Find and Compare Differences in Excel Sheets with Python
Table of Contents
- Introduction
- Installing the Required Dependencies
- Comparing Two Datasets with the Same Shape
- Defining File Paths to Excel Files
- Bringing the Excel Data into a Pandas DataFrame
- Comparing the Data Using Pandas Compare Method
- Additional Arguments for the Compare Method
- Exporting the DataFrame Using the to_excel Method
- Highlighting Differences and Adding Comments Using xlwings
- Comparing Two Datasets with Different Numbers of Rows
- Merging the Dataframes to Show the Difference
- Highlighting Rows in Excel Using xlwings
Python Excel Sheet Comparison: Finding Differences Between Two Sheets
In this article, we will explore how to compare two Excel sheets using Python and find their differences. We will use the Python libraries pandas and xlwings to manipulate the Excel files and perform the comparison. The article will guide You through the installation of necessary dependencies and demonstrate step by step how to compare two datasets. We will cover scenarios where the datasets have the same shape, as well as cases where they have different numbers of rows. Additionally, we will learn how to highlight the differences in the Excel sheets and export the comparison results. So, let's get started and learn how to effectively compare Excel sheets using Python.
1. Introduction
When working with Excel sheets, it is common to encounter situations where you need to compare two datasets and identify the differences between them. For example, you may have an initial version of a spreadsheet and a modified version, and you want to find out which cells have been changed. Python provides powerful libraries like pandas and xlwings that enable us to perform such comparisons efficiently and effectively. In this article, we will demonstrate how to use these libraries to compare Excel sheets and highlight the differences.
2. Installing the Required Dependencies
Before we start comparing the Excel sheets, we need to make sure that we have the necessary dependencies installed. We will be using the pandas and xlwings libraries for this task. Additionally, the optional dependency openpyxl is required for pandas to work with Excel files. To install these dependencies, open your command prompt or terminal and run the following commands:
pip install openpyxl
pip install pandas
pip install xlwings
Once the installations are complete, we can proceed to the next steps.
3. Comparing Two Datasets with the Same Shape
We will begin by exploring an example where we compare two datasets that have the same shape. By shape, we mean that the datasets have the same number of columns and rows. Let's assume we have two Excel files: the initial spreadsheet and the updated version. Our task is to identify the differences between these two files. We will achieve this using the pandas library.
To start, let's define the file paths to these Excel files. Both workbooks are located in the Current working directory of this Python script, in a folder called 'Same_Shape'. We can use the pathlib module to Create the file paths. Here is an example of how to define the file paths: [code example]
Once we have the file paths defined, we can bring the Excel data into pandas dataframes by using the pd.read_excel method and passing in the file path. This method will read the Excel file and create a dataframe with the data. We will have two dataframes: one for the initial spreadsheet and one for the updated version. We can then use the pandas compare method to compare the data in these two dataframes and find the differences.
After running the comparison, pandas will return a dataframe that shows all the differences between the two datasets. For example, it will highlight the cells that have been changed and provide information about the previous and current values. We can also customize the comparison behavior by using additional arguments with the compare method. For instance, we can Align the differences on rows instead of columns, or keep the original shape of the dataframes.
Finally, we can export the comparison results to a new Excel file using the to_excel method provided by pandas. This allows us to have a clear overview of the differences between the two datasets. We can save this file with an appropriate name to easily refer to it later.
4. Highlighting Differences and Adding Comments Using xlwings
In addition to using pandas for comparing and exporting the Excel sheets, we can also utilize the xlwings library to highlight the differences and add comments in the Excel file itself. This can be useful when you want to visually inspect the changes and make it easier to understand the comparison results.
With xlwings, we can create a new instance of Excel and open both the initial and updated workbooks within it. We can then iterate over each cell in the used range of the worksheets and compare the current value with the old value. If they differ, we can add a comment to the cell indicating the change. Additionally, we can change the background color of the cell to visually highlight the difference.
Once we have made the necessary changes, we can save the updated workbook as a new file. This will create a new workbook with the changes highlighted and comments added where necessary. By utilizing xlwings, we can provide a more comprehensive comparison and make it easier for users to understand the differences in the Excel sheet.
5. Comparing Two Datasets with Different Numbers of Rows
Next, let's explore a more complicated example where we compare two datasets with different numbers of rows. In this case, the spreadsheets may have additional rows or entries, making a direct comparison using pandas' compare method impractical. Instead, we will merge the two dataframes to show the differences.
Before merging the dataframes, we need to reset the index of the updated dataframe. This is important because the index will be used to identify the matching rows during the merging process. Once the index is reset, we can perform an outer merge on the dataframes, which will keep the rows from both datasets and indicate the presence or absence of an entry in either dataframe.
We can use the indicator argument in the merge method to create a new column in the merged dataframe that indicates whether a row exists in the left dataframe, right dataframe, or both. By filtering out the rows that exist in both dataframes, we can obtain a dataframe that shows the differences. This dataframe will highlight the entries that have changed or are unique to one of the datasets.
6. Highlighting Rows in Excel Using xlwings
To provide a visual representation of the differences in the Excel sheet, we can utilize xlwings once again. This time, we will highlight the rows that have changed or are unique to one of the datasets. To do this, we can iterate over every row in the used range of the Excel worksheet and check if the row number exists in the list of rows that have differences. If a row has a difference, we can change the background color of that entire row.
After highlighting the rows, we can save the workbook as a new file with an appropriate name. The new file will contain the updated version of the Excel sheet with the highlighted rows, making it easy to identify and review the differences.
By using pandas and xlwings together, we can effectively compare Excel sheets with different numbers of rows and provide a clear visual representation of the differences. This approach allows us to efficiently identify and analyze changes in the data, making it a valuable tool for data analysis and quality control.
Conclusion
In this article, we have learned how to compare two Excel sheets using Python. We have explored scenarios where the datasets have the same shape and different numbers of rows. We have seen how to use the pandas library to compare dataframes and highlight differences. Additionally, we have utilized the xlwings library to add comments and highlight rows in the Excel sheet itself. By combining these libraries, we can perform comprehensive comparisons and provide a clear visual representation of the differences.