Python Pandas - Duplicated Data



Duplicated data refers to rows in a dataset that appear more than once. Duplicate data can occur due to various reasons such as data collection errors, repeated records, or merging datasets. Identifying and removing duplicates is an essential task in Data Preprocessing and Data Analysis to avoid incorrect results.

Consider this sample dataset containing student names and their dates of birth −

Student Date of Birth
Rahul 01 December 2017
Raj 14 April 2018
Rahul 01 December 2017

In this dataset, the first and last rows contain repeated values, indicating that "Rahul" is a duplicate entry.

Pandas provides two primary methods to detect and remove duplicate rows in a DataFrame −

  • duplicated(): Identifies duplicate rows and returns a Boolean mask, where True indicates a duplicate entry.

  • drop_duplicates(): Removes duplicate rows from the DataFrame while keeping the first occurrence by default.

In this tutorial, we will learn how to identify duplicates, check for duplicates in specific columns, and remove them using Pandas.

Identifying Duplicates in a DataFrame

Pandas DataFrame.duplicated() method is used to identify duplicate rows in a DataFrame. By default, it considers all columns to identify duplicates and marks them as True, except for the first occurrence.

This method returns a Boolean Series indicating whether a row is duplicated, where −

  • False: The row is not a duplicate (i.e., it's the first occurrence).

  • True: The row is a duplicate of another row in the DataFrame.

Example

The following example demonstrates how to identify duplicate rows in a Pandas DataFrame using duplicated() method.

Open Compiler
import pandas as pd # Sample dataset df = pd.DataFrame({ 'Name': ['Rahul', 'Raj', 'Rahul'], 'Date_of_Birth': ['01 December 2017', '14 April 2018', '01 December 2017']}) print("Original DataFrame:") print(df) # Find duplicates in the DataFrame result = df.duplicated() # Display the resultant Duplicates print('\nResult after finding the duplicates:') print(result)

Following is the output of the above code −

Original DataFrame:
Name Date_of_Birth
Rahul 01 December 2017
Raj 14 April 2018
Rahul 01 December 2017
Result after finding the duplicates: 0 False 1 False 2 True dtype: bool

In the example, the third row is marked as a duplicate since it has the same values as the first row.

Identifying Duplicates on Specific Columns

To find duplicates based on specific columns, use the subset parameter of the duplicated() method.

Example

The following example demonstrates how to identify the duplicate values on a specific column using the subset parameter of the duplicated() method.

Open Compiler
import pandas as pd # Sample dataset df = pd.DataFrame({ 'Name': ['Rahul', 'Raj', 'Rahul', 'Karthik', 'Arya', 'Karthik'], 'Date_of_Birth': ['01 December 2017', '14 April 2018', '01 December 2017', '14 July 2000', '26 May 2000', '14 July 2000'], 'City': ['Hyderabad', 'Chennai', 'Kolkata', 'Hyderabad', 'Chennai', 'Hyderabad']}) print("Original DataFrame:") print(df) # Find duplicates in the DataFrame result = df.duplicated(subset=['Name', 'City']) # Display the resultant Duplicates print('\nResult after finding the duplicates:') print(result)

Following is the output of the above code −

Original DataFrame:
Name Date_of_Birth City
0 Rahul 01 December 2017 Hyderabad
1 Raj 14 April 2018 Chennai
2 Rahul 01 December 2017 Kolkata
3 Karthik 14 July 2000 Hyderabad
4 Arya 26 May 2000 Chennai
5 Karthik 14 July 2000 Hyderabad
Result after finding the duplicates: 0 False 1 False 2 False 3 False 4 False 5 True dtype: bool

Removing Duplicates

The drop_duplicates() method is used to remove duplicate rows from the DataFrame. By default, it considers all columns and keeps the first occurrence of each duplicated row, while removing the rest.

Example

This example removes the duplicate rows from a Pandas DataFrame using the drop_duplicates() method.

Open Compiler
import pandas as pd # Sample dataset df = pd.DataFrame({ 'Name': ['Rahul', 'Raj', 'Rahul', 'Karthik', 'Arya', 'Karthik'], 'Date_of_Birth': ['01 December 2017', '14 April 2018', '01 December 2017', '14 July 2000', '26 May 2000', '14 July 2000'], 'City': ['Hyderabad', 'Chennai', 'Kolkata', 'Hyderabad', 'Chennai', 'Hyderabad']}) print("Original DataFrame:") print(df) # Drop duplicates in the DataFrame result = df.drop_duplicates() # Display the resultant Duplicates print('\nResult after finding the duplicates:') print(result)

Following is the output of the above code −

Original DataFrame:
Name Date_of_Birth City
0 Rahul 01 December 2017 Hyderabad
1 Raj 14 April 2018 Chennai
2 Rahul 01 December 2017 Kolkata
3 Karthik 14 July 2000 Hyderabad
4 Arya 26 May 2000 Chennai
5 Karthik 14 July 2000 Hyderabad
Result after finding the duplicates:
Name Date_of_Birth City
0 Rahul 01 December 2017 Hyderabad
1 Raj 14 April 2018 Chennai
2 Rahul 01 December 2017 Kolkata
3 Karthik 14 July 2000 Hyderabad
4 Arya 26 May 2000 Chennai

Removing Duplicates in Specific Columns

You can also remove duplicates based on specific columns using the subset parameter of the drop_duplicates() method.

Example

This example removes the duplicate data of a DataFrame based on specific columns using the subset parameter of the drop_duplicates() method.

Open Compiler
import pandas as pd # Sample dataset df = pd.DataFrame({ 'Name': ['Rahul', 'Raj', 'Rahul', 'Karthik', 'Arya', 'Karthik'], 'Date_of_Birth': ['01 December 2017', '14 April 2018', '01 December 2017', '14 July 2000', '26 May 2000', '14 July 2000'], 'City': ['Hyderabad', 'Chennai', 'Kolkata', 'Hyderabad', 'Chennai', 'Hyderabad']}) print("Original DataFrame:") print(df) # Drop duplicates in the DataFrame result = df.drop_duplicates(subset=['Date_of_Birth']) # Display the resultant Duplicates print('\nResult after finding the duplicates:') print(result)

Following is the output of the above code −

Original DataFrame:
Name Date_of_Birth City
0 Rahul 01 December 2017 Hyderabad
1 Raj 14 April 2018 Chennai
2 Rahul 01 December 2017 Kolkata
3 Karthik 14 July 2000 Hyderabad
4 Arya 26 May 2000 Chennai
5 Karthik 14 July 2000 Hyderabad
Result after finding the duplicates:
Name Date_of_Birth City
0 Rahul 01 December 2017 Hyderabad
1 Raj 14 April 2018 Chennai
3 Karthik 14 July 2000 Hyderabad
4 Arya 26 May 2000 Chennai
Advertisements