Python Pandas - Home
Python Pandas - Introduction
Python Pandas - Environment Setup
Python Pandas - Basics
Python Pandas - Introduction to Data Structures
Python Pandas - Index Objects
Python Pandas - Panel
Python Pandas - Basic Functionality
Python Pandas - Indexing & Selecting Data
Python Pandas - Series
Python Pandas - Series
Python Pandas - Slicing a Series Object
Python Pandas - Attributes of a Series Object
Python Pandas - Arithmetic Operations on Series Object
Python Pandas - Converting Series to Other Objects
Python Pandas - DataFrame
Python Pandas - DataFrame
Python Pandas - Accessing DataFrame
Python Pandas - Slicing a DataFrame Object
Python Pandas - Modifying DataFrame
Python Pandas - Removing Rows from a DataFrame
Python Pandas - Arithmetic Operations on DataFrame
Python Pandas - IO Tools
Python Pandas - IO Tools
Python Pandas - Working with CSV Format
Python Pandas - Reading & Writing JSON Files
Python Pandas - Reading Data from an Excel File
Python Pandas - Writing Data to Excel Files
Python Pandas - Working with HTML Data
Python Pandas - Clipboard
Python Pandas - Working with HDF5 Format
Python Pandas - Comparison with SQL
Python Pandas - Data Handling
Python Pandas - Sorting
Python Pandas - Reindexing
Python Pandas - Iteration
Python Pandas - Concatenation
Python Pandas - Statistical Functions
Python Pandas - Descriptive Statistics
Python Pandas - Working with Text Data
Python Pandas - Function Application
Python Pandas - Options & Customization
Python Pandas - Window Functions
Python Pandas - Aggregations
Python Pandas - Merging/Joining
Python Pandas - MultiIndex
Python Pandas - Basics of MultiIndex
Python Pandas - Indexing with MultiIndex
Python Pandas - Advanced Reindexing with MultiIndex
Python Pandas - Renaming MultiIndex Labels
Python Pandas - Sorting a MultiIndex
Python Pandas - Binary Operations
Python Pandas - Binary Comparison Operations
Python Pandas - Boolean Indexing
Python Pandas - Boolean Masking
Python Pandas - Data Reshaping & Pivoting
Python Pandas - Pivoting
Python Pandas - Stacking & Unstacking
Python Pandas - Melting
Python Pandas - Computing Dummy Variables
Python Pandas - Categorical Data
Python Pandas - Categorical Data
Python Pandas - Ordering & Sorting Categorical Data
Python Pandas - Comparing Categorical Data
Python Pandas - Handling Missing Data
Python Pandas - Missing Data
Python Pandas - Filling Missing Data
Python Pandas - Interpolation of Missing Values
Python Pandas - Dropping Missing Data
Python Pandas - Calculations with Missing Data
Python Pandas - Handling Duplicates
Python Pandas - Duplicated Data
Python Pandas - Counting & Retrieving Unique Elements
Python Pandas - Duplicated Labels
Python Pandas - Grouping & Aggregation
Python Pandas - GroupBy
Python Pandas - Time-series Data
Python Pandas - Date Functionality
Python Pandas - Timedelta
Python Pandas - Sparse Data Structures
Python Pandas - Sparse Data
Python Pandas - Visualization
Python Pandas - Visualization
Python Pandas - Additional Concepts
Python Pandas - Caveats & Gotchas

Python Pandas - Duplicated Data

Quiz

Duplicated data refers to rows in a dataset that appear more than once. Duplicate data can occur due to various reasons such as data collection errors, repeated records, or merging datasets. Identifying and removing duplicates is an essential task in Data Preprocessing and Data Analysis to avoid incorrect results.

Consider this sample dataset containing student names and their dates of birth −

Student	Date of Birth
Rahul	01 December 2017
Raj	14 April 2018
Rahul	01 December 2017

In this dataset, the first and last rows contain repeated values, indicating that "Rahul" is a duplicate entry.

Pandas provides two primary methods to detect and remove duplicate rows in a DataFrame −

duplicated(): Identifies duplicate rows and returns a Boolean mask, where True indicates a duplicate entry.
drop_duplicates(): Removes duplicate rows from the DataFrame while keeping the first occurrence by default.

In this tutorial, we will learn how to identify duplicates, check for duplicates in specific columns, and remove them using Pandas.

Identifying Duplicates in a DataFrame

Pandas DataFrame.duplicated() method is used to identify duplicate rows in a DataFrame. By default, it considers all columns to identify duplicates and marks them as True, except for the first occurrence.

This method returns a Boolean Series indicating whether a row is duplicated, where −

False: The row is not a duplicate (i.e., it's the first occurrence).
True: The row is a duplicate of another row in the DataFrame.

Example

The following example demonstrates how to identify duplicate rows in a Pandas DataFrame using duplicated() method.

Open Compiler

import pandas as pd

# Sample dataset
df = pd.DataFrame({
'Name': ['Rahul', 'Raj', 'Rahul'],
'Date_of_Birth': ['01 December 2017', '14 April 2018', '01 December 2017']})

print("Original DataFrame:")
print(df)

# Find duplicates in the DataFrame
result = df.duplicated()

# Display the resultant Duplicates
print('\nResult after finding the duplicates:')
print(result)

Following is the output of the above code −

Original DataFrame:

Name	Date_of_Birth
Rahul	01 December 2017
Raj	14 April 2018
Rahul	01 December 2017

Result after finding the duplicates: 0 False 1 False 2 True dtype: bool

In the example, the third row is marked as a duplicate since it has the same values as the first row.

Identifying Duplicates on Specific Columns

To find duplicates based on specific columns, use the subset parameter of the duplicated() method.

Example

The following example demonstrates how to identify the duplicate values on a specific column using the subset parameter of the duplicated() method.

Open Compiler

import pandas as pd

# Sample dataset
df = pd.DataFrame({
'Name': ['Rahul', 'Raj', 'Rahul', 'Karthik', 'Arya', 'Karthik'],
'Date_of_Birth': ['01 December 2017', '14 April 2018', '01 December 2017', '14 July 2000', '26 May 2000', '14 July 2000'],
'City': ['Hyderabad', 'Chennai', 'Kolkata', 'Hyderabad', 'Chennai', 'Hyderabad']})

print("Original DataFrame:")
print(df)

# Find duplicates in the DataFrame
result = df.duplicated(subset=['Name', 'City'])

# Display the resultant Duplicates
print('\nResult after finding the duplicates:')
print(result)

Following is the output of the above code −

Original DataFrame:

	Name	Date_of_Birth	City
0	Rahul	01 December 2017	Hyderabad
1	Raj	14 April 2018	Chennai
2	Rahul	01 December 2017	Kolkata
3	Karthik	14 July 2000	Hyderabad
4	Arya	26 May 2000	Chennai
5	Karthik	14 July 2000	Hyderabad

Result after finding the duplicates: 0 False 1 False 2 False 3 False 4 False 5 True dtype: bool

Removing Duplicates

The drop_duplicates() method is used to remove duplicate rows from the DataFrame. By default, it considers all columns and keeps the first occurrence of each duplicated row, while removing the rest.

Example

This example removes the duplicate rows from a Pandas DataFrame using the drop_duplicates() method.

Open Compiler

import pandas as pd

# Sample dataset
df = pd.DataFrame({
'Name': ['Rahul', 'Raj', 'Rahul', 'Karthik', 'Arya', 'Karthik'],
'Date_of_Birth': ['01 December 2017', '14 April 2018', '01 December 2017', '14 July 2000', '26 May 2000', '14 July 2000'],
'City': ['Hyderabad', 'Chennai', 'Kolkata', 'Hyderabad', 'Chennai', 'Hyderabad']})

print("Original DataFrame:")
print(df)

# Drop duplicates in the DataFrame

result = df.drop_duplicates()

# Display the resultant Duplicates
print('\nResult after finding the duplicates:')
print(result)

Following is the output of the above code −

Original DataFrame:

	Name	Date_of_Birth	City
0	Rahul	01 December 2017	Hyderabad
1	Raj	14 April 2018	Chennai
2	Rahul	01 December 2017	Kolkata
3	Karthik	14 July 2000	Hyderabad
4	Arya	26 May 2000	Chennai
5	Karthik	14 July 2000	Hyderabad

Result after finding the duplicates:

	Name	Date_of_Birth	City
0	Rahul	01 December 2017	Hyderabad
1	Raj	14 April 2018	Chennai
2	Rahul	01 December 2017	Kolkata
3	Karthik	14 July 2000	Hyderabad
4	Arya	26 May 2000	Chennai

Removing Duplicates in Specific Columns

You can also remove duplicates based on specific columns using the subset parameter of the drop_duplicates() method.

Example

This example removes the duplicate data of a DataFrame based on specific columns using the subset parameter of the drop_duplicates() method.

Open Compiler

import pandas as pd

# Sample dataset
df = pd.DataFrame({
'Name': ['Rahul', 'Raj', 'Rahul', 'Karthik', 'Arya', 'Karthik'],
'Date_of_Birth': ['01 December 2017', '14 April 2018', '01 December 2017', '14 July 2000', '26 May 2000', '14 July 2000'],
'City': ['Hyderabad', 'Chennai', 'Kolkata', 'Hyderabad', 'Chennai', 'Hyderabad']})

print("Original DataFrame:")
print(df)

# Drop duplicates in the DataFrame
result = df.drop_duplicates(subset=['Date_of_Birth'])

# Display the resultant Duplicates
print('\nResult after finding the duplicates:')
print(result)

Following is the output of the above code −

Original DataFrame:

	Name	Date_of_Birth	City
0	Rahul	01 December 2017	Hyderabad
1	Raj	14 April 2018	Chennai
2	Rahul	01 December 2017	Kolkata
3	Karthik	14 July 2000	Hyderabad
4	Arya	26 May 2000	Chennai
5	Karthik	14 July 2000	Hyderabad

Result after finding the duplicates:

	Name	Date_of_Birth	City
0	Rahul	01 December 2017	Hyderabad
1	Raj	14 April 2018	Chennai
3	Karthik	14 July 2000	Hyderabad
4	Arya	26 May 2000	Chennai

Print Page