
- Python Pandas - Home
- Python Pandas - Introduction
- Python Pandas - Environment Setup
- Python Pandas - Basics
- Python Pandas - Introduction to Data Structures
- Python Pandas - Index Objects
- Python Pandas - Panel
- Python Pandas - Basic Functionality
- Python Pandas - Indexing & Selecting Data
- Python Pandas - Series
- Python Pandas - Series
- Python Pandas - Slicing a Series Object
- Python Pandas - Attributes of a Series Object
- Python Pandas - Arithmetic Operations on Series Object
- Python Pandas - Converting Series to Other Objects
- Python Pandas - DataFrame
- Python Pandas - DataFrame
- Python Pandas - Accessing DataFrame
- Python Pandas - Slicing a DataFrame Object
- Python Pandas - Modifying DataFrame
- Python Pandas - Removing Rows from a DataFrame
- Python Pandas - Arithmetic Operations on DataFrame
- Python Pandas - IO Tools
- Python Pandas - IO Tools
- Python Pandas - Working with CSV Format
- Python Pandas - Reading & Writing JSON Files
- Python Pandas - Reading Data from an Excel File
- Python Pandas - Writing Data to Excel Files
- Python Pandas - Working with HTML Data
- Python Pandas - Clipboard
- Python Pandas - Working with HDF5 Format
- Python Pandas - Comparison with SQL
- Python Pandas - Data Handling
- Python Pandas - Sorting
- Python Pandas - Reindexing
- Python Pandas - Iteration
- Python Pandas - Concatenation
- Python Pandas - Statistical Functions
- Python Pandas - Descriptive Statistics
- Python Pandas - Working with Text Data
- Python Pandas - Function Application
- Python Pandas - Options & Customization
- Python Pandas - Window Functions
- Python Pandas - Aggregations
- Python Pandas - Merging/Joining
- Python Pandas - MultiIndex
- Python Pandas - Basics of MultiIndex
- Python Pandas - Indexing with MultiIndex
- Python Pandas - Advanced Reindexing with MultiIndex
- Python Pandas - Renaming MultiIndex Labels
- Python Pandas - Sorting a MultiIndex
- Python Pandas - Binary Operations
- Python Pandas - Binary Comparison Operations
- Python Pandas - Boolean Indexing
- Python Pandas - Boolean Masking
- Python Pandas - Data Reshaping & Pivoting
- Python Pandas - Pivoting
- Python Pandas - Stacking & Unstacking
- Python Pandas - Melting
- Python Pandas - Computing Dummy Variables
- Python Pandas - Categorical Data
- Python Pandas - Categorical Data
- Python Pandas - Ordering & Sorting Categorical Data
- Python Pandas - Comparing Categorical Data
- Python Pandas - Handling Missing Data
- Python Pandas - Missing Data
- Python Pandas - Filling Missing Data
- Python Pandas - Interpolation of Missing Values
- Python Pandas - Dropping Missing Data
- Python Pandas - Calculations with Missing Data
- Python Pandas - Handling Duplicates
- Python Pandas - Duplicated Data
- Python Pandas - Counting & Retrieving Unique Elements
- Python Pandas - Duplicated Labels
- Python Pandas - Grouping & Aggregation
- Python Pandas - GroupBy
- Python Pandas - Time-series Data
- Python Pandas - Date Functionality
- Python Pandas - Timedelta
- Python Pandas - Sparse Data Structures
- Python Pandas - Sparse Data
- Python Pandas - Visualization
- Python Pandas - Visualization
- Python Pandas - Additional Concepts
- Python Pandas - Caveats & Gotchas
Python Pandas - Duplicated Data
Duplicated data refers to rows in a dataset that appear more than once. Duplicate data can occur due to various reasons such as data collection errors, repeated records, or merging datasets. Identifying and removing duplicates is an essential task in Data Preprocessing and Data Analysis to avoid incorrect results.
Consider this sample dataset containing student names and their dates of birth −
Student | Date of Birth |
---|---|
Rahul | 01 December 2017 |
Raj | 14 April 2018 |
Rahul | 01 December 2017 |
In this dataset, the first and last rows contain repeated values, indicating that "Rahul" is a duplicate entry.
Pandas provides two primary methods to detect and remove duplicate rows in a DataFrame −
duplicated(): Identifies duplicate rows and returns a Boolean mask, where True indicates a duplicate entry.
drop_duplicates(): Removes duplicate rows from the DataFrame while keeping the first occurrence by default.
In this tutorial, we will learn how to identify duplicates, check for duplicates in specific columns, and remove them using Pandas.
Identifying Duplicates in a DataFrame
Pandas DataFrame.duplicated() method is used to identify duplicate rows in a DataFrame. By default, it considers all columns to identify duplicates and marks them as True, except for the first occurrence.
This method returns a Boolean Series indicating whether a row is duplicated, where −
False: The row is not a duplicate (i.e., it's the first occurrence).
True: The row is a duplicate of another row in the DataFrame.
Example
The following example demonstrates how to identify duplicate rows in a Pandas DataFrame using duplicated() method.
import pandas as pd # Sample dataset df = pd.DataFrame({ 'Name': ['Rahul', 'Raj', 'Rahul'], 'Date_of_Birth': ['01 December 2017', '14 April 2018', '01 December 2017']}) print("Original DataFrame:") print(df) # Find duplicates in the DataFrame result = df.duplicated() # Display the resultant Duplicates print('\nResult after finding the duplicates:') print(result)
Following is the output of the above code −
Original DataFrame:
Name | Date_of_Birth |
---|---|
Rahul | 01 December 2017 |
Raj | 14 April 2018 |
Rahul | 01 December 2017 |
In the example, the third row is marked as a duplicate since it has the same values as the first row.
Identifying Duplicates on Specific Columns
To find duplicates based on specific columns, use the subset parameter of the duplicated() method.
Example
The following example demonstrates how to identify the duplicate values on a specific column using the subset parameter of the duplicated() method.
import pandas as pd # Sample dataset df = pd.DataFrame({ 'Name': ['Rahul', 'Raj', 'Rahul', 'Karthik', 'Arya', 'Karthik'], 'Date_of_Birth': ['01 December 2017', '14 April 2018', '01 December 2017', '14 July 2000', '26 May 2000', '14 July 2000'], 'City': ['Hyderabad', 'Chennai', 'Kolkata', 'Hyderabad', 'Chennai', 'Hyderabad']}) print("Original DataFrame:") print(df) # Find duplicates in the DataFrame result = df.duplicated(subset=['Name', 'City']) # Display the resultant Duplicates print('\nResult after finding the duplicates:') print(result)
Following is the output of the above code −
Original DataFrame:
Name | Date_of_Birth | City | |
---|---|---|---|
0 | Rahul | 01 December 2017 | Hyderabad |
1 | Raj | 14 April 2018 | Chennai |
2 | Rahul | 01 December 2017 | Kolkata |
3 | Karthik | 14 July 2000 | Hyderabad |
4 | Arya | 26 May 2000 | Chennai |
5 | Karthik | 14 July 2000 | Hyderabad |
Removing Duplicates
The drop_duplicates() method is used to remove duplicate rows from the DataFrame. By default, it considers all columns and keeps the first occurrence of each duplicated row, while removing the rest.
Example
This example removes the duplicate rows from a Pandas DataFrame using the drop_duplicates() method.
import pandas as pd # Sample dataset df = pd.DataFrame({ 'Name': ['Rahul', 'Raj', 'Rahul', 'Karthik', 'Arya', 'Karthik'], 'Date_of_Birth': ['01 December 2017', '14 April 2018', '01 December 2017', '14 July 2000', '26 May 2000', '14 July 2000'], 'City': ['Hyderabad', 'Chennai', 'Kolkata', 'Hyderabad', 'Chennai', 'Hyderabad']}) print("Original DataFrame:") print(df) # Drop duplicates in the DataFrame result = df.drop_duplicates() # Display the resultant Duplicates print('\nResult after finding the duplicates:') print(result)
Following is the output of the above code −
Original DataFrame:
Name | Date_of_Birth | City | |
---|---|---|---|
0 | Rahul | 01 December 2017 | Hyderabad |
1 | Raj | 14 April 2018 | Chennai |
2 | Rahul | 01 December 2017 | Kolkata |
3 | Karthik | 14 July 2000 | Hyderabad |
4 | Arya | 26 May 2000 | Chennai |
5 | Karthik | 14 July 2000 | Hyderabad |
Name | Date_of_Birth | City | |
---|---|---|---|
0 | Rahul | 01 December 2017 | Hyderabad |
1 | Raj | 14 April 2018 | Chennai |
2 | Rahul | 01 December 2017 | Kolkata |
3 | Karthik | 14 July 2000 | Hyderabad |
4 | Arya | 26 May 2000 | Chennai |
Removing Duplicates in Specific Columns
You can also remove duplicates based on specific columns using the subset parameter of the drop_duplicates() method.
Example
This example removes the duplicate data of a DataFrame based on specific columns using the subset parameter of the drop_duplicates() method.
import pandas as pd # Sample dataset df = pd.DataFrame({ 'Name': ['Rahul', 'Raj', 'Rahul', 'Karthik', 'Arya', 'Karthik'], 'Date_of_Birth': ['01 December 2017', '14 April 2018', '01 December 2017', '14 July 2000', '26 May 2000', '14 July 2000'], 'City': ['Hyderabad', 'Chennai', 'Kolkata', 'Hyderabad', 'Chennai', 'Hyderabad']}) print("Original DataFrame:") print(df) # Drop duplicates in the DataFrame result = df.drop_duplicates(subset=['Date_of_Birth']) # Display the resultant Duplicates print('\nResult after finding the duplicates:') print(result)
Following is the output of the above code −
Original DataFrame:
Name | Date_of_Birth | City | |
---|---|---|---|
0 | Rahul | 01 December 2017 | Hyderabad |
1 | Raj | 14 April 2018 | Chennai |
2 | Rahul | 01 December 2017 | Kolkata |
3 | Karthik | 14 July 2000 | Hyderabad |
4 | Arya | 26 May 2000 | Chennai |
5 | Karthik | 14 July 2000 | Hyderabad |
Name | Date_of_Birth | City | |
---|---|---|---|
0 | Rahul | 01 December 2017 | Hyderabad |
1 | Raj | 14 April 2018 | Chennai |
3 | Karthik | 14 July 2000 | Hyderabad |
4 | Arya | 26 May 2000 | Chennai |