Python | Pandas Index.duplicated()
The Index.duplicated()
method in Pandas is a powerful tool for identifying duplicate values within an index. It returns a boolean array where duplicates are marked as True
based on the specified criteria and False
denotes unique values or the first occurrence of duplicates. This method is especially useful for data cleaning and preprocessing, ensuring that your data is free from redundancy or inconsistencies.
How Index.duplicated()
Works?
This method iterates over the values in a Pandas index and checks for duplicates:
- First occurrence: Always marked as
False
. - Subsequent duplicates: Marked as
True
.
You can optionally change its behavior using the keep
parameter, which allows you to specify whether to retain the first occurrence, the last, or none.
Syntax:
Index.duplicated(keep=’first’)
keep
: Determines which duplicates to mark asTrue
.
'first'
: Marks all duplicates except the first occurrence.'last'
: Marks all duplicates except the last occurrence.False
: Marks all occurrences of duplicates.
Example 1: Basic Example: Default Behavior (keep='first'
)
import pandas as pd
# Create an Index with duplicates
idx = pd.Index(['Apple', 'Banana', 'Apple', 'Cherry', 'Banana'])
# Identify duplicates, keeping the first occurrence as unique
print(idx.duplicated(keep='first'))
# Output: [False False True False True]
In this example, the first occurrences of “Apple” and “Banana” are marked as unique (False
), while subsequent occurrences are flagged as duplicates (True
).
Example 2: Retaining the Last Occurrence (keep='last'
)
import pandas as pd
idx = pd.Index(['Apple', 'Banana', 'Apple', 'Cherry', 'Banana'])
# Retain the last occurrence of duplicates
print(idx.duplicated(keep='last'))
# Output: [ True True False False False]
Here, the last occurrence of duplicates is flagged as False
, while the earlier ones are True
.
Example 3: Marking All Duplicates (keep=False
)
import pandas as pd
idx = pd.Index(['Apple', 'Banana', 'Apple', 'Cherry', 'Banana'])
# Mark all duplicates
print(idx.duplicated(keep=False))
# Output: [ True True True False True]
Additional Use Cases of Index.duplicated()
in Pandas
Beyond its basic usage, it can handle more complex scenarios, making it an essential feature for advanced data cleaning and preprocessing tasks. Below are some additional examples and use cases that demonstrate the power and flexibility of this method.
1. Filtering Rows with Duplicated Indices
If you want to filter out rows with duplicate indices, you can combine Index.duplicated()
with boolean indexing.
import pandas as pd
data = pd.DataFrame({'Values': [10, 20, 30, 40]}, index=['A', 'B', 'A', 'C'])
# Filter rows with unique indices
unique_rows = data[~data.index.duplicated(keep='first')]
print("Filtered DataFrame (Unique Indices):")
print(unique_rows)
Output
Filtered DataFrame (Unique Indices): Values A 10 B 20 C 40
This is particularly useful when you need to retain only the first occurrence of each index while discarding duplicates.
2. Identifying All Duplicate Entries
To identify all occurrences of duplicate indices, use keep=False
. This flags every instance of a duplicate value.
import pandas as pd
# Create a DataFrame with duplicate indices
data = pd.DataFrame({'Values': [10, 20, 30, 40]}, index=['A', 'B', 'A', 'C'])
# Identify all duplicated indices
duplicates = data.index[data.index.duplicated(keep=False)]
print("All Duplicate Indices:")
print(duplicates)
Output
All Duplicate Indices: Index(['A', 'A'], dtype='object')
This approach is helpful when you need to isolate all rows associated with non-unique indices for further inspection.
3. Handling Missing Values (NaN) in Indices
The Index.duplicated()
method treats NaN values as unique unless explicitly duplicated.
import pandas as pd
# Create a DataFrame with duplicate indices
data = pd.DataFrame({'Values': [10, 20, 30, 40]}, index=['A', 'B', 'A', 'C'])
# Index with NaN values
idx = pd.Index([1, 2, None, 1, None])
print(idx.duplicated(keep='first'))
Output
[False False False True True]
This is useful when dealing with datasets that include missing or null values in the index and you need to ensure proper handling of such cases.
5. Grouping Data by Non-Unique Indices
When working with non-unique indices, grouping data by these indices can help aggregate or summarize information.
import pandas as pd
data = pd.DataFrame({'Values': [10, 20, 30, 40]}, index=['A', 'B', 'A', 'C'])
data = pd.DataFrame({'Values': [10, 20, 30]}, index=['A', 'B', 'A'])
# Group by index and calculate the sum
grouped_data = data.groupby(level=0).sum()
print("Grouped Data:")
print(grouped_data)
Output
Grouped Data: Values A 40 B 20
This technique is useful for resolving duplicates by aggregating data instead of simply removing them.
6. Detecting Duplicate Labels Before Operations
Before performing operations like merging or concatenation, it’s crucial to check for duplicate labels to avoid unexpected behavior.
import pandas as pd
data = pd.DataFrame({'Values': [10, 20, 30, 40]}, index=['A', 'B', 'A', 'C'])
# Check if the index has duplicates
has_duplicates = data.index.has_duplicates
print("Does the index have duplicates?")
print(has_duplicates)
Output
Does the index have duplicates? True
This helps prevent errors caused by non-unique indices during operations that require unique labels.
7. Using Boolean Indexing to Extract Duplicates
You can extract rows corresponding to duplicate indices using boolean indexing.
import pandas as pd
data = pd.DataFrame({'Values': [10, 20, 30, 40]}, index=['A', 'B', 'A', 'C'])
# Extract rows with duplicated indices
duplicate_rows = data[data.index.duplicated(keep='first')]
print("Rows with Duplicated Indices:")
print(duplicate_rows)
Output
Rows with Duplicated Indices: Values A 30
This is useful when you need to analyze or process only the duplicate entries in your dataset.
8. Combining Index.duplicated()
with Custom Logic
For advanced scenarios, you can combine Index.duplicated()
with custom logic to handle duplicates differently based on specific conditions.
import pandas as pd
data = pd.DataFrame({'Values': [10, 20, 30, 40]}, index=['A', 'B', 'A', 'C'])
# Custom logic: Keep first occurrence of "A", but flag others as duplicates
custom_duplicates = data.index.map(lambda x: x == 'A').duplicated(keep='first')
print("Custom Duplicate Flags:")
print(custom_duplicates)
Output
Custom Duplicate Flags: [False False True True]
This allows for tailored handling of duplicates based on domain-specific requirements.