Python | Pandas Index.duplicated()

Last Updated : 27 Nov, 2024

The Index.duplicated() method in Pandas is a powerful tool for identifying duplicate values within an index. It returns a boolean array where duplicates are marked as True based on the specified criteria and False denotes unique values or the first occurrence of duplicates. This method is especially useful for data cleaning and preprocessing, ensuring that your data is free from redundancy or inconsistencies.

How `Index.duplicated()` Works?

This method iterates over the values in a Pandas index and checks for duplicates:

First occurrence: Always marked as False.
Subsequent duplicates: Marked as True.

You can optionally change its behavior using the keep parameter, which allows you to specify whether to retain the first occurrence, the last, or none.

Syntax:

Index.duplicated(keep=’first’)

keep: Determines which duplicates to mark as True.
'first': Marks all duplicates except the first occurrence.
'last': Marks all duplicates except the last occurrence.
False: Marks all occurrences of duplicates.

Example 1: Basic Example: Default Behavior (`keep='first'`)

import pandas as pd

# Create an Index with duplicates
idx = pd.Index(['Apple', 'Banana', 'Apple', 'Cherry', 'Banana'])
# Identify duplicates, keeping the first occurrence as unique
print(idx.duplicated(keep='first'))

# Output: [False False  True False  True]

In this example, the first occurrences of “Apple” and “Banana” are marked as unique (False), while subsequent occurrences are flagged as duplicates (True).

Example 2: Retaining the Last Occurrence (`keep='last'`)

import pandas as pd
idx = pd.Index(['Apple', 'Banana', 'Apple', 'Cherry', 'Banana'])

# Retain the last occurrence of duplicates
print(idx.duplicated(keep='last'))  

# Output: [ True  True False False False]

Here, the last occurrence of duplicates is flagged as False, while the earlier ones are True.

Example 3: Marking All Duplicates (`keep=False`)

import pandas as pd
idx = pd.Index(['Apple', 'Banana', 'Apple', 'Cherry', 'Banana'])

# Mark all duplicates
print(idx.duplicated(keep=False))  

# Output: [ True  True  True False  True]

Additional Use Cases of `Index.duplicated()` in Pandas

Beyond its basic usage, it can handle more complex scenarios, making it an essential feature for advanced data cleaning and preprocessing tasks. Below are some additional examples and use cases that demonstrate the power and flexibility of this method.

1. Filtering Rows with Duplicated Indices

If you want to filter out rows with duplicate indices, you can combine Index.duplicated() with boolean indexing.

import pandas as pd
data = pd.DataFrame({'Values': [10, 20, 30, 40]}, index=['A', 'B', 'A', 'C'])

# Filter rows with unique indices
unique_rows = data[~data.index.duplicated(keep='first')]

print("Filtered DataFrame (Unique Indices):")
print(unique_rows)

Output

Filtered DataFrame (Unique Indices):
   Values
A      10
B      20
C      40

This is particularly useful when you need to retain only the first occurrence of each index while discarding duplicates.

2. Identifying All Duplicate Entries

To identify all occurrences of duplicate indices, use keep=False. This flags every instance of a duplicate value.

import pandas as pd

# Create a DataFrame with duplicate indices
data = pd.DataFrame({'Values': [10, 20, 30, 40]}, index=['A', 'B', 'A', 'C'])
# Identify all duplicated indices
duplicates = data.index[data.index.duplicated(keep=False)]

print("All Duplicate Indices:")
print(duplicates)

Output

All Duplicate Indices:
Index(['A', 'A'], dtype='object')

This approach is helpful when you need to isolate all rows associated with non-unique indices for further inspection.

3. Handling Missing Values (NaN) in Indices

The Index.duplicated() method treats NaN values as unique unless explicitly duplicated.

import pandas as pd

# Create a DataFrame with duplicate indices
data = pd.DataFrame({'Values': [10, 20, 30, 40]}, index=['A', 'B', 'A', 'C'])
# Index with NaN values
idx = pd.Index([1, 2, None, 1, None])

print(idx.duplicated(keep='first'))

Output

[False False False  True  True]

This is useful when dealing with datasets that include missing or null values in the index and you need to ensure proper handling of such cases.

5. Grouping Data by Non-Unique Indices

When working with non-unique indices, grouping data by these indices can help aggregate or summarize information.

import pandas as pd

data = pd.DataFrame({'Values': [10, 20, 30, 40]}, index=['A', 'B', 'A', 'C'])
data = pd.DataFrame({'Values': [10, 20, 30]}, index=['A', 'B', 'A'])

# Group by index and calculate the sum
grouped_data = data.groupby(level=0).sum()

print("Grouped Data:")
print(grouped_data)

Output

Grouped Data:
   Values
A      40
B      20

This technique is useful for resolving duplicates by aggregating data instead of simply removing them.

6. Detecting Duplicate Labels Before Operations

Before performing operations like merging or concatenation, it’s crucial to check for duplicate labels to avoid unexpected behavior.

import pandas as pd
data = pd.DataFrame({'Values': [10, 20, 30, 40]}, index=['A', 'B', 'A', 'C'])
# Check if the index has duplicates
has_duplicates = data.index.has_duplicates

print("Does the index have duplicates?")
print(has_duplicates)

Output

Does the index have duplicates?
True

This helps prevent errors caused by non-unique indices during operations that require unique labels.

7. Using Boolean Indexing to Extract Duplicates

You can extract rows corresponding to duplicate indices using boolean indexing.

import pandas as pd

data = pd.DataFrame({'Values': [10, 20, 30, 40]}, index=['A', 'B', 'A', 'C'])
# Extract rows with duplicated indices
duplicate_rows = data[data.index.duplicated(keep='first')]

print("Rows with Duplicated Indices:")
print(duplicate_rows)

Output

Rows with Duplicated Indices:
   Values
A      30

This is useful when you need to analyze or process only the duplicate entries in your dataset.

8. Combining `Index.duplicated()` with Custom Logic

For advanced scenarios, you can combine Index.duplicated() with custom logic to handle duplicates differently based on specific conditions.

import pandas as pd

data = pd.DataFrame({'Values': [10, 20, 30, 40]}, index=['A', 'B', 'A', 'C'])
# Custom logic: Keep first occurrence of "A", but flag others as duplicates
custom_duplicates = data.index.map(lambda x: x == 'A').duplicated(keep='first')

print("Custom Duplicate Flags:")
print(custom_duplicates)

Output

Custom Duplicate Flags:
[False False  True  True]

This allows for tailored handling of duplicates based on domain-specific requirements.

Python | Pandas Index.get_duplicates()

Shubham__Ranjan

Improve

Article Tags :

Python | Pandas Index.duplicated()

How Index.duplicated() Works?

Example 1: Basic Example: Default Behavior (keep='first')

Example 2: Retaining the Last Occurrence (keep='last')

Example 3: Marking All Duplicates (keep=False)

Additional Use Cases of Index.duplicated() in Pandas

1. Filtering Rows with Duplicated Indices

2. Identifying All Duplicate Entries

3. Handling Missing Values (NaN) in Indices

5. Grouping Data by Non-Unique Indices

6. Detecting Duplicate Labels Before Operations

7. Using Boolean Indexing to Extract Duplicates

8. Combining Index.duplicated() with Custom Logic

Similar Reads

Thank You!

What kind of Experience do you want to share?

How `Index.duplicated()` Works?

Example 1: Basic Example: Default Behavior (`keep='first'`)

Example 2: Retaining the Last Occurrence (`keep='last'`)

Example 3: Marking All Duplicates (`keep=False`)

Additional Use Cases of `Index.duplicated()` in Pandas

8. Combining `Index.duplicated()` with Custom Logic