SQL | Remove Duplicates without Distinct

Q: How to remove duplicates in SQL other than DISTINCT?

We can remove duplicates in SQL by using the DELETE statement with a self-join, or by using ROW_NUMBER() with a CTE (Common Table Expression) to identify and delete duplicate rows based on certain conditions.

Q: How to avoid DISTINCT in SQL?

To avoid using DISTINCT in SQL, you can use GROUP BY, JOIN operations, or aggregate functions like COUNT(), SUM(), or MAX() to get unique results without explicitly using the DISTINCT keyword.

Q: How to remove the duplicate records in SQL?

Duplicate records can be removed using DELETE combined with self-join, or by using ROW_NUMBER() to assign a unique number to each row and delete rows with a higher row number within the same group

Last Updated : 02 Dec, 2024

In SQL, removing duplicate records is a common task, but the DISTINCT keyword can sometimes lead to performance issues, especially with large datasets. The DISTINCT clause requires sorting and comparing records, which can increase the processing load on the query engine.

In this article, we’ll explain various alternatives to remove duplicates in SQL, including using ROW_NUMBER(), self-joins, and GROUP BY. Each method will be explained in detail with examples and outputs.

Why Remove Duplicates in SQL?

Duplicate records can lead to incorrect data analysis and reporting, and can increase storage requirements. Therefore, removing duplicate records ensures better data integrity and more efficient database operations Fortunately, there are more efficient methods to remove duplicates from SQL queries without using DISTINCT.

1. Remove Duplicates Using Row_Number()

The Row_Number function assigns a unique number to each row within a partition of a result set, which allows us to identify and remove duplicate rows effectively.

Example

Let’s assume we have a table named Employees, and we want to remove duplicate rows based on the EmployeeName, EmployeeAddress, and EmployeeSex columns.

WITH CTE AS (
    SELECT EmployeeID, EmployeeName, EmployeeAddress, EmployeeSex, 
           ROW_NUMBER() OVER (PARTITION BY EmployeeName, EmployeeAddress, EmployeeSex ORDER BY EmployeeID) AS RowNum
    FROM Employees
)
DELETE FROM CTE WHERE RowNum > 1;

Explanation:

The ROW_NUMBER() function assigns a unique number to each row within the partition of duplicate values.
The CTE (Common Table Expression) is used to define the duplicate rows, and DELETE removes all but the first occurrence (RowNum > 1) of each duplicate.

2. Remove Duplicates Using a Self-Join

A self-join involves joining a table to itself to identify and remove duplicates based on specific criteria. This method is ideal for comparing columns within the same table.

Example

Let’s consider the Employee table again. We’ll use a self-join to remove duplicate entries where EmployeeName and EmployeeAddress are the same.

DELETE A 
FROM Employees A
JOIN Employees B ON A.EmployeeName = B.EmployeeName 
    AND A.EmployeeAddress = B.EmployeeAddress
WHERE A.EmployeeID > B.EmployeeID;

Explanation:

The self-join compares records within the same table (aliased as A and B).
The condition A.EmployeeID > B.EmployeeID ensures that only the duplicate records (with a higher EmployeeID) are deleted.

3. Remove Duplicates using group By

The GROUP BY clause can be used to remove duplicates by grouping rows with identical values in selected columns. This method is ideal when we want to retain specific records (like the first or last entry) based on aggregate functions.

Example

To remove duplicates based on FirstName, LastName, and MobileNo, we can group by these columns and select distinct entries.

SELECT FirstName, LastName, MobileNo
FROM Customers
GROUP BY FirstName, LastName, MobileNo;

Explanation:

The GROUP BY clause groups records with the same FirstName, LastName, and MobileNo values.
This effectively removes any duplicate entries based on these columns and returns only unique combinations.

4. Remove Duplicates Using `DISTINCT ON` (PostgreSQL)

For PostgreSQL users, the DISTINCT ON clause is a powerful way to remove duplicates based on specific columns while retaining additional data from the same rows.

Example

SELECT DISTINCT ON (EmployeeName) EmployeeName, EmployeeAddress
FROM Employees
ORDER BY EmployeeName, EmployeeID;

Explanation:

The DISTINCT ON clause keeps the first occurrence of each EmployeeName and removes subsequent duplicates.
The ORDER BY clause specifies which row should be retained when duplicates are found.

5. Use of `EXCEPT` to Remove Duplicates

The EXCEPT operator returns the records from the first query that are not present in the second query. This can be used to eliminate duplicates from a result set.

Example

SELECT * FROM Employees
EXCEPT
SELECT DISTINCT * FROM Employees;

Explanation:

The first query returns all records from the Employees table, and the second query returns only distinct records.
The EXCEPT operator subtracts the distinct rows from the original set, effectively leaving only duplicates.

Conclusion

There are several efficient ways to remove duplicates in SQL without using the DISTINCT keyword. Methods like ROW_NUMBER(), self-joins, GROUP BY, and DISTINCT ON can help eliminate duplicates and maintain data integrity in our databases. These techniques are particularly useful for improving query performance in large datasets where using DISTINCT might slow down execution.

FAQs

How to remove duplicates in SQL other than DISTINCT?

We can remove duplicates in SQL by using the DELETE statement with a self-join, or by using ROW_NUMBER() with a CTE (Common Table Expression) to identify and delete duplicate rows based on certain conditions.

How to avoid DISTINCT in SQL?

To avoid using DISTINCT in SQL, you can use GROUP BY, JOIN operations, or aggregate functions like COUNT(), SUM(), or MAX() to get unique results without explicitly using the DISTINCT keyword.

How to remove the duplicate records in SQL?

Duplicate records can be removed using DELETE combined with self-join, or by using ROW_NUMBER() to assign a unique number to each row and delete rows with a higher row number within the same group