PostgreSQL – Deleting Duplicate Rows using Subquery
In PostgreSQL, handling duplicate rows is a common task, especially when working with large datasets. Fortunately, PostgreSQL provides several techniques to efficiently delete duplicate rows, and one of the most effective approaches is using subqueries.
In this article, we will demonstrate how to identify and remove duplicate rows while keeping the row with either the lowest or highest ID, depending on your requirements.
Setting Up a Sample Table
For the purpose of demonstration let’s set up a sample table(say, ‘basket’) that stores ‘fruits’ as follows:
CREATE TABLE basket(
id SERIAL PRIMARY KEY,
fruit VARCHAR(50) NOT NULL
);
INSERT INTO basket(fruit) values('apple');
INSERT INTO basket(fruit) values('apple');
INSERT INTO basket(fruit) values('orange');
INSERT INTO basket(fruit) values('orange');
INSERT INTO basket(fruit) values('orange');
INSERT INTO basket(fruit) values('banana');
SELECT * FROM basket;
This should result into below:
Now that we have set up the sample table, we will query for the duplicates using the following.
Query:
SELECT fruit, COUNT( fruit ) FROM basket GROUP BY fruit HAVING COUNT( fruit )> 1 ORDER BY fruit;
This should lead to the following results:
Deleting Duplicate Rows with a Subquery
To delete the duplicate rows while keeping the row with the lowest ID, you can use a subquery with the ‘ROW_NUMBER()'
window function. This method ensures that only one row per fruit is retained, and all other duplicates are removed.
Query:
DELETE FROM basket WHERE id IN (SELECT id FROM (SELECT id, ROW_NUMBER() OVER( PARTITION BY fruit ORDER BY id ) AS row_num FROM basket ) t WHERE t.row_num > 1 );
Explanation:
- The inner subquery assigns a row number to each row within each partition (grouped by ‘fruit’), ordered by ‘id’.
- The ROW_NUMBER() function starts counting from 1 for each group, so the first row in each group is retained, and the rest are marked for deletion.
- The outer DELETE statement removes the rows identified by the subquery.
Keeping the Row with the Highest ID
If you want to keep the duplicate row with highest id, just change the order in the subquery:
DELETE FROM basket WHERE id IN (SELECT id FROM (SELECT id, ROW_NUMBER() OVER( PARTITION BY fruit ORDER BY id ) AS row_num FROM basket ) t WHERE t.row_num > 1 );
This query will retain the row with the highest ID for each duplicate group and delete all other duplicates.
Deleting Duplicates Based on Multiple Columns
In case you want to delete duplicate based on values of multiple columns, here is the query template.
Query:
DELETE FROM table_name WHERE id IN (SELECT id FROM (SELECT id, ROW_NUMBER() OVER( PARTITION BY column_1, column_2 ORDER BY id ) AS row_num FROM table_name ) t WHERE t.row_num > 1 );
Explanation:
- The
PARTITION BY
clause includes multiple columns (‘column_1', 'column_2'
), ensuring duplicates are identified based on the combination of those columns. - The rest of the logic remains the same.
Verifying the Result
In this case, the statement will delete all rows with duplicate values in the ‘column_1′ and ‘column_2′ columns. To verify the above use the below query.
Query:
SELECT fruit, COUNT( fruit ) FROM basket GROUP BY fruit HAVING COUNT( fruit )> 1 ORDER BY fruit;
Output:
If the deletion was successful, this query should return an empty result set, indicating no duplicates remain.