SQL: Row Numbers and Duplicates
Author: jason
Date: 2014-01-23 08:50:32
Category: Musing

The ability to find duplicates in a table can be very handy. The core of finding duplicates, with the below script, is the ability to assign a Row Number. A Row Number will allow us to see if the item is in the Table with more than 1 row.

This query creates a temp table and puts the name Apple into it twice, along with Peach and Pear once. Then using the OVER / PARTITION BY statement we can assign a Row Number to each row, using the Name.
Code

DECLARE @Temp AS TABLE ([id] BIGINT, [name] VARCHAR(10))
INSERT INTO @Temp SELECT 1, 'Apple'
INSERT INTO @Temp SELECT 2, 'Apple'
INSERT INTO @Temp SELECT 3, 'Peach'
INSERT INTO @Temp SELECT 4, 'Pear'

SELECT [t].[id], [t].[name], ROW_NUMBER() OVER
(PARTITION BY [t].[name] ORDER BY [t].[name] DESC) AS [Row_Number]
FROM @Temp t


As you can see Apple has 2 rows while Peach and Pear only have 1. The Row Number restarts its numbering based on the Name.





Taking this a step further we can Update or Delete the row we determine to be the duplicate. We can use the ORDER BY clause to flip the order of the rows around, if we have more to sort by.
Code

DECLARE @Temp AS TABLE ([id] BIGINT, [name] VARCHAR(10), [count] INT, [date] DATE, [is_active] BIT)
INSERT INTO @Temp SELECT 10, 'Apple', 1, '1/20/2014', 1
INSERT INTO @Temp SELECT 12, 'Apple', 1, '1/20/2014', 1
INSERT INTO @Temp SELECT 16, 'Peach', 3, '1/21/2014', 1
INSERT INTO @Temp SELECT 19, 'Apple', 2, '1/22/2014', 1
INSERT INTO @Temp SELECT 25, 'Peach', 3, NULL, 1

SELECT * FROM @Temp

DECLARE @RowNumberTable AS TABLE ([temp_id] BIGINT, [temp_name] VARCHAR(10), [row_number] INT)
INSERT INTO @RowNumberTable
SELECT [t].[id] AS [temp_id], [t].[name] AS [temp_name], ROW_NUMBER() OVER
(PARTITION BY [t].[name], [t].[count], [t].[is_active] ORDER BY [t].[date] DESC)
FROM @Temp t

UPDATE [t] SET [t].[is_active] = 0
FROM @Temp t WHERE [t].[id] IN (SELECT [r].[temp_id]
FROM @RowNumberTable r
WHERE [r].[row_number] > 1)

SELECT [r].*, [t].*
FROM @RowNumberTable r
INNER JOIN @Temp t ON [t].[id] = [r].[temp_id]
ORDER BY [r].[temp_name], [r].[row_number]


On Line 16, in the below image, is where the UPDATE statement decides which row to set Is_Active = Zero. I am using the ID of the Temp table to determine the correct row to update. Where that ID has a Row Number greater than One, set it to Zero.




The first table, in the below image, is before we make any update. The second table is after the update. We've set the Is_Active column to Zero where we've determined that row to be the Duplicate row.







jason @ jasonthomasfrance.com - www.masterstationlog.com - copyright 2009