I was recently working on a data cleanup problem where I had to do lots of comparisons of one row to the next row and I was trying to do my best to avoid using cursor for this. I was using the old trick of having an IDENTITY() field and doing a self-join where the identity field in the main table = the identity field in the “comparison” table –1. This was working great until I had to do a second set of deletes from my temp table and realized that some of my rows were now missing and therefore my IDENTITY() numbers weren’t always sequential anymore.
I found a great trick from MSDN (here) on how to get around this. Below I’m providing some sample code to demonstrate this trick. I believe this should work on SQL Server 2005 or higher. The trick is by the comment called “Renumber the RowIDs”. It involves using a CTE and the ROW_NUMBER() window function as part of an UPDATE statement.
Be sure to use semicolons after your statements when you start using CTEs or MERGE statements as SQL Server can start getting confused if the code is ambiguous.
--Setup a table variable with some data that has some definite duplicates and some
--“not so sure” duplicates – this happens to be someone’s job history.
DECLARE @SomeData TABLE
(RowID INT UNIQUE,
EmpID VARCHAR(8),
PositionTitle VARCHAR(40),
PositionEffDate DATETIME,
PositionID VARCHAR(8));
INSERT INTO @SomeData VALUES
(1,'00000012','Manager','1/1/2000','ABC123'),
(2,'00000012','Manager','1/1/2000','ABC123ZZ'),
(3,'00000012','Sr. Manager','1/1/2002','ABC125'),
(4,'00000012','Sr. Manager','1/1/2002','ABC125'),
(5,'00000012','Sr. Manager','1/1/2002','ABC125ZZ'),
(6,'00000012','Director','1/1/2006','ABC126'),
(7,'00000012','Director','1/1/2006','ABC126'),
(8,'00000012','Sr. Director','1/1/2009','ABC129ZZ');
SELECT 'Has Dups and possibly bad rows' as [Description],
* FROM @SomeData;
--this gets rid of rows that are certainly dups - same position title,
-- effective date, and position ID.
DELETE FROM @SomeData WHERE RowID IN (
SELECT compare.RowID
FROM @SomeData main
LEFT OUTER JOIN @SomeData compare on main.EmpID = compare.EmpID
AND main.RowID = compare.RowID -1
where main.PositionTitle = compare.PositionTitle
AND main.PositionID = compare.PositionID
AND main.PositionEffDate = compare.PositionEffDate);
SELECT 'Pure dups removed, some possibly bad rows, split row ids' as [Description],
* FROM @SomeData;
--Renumber the RowIDs
WITH newPH AS(
SELECT RowID, ROW_NUMBER() OVER(Order By RowID ASC) as [newRowID] FROM @SomeData)
UPDATE newPH
SET RowID = newRowID;
SELECT 'Pure dups removed, some possibly bad rows, fixed row ids' as [Description],
* FROM @SomeData;
--this gets rid of rows that are still dups - same position title,
-- effective date, keeping the first row that matches
DELETE FROM @SomeData WHERE RowID IN (
SELECT compare.RowID
FROM @SomeData main
LEFT OUTER JOIN @SomeData compare on main.EmpID = compare.EmpID
AND main.RowID = compare.RowID -1
where main.PositionTitle = compare.PositionTitle
AND main.PositionEffDate = compare.PositionEffDate);
--See that the "Sr. Director" row is kept since that has a different title than the
-- prior row. The second Sr. Manager row (with the ZZ code in position ID) is removed.
SELECT 'Fixed up' as [Description],
* FROM @SomeData;
No comments:
Post a Comment