Frequently when importing / scrubbing data within CRM, identifying duplicates to remove or ‘flag’ may be needed. It’s helpful to have a query that will return all the duplicates in the database – but exclude a single ‘surviving’ record. (i.e. Return all-but-one of the duplicated records.)
This query is one of my go-to query patterns – I rarely, if ever, use it as-is, but will modify it to fit the particular entity / scenario I need to investigate.
The returned record set might used as the source for a Scribe Insight job to either ‘flag’ or inactivate or delete the duplicates depending on the situation. – Of course you’ll need to take into account the record’s relationship to other entities etc.
This example query assumes the duplicated contacts were entered more than 3 milliseconds apart in SQL. If your data was migrated from another system or entered as a result of a large batch and you need finer control, consider using the 'versionnumber' field as a unique timestamp field. (The VERSIONNUMBER field is a unique value that gets incremented as records are updated – it can be very useful.)
Another variations might be that you want to change MAX to MIN – and adjust the JOIN in the query to return all but the most recently created or updated record.
The key is that this is a pattern you can build on and adapt to your particular situation.
( Also, by blogging it here, I no longer have to search my laptop for copies of it when I need a example for someone. )
/*
Title: Find Duplicate Contact to Remove (Example)
Author: Scott Sewell, Customer Effective
Description: Find the most recently created contacts that duplicates an existing contact based on fullname and emailaddress1. The Subquery finds the Name/Email and CreatedOn date of the most recently created duplicates. e.g. if there are multiple contact records with the same name and emailaddress1, return the date of all but the oldest one.
*/
SELECT
FC.contactid
,FC.fullname
,FC.emailaddress1
FROM
dbo.FilteredContact AS FC
JOIN
( SELECT
FC2.fullname
,FC2.emailaddress1
,MIN(FC2.createdon) MIN_CREATED_ON
FROM
dbo.FilteredContact AS FC2
WHERE
FC2.emailaddress1 IS NOT NULL
AND FC2.statecode = 0
GROUP BY
FC2.fullname
,FC2.emailaddress1
HAVING
COUNT(FC2.contactid) >= 2
) AS FC_DUPE
ON FC_DUPE.MIN_CREATED_ON < FC.createdon
AND FC.statecode = 0
AND FC.fullname = FC_DUPE.fullname
AND FC.emailaddress1 = FC_DUPE.emailaddress1
-- @ScottSewell
Hello,
I've been reading the blog for sometime and thought I'd provide some feedback. Although your query works, it unfortunately doesn't work for all scenarios. In our database we have some contacts with exactly the same createdon date (we have a few records like that!). Therefore they are excluded. To fix this, your query could be reworked to take advantage of the newer SQL feature of OVER clause. The same query with OVER clause that works with the same createdon date would be (which i think is clearer to read as well) :-
SELECT contactid ,
fullname ,
emailaddress1
FROM ( SELECT contactid ,
fullname ,
emailaddress1 ,
ROW_NUMBER() OVER ( PARTITION BY fullname, emailaddress1 ORDER BY modifiedon DESC ) AS mostrecentrownumber ,
COUNT(*) OVER ( PARTITION BY fullname, emailaddress1 ) AS duplicatecount
FROM dbo.FilteredContact
WHERE statecode = 0
AND emailaddress1 IS NOT NULL
) AS dt
WHERE duplicatecount > 1
AND mostrecentrownumber = 1
ORDER BY dt.fullname
Thanks Glenn
Posted by: Glenn | May 25, 2012 at 07:12 AM
Thanks Glenn
I like it and will definitely try it out - (That's part of the fun of this work, there's always a newer/better way to solve the puzzles. - Thanks for the tip!)
Scott -
Posted by: Scott Sewell | May 25, 2012 at 08:14 AM