Re: VERY chalanging question - Explanation



groupy (liav.ezer@xxxxxxxxx) writes:
A | B | C | D
--------------------
a1 b1 c1 d1
a1 b2 c2 d2
a1 b1 c3 d3
a4 b4 c4 d3
a5 b5 c5 d5
a6 b6 c6 d3

The duplications are:
rows 1+2+3 on A
row 1+3 on B
rows 3+4+6 on D
the only unique (in all params) row is 5
note: finding first that row 1 similar to 2 on A & deleting it will
loose information because we WON'T know if row 1 similar to row 3 on B.
The same goes for the deletion of row 3 : it will cause lose of data
regarding it's similarity to row 4 on D


The Simple query for retriving all duplicated rows which consumes most
time is:
SELECT COUNT(*),A,B,C,D
FROM tbl
GROUP BY A,B,C,D
HAVING count(*)>1
It takes about 2 weaks on a 1.5 million rows, while all fields are
nvchars & the DB is in SQL-Server

I sincerely doubt that this statement takes two weeks to run for 1.5
million rows. Had you said 1.5 milliard rows, I could maybe have
believed it.

Anyway, first index each column individually. Then try:

DELETE tbl
FROM tbl a
JOIN tbl b ON a.A = b.A
WHERE a.B > b.B OR
a.C > b.C OR
a.D > b.D

DELETE tbl
FROM tbl a
JOIN tbl b ON a.B = b.B
WHERE a.C > b.C OR
a.D > b.D

DELETE tbl
FROM tbl a
JOIN tbl b ON a.C = b.C
WHERE a.D > b.C

After this operation, you still have the rows that have the same values
in four columns. But it is not clear from your description whether you
have such duplicates. If you have this maybe the best:

ATLER TABLE tbl ADD ident int IDENTITY

DELETE tbl
FROM tbl a
JOIN tbl b ON a.A = b.A
WHERE a.ident > b.ident

DELETE tbl
FROM tbl a
JOIN tbl b ON a.B = b.B
WHERE a.ident > b.ident

DELETE tbl
FROM tbl a
JOIN tbl b ON a.C = b.C
WHERE a.ident > b.ident

ALTER TABLE tbl DROP COLUMN ident

Note: all the above is untested. For tested solutions (at least with
regards to correctness), please post:

o CREATE TABLE statement for the table.
o INSERT statements with sample data.
o The desired result given the sample.

--
Erland Sommarskog, SQL Server MVP, esquel@xxxxxxxxxxxxx

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
.



Relevant Pages

  • Re: Help with tricky T-SQL
    ... JOIN tbl b ON a.VId = b.VId ... Erland Sommarskog, SQL Server MVP, esquel@xxxxxxxxxxxxx ... JOIN prevID ...
    (comp.databases.ms-sqlserver)
  • Re: Update Identity Column
    ... Actually, I implemented the trigger, and it seems to be working fine. ... Sylvain Lafontaine, ing. ... SQL Server will almost invariably decide that that's the best order to ... INSERT INTO tbl VALUES; ...
    (microsoft.public.access.adp.sqlserver)
  • Re: Update Identity Column
    ... is a real problem. ... Sylvain Lafontaine, ing. ... SQL Server will almost invariably decide that that's the best order to ... INSERT INTO tbl VALUES; ...
    (microsoft.public.access.adp.sqlserver)
  • Re: SQL Express - Identity specification property - how to change
    ... UPDATE tbl ... You can use ALTER TABLE ALTER COLUMN to add an IDENTITY column, ... Erland Sommarskog, SQL Server MVP, esquel@xxxxxxxxxxxxx ... Books Online for SQL Server 2005 at ...
    (comp.databases.ms-sqlserver)
  • Re: SQL question
    ... JOIN tbl b ON a.UserID = b.UserID ... INSERT statements with sample data. ... Erland Sommarskog, SQL Server MVP, esquel@xxxxxxxxxxxxx ... Books Online for SQL Server 2005 at ...
    (comp.databases.ms-sqlserver)