Page 1 of 1

SELECT distinct v/s Remove Duplicate Stage

Posted: Thu Oct 11, 2012 12:22 am
by jerome_rajan
Hi,

We have a table that currently has about 60 million records and will keep growing. This table contains a lot of duplicates. To extract only the unique records from a table of such a large volume, will it be efficient to use the remove duplicates stage or should I fire a DISTINCT in the SQL query of the DB2 connector?

I am currently testing it and will post the result when the job finishes but it'll be helpful if anyone can share a prior experience.
Thanks

Posted: Thu Oct 11, 2012 1:30 am
by ssreeni3
Hi jerome_rajan,

My opinion is:

1)Oracle DISTINCT keyword treats first record in the group as a unique record and the remaing records in that goup are duplicate records of the first record based on the value/values of key/keys used in the DISTINCT.

2)RemoveDuplicate Stage has an options Retain First,Retain Last Options.

3)RemoveDuplicate Stage Retain First Option is equivalent to Oracle Distinct.

Experts,correct me if i am wrong.

Thanks,
Srini

Posted: Thu Oct 11, 2012 1:35 am
by jerome_rajan
Ssreeni,

I know what the SQL distinct does and how the remove duplicate stage functions. My question is which method would be me more efficient for a data volume of 60 million + records.
I am not bothered about which record to retain since the records are pure duplicates i.e. every field in the duplicate record has the exact same value as the original record.

Posted: Thu Oct 11, 2012 5:09 am
by jerome_rajan
Oops! Just realized I posted this in the wrong forum :( . Can the admin please move it to the right forum?

Posted: Thu Oct 11, 2012 6:04 am
by ArndW
I try to remove excess data as early as possible in a job. In this case the distinct at the database level will save overhead, particularly if the DB is remote.

Posted: Thu Oct 11, 2012 8:00 am
by chulett
jerome_rajan wrote:Oops! Just realized I posted this in the wrong forum :( . Can the admin please move it to the right forum?
You betcha. :wink:

Posted: Thu Oct 11, 2012 3:42 pm
by ray.wurlod
If the basis for "distinct" is indexed in the database it will make a huge amount of sense to include DISTINCT in the query, mainly for the reason Arnd articulated, but also because the query can be resolved entirely in the index, a huge performance gain.

Relative cost (volume of distinct rows / volume of rows) should be your guideline otherwise about where to apply this constraint. Available spare capacity on the database server versus the DataStage server will also be a consideration.