SELECT distinct v/s Remove Duplicate Stage

jerome_rajan · Post by **jerome_rajan** » Thu Oct 11, 2012 12:22 am

Hi,

We have a table that currently has about 60 million records and will keep growing. This table contains a lot of duplicates. To extract only the unique records from a table of such a large volume, will it be efficient to use the remove duplicates stage or should I fire a DISTINCT in the SQL query of the DB2 connector?

I am currently testing it and will post the result when the job finishes but it'll be helpful if anyone can share a prior experience.
Thanks

ssreeni3 · Post by **ssreeni3** » Thu Oct 11, 2012 1:30 am

Hi jerome_rajan,

My opinion is:

1)Oracle DISTINCT keyword treats first record in the group as a unique record and the remaing records in that goup are duplicate records of the first record based on the value/values of key/keys used in the DISTINCT.

2)RemoveDuplicate Stage has an options Retain First,Retain Last Options.

3)RemoveDuplicate Stage Retain First Option is equivalent to Oracle Distinct.

Experts,correct me if i am wrong.

Thanks,
Srini

jerome_rajan · Post by **jerome_rajan** » Thu Oct 11, 2012 1:35 am

Ssreeni,

I know what the SQL distinct does and how the remove duplicate stage functions. My question is which method would be me more efficient for a data volume of 60 million + records.
I am not bothered about which record to retain since the records are pure duplicates i.e. every field in the duplicate record has the exact same value as the original record.

jerome_rajan · Post by **jerome_rajan** » Thu Oct 11, 2012 5:09 am

Oops! Just realized I posted this in the wrong forum

. Can the admin please move it to the right forum?

ArndW · Post by **ArndW** » Thu Oct 11, 2012 6:04 am

I try to remove excess data as early as possible in a job. In this case the distinct at the database level will save overhead, particularly if the DB is remote.

chulett · Post by **chulett** » Thu Oct 11, 2012 8:00 am

jerome_rajan wrote:Oops! Just realized I posted this in the wrong forum . Can the admin please move it to the right forum?

You betcha.

ray.wurlod · Post by **ray.wurlod** » Thu Oct 11, 2012 3:42 pm

If the basis for "distinct" is indexed in the database it will make a huge amount of sense to include DISTINCT in the query, mainly for the reason Arnd articulated, but also because the query can be resolved entirely in the index, a huge performance gain.

Relative cost (volume of distinct rows / volume of rows) should be your guideline otherwise about where to apply this constraint. Available spare capacity on the database server versus the DataStage server will also be a consideration.