Remove Duplicates...

mallireddibalaji · Post by **mallireddibalaji** » Fri Apr 18, 2008 12:38 am

Hi guru's,

I want small clarification,
I want to remove the duplicates.
which one give better performance either DISTINCT clause in source query or Remove duplicates stage in our job.

Thanks in advance.....

GOPIK · Post by **GOPIK** » Fri Apr 18, 2008 12:56 am

Your souce system is Database or Flatfiles?
If it is Database Query will be faster in my experience

mallireddibalaji · Post by **mallireddibalaji** » Fri Apr 18, 2008 1:22 am

GOPIK wrote:Your souce system is Database or Flatfiles?
If it is Database Query will be faster in my experience

Thanks GOPIK,

MY source system is Database.
we are doing the performance tuning of jobs.But before 6 months back somebody tuned the jobs. They are replaced the DISTINCT clause in query by Remove Duplicate stage.

ray.wurlod · Post by **ray.wurlod** » Fri Apr 18, 2008 4:00 am

Is the key (set of one or more columns) that identifies duplicates supported by an index in the database? If so, DISTINCT may be quicker.

On the other hand, partitioned data may mean that finish time is faster because each node is only looking after 1/N of the rows (on average).

jhmckeever · Post by **jhmckeever** » Sun Apr 20, 2008 7:48 pm

A couple of other factors which *might* influence your decision are:

1. Is the source table partitioned on the relevant keys?
2. What percentage of rows are duplicates? (or are anticipated to be duplicates in the production data)

As ever, the 'true' answer will come from trying both with realistic data and comparing results!

J

Minhajuddin · Post by **Minhajuddin** » Mon Apr 21, 2008 12:03 pm

A DISTINCT would be a better option if the speed at which data gets read from your DB is slow.

If the speed of data transfer from your DB to your DS Server is fast enough, then a Remove Duplicates would be a better choice.

kumar_s · Post by **kumar_s** » Mon Apr 21, 2008 1:06 pm

Minhajuddin wrote:A DISTINCT would be a better option if the speed at which data gets read from your DB is slow.

If the speed of data transfer from your DB to your DS Server is fast enough, then a Remove Duplicates would be a better choice.

I believe, he is insisting up on the point that, if you included the Distinct and if that reduces / rejects lot of records, the Data transfer between DB and DS will be less. Thus this factor considering the Data movement over networking.