Page 1 of 1

Remove Duplicates...

Posted: Fri Apr 18, 2008 12:38 am
by mallireddibalaji
Hi guru's,

I want small clarification,
I want to remove the duplicates.
which one give better performance either DISTINCT clause in source query or Remove duplicates stage in our job.


Thanks in advance.....

Posted: Fri Apr 18, 2008 12:56 am
by GOPIK
Your souce system is Database or Flatfiles?
If it is Database Query will be faster in my experience

Posted: Fri Apr 18, 2008 1:22 am
by mallireddibalaji
GOPIK wrote:Your souce system is Database or Flatfiles?
If it is Database Query will be faster in my experience
Thanks GOPIK,

MY source system is Database.
we are doing the performance tuning of jobs.But before 6 months back somebody tuned the jobs. They are replaced the DISTINCT clause in query by Remove Duplicate stage.

Posted: Fri Apr 18, 2008 4:00 am
by ray.wurlod
Is the key (set of one or more columns) that identifies duplicates supported by an index in the database? If so, DISTINCT may be quicker.

On the other hand, partitioned data may mean that finish time is faster because each node is only looking after 1/N of the rows (on average).

Posted: Sun Apr 20, 2008 7:48 pm
by jhmckeever
A couple of other factors which *might* influence your decision are:

1. Is the source table partitioned on the relevant keys?
2. What percentage of rows are duplicates? (or are anticipated to be duplicates in the production data)

As ever, the 'true' answer will come from trying both with realistic data and comparing results! :-)

J

Posted: Mon Apr 21, 2008 12:03 pm
by Minhajuddin
A DISTINCT would be a better option if the speed at which data gets read from your DB is slow.

If the speed of data transfer from your DB to your DS Server is fast enough, then a Remove Duplicates would be a better choice.

Posted: Mon Apr 21, 2008 1:06 pm
by kumar_s
Minhajuddin wrote:A DISTINCT would be a better option if the speed at which data gets read from your DB is slow.

If the speed of data transfer from your DB to your DS Server is fast enough, then a Remove Duplicates would be a better choice.
I believe, he is insisting up on the point that, if you included the Distinct and if that reduces / rejects lot of records, the Data transfer between DB and DS will be less. Thus this factor considering the Data movement over networking.