Remove Duplicates...

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
mallireddibalaji
Participant
Posts: 24
Joined: Wed Mar 19, 2008 1:13 am

Remove Duplicates...

Post by mallireddibalaji »

Hi guru's,

I want small clarification,
I want to remove the duplicates.
which one give better performance either DISTINCT clause in source query or Remove duplicates stage in our job.


Thanks in advance.....
M.BALAJI
GOPIK
Participant
Posts: 1
Joined: Sat Jun 24, 2006 2:17 am

Post by GOPIK »

Your souce system is Database or Flatfiles?
If it is Database Query will be faster in my experience
mallireddibalaji
Participant
Posts: 24
Joined: Wed Mar 19, 2008 1:13 am

Post by mallireddibalaji »

GOPIK wrote:Your souce system is Database or Flatfiles?
If it is Database Query will be faster in my experience
Thanks GOPIK,

MY source system is Database.
we are doing the performance tuning of jobs.But before 6 months back somebody tuned the jobs. They are replaced the DISTINCT clause in query by Remove Duplicate stage.
M.BALAJI
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Is the key (set of one or more columns) that identifies duplicates supported by an index in the database? If so, DISTINCT may be quicker.

On the other hand, partitioned data may mean that finish time is faster because each node is only looking after 1/N of the rows (on average).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jhmckeever
Premium Member
Premium Member
Posts: 301
Joined: Thu Jul 14, 2005 10:27 am
Location: Melbourne, Australia
Contact:

Post by jhmckeever »

A couple of other factors which *might* influence your decision are:

1. Is the source table partitioned on the relevant keys?
2. What percentage of rows are duplicates? (or are anticipated to be duplicates in the production data)

As ever, the 'true' answer will come from trying both with realistic data and comparing results! :-)

J
<b>John McKeever</b>
Data Migrators
<b><a href="https://www.mettleci.com">MettleCI</a> - DevOps for DataStage</b>
<a href="http://www.datamigrators.com/"><img src="https://www.datamigrators.com/assets/im ... l.png"></a>
Minhajuddin
Participant
Posts: 467
Joined: Tue Mar 20, 2007 6:36 am
Location: Chennai
Contact:

Post by Minhajuddin »

A DISTINCT would be a better option if the speed at which data gets read from your DB is slow.

If the speed of data transfer from your DB to your DS Server is fast enough, then a Remove Duplicates would be a better choice.
Minhajuddin

<a href="http://feeds.feedburner.com/~r/MyExperi ... ~6/2"><img src="http://feeds.feedburner.com/MyExperienc ... lrow.3.gif" alt="My experiences with this DLROW" border="0"></a>
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Minhajuddin wrote:A DISTINCT would be a better option if the speed at which data gets read from your DB is slow.

If the speed of data transfer from your DB to your DS Server is fast enough, then a Remove Duplicates would be a better choice.
I believe, he is insisting up on the point that, if you included the Distinct and if that reduces / rejects lot of records, the Data transfer between DB and DS will be less. Thus this factor considering the Data movement over networking.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Post Reply