Hi guru's,
I want small clarification,
I want to remove the duplicates.
which one give better performance either DISTINCT clause in source query or Remove duplicates stage in our job.
Thanks in advance.....
Remove Duplicates...
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 24
- Joined: Wed Mar 19, 2008 1:13 am
Remove Duplicates...
M.BALAJI
-
- Participant
- Posts: 24
- Joined: Wed Mar 19, 2008 1:13 am
Thanks GOPIK,GOPIK wrote:Your souce system is Database or Flatfiles?
If it is Database Query will be faster in my experience
MY source system is Database.
we are doing the performance tuning of jobs.But before 6 months back somebody tuned the jobs. They are replaced the DISTINCT clause in query by Remove Duplicate stage.
M.BALAJI
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Is the key (set of one or more columns) that identifies duplicates supported by an index in the database? If so, DISTINCT may be quicker.
On the other hand, partitioned data may mean that finish time is faster because each node is only looking after 1/N of the rows (on average).
On the other hand, partitioned data may mean that finish time is faster because each node is only looking after 1/N of the rows (on average).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 301
- Joined: Thu Jul 14, 2005 10:27 am
- Location: Melbourne, Australia
- Contact:
A couple of other factors which *might* influence your decision are:
1. Is the source table partitioned on the relevant keys?
2. What percentage of rows are duplicates? (or are anticipated to be duplicates in the production data)
As ever, the 'true' answer will come from trying both with realistic data and comparing results!
J
1. Is the source table partitioned on the relevant keys?
2. What percentage of rows are duplicates? (or are anticipated to be duplicates in the production data)
As ever, the 'true' answer will come from trying both with realistic data and comparing results!
J
<b>John McKeever</b>
Data Migrators
<b><a href="https://www.mettleci.com">MettleCI</a> - DevOps for DataStage</b>
<a href="http://www.datamigrators.com/"><img src="https://www.datamigrators.com/assets/im ... l.png"></a>
Data Migrators
<b><a href="https://www.mettleci.com">MettleCI</a> - DevOps for DataStage</b>
<a href="http://www.datamigrators.com/"><img src="https://www.datamigrators.com/assets/im ... l.png"></a>
-
- Participant
- Posts: 467
- Joined: Tue Mar 20, 2007 6:36 am
- Location: Chennai
- Contact:
A DISTINCT would be a better option if the speed at which data gets read from your DB is slow.
If the speed of data transfer from your DB to your DS Server is fast enough, then a Remove Duplicates would be a better choice.
If the speed of data transfer from your DB to your DS Server is fast enough, then a Remove Duplicates would be a better choice.
Minhajuddin
<a href="http://feeds.feedburner.com/~r/MyExperi ... ~6/2"><img src="http://feeds.feedburner.com/MyExperienc ... lrow.3.gif" alt="My experiences with this DLROW" border="0"></a>
<a href="http://feeds.feedburner.com/~r/MyExperi ... ~6/2"><img src="http://feeds.feedburner.com/MyExperienc ... lrow.3.gif" alt="My experiences with this DLROW" border="0"></a>
I believe, he is insisting up on the point that, if you included the Distinct and if that reduces / rejects lot of records, the Data transfer between DB and DS will be less. Thus this factor considering the Data movement over networking.Minhajuddin wrote:A DISTINCT would be a better option if the speed at which data gets read from your DB is slow.
If the speed of data transfer from your DB to your DS Server is fast enough, then a Remove Duplicates would be a better choice.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'