Hi,
We have a table that currently has about 60 million records and will keep growing. This table contains a lot of duplicates. To extract only the unique records from a table of such a large volume, will it be efficient to use the remove duplicates stage or should I fire a DISTINCT in the SQL query of the DB2 connector?
I am currently testing it and will post the result when the job finishes but it'll be helpful if anyone can share a prior experience.
Thanks
SELECT distinct v/s Remove Duplicate Stage
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 376
- Joined: Sat Jan 07, 2012 12:25 pm
- Location: Piscataway
SELECT distinct v/s Remove Duplicate Stage
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
Hi jerome_rajan,
My opinion is:
1)Oracle DISTINCT keyword treats first record in the group as a unique record and the remaing records in that goup are duplicate records of the first record based on the value/values of key/keys used in the DISTINCT.
2)RemoveDuplicate Stage has an options Retain First,Retain Last Options.
3)RemoveDuplicate Stage Retain First Option is equivalent to Oracle Distinct.
Experts,correct me if i am wrong.
Thanks,
Srini
My opinion is:
1)Oracle DISTINCT keyword treats first record in the group as a unique record and the remaing records in that goup are duplicate records of the first record based on the value/values of key/keys used in the DISTINCT.
2)RemoveDuplicate Stage has an options Retain First,Retain Last Options.
3)RemoveDuplicate Stage Retain First Option is equivalent to Oracle Distinct.
Experts,correct me if i am wrong.
Thanks,
Srini
-
- Premium Member
- Posts: 376
- Joined: Sat Jan 07, 2012 12:25 pm
- Location: Piscataway
Ssreeni,
I know what the SQL distinct does and how the remove duplicate stage functions. My question is which method would be me more efficient for a data volume of 60 million + records.
I am not bothered about which record to retain since the records are pure duplicates i.e. every field in the duplicate record has the exact same value as the original record.
I know what the SQL distinct does and how the remove duplicate stage functions. My question is which method would be me more efficient for a data volume of 60 million + records.
I am not bothered about which record to retain since the records are pure duplicates i.e. every field in the duplicate record has the exact same value as the original record.
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
-
- Premium Member
- Posts: 376
- Joined: Sat Jan 07, 2012 12:25 pm
- Location: Piscataway
Oops! Just realized I posted this in the wrong forum . Can the admin please move it to the right forum?
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
I try to remove excess data as early as possible in a job. In this case the distinct at the database level will save overhead, particularly if the DB is remote.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
If the basis for "distinct" is indexed in the database it will make a huge amount of sense to include DISTINCT in the query, mainly for the reason Arnd articulated, but also because the query can be resolved entirely in the index, a huge performance gain.
Relative cost (volume of distinct rows / volume of rows) should be your guideline otherwise about where to apply this constraint. Available spare capacity on the database server versus the DataStage server will also be a consideration.
Relative cost (volume of distinct rows / volume of rows) should be your guideline otherwise about where to apply this constraint. Available spare capacity on the database server versus the DataStage server will also be a consideration.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.