Guys,
I normally use remove duplicate stage to handle duplicates. There many other options such as sort stage, transformer stage, aggregator and hash file(server edition) to handle duplicates. My question are
1. which is the best way to handle duplictes if volume of data is huge.
2. Is it better to handle duplicate at database level or handle it through datastage.
thanks
handling duplicate records
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 236
- Joined: Sun Apr 01, 2007 7:41 am
- Location: Michigan
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
It depends (as always).
If the columns that identify duplicates are indexed, then it's better done in the database, because it can be solved quickly in the index.
A unique sort of some kind removes duplicates fairly efficiently. The Remove Duplicates stage gives you the opportunity to specify the first or last in each group as the record to retain. If this is your business requirement, then use the Remove Duplicates stage.
If the columns that identify duplicates are indexed, then it's better done in the database, because it can be solved quickly in the index.
A unique sort of some kind removes duplicates fairly efficiently. The Remove Duplicates stage gives you the opportunity to specify the first or last in each group as the record to retain. If this is your business requirement, then use the Remove Duplicates stage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
I would say it is best done in the most economical environment which still meets your SLA agreement with the end customer.
If your Database is VERY busy servicing reports for your end users, every CPU second saved on that system is important, then do it in DataStage.
Statistically your database will perform that activity a lot faster and as a result you will transfer less data across your network. Your over all job will be faster due to the filtering happening as far upstream as it possibly could.
If your Database is VERY busy servicing reports for your end users, every CPU second saved on that system is important, then do it in DataStage.
Statistically your database will perform that activity a lot faster and as a result you will transfer less data across your network. Your over all job will be faster due to the filtering happening as far upstream as it possibly could.