Page 1 of 1

handling duplicate records

Posted: Thu Dec 23, 2010 5:12 pm
by just4u_sharath
Guys,
I normally use remove duplicate stage to handle duplicates. There many other options such as sort stage, transformer stage, aggregator and hash file(server edition) to handle duplicates. My question are
1. which is the best way to handle duplictes if volume of data is huge.
2. Is it better to handle duplicate at database level or handle it through datastage.

thanks

Posted: Thu Dec 23, 2010 6:44 pm
by ray.wurlod
It depends (as always).

If the columns that identify duplicates are indexed, then it's better done in the database, because it can be solved quickly in the index.

A unique sort of some kind removes duplicates fairly efficiently. The Remove Duplicates stage gives you the opportunity to specify the first or last in each group as the record to retain. If this is your business requirement, then use the Remove Duplicates stage.

Posted: Thu Dec 23, 2010 9:16 pm
by PaulVL
I would say it is best done in the most economical environment which still meets your SLA agreement with the end customer.

If your Database is VERY busy servicing reports for your end users, every CPU second saved on that system is important, then do it in DataStage.

Statistically your database will perform that activity a lot faster and as a result you will transfer less data across your network. Your over all job will be faster due to the filtering happening as far upstream as it possibly could.