handling duplicate records

just4u_sharath · Post by **just4u_sharath** » Thu Dec 23, 2010 5:12 pm

Guys,
I normally use remove duplicate stage to handle duplicates. There many other options such as sort stage, transformer stage, aggregator and hash file(server edition) to handle duplicates. My question are
1. which is the best way to handle duplictes if volume of data is huge.
2. Is it better to handle duplicate at database level or handle it through datastage.

thanks

ray.wurlod · Post by **ray.wurlod** » Thu Dec 23, 2010 6:44 pm

It depends (as always).

If the columns that identify duplicates are indexed, then it's better done in the database, because it can be solved quickly in the index.

A unique sort of some kind removes duplicates fairly efficiently. The Remove Duplicates stage gives you the opportunity to specify the first or last in each group as the record to retain. If this is your business requirement, then use the Remove Duplicates stage.

PaulVL · Post by **PaulVL** » Thu Dec 23, 2010 9:16 pm

I would say it is best done in the most economical environment which still meets your SLA agreement with the end customer.

If your Database is VERY busy servicing reports for your end users, every CPU second saved on that system is important, then do it in DataStage.

Statistically your database will perform that activity a lot faster and as a result you will transfer less data across your network. Your over all job will be faster due to the filtering happening as far upstream as it possibly could.