Remove duplicate issue

abhik05 · Post by **abhik05** » Tue Aug 07, 2012 11:31 pm

We are having problem with remove duplicate stage.We are getting different outpu set for different run of the same job.Our requirement is to sort the source data on 3 different fields(A,B,C).Then preserving the sort order we need to remove duplicate on field D.

We tried it like:

Design 1.
Source -Copy stage (partition on D) -Sort stage(sort by A,B,C With SAME partition)-Remove duplicate stage(Key field D,SAME partition)

Design 2.
Source -Sort stage(sort by A,B,C With partition on D)-Remove duplicate stage(Key field D,SAME partition)

In both cases we are getting different output set from multiple run of the job.

Please suggest if anyone has faced similar issue before.

Thanks
Abhik.

ray.wurlod · Post by **ray.wurlod** » Tue Aug 07, 2012 11:47 pm

Try sorting and partitioning on D,A,B,C.

BI-RMA · Post by **BI-RMA** » Wed Aug 08, 2012 3:33 am

Whatever you are trying to do, your description does not make much sense to me. The Remove Duplicates Stage expects your data to be sorted by the key-columns specified. If you sort by A, B and C and then specify D as key for Remove Duplicates I would expect DataStage to throw a warning for inadequate sorting and to insert an additional sort-step sorting your data by column D. The result is - as you discovered - random.

I guess what you really want to do is to remove Duplicates on Column D and to use sorting on columns A, B and C to identify the correct record to keep. So you will have to sort your data by D first and then by A, B and C.

Hash-Partition by D and - if necessary to avoid skewing - by one or more of columns A, B and C in the same order you use them for sorting. Remove Duplicates on Column D. Make sure you choose to keep the correct record (first or last) in Remove Duplicates according to sort order.