Remove duplicate issue

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
abhik05
Participant
Posts: 28
Joined: Thu Mar 08, 2012 8:31 am

Remove duplicate issue

Post by abhik05 »

We are having problem with remove duplicate stage.We are getting different outpu set for different run of the same job.Our requirement is to sort the source data on 3 different fields(A,B,C).Then preserving the sort order we need to remove duplicate on field D.

We tried it like:

Design 1.
Source -Copy stage (partition on D) -Sort stage(sort by A,B,C With SAME partition)-Remove duplicate stage(Key field D,SAME partition)

Design 2.
Source -Sort stage(sort by A,B,C With partition on D)-Remove duplicate stage(Key field D,SAME partition)

In both cases we are getting different output set from multiple run of the job.

Please suggest if anyone has faced similar issue before.

Thanks
Abhik.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Try sorting and partitioning on D,A,B,C.
Last edited by ray.wurlod on Wed Aug 08, 2012 3:19 pm, edited 1 time in total.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
BI-RMA
Premium Member
Premium Member
Posts: 463
Joined: Sun Nov 01, 2009 3:55 pm
Location: Hamburg

Post by BI-RMA »

Whatever you are trying to do, your description does not make much sense to me. The Remove Duplicates Stage expects your data to be sorted by the key-columns specified. If you sort by A, B and C and then specify D as key for Remove Duplicates I would expect DataStage to throw a warning for inadequate sorting and to insert an additional sort-step sorting your data by column D. The result is - as you discovered - random.

I guess what you really want to do is to remove Duplicates on Column D and to use sorting on columns A, B and C to identify the correct record to keep. So you will have to sort your data by D first and then by A, B and C.

Hash-Partition by D and - if necessary to avoid skewing - by one or more of columns A, B and C in the same order you use them for sorting. Remove Duplicates on Column D. Make sure you choose to keep the correct record (first or last) in Remove Duplicates according to sort order.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
Post Reply