sort stage followed by remove duplicates stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
pillip
Premium Member
Premium Member
Posts: 50
Joined: Thu Dec 10, 2009 10:43 am

sort stage followed by remove duplicates stage

Post by pillip »

Hi,

we are using sort stage followed by remove duplicates stage in a datastage job.
Hash partioning done on col1,col2,col3 and sorting done on col1,col2 col4 in sort stage. Now in remove duplicates stage removing duplicates on col1,col2,col3. Retaining the first row in remove duplicates stage.

Remove duplicates stage is not working fine. Its once selecting the first row or the last row.

The query is, is it mandatory that the rows having duplicates be side by side for remove duplicates to retain the correct row.


Thanks.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Think about what this combination of partitioning and sorting is doing.

By including col3 as a partitioning key, you are placing combinations of (col1,col2) on different partitions, which is causing the duplicates.

Try partitioning by col1 alone or, if it has fewer distinct values than your configuration has nodes, then col1 and col2 only.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
pillip
Premium Member
Premium Member
Posts: 50
Joined: Thu Dec 10, 2009 10:43 am

Post by pillip »

Hi Ray,


Current job : Sort stage followed by remove dup stage. In sort stage partition by col1,col2,col3 and then sort by col1,col2 col4. col1 and col2 asc order sort, whereas col4 desc sort. Now in remove dups there is remove dup on col1,col2,col3.

Currently wt is missing in current job is the partioning in remove dup stage is auto. We are guessing that this is causing duplicates.

We plan to change the patitioning to Same and then sort to stable sort in remove duplicate stage and test it out.

Can you let me know your thoughts on this.





Thanks you
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I would partition only by col1, and sort by col1, col2, col3. Sorting by col4 doesn't achieve anything, unless to govern the meaning of First or Last in Remove Duplicates stage.

For any given value of col1, all values of col1,col2,col3 will occur on the same partition.

Only if there are too few distinct values of col1 would I consider partitioning by col1,col2. In this case, some col1 values may occur on one partition and others on others.

Don't sort in Remove Duplicates stage at all. Provided that the Sort stage is immediately upstream of the Remove Duplicates stage, the framework will not insert any tsort operator.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply