Replicating Syncsort in PX

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
joesat
Participant
Posts: 93
Joined: Wed Jun 20, 2007 2:12 am

Replicating Syncsort in PX

Post by joesat »

I have to replicate the following scenario in DataStage.

There is a sequential file which has been sorted using five different keys in descending order using a SyncSort program. The output of this file is given to another SyncSort program which uses three keys (these three are part of the five used earlier) and removes the duplicates and outputs to another sequential file.

Now, I have tried to replicate this by using a Sort stage followed by a Remove Duplicates stage. But we already know that if the keys used in a Sort stage and those used in a following stage are different, then a warning is shown. But here there is no other option as I have to replicate the existing scenario. I have used hash partitioning for the Sort stage and 'Same' partitioning for the Remove Dups stage.

The number of output records obtained in the PX job is same as in the Syncsort utility. But the order is jumbled up. Also, is there any way I can remove the warning in this particular scenario, ie. When checking operator: User inserted sort "Sort_stage" does not fulfill the sort requirements of the downstream operator "Remove_Dups_Stage".
Joel Satire
stefanfrost1
Premium Member
Premium Member
Posts: 99
Joined: Mon Sep 03, 2007 7:49 am
Location: Stockholm, Sweden

Post by stefanfrost1 »

As far as I know, your result should be what you're striving for in what you are describing, just make sure datastage hasn't inserted any own operators (inserted sort). We are using a similar approach at my place but where we are filtering duplicates using an extra column. We accepted those warnings after severe testing.

However to get rid of the warning you have to remove duplicates in the same order as your sort. For example

Sort on key1,key2,key3,key4.key5
then remove duplicates on key1,key2,key3

This will eliminate all warnings.
joesat
Participant
Posts: 93
Joined: Wed Jun 20, 2007 2:12 am

Post by joesat »

Stefan, by 'inserted sort' do you mean the 'perform sort' option within the sort stage and the remove dups stage? If that is so, yes I have disabled them.
And yes I have used the keys in the order that you have shown.

I guess the warnings are not an issue. The problem is that the sorted data is jumbled once it gets into the remove dups stage, ie. the output from the sort stage (which has five keys in descending order) is 5, 4, 3, 2, 1. But output data from the remove dups stage is 3, 4, 1, 2, 5.

Can someone provide me with possible reasons as to why this jumbling up occurs?
Joel Satire
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Are you running a 1-node configuration (if not, try it and see if your 'jumbling' might be coming from your repartitioning)
joesat
Participant
Posts: 93
Joined: Wed Jun 20, 2007 2:12 am

Post by joesat »

We are using a two node configuration. And like I had mentioned earlier, the sort stage uses hash partitioning and the remove dups stage uses 'same' partitioning.
Joel Satire
Post Reply