Replicating Syncsort in PX

joesat · Post by **joesat** » Mon Sep 17, 2007 11:12 pm

I have to replicate the following scenario in DataStage.

There is a sequential file which has been sorted using five different keys in descending order using a SyncSort program. The output of this file is given to another SyncSort program which uses three keys (these three are part of the five used earlier) and removes the duplicates and outputs to another sequential file.

Now, I have tried to replicate this by using a Sort stage followed by a Remove Duplicates stage. But we already know that if the keys used in a Sort stage and those used in a following stage are different, then a warning is shown. But here there is no other option as I have to replicate the existing scenario. I have used hash partitioning for the Sort stage and 'Same' partitioning for the Remove Dups stage.

The number of output records obtained in the PX job is same as in the Syncsort utility. But the order is jumbled up. Also, is there any way I can remove the warning in this particular scenario, ie. When checking operator: User inserted sort "Sort_stage" does not fulfill the sort requirements of the downstream operator "Remove_Dups_Stage".

stefanfrost1 · Post by **stefanfrost1** » Tue Sep 18, 2007 12:13 am

As far as I know, your result should be what you're striving for in what you are describing, just make sure datastage hasn't inserted any own operators (inserted sort). We are using a similar approach at my place but where we are filtering duplicates using an extra column. We accepted those warnings after severe testing.

However to get rid of the warning you have to remove duplicates in the same order as your sort. For example

Sort on key1,key2,key3,key4.key5
then remove duplicates on key1,key2,key3

This will eliminate all warnings.

joesat · Post by **joesat** » Tue Sep 18, 2007 10:16 pm

Stefan, by 'inserted sort' do you mean the 'perform sort' option within the sort stage and the remove dups stage? If that is so, yes I have disabled them.
And yes I have used the keys in the order that you have shown.

I guess the warnings are not an issue. The problem is that the sorted data is jumbled once it gets into the remove dups stage, ie. the output from the sort stage (which has five keys in descending order) is 5, 4, 3, 2, 1. But output data from the remove dups stage is 3, 4, 1, 2, 5.

Can someone provide me with possible reasons as to why this jumbling up occurs?

ArndW · Post by **ArndW** » Tue Sep 18, 2007 10:49 pm

Are you running a 1-node configuration (if not, try it and see if your 'jumbling' might be coming from your repartitioning)

joesat · Post by **joesat** » Tue Sep 18, 2007 11:04 pm

We are using a two node configuration. And like I had mentioned earlier, the sort stage uses hash partitioning and the remove dups stage uses 'same' partitioning.