Sorting Algorithm for Datastage Sort

joesat · Post by **joesat** » Mon Dec 31, 2007 4:58 am

I have been trying to re-create a cosort program in Datastage PX. I have managed to replicate everything except the sorting order.

Cosort uses a /SORT statement to sort incoming data. But when there are equal keyed records, it still sorts the data based on some unknown internal algorithm.

Datastage doesn't do that. Whether we use the Stable sort option or not, when there are equal keyed records, it retains the existing order (which is what we need).

Let me illustrate.. suppose there are two records with the first field as key column

AAA|456
AAA|123

Cosort's sort returns
AAA|123
AAA|456

While Datastage returns the optimal
AAA|456
AAA|123

But since my client wants me to replicate the Cosort behaviour, I am not able to do it in Datastage. What can I do to replicate this? Or what is the algorithm that Datastage uses for sorting?

Thanks.

ArndW · Post by **ArndW** » Mon Dec 31, 2007 5:05 am

Hello Joe,

there are so many complex mathematical approaches that are data dependant that the output order of sorted records will always be undefined unless they are part of the sort criteria. Since the sort algorithms and methods in CoSort as well as DataStage are not published you can never guarantee that the results are going to be identical when you have duplicates. In DataStage if you don't specify "stable sort" then the order of the duplicates depends upon where and how the records were hashed and collated. This means that it is entirely possible that the order can change if you add an extra record to the file that might not have any field values in common.

The only way that you can guarantee identical output is to add explicit sort conditions in both CoSort and DataStage to give a unique order within the duplicate records. Or you can explicitly sort within DataStage using CoSort.

joesat · Post by **joesat** » Mon Dec 31, 2007 6:15 am

Sorry Arndw, I wasn't able to read the complete message... can u let me know what approach I need to take?

Thanks

joesat · Post by **joesat** » Mon Dec 31, 2007 6:18 am

Sorry Arndw, I wasn't able to read the complete message... can u let me know what approach I need to take?

Thanks

ArndW · Post by **ArndW** » Mon Dec 31, 2007 6:18 am

Either use CoSort in DataStage or ensure that there are no unsorted duplicate columns in both CoSort and DataStage. This is the only way you will get guaranteed identical behaviour.

joesat · Post by **joesat** » Mon Dec 31, 2007 6:23 am

Thanks, but the difference arises only while I am sorting the records

...how can I ensure that the sorting of the duplicate records in datastage is similar to what I described for Cosort?

By the way, can Cosort be plugged into PX? I thought it was a functionality only in the Server edition...

ArndW · Post by **ArndW** » Mon Dec 31, 2007 6:26 am

I explained in the original reply that because of proprietary sorting method you cannot guarantee identical output. So either specify all columns needed in your sort criteria or explicitly sort using CoSort in DataStage. Also, because of partitioning in DataStage you are going to get separate streams of data so you need to handle that as well.