Sorting Algorithm for Datastage Sort

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
joesat
Participant
Posts: 93
Joined: Wed Jun 20, 2007 2:12 am

Sorting Algorithm for Datastage Sort

Post by joesat »

I have been trying to re-create a cosort program in Datastage PX. I have managed to replicate everything except the sorting order.

Cosort uses a /SORT statement to sort incoming data. But when there are equal keyed records, it still sorts the data based on some unknown internal algorithm.

Datastage doesn't do that. Whether we use the Stable sort option or not, when there are equal keyed records, it retains the existing order (which is what we need).

Let me illustrate.. suppose there are two records with the first field as key column

AAA|456
AAA|123

Cosort's sort returns
AAA|123
AAA|456

While Datastage returns the optimal
AAA|456
AAA|123

But since my client wants me to replicate the Cosort behaviour, I am not able to do it in Datastage. What can I do to replicate this? Or what is the algorithm that Datastage uses for sorting?

Thanks.
Joel Satire
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Hello Joe,

there are so many complex mathematical approaches that are data dependant that the output order of sorted records will always be undefined unless they are part of the sort criteria. Since the sort algorithms and methods in CoSort as well as DataStage are not published you can never guarantee that the results are going to be identical when you have duplicates. In DataStage if you don't specify "stable sort" then the order of the duplicates depends upon where and how the records were hashed and collated. This means that it is entirely possible that the order can change if you add an extra record to the file that might not have any field values in common.

The only way that you can guarantee identical output is to add explicit sort conditions in both CoSort and DataStage to give a unique order within the duplicate records. Or you can explicitly sort within DataStage using CoSort.
joesat
Participant
Posts: 93
Joined: Wed Jun 20, 2007 2:12 am

Post by joesat »

Sorry Arndw, I wasn't able to read the complete message... can u let me know what approach I need to take?

Thanks
Joel Satire
joesat
Participant
Posts: 93
Joined: Wed Jun 20, 2007 2:12 am

Post by joesat »

Sorry Arndw, I wasn't able to read the complete message... can u let me know what approach I need to take?

Thanks
Joel Satire
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Either use CoSort in DataStage or ensure that there are no unsorted duplicate columns in both CoSort and DataStage. This is the only way you will get guaranteed identical behaviour.
joesat
Participant
Posts: 93
Joined: Wed Jun 20, 2007 2:12 am

Post by joesat »

Thanks, but the difference arises only while I am sorting the records
:( ...how can I ensure that the sorting of the duplicate records in datastage is similar to what I described for Cosort?

By the way, can Cosort be plugged into PX? I thought it was a functionality only in the Server edition...
Joel Satire
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I explained in the original reply that because of proprietary sorting method you cannot guarantee identical output. So either specify all columns needed in your sort criteria or explicitly sort using CoSort in DataStage. Also, because of partitioning in DataStage you are going to get separate streams of data so you need to handle that as well.
Post Reply