Sort in PX Version 7

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
johnman
Participant
Posts: 8
Joined: Sat Jan 10, 2004 1:45 pm

Sort in PX Version 7

Post by johnman »

Hi All,
I have real concerns about the parallelism in Data Stage.
The general problems coming is Join Stage not joining the Data, Aggregator stage not aggregating the data, Sort stage not doing the Sort.
Then after setting the Input Partitioning to Hash and performing sort on Input through "Perform Sort" check box, Join stage and aggregator stage are working fine.
Can anybody pls tell me what is the use of setting the Input Partitioning to "Auto" if DataStage can not automatically key partition and sort?

But sort stage is not giving the desired result.
Even with Input Partitioning set to Hash, the data is not coming in the Sorted order. It is sorting the data in partitions but sorting across partitioning is not happening.

For Ex: Input Data is - 2,5,3, 8,34,64, 23,76,24, 12,1,6
The output is coming like this: 1,2,3, 34,64,76, 12,23,24, 5,6,8
Basically It is sorting 3 - 3 records in each partition.

Can anybody help me in this?
PS: I am running the job with 4 nodes configuration.
For getting the Sort stage to work as desired, I have to set the Execution Mode to 'Sequential' i.e. not taking the advantage of Parallelism.

The funniest thing i have seen is that we have the "Perform Sort" check box in the Input Properties for Hash Partitioning. Then what is Sort Stage doing? Again using that option also not giving me the desired result.

Regards
John
vzoubov
Participant
Posts: 28
Joined: Tue Feb 05, 2002 12:30 pm
Location: Boston, MA

Re: Sort in PX Version 7

Post by vzoubov »

John,

Please see my notes below...
johnman wrote:Hi All,
I have real concerns about the parallelism in Data Stage.
The general problems coming is Join Stage not joining the Data, Aggregator stage not aggregating the data, Sort stage not doing the Sort.
Then after setting the Input Partitioning to Hash and performing sort on Input through "Perform Sort" check box, Join stage and aggregator stage are working fine.
Can anybody pls tell me what is the use of setting the Input Partitioning to "Auto" if DataStage can not automatically key partition and sort?
Setting partitioning type to "Auto" means that you simply don't care how DataStage will partition the data. However the Sort stage REQUIRES that the records with the same key go into the same partition. Hash isn't the only partitioning type that complies with this requirement. DB2 partitioning also can be used. On the other hand if your data already hash partitioned before the Sort stage there's no need to do repartitioning. And in this case you'd use Same for Sort.
johnman wrote: But sort stage is not giving the desired result.
Even with Input Partitioning set to Hash, the data is not coming in the Sorted order. It is sorting the data in partitions but sorting across partitioning is not happening.

For Ex: Input Data is - 2,5,3, 8,34,64, 23,76,24, 12,1,6
The output is coming like this: 1,2,3, 34,64,76, 12,23,24, 5,6,8
Basically It is sorting 3 - 3 records in each partition.

Can anybody help me in this?
PS: I am running the job with 4 nodes configuration.
For getting the Sort stage to work as desired, I have to set the Execution Mode to 'Sequential' i.e. not taking the advantage of Parallelism.
Besides choosing the right Partitioning strategy you also have to take care of Collecting your data in your PX job. If you need the output to be sorted use the Sort Merge Collector type.
johnman wrote: The funniest thing i have seen is that we have the "Perform Sort" check box in the Input Properties for Hash Partitioning. Then what is Sort Stage doing?
The Sort stage sorts previously partitioned data. "Perform Sort" check in the Input Properties is merely a space saver: you tell PX to sort data without dropping the Sort stage on the pallets.

Vitali.
johnman
Participant
Posts: 8
Joined: Sat Jan 10, 2004 1:45 pm

Post by johnman »

Thanks a Lot Vitali.

One more point I wud like to clarify:
Right now I am setting the input partitioning to Hash and checking "Perform Sort" to get the desired result. Is the any better and efficient method to do the same.
Which partitioning is better when we want same key values to land up in same partition?

Thanks & Regards
John
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Re: Sort in PX Version 7

Post by Teej »

johnman wrote:The general problems coming is Join Stage not joining the Data, Aggregator stage not aggregating the data, Sort stage not doing the Sort.
This is your biggest challenge in understanding Parallelism -- the huge flexibility that the tool offers you in handling data the way YOU want to do it, not what they THINK you want to do it.

It may seems 'obvious' to yourself that it should only sort on key fields, but...

What if you have already sorted those key fields on the previous stage, and do not want to waste processing time doing another sort? Just set the partition to "Same" and you're good.

What if you have an unique requirement that have you specifically sort it on a particular field that could be repeated across an unique set of keys that needs to be on specific partitions?

Trust me on this, and handle the partitioning yourself. The control is explicitly your to do it. We generally use "Hash" due to its relatively low cost behavior, and our mostly balanced data (though this is not always true). I do have some co-workers here who like to explicitly state "Same" partitioning on stages that do not need to be 'rearranged'. I prefers 'Auto', but that person's solution may be faster than mine due to the concept behind it. (We have not yet done the metrics on that -- Life on the OverDue Project Lane.)

If you would Hash on the key fields for all of the stages you are complaining about, you would notice an amazing thing: Those things are suddenly behaving the way you expect them to do. Sorting is only required when the Stage requires it. The Help menu is VERY clear on which stages requires it (click on the Help button on the main window of each stage and read its description). Otherwise, don't sort it unless you have a business rule requiring it. Join Stage is one of those that do require it on all input links.

One other thing that you can do is sort and partition on some key fields, and sort ONLY on extra fields -- (How do you get the lowest/highest value based on a key? The above trick will do it within the Remove Duplicate Stage.)

Trust me on this: Be very careful on your research on Partitioning and Sorting. Doing it wrong will produce wrong data or extremely poor performance. Doing it right will amaze your manager.

-T.J.

P.S. If your database is DB2... use it. As far as I know, it's FAAAAAAAAAAAAST for DB2-based data compared to other options.
Developer of DataStage Parallel Engine (Orchestrate).
vzoubov
Participant
Posts: 28
Joined: Tue Feb 05, 2002 12:30 pm
Location: Boston, MA

Post by vzoubov »

johnman wrote:Thanks a Lot Vitali.

One more point I wud like to clarify:
Right now I am setting the input partitioning to Hash and checking "Perform Sort" to get the desired result. Is the any better and efficient method to do the same.
Which partitioning is better when we want same key values to land up in same partition?

Thanks & Regards
John
It depends on what you need the sorted data for. If you are doing joins aggregations, de-dupping etc. Hash most likely will be the best partitioning type. However if you are doing a total sort (across all partitions) consider Range partitioning combined with Ordered collecting.
If you need more information on this subject send me an e-mail.

Vitali.
rggoud
Participant
Posts: 15
Joined: Thu Nov 06, 2003 9:59 am

Post by rggoud »

Hi,

We are using only "Auto" Partitioning Method in our loads. And all of them are working as expected. Is there any project level settings that takes care of the issue Johnman is referring ? When i checked the log messgages in director, our loads initializes "orchsort" operator. When i tried the job with runtime option "nosortinsert", the results are not correct. What is the difference between the job with and without this runtime option ?

Thanks.


vzoubov wrote:
johnman wrote:Thanks a Lot Vitali.

One more point I wud like to clarify:
Right now I am setting the input partitioning to Hash and checking "Perform Sort" to get the desired result. Is the any better and efficient method to do the same.
Which partitioning is better when we want same key values to land up in same partition?

Thanks & Regards
John
It depends on what you need the sorted data for. If you are doing joins aggregations, de-dupping etc. Hash most likely will be the best partitioning type. However if you are doing a total sort (across all partitions) consider Range partitioning combined with Ordered collecting.
If you need more information on this subject send me an e-mail.

Vitali.
Post Reply