Page 1 of 1

Sorting and DataSets

Posted: Tue Jul 19, 2005 11:40 am
by gsherry1
The documentation suggests that for components with a sort requirement, DS will analyze the incoming flow to determine if the sort criteria has already been met by a previous (sort) stage. If this criteria has not been met it will insert a sort internal to the stage.

Questions:
1.) Will DataStage perform this sort analysis across jobs? Suppose I have 2 jobs in a sequence. Job A sorts by key X and writes to sequential file, and Job B reads in sequential file from A and wants to aggregate grouped by key X. Will DS be smart enough to recognize that while executing this sequence, a sort in the second job will not be necessary?

2.) Same scenario, but using DataSets rather than sequential files. Is there any information in the dataset format that DS can use to know that it is sorted already?

Posted: Tue Jul 19, 2005 6:25 pm
by vmcburney
Staging data to sequential files from a parallel job is very inefficient as it needs to unpartition the data back into a sequential stream to write to the file and repartition it across nodes in the second job. If you feed a sequential file directly to an aggregation stage without any sorting options it will assume it is already sorted.

It is better to stage the data to a dataset as it retains partitioning. You should sort and partition the data in the first job to suit the aggregation that will occur in the second job.

Posted: Tue Jul 19, 2005 9:42 pm
by legendkiller
just to add one more thing, you have to put same partition in second job while reading from dataset and upto aggregator stage so you need not sort again in job 2

Posted: Tue Jul 19, 2005 11:51 pm
by vmcburney
A parallel job will automatically add sorting and partitioning to certain stages such as the remove duplicates and join stages. If you know the data is already sorted, let's say you have a sorted sequential file or a database source with an ORDER BY statement, you can turn off this sort insertion by adding to your job the environment variable $APT_NO_PART_INSERTION and $APT_NO_SORT_INSERTION and setting them to false. This puts you in total control of the sorting and partitioning of the job so you don't sort already sorted data.