Sorting and DataSets

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
gsherry1
Charter Member
Charter Member
Posts: 173
Joined: Fri Jun 17, 2005 8:31 am
Location: Canada

Sorting and DataSets

Post by gsherry1 »

The documentation suggests that for components with a sort requirement, DS will analyze the incoming flow to determine if the sort criteria has already been met by a previous (sort) stage. If this criteria has not been met it will insert a sort internal to the stage.

Questions:
1.) Will DataStage perform this sort analysis across jobs? Suppose I have 2 jobs in a sequence. Job A sorts by key X and writes to sequential file, and Job B reads in sequential file from A and wants to aggregate grouped by key X. Will DS be smart enough to recognize that while executing this sequence, a sort in the second job will not be necessary?

2.) Same scenario, but using DataSets rather than sequential files. Is there any information in the dataset format that DS can use to know that it is sorted already?
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Staging data to sequential files from a parallel job is very inefficient as it needs to unpartition the data back into a sequential stream to write to the file and repartition it across nodes in the second job. If you feed a sequential file directly to an aggregation stage without any sorting options it will assume it is already sorted.

It is better to stage the data to a dataset as it retains partitioning. You should sort and partition the data in the first job to suit the aggregation that will occur in the second job.
legendkiller
Participant
Posts: 60
Joined: Sun Nov 21, 2004 2:24 am

Post by legendkiller »

just to add one more thing, you have to put same partition in second job while reading from dataset and upto aggregator stage so you need not sort again in job 2
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

A parallel job will automatically add sorting and partitioning to certain stages such as the remove duplicates and join stages. If you know the data is already sorted, let's say you have a sorted sequential file or a database source with an ORDER BY statement, you can turn off this sort insertion by adding to your job the environment variable $APT_NO_PART_INSERTION and $APT_NO_SORT_INSERTION and setting them to false. This puts you in total control of the sorting and partitioning of the job so you don't sort already sorted data.
Post Reply