The documentation suggests that for components with a sort requirement, DS will analyze the incoming flow to determine if the sort criteria has already been met by a previous (sort) stage. If this criteria has not been met it will insert a sort internal to the stage.
Questions:
1.) Will DataStage perform this sort analysis across jobs? Suppose I have 2 jobs in a sequence. Job A sorts by key X and writes to sequential file, and Job B reads in sequential file from A and wants to aggregate grouped by key X. Will DS be smart enough to recognize that while executing this sequence, a sort in the second job will not be necessary?
2.) Same scenario, but using DataSets rather than sequential files. Is there any information in the dataset format that DS can use to know that it is sorted already?
Sorting and DataSets
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
Staging data to sequential files from a parallel job is very inefficient as it needs to unpartition the data back into a sequential stream to write to the file and repartition it across nodes in the second job. If you feed a sequential file directly to an aggregation stage without any sorting options it will assume it is already sorted.
It is better to stage the data to a dataset as it retains partitioning. You should sort and partition the data in the first job to suit the aggregation that will occur in the second job.
It is better to stage the data to a dataset as it retains partitioning. You should sort and partition the data in the first job to suit the aggregation that will occur in the second job.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
-
- Participant
- Posts: 60
- Joined: Sun Nov 21, 2004 2:24 am
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
A parallel job will automatically add sorting and partitioning to certain stages such as the remove duplicates and join stages. If you know the data is already sorted, let's say you have a sorted sequential file or a database source with an ORDER BY statement, you can turn off this sort insertion by adding to your job the environment variable $APT_NO_PART_INSERTION and $APT_NO_SORT_INSERTION and setting them to false. This puts you in total control of the sorting and partitioning of the job so you don't sort already sorted data.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn