Sort and Aggregation Options

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
vijayrc
Participant
Posts: 197
Joined: Sun Apr 02, 2006 10:31 am
Location: NJ

Sort and Aggregation Options

Post by vijayrc »

Say I have three different type of aggregations to be done
Aggr1 = Groupkey1,Groupkey2,Groupkey3,
Aggr2 = Groupkey1,Groupkey2,Groupkey3,Groupkey4
Aggr3 = Groupkey1,Groupkey2,Groupkey3,Groupkey4,Groupkey5

Option 1:
Sort1 [GK1,GK2,GK3], Aggregate1 using GK1-3
Sort2 [GK1,GK2,GK3,GK4], Aggregate2 using GK1-4
Sort3 [GK1,GK2,GK3,GK4,GK5], Aggregate3 using GK1-5

Option 2:
Sort1 [GK1,GK2,GK3]--->Agg1
................................--->Sort2[GK4]-->Agg2
......................................................-->Sort3[GK5]-->Agg3

I guess Option 2 makes more sense for performance. What do you guys suggest??
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Sort it at once, in the begining on all the grouping keys, in that order. Then you can use that dataset and pass it through different aggregations for different groupings.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Or, if there's some reason not to sort by GK4 and GK5 up front (for example you may not have them until you've performed a lookup or join) then Option 2 is definitiely to be preferred - but you MUST preserve the sorting order of GK1 through GK3. Do this with an explicit Sort stage, with GK1 through GK3 specfied as "Don't sort (already sorted)".
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vijayrc
Participant
Posts: 197
Joined: Sun Apr 02, 2006 10:31 am
Location: NJ

Post by vijayrc »

ray.wurlod wrote:Or, if there's some reason not to sort by GK4 and GK5 up front (for example you may not have them until you've performed a lookup or join) then Option 2 is definitiely to be preferred - but you MUST p ...
Thanks Ray..I'm going to go with Option2
Post Reply