Page 1 of 1

Sort and Aggregation Options

Posted: Thu Dec 21, 2006 7:48 pm
by vijayrc
Say I have three different type of aggregations to be done
Aggr1 = Groupkey1,Groupkey2,Groupkey3,
Aggr2 = Groupkey1,Groupkey2,Groupkey3,Groupkey4
Aggr3 = Groupkey1,Groupkey2,Groupkey3,Groupkey4,Groupkey5

Option 1:
Sort1 [GK1,GK2,GK3], Aggregate1 using GK1-3
Sort2 [GK1,GK2,GK3,GK4], Aggregate2 using GK1-4
Sort3 [GK1,GK2,GK3,GK4,GK5], Aggregate3 using GK1-5

Option 2:
Sort1 [GK1,GK2,GK3]--->Agg1
................................--->Sort2[GK4]-->Agg2
......................................................-->Sort3[GK5]-->Agg3

I guess Option 2 makes more sense for performance. What do you guys suggest??

Posted: Thu Dec 21, 2006 9:38 pm
by DSguru2B
Sort it at once, in the begining on all the grouping keys, in that order. Then you can use that dataset and pass it through different aggregations for different groupings.

Posted: Fri Dec 22, 2006 12:33 am
by ray.wurlod
Or, if there's some reason not to sort by GK4 and GK5 up front (for example you may not have them until you've performed a lookup or join) then Option 2 is definitiely to be preferred - but you MUST preserve the sorting order of GK1 through GK3. Do this with an explicit Sort stage, with GK1 through GK3 specfied as "Don't sort (already sorted)".

Posted: Fri Dec 22, 2006 6:31 am
by vijayrc
ray.wurlod wrote:Or, if there's some reason not to sort by GK4 and GK5 up front (for example you may not have them until you've performed a lookup or join) then Option 2 is definitiely to be preferred - but you MUST p ...
Thanks Ray..I'm going to go with Option2