Hashing keys and grouping columns

meet_deb85 · Post by **meet_deb85** » Tue Feb 16, 2010 3:59 am

Hi All,
I am having one confusion in job.......like in some of the jobs prior to aggrigator stage data is hash partitioned on lets say A,B & C columns and in aggrigator,grouping is done on A,B,C,D(where D is not constent).
Will the result correct and what will be the impact on performance ??

thanks

ray.wurlod · Post by **ray.wurlod** » Tue Feb 16, 2010 4:18 am

What's your confusion?

If data are partitioned on A, B and C then - for any particular combination of A, B and C - all values of D will be on the one node, so grouping by A, B, C and D will yield accurate results.

Performance is immaterial, there's only one way to get the correct result, namely grouping by A, B, C and D (though partitioning on A alone would probably work as well). If data are sorted by A (and maybe then by B and C) then the Sort method for the Aggregator stage will probably finish faster than the Hash method for reasonable volumes of data.

meet_deb85 · Post by **meet_deb85** » Tue Feb 16, 2010 4:23 am

Thnaks Ray......