Hi All,
I am having one confusion in job.......like in some of the jobs prior to aggrigator stage data is hash partitioned on lets say A,B & C columns and in aggrigator,grouping is done on A,B,C,D(where D is not constent).
Will the result correct and what will be the impact on performance ??
thanks
Hashing keys and grouping columns
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 132
- Joined: Tue Sep 04, 2007 11:38 am
- Location: NOIDA
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
What's your confusion?
If data are partitioned on A, B and C then - for any particular combination of A, B and C - all values of D will be on the one node, so grouping by A, B, C and D will yield accurate results.
Performance is immaterial, there's only one way to get the correct result, namely grouping by A, B, C and D (though partitioning on A alone would probably work as well). If data are sorted by A (and maybe then by B and C) then the Sort method for the Aggregator stage will probably finish faster than the Hash method for reasonable volumes of data.
If data are partitioned on A, B and C then - for any particular combination of A, B and C - all values of D will be on the one node, so grouping by A, B, C and D will yield accurate results.
Performance is immaterial, there's only one way to get the correct result, namely grouping by A, B, C and D (though partitioning on A alone would probably work as well). If data are sorted by A (and maybe then by B and C) then the Sort method for the Aggregator stage will probably finish faster than the Hash method for reasonable volumes of data.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 132
- Joined: Tue Sep 04, 2007 11:38 am
- Location: NOIDA