Hi,
When using the aggregator in sort option the input record must be sorted using ALL the aggregate key columns, but, is it mandatory to partition it based on ALL the key columns (as documentation specifies) - isn't it enough if it is partitioned based on just one of the Key column?
Thanks and Regards,
Senthil
Partitioning for Aggregator
Moderators: chulett, rschirm, roy
Re: Partitioning for Aggregator
I think we need to use the key on which we are going to aggregate.if its a single key than it should work.
Thanks,
Akarsh Kapoor
Akarsh Kapoor
When you have multiple keys it is sufficient to partition on just the first of those keys.
Last edited by ArndW on Thu Aug 23, 2012 5:19 am, edited 1 time in total.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
It usually suffices to partition on just the first grouping (sorting) key. If this has only a few values in its domain, add the second grouping (sorting) key to the partitioning algorithm too.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Re: Partitioning for Aggregator
Hi - Thanks for your comments.
Ray/ArndW - Is it mandatory to pick only the 'first' key in sort as partition key? Will there be any issues if the data is partitioned on the fourth or fifth key?
The reason I am asking this is, I am dealing with huge input datasets (>500million records) that is partitioned on one key (think of it as account #/customer # that provides good distribution) and sorted on some six keys (with partition key column fifth in sort order).
I am doing aggregation multiple times on this dataset. I want to avoid repartitioning and sorting of this dataset again unless it is absolutely necessay. Thats why I thought I will get clarity on this.
Regards,
Senthil
Ray/ArndW - Is it mandatory to pick only the 'first' key in sort as partition key? Will there be any issues if the data is partitioned on the fourth or fifth key?
The reason I am asking this is, I am dealing with huge input datasets (>500million records) that is partitioned on one key (think of it as account #/customer # that provides good distribution) and sorted on some six keys (with partition key column fifth in sort order).
I am doing aggregation multiple times on this dataset. I want to avoid repartitioning and sorting of this dataset again unless it is absolutely necessay. Thats why I thought I will get clarity on this.
Regards,
Senthil
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Nothing is mandatory. But think about what will happen if you don't partition on the first sort key value. The same value of the first sort key may occur on more than one partition which means you will end up with more groups than there really are.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.