Hi,
I need a clarification regarding Partitions on Filter stage.
For eg consider the following set of records.
1,NY
2,NJ
3,NJ
4,NY
5,NJ
6,NY
Now i want to filter all the records from NY using filter stage(using two node config file). How does the partitioning works in the following cases.
1. Auto Partition - Will Filter stage uses the filter columns to partition the records.
2. If I explicitly set the partition as Hash Partition on state name, will the performance be improved? As the records from NY move to one node and NJ to another node, will the system knows not to apply filter on the node which has NJ?
3. Setting partition on Serial no affects performance?
Thanks,
Firvin
Partitioning in Filter stage
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 49
- Joined: Fri Dec 14, 2007 1:43 pm
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
With only six rows nothing you do will make a lot of difference.
What partitioning (Auto) uses depends on what's upstream of the Filter stage. Hash partitioning may worsen performance if it causes your data to be skewed (you have, for example, many more NY than NJ). The Filter stage does not use the partitioning algorithm in its filtering calculations; it will always check all the WHERE conditions. Sequential execution (which is what I assume you mean by "serial") will not improve anything. Data do not need to be key partitioned for the Filter stage, so Round Robin will give the most equitable balance of rows over available processing nodes. However, if a downstream stage does require key-partitioned data, then effecting this as far upstream as possible will minimize the need for subsequent re-partitioning.
What partitioning (Auto) uses depends on what's upstream of the Filter stage. Hash partitioning may worsen performance if it causes your data to be skewed (you have, for example, many more NY than NJ). The Filter stage does not use the partitioning algorithm in its filtering calculations; it will always check all the WHERE conditions. Sequential execution (which is what I assume you mean by "serial") will not improve anything. Data do not need to be key partitioned for the Filter stage, so Round Robin will give the most equitable balance of rows over available processing nodes. However, if a downstream stage does require key-partitioned data, then effecting this as far upstream as possible will minimize the need for subsequent re-partitioning.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.