Partitioning in Filter stage

visvacfirvin · Post by **visvacfirvin** » Mon Jul 28, 2008 12:14 pm

Hi,
I need a clarification regarding Partitions on Filter stage.

For eg consider the following set of records.

1,NY
2,NJ
3,NJ
4,NY
5,NJ
6,NY

Now i want to filter all the records from NY using filter stage(using two node config file). How does the partitioning works in the following cases.

1. Auto Partition - Will Filter stage uses the filter columns to partition the records.
2. If I explicitly set the partition as Hash Partition on state name, will the performance be improved? As the records from NY move to one node and NJ to another node, will the system knows not to apply filter on the node which has NJ?
3. Setting partition on Serial no affects performance?

Thanks,
Firvin

ray.wurlod · Post by **ray.wurlod** » Mon Jul 28, 2008 4:08 pm

With only six rows nothing you do will make a lot of difference.

What partitioning (Auto) uses depends on what's upstream of the Filter stage. Hash partitioning may worsen performance if it causes your data to be skewed (you have, for example, many more NY than NJ). The Filter stage does not use the partitioning algorithm in its filtering calculations; it will always check all the WHERE conditions. Sequential execution (which is what I assume you mean by "serial") will not improve anything. Data do not need to be key partitioned for the Filter stage, so Round Robin will give the most equitable balance of rows over available processing nodes. However, if a downstream stage does require key-partitioned data, then effecting this as far upstream as possible will minimize the need for subsequent re-partitioning.