Hash Partitioning

Madhavan VM · Post by **Madhavan VM** » Thu Jul 21, 2005 9:20 am

The question is related to partitioning

From a copy stage,cpy_split, i am splitting the record into two links. one goes to the Aggregator,Agg_sum, and one
pass to a Join stage,Jn_k1.The link which is passed to the Aggregator,Agg_sum is Hashed and sorted on a Key,say,K1.
In Agg_sum we are calculating sum of a column based on key,k1.Now, the link from Aggregator,Agg_sum is joined with
the join,Jn_k1.

The join has two inputs the link.From Aggregator,Agg_sum, and the link from copy stage,cpy_split. The key on which the
records are getting joined is again the key,k1.

The partitioning i have followed for the input to the join is as follows:
From the copy stage,cpy_split, the data is hashed and sorted on key k1
From the Aggregator,Agg_sum the data is same partitioned as before entering the Aggregator i have hashed and sorted
on the key,k1.

In the output one set of data was missing.

Let the output from Aggregator be:

k1
"1","310","4"
"3","342","4"
"4","887","5"
"2","543","4"

Let the output from copy stage be:

k1
"1","X","0001","00000000123","210","349"
"1","X","0002","00000000114","110","349"
"1","X","0003","00000000272","310","349"
"1","X","0004","00000000256","121","349"
"3","X","0001","00000000364","342","147"
"3","X","0002","00000000489","111","147"
"3","X","0003","00000000954","222","147"
"3","X","0004","00000000326","333","147"
"2","X","0001","00000000895","432","247"
"2","X","0002","00000000654","543","247"
"2","X","0003","00000000156","123","247"
"2","X","0004","00000000258","232","247"
"4","X","0001","00000000195","444","47"
"4","X","0002","00000000659","555","47"
"4","X","0003","00000000485","666","47"
"4","X","0004","00000000147","777","47"
"4","X","0004","00000000147","887","47"

The output from the join stage is:
k1
"1","310","4","X","0001","00000000123","210","349","100","489"
"1","310","4","X","0003","00000000272","310","349","0","489"
"1","310","4","X","0004","00000000256","121","349","189","489"
"1","310","4","X","0002","00000000114","110","349","200","489"
"3","342","4","X","0001","00000000364","342","147","0","360"
"3","342","4","X","0004","00000000326","333","147","9","360"
"3","342","4","X","0003","00000000954","222","147","120","360"
"3","342","4","X","0002","00000000489","111","147","231","360"
"4","887","5","X","0003","00000000485","666","47","221","1106"
"4","887","5","X","0004","00000000147","777","47","110","1106"
"4","887","5","X","0001","00000000195","444","47","443","1106"
"4","887","5","X","0002","00000000659","555","47","332","1106"
"4","887","5","X","0004","00000000147","887","47","0","1106"

If you observe the data you can find that the value for Key,k1=2 is not found on the output.

But, If I change the partitioning as below i am getting all the rows as output:
The change is as follows:

The partitioning i have changed for the input to the join is as follows:
From the copy stage,cpy_split, the data is hashed and sorted on key k1(same as before)
From the Aggregator,Agg_sum the data is same partitioned and sorted on key k1.

Could someone tell me what the issue is and more about hash partitioning?

vmcburney · Post by **vmcburney** » Thu Jul 21, 2005 6:58 pm

Immediately when I see input of 1,1,1,1,3,3,3,3,2,2,2,2,4,4,4 results in an output from the join of 1,3,4 I would look at sorting problems. The join stage needs both links to be sorted and usually adds a join if one does not exist. Check your symbols on your links to see if the sort appears on both join input links. Why don't you sort before your aggregation? Output from aggregation is 1,3,2,4. Why not get the 1,2,3,4 order correct from the beginning and sort and partition by k1 all the way through.

elavenil · Post by **elavenil** » Fri Jul 22, 2005 3:51 am

As suggessted by Vincent, sort & partition could cause this issue. Pls look at the generated OSH code to identify the sort and partition that are used in the aggregator and join stage. And use the right partition and sort the data that will help you in resolving the issue.

Regards
Saravanan