Page 1 of 1

Strange behaviour of aggregator stage

Posted: Sat Jan 21, 2012 8:49 am
by soumya5891
I have a job with the following design.

From dataset I have one copy stage and from that copy stage one link goes to aggregator stage and another link goes to rdup stage.Then the output of aggregation of rdup is join.The aggregator keys ,rdup keys and join keys are same.And I have mentioned the partitioning properly.

The input of aggregatoe is hash partitioned and sorted on the basis of aggregation keys.Now when I am using Aggregator method as Hash the No of records output from the join stage is different from the join output when aggregator method is sort.

Re: Strange behaviour of aggregator stage

Posted: Wed Feb 15, 2012 2:59 am
by felixyong
If you sort the data then you should be using "sort" in the Aggregator. Hash is used if you didn't "sort" the data in adv.
soumya5891 wrote:I have a job with the following design.

From dataset I have one copy stage and from that copy stage one link goes to aggregator stage and another link goes to rdup stage.Then the output of aggregation of rdup is join.The aggregator keys ,rdup keys and join keys are same.And I have mentioned the partitioning properly.

The input of aggregatoe is hash partitioned and sorted on the basis of aggregation keys.Now when I am using Aggregator method as Hash the No of records output from the join stage is different from the join output when aggregator method is sort.

Re: Strange behaviour of aggregator stage

Posted: Wed Feb 15, 2012 3:01 am
by felixyong
If you sort the data then you should be using "sort" in the Aggregator. Hash is used if you didn't "sort" the data in adv.
soumya5891 wrote:I have a job with the following design.

From dataset I have one copy stage and from that copy stage one link goes to aggregator stage and another link goes to rdup stage.Then the output of aggregation of rdup is join.The aggregator keys ,rdup keys and join keys are same.And I have mentioned the partitioning properly.

The input of aggregatoe is hash partitioned and sorted on the basis of aggregation keys.Now when I am using Aggregator method as Hash the No of records output from the join stage is different from the join output when aggregator method is sort.

Re: Strange behaviour of aggregator stage

Posted: Wed Feb 15, 2012 3:15 am
by felixyong
If you sort the data then you should be using "sort" in the Aggregator. Hash is used if you didn't "sort" the data in adv.
soumya5891 wrote:I have a job with the following design.

From dataset I have one copy stage and from that copy stage one link goes to aggregator stage and another link goes to rdup stage.Then the output of aggregation of rdup is join.The aggregator keys ,rdup keys and join keys are same.And I have mentioned the partitioning properly.

The input of aggregatoe is hash partitioned and sorted on the basis of aggregation keys.Now when I am using Aggregator method as Hash the No of records output from the join stage is different from the join output when aggregator method is sort.

Posted: Wed Feb 15, 2012 3:19 am
by kandyshandy
Soumya,

Read the IBM doc for your issue. It is hardly 1 page and will give you clear picture.

Your way should depend on your data and aggregation requirements!!