Strange behaviour of aggregator stage

soumya5891 · Post by **soumya5891** » Sat Jan 21, 2012 8:49 am

I have a job with the following design.

From dataset I have one copy stage and from that copy stage one link goes to aggregator stage and another link goes to rdup stage.Then the output of aggregation of rdup is join.The aggregator keys ,rdup keys and join keys are same.And I have mentioned the partitioning properly.

The input of aggregatoe is hash partitioned and sorted on the basis of aggregation keys.Now when I am using Aggregator method as Hash the No of records output from the join stage is different from the join output when aggregator method is sort.

felixyong · Post by **felixyong** » Wed Feb 15, 2012 2:59 am

If you sort the data then you should be using "sort" in the Aggregator. Hash is used if you didn't "sort" the data in adv.

soumya5891 wrote:I have a job with the following design.

From dataset I have one copy stage and from that copy stage one link goes to aggregator stage and another link goes to rdup stage.Then the output of aggregation of rdup is join.The aggregator keys ,rdup keys and join keys are same.And I have mentioned the partitioning properly.

The input of aggregatoe is hash partitioned and sorted on the basis of aggregation keys.Now when I am using Aggregator method as Hash the No of records output from the join stage is different from the join output when aggregator method is sort.

felixyong · Post by **felixyong** » Wed Feb 15, 2012 3:01 am

If you sort the data then you should be using "sort" in the Aggregator. Hash is used if you didn't "sort" the data in adv.

soumya5891 wrote:I have a job with the following design.

From dataset I have one copy stage and from that copy stage one link goes to aggregator stage and another link goes to rdup stage.Then the output of aggregation of rdup is join.The aggregator keys ,rdup keys and join keys are same.And I have mentioned the partitioning properly.

The input of aggregatoe is hash partitioned and sorted on the basis of aggregation keys.Now when I am using Aggregator method as Hash the No of records output from the join stage is different from the join output when aggregator method is sort.

felixyong · Post by **felixyong** » Wed Feb 15, 2012 3:15 am

If you sort the data then you should be using "sort" in the Aggregator. Hash is used if you didn't "sort" the data in adv.

soumya5891 wrote:I have a job with the following design.

From dataset I have one copy stage and from that copy stage one link goes to aggregator stage and another link goes to rdup stage.Then the output of aggregation of rdup is join.The aggregator keys ,rdup keys and join keys are same.And I have mentioned the partitioning properly.

The input of aggregatoe is hash partitioned and sorted on the basis of aggregation keys.Now when I am using Aggregator method as Hash the No of records output from the join stage is different from the join output when aggregator method is sort.

kandyshandy · Post by **kandyshandy** » Wed Feb 15, 2012 3:19 am

Soumya,

Read the IBM doc for your issue. It is hardly 1 page and will give you clear picture.

Your way should depend on your data and aggregation requirements!!

DSXchange

Strange behaviour of aggregator stage

Strange behaviour of aggregator stage

Re: Strange behaviour of aggregator stage

Re: Strange behaviour of aggregator stage

Re: Strange behaviour of aggregator stage