Aggregator - Partitioning

Raftsman · Post by **Raftsman** » Mon Jul 23, 2007 1:34 pm

I searched the online manual and it was very vague on the partitioning requirements for the Aggregator Stage. I am working on a two node box and when the data reaches the aggregator stage, it aggregates the data on the two nodes and not as one. I end up with duplicate records and two different totals. Let just say node 1 = 6 and node 2 = 3. I get two records with the same grouping. My requirement was to have 9 as the total.

So to get around this, I hash sorted the records before the aggregator stage and I got the correct result. Is this the correct way or is there a better choice.
I didn't think I needed to resort the records before the stage. I thought AUTO partitioning would handle this. I really don't like overriding defaults.

Thanks,

Zhang Bo · Post by **Zhang Bo** » Mon Jul 23, 2007 9:47 pm

i think in the previous stage,maybe you chose preserve partitioning be set,that will cause Auto partition type in Aggregator stage equals Same type,

balajisr · Post by **balajisr** » Mon Jul 23, 2007 10:39 pm

What is your aggregation method when you had duplicates?

kumar_s · Post by **kumar_s** » Mon Jul 23, 2007 11:12 pm

Auto will not 100% assure you the Sort and partition of the data.
Insert the Dump sore in the log to see actual Partition and sort operation held in that job.
What is the actual job design and partition metiods used?
You can sort the data using a sort stage prior to the aggregator and partition the data using Hash partition and maintain the 'Same' partition on aggregator which will not do a repartition or re sort.

Raftsman · Post by **Raftsman** » Tue Jul 24, 2007 7:32 am

I insert a Dump_Score and got the following message:

op37[2p] {(parallel inserted tsort operator {key={value=TIME_ID, subArgs={asc, nulls={value=first}}}, key={value=REGION_RCM_NUMBER, subArgs={asc, cs}}, key={value=SUB_ACT_CD, subArgs={asc, cs}}, key={value=PI_QUALIFIER_ID, subArgs={asc}}, key={value=PI_ID, subArgs={asc, cs}}}(0) in Aggregator_211)
on nodes (
node1[op37,p0]
node2[op37,p1]
)}
I am not sure what this tells me. As for the design, I guess there are many ways I could accomplish my task.

Design one:

Transformer Stage => Remove_Duplicate Stage => Aggregator Stage.

Grouping is TIME_ID, RCM_CD, SUB_ACT_CD, QUAL_ID, PI_ID

Count on REGIST_NUM

Running on 2 Nodes.

Incorrect counts. Count reside on each node per group

Design 2

Transformer Stage => Sort Stage => Remover Dup Stage => Aggregator Stage.

Sort Stage runs in Sequential mode.

Correct Amounts

Design 3

Transformer Stage => Remove Dup => Aggregator Stage

Aggregator stage set to HASH and selected grouping

Correct Amounts BUT stage repartitions data.

Design 4

Transformer Stage => Sort Stage = Aggregator Stage

Sort Stage set to Sequential and remove duplicates.

Count correct.

What is the best solution?

Thanks

thebird · Post by **thebird** » Tue Jul 24, 2007 1:00 pm

Raftsman wrote: Design 4

Transformer Stage => Sort Stage = Aggregator Stage

Sort Stage set to Sequential and remove duplicates.

Count correct.
Thanks

Why not just Hash partition the data to the Sort stage instead of setting the stage to a sequential mode? Set the Aggregator Method to Sort instead of Hash - and this should give you the desired.
I believe this to be better than any of the others - unless someone here points otherwise.

Aneesh

throbinson · Post by **throbinson** » Tue Jul 24, 2007 1:27 pm

The best solution is to always attempt to minimize re-partitioning. An aggregator will aggregate by node. Therefore if all the Aggregation keys are not in the same node, your Aggregation will be wrong. Auto partition will not look at your data to make sure you've partitioned it correctly based on the the Aggregator stage keys (I'm pretty sure but could be wrong). So if you have partitioned the data using one set of keys and you desire to aggregate based on a different set that might cross nodes, it will not work and you MUST re-partition or go all Sequential on it.

thamark · Post by **thamark** » Tue Jul 24, 2007 3:54 pm

Raftsman wrote: Design one:

Transformer Stage => Remove_Duplicate Stage => Aggregator Stage.

Grouping is TIME_ID, RCM_CD, SUB_ACT_CD, QUAL_ID, PI_ID

Count on REGIST_NUM

Running on 2 Nodes.

Incorrect counts. Count reside on each node per group

Design 2

Transformer Stage => Sort Stage => Remover Dup Stage => Aggregator Stage.

Sort Stage runs in Sequential mode.

Correct Amounts

Based on above information you would like to group by following columns and sumarize REGIST_NUM column value after removing duplicate records.
TIME_ID, RCM_CD, SUB_ACT_CD, QUAL_ID, PI_ID

You could do

Transformer Stage => Sort Stage (parallel) => Aggregator Stage (parallel).

Sort Stage(parallel):

Hash patition by the group columns you mentioned and then go to properties of the sort stage you have one option "Allow Duplicates = True" set this to false to remove duplicates.

Aggregator Stage(parallel):

define the partition as same so you don't re-partition data again.

This should work best.

kumar_s · Post by **kumar_s** » Wed Jul 25, 2007 12:44 am

Information in the Dump score was just giveing out the info regarding a sort (tsort) been included. And the list of keys been used. You would have got another entry in the log which says the partition been used at each stage. Or eplain the partiton used in you design at each stage.

Raftsman · Post by **Raftsman** » Tue Sep 18, 2007 6:39 am

I am still confused on how I should be partitioning and sorting data for the aggregator stage.

I get this message and I have tried many different scenerios to eliminate it and still it appears. I am sure it's something simple.

I get this,

Cnt_Instr_Num: When checking operator: User inserted sort "Sort_375" does not fulfill the sort requirements of the downstream operator "APT_SortedGroup2Operator in Cnt_Instr_Num"

Here's my design.

TRANSFORMER>>>SORT >>> AGGREGATOR

TRANSFORMER has the following data

AGR_ID, TIME_ID, PI_QUALIFIER, PI_ID, SUB_ACT_CD, RCM_CD

SORT Keys are

TIME_ID, RCM_CD, SUB_ACT_CD, PI_QUALIFIER, PI_ID

To ensure that the partitioning is correct for the aggregator stage, I forced a re-partition with the following fields. (not sure if I must do this but I have run into problems before)

TIME_ID, RCM_CD, SUB_ACT_CD, PI_QUALIFIER, PI_ID

AGGREGATOR Stage group key are;

TIME_ID, RCM_CD, SUB_ACT_CD, PI_QUALIFIER, PI_ID

Count rows on AGR_ID

Input partitioning is set to SAME

I get this warning in all my jobs and I can't seem to figure out why.

Could someone please enlighten me.

Thanks

ray.wurlod · Post by **ray.wurlod** » Tue Sep 18, 2007 4:51 pm

Partition by the grouping keys. Make sure that the data are sorted by at least these fields, and in the order specified by the Sort aggregation method if you are using that.

say2prabhu1 · Post by **say2prabhu1** » Tue Sep 18, 2007 4:59 pm

I think you could have missed some columns in the output field... make sure that all the values in the output should undergo any one either group or calculation...

Raftsman · Post by **Raftsman** » Wed Sep 19, 2007 6:35 am

In the SORT stage, I hash partition but I do not check the sort checkbox. I assume that it is not needed. I tried checking it and still got the same warnings. Both my sort group and aggregator groups are the same.

DSXchange

Aggregator - Partitioning

Aggregator - Partitioning

Re: Aggregator - Partitioning