Aggregator - Partitioning
Moderators: chulett, rschirm, roy
Aggregator - Partitioning
I searched the online manual and it was very vague on the partitioning requirements for the Aggregator Stage. I am working on a two node box and when the data reaches the aggregator stage, it aggregates the data on the two nodes and not as one. I end up with duplicate records and two different totals. Let just say node 1 = 6 and node 2 = 3. I get two records with the same grouping. My requirement was to have 9 as the total.
So to get around this, I hash sorted the records before the aggregator stage and I got the correct result. Is this the correct way or is there a better choice.
I didn't think I needed to resort the records before the stage. I thought AUTO partitioning would handle this. I really don't like overriding defaults.
Thanks,
So to get around this, I hash sorted the records before the aggregator stage and I got the correct result. Is this the correct way or is there a better choice.
I didn't think I needed to resort the records before the stage. I thought AUTO partitioning would handle this. I really don't like overriding defaults.
Thanks,
Jim Stewart
Re: Aggregator - Partitioning
i think in the previous stage,maybe you chose preserve partitioning be set,that will cause Auto partition type in Aggregator stage equals Same type,
MaDFroG
Auto will not 100% assure you the Sort and partition of the data.
Insert the Dump sore in the log to see actual Partition and sort operation held in that job.
What is the actual job design and partition metiods used?
You can sort the data using a sort stage prior to the aggregator and partition the data using Hash partition and maintain the 'Same' partition on aggregator which will not do a repartition or re sort.
Insert the Dump sore in the log to see actual Partition and sort operation held in that job.
What is the actual job design and partition metiods used?
You can sort the data using a sort stage prior to the aggregator and partition the data using Hash partition and maintain the 'Same' partition on aggregator which will not do a repartition or re sort.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
I insert a Dump_Score and got the following message:
op37[2p] {(parallel inserted tsort operator {key={value=TIME_ID, subArgs={asc, nulls={value=first}}}, key={value=REGION_RCM_NUMBER, subArgs={asc, cs}}, key={value=SUB_ACT_CD, subArgs={asc, cs}}, key={value=PI_QUALIFIER_ID, subArgs={asc}}, key={value=PI_ID, subArgs={asc, cs}}}(0) in Aggregator_211)
on nodes (
node1[op37,p0]
node2[op37,p1]
)}
I am not sure what this tells me. As for the design, I guess there are many ways I could accomplish my task.
Design one:
Transformer Stage => Remove_Duplicate Stage => Aggregator Stage.
Grouping is TIME_ID, RCM_CD, SUB_ACT_CD, QUAL_ID, PI_ID
Count on REGIST_NUM
Running on 2 Nodes.
Incorrect counts. Count reside on each node per group
Design 2
Transformer Stage => Sort Stage => Remover Dup Stage => Aggregator Stage.
Sort Stage runs in Sequential mode.
Correct Amounts
Design 3
Transformer Stage => Remove Dup => Aggregator Stage
Aggregator stage set to HASH and selected grouping
Correct Amounts BUT stage repartitions data.
Design 4
Transformer Stage => Sort Stage = Aggregator Stage
Sort Stage set to Sequential and remove duplicates.
Count correct.
What is the best solution?
Thanks
op37[2p] {(parallel inserted tsort operator {key={value=TIME_ID, subArgs={asc, nulls={value=first}}}, key={value=REGION_RCM_NUMBER, subArgs={asc, cs}}, key={value=SUB_ACT_CD, subArgs={asc, cs}}, key={value=PI_QUALIFIER_ID, subArgs={asc}}, key={value=PI_ID, subArgs={asc, cs}}}(0) in Aggregator_211)
on nodes (
node1[op37,p0]
node2[op37,p1]
)}
I am not sure what this tells me. As for the design, I guess there are many ways I could accomplish my task.
Design one:
Transformer Stage => Remove_Duplicate Stage => Aggregator Stage.
Grouping is TIME_ID, RCM_CD, SUB_ACT_CD, QUAL_ID, PI_ID
Count on REGIST_NUM
Running on 2 Nodes.
Incorrect counts. Count reside on each node per group
Design 2
Transformer Stage => Sort Stage => Remover Dup Stage => Aggregator Stage.
Sort Stage runs in Sequential mode.
Correct Amounts
Design 3
Transformer Stage => Remove Dup => Aggregator Stage
Aggregator stage set to HASH and selected grouping
Correct Amounts BUT stage repartitions data.
Design 4
Transformer Stage => Sort Stage = Aggregator Stage
Sort Stage set to Sequential and remove duplicates.
Count correct.
What is the best solution?
Thanks
Jim Stewart
Why not just Hash partition the data to the Sort stage instead of setting the stage to a sequential mode? Set the Aggregator Method to Sort instead of Hash - and this should give you the desired.Raftsman wrote: Design 4
Transformer Stage => Sort Stage = Aggregator Stage
Sort Stage set to Sequential and remove duplicates.
Count correct.
Thanks
I believe this to be better than any of the others - unless someone here points otherwise.
Aneesh
------------------
Aneesh
Aneesh
-
- Charter Member
- Posts: 299
- Joined: Wed Nov 13, 2002 5:38 pm
- Location: USA
The best solution is to always attempt to minimize re-partitioning. An aggregator will aggregate by node. Therefore if all the Aggregation keys are not in the same node, your Aggregation will be wrong. Auto partition will not look at your data to make sure you've partitioned it correctly based on the the Aggregator stage keys (I'm pretty sure but could be wrong). So if you have partitioned the data using one set of keys and you desire to aggregate based on a different set that might cross nodes, it will not work and you MUST re-partition or go all Sequential on it.
Based on above information you would like to group by following columns and sumarize REGIST_NUM column value after removing duplicate records.Raftsman wrote: Design one:
Transformer Stage => Remove_Duplicate Stage => Aggregator Stage.
Grouping is TIME_ID, RCM_CD, SUB_ACT_CD, QUAL_ID, PI_ID
Count on REGIST_NUM
Running on 2 Nodes.
Incorrect counts. Count reside on each node per group
Design 2
Transformer Stage => Sort Stage => Remover Dup Stage => Aggregator Stage.
Sort Stage runs in Sequential mode.
Correct Amounts
TIME_ID, RCM_CD, SUB_ACT_CD, QUAL_ID, PI_ID
You could do
Transformer Stage => Sort Stage (parallel) => Aggregator Stage (parallel).
Sort Stage(parallel):
Hash patition by the group columns you mentioned and then go to properties of the sort stage you have one option "Allow Duplicates = True" set this to false to remove duplicates.
Aggregator Stage(parallel):
define the partition as same so you don't re-partition data again.
This should work best.
Hmm i will fill this when ever i get one
Information in the Dump score was just giveing out the info regarding a sort (tsort) been included. And the list of keys been used. You would have got another entry in the log which says the partition been used at each stage. Or eplain the partiton used in you design at each stage.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
I am still confused on how I should be partitioning and sorting data for the aggregator stage.
I get this message and I have tried many different scenerios to eliminate it and still it appears. I am sure it's something simple.
I get this,
Cnt_Instr_Num: When checking operator: User inserted sort "Sort_375" does not fulfill the sort requirements of the downstream operator "APT_SortedGroup2Operator in Cnt_Instr_Num"
Here's my design.
TRANSFORMER>>>SORT >>> AGGREGATOR
TRANSFORMER has the following data
AGR_ID, TIME_ID, PI_QUALIFIER, PI_ID, SUB_ACT_CD, RCM_CD
SORT Keys are
TIME_ID, RCM_CD, SUB_ACT_CD, PI_QUALIFIER, PI_ID
To ensure that the partitioning is correct for the aggregator stage, I forced a re-partition with the following fields. (not sure if I must do this but I have run into problems before)
TIME_ID, RCM_CD, SUB_ACT_CD, PI_QUALIFIER, PI_ID
AGGREGATOR Stage group key are;
TIME_ID, RCM_CD, SUB_ACT_CD, PI_QUALIFIER, PI_ID
Count rows on AGR_ID
Input partitioning is set to SAME
I get this warning in all my jobs and I can't seem to figure out why.
Could someone please enlighten me.
Thanks
I get this message and I have tried many different scenerios to eliminate it and still it appears. I am sure it's something simple.
I get this,
Cnt_Instr_Num: When checking operator: User inserted sort "Sort_375" does not fulfill the sort requirements of the downstream operator "APT_SortedGroup2Operator in Cnt_Instr_Num"
Here's my design.
TRANSFORMER>>>SORT >>> AGGREGATOR
TRANSFORMER has the following data
AGR_ID, TIME_ID, PI_QUALIFIER, PI_ID, SUB_ACT_CD, RCM_CD
SORT Keys are
TIME_ID, RCM_CD, SUB_ACT_CD, PI_QUALIFIER, PI_ID
To ensure that the partitioning is correct for the aggregator stage, I forced a re-partition with the following fields. (not sure if I must do this but I have run into problems before)
TIME_ID, RCM_CD, SUB_ACT_CD, PI_QUALIFIER, PI_ID
AGGREGATOR Stage group key are;
TIME_ID, RCM_CD, SUB_ACT_CD, PI_QUALIFIER, PI_ID
Count rows on AGR_ID
Input partitioning is set to SAME
I get this warning in all my jobs and I can't seem to figure out why.
Could someone please enlighten me.
Thanks
Jim Stewart
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Partition by the grouping keys. Make sure that the data are sorted by at least these fields, and in the order specified by the Sort aggregation method if you are using that.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 27
- Joined: Mon Jul 09, 2007 1:06 am