Aggregator - Partitioning

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Raftsman
Premium Member
Premium Member
Posts: 335
Joined: Thu May 26, 2005 8:56 am
Location: Ottawa, Canada

Aggregator - Partitioning

Post by Raftsman »

I searched the online manual and it was very vague on the partitioning requirements for the Aggregator Stage. I am working on a two node box and when the data reaches the aggregator stage, it aggregates the data on the two nodes and not as one. I end up with duplicate records and two different totals. Let just say node 1 = 6 and node 2 = 3. I get two records with the same grouping. My requirement was to have 9 as the total.

So to get around this, I hash sorted the records before the aggregator stage and I got the correct result. Is this the correct way or is there a better choice.
I didn't think I needed to resort the records before the stage. I thought AUTO partitioning would handle this. I really don't like overriding defaults.


Thanks,
Jim Stewart
Zhang Bo
Participant
Posts: 15
Joined: Sun Jul 15, 2007 9:22 pm

Re: Aggregator - Partitioning

Post by Zhang Bo »

i think in the previous stage,maybe you chose preserve partitioning be set,that will cause Auto partition type in Aggregator stage equals Same type,
MaDFroG
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

What is your aggregation method when you had duplicates?
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Auto will not 100% assure you the Sort and partition of the data.
Insert the Dump sore in the log to see actual Partition and sort operation held in that job.
What is the actual job design and partition metiods used?
You can sort the data using a sort stage prior to the aggregator and partition the data using Hash partition and maintain the 'Same' partition on aggregator which will not do a repartition or re sort.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Raftsman
Premium Member
Premium Member
Posts: 335
Joined: Thu May 26, 2005 8:56 am
Location: Ottawa, Canada

Post by Raftsman »

I insert a Dump_Score and got the following message:

op37[2p] {(parallel inserted tsort operator {key={value=TIME_ID, subArgs={asc, nulls={value=first}}}, key={value=REGION_RCM_NUMBER, subArgs={asc, cs}}, key={value=SUB_ACT_CD, subArgs={asc, cs}}, key={value=PI_QUALIFIER_ID, subArgs={asc}}, key={value=PI_ID, subArgs={asc, cs}}}(0) in Aggregator_211)
on nodes (
node1[op37,p0]
node2[op37,p1]
)}
I am not sure what this tells me. As for the design, I guess there are many ways I could accomplish my task.

Design one:

Transformer Stage => Remove_Duplicate Stage => Aggregator Stage.

Grouping is TIME_ID, RCM_CD, SUB_ACT_CD, QUAL_ID, PI_ID

Count on REGIST_NUM

Running on 2 Nodes.

Incorrect counts. Count reside on each node per group

Design 2

Transformer Stage => Sort Stage => Remover Dup Stage => Aggregator Stage.

Sort Stage runs in Sequential mode.

Correct Amounts

Design 3

Transformer Stage => Remove Dup => Aggregator Stage

Aggregator stage set to HASH and selected grouping

Correct Amounts BUT stage repartitions data.

Design 4

Transformer Stage => Sort Stage = Aggregator Stage

Sort Stage set to Sequential and remove duplicates.

Count correct.


What is the best solution?

Thanks
Jim Stewart
thebird
Participant
Posts: 254
Joined: Thu Jan 06, 2005 12:11 am
Location: India
Contact:

Post by thebird »

Raftsman wrote: Design 4

Transformer Stage => Sort Stage = Aggregator Stage

Sort Stage set to Sequential and remove duplicates.

Count correct.
Thanks
Why not just Hash partition the data to the Sort stage instead of setting the stage to a sequential mode? Set the Aggregator Method to Sort instead of Hash - and this should give you the desired.
I believe this to be better than any of the others - unless someone here points otherwise.

Aneesh
------------------
Aneesh
throbinson
Charter Member
Charter Member
Posts: 299
Joined: Wed Nov 13, 2002 5:38 pm
Location: USA

Post by throbinson »

The best solution is to always attempt to minimize re-partitioning. An aggregator will aggregate by node. Therefore if all the Aggregation keys are not in the same node, your Aggregation will be wrong. Auto partition will not look at your data to make sure you've partitioned it correctly based on the the Aggregator stage keys (I'm pretty sure but could be wrong). So if you have partitioned the data using one set of keys and you desire to aggregate based on a different set that might cross nodes, it will not work and you MUST re-partition or go all Sequential on it.
thamark
Premium Member
Premium Member
Posts: 43
Joined: Thu Jan 29, 2004 12:12 am
Location: US

Post by thamark »

Raftsman wrote: Design one:

Transformer Stage => Remove_Duplicate Stage => Aggregator Stage.

Grouping is TIME_ID, RCM_CD, SUB_ACT_CD, QUAL_ID, PI_ID

Count on REGIST_NUM

Running on 2 Nodes.

Incorrect counts. Count reside on each node per group

Design 2

Transformer Stage => Sort Stage => Remover Dup Stage => Aggregator Stage.

Sort Stage runs in Sequential mode.

Correct Amounts
Based on above information you would like to group by following columns and sumarize REGIST_NUM column value after removing duplicate records.
TIME_ID, RCM_CD, SUB_ACT_CD, QUAL_ID, PI_ID

You could do

Transformer Stage => Sort Stage (parallel) => Aggregator Stage (parallel).

Sort Stage(parallel):

Hash patition by the group columns you mentioned and then go to properties of the sort stage you have one option "Allow Duplicates = True" set this to false to remove duplicates.

Aggregator Stage(parallel):

define the partition as same so you don't re-partition data again.

This should work best.
Hmm i will fill this when ever i get one
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Information in the Dump score was just giveing out the info regarding a sort (tsort) been included. And the list of keys been used. You would have got another entry in the log which says the partition been used at each stage. Or eplain the partiton used in you design at each stage.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Raftsman
Premium Member
Premium Member
Posts: 335
Joined: Thu May 26, 2005 8:56 am
Location: Ottawa, Canada

Post by Raftsman »

I am still confused on how I should be partitioning and sorting data for the aggregator stage.

I get this message and I have tried many different scenerios to eliminate it and still it appears. I am sure it's something simple.

I get this,

Cnt_Instr_Num: When checking operator: User inserted sort "Sort_375" does not fulfill the sort requirements of the downstream operator "APT_SortedGroup2Operator in Cnt_Instr_Num"

Here's my design.

TRANSFORMER>>>SORT >>> AGGREGATOR

TRANSFORMER has the following data

AGR_ID, TIME_ID, PI_QUALIFIER, PI_ID, SUB_ACT_CD, RCM_CD

SORT Keys are

TIME_ID, RCM_CD, SUB_ACT_CD, PI_QUALIFIER, PI_ID

To ensure that the partitioning is correct for the aggregator stage, I forced a re-partition with the following fields. (not sure if I must do this but I have run into problems before)

TIME_ID, RCM_CD, SUB_ACT_CD, PI_QUALIFIER, PI_ID

AGGREGATOR Stage group key are;

TIME_ID, RCM_CD, SUB_ACT_CD, PI_QUALIFIER, PI_ID

Count rows on AGR_ID

Input partitioning is set to SAME


I get this warning in all my jobs and I can't seem to figure out why.

Could someone please enlighten me.

Thanks
Jim Stewart
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Partition by the grouping keys. Make sure that the data are sorted by at least these fields, and in the order specified by the Sort aggregation method if you are using that.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
say2prabhu1
Participant
Posts: 27
Joined: Mon Jul 09, 2007 1:06 am

Post by say2prabhu1 »

I think you could have missed some columns in the output field... make sure that all the values in the output should undergo any one either group or calculation...
Raftsman
Premium Member
Premium Member
Posts: 335
Joined: Thu May 26, 2005 8:56 am
Location: Ottawa, Canada

Post by Raftsman »

In the SORT stage, I hash partition but I do not check the sort checkbox. I assume that it is not needed. I tried checking it and still got the same warnings. Both my sort group and aggregator groups are the same.
Jim Stewart
Post Reply