Partitioning Problem

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
mydsworld
Participant
Posts: 321
Joined: Thu Sep 07, 2006 3:55 am

Partitioning Problem

Post by mydsworld »

Due to the partitioning very often I am facing the problem where say max value or say Funnel sequencing option works on the partition, but when I see the result I find them not correct value or the sequence (in the sense they are true for the partition but not for the whole data).

How do I overcome that. Is Forcing the stage to run in sequential mode the only solution?

Please advise.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

If you want one record per the whole input file/Data yes you can force it o sequntial mode.
But if you are looking per key, then the partition which was made earliar is incorrect.
The key based partition should be made based on the key on which you are doing the operations such as Aggregation or RemDuplicate.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You could partition the single-row maximum value using Entire partitioning algorithm. That way it would be the same on every node.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Post by abc123 »

Ray, wouldn't that be a lot of extra load on resources, that is, sending all rows through every node?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Of course. But your original request sought "any other solution".
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Post by abc123 »

So what would be an ideal solution, in say, doing a max using an aggregator in a multi-node scenario? Let's say you have 3 nodes. 100 rows each are going through each node. How would get a proper max value among all nodes? Or does Datastage handle it automatically, which I think it does.
mydsworld
Participant
Posts: 321
Joined: Thu Sep 07, 2006 3:55 am

Post by mydsworld »

Good question.

But I dbout whether DS does it automatically or not.

Any thoughts.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you have an Aggregator after collection to sequential mode you will be able to derive the maximum value from all partitions on its output.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mydsworld
Participant
Posts: 321
Joined: Thu Sep 07, 2006 3:55 am

Post by mydsworld »

Even if you have sequential mode input like

Seq File -> Aggregator-> ...

then also, aggregator will run on 3 nodes and find the aggregation for each partition.

Ray, I am not sure whether I got your point correctly.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Code: Select all

ParallelStuff  ----> [Collector] Aggregator  ----> Target
Set the Aggregator stage to run in sequential mode and to use Sort/Merge collector.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
shiva_reddys447
Participant
Posts: 21
Joined: Sat Sep 08, 2007 12:04 am
Location: bangalore

Post by shiva_reddys447 »

take a stage variable in transformer lets say Cnt

Intial value of Cnt=0.

put the below derivation for the sequence generating column.

If Cnt=0 then @PARTITIONNUM+1 Else Cnt+@NUMPARTITIONS
OddJob
Participant
Posts: 163
Joined: Tue Feb 28, 2006 5:00 am
Location: Sheffield, UK

Post by OddJob »

If you're aggregating only a small data size then using the Aggregator in Sequential mode is going to be fine.

If you have a lot of data, use an aggregator that runs in parallel mode, then to achieve the desired result feed this output into another aggregator that is running in sequential mode i.e.

Partitionned Data -> Aggregator(Parallel - Gives aggregations per node, large number of records) -> Aggregator(Sequential - aggregates across the nodes, small number of records)
Post Reply