Partitioning Problem

mydsworld · Post by **mydsworld** » Fri Apr 25, 2008 11:47 am

Due to the partitioning very often I am facing the problem where say max value or say Funnel sequencing option works on the partition, but when I see the result I find them not correct value or the sequence (in the sense they are true for the partition but not for the whole data).

How do I overcome that. Is Forcing the stage to run in sequential mode the only solution?

Please advise.

kumar_s · Post by **kumar_s** » Fri Apr 25, 2008 12:04 pm

If you want one record per the whole input file/Data yes you can force it o sequntial mode.
But if you are looking per key, then the partition which was made earliar is incorrect.
The key based partition should be made based on the key on which you are doing the operations such as Aggregation or RemDuplicate.

ray.wurlod · Post by **ray.wurlod** » Fri Apr 25, 2008 3:07 pm

You could partition the single-row maximum value using Entire partitioning algorithm. That way it would be the same on every node.

abc123 · Post by **abc123** » Wed Apr 30, 2008 9:41 pm

Ray, wouldn't that be a lot of extra load on resources, that is, sending all rows through every node?

ray.wurlod · Post by **ray.wurlod** » Wed Apr 30, 2008 10:28 pm

Of course. But your original request sought "any other solution".

abc123 · Post by **abc123** » Thu May 01, 2008 8:17 am

So what would be an ideal solution, in say, doing a max using an aggregator in a multi-node scenario? Let's say you have 3 nodes. 100 rows each are going through each node. How would get a proper max value among all nodes? Or does Datastage handle it automatically, which I think it does.

mydsworld · Post by **mydsworld** » Thu May 01, 2008 9:51 am

Good question.

But I dbout whether DS does it automatically or not.

Any thoughts.

ray.wurlod · Post by **ray.wurlod** » Thu May 01, 2008 2:41 pm

If you have an Aggregator after collection to sequential mode you will be able to derive the maximum value from all partitions on its output.

mydsworld · Post by **mydsworld** » Thu May 01, 2008 10:13 pm

Even if you have sequential mode input like

Seq File -> Aggregator-> ...

then also, aggregator will run on 3 nodes and find the aggregation for each partition.

Ray, I am not sure whether I got your point correctly.

ray.wurlod · Post by **ray.wurlod** » Thu May 01, 2008 10:59 pm

Code: Select all

ParallelStuff  ----> [Collector] Aggregator  ----> Target

Set the Aggregator stage to run in sequential mode and to use Sort/Merge collector.

shiva_reddys447 · Post by **shiva_reddys447** » Wed May 07, 2008 3:02 am

take a stage variable in transformer lets say Cnt

Intial value of Cnt=0.

put the below derivation for the sequence generating column.

If Cnt=0 then @PARTITIONNUM+1 Else Cnt+@NUMPARTITIONS

OddJob · Post by **OddJob** » Wed May 07, 2008 4:32 am

If you're aggregating only a small data size then using the Aggregator in Sequential mode is going to be fine.

If you have a lot of data, use an aggregator that runs in parallel mode, then to achieve the desired result feed this output into another aggregator that is running in sequential mode i.e.

Partitionned Data -> Aggregator(Parallel - Gives aggregations per node, large number of records) -> Aggregator(Sequential - aggregates across the nodes, small number of records)