Partition & Pipeline Parallelism

jerome_rajan · Post by **jerome_rajan** » Mon Apr 30, 2012 11:16 pm

I understand that partition parallelism is something that we can control. i.e. we can choose whether or not a stage should process data in a partitioned manner or whether it should run in sequential mode.

I was wondering if we could say the same about Pipeline parallelism. I don't see any way to control pipe lining other than completely doing away with buffers. But I am also sure that DataStage does handle this in some way internally. For e.g. in stages like Aggregator, Sort, etc.

Would appreciate it if someone can explain how this happens and if we can control it like we can control partitioning.

Does the "Method" property in an Aggregator have to do anything with this?

Thanks in advance

jerome_rajan · Post by **jerome_rajan** » Sat May 05, 2012 10:44 am

Can someone help me understand this? Thank You

ray.wurlod · Post by **ray.wurlod** » Sat May 05, 2012 4:29 pm

Whether in sequential mode or not, a parallel job executes logically as
op0 | op1 | op2 | op3...

So pipeline parallelism is a given. Sequential mode simply means that the operator processes each executes on a single node.

Data actually move through virtual Data Sets. There is one of these per link. Each consists (by default) of two 3MB buffers, though both of those numbers is configurable. When data have to be transmitted between nodes (for example during re-partitioning) "transport buffers" are used. Again, these are configurable for size.