Partition & Pipeline Parallelism

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Partition & Pipeline Parallelism

Post by jerome_rajan »

I understand that partition parallelism is something that we can control. i.e. we can choose whether or not a stage should process data in a partitioned manner or whether it should run in sequential mode.

I was wondering if we could say the same about Pipeline parallelism. I don't see any way to control pipe lining other than completely doing away with buffers. But I am also sure that DataStage does handle this in some way internally. For e.g. in stages like Aggregator, Sort, etc.

Would appreciate it if someone can explain how this happens and if we can control it like we can control partitioning.

Does the "Method" property in an Aggregator have to do anything with this?

Thanks in advance
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Post by jerome_rajan »

Can someone help me understand this? Thank You
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Whether in sequential mode or not, a parallel job executes logically as
op0 | op1 | op2 | op3...

So pipeline parallelism is a given. Sequential mode simply means that the operator processes each executes on a single node.

Data actually move through virtual Data Sets. There is one of these per link. Each consists (by default) of two 3MB buffers, though both of those numbers is configurable. When data have to be transmitted between nodes (for example during re-partitioning) "transport buffers" are used. Again, these are configurable for size.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply