I understand that partition parallelism is something that we can control. i.e. we can choose whether or not a stage should process data in a partitioned manner or whether it should run in sequential mode.
I was wondering if we could say the same about Pipeline parallelism. I don't see any way to control pipe lining other than completely doing away with buffers. But I am also sure that DataStage does handle this in some way internally. For e.g. in stages like Aggregator, Sort, etc.
Would appreciate it if someone can explain how this happens and if we can control it like we can control partitioning.
Does the "Method" property in an Aggregator have to do anything with this?
Thanks in advance
Partition & Pipeline Parallelism
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 376
- Joined: Sat Jan 07, 2012 12:25 pm
- Location: Piscataway
Partition & Pipeline Parallelism
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
-
- Premium Member
- Posts: 376
- Joined: Sat Jan 07, 2012 12:25 pm
- Location: Piscataway
Can someone help me understand this? Thank You
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Whether in sequential mode or not, a parallel job executes logically as
op0 | op1 | op2 | op3...
So pipeline parallelism is a given. Sequential mode simply means that the operator processes each executes on a single node.
Data actually move through virtual Data Sets. There is one of these per link. Each consists (by default) of two 3MB buffers, though both of those numbers is configurable. When data have to be transmitted between nodes (for example during re-partitioning) "transport buffers" are used. Again, these are configurable for size.
op0 | op1 | op2 | op3...
So pipeline parallelism is a given. Sequential mode simply means that the operator processes each executes on a single node.
Data actually move through virtual Data Sets. There is one of these per link. Each consists (by default) of two 3MB buffers, though both of those numbers is configurable. When data have to be transmitted between nodes (for example during re-partitioning) "transport buffers" are used. Again, these are configurable for size.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.