where data is stored between stages when the job is running

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
sjordery
Premium Member
Premium Member
Posts: 202
Joined: Thu Jun 08, 2006 5:58 am

where data is stored between stages when the job is running

Post by sjordery »

Hi All,

I have a conceptual doubt.
When data flows from one stage (process) to another stage where it stores the intermediate data i.e after the 1st stage finishes reading and before 2nd stage starts reading.
Does it use any temporary data sets?

Also when buffering comes in to picture.

Thanks in Advance.
miwinter
Participant
Posts: 396
Joined: Thu Jun 22, 2006 7:00 am
Location: England, UK

Post by miwinter »

In virtual datasets and in buffers (files) both in memory and in dataset/scratch space.

viewtopic.php?t=118718&highlight=virtual+dataset
Mark Winter
<i>Nothing appeases a troubled mind more than <b>good</b> music</i>
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Realize that the vast majority of the time there's no... 'break'... between stages, they run in a serial fashion and a single record can go all the way through the job before the next one starts. So you don't typically have one stage finishing before the next stage even starts.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Oritech
Premium Member
Premium Member
Posts: 140
Joined: Thu May 07, 2009 9:32 pm

Post by Oritech »

[quote="chulett"]

wondering how do parrallel engine expoited parrallel excecution?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

By dividing the data up across the 'nodes' by partitioning.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There are two schemes of parallelism.

Pipeline parallelism (the one implied by the original question) and partition parallelism (the one implied by Craig's response to Oritech's unrelated question).

You can read about both in the Parallel Job Developer's Guide.

To address the original question, ideally the data are not stored anywhere, but remain resident in memory as they are "passed" from one operator to the next. If there is not enough memory, then the overflow lands on disk, either scratch disk as configured or paging disk, depending on a number of factors.

The only way that data are "stored" by a parallel job is if you have a stage type that causes the data to be stored.

Associated with each link is a "virtual Data Set" (a data set structure in memory). Each of these is managed as a configurable number of buffers (usually two) - one is being written to by the upstream, or producer, operator while the other is being read from by the downstream, or consumer, operator. Thresholds for switching buffers and/or buffers beginning to resist input are also configurable.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply