where data is stored between stages when the job is running

sjordery · Post by **sjordery** » Thu Aug 27, 2009 6:53 am

Hi All,

I have a conceptual doubt.
When data flows from one stage (process) to another stage where it stores the intermediate data i.e after the 1st stage finishes reading and before 2nd stage starts reading.
Does it use any temporary data sets?

Also when buffering comes in to picture.

Thanks in Advance.

miwinter · Post by **miwinter** » Thu Aug 27, 2009 6:57 am

In virtual datasets and in buffers (files) both in memory and in dataset/scratch space.

viewtopic.php?t=118718&highlight=virtual+dataset

chulett · Post by **chulett** » Thu Aug 27, 2009 8:50 am

Realize that the vast majority of the time there's no... 'break'... between stages, they run in a serial fashion and a single record can go all the way through the job before the next one starts. So you don't typically have one stage finishing before the next stage even starts.

Oritech · Post by **Oritech** » Thu Aug 27, 2009 6:20 pm

[quote="chulett"]

wondering how do parrallel engine expoited parrallel excecution?

chulett · Post by **chulett** » Thu Aug 27, 2009 6:32 pm

By dividing the data up across the 'nodes' by partitioning.

ray.wurlod · Post by **ray.wurlod** » Thu Aug 27, 2009 6:51 pm

There are two schemes of parallelism.

Pipeline parallelism (the one implied by the original question) and partition parallelism (the one implied by Craig's response to Oritech's unrelated question).

You can read about both in the Parallel Job Developer's Guide.

To address the original question, ideally the data are not stored anywhere, but remain resident in memory as they are "passed" from one operator to the next. If there is not enough memory, then the overflow lands on disk, either scratch disk as configured or paging disk, depending on a number of factors.

The only way that data are "stored" by a parallel job is if you have a stage type that causes the data to be stored.

Associated with each link is a "virtual Data Set" (a data set structure in memory). Each of these is managed as a configurable number of buffers (usually two) - one is being written to by the upstream, or producer, operator while the other is being read from by the downstream, or consumer, operator. Thresholds for switching buffers and/or buffers beginning to resist input are also configurable.