Page 1 of 1

where data is stored between stages when the job is running

Posted: Thu Aug 27, 2009 6:53 am
by sjordery
Hi All,

I have a conceptual doubt.
When data flows from one stage (process) to another stage where it stores the intermediate data i.e after the 1st stage finishes reading and before 2nd stage starts reading.
Does it use any temporary data sets?

Also when buffering comes in to picture.

Thanks in Advance.

Posted: Thu Aug 27, 2009 6:57 am
by miwinter
In virtual datasets and in buffers (files) both in memory and in dataset/scratch space.

viewtopic.php?t=118718&highlight=virtual+dataset

Posted: Thu Aug 27, 2009 8:50 am
by chulett
Realize that the vast majority of the time there's no... 'break'... between stages, they run in a serial fashion and a single record can go all the way through the job before the next one starts. So you don't typically have one stage finishing before the next stage even starts.

Posted: Thu Aug 27, 2009 6:20 pm
by Oritech
[quote="chulett"]

wondering how do parrallel engine expoited parrallel excecution?

Posted: Thu Aug 27, 2009 6:32 pm
by chulett
By dividing the data up across the 'nodes' by partitioning.

Posted: Thu Aug 27, 2009 6:51 pm
by ray.wurlod
There are two schemes of parallelism.

Pipeline parallelism (the one implied by the original question) and partition parallelism (the one implied by Craig's response to Oritech's unrelated question).

You can read about both in the Parallel Job Developer's Guide.

To address the original question, ideally the data are not stored anywhere, but remain resident in memory as they are "passed" from one operator to the next. If there is not enough memory, then the overflow lands on disk, either scratch disk as configured or paging disk, depending on a number of factors.

The only way that data are "stored" by a parallel job is if you have a stage type that causes the data to be stored.

Associated with each link is a "virtual Data Set" (a data set structure in memory). Each of these is managed as a configurable number of buffers (usually two) - one is being written to by the upstream, or producer, operator while the other is being read from by the downstream, or consumer, operator. Thresholds for switching buffers and/or buffers beginning to resist input are also configurable.