dataset using as intermediate stage b/w two jobs

Madhu1981 · Post by **Madhu1981** » Tue May 01, 2007 2:54 pm

I have two jobs. I am running the first job in four nodes and loading the intermediate results to dataset. Now i am using this dataset in my second job for loading the data to DB2. As per performance consideration, my team lead asked to run the job in 2 nodes, but the dataset which i am using has run on 4 nodes. How to come up with this solution.

I have made the preserve partition as clear and ran the job but i would like to know whether this approach is really good? Is there any another approach?

Please suggest me.!!

ray.wurlod · Post by **ray.wurlod** » Tue May 01, 2007 3:13 pm

A Data Set stage is ideal as the staging area between two jobs because it preserves the internal Data Set structure; internal formats, partitioning and sorting.

You will need to read the Data Set with a configuration file that is compatible with the one used when it was written. You may, therefore, need to re-run the writing job under the new, two-node, configuration.

You simply can not use a two-node configuration file to read a Data Set that was written with a four-node configuration file. If your team lead says you can, demand to know how.

swades · Post by **swades** » Tue May 01, 2007 3:21 pm

I have a question.

Is it must that we have to re-run the previous DataSet writing job ? OR Can we still use that 4-node DataSet for further processing using 2-node config. file at expence of performance ?

nick.bond · Post by **nick.bond** » Tue May 01, 2007 6:33 pm

I'm not 100% about this and have no system to test it on but I think all the data will still be read if the dataset has been created on 4 node and you then read it on 2 node, but you will get a warning about the data being repartitioned, hence loosing performance. (And there will be a warning in the job so sequences may fail if you have success only set.

vmcburney · Post by **vmcburney** » Wed May 02, 2007 6:24 pm

Given the cost of repartitioning you may find restricting both jobs to 2 nodes is faster than 4 nodes followed by 2 nodes.

There may be a way to be clever with the configuration file so you have a node pool of 4 nodes for datasets and 2 nodes for everything else. Don't know enough about pooling to be sure. This may avoid having to rebuild the dataset to 2 nodes.

vijayrc · Post by **vijayrc** » Wed May 02, 2007 6:29 pm

Madhu1981 wrote:I have two jobs. I am running the first job in four nodes and loading the intermediate results to dataset. Now i am using this dataset in my second job for loading the data to DB2. As per performance consideration, my team lead asked to run the job in 2 nodes, but the dataset which i am using has run on 4 nodes. How to come up with this solution.

I have made the preserve partition as clear and ran the job but i would like to know whether this approach is really good? Is there any another approach?

Please suggest me.!!

You can use the FROM NODES/FROM PARTITIONS variables...!!

ray.wurlod · Post by **ray.wurlod** » Wed May 02, 2007 6:44 pm

I understood that these properties could only refer to the nodes/partitions in the current configuration file.

DSXchange

dataset using as intermediate stage b/w two jobs

dataset using as intermediate stage b/w two jobs

Re: dataset using as intermediate stage b/w two jobs