Page 1 of 1

dataset using as intermediate stage b/w two jobs

Posted: Tue May 01, 2007 2:54 pm
by Madhu1981
I have two jobs. I am running the first job in four nodes and loading the intermediate results to dataset. Now i am using this dataset in my second job for loading the data to DB2. As per performance consideration, my team lead asked to run the job in 2 nodes, but the dataset which i am using has run on 4 nodes. How to come up with this solution.

I have made the preserve partition as clear and ran the job but i would like to know whether this approach is really good? Is there any another approach?

Please suggest me.!!

Posted: Tue May 01, 2007 3:13 pm
by ray.wurlod
A Data Set stage is ideal as the staging area between two jobs because it preserves the internal Data Set structure; internal formats, partitioning and sorting.

You will need to read the Data Set with a configuration file that is compatible with the one used when it was written. You may, therefore, need to re-run the writing job under the new, two-node, configuration.

You simply can not use a two-node configuration file to read a Data Set that was written with a four-node configuration file. If your team lead says you can, demand to know how.

Posted: Tue May 01, 2007 3:21 pm
by swades
I have a question. :?:

Is it must that we have to re-run the previous DataSet writing job ? OR Can we still use that 4-node DataSet for further processing using 2-node config. file at expence of performance ?

Posted: Tue May 01, 2007 6:33 pm
by nick.bond
I'm not 100% about this and have no system to test it on but I think all the data will still be read if the dataset has been created on 4 node and you then read it on 2 node, but you will get a warning about the data being repartitioned, hence loosing performance. (And there will be a warning in the job so sequences may fail if you have success only set.

Posted: Wed May 02, 2007 6:24 pm
by vmcburney
Given the cost of repartitioning you may find restricting both jobs to 2 nodes is faster than 4 nodes followed by 2 nodes.

There may be a way to be clever with the configuration file so you have a node pool of 4 nodes for datasets and 2 nodes for everything else. Don't know enough about pooling to be sure. This may avoid having to rebuild the dataset to 2 nodes.

Re: dataset using as intermediate stage b/w two jobs

Posted: Wed May 02, 2007 6:29 pm
by vijayrc
Madhu1981 wrote:I have two jobs. I am running the first job in four nodes and loading the intermediate results to dataset. Now i am using this dataset in my second job for loading the data to DB2. As per performance consideration, my team lead asked to run the job in 2 nodes, but the dataset which i am using has run on 4 nodes. How to come up with this solution.

I have made the preserve partition as clear and ran the job but i would like to know whether this approach is really good? Is there any another approach?

Please suggest me.!!
You can use the FROM NODES/FROM PARTITIONS variables...!!

Posted: Wed May 02, 2007 6:44 pm
by ray.wurlod
I understood that these properties could only refer to the nodes/partitions in the current configuration file.