dataset using as intermediate stage b/w two jobs

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Madhu1981
Participant
Posts: 69
Joined: Wed Feb 22, 2006 7:49 am

dataset using as intermediate stage b/w two jobs

Post by Madhu1981 »

I have two jobs. I am running the first job in four nodes and loading the intermediate results to dataset. Now i am using this dataset in my second job for loading the data to DB2. As per performance consideration, my team lead asked to run the job in 2 nodes, but the dataset which i am using has run on 4 nodes. How to come up with this solution.

I have made the preserve partition as clear and ran the job but i would like to know whether this approach is really good? Is there any another approach?

Please suggest me.!!
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

A Data Set stage is ideal as the staging area between two jobs because it preserves the internal Data Set structure; internal formats, partitioning and sorting.

You will need to read the Data Set with a configuration file that is compatible with the one used when it was written. You may, therefore, need to re-run the writing job under the new, two-node, configuration.

You simply can not use a two-node configuration file to read a Data Set that was written with a four-node configuration file. If your team lead says you can, demand to know how.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
swades
Premium Member
Premium Member
Posts: 323
Joined: Mon Dec 04, 2006 11:52 pm

Post by swades »

I have a question. :?:

Is it must that we have to re-run the previous DataSet writing job ? OR Can we still use that 4-node DataSet for further processing using 2-node config. file at expence of performance ?
nick.bond
Charter Member
Charter Member
Posts: 230
Joined: Thu Jan 15, 2004 12:00 pm
Location: London

Post by nick.bond »

I'm not 100% about this and have no system to test it on but I think all the data will still be read if the dataset has been created on 4 node and you then read it on 2 node, but you will get a warning about the data being repartitioned, hence loosing performance. (And there will be a warning in the job so sequences may fail if you have success only set.
Regards,

Nick.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Given the cost of repartitioning you may find restricting both jobs to 2 nodes is faster than 4 nodes followed by 2 nodes.

There may be a way to be clever with the configuration file so you have a node pool of 4 nodes for datasets and 2 nodes for everything else. Don't know enough about pooling to be sure. This may avoid having to rebuild the dataset to 2 nodes.
vijayrc
Participant
Posts: 197
Joined: Sun Apr 02, 2006 10:31 am
Location: NJ

Re: dataset using as intermediate stage b/w two jobs

Post by vijayrc »

Madhu1981 wrote:I have two jobs. I am running the first job in four nodes and loading the intermediate results to dataset. Now i am using this dataset in my second job for loading the data to DB2. As per performance consideration, my team lead asked to run the job in 2 nodes, but the dataset which i am using has run on 4 nodes. How to come up with this solution.

I have made the preserve partition as clear and ran the job but i would like to know whether this approach is really good? Is there any another approach?

Please suggest me.!!
You can use the FROM NODES/FROM PARTITIONS variables...!!
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I understood that these properties could only refer to the nodes/partitions in the current configuration file.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply