Shared containers - balancing act??

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dsxuserrio
Participant
Posts: 82
Joined: Thu Dec 02, 2004 10:27 pm
Location: INDIA

Shared containers - balancing act??

Post by dsxuserrio »

Hello
Shared containers ( as compared to writing to datasets and using those datasets to continue processing) perform better because of pipeline parallelism. On the other hand it is always recommended not to build big jobs . ( Build smaller jobs becuase of restartability and manageability and other advantages).

So is choosing between shared containers and breaking into smaller jobs a balancing act?? Or are there some other advantages in using shared containers in PX??

Thanks for your time.
dsxuserrio

Kannan.N
Bangalore,INDIA
memrinal
Participant
Posts: 74
Joined: Wed Nov 24, 2004 9:13 pm

Post by memrinal »

Even i am facing a similar situation.
I have a set of jobs which use the output from the previous job as their input. since each job was writing to a file and the next job was reading from it I/O overhead was major concern. So to reduce the I/O Load i encapsulated complete jobs in shared continers and created a job which used these containers.
Say earliar output file from job A was input for job B and output file of job B was input of job C. Now containers were created as Container 1 encapsulated Job A with the output file replaced by an output link. similarly Containers 2 and 3 were created from jobs B and C.

But the problem that i faced in using output of one container as input of other was that i was unable to match the metadata. To overcome this problem i had to use transformers wherever output of one container was input of the second container. This led to several transformers in my job.

So now the overhead due to I/O has been replaced by overhead due to Transformer. With the volume of data that i have (its about 3000 records) i am unable to decide which approach is better. Perhaps using containers may be a good idea when data is very huge but otherwise small jobs would be okay becuase I/O will be less. i am myself not sure about this.

Anyone who has faced a similar situation and knows which approach is better for large volume of data, Please share your experience.

Mrinal
Post Reply