Multi-instance job appending DataSet.

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
videsh77
Premium Member
Premium Member
Posts: 97
Joined: Thu Dec 02, 2004 10:43 am
Contact:

Multi-instance job appending DataSet.

Post by videsh77 »

Hi

We have a DataStage job which allows its multiple invocations. This job is expected to append a DataSet.

Now my question is there could be a possibility at a given instance of a time we may have 5 instances of this job running together, which will try to append same DataSet.
Even though DataSet stands parallel in nature, I have doubt will there be any contention while two or more instances of job is attempting to append to a DataSet; will there be any lock issues?

If yes, by altering which environment variable we can control this locking issue?
Thanks with regards,
videsh.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

DataSets cannot be concurrently written to. Each instance will have to write to a separate DS and then the files need to be merged when all writes have finished.

Update - Let me retract that. When you append to a dataset the engine will create new data files and add them to the descriptor. So each job that is appending to a dataset could, without corrupting the previous data or concurrent writer's data, legitimately run if the descriptor file concurrency is controlled. I think it is worth trying this out or getting a definitive statement from support.
The only thing that can go wrong is if the descriptor file is opened for writing by 2 jobs at the same time, in which case the last one to close the file "wins". If the file is set to 1-writer or n-readers or simultaneous access is otherwise guaranteed and controlled within the PX engine you might be in luck.
videsh77
Premium Member
Premium Member
Posts: 97
Joined: Thu Dec 02, 2004 10:43 am
Contact:

Post by videsh77 »

You are pointing to which I am afraid of.

If I wait all datasets are written & then appended then I need to allocate the 2ce the node space. One is for all individual datasets & other for combining all datasets.
Also in this approach, there will be a wait time untill all datasets are written.

Isnt there any suitable way by which single dataset descriptor is used for appending datasets which are written by individual job instances?
Thanks with regards,
videsh.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

You should ask your support provider for a definitive answer. I would think that in 99%+ of cases this will work, but in the one case you might lose a whole data file out of the data set if the concurrency control isn't guaranteed.
Please do post your answer, I am very curious if this will work.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Data Sets work like text files, only in a parallel environment. It is the operating system that limits you, not DataStage. Therefore there is no workaround in DataStage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Ray,
actually it works somewhat differently in PX. When you "append" to a dataset, the PX engine does not append to the existing data files, it creates new sequential files and adds their path to the descriptor. Thus, in order for concurrent to work, the descriptor file is the important one to control, not the data files. If the engine single-threads access to the descriptor then you can do concurrent writes to a dataset.
Post Reply