Multi-instance job appending DataSet.

videsh77 · Post by **videsh77** » Sun Jul 08, 2007 8:05 am

Hi

We have a DataStage job which allows its multiple invocations. This job is expected to append a DataSet.

Now my question is there could be a possibility at a given instance of a time we may have 5 instances of this job running together, which will try to append same DataSet.
Even though DataSet stands parallel in nature, I have doubt will there be any contention while two or more instances of job is attempting to append to a DataSet; will there be any lock issues?

If yes, by altering which environment variable we can control this locking issue?

ArndW · Post by **ArndW** » Sun Jul 08, 2007 4:09 pm

DataSets cannot be concurrently written to. Each instance will have to write to a separate DS and then the files need to be merged when all writes have finished.

Update - Let me retract that. When you append to a dataset the engine will create new data files and add them to the descriptor. So each job that is appending to a dataset could, without corrupting the previous data or concurrent writer's data, legitimately run if the descriptor file concurrency is controlled. I think it is worth trying this out or getting a definitive statement from support.
The only thing that can go wrong is if the descriptor file is opened for writing by 2 jobs at the same time, in which case the last one to close the file "wins". If the file is set to 1-writer or n-readers or simultaneous access is otherwise guaranteed and controlled within the PX engine you might be in luck.

videsh77 · Post by **videsh77** » Sun Jul 08, 2007 11:33 pm

You are pointing to which I am afraid of.

If I wait all datasets are written & then appended then I need to allocate the 2ce the node space. One is for all individual datasets & other for combining all datasets.
Also in this approach, there will be a wait time untill all datasets are written.

Isnt there any suitable way by which single dataset descriptor is used for appending datasets which are written by individual job instances?

ArndW · Post by **ArndW** » Sun Jul 08, 2007 11:51 pm

You should ask your support provider for a definitive answer. I would think that in 99%+ of cases this will work, but in the one case you might lose a whole data file out of the data set if the concurrency control isn't guaranteed.
Please do post your answer, I am very curious if this will work.

ray.wurlod · Post by **ray.wurlod** » Mon Jul 09, 2007 12:27 am

Data Sets work like text files, only in a parallel environment. It is the operating system that limits you, not DataStage. Therefore there is no workaround in DataStage.

ArndW · Post by **ArndW** » Mon Jul 09, 2007 2:37 am

Ray,
actually it works somewhat differently in PX. When you "append" to a dataset, the PX engine does not append to the existing data files, it creates new sequential files and adds their path to the descriptor. Thus, in order for concurrent to work, the descriptor file is the important one to control, not the data files. If the engine single-threads access to the descriptor then you can do concurrent writes to a dataset.