Multiple instance job: Different input files,one output file

pdntsap · Post by **pdntsap** » Fri Oct 21, 2011 8:44 am

We have a parallel job that has five input files and writes to two output files. We are planning to make this job a multiple instance job due to the record count involved. Let us say one of the input files has about 15 millions records. We might run three instances of the same job.
My questions are:

1. I need to divide the 15 million record input file into three different files based on a column value and then use each file for each instance of the job. How can I specify different files for different instances?

2. The output file must created/overwritten the first time and the remaining two instance must append to the created/overwritten output file. How can I achieve creation/overwrite of the output file for the first instance that gets completed, and then append for the remaining instances. One way might be creating three different output files and then merging the three files. Any other suggestions?

Any suggestions are greatly appreciated.

Thanks.

chulett · Post by **chulett** » Fri Oct 21, 2011 8:47 am

1. Typically, one would include the Invocation ID as part of the name.

2. Since you'll need to set the job to "Append" for the remaining two invocations to work properly, that means you'll need to do something before job to delete any existing version of the output file. Append will happily create the file if it doesn't exist. Of course, you'd need to be careful if this was actually in the "before job" area to make sure it only happened during the initial invocation. That or use a Sequence such that an Execute Command stage before the job activities does the dirty work.

pdntsap · Post by **pdntsap** » Fri Oct 21, 2011 1:52 pm

Thanks for the suggestions Craig.

I need to start looking into invocation ID and related things and might come back with more questions.

chulett · Post by **chulett** » Fri Oct 21, 2011 3:17 pm

We'll be here.

ShaneMuir · Post by **ShaneMuir** » Mon Oct 24, 2011 8:47 am

My initial thoughts are as Craig mentions is to use the invocation id as part of your input file name.

You would need a job to split the file (Unix or DS)When you a reading the file - use an input mask which references the DSJobInvocationId Macro. eg /folder_path/Filename_: DSJobInvocationId :.txt

The output is a bit trickier. You could try to use an external target stage, pass it 2 columns, one being the file name and then pass the rest of your columns as one column (use a column export stage) with whatever delimiter your target file needs.

It in the destination program of the external target stage you would set your Target method to Specific program and use the following code.

Code: Select all

awk '{nPosField1=index($0,",");print substr($0,nPosField1+1)>substr($0,1,nPosField1-1)}'

This should have the effect of allowing the job to concatenate the rows to various different files at the same time. Not sure if it will work for multiple different jobs appending to the same file though. But it might.

arunkumarmm · Post by **arunkumarmm** » Mon Oct 24, 2011 11:53 am

For your 2nd question, you can create a job, which overwrites the output file from a null file and call it in the job sequence only when it is your first instance and by-pass it for the other two. And in the core job you can always use 'Append'