Multiple instance job: Different input files,one output file

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
pdntsap
Premium Member
Premium Member
Posts: 107
Joined: Mon Jul 04, 2011 5:38 pm

Multiple instance job: Different input files,one output file

Post by pdntsap »

We have a parallel job that has five input files and writes to two output files. We are planning to make this job a multiple instance job due to the record count involved. Let us say one of the input files has about 15 millions records. We might run three instances of the same job.
My questions are:

1. I need to divide the 15 million record input file into three different files based on a column value and then use each file for each instance of the job. How can I specify different files for different instances?

2. The output file must created/overwritten the first time and the remaining two instance must append to the created/overwritten output file. How can I achieve creation/overwrite of the output file for the first instance that gets completed, and then append for the remaining instances. One way might be creating three different output files and then merging the three files. Any other suggestions?

Any suggestions are greatly appreciated.

Thanks.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

1. Typically, one would include the Invocation ID as part of the name.

2. Since you'll need to set the job to "Append" for the remaining two invocations to work properly, that means you'll need to do something before job to delete any existing version of the output file. Append will happily create the file if it doesn't exist. Of course, you'd need to be careful if this was actually in the "before job" area to make sure it only happened during the initial invocation. That or use a Sequence such that an Execute Command stage before the job activities does the dirty work.
-craig

"You can never have too many knives" -- Logan Nine Fingers
pdntsap
Premium Member
Premium Member
Posts: 107
Joined: Mon Jul 04, 2011 5:38 pm

Post by pdntsap »

Thanks for the suggestions Craig.

I need to start looking into invocation ID and related things and might come back with more questions.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

We'll be here. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Post by ShaneMuir »

My initial thoughts are as Craig mentions is to use the invocation id as part of your input file name.

You would need a job to split the file (Unix or DS)When you a reading the file - use an input mask which references the DSJobInvocationId Macro. eg /folder_path/Filename_: DSJobInvocationId :.txt

The output is a bit trickier. You could try to use an external target stage, pass it 2 columns, one being the file name and then pass the rest of your columns as one column (use a column export stage) with whatever delimiter your target file needs.

It in the destination program of the external target stage you would set your Target method to Specific program and use the following code.

Code: Select all

awk '{nPosField1=index($0,",");print substr($0,nPosField1+1)>substr($0,1,nPosField1-1)}'
This should have the effect of allowing the job to concatenate the rows to various different files at the same time. Not sure if it will work for multiple different jobs appending to the same file though. But it might.
arunkumarmm
Participant
Posts: 246
Joined: Mon Jun 30, 2008 3:22 am
Location: New York
Contact:

Post by arunkumarmm »

For your 2nd question, you can create a job, which overwrites the output file from a null file and call it in the job sequence only when it is your first instance and by-pass it for the other two. And in the core job you can always use 'Append'
Arun
Post Reply