We have a parallel job that has five input files and writes to two output files. We are planning to make this job a multiple instance job due to the record count involved. Let us say one of the input files has about 15 millions records. We might run three instances of the same job.
My questions are:
1. I need to divide the 15 million record input file into three different files based on a column value and then use each file for each instance of the job. How can I specify different files for different instances?
2. The output file must created/overwritten the first time and the remaining two instance must append to the created/overwritten output file. How can I achieve creation/overwrite of the output file for the first instance that gets completed, and then append for the remaining instances. One way might be creating three different output files and then merging the three files. Any other suggestions?
Any suggestions are greatly appreciated.
Thanks.
Multiple instance job: Different input files,one output file
Moderators: chulett, rschirm, roy
1. Typically, one would include the Invocation ID as part of the name.
2. Since you'll need to set the job to "Append" for the remaining two invocations to work properly, that means you'll need to do something before job to delete any existing version of the output file. Append will happily create the file if it doesn't exist. Of course, you'd need to be careful if this was actually in the "before job" area to make sure it only happened during the initial invocation. That or use a Sequence such that an Execute Command stage before the job activities does the dirty work.
2. Since you'll need to set the job to "Append" for the remaining two invocations to work properly, that means you'll need to do something before job to delete any existing version of the output file. Append will happily create the file if it doesn't exist. Of course, you'd need to be careful if this was actually in the "before job" area to make sure it only happened during the initial invocation. That or use a Sequence such that an Execute Command stage before the job activities does the dirty work.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
My initial thoughts are as Craig mentions is to use the invocation id as part of your input file name.
You would need a job to split the file (Unix or DS)When you a reading the file - use an input mask which references the DSJobInvocationId Macro. eg /folder_path/Filename_: DSJobInvocationId :.txt
The output is a bit trickier. You could try to use an external target stage, pass it 2 columns, one being the file name and then pass the rest of your columns as one column (use a column export stage) with whatever delimiter your target file needs.
It in the destination program of the external target stage you would set your Target method to Specific program and use the following code.
This should have the effect of allowing the job to concatenate the rows to various different files at the same time. Not sure if it will work for multiple different jobs appending to the same file though. But it might.
You would need a job to split the file (Unix or DS)When you a reading the file - use an input mask which references the DSJobInvocationId Macro. eg /folder_path/Filename_: DSJobInvocationId :.txt
The output is a bit trickier. You could try to use an external target stage, pass it 2 columns, one being the file name and then pass the rest of your columns as one column (use a column export stage) with whatever delimiter your target file needs.
It in the destination program of the external target stage you would set your Target method to Specific program and use the following code.
Code: Select all
awk '{nPosField1=index($0,",");print substr($0,nPosField1+1)>substr($0,1,nPosField1-1)}'
-
- Participant
- Posts: 246
- Joined: Mon Jun 30, 2008 3:22 am
- Location: New York
- Contact: