Can I accomplish sequential file bursting in a parallel job?
Moderators: chulett, rschirm, roy
Can I accomplish sequential file bursting in a parallel job?
I have an input link to a Sequential File Stage. One of the several columns contains 4-digit-time information: 2-digit year and 2-digit month.
I'll call this column YYMM_CHAR, and this is my burst key. Examples for values would be 0902, 0508, 1312, 1502 and so on.
The output of the Sequential File Stage should split ("burst") the Input Link records into several output files: one file for each distinct occurrence of YYMM_CHAR.
All Records with YYMM_CHAR='0902' should be in <SequentialFileName>.0902
All Records with YYMM_CHAR='0508' should be in <SequentialFileName>.0508
etc.
(Ideally, the column YYMM_CHAR itself should be omitted in the output files, but that's of secondary importance at this time.)
At job design time, it is not possible to predict the values for YYMM_CHAR for each individual execution.
The possible range starts in the 1990s and reaches up to the present - the present at execution time, that is.
I've discarded the idea of putting a Switch Stage before the Sequential File Stage due to
* the fact that I'd have to enumerate all possible values for the burst key YYMM_CHAR
* the high maintenance effort of adding new output links as time progresses.
* but mainly because of its limitations to 128 output links which would cover 128:12 years - less than eleven, which will not suffice.
Is there any chance to burst the data of the Sequential File Stage's input link dynamically into several files as described?
Thanks for reading,
Robert
[edit: deleted a "not" in a sentence.]
I'll call this column YYMM_CHAR, and this is my burst key. Examples for values would be 0902, 0508, 1312, 1502 and so on.
The output of the Sequential File Stage should split ("burst") the Input Link records into several output files: one file for each distinct occurrence of YYMM_CHAR.
All Records with YYMM_CHAR='0902' should be in <SequentialFileName>.0902
All Records with YYMM_CHAR='0508' should be in <SequentialFileName>.0508
etc.
(Ideally, the column YYMM_CHAR itself should be omitted in the output files, but that's of secondary importance at this time.)
At job design time, it is not possible to predict the values for YYMM_CHAR for each individual execution.
The possible range starts in the 1990s and reaches up to the present - the present at execution time, that is.
I've discarded the idea of putting a Switch Stage before the Sequential File Stage due to
* the fact that I'd have to enumerate all possible values for the burst key YYMM_CHAR
* the high maintenance effort of adding new output links as time progresses.
* but mainly because of its limitations to 128 output links which would cover 128:12 years - less than eleven, which will not suffice.
Is there any chance to burst the data of the Sequential File Stage's input link dynamically into several files as described?
Thanks for reading,
Robert
[edit: deleted a "not" in a sentence.]
This feature was first introduced in DataStage 9.1.0. You can set the write method to "Generate Multiple Files" and use a key column, akin to what you labelled as "bursting." You can optionally append the key value to the file name.
http://www-01.ibm.com/support/knowledge ... stage.html
http://www-01.ibm.com/support/knowledge ... egory.html
For the file name, you can provide a root file string, which can include a path. The key value may be appended, just as in your example. I ran a quick test in 11.3.1 to confirm the behavior.
Notes:
- It inserts a sort for you, based on your key.
- When appending the key value to the file name, it appends a period followed by the key value. It appears that the period is hard-coded for you; I don't see an option to avoid the period.
- If you don't force it to sequential mode, then is also appends a period followed by partition/node number (i.e. ".part00000", ".part00001", etc.)
- If you needed the key value anywhere other than in the end of the file name, then I think you would be in trouble...
It looks like you are on a slightly older release. Do you have the option to upgrade?
http://www-01.ibm.com/support/knowledge ... stage.html
http://www-01.ibm.com/support/knowledge ... egory.html
For the file name, you can provide a root file string, which can include a path. The key value may be appended, just as in your example. I ran a quick test in 11.3.1 to confirm the behavior.
Notes:
- It inserts a sort for you, based on your key.
- When appending the key value to the file name, it appends a period followed by the key value. It appears that the period is hard-coded for you; I don't see an option to avoid the period.
- If you don't force it to sequential mode, then is also appends a period followed by partition/node number (i.e. ".part00000", ".part00001", etc.)
- If you needed the key value anywhere other than in the end of the file name, then I think you would be in trouble...
It looks like you are on a slightly older release. Do you have the option to upgrade?
Choose a job you love, and you will never have to work a day in your life. - Confucius
From what I understand you want to create several output files from the one input link. If so you could accomplish what you are after but using an external target stage.
Prior to outputting the data, you would have to sort by your burst key column, and based on that key you would generate the required file name. Next combine all your other columns that you want to output using a column export stage (if a static file format then use explicit columns, else you'd have to build schemas for each output file).
Set the derivation of an output column in a transform to something like
In this example I am using '~' as a delimiter between the filename and the data.
Then in an external target stage, select specified program as the target method, and the destination program as
The only flaw in this plan is that you are on windows, so I don't know if the awk statement will work (Does the MKSToolkit cover this?) else you would have to use an equivalent windows command (if it exists).
Prior to outputting the data, you would have to sort by your burst key column, and based on that key you would generate the required file name. Next combine all your other columns that you want to output using a column export stage (if a static file format then use explicit columns, else you'd have to build schemas for each output file).
Set the derivation of an output column in a transform to something like
Code: Select all
lnk_out.FILE_NAME: '~' : lnk_out.DATA_COLUMNS
Then in an external target stage, select specified program as the target method, and the destination program as
Code: Select all
awk 'BEGIN { FS = "~" } ; { print $2 > $1 }'
Consider having a Server Job run afterwards that sends the data thru a Folder Stage...the folder can put out multiple sequential files. Been a long time since I've done it, and it might not be as flexible as you need, but worth looking into.
Ernie
Ernie
Ernie Ostic
blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
ShaneMuir - thanks so much for your suggestion.
I followed it for several reasons:
1. It looks like I'm on AIX. Sorry for the confusion; I only have access to the Windows-based GUI tools....
2. The original data columns are already concatenated into one single, long record in the SQL Query of the Oracle Connector Stage using a definable seperator character; in this case a semicolon.
3. The Stream was already sorted by the burst key prior to sending it into the Sequential File Stage.
All I did now was to prefix the concatenated data fields with the burst key using an additional, different seperator character: ^
So the only field that goes into the External Target Stage - which now replaces the Sequential File Stage - looks like this (2 example records):
And the "Destination Program" of the External Target Stage, according to your posting, is the following
However:
Do I have to generate the filename for each output record in the data stream, too?
Thanks in advance!
I followed it for several reasons:
1. It looks like I'm on AIX. Sorry for the confusion; I only have access to the Windows-based GUI tools....
2. The original data columns are already concatenated into one single, long record in the SQL Query of the Oracle Connector Stage using a definable seperator character; in this case a semicolon.
3. The Stream was already sorted by the burst key prior to sending it into the Sequential File Stage.
All I did now was to prefix the concatenated data fields with the burst key using an additional, different seperator character: ^
So the only field that goes into the External Target Stage - which now replaces the Sequential File Stage - looks like this (2 example records):
Code: Select all
1309^30924537;20130930230012;30.09.2013 23:04:04;1210805;889;;2382657;[...and so on...]
1310^31111382;20131001230005;01.10.2013 23:04:19;1218557;2884;;2383754;[...and so on...]
Code: Select all
awk 'BEGIN { FS = "^" } ; { print $2 > $1 }'
I somehow miss the options or probably fail to know the exact place and/or method where/how to specify the output directory and a "base file name pattern" (both are job parameters) for each file; meaning I don't see any output file(s) after the job has finished without errors or warnings.and based on that key you would generate the required file name
Do I have to generate the filename for each output record in the data stream, too?
Thanks in advance!
It's always a sad thing to reply to your own post, but ...
Thanks again, ShaneMuir.
In the end, I got it. From Job Parameters and burst key value I had to generate the whole Path and Filename for each record and put it in front of the "custom seperator character" ~ (or ^ in my case).I somehow miss the options or probably fail to know the exact place and/or method where/how to specify the output directory and a "base file name pattern" (both are job parameters) for each file;
Thanks again, ShaneMuir.