Can I accomplish sequential file bursting in a parallel job?

rstaedter · Post by **rstaedter** » Fri Feb 20, 2015 6:11 am

I have an input link to a Sequential File Stage. One of the several columns contains 4-digit-time information: 2-digit year and 2-digit month.
I'll call this column YYMM_CHAR, and this is my burst key. Examples for values would be 0902, 0508, 1312, 1502 and so on.

The output of the Sequential File Stage should split ("burst") the Input Link records into several output files: one file for each distinct occurrence of YYMM_CHAR.

All Records with YYMM_CHAR='0902' should be in <SequentialFileName>.0902
All Records with YYMM_CHAR='0508' should be in <SequentialFileName>.0508
etc.

(Ideally, the column YYMM_CHAR itself should be omitted in the output files, but that's of secondary importance at this time.)

At job design time, it is not possible to predict the values for YYMM_CHAR for each individual execution.
The possible range starts in the 1990s and reaches up to the present - the present at execution time, that is.

I've discarded the idea of putting a Switch Stage before the Sequential File Stage due to
* the fact that I'd have to enumerate all possible values for the burst key YYMM_CHAR
* the high maintenance effort of adding new output links as time progresses.
* but mainly because of its limitations to 128 output links which would cover 128:12 years - less than eleven, which will not suffice.

Is there any chance to burst the data of the Sequential File Stage's input link dynamically into several files as described?

Thanks for reading,
Robert

[edit: deleted a "not" in a sentence.]

qt_ky · Post by **qt_ky** » Fri Feb 20, 2015 6:31 am

This feature was first introduced in DataStage 9.1.0. You can set the write method to "Generate Multiple Files" and use a key column, akin to what you labelled as "bursting." You can optionally append the key value to the file name.

http://www-01.ibm.com/support/knowledge ... stage.html

http://www-01.ibm.com/support/knowledge ... egory.html

For the file name, you can provide a root file string, which can include a path. The key value may be appended, just as in your example. I ran a quick test in 11.3.1 to confirm the behavior.

Notes:
- It inserts a sort for you, based on your key.
- When appending the key value to the file name, it appends a period followed by the key value. It appears that the period is hard-coded for you; I don't see an option to avoid the period.
- If you don't force it to sequential mode, then is also appends a period followed by partition/node number (i.e. ".part00000", ".part00001", etc.)
- If you needed the key value anywhere other than in the end of the file name, then I think you would be in trouble...

It looks like you are on a slightly older release. Do you have the option to upgrade?

rstaedter · Post by **rstaedter** » Fri Feb 20, 2015 8:15 am

First, thanks for your reply.
Your description sounds exactly like what I'd like to have; even the inserted period before the key value.
I am (meaning: the client is) running 8.7.0.2. How fast they are going to upgrade remains to be seen.

Thanks again!

ShaneMuir · Post by **ShaneMuir** » Fri Feb 20, 2015 10:23 am

From what I understand you want to create several output files from the one input link. If so you could accomplish what you are after but using an external target stage.

Prior to outputting the data, you would have to sort by your burst key column, and based on that key you would generate the required file name. Next combine all your other columns that you want to output using a column export stage (if a static file format then use explicit columns, else you'd have to build schemas for each output file).

Set the derivation of an output column in a transform to something like

Code: Select all

lnk_out.FILE_NAME: '~' : lnk_out.DATA_COLUMNS

In this example I am using '~' as a delimiter between the filename and the data.

Then in an external target stage, select specified program as the target method, and the destination program as

Code: Select all

awk 'BEGIN { FS = "~" } ; { print $2 > $1 }'

The only flaw in this plan is that you are on windows, so I don't know if the awk statement will work (Does the MKSToolkit cover this?) else you would have to use an equivalent windows command (if it exists).

eostic · Post by **eostic** » Fri Feb 20, 2015 12:16 pm

Consider having a Server Job run afterwards that sends the data thru a Folder Stage...the folder can put out multiple sequential files. Been a long time since I've done it, and it might not be as flexible as you need, but worth looking into.

Ernie

rstaedter · Post by **rstaedter** » Mon Feb 23, 2015 3:59 am

ShaneMuir - thanks so much for your suggestion.

I followed it for several reasons:
1. It looks like I'm on AIX. Sorry for the confusion; I only have access to the Windows-based GUI tools....

2. The original data columns are already concatenated into one single, long record in the SQL Query of the Oracle Connector Stage using a definable seperator character; in this case a semicolon.

3. The Stream was already sorted by the burst key prior to sending it into the Sequential File Stage.

All I did now was to prefix the concatenated data fields with the burst key using an additional, different seperator character: ^
So the only field that goes into the External Target Stage - which now replaces the Sequential File Stage - looks like this (2 example records):

Code: Select all

1309^30924537;20130930230012;30.09.2013 23:04:04;1210805;889;;2382657;[...and so on...]
1310^31111382;20131001230005;01.10.2013 23:04:19;1218557;2884;;2383754;[...and so on...]

And the "Destination Program" of the External Target Stage, according to your posting, is the following

Code: Select all

awk 'BEGIN { FS = "^" } ; { print $2 > $1 }'

However:

and based on that key you would generate the required file name

I somehow miss the options or probably fail to know the exact place and/or method where/how to specify the output directory and a "base file name pattern" (both are job parameters) for each file; meaning I don't see any output file(s) after the job has finished without errors or warnings.

Do I have to generate the filename for each output record in the data stream, too?

Thanks in advance!

rstaedter · Post by **rstaedter** » Mon Feb 23, 2015 9:45 am

It's always a sad thing to reply to your own post, but ...

I somehow miss the options or probably fail to know the exact place and/or method where/how to specify the output directory and a "base file name pattern" (both are job parameters) for each file;

In the end, I got it. From Job Parameters and burst key value I had to generate the whole Path and Filename for each record and put it in front of the "custom seperator character" ~ (or ^ in my case).

Thanks again, ShaneMuir.

chulett · Post by **chulett** » Mon Feb 23, 2015 10:13 am

Sad? I think you just earned the "Solved that last problem on my own" achievement. <chime!>

ShaneMuir · Post by **ShaneMuir** » Mon Feb 23, 2015 10:57 am

generate the whole Path and Filename for each record and put it in front of the "custom seperator character"

Yes sorry - I was a bit vague on that. Good work on deducing the solution though :D