Page 1 of 1

multiple sequential file creation?

Posted: Tue Jan 13, 2009 4:58 am
by jonathanhale
I have table blobs, consisting of blob_id (char16) and blob_cont (char 4MB)

I need to create sequential files for each row in blobs, where filename = blob_id and file content = blob_cont

Is there a way of creating multiple differently named sequential files simultaneously in DataStage?

Posted: Wed Jan 14, 2009 9:40 am
by chulett
Hmmm... not really, unless you know how many max you'll have ahead of time and build that many output stages into your job. Typical answer would be to create a single file and then script something after-job to split the file into multiple files with the names you require.

Another answer might be to write the output to a type 19 (?) hashed file, which is basically a directory and every 'record' becomes a file inside that directory. Pretty sure it's a type 19 but again you may need to rename the files post-job if you have a particular naming scheme in mind as I don't believe you can control the filenames.

Yet another answer may be (oddly enough) an XML Output stage with a 'trigger' column, just letting your data pass thru the stage with no 'xmling' going on. It creates new files whenever the value in the trigger column changes and the trigger column doesn't need to be output.

Posted: Wed Jan 14, 2009 10:48 am
by chulett
Well... not sure I would have ponied up some of those thoughts if I'd known we were talking about two million files daily. Except perhaps academically. :wink:

I think your "Option 3" is a perfectly valid solution, kudos for coming up with that.

Posted: Wed Jan 14, 2009 11:21 am
by throbinson
Option 4
A single job that contains an output link to a Folder stage. This will write a unique file per row of incoming data almost exactly like Option 3. The first column in the derivation would be the file path/name, the second, third, etc. is the data.

Posted: Wed Jan 21, 2009 5:13 am
by jonathanhale
Option 4 does also work - slightly slower than the routine version - and in my testing I was not successful in passing the path. i.e. the path must be set by a job parameter, and col 1 to the folder stage becomes filename, col 2, etc file content.

Despite the theoretical likelihood that Parallel jobs will not be particularly helpful for this requirement, we are still interested in comparing the performance of parallel against server.

However, no folder stage available for parallel jobs. Is there an equivalent/alternative? Can file sets be utilised like this for output?

A Parallel routine can not be basic. Anybody ever come across a basic to C++ converter? :D

Otherwise, I guess I need to write a C++ routine that sits on the server file system to be called from the parallel job? Is that the right theory?

Any other comments or remarks?

Posted: Wed Jan 21, 2009 5:28 am
by throbinson
A parameter for the path will work in the Folder Path name of the Folder Stage Properties tab. This didn't work for you? I do not know a way to replicate the folder stage write in EE although I am sure that cat can be skinned.