How to split data into files?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

vivek_rs
Participant
Posts: 37
Joined: Thu Nov 25, 2004 8:44 pm
Location: Bangalore, Karnataka, India

How to split data into files?

Post by vivek_rs »

Hi
I have an input file which has to be split into different files depending on the value of the fifth field. The problem is...
the number of files is determined by another sequential file that contains the values that the fifth field can take.

Sequential file contents is as follows...
account
address
registration

So, if fifth field is 'account', the row should go into account.txt.
if it is 'address', the row should go into address.txt.

the data in the sequential file can change.

I'm in a hurry.
Please help!!!
Regards,
Vivek RS
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

vivek_rs,

two ways that immediately come to mind are to use a Transform stage and splitting using the constraints, or to use the specific stage in Px that does this - the "Switch" stage. See the documentation for details and examples.
vivek_rs
Participant
Posts: 37
Joined: Thu Nov 25, 2004 8:44 pm
Location: Bangalore, Karnataka, India

Post by vivek_rs »

ArndW,
The problem is I do not know exactly how many output files are there.
The sequential file has to be looked up and depending on the number of rows in the file, those many output files have to be present. How do I do that???
Regards,
Vivek RS
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Vivek,

aha, the plot thickens :D That is a bit more difficult, especially dynamically in a single pass. Offhand I think you might need to do 2 passes on your main data; one pass to get a complete list of possible values. I just thought about using Px's built-in partitioning for datasets, I recall that you can use your own (c-language) formula so you might be able to do a dynamic allocation. Hopefully someone on this forum has done that or knows how...
vigneshra
Participant
Posts: 86
Joined: Wed Jun 09, 2004 6:07 am
Location: Chennai

Post by vigneshra »

Vivek,

As far as I could think, it's impossible to do in DataStage because in DataStage each output file you are writing into should be determined at the time of job development. One thing I can suggest you is that it can be done using DOS batch files. Please try that and let us know the result.
Vignesh.

"A conclusion is simply the place where you got tired of thinking."
vivek_rs
Participant
Posts: 37
Joined: Thu Nov 25, 2004 8:44 pm
Location: Bangalore, Karnataka, India

Post by vivek_rs »

Thanks a lot guys...
I guess I'll have to figure out another design then.
I'll try the approach suggested sometime later.
Thanks a lot anyways...
Regards,
Vivek RS
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Use any normal method (such as using an agg) to identify the distinct file values.

Use the 'Multiple Instance' property to create multiple files from the same file as source and use parameters for target file. This paramter can take its value from the distinc values mentioned above.

As you may be using V > 7.5, you can use the looping mechanism in the sequencer for this purpose.

But note that Windows may lock the file if multiple processes access at the same time.
vivek_rs
Participant
Posts: 37
Joined: Thu Nov 25, 2004 8:44 pm
Location: Bangalore, Karnataka, India

Post by vivek_rs »

This seems to be a good idea.
I'm using 7.1. So, I'll have to write a Job Control that reads a sequential file and calls multiple instances of the job to extract different segments into different files.
Can anyone think of anything better?
TIA
Regards,
Vivek RS
chalasaniamith
Participant
Posts: 36
Joined: Wed Feb 16, 2005 5:20 pm
Location: IL

File

Post by chalasaniamith »

vivek_rs wrote:This seems to be a good idea.
I'm using 7.1. So, I'll have to write a Job Control that reads a sequential file and calls multiple instances of the job to extract different segments into different files.
Can anyone think of anything better?
TIA
Incase if ur using a transformer and generalised a parameter so that it will write to a specified file.
Let m eknow if i am wrong
vivek_rs
Participant
Posts: 37
Joined: Thu Nov 25, 2004 8:44 pm
Location: Bangalore, Karnataka, India

Post by vivek_rs »

There are two parameters...
one specifies the segment I am supposed to extract.
the other specifies the file into which the row has to go.
the job control calls multple sequence of the job with different parameters...
Regards,
Vivek RS
T42
Participant
Posts: 499
Joined: Thu Nov 11, 2004 6:45 pm

Post by T42 »

You will need to create a buildop or custom stage to do this type of task.

Just pre-sort the data, and throw it in a buildop which will open a file, watch the data flows through, and when the data changes, close the file, and open a new one.

Buildop definitely will serve you well on this one.
mujeebur
Participant
Posts: 46
Joined: Sun Mar 06, 2005 3:02 pm
Location: Philly,USA

Post by mujeebur »

Hi,

I guess this can be done by below :

Sort the file and pass it to a Transformer stage . Use stage variable to compare the previous row , if its changes write to different file , else write to the same file. Like wise you may have accounts.txt and address.txt ..etc by using constraints mechanism.
T42
Participant
Posts: 499
Joined: Thu Nov 11, 2004 6:45 pm

Post by T42 »

mujeebur wrote:Sort the file and pass it to a Transformer stage . Use stage variable to compare the previous row , if its changes write to different file , else write to the same file. Like wise you may have accounts.txt and address.txt ..etc by using constraints mechanism.
This does not address the "unknown number of files" condition.
vivek_rs
Participant
Posts: 37
Joined: Thu Nov 25, 2004 8:44 pm
Location: Bangalore, Karnataka, India

Post by vivek_rs »

As of now, I am using multiple instances, but I'd love to use to custom stages, but do not know how to use them.
Does anyone have any documentation or Knowledge articles on how to write customstages and buildops?
Regards,
Vivek RS
T42
Participant
Posts: 499
Joined: Thu Nov 11, 2004 6:45 pm

Post by T42 »

The following document (as part of your DataStage PDF library), parjdev.pdf, have a chapter on this: Chapter 55: Specifying your own parallel stages.

How well do you know C++?

There are sample files within the PXEngine directory that you could use for custom stages.
Post Reply