creating many output files

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
veerabusani185512
Participant
Posts: 11
Joined: Fri Jan 30, 2009 3:21 am

creating many output files

Post by veerabusani185512 »

Hi,

I have one source file ( dataset) and I need to create many output files from this source file based on different conditions.

There are around 1000 conditions and I need to create 1 output file for each condition.

I am thinking of creating 30 jobs and each job will create 30 or 40 outpufiles.

Is there any other way to do this?

Thanks & Regards,
Veera.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

I would suggest to attach a condition id of some sort with your data and then use some sort of os level command to split your file into many files.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Re: creating many output files

Post by PhilHibbs »

The last time I had to do something like this I did it in a Routine in a Server Job. The client pushed back as everything was supposed to be Parallel Jobs, but we persuaded them that a Parallel Job was an inappropriate choice for this operation, and that a Server Job was an acceptable solution. Another option would be to do it in a shell script.
Phil Hibbs | Capgemini
Technical Consultant
veerabusani185512
Participant
Posts: 11
Joined: Fri Jan 30, 2009 3:21 am

Re: creating many output files

Post by veerabusani185512 »

Thanks for the reply Phil and DSGuru. I had also suggested my client about the Server job but client did not accepted it as the source data is 1.6Billion records. So I am planning to do it in Shell script. Thanks for the suggestion.
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Re: creating many output files

Post by samyamkrishna »

you can do this

Code: Select all

dataset ----- transformer--------seqfile
in a transformer

If [condition 1] Then "echo ":inputlink.col1: inputlink.col2:inputlink.col3...:">/directorey/file1.txt"
else If[ condition 2] Then "echo ":inputlink.col1: inputlink.col2:inputlink.col3...:">/directorey/file2.txt"
.
.
.
.
.
.
else ''

then in the after job subroutine run execSH " sh < seqfile.txt

This should work ...
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Re: creating many output files

Post by PhilHibbs »

samyamkrishna wrote:

Code: Select all

If [condition 1] Then "echo ":inputlink.col1: inputlink.col2:inputlink.col3...:">/directorey/file1.txt"
else If[ condition 2] Then "echo ":inputlink.col1: inputlink.col2:inputlink.col3...:">/directorey/file2.txt"
That should be >> not >
Phil Hibbs | Capgemini
Technical Consultant
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Re: creating many output files

Post by samyamkrishna »

ya sorry my bad.
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Re: creating many output files

Post by PhilHibbs »

...although the file will need to be deleted beforehand, unless you can detect the first occurrence of each rule and do a > on the first and a >> on each subsequent. Personally I wouldn't do it by generating a shell script like this, I'd just generate the CSV and then handle the splitting in a shell script - that way you aren't tying the DataStage build to a particular shell.
Phil Hibbs | Capgemini
Technical Consultant
netgurutoo
Participant
Posts: 6
Joined: Wed Mar 09, 2005 9:35 am

Re: creating many output files

Post by netgurutoo »

If you need to do this in Datastage you can simply use a jobs that use a Filter stage after the file. You can have many output files from the filter and you only need to put a where clause like ColumnA = 123, etc... (exclude the where) the trick will be keeping track of what link numbers go with each file. I would write them down one at a time as you place the links... starting with 0.. the first link you attach will be 0 and then 1,2,3 .....

This should be an efficient way of doing it and the data should be sorted on your key. You don't want to use a transformer if there are no transformations
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Re: creating many output files

Post by PhilHibbs »

netgurutoo wrote:If you need to do this in Datastage you can simply use a jobs that use a Filter stage after the file. You can have many output files from the filter and you only need to put a where clause like ColumnA = 123, etc... (exclude the where) the trick will be keeping track of what link numbers go with each file. I would write them down one at a time as you place the links... starting with 0.. the first link you attach will be 0 and then 1,2,3 .....
I think the question was to find out if there is a way of doing it without around 1000 output links in a job.
Phil Hibbs | Capgemini
Technical Consultant
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Post by ShaneMuir »

You could use an external target stage, pass it 1 column which starts with the fully qualified file name and then pass the rest of your column (with whatever delimiter your target file needs).

I think you would want to look at sorting the incoming data here also by the 1st column. But I am not sure its entirely necessary.

It in the destination program of the external target stage you would set your Target method to Specific program and use the following code. Following example has comma as delimiter

Code: Select all

awk '{nPosField1=index($0,",");print substr($0,nPosField1+1)>substr($0,1,nPosField1-1)}'
Or you could just do it in unix directly.
Post Reply