Page 1 of 1

creating many output files

Posted: Wed Jul 13, 2011 8:51 am
by veerabusani185512
Hi,

I have one source file ( dataset) and I need to create many output files from this source file based on different conditions.

There are around 1000 conditions and I need to create 1 output file for each condition.

I am thinking of creating 30 jobs and each job will create 30 or 40 outpufiles.

Is there any other way to do this?

Thanks & Regards,
Veera.

Posted: Wed Jul 13, 2011 9:32 am
by DSguru2B
I would suggest to attach a condition id of some sort with your data and then use some sort of os level command to split your file into many files.

Re: creating many output files

Posted: Wed Jul 13, 2011 9:33 am
by PhilHibbs
The last time I had to do something like this I did it in a Routine in a Server Job. The client pushed back as everything was supposed to be Parallel Jobs, but we persuaded them that a Parallel Job was an inappropriate choice for this operation, and that a Server Job was an acceptable solution. Another option would be to do it in a shell script.

Re: creating many output files

Posted: Thu Jul 14, 2011 2:55 am
by veerabusani185512
Thanks for the reply Phil and DSGuru. I had also suggested my client about the Server job but client did not accepted it as the source data is 1.6Billion records. So I am planning to do it in Shell script. Thanks for the suggestion.

Re: creating many output files

Posted: Thu Jul 14, 2011 3:47 am
by samyamkrishna
you can do this

Code: Select all

dataset ----- transformer--------seqfile
in a transformer

If [condition 1] Then "echo ":inputlink.col1: inputlink.col2:inputlink.col3...:">/directorey/file1.txt"
else If[ condition 2] Then "echo ":inputlink.col1: inputlink.col2:inputlink.col3...:">/directorey/file2.txt"
.
.
.
.
.
.
else ''

then in the after job subroutine run execSH " sh < seqfile.txt

This should work ...

Re: creating many output files

Posted: Thu Jul 14, 2011 4:54 am
by PhilHibbs
samyamkrishna wrote:

Code: Select all

If [condition 1] Then "echo ":inputlink.col1: inputlink.col2:inputlink.col3...:">/directorey/file1.txt"
else If[ condition 2] Then "echo ":inputlink.col1: inputlink.col2:inputlink.col3...:">/directorey/file2.txt"
That should be >> not >

Re: creating many output files

Posted: Thu Jul 14, 2011 5:10 am
by samyamkrishna
ya sorry my bad.

Re: creating many output files

Posted: Thu Jul 14, 2011 5:14 am
by PhilHibbs
...although the file will need to be deleted beforehand, unless you can detect the first occurrence of each rule and do a > on the first and a >> on each subsequent. Personally I wouldn't do it by generating a shell script like this, I'd just generate the CSV and then handle the splitting in a shell script - that way you aren't tying the DataStage build to a particular shell.

Re: creating many output files

Posted: Thu Jul 14, 2011 5:34 am
by netgurutoo
If you need to do this in Datastage you can simply use a jobs that use a Filter stage after the file. You can have many output files from the filter and you only need to put a where clause like ColumnA = 123, etc... (exclude the where) the trick will be keeping track of what link numbers go with each file. I would write them down one at a time as you place the links... starting with 0.. the first link you attach will be 0 and then 1,2,3 .....

This should be an efficient way of doing it and the data should be sorted on your key. You don't want to use a transformer if there are no transformations

Re: creating many output files

Posted: Thu Jul 14, 2011 5:42 am
by PhilHibbs
netgurutoo wrote:If you need to do this in Datastage you can simply use a jobs that use a Filter stage after the file. You can have many output files from the filter and you only need to put a where clause like ColumnA = 123, etc... (exclude the where) the trick will be keeping track of what link numbers go with each file. I would write them down one at a time as you place the links... starting with 0.. the first link you attach will be 0 and then 1,2,3 .....
I think the question was to find out if there is a way of doing it without around 1000 output links in a job.

Posted: Mon Jul 18, 2011 3:52 pm
by ShaneMuir
You could use an external target stage, pass it 1 column which starts with the fully qualified file name and then pass the rest of your column (with whatever delimiter your target file needs).

I think you would want to look at sorting the incoming data here also by the 1st column. But I am not sure its entirely necessary.

It in the destination program of the external target stage you would set your Target method to Specific program and use the following code. Following example has comma as delimiter

Code: Select all

awk '{nPosField1=index($0,",");print substr($0,nPosField1+1)>substr($0,1,nPosField1-1)}'
Or you could just do it in unix directly.