creating many output files

veerabusani185512 · Post by **veerabusani185512** » Wed Jul 13, 2011 8:51 am

Hi,

I have one source file ( dataset) and I need to create many output files from this source file based on different conditions.

There are around 1000 conditions and I need to create 1 output file for each condition.

I am thinking of creating 30 jobs and each job will create 30 or 40 outpufiles.

Is there any other way to do this?

Thanks & Regards,
Veera.

DSguru2B · Post by **DSguru2B** » Wed Jul 13, 2011 9:32 am

I would suggest to attach a condition id of some sort with your data and then use some sort of os level command to split your file into many files.

PhilHibbs · Post by **PhilHibbs** » Wed Jul 13, 2011 9:33 am

The last time I had to do something like this I did it in a Routine in a Server Job. The client pushed back as everything was supposed to be Parallel Jobs, but we persuaded them that a Parallel Job was an inappropriate choice for this operation, and that a Server Job was an acceptable solution. Another option would be to do it in a shell script.

veerabusani185512 · Post by **veerabusani185512** » Thu Jul 14, 2011 2:55 am

Thanks for the reply Phil and DSGuru. I had also suggested my client about the Server job but client did not accepted it as the source data is 1.6Billion records. So I am planning to do it in Shell script. Thanks for the suggestion.

samyamkrishna · Post by **samyamkrishna** » Thu Jul 14, 2011 3:47 am

you can do this

Code: Select all

dataset ----- transformer--------seqfile

in a transformer

If [condition 1] Then "echo ":inputlink.col1: inputlink.col2:inputlink.col3...:">/directorey/file1.txt"
else If[ condition 2] Then "echo ":inputlink.col1: inputlink.col2:inputlink.col3...:">/directorey/file2.txt"
.
.
.
.
.
.
else ''

then in the after job subroutine run execSH " sh < seqfile.txt

This should work ...

PhilHibbs · Post by **PhilHibbs** » Thu Jul 14, 2011 4:54 am

samyamkrishna wrote:

Code: Select all

If [condition 1] Then "echo ":inputlink.col1: inputlink.col2:inputlink.col3...:">/directorey/file1.txt"
else If[ condition 2] Then "echo ":inputlink.col1: inputlink.col2:inputlink.col3...:">/directorey/file2.txt"

That should be >> not >

samyamkrishna · Post by **samyamkrishna** » Thu Jul 14, 2011 5:10 am

ya sorry my bad.

PhilHibbs · Post by **PhilHibbs** » Thu Jul 14, 2011 5:14 am

...although the file will need to be deleted beforehand, unless you can detect the first occurrence of each rule and do a > on the first and a >> on each subsequent. Personally I wouldn't do it by generating a shell script like this, I'd just generate the CSV and then handle the splitting in a shell script - that way you aren't tying the DataStage build to a particular shell.

netgurutoo · Post by **netgurutoo** » Thu Jul 14, 2011 5:34 am

If you need to do this in Datastage you can simply use a jobs that use a Filter stage after the file. You can have many output files from the filter and you only need to put a where clause like ColumnA = 123, etc... (exclude the where) the trick will be keeping track of what link numbers go with each file. I would write them down one at a time as you place the links... starting with 0.. the first link you attach will be 0 and then 1,2,3 .....

This should be an efficient way of doing it and the data should be sorted on your key. You don't want to use a transformer if there are no transformations

PhilHibbs · Post by **PhilHibbs** » Thu Jul 14, 2011 5:42 am

netgurutoo wrote:If you need to do this in Datastage you can simply use a jobs that use a Filter stage after the file. You can have many output files from the filter and you only need to put a where clause like ColumnA = 123, etc... (exclude the where) the trick will be keeping track of what link numbers go with each file. I would write them down one at a time as you place the links... starting with 0.. the first link you attach will be 0 and then 1,2,3 .....

I think the question was to find out if there is a way of doing it without around 1000 output links in a job.

ShaneMuir · Post by **ShaneMuir** » Mon Jul 18, 2011 3:52 pm

You could use an external target stage, pass it 1 column which starts with the fully qualified file name and then pass the rest of your column (with whatever delimiter your target file needs).

I think you would want to look at sorting the incoming data here also by the 1st column. But I am not sure its entirely necessary.

It in the destination program of the external target stage you would set your Target method to Specific program and use the following code. Following example has comma as delimiter

Code: Select all

awk '{nPosField1=index($0,",");print substr($0,nPosField1+1)>substr($0,1,nPosField1-1)}'

Or you could just do it in unix directly.

DSXchange

creating many output files

creating many output files

Re: creating many output files

Re: creating many output files

Re: creating many output files

Re: creating many output files

Re: creating many output files

Re: creating many output files

Re: creating many output files

Re: creating many output files