Page 1 of 1

Splitting a file into multiple files based on first column

Posted: Thu Jun 29, 2006 8:52 am
by ram9009
Hi,

I have a big file ( '|' delimted ). I need to split this file into multiple files based on the value in the first column.

all the rows with the same first column value should go into one file.|77

there are 314 diffrent values in first column so the big file needs to split in to 314 different files.


example:


xx|33|55|66|ggg|ddfc|
xx|67|67|89||568|fdsk|
zz|44|67|55||568|fdsk|
yy|456|jdfjd|dfksd|567|67|
xx|45|67|49||588|fdsk|
zz|67|67|89||568|fdsk|



output files are 3

xx.txt

xx|33|55|66|ggg|ddfc|
xx|67|67|89||568|fdsk|
xx|45|67|49||588|fdsk|

yy.txt

yy|456|jdfjd|dfksd|567|67|

zz.txt

zz|67|67|89||568|fdsk|
zz|44|67|55||568|fdsk|


Thanks,

Ram

Posted: Thu Jun 29, 2006 9:28 am
by dsdesigner
My 2 cents,
I would not do it using Datastage, simply because it will be too tedious and the job very cumbersome.

I would rather use a shell script (perl script / a java program) to achieve what you want to. A simple algorithm would be as follows

Preferably (but not necessary) Sort the data based on the first column.

Read each line line from the master file (file containing all the records).

Read the value in the first column using the cut command. Assign this to a variable. This will be the file name you want to write to.

Append the row to the file identified by the variable.

This might be time consuming depending upon the number of rows you have, but in my opinion elegant.

Hope this helps.

Thanks,
Shekar

Posted: Thu Jun 29, 2006 9:45 am
by DSguru2B
Welcome aboard Ram 8)
This is something that you want to keep outside datastage. Write a small shell script. That will get you going.
From the top of my head, something like

Code: Select all

#!/usr/bin/ksh
awk -F"|" '{ print $1 }' myfile.txt | sort | uniq > uniqfile.txt
cat uniqfile.txt | while read filenames
do
cat myfile.txt | grep ${filenames} > ${filenames}.txt
done
myfile.txt is your "huge" file. Provide a fully qualified path to it.

Posted: Thu Jun 29, 2006 9:53 am
by kcbland
Try out using the Folder stage to do the split out on the first column.

Posted: Thu Jun 29, 2006 10:09 am
by ram9009
Hi,
We already have a perl/shell script program doing that.But we wanted to move to datastage just to keep all the code in datstage if performance is same as perl.

Anyway we keep it in perl.

Thanks for quick reply

- Ram

Posted: Thu Jun 29, 2006 10:10 am
by DSguru2B
If you want to keep everything in datastage, then go for Ken's suggestion.