Splitting a file into multiple files based on first column

ram9009 · Post by **ram9009** » Thu Jun 29, 2006 8:52 am

Hi,

I have a big file ( '|' delimted ). I need to split this file into multiple files based on the value in the first column.

all the rows with the same first column value should go into one file.|77

there are 314 diffrent values in first column so the big file needs to split in to 314 different files.

example:

xx|33|55|66|ggg|ddfc|
xx|67|67|89||568|fdsk|
zz|44|67|55||568|fdsk|
yy|456|jdfjd|dfksd|567|67|
xx|45|67|49||588|fdsk|
zz|67|67|89||568|fdsk|

output files are 3

xx.txt

xx|33|55|66|ggg|ddfc|
xx|67|67|89||568|fdsk|
xx|45|67|49||588|fdsk|

yy.txt

yy|456|jdfjd|dfksd|567|67|

zz.txt

zz|67|67|89||568|fdsk|
zz|44|67|55||568|fdsk|

Thanks,

Ram

dsdesigner · Post by **dsdesigner** » Thu Jun 29, 2006 9:28 am

My 2 cents,
I would not do it using Datastage, simply because it will be too tedious and the job very cumbersome.

I would rather use a shell script (perl script / a java program) to achieve what you want to. A simple algorithm would be as follows

Preferably (but not necessary) Sort the data based on the first column.

Read each line line from the master file (file containing all the records).

Read the value in the first column using the cut command. Assign this to a variable. This will be the file name you want to write to.

Append the row to the file identified by the variable.

This might be time consuming depending upon the number of rows you have, but in my opinion elegant.

Hope this helps.

Thanks,
Shekar

DSguru2B · Post by **DSguru2B** » Thu Jun 29, 2006 9:45 am

Welcome aboard Ram

This is something that you want to keep outside datastage. Write a small shell script. That will get you going.
From the top of my head, something like

Code: Select all

#!/usr/bin/ksh
awk -F"|" '{ print $1 }' myfile.txt | sort | uniq > uniqfile.txt
cat uniqfile.txt | while read filenames
do
cat myfile.txt | grep ${filenames} > ${filenames}.txt
done

myfile.txt is your "huge" file. Provide a fully qualified path to it.

kcbland · Post by **kcbland** » Thu Jun 29, 2006 9:53 am

Try out using the Folder stage to do the split out on the first column.

ram9009 · Post by **ram9009** » Thu Jun 29, 2006 10:09 am

Hi,
We already have a perl/shell script program doing that.But we wanted to move to datastage just to keep all the code in datstage if performance is same as perl.

Anyway we keep it in perl.

Thanks for quick reply

- Ram

DSguru2B · Post by **DSguru2B** » Thu Jun 29, 2006 10:10 am

If you want to keep everything in datastage, then go for Ken's suggestion.