Splitting a file into multiple files based on first column

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ram9009
Participant
Posts: 3
Joined: Wed Nov 24, 2004 8:49 am

Splitting a file into multiple files based on first column

Post by ram9009 »

Hi,

I have a big file ( '|' delimted ). I need to split this file into multiple files based on the value in the first column.

all the rows with the same first column value should go into one file.|77

there are 314 diffrent values in first column so the big file needs to split in to 314 different files.


example:


xx|33|55|66|ggg|ddfc|
xx|67|67|89||568|fdsk|
zz|44|67|55||568|fdsk|
yy|456|jdfjd|dfksd|567|67|
xx|45|67|49||588|fdsk|
zz|67|67|89||568|fdsk|



output files are 3

xx.txt

xx|33|55|66|ggg|ddfc|
xx|67|67|89||568|fdsk|
xx|45|67|49||588|fdsk|

yy.txt

yy|456|jdfjd|dfksd|567|67|

zz.txt

zz|67|67|89||568|fdsk|
zz|44|67|55||568|fdsk|


Thanks,

Ram
dsdesigner
Participant
Posts: 34
Joined: Thu Jul 29, 2004 1:03 pm

Post by dsdesigner »

My 2 cents,
I would not do it using Datastage, simply because it will be too tedious and the job very cumbersome.

I would rather use a shell script (perl script / a java program) to achieve what you want to. A simple algorithm would be as follows

Preferably (but not necessary) Sort the data based on the first column.

Read each line line from the master file (file containing all the records).

Read the value in the first column using the cut command. Assign this to a variable. This will be the file name you want to write to.

Append the row to the file identified by the variable.

This might be time consuming depending upon the number of rows you have, but in my opinion elegant.

Hope this helps.

Thanks,
Shekar
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Welcome aboard Ram 8)
This is something that you want to keep outside datastage. Write a small shell script. That will get you going.
From the top of my head, something like

Code: Select all

#!/usr/bin/ksh
awk -F"|" '{ print $1 }' myfile.txt | sort | uniq > uniqfile.txt
cat uniqfile.txt | while read filenames
do
cat myfile.txt | grep ${filenames} > ${filenames}.txt
done
myfile.txt is your "huge" file. Provide a fully qualified path to it.
Last edited by DSguru2B on Thu Jun 29, 2006 9:55 am, edited 1 time in total.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Try out using the Folder stage to do the split out on the first column.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ram9009
Participant
Posts: 3
Joined: Wed Nov 24, 2004 8:49 am

Post by ram9009 »

Hi,
We already have a perl/shell script program doing that.But we wanted to move to datastage just to keep all the code in datstage if performance is same as perl.

Anyway we keep it in perl.

Thanks for quick reply

- Ram
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

If you want to keep everything in datastage, then go for Ken's suggestion.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
Post Reply