Page 1 of 1

Adding a column during concatenation

Posted: Sun Apr 22, 2007 6:43 am
by chulett
Not really a DataStage question as I don't want a job solution, but a UNIX one.

I have 16M records in 51 identical files that need to be concatenated together for processing. I'd like to add a single letter as a trailing pipe delimited column to each record during the process, if possible. I know which files need what letter, so don't worry about that. Just wondering if there's some way to maintain the speed of a straight command line 'cat *.xxx > file' operation and add a column to the end of each record at the same time.

Thanks!

Posted: Sun Apr 22, 2007 9:46 am
by DSguru2B
Look into sed, Craig. Something like

Code: Select all

sed -e 's/$/your character/g' infile 
You can perform this command on the files, either before concatenating them or after.
As for the speed, you will have to test it out. Sed is pretty fast. But dont know with 51 x 2M records.

Posted: Sun Apr 22, 2007 10:33 am
by chulett
Thanks - running some timing tests. I can pipe cats through sed:

Code: Select all

cat *.xxx | sed -e 's/$/|M' > fileout
Two birds, one stone. :D

Posted: Sun Apr 22, 2007 10:49 am
by chulett
This does work but adds some overhead. Cat'ing 2.6M records went from 23 seconds to 1 minute 23 seconds. I can live with that but still curious if there is something less impactive that could be done.

Posted: Sun Apr 22, 2007 7:24 pm
by ray.wurlod
How about using the wildcard as stdin for sed (so they work one at a time) and append-redirecting the output into your file? You could also parallelize the sed operations with a bit more scripting.

Posted: Sun Apr 22, 2007 7:39 pm
by kcbland
Syncsort available?

Posted: Sun Apr 22, 2007 8:49 pm
by chulett
kcbland wrote:Syncsort available?
Sadly, no. :cry:

Posted: Sun Apr 22, 2007 8:54 pm
by chulett
ray.wurlod wrote:How about using the wildcard as stdin for sed (so they work one at a time) and append-redirecting the output into your file?
Err... meaning... this?

Code: Select all

sed -e 's/$/|M' *.xxx >> fileout
I'll give it a shot.

Edited to add: WTH?

The substitution syntax that worked fine in the other form won't parse in this one, in spite of me cribbing it directly from the man pages.

sed: Function s/$/|M cannot be parsed

Doesn't seem to matter what I put between the quotes, the dollar sign or pipe are not the issue here. :?

Posted: Mon Apr 23, 2007 6:53 am
by DSguru2B
Try this:

Code: Select all

sed -e 's/$/|M/g' *.xxx >> fileout

Posted: Mon Apr 23, 2007 11:08 am
by chulett
:oops: When I transcribed the previous syntax I was using into the post I left off the trailing slash, which is why it could no longer be parsed. So now both are working and here are some timing tests for anyone interested:

Code: Select all

sed -e 's/$/|M/' *.xxx >> fileout          55 sec avg

cat *.xxx | sed -e 's/$/|M/' >> fileout    77 sec avg
This for 1.7M records. I'll go with the former, though either would be 'fine' in the long run. :D