UNIX sort on multiple keys

chulett · Post by **chulett** » Tue Dec 21, 2004 11:54 am

Looking for a little syntax help for a Filter command I am using in a Sequential File stage...

I am dropping a delimited file to disk in order to leverage the UNIX sort command rather than the Sort plugin as it can be orders of magnitude faster. It's fairly simply when sorting on character fields, as by default it does a string sort. I'm wondering if there is 'shorter' syntax than the only way I've found so far when sorting multiple numeric keys.

Abbreviated Example: the first 9 fields in the file are numeric fields and I need to sort on them in order. I thought I could do something simple, like one of the following:

Code: Select all

sort -n -k 1,9
sort -k 1,9n
sort -k 1n,9n

All of these garner me the prestigious 'Row Out Of Sequence' award from the downstream Aggregator it is feeding. The only I could get it to work for me was to individually mention each key:

Code: Select all

sort -k 1,1n -k 2,2n -k 3,3n -k 4,4n -k 5,5n -k 6,6n -k 7,7n -k 8,8n -k 9,9n

Sure seems like there should be a better way to do it than that.

Anyone done this before and know some more better syntax to accomplish the same thing?

kcbland · Post by **kcbland** » Tue Dec 21, 2004 12:56 pm

I do it the same way you did it, it absolutely guarantees that each key is identified and datatyped. The only downside is that this statement must be kept in sync with any file structural changes, such as moving columns around. Other than that, it's a very fast technique to drop to a file prior to an aggregation stage. You write from a transformer to a sequential file under one name, use an after-transformer ExecSH routine call to sort the file catting output to a new file, and have the sequential stage read the "just-in-time" new file going into the aggregator. Kind of like a "roll-your-own" custom stage without the stage solution.

chulett · Post by **chulett** » Tue Dec 21, 2004 2:22 pm

I've done that in the past, using two files and all. Nowadays, it's pretty cool to use the Filter option in the Sequential stage to do the sorting 'on the fly' and not need the second file. The output of the sort command goes to Standard Out and the stage picks that up and processes it just as if it were picking it up from the disk.

Perhaps one method is faster than the other, I haven't benchmarked the two to see. Only other potential 'downside' is there's nothing to 'View' when using a filter... that and you can't seem to use a filter command that needs a quote of any kind in it.

kduke · Post by **kduke** » Tue Dec 21, 2004 5:02 pm

Craig

The easiest way is to reformat the data to have the columns to be sorted to the first of the line. You need to format everything to sort in string or alpha order so right justify numbers. Dates should be YMD[4,2,2].

You fix the columns back after reading them after sorting them.

chulett · Post by **chulett** » Tue Dec 21, 2004 6:09 pm

That's an interesting thought. In essence, build a 'sort key' - a single concatenated field strictly for the sort process to use that could be pitched after that. As you said, one would have to make sure the fields that make up the key are formatted appropriately.

Certainly an option to keep in mind.

Christina Lim · Post by **Christina Lim** » Tue Dec 28, 2004 5:23 am

Code: Select all

sort -k 1,1n -k 2,2n -k 3,3n -k 4,4n -k 5,5n -k 6,6n -k 7,7n -k 8,8n -k 9,9n

But my 3rd column is an alphabetic datatype.

I applied the following syntax n it convert all the fields to character sorting.

Code: Select all

sort -k 1,1n -k 2,2n -k 3,3 -k 4,4n -k 5,5n -k 6,6n -k 7,7n -k 8,8n -k 9,9n

Wonder if anyone face the same problem..

roy · Post by **roy** » Sat Jan 01, 2005 4:03 pm

Hi,
to add on what was said,
If you need the sort for agg stage for agg not with first()/last() that are crucial to the outcome you want, sorting the file on the most varying column alone (or a few of them) and specifying a group by on the others might be enough to get the job done without all sort fields specified.

IHTH,

tonystark622 · Post by **tonystark622** » Sat Jan 01, 2005 9:41 pm

Craig,

I had similar experiences as you did. I had to mention each key and put the begin, end syntax (e.g. -k9,9) to make it work.

Just my 2 cents worth,
Tony

chulett · Post by **chulett** » Sun Jan 02, 2005 12:11 am

Kinda what I figured... but was hoping I was overlooking something to make the syntax a little easier to handle. Oh well.