UNIX sort on multiple keys

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

UNIX sort on multiple keys

Post by chulett »

Looking for a little syntax help for a Filter command I am using in a Sequential File stage...

I am dropping a delimited file to disk in order to leverage the UNIX sort command rather than the Sort plugin as it can be orders of magnitude faster. It's fairly simply when sorting on character fields, as by default it does a string sort. I'm wondering if there is 'shorter' syntax than the only way I've found so far when sorting multiple numeric keys.

Abbreviated Example: the first 9 fields in the file are numeric fields and I need to sort on them in order. I thought I could do something simple, like one of the following:

Code: Select all

sort -n -k 1,9
sort -k 1,9n
sort -k 1n,9n
All of these garner me the prestigious 'Row Out Of Sequence' award from the downstream Aggregator it is feeding. The only I could get it to work for me was to individually mention each key:

Code: Select all

sort -k 1,1n -k 2,2n -k 3,3n -k 4,4n -k 5,5n -k 6,6n -k 7,7n -k 8,8n -k 9,9n
Sure seems like there should be a better way to do it than that. :? Anyone done this before and know some more better syntax to accomplish the same thing?
-craig

"You can never have too many knives" -- Logan Nine Fingers
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

I do it the same way you did it, it absolutely guarantees that each key is identified and datatyped. The only downside is that this statement must be kept in sync with any file structural changes, such as moving columns around. Other than that, it's a very fast technique to drop to a file prior to an aggregation stage. You write from a transformer to a sequential file under one name, use an after-transformer ExecSH routine call to sort the file catting output to a new file, and have the sequential stage read the "just-in-time" new file going into the aggregator. Kind of like a "roll-your-own" custom stage without the stage solution.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I've done that in the past, using two files and all. Nowadays, it's pretty cool to use the Filter option in the Sequential stage to do the sorting 'on the fly' and not need the second file. The output of the sort command goes to Standard Out and the stage picks that up and processes it just as if it were picking it up from the disk. 8)

Perhaps one method is faster than the other, I haven't benchmarked the two to see. Only other potential 'downside' is there's nothing to 'View' when using a filter... that and you can't seem to use a filter command that needs a quote of any kind in it. :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Craig

The easiest way is to reformat the data to have the columns to be sorted to the first of the line. You need to format everything to sort in string or alpha order so right justify numbers. Dates should be YMD[4,2,2].

You fix the columns back after reading them after sorting them.
Mamu Kim
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

That's an interesting thought. In essence, build a 'sort key' - a single concatenated field strictly for the sort process to use that could be pitched after that. As you said, one would have to make sure the fields that make up the key are formatted appropriately.

Certainly an option to keep in mind. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
Christina Lim
Participant
Posts: 74
Joined: Tue Sep 30, 2003 4:25 am
Location: Malaysia

Post by Christina Lim »

Code: Select all

sort -k 1,1n -k 2,2n -k 3,3n -k 4,4n -k 5,5n -k 6,6n -k 7,7n -k 8,8n -k 9,9n
But my 3rd column is an alphabetic datatype.

I applied the following syntax n it convert all the fields to character sorting.

Code: Select all

sort -k 1,1n -k 2,2n -k 3,3 -k 4,4n -k 5,5n -k 6,6n -k 7,7n -k 8,8n -k 9,9n
Wonder if anyone face the same problem..
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

Hi,
to add on what was said,
If you need the sort for agg stage for agg not with first()/last() that are crucial to the outcome you want, sorting the file on the most varying column alone (or a few of them) and specifying a group by on the others might be enough to get the job done without all sort fields specified.

IHTH,
Roy R.
Time is money but when you don't have money time is all you can afford.

Search before posting:)

Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Image
tonystark622
Premium Member
Premium Member
Posts: 483
Joined: Thu Jun 12, 2003 4:47 pm
Location: St. Louis, Missouri USA

Post by tonystark622 »

Craig,

I had similar experiences as you did. I had to mention each key and put the begin, end syntax (e.g. -k9,9) to make it work.

Just my 2 cents worth,
Tony
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Kinda what I figured... but was hoping I was overlooking something to make the syntax a little easier to handle. Oh well. :cry:
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply