sort order

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
tostay2003
Participant
Posts: 97
Joined: Tue Feb 21, 2006 6:45 am

sort order

Post by tostay2003 »

Hi,

A small question.

If I have got two fields, Field1 takes upto 10000 values and Field2 takes upto 4 values.

which would be the better way (performance) to specify the properties in the sort stage.

a) Field1 ASC, Field2 ASC
b) Field2 ASC, Field1 ASC
c) Doesnt matter

My Hinch is (a)

Thank You
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Neither one is 'better', the answer depends on what order you need them sorted in downstream from there.
-craig

"You can never have too many knives" -- Logan Nine Fingers
tostay2003
Participant
Posts: 97
Joined: Tue Feb 21, 2006 6:45 am

Post by tostay2003 »

I need to check duplicates based on Field1 and Field2
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

As Craig said, it really depends how you want them grouped. Does Field2 come under Field1 or viceversa. The sort order will be specific to your requirement.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
Ultramundane
Participant
Posts: 407
Joined: Mon Jun 27, 2005 8:54 am
Location: Walker, Michigan
Contact:

Re: sort order

Post by Ultramundane »

I bumped up Field2 and ran this test. Field 1 with a cycle of 10000 values for every distinct value in Field2 where Field2 had cycle of 40 values. Both columns defined as int not null.

Field1, Field2 Average 10 Runs: 65 seconds
Field2, Field1 Average 10 Runs: 88 seconds

I saw an improvement in speed by about 26% by sorting by Field1, Field2.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Again. If Field1 needs is the Main group and Field2 is the sub-group then the sorting should be in that order, regardless. Switching the sort order will mess up the data. If performance does become an issue, resort to the unix sort.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
Ultramundane
Participant
Posts: 407
Joined: Mon Jun 27, 2005 8:54 am
Location: Walker, Michigan
Contact:

Post by Ultramundane »

I agree that you should sort how you need that data to be sorted.

It sounded like in this case the author just wanted to sort to find duplicate values. He does not care the actual order. Just sorting to find duplicates on some key values. In this case, I would sort by whatever is the most efficient.

He gave us some values and I tested them.

I found that sorting by field1 then field2 was about 26% than sorting by field2 then field1.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

How many time you repeated the test to confirm the effeciency?
Is your server consistantly loaded for each test for all the times?
What is the volume of data you took for this test?
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

These speeds also depend greatly on what order the incoming data is already in.
Ultramundane
Participant
Posts: 407
Joined: Mon Jun 27, 2005 8:54 am
Location: Walker, Michigan
Contact:

Post by Ultramundane »

How many time you repeated the test to confirm the effeciency?
Field1, Field2 Average 10 Runs: 65 seconds
Field2, Field1 Average 10 Runs: 88 seconds

Is your server consistantly loaded for each test for all the times?
I was the only user on this system and this was only the job running.

What is the volume of data you took for this test?
I bumped up Field2 and ran this test. Field 1 with a cycle of 10000 values for every distinct value in Field2 where Field2 had cycle of 40 values. Both columns defined as int not null. Thus, 10000 * 40 = 400000 records.

These speeds also depend greatly on what order the incoming data is already in. I tried randomizing and sorting descending as well. Got the same results.
Post Reply