Join stage requires sorted input?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

mingzhang
Participant
Posts: 16
Joined: Fri Feb 27, 2009 3:31 pm

Join stage requires sorted input?

Post by mingzhang »

Does Join stage really require input to be sorted?
I just tried a small case of join with unsorted inputs. I did get the correct output of join, and the output are sorted on key.
Anybody can explain?
lstsaur
Participant
Posts: 1139
Joined: Thu Oct 21, 2004 9:59 pm

Post by lstsaur »

Please read the Parallel Job Developer's Guide 19-2, then you will understand why the Join stage's driving and reference datasets should be sorted first.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Note the use of the word 'should' rather than 'must'. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Welcome aboard.

The answer is yes but, for a small enough number of rows, you can get away with unsorted input. The stage even has an "ignore unsorted input" option if I recall correctly, to ignore the fact that the data are not sorted.

If you don't have sorting of the data specified on the input link, then DataStage will insert a "tsort operator" to effect the sorting. There is an environment variable you can set to override this behaviour.

Why is sorted input required? Basically this is about efficient use of memory. If it is known that the inputs are sorted, then it is sufficient only to read those rows with the next key value from the left input, then to read those rows with the same key value from the right input into memory. Once that join has been effected, memory can be freed and the next key value from the left input processed. Without sorted input, each input data set would need to be resident in memory.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mingzhang
Participant
Posts: 16
Joined: Fri Feb 27, 2009 3:31 pm

Post by mingzhang »

Thanks all for the reply.

I understand that Join on sorted inputs will have much better performance.
But from the technical view (not considering the performance), does Join stage require inputs to be sorted first? ("should-but not must" or "must"?) otherwise, the output of Join couldn't be guarantee correct?

BTW, seems I couldn't find the "ignore unsorted input" option from Join stage.
ray.wurlod wrote:Welcome aboard.

The answer is yes but, for a small enough number of rows, you can get away with unsorted input. The stage even has an "ignore unsorted input" option if I recall correctly, to ignor ...
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

For guaranteed correct results, the answer is "must" be sorted. Investigate the job score. If you use auto partitioning in the join stage, it may be sorting the data for you implicitly. With random data, correct results are purely an accident.

Mike
mingzhang
Participant
Posts: 16
Joined: Fri Feb 27, 2009 3:31 pm

Post by mingzhang »

Actually, I tried both cases:
1. "Auto" partition for inputs
2. "Hash" partition on key, but NOT "perform sort"

Both cases produce same correct results. So they are just "accidental" correct?
Mike wrote:For guaranteed correct results, the answer is "must" be sorted. Investigate the job score. If you use auto partitioning in the join stage, it may be sorting the data for you implicitly. With random data, correct results are purely an accident.

Mike
betterthanever
Participant
Posts: 152
Joined: Tue Jan 13, 2009 8:59 am

Post by betterthanever »

as long as the same key values endup in the same partition for all the input/references you should be okay which would have happened in the 2 step you mentioned.

the first step already would have inserted the sort operator when you set AUTO partinoning..which you should be able to see in the director setting APT_DUMP_SCORE to true
girija
Participant
Posts: 89
Joined: Fri Mar 24, 2006 1:51 pm
Location: Hartford

Post by girija »

But if we are not specfying partition and sort in the input link, it all depends on these two environment variable : APT_NO_SORT_INSERTION
and APT_NO_PART_INSERTION. Means project dependent code. I think it is better to say "Input data must be partitioned and sorted" for join stage.
mingzhang
Participant
Posts: 16
Joined: Fri Feb 27, 2009 3:31 pm

Post by mingzhang »

I meant "two cases", not "two steps".

First case is using "Auto" partition, which could do sort implicitly.

But second case, I used "Hash" partition and didn't check "perform sort", which should not sort the inputs, right?
betterthanever wrote:as long as the same key values endup in the same partition for all the input/references you should be okay which would have happened in the 2 step you mentioned.

the first step already would have inserted the sort operator when you set AUTO partinoning..which you should be able to see in the director setting APT_DUMP_SCORE to true
betterthanever
Participant
Posts: 152
Joined: Tue Jan 13, 2009 8:59 am

Post by betterthanever »

in the second scenario hash partitnoning makes sure that the key values end up in the same partition...
mingzhang
Participant
Posts: 16
Joined: Fri Feb 27, 2009 3:31 pm

Post by mingzhang »

Yes, but not necessary SORTED (since didn't "perform sort").
And my question is: does Join stage require inputs to be sorted?
betterthanever wrote:in the second scenario hash partitnoning makes sure that the key values end up in the same partition...
betterthanever
Participant
Posts: 152
Joined: Tue Jan 13, 2009 8:59 am

Post by betterthanever »

[quote="mingzhang"]Yes, but not necessary SORTED (since didn't "perform sort").
And my question is: does Join stage require inputs to be sorted?

[quote="betterthanever"]in the second scenario hash partitnoning makes sure that the key values end up in the same partition...[/quote][/quote]

did you mention explicitly not to sort the data???

in APT_NO_SORT_INSERTION???
mingzhang
Participant
Posts: 16
Joined: Fri Feb 27, 2009 3:31 pm

Post by mingzhang »

I just tried to set APT_NO_SORT_INSERTION "True" and "False". Both scenarios got same correct output for my test job.
betterthanever wrote:
mingzhang wrote:Yes, but not necessary SORTED (since didn't "perform sort").
And my question is: does Join stage require inputs to be sorted?
betterthanever wrote:in the second scenario hash partitnoning makes sure that the key values end up in the same partition...
did you mention explicitly not to sort the data???

in APT_NO_SORT_INSERTION???
girija
Participant
Posts: 89
Joined: Fri Mar 24, 2006 1:51 pm
Location: Hartford

Post by girija »

If you are not specifying SORT order it will effect the performance not the result. But if you not define partition it will effect the result.
Post Reply