Join stage requires sorted input?

mingzhang · Post by **mingzhang** » Wed Mar 04, 2009 3:49 pm

Does Join stage really require input to be sorted?
I just tried a small case of join with unsorted inputs. I did get the correct output of join, and the output are sorted on key.
Anybody can explain?

lstsaur · Post by **lstsaur** » Wed Mar 04, 2009 5:10 pm

Please read the Parallel Job Developer's Guide 19-2, then you will understand why the Join stage's driving and reference datasets should be sorted first.

chulett · Post by **chulett** » Wed Mar 04, 2009 5:44 pm

Note the use of the word 'should' rather than 'must'.

ray.wurlod · Post by **ray.wurlod** » Wed Mar 04, 2009 5:57 pm

Welcome aboard.

The answer is yes but, for a small enough number of rows, you can get away with unsorted input. The stage even has an "ignore unsorted input" option if I recall correctly, to ignore the fact that the data are not sorted.

If you don't have sorting of the data specified on the input link, then DataStage will insert a "tsort operator" to effect the sorting. There is an environment variable you can set to override this behaviour.

Why is sorted input required? Basically this is about efficient use of memory. If it is known that the inputs are sorted, then it is sufficient only to read those rows with the next key value from the left input, then to read those rows with the same key value from the right input into memory. Once that join has been effected, memory can be freed and the next key value from the left input processed. Without sorted input, each input data set would need to be resident in memory.

mingzhang · Post by **mingzhang** » Thu Mar 05, 2009 9:41 am

Thanks all for the reply.

I understand that Join on sorted inputs will have much better performance.
But from the technical view (not considering the performance), does Join stage require inputs to be sorted first? ("should-but not must" or "must"?) otherwise, the output of Join couldn't be guarantee correct?

BTW, seems I couldn't find the "ignore unsorted input" option from Join stage.

ray.wurlod wrote:Welcome aboard.

The answer is yes but, for a small enough number of rows, you can get away with unsorted input. The stage even has an "ignore unsorted input" option if I recall correctly, to ignor ...

Mike · Post by **Mike** » Thu Mar 05, 2009 9:49 am

For guaranteed correct results, the answer is "must" be sorted. Investigate the job score. If you use auto partitioning in the join stage, it may be sorting the data for you implicitly. With random data, correct results are purely an accident.

Mike

mingzhang · Post by **mingzhang** » Thu Mar 05, 2009 9:57 am

Actually, I tried both cases:
1. "Auto" partition for inputs
2. "Hash" partition on key, but NOT "perform sort"

Both cases produce same correct results. So they are just "accidental" correct?

Mike wrote:For guaranteed correct results, the answer is "must" be sorted. Investigate the job score. If you use auto partitioning in the join stage, it may be sorting the data for you implicitly. With random data, correct results are purely an accident.

Mike

betterthanever · Post by **betterthanever** » Thu Mar 05, 2009 10:10 am

as long as the same key values endup in the same partition for all the input/references you should be okay which would have happened in the 2 step you mentioned.

the first step already would have inserted the sort operator when you set AUTO partinoning..which you should be able to see in the director setting APT_DUMP_SCORE to true

girija · Post by **girija** » Thu Mar 05, 2009 11:23 am

But if we are not specfying partition and sort in the input link, it all depends on these two environment variable : APT_NO_SORT_INSERTION
and APT_NO_PART_INSERTION. Means project dependent code. I think it is better to say "Input data must be partitioned and sorted" for join stage.

mingzhang · Post by **mingzhang** » Thu Mar 05, 2009 11:28 am

I meant "two cases", not "two steps".

First case is using "Auto" partition, which could do sort implicitly.

But second case, I used "Hash" partition and didn't check "perform sort", which should not sort the inputs, right?

betterthanever wrote:as long as the same key values endup in the same partition for all the input/references you should be okay which would have happened in the 2 step you mentioned.

the first step already would have inserted the sort operator when you set AUTO partinoning..which you should be able to see in the director setting APT_DUMP_SCORE to true

betterthanever · Post by **betterthanever** » Thu Mar 05, 2009 11:34 am

in the second scenario hash partitnoning makes sure that the key values end up in the same partition...

mingzhang · Post by **mingzhang** » Thu Mar 05, 2009 11:48 am

Yes, but not necessary SORTED (since didn't "perform sort").
And my question is: does Join stage require inputs to be sorted?

betterthanever wrote:in the second scenario hash partitnoning makes sure that the key values end up in the same partition...

betterthanever · Post by **betterthanever** » Thu Mar 05, 2009 12:56 pm

[quote="mingzhang"]Yes, but not necessary SORTED (since didn't "perform sort").
And my question is: does Join stage require inputs to be sorted?

[quote="betterthanever"]in the second scenario hash partitnoning makes sure that the key values end up in the same partition...[/quote][/quote]

did you mention explicitly not to sort the data???

in APT_NO_SORT_INSERTION???

mingzhang · Post by **mingzhang** » Thu Mar 05, 2009 1:33 pm

I just tried to set APT_NO_SORT_INSERTION "True" and "False". Both scenarios got same correct output for my test job.

betterthanever wrote:
mingzhang wrote:Yes, but not necessary SORTED (since didn't "perform sort").
And my question is: does Join stage require inputs to be sorted?

betterthanever wrote:in the second scenario hash partitnoning makes sure that the key values end up in the same partition...
did you mention explicitly not to sort the data???

in APT_NO_SORT_INSERTION???

girija · Post by **girija** » Thu Mar 05, 2009 1:54 pm

If you are not specifying SORT order it will effect the performance not the result. But if you not define partition it will effect the result.