Join stage requires sorted input?
Moderators: chulett, rschirm, roy
Join stage requires sorted input?
Does Join stage really require input to be sorted?
I just tried a small case of join with unsorted inputs. I did get the correct output of join, and the output are sorted on key.
Anybody can explain?
I just tried a small case of join with unsorted inputs. I did get the correct output of join, and the output are sorted on key.
Anybody can explain?
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Welcome aboard.
The answer is yes but, for a small enough number of rows, you can get away with unsorted input. The stage even has an "ignore unsorted input" option if I recall correctly, to ignore the fact that the data are not sorted.
If you don't have sorting of the data specified on the input link, then DataStage will insert a "tsort operator" to effect the sorting. There is an environment variable you can set to override this behaviour.
Why is sorted input required? Basically this is about efficient use of memory. If it is known that the inputs are sorted, then it is sufficient only to read those rows with the next key value from the left input, then to read those rows with the same key value from the right input into memory. Once that join has been effected, memory can be freed and the next key value from the left input processed. Without sorted input, each input data set would need to be resident in memory.
The answer is yes but, for a small enough number of rows, you can get away with unsorted input. The stage even has an "ignore unsorted input" option if I recall correctly, to ignore the fact that the data are not sorted.
If you don't have sorting of the data specified on the input link, then DataStage will insert a "tsort operator" to effect the sorting. There is an environment variable you can set to override this behaviour.
Why is sorted input required? Basically this is about efficient use of memory. If it is known that the inputs are sorted, then it is sufficient only to read those rows with the next key value from the left input, then to read those rows with the same key value from the right input into memory. Once that join has been effected, memory can be freed and the next key value from the left input processed. Without sorted input, each input data set would need to be resident in memory.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Thanks all for the reply.
I understand that Join on sorted inputs will have much better performance.
But from the technical view (not considering the performance), does Join stage require inputs to be sorted first? ("should-but not must" or "must"?) otherwise, the output of Join couldn't be guarantee correct?
BTW, seems I couldn't find the "ignore unsorted input" option from Join stage.
I understand that Join on sorted inputs will have much better performance.
But from the technical view (not considering the performance), does Join stage require inputs to be sorted first? ("should-but not must" or "must"?) otherwise, the output of Join couldn't be guarantee correct?
BTW, seems I couldn't find the "ignore unsorted input" option from Join stage.
ray.wurlod wrote:Welcome aboard.
The answer is yes but, for a small enough number of rows, you can get away with unsorted input. The stage even has an "ignore unsorted input" option if I recall correctly, to ignor ...
Actually, I tried both cases:
1. "Auto" partition for inputs
2. "Hash" partition on key, but NOT "perform sort"
Both cases produce same correct results. So they are just "accidental" correct?
1. "Auto" partition for inputs
2. "Hash" partition on key, but NOT "perform sort"
Both cases produce same correct results. So they are just "accidental" correct?
Mike wrote:For guaranteed correct results, the answer is "must" be sorted. Investigate the job score. If you use auto partitioning in the join stage, it may be sorting the data for you implicitly. With random data, correct results are purely an accident.
Mike
-
- Participant
- Posts: 152
- Joined: Tue Jan 13, 2009 8:59 am
as long as the same key values endup in the same partition for all the input/references you should be okay which would have happened in the 2 step you mentioned.
the first step already would have inserted the sort operator when you set AUTO partinoning..which you should be able to see in the director setting APT_DUMP_SCORE to true
the first step already would have inserted the sort operator when you set AUTO partinoning..which you should be able to see in the director setting APT_DUMP_SCORE to true
I meant "two cases", not "two steps".
First case is using "Auto" partition, which could do sort implicitly.
But second case, I used "Hash" partition and didn't check "perform sort", which should not sort the inputs, right?
First case is using "Auto" partition, which could do sort implicitly.
But second case, I used "Hash" partition and didn't check "perform sort", which should not sort the inputs, right?
betterthanever wrote:as long as the same key values endup in the same partition for all the input/references you should be okay which would have happened in the 2 step you mentioned.
the first step already would have inserted the sort operator when you set AUTO partinoning..which you should be able to see in the director setting APT_DUMP_SCORE to true
-
- Participant
- Posts: 152
- Joined: Tue Jan 13, 2009 8:59 am
-
- Participant
- Posts: 152
- Joined: Tue Jan 13, 2009 8:59 am
[quote="mingzhang"]Yes, but not necessary SORTED (since didn't "perform sort").
And my question is: does Join stage require inputs to be sorted?
[quote="betterthanever"]in the second scenario hash partitnoning makes sure that the key values end up in the same partition...[/quote][/quote]
did you mention explicitly not to sort the data???
in APT_NO_SORT_INSERTION???
And my question is: does Join stage require inputs to be sorted?
[quote="betterthanever"]in the second scenario hash partitnoning makes sure that the key values end up in the same partition...[/quote][/quote]
did you mention explicitly not to sort the data???
in APT_NO_SORT_INSERTION???
I just tried to set APT_NO_SORT_INSERTION "True" and "False". Both scenarios got same correct output for my test job.
betterthanever wrote:did you mention explicitly not to sort the data???mingzhang wrote:Yes, but not necessary SORTED (since didn't "perform sort").
And my question is: does Join stage require inputs to be sorted?
betterthanever wrote:in the second scenario hash partitnoning makes sure that the key values end up in the same partition...
in APT_NO_SORT_INSERTION???