Page 1 of 1

is sorting before joining mandatory?

Posted: Tue Nov 28, 2006 5:37 am
by tejaswini
Is it enough if I partition the data on the joining keys before join? Or should I also sort the data on the joining keys? Also if I do not sort, will the join output be wrong?

Re: is sorting before joining mandatory?

Posted: Tue Nov 28, 2006 6:23 am
by sanjumsm
Hi,

The result will not affect. But you know it will not be optimal and it will cause thrashing. As you know sorting takes lots of memory and time. So it would be better to sort and partitin the dataset before joining.

Note:
-----It also minimizes memory requirements because fewer rows need to be in memory at any one time.---------
tejaswini wrote:Is it enough if I partition the data on the joining keys before join? Or should I also sort the data on the joining keys? Also if I do not sort, will the join output be wrong?

Posted: Tue Nov 28, 2006 6:28 am
by BalageBaju
Hi,

It is always better to sort the data based on the partition keys. It will increase the performance of the job.

If the stage is Partitioning data, then the sort will occur after the partitioning. If the stage is collecting data, then sort occurs before the collection of data.

Posted: Tue Nov 28, 2006 8:11 am
by Nageshsunkoji
Hi,

To get the accurate results, It is always better to sort the data and at the same time perform HASH partition also. One more thing here you can include is using of environmental variable APT_SORT_INSERTION_CHECK_ONLY. For stages like join, datastage will insert Tsort operaor. It will happen, even though you have sorted the data before sending to the join stage. The above mentioned variable will just check the sort order, if it is sorted, it will not include the Tsort operator.It will increase your performance in a countable manner.

Posted: Tue Nov 28, 2006 1:16 pm
by ray.wurlod
The Join stage requires its inputs to be sorted, so that it can employ an efficient memory management algorithm. If you don't specify sorted data the composed score will have tsort operators inserted on the input links so that the data will, in fact, be sorted. It is far better technique to retain control of sorting, so that unnecessary sorting does not occur.