is sorting before joining mandatory?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
tejaswini
Participant
Posts: 19
Joined: Thu Aug 26, 2004 5:40 am

is sorting before joining mandatory?

Post by tejaswini »

Is it enough if I partition the data on the joining keys before join? Or should I also sort the data on the joining keys? Also if I do not sort, will the join output be wrong?
sanjumsm
Premium Member
Premium Member
Posts: 64
Joined: Tue Oct 17, 2006 11:29 pm
Location: Toronto

Re: is sorting before joining mandatory?

Post by sanjumsm »

Hi,

The result will not affect. But you know it will not be optimal and it will cause thrashing. As you know sorting takes lots of memory and time. So it would be better to sort and partitin the dataset before joining.

Note:
-----It also minimizes memory requirements because fewer rows need to be in memory at any one time.---------
tejaswini wrote:Is it enough if I partition the data on the joining keys before join? Or should I also sort the data on the joining keys? Also if I do not sort, will the join output be wrong?
sanjeev kumar
BalageBaju
Participant
Posts: 34
Joined: Fri Sep 22, 2006 10:59 pm
Location: India

Post by BalageBaju »

Hi,

It is always better to sort the data based on the partition keys. It will increase the performance of the job.

If the stage is Partitioning data, then the sort will occur after the partitioning. If the stage is collecting data, then sort occurs before the collection of data.
Regards,
Balaji.
Nageshsunkoji
Participant
Posts: 222
Joined: Tue Aug 30, 2005 2:07 am
Location: pune
Contact:

Post by Nageshsunkoji »

Hi,

To get the accurate results, It is always better to sort the data and at the same time perform HASH partition also. One more thing here you can include is using of environmental variable APT_SORT_INSERTION_CHECK_ONLY. For stages like join, datastage will insert Tsort operaor. It will happen, even though you have sorted the data before sending to the join stage. The above mentioned variable will just check the sort order, if it is sorted, it will not include the Tsort operator.It will increase your performance in a countable manner.
NageshSunkoji

If you know anything SHARE it.............
If you Don't know anything LEARN it...............
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The Join stage requires its inputs to be sorted, so that it can employ an efficient memory management algorithm. If you don't specify sorted data the composed score will have tsort operators inserted on the input links so that the data will, in fact, be sorted. It is far better technique to retain control of sorting, so that unnecessary sorting does not occur.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply