sorted input to Join

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
djoni
Participant
Posts: 98
Joined: Wed Oct 05, 2005 1:01 pm

sorted input to Join

Post by djoni »

Must Join stage have sorted inputs (all or any one) :?:
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

Hi,
To quote the help:
The data sets input to the Join stage must be key partitioned and sorted.
Last edited by roy on Wed Feb 08, 2006 10:01 am, edited 1 time in total.
Roy R.
Time is money but when you don't have money time is all you can afford.

Search before posting:)

Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Image
felixyong
Participant
Posts: 35
Joined: Tue Jul 22, 2003 7:24 pm
Location: Australia

Re: sorted input to Join

Post by felixyong »

djoni wrote:Must Join stage have sorted inputs (all or any one) :?:
It is recommended to sort before join so that the join will be more efficient. If you're sorting than all sources must be sorts the same way before join.

It is even better to sort using the RDBMs if the join is already indexed in the RDBMs so that you can save processing time & resources in DataStage Server.
djoni
Participant
Posts: 98
Joined: Wed Oct 05, 2005 1:01 pm

Re: sorted input to Join

Post by djoni »

felixyong wrote:
djoni wrote:Must Join stage have sorted inputs (all or any one) :?:
It is recommended to sort before join so that the join will be more efficient. If you're sorting than all sources must be sorts the same way before join.

It is even better to sort using the RDBMs if the join is already indexed in the RDBMs so that you can save processing time & resources in DataStage Server.
Recommended or Mandatory?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Mandatory if the manual is to be believed. I believe it so have always key partitioned and sorted Join stage inputs. Perhaps you'd like to try without, and let us know the result?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
djoni
Participant
Posts: 98
Joined: Wed Oct 05, 2005 1:01 pm

Post by djoni »

ray.wurlod wrote:Mandatory if the manual is to be believed. I believe it so have always key partitioned and sorted Join stage inputs. Perhaps you'd like to try without, and let us know the result?
Runs well on two un-sorted sequential files, auto partitioned no sort.

So, is something wrong with the manual and .... EE Essential course?
Gaurav.Dave
Premium Member
Premium Member
Posts: 62
Joined: Tue Sep 21, 2004 10:24 am
Location: IBM - Chicago Area

Post by Gaurav.Dave »

Well, with sequential files it behaves differently....

But when you use Datasets, it's partition based, u need to key partioned and sorted it before you input to ur join stage...

Gaurav Dave
dsusr
Premium Member
Premium Member
Posts: 104
Joined: Sat Sep 03, 2005 11:30 pm

Post by dsusr »

djoni wrote:
ray.wurlod wrote:Mandatory if the manual is to be believed. I believe it so have always key partitioned and sorted Join stage inputs. Perhaps you'd like to try without, and let us know the result?
Runs well on two un-sorted sequential files, auto partitioned no sort.

So, is something wrong with the manual and .... EE Essential course?

See the problem occurs when you are running the job on multiple nodes and with large amount of data because if at that time you didnt do the hash partitioning on key then two records with same key value can go in different partitions and join will not take place. We have faced this type of issue in one of our projects.

As far as sort is concerned it is basically to improve the performance. So partitioning is mandatory but sort is preferable.

Regards
dsusr
Post Reply