Question on Partitioning before join stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ds_teg
Premium Member
Premium Member
Posts: 51
Joined: Tue Aug 11, 2009 6:53 am
Location: Chicago

Question on Partitioning before join stage

Post by ds_teg »

Hi All ,

I have some basic question about join stage . The design of the job is as below

Input1:comes from oracle enterprise stage select col1 , col2 , col3 , col4 from table

Input2: dataset ( dataset is generated in previous job with design Oracle enterprise ---> Sort Col1 , Col2 --> Dataset )


Input1 ---> Sort stage ( sort on Key columns ) --> Left link of join stage

Input2 ---> Sort Stage ( Don't sort previous sorted on key columns ) ---> Right link of join stage



Now My question is where should I do hash partitioning on key columns ? Is it on the input links of Sort stages or input links of join stage ?

Please suggest .

Thanks
Developer9
Premium Member
Premium Member
Posts: 187
Joined: Thu Apr 14, 2011 5:10 pm

Re: Question on Partitioning before join stage

Post by Developer9 »

Hi,

Code: Select all

Left Link (sort input1),right link (sort input2)>>>>join stage >>>Output (???)
Is this is what your are looking for ???Can you please quote your job design in "code" with input and out put links

If job design is clear you can expect more responses

Thank you
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Get your partitioning done as early as possible in the job.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ds_teg
Premium Member
Premium Member
Posts: 51
Joined: Tue Aug 11, 2009 6:53 am
Location: Chicago

Post by ds_teg »

Code: Select all


TABLE -> Sort  ---->Join  ->o/p
                              ^
                              ||
DATASET --> Don't sort previously sorted

Here I am doing partitioning first ( before sort stages ) & keeping same partitioning in join stage ..Just wanted to know if it is correct
josejohny
Premium Member
Premium Member
Posts: 10
Joined: Wed Nov 26, 2008 11:14 pm
Location: Bangalore

Post by josejohny »

What I used to do is Hash partitioning on Sort stage and same partitioning on join stage input links. Partitioning key you can give same as joining key. So you will get matching data in one partition.
Thanks & Regards
Jose Johny
Project Engineer
Wipro Technologies |Bangalore
"Life is a process of cultivating goodness & removing evilness"
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

At least in front of the Sort stage for Input 1, if not earlier as suggested by Ray as a best practice (in your job that may be the earliest location you can). Because Input 2 is already sorted, it should have also been partitioned at that time.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
ds_teg
Premium Member
Premium Member
Posts: 51
Joined: Tue Aug 11, 2009 6:53 am
Location: Chicago

Post by ds_teg »

Input 2 is already sorted but not partitioned in the previous job while creating dataset . So , If i am using a hash partition before sort stage ( don't sort already sorted ) will not disturb the sort order ? Here I am using same partitiong on the both the links of the join stage .
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Repartitioning a dataset invalidates the sort order (it is no longer guaranteed) and potentially disrupts the order of the data in the partitions of a dataset. You should properly partition the data when you sort it, on the input to or prior to the sort stage.

In the job which creates your input2 dataset, if you haven't disabled partition insertion, more than likely the data has been partitioned automatically by DataStage when it is sorted unless your job design dictates otherwise. My personal preference and recommendation is to specify the partitioning so that you know it has been done to properly meet the needs of the downstream processing logic.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
nagarjuna
Premium Member
Premium Member
Posts: 533
Joined: Fri Jun 27, 2008 9:11 pm
Location: Chicago

Post by nagarjuna »

You need to make sure that partitioning of data takes before the sorting of data ...
Nag
Post Reply