Page 1 of 1

Question on Partitioning before join stage

Posted: Fri Jan 06, 2012 2:42 pm
by ds_teg
Hi All ,

I have some basic question about join stage . The design of the job is as below

Input1:comes from oracle enterprise stage select col1 , col2 , col3 , col4 from table

Input2: dataset ( dataset is generated in previous job with design Oracle enterprise ---> Sort Col1 , Col2 --> Dataset )


Input1 ---> Sort stage ( sort on Key columns ) --> Left link of join stage

Input2 ---> Sort Stage ( Don't sort previous sorted on key columns ) ---> Right link of join stage



Now My question is where should I do hash partitioning on key columns ? Is it on the input links of Sort stages or input links of join stage ?

Please suggest .

Thanks

Re: Question on Partitioning before join stage

Posted: Fri Jan 06, 2012 3:17 pm
by Developer9
Hi,

Code: Select all

Left Link (sort input1),right link (sort input2)>>>>join stage >>>Output (???)
Is this is what your are looking for ???Can you please quote your job design in "code" with input and out put links

If job design is clear you can expect more responses

Thank you

Posted: Fri Jan 06, 2012 3:26 pm
by ray.wurlod
Get your partitioning done as early as possible in the job.

Posted: Fri Jan 06, 2012 3:33 pm
by ds_teg

Code: Select all


TABLE -> Sort  ---->Join  ->o/p
                              ^
                              ||
DATASET --> Don't sort previously sorted

Here I am doing partitioning first ( before sort stages ) & keeping same partitioning in join stage ..Just wanted to know if it is correct

Posted: Fri Jan 06, 2012 3:46 pm
by josejohny
What I used to do is Hash partitioning on Sort stage and same partitioning on join stage input links. Partitioning key you can give same as joining key. So you will get matching data in one partition.

Posted: Fri Jan 06, 2012 5:12 pm
by jwiles
At least in front of the Sort stage for Input 1, if not earlier as suggested by Ray as a best practice (in your job that may be the earliest location you can). Because Input 2 is already sorted, it should have also been partitioned at that time.

Regards,

Posted: Sat Jan 07, 2012 4:11 am
by ds_teg
Input 2 is already sorted but not partitioned in the previous job while creating dataset . So , If i am using a hash partition before sort stage ( don't sort already sorted ) will not disturb the sort order ? Here I am using same partitiong on the both the links of the join stage .

Posted: Sat Jan 07, 2012 10:16 am
by jwiles
Repartitioning a dataset invalidates the sort order (it is no longer guaranteed) and potentially disrupts the order of the data in the partitions of a dataset. You should properly partition the data when you sort it, on the input to or prior to the sort stage.

In the job which creates your input2 dataset, if you haven't disabled partition insertion, more than likely the data has been partitioned automatically by DataStage when it is sorted unless your job design dictates otherwise. My personal preference and recommendation is to specify the partitioning so that you know it has been done to properly meet the needs of the downstream processing logic.

Regards,

Posted: Sat Jan 07, 2012 12:22 pm
by nagarjuna
You need to make sure that partitioning of data takes before the sorting of data ...