Job design help

swerajan · Post by **swerajan** » Wed Sep 09, 2009 11:50 pm

Hi
I have to do a join on two files. The join requires presorted input. hence am using Hash partition with Stable sort option on both the input links. Is this the right option to do? or should i use a same partition with sort?? Also please explain about the performance improvement that takes place in following any one of the stated method

ArndW · Post by **ArndW** » Thu Sep 10, 2009 12:45 am

Even though the phrasing of the question sounds like an excercise question for an interview, I'll give it a go:

1. Partitioning. What method of partitioning you use makes no difference in the join as long as it guarantees that the join key is present in the same partition for all the input links. Normally this implies that the hash key and the join key are the same (or at least the primary join key, if several are used).

2. Sorting. A join requires that all the input data is sorted on the join key(s), if you don't do it explicitly then DataStage will do it implicitly unless you turn off that functionality.

Partitioning and Sorting are prerequisites. How you do either might be performance-related.

swerajan · Post by **swerajan** » Thu Sep 10, 2009 1:18 am

thanks !! it is not for an interview. I had a job that needs to developed so thought of posting the query. as my development data is less i could see no difference in using either method

ArndW · Post by **ArndW** » Thu Sep 10, 2009 1:58 am

No worries. It was just the way the problem was formulated sounded very much like a typical open-ended interview question.

Radhika Reddy · Post by **Radhika Reddy** » Tue Feb 15, 2011 12:54 am

I have designed a job in that for full outer join i am using same partition.
If i enable perform sort option its giving difference in the o/p rec count without enabling perform sort option. I have already sorted the data in the previous stage. please explian why it is giving difference in the count.

nani0907 · Post by **nani0907** » Tue Feb 15, 2011 1:20 am

As indicated by Guru-- Andrw, Sorting & partioning are the pre-requisites for Join stage.

Note:If there is more than one join key used, make sure that the order of the keys defined in the both input links should be same order ,else it might give inconsistent results.........

jwiles · Post by **jwiles** » Tue Feb 15, 2011 6:54 am

Radhika,

Is automatic sort insertion disabled in your job/project/environment? When you select the sort option, do your sort keys match the join keys? Since you are choosing Same partitioning, has your data already been partitioned correctly--using at least the first/primary join key column--upstream from the join?

As already stated, Joins (and Merges also) requires that the incoming data be partitioned (for parallel jobs) and sorted correctly in order to provide the correct results. The purpose of partitioning is twofold: 1-distribute data among the multiple partitions to spread the workload and 2-place related records into the same partition for correct processing. For Join and Merge, all inputs should be partitioned using the same method on the same key columns.

Since this is a full outer join, the change in the number of output rows (when you enable the sort option) indicates that you may have duplicate records in two or more input links.

Regards,