Job design help

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
swerajan
Participant
Posts: 42
Joined: Tue Aug 05, 2008 4:34 am

Job design help

Post by swerajan »

Hi
I have to do a join on two files. The join requires presorted input. hence am using Hash partition with Stable sort option on both the input links. Is this the right option to do? or should i use a same partition with sort?? Also please explain about the performance improvement that takes place in following any one of the stated method
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Even though the phrasing of the question sounds like an excercise question for an interview, I'll give it a go:

1. Partitioning. What method of partitioning you use makes no difference in the join as long as it guarantees that the join key is present in the same partition for all the input links. Normally this implies that the hash key and the join key are the same (or at least the primary join key, if several are used).

2. Sorting. A join requires that all the input data is sorted on the join key(s), if you don't do it explicitly then DataStage will do it implicitly unless you turn off that functionality.

Partitioning and Sorting are prerequisites. How you do either might be performance-related.
swerajan
Participant
Posts: 42
Joined: Tue Aug 05, 2008 4:34 am

Post by swerajan »

thanks !! it is not for an interview. I had a job that needs to developed so thought of posting the query. as my development data is less i could see no difference in using either method
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

No worries. It was just the way the problem was formulated sounded very much like a typical open-ended interview question.
Radhika Reddy
Participant
Posts: 1
Joined: Tue Jul 20, 2010 5:56 am

Post by Radhika Reddy »

I have designed a job in that for full outer join i am using same partition.
If i enable perform sort option its giving difference in the o/p rec count without enabling perform sort option. I have already sorted the data in the previous stage. please explian why it is giving difference in the count.
nani0907
Participant
Posts: 155
Joined: Wed Apr 18, 2007 10:30 am

Post by nani0907 »

As indicated by Guru-- Andrw, Sorting & partioning are the pre-requisites for Join stage.

Note:If there is more than one join key used, make sure that the order of the keys defined in the both input links should be same order ,else it might give inconsistent results.........
thanks n regards
nani
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Radhika,

Is automatic sort insertion disabled in your job/project/environment? When you select the sort option, do your sort keys match the join keys? Since you are choosing Same partitioning, has your data already been partitioned correctly--using at least the first/primary join key column--upstream from the join?

As already stated, Joins (and Merges also) requires that the incoming data be partitioned (for parallel jobs) and sorted correctly in order to provide the correct results. The purpose of partitioning is twofold: 1-distribute data among the multiple partitions to spread the workload and 2-place related records into the same partition for correct processing. For Join and Merge, all inputs should be partitioned using the same method on the same key columns.

Since this is a full outer join, the change in the number of output rows (when you enable the sort option) indicates that you may have duplicate records in two or more input links.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
Post Reply