Hi
I have to do a join on two files. The join requires presorted input. hence am using Hash partition with Stable sort option on both the input links. Is this the right option to do? or should i use a same partition with sort?? Also please explain about the performance improvement that takes place in following any one of the stated method
Job design help
Moderators: chulett, rschirm, roy
Even though the phrasing of the question sounds like an excercise question for an interview, I'll give it a go:
1. Partitioning. What method of partitioning you use makes no difference in the join as long as it guarantees that the join key is present in the same partition for all the input links. Normally this implies that the hash key and the join key are the same (or at least the primary join key, if several are used).
2. Sorting. A join requires that all the input data is sorted on the join key(s), if you don't do it explicitly then DataStage will do it implicitly unless you turn off that functionality.
Partitioning and Sorting are prerequisites. How you do either might be performance-related.
1. Partitioning. What method of partitioning you use makes no difference in the join as long as it guarantees that the join key is present in the same partition for all the input links. Normally this implies that the hash key and the join key are the same (or at least the primary join key, if several are used).
2. Sorting. A join requires that all the input data is sorted on the join key(s), if you don't do it explicitly then DataStage will do it implicitly unless you turn off that functionality.
Partitioning and Sorting are prerequisites. How you do either might be performance-related.
-
- Participant
- Posts: 1
- Joined: Tue Jul 20, 2010 5:56 am
I have designed a job in that for full outer join i am using same partition.
If i enable perform sort option its giving difference in the o/p rec count without enabling perform sort option. I have already sorted the data in the previous stage. please explian why it is giving difference in the count.
If i enable perform sort option its giving difference in the o/p rec count without enabling perform sort option. I have already sorted the data in the previous stage. please explian why it is giving difference in the count.
Radhika,
Is automatic sort insertion disabled in your job/project/environment? When you select the sort option, do your sort keys match the join keys? Since you are choosing Same partitioning, has your data already been partitioned correctly--using at least the first/primary join key column--upstream from the join?
As already stated, Joins (and Merges also) requires that the incoming data be partitioned (for parallel jobs) and sorted correctly in order to provide the correct results. The purpose of partitioning is twofold: 1-distribute data among the multiple partitions to spread the workload and 2-place related records into the same partition for correct processing. For Join and Merge, all inputs should be partitioned using the same method on the same key columns.
Since this is a full outer join, the change in the number of output rows (when you enable the sort option) indicates that you may have duplicate records in two or more input links.
Regards,
Is automatic sort insertion disabled in your job/project/environment? When you select the sort option, do your sort keys match the join keys? Since you are choosing Same partitioning, has your data already been partitioned correctly--using at least the first/primary join key column--upstream from the join?
As already stated, Joins (and Merges also) requires that the incoming data be partitioned (for parallel jobs) and sorted correctly in order to provide the correct results. The purpose of partitioning is twofold: 1-distribute data among the multiple partitions to spread the workload and 2-place related records into the same partition for correct processing. For Join and Merge, all inputs should be partitioned using the same method on the same key columns.
Since this is a full outer join, the change in the number of output rows (when you enable the sort option) indicates that you may have duplicate records in two or more input links.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.