Difference between Explicit sort and Sort on partition
Posted: Thu Jun 29, 2006 8:13 am
This is a question that was lingering in my mind for a long time and thought it best to take the opinions from the forum.
This question is related to stages like Join/Aggregator where the data need to be partitioned and sorted. In the chapter pertaining to the Join stage for the parallel job developer guide, for sorting before the join, an explicit Sort stage is used. I have religiously followed the guide in my development and placed an explicit sorted the data using the sort stage before the join. Yet some of the developers I have come accross do not use an explicit sort stage but sort each partition on the join stage itself. My experiments with both the methods (sorting using explicit sort stage and sorting on partition) yield the correct results.
My question is which method is preferred? In other words, which would yield a better performance and why? What is the difference in the two methods?
I would greatly appreciate your thoughts on the above questions.
Thanks in advance.
dsdesigner
This question is related to stages like Join/Aggregator where the data need to be partitioned and sorted. In the chapter pertaining to the Join stage for the parallel job developer guide, for sorting before the join, an explicit Sort stage is used. I have religiously followed the guide in my development and placed an explicit sorted the data using the sort stage before the join. Yet some of the developers I have come accross do not use an explicit sort stage but sort each partition on the join stage itself. My experiments with both the methods (sorting using explicit sort stage and sorting on partition) yield the correct results.
My question is which method is preferred? In other words, which would yield a better performance and why? What is the difference in the two methods?
I would greatly appreciate your thoughts on the above questions.
Thanks in advance.
dsdesigner