Page 1 of 1

What is difference between explicit Sort stage and sort ....

Posted: Thu Sep 20, 2007 2:57 am
by mavrick21
Hi All,

To use a join stage the data should be Hash partitioned and sorted. In our jobs we join 2 tables. We use Sort stage for each input link, to sort data and to Hash partition, before the Join stage.

By using an explicit Sort stage is there any advantage over the in-stage sorting ? By in-stage sorting i mean the Sort option inside the Join stage.

Is explicit sort performance wise better compared to in-stage sort ?

Can anyone please clarify and provide more details on this?

Thanks in Advance !

Posted: Thu Sep 20, 2007 3:04 am
by Raghavendra
Explicit sort stage uses temporary disk space when performing a sort. I believe when you are handling huge volumes of data you will not get resource problems as you are using temporory disk space.

Lets see our DS experts comment on this query.

Posted: Thu Sep 20, 2007 5:56 am
by rajeevn80
When u need a sort functionality along with another stage like a join in this case it is always better to do in-stage/implicit sorting. This has advantages over an additional stage being put.
eg. In ur case u might have put 2 sort stages on each link before the join. This will increase the number of process in the job. Even imagine running on 'n' nodes will create many sort process that require additional resource. In an implict or in-stage sorting the DS engine sorts the data directly in memory and treats the join and sort as a single process.

Posted: Thu Sep 20, 2007 4:45 pm
by ray.wurlod
Explicit Sort stages allow you to more easily control the memory allocated to sorting, and to perform sub-sorts without re-sorting the already-sorted sort key columns.

All forms of sort will create extra processes. Look at the score to verify that this is the case.

Posted: Fri Sep 21, 2007 12:21 am
by mavrick21
Thanks all !