Is sort merge collector optimised for node sorted input ?

zulfi123786 · Post by **zulfi123786** » Sun Sep 28, 2014 2:43 pm

Hi,

I was wondering if sort merge collector is optimised for parallel sorted input. One of the developers wanted a total sorted sequential file and to have so, used a sort stage before the sequential file and left the collector in auto mode.

Before flipping the collector to sort merge wanted to know if it would blindly resort the data again or is it intelligent enough to identify that incoming is previously grouped and node sorted. The file size being 100 GB forces me to think on these lines

Interesting fact is that current run file though being 100GB was totally sorted, was expecting Atleast few breaks, mysteries of the auto mode

Thanks

ray.wurlod · Post by **ray.wurlod** » Sun Sep 28, 2014 4:33 pm

Sort-merge collector does not re-sort the data (blindly or otherwise). It depends on the fact that the data are sorted already, partition by partition, on the indicated key, and monitors the next value queued to come in from each partition, transferring the one that is next in sorted order.

Auto does not select sort-merge as the collection algorithm. It may be that your sorted parallel data were partitioned using a method amenable to the "hungry" round robin collection that Auto selects.