Sort stage or Link Sort

harikhk · Post by **harikhk** » Mon Oct 21, 2013 11:53 am

Hi,

I am writing data from a sequential file to a dataset.
The volume of the data ranges from 8 millions to 20 millions for different files.

I need this data to be sorted based on a single key.

I am not sure which sorting is better for sorting for better performance with this volume of data

Please help me in knowing which is better

My version is 8.5

ray.wurlod · Post by **ray.wurlod** » Mon Oct 21, 2013 3:24 pm

Use an explicit Sort stage. Partition data by the sort key.

The Sort stage allows you to allocate more memory than the default to the sorting operation, which means it takes longer before the sort has to spill to scratchdisk.

You can control the default with an environment variable called APT_TSORT_STRESS_BLOCKSIZE but beware that this is a global change across the scope of the variable (project or job).