Inline sort or external sort

hitmanthesilentassasin · Mon Dec 17, 2007 12:26 am

Hi,

I have few below mentioned questions. could any one help me find the answers please?
1. What is the difference between inline sort and external sort?
2. If my data is sorted and then some lookups/aggregations are performed then will my data needs to be sorted once more or will it remain sorted in the parallel environment?

Thanks

ray.wurlod · Post by **ray.wurlod** » Mon Dec 17, 2007 1:32 am

An inline sort is where you specify sorting on the Partitioning tab of a stage's input link. An external sort is, presumably, where you use an explicit Sort stage. The latter gives greater flexibility, including controlling memory allocated to sorting, generation of key change indicators, and performing sub-sorts (not sorting already sorted/grouped columns).

Lookups should not affect sorted order. Aggregations should not affect sorted order provided that it is the grouping columns that are pre-sorted and you use Sort (not Hash Table) as the aggregation method.

hitmanthesilentassasin · Mon Dec 17, 2007 4:10 am

Would that have any of the sort method have any performance issue?

ArndW · Post by **ArndW** » Mon Dec 17, 2007 4:27 am

Yes, it might have a performance impact. As Ray has already pointed out, you can change a number of parameters using an explicit sort; and these changes can impact performance.

ray.wurlod · Post by **ray.wurlod** » Mon Dec 17, 2007 5:36 am

Define "performance" in an ETL context.

just4u_sharath · Post by **just4u_sharath** » Mon Jan 14, 2008 8:11 pm

ray.wurlod wrote:An inline sort is where you specify sorting on the Partitioning tab of a stage's input link. An external sort is, presumably, where you use an explicit Sort stage. The latter gives greater flexibility, including controlling memory allocated to sorting, generation of key change indicators, and performing sub-sorts (not sorting already sorted/grouped columns).

Lookups should not affect sorted order. Aggregations should not affect sorted order provided that it is the grouping columns that are pre-sorted and you use Sort (not Hash Table) as the aggregation method.

What is the difference of using sort and hash in aggregator. i heard if there are more than 1000 rows per mb use sort. But it didnt look logical . please explain in logical terms

just4u_sharath · Post by **just4u_sharath** » Mon Jan 14, 2008 8:14 pm

ray.wurlod wrote:An inline sort is where you specify sorting on the Partitioning tab of a stage's input link. An external sort is, presumably, where you use an explicit Sort stage. The latter gives greater flexibility, including controlling memory allocated to sorting, generation of key change indicators, and performing sub-sorts (not sorting already sorted/grouped columns).

Lookups should not affect sorted order. Aggregations should not affect sorted order provided that it is the grouping columns that are pre-sorted and you use Sort (not Hash Table) as the aggregation method.

What is the difference of using sort and hash in aggregator. i heard if there are more than 1000 rows per mb use sort. But it didnt look logical . please explain in logical terms

ray.wurlod · Post by **ray.wurlod** » Mon Jan 14, 2008 9:11 pm

It's a "rule of thumb" rather than a hard and fast rule. The hash table, in which every distinct combination of key values plus all calculated/counted output columns are stored, must fit in physical memory. If they won't, use Sort. If you're not sure, use Sort. With Sort, memory can be freed (and one or more rows output) every time one of the grouping fields changes value.