Case sensitive option when sorting

shankar_ramanath · Post by **shankar_ramanath** » Fri Apr 16, 2010 1:30 pm

Please excuse my ignorance here. When I select a VARCHAR column to sort (Input-Partitioning-Hash-Sort), the option "Case-sensitive" is the default for the selected column. I would assume that "Case-sensitive" sort is more expensive than "Case-insensitive" sort. If so, why is it the default?

Thanks,

chowdhury99 · Post by **chowdhury99** » Fri Apr 16, 2010 1:40 pm

I don't think so. Unless you need case-insensitive, you may keep that option.

Thanks

chulett · Post by **chulett** » Fri Apr 16, 2010 2:21 pm

It's the default as that is the most common choice for that option.

ray.wurlod · Post by **ray.wurlod** » Fri Apr 16, 2010 3:15 pm

Case insensitive is more costly, because an internal conversion (to all upper case, as fas as I am aware) has to be performed and the sort performed on that field.

shankar_ramanath · Post by **shankar_ramanath** » Fri Apr 16, 2010 4:52 pm

Thanks Ray. I agree. I did not think about it.

chowdhury99 · Post by **chowdhury99** » Sat Apr 17, 2010 6:51 pm

Thanks Ray for your valuable opinion.

Please correct me if I am wrong. Sort stage stores all data to memory space and then sorts. The most expensive part of the sort stage is storing data into memory space. Once all data is in the memory space, sorting operation does not take that much time compared to storing into memory space. That's why if you even add one more sorting key, overall processing time does go up that much. So, case sensitive option will not decrease the performance that much.

Please give your thoughtful opinion on this matter.

Best Regards.

ray.wurlod · Post by **ray.wurlod** » Sat Apr 17, 2010 7:34 pm

This is not the whole story. Only a certain amount of memory per node is available for sorting. Once this amount is exceeded, the sort must use disk files in the scratchdisk areas. This adds cost that you have not considered, probably more cost than the other steps (depending on the source of the rows to be sorted, which may already be in memory in a virtual Data Set).

The amount of memory available to the tsort operator is governed by the environment variable APT_TSORT_STRESS_BLOCKSIZE. If this environment variable is not set, then the Sort stage has a property that allows the amount of memory (per node) to be specified.