Page 1 of 1

Case sensitive option when sorting

Posted: Fri Apr 16, 2010 1:30 pm
by shankar_ramanath
Please excuse my ignorance here. When I select a VARCHAR column to sort (Input-Partitioning-Hash-Sort), the option "Case-sensitive" is the default for the selected column. I would assume that "Case-sensitive" sort is more expensive than "Case-insensitive" sort. If so, why is it the default?

Thanks,

Posted: Fri Apr 16, 2010 1:40 pm
by chowdhury99
I don't think so. Unless you need case-insensitive, you may keep that option.

Thanks

Posted: Fri Apr 16, 2010 2:21 pm
by chulett
It's the default as that is the most common choice for that option.

Posted: Fri Apr 16, 2010 3:15 pm
by ray.wurlod
Case insensitive is more costly, because an internal conversion (to all upper case, as fas as I am aware) has to be performed and the sort performed on that field.

Posted: Fri Apr 16, 2010 4:52 pm
by shankar_ramanath
Thanks Ray. I agree. I did not think about it.

Posted: Sat Apr 17, 2010 6:51 pm
by chowdhury99
Thanks Ray for your valuable opinion.

Please correct me if I am wrong. Sort stage stores all data to memory space and then sorts. The most expensive part of the sort stage is storing data into memory space. Once all data is in the memory space, sorting operation does not take that much time compared to storing into memory space. That's why if you even add one more sorting key, overall processing time does go up that much. So, case sensitive option will not decrease the performance that much.

Please give your thoughtful opinion on this matter.

Best Regards.

Posted: Sat Apr 17, 2010 7:34 pm
by ray.wurlod
This is not the whole story. Only a certain amount of memory per node is available for sorting. Once this amount is exceeded, the sort must use disk files in the scratchdisk areas. This adds cost that you have not considered, probably more cost than the other steps (depending on the source of the rows to be sorted, which may already be in memory in a virtual Data Set).

The amount of memory available to the tsort operator is governed by the environment variable APT_TSORT_STRESS_BLOCKSIZE. If this environment variable is not set, then the Sort stage has a property that allows the amount of memory (per node) to be specified.