If data have been sorted (Sort Stage) and then DataStage auto inserted tsort operator (Join Stage - same key as Sort Stage), is it actually re-sort data again?
Notict that there are some auto inserted tsort operator at Join Stage although data already sorted.
My question: At run time will data be re-sort again or not?
It's my intention to have auto inserted tsort operator and auto partitioning just for test this case.
The question stay the same, is datastage smart enough to check that data is already sorted on desired field and do nothing or it have to re-sort all data again?
DataStage cannot tell that data is sorted if it is sorted outside of DataStage. By default it assumes data is not sorted. If there is a stage in the job that requires sorted data such as Remove Duplicates or Join then DataStage will automatically add a tsort before that stage.
There are only two ways to prevent these tsorts from being added:
- Adding APT_NO_SORT_INSERT to the job, which can be dangerous if the data is not sorted or incorrectly sorted.
- Adding a sort stage to the job and setting an option that the data is already sorted and should not be sorted again.
If you do not do either of these things then you cannot avoid the tsorts.
but the thing is, that the job already contains Sort-Stages - according to the job description and also according to the job score (op2 and op3).
@mfexdsx:
What I find suspicious is the difference between the APT_HashPartitioner-definitions on ds0 and ds1. In case of ds1 the subArgument {asc} is specified explicitly in the score, which should, of course, be the default anyway. Maybe this is the reason why DataStage believes there to be a difference in partitioning, which then leads to the sort-insertion.
Could you try to remove the sort-order-property on ds1 or set it identically on ds0?
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
sorry, I thought I had seen a difference in the job-score you posted on the sort-order of the hash-partitioner, but doublechecking I can't see it now.
Concerning the environment-variable: Vincent advised to set APT_NO_SORT_INSERTION, not APT_DISABLE_COMBINATION. If You do not use the Copy-Stage to drop any input-columns, DataStage would usually ignore the stages entirely at compile-time. In any case it can combine the copy-operator with the upcoming join-step. By setting APT_DISABLE_COMBINATION You do not allow the system to do that, but DataStage still inserts the tsort-operator.
There is another option: Set APT_NO_SORT_INSERTION_CHECK_ONLY. This prevents DataStage from inserting a tsort-operator, but the job will abort if data should arrive at the join-stage in incorrect order.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
I understand that if I set APT_NO_SORT_INSERTION to True would solve this problem. But actually this is not my concern.
The point is I want to know if I leave job design as default (auto partitioning,auto insert sort), the job performance would be as good as manual partitioning and sorting at every stage or not.