Page 1 of 1

Sorting in Join stage or database

Posted: Wed Jan 05, 2011 7:02 am
by rumu
Dear All,

It might be a qs which has been discussed already but still I seek your thoughts.

I am joining two tables on different data base using join stage.Database is SQL server 2008 and we are using ODBC enterprise stage to connect it.

Out of the following three optins which one is preferable:
1)Use order by clause in database level to sort the data and use auto partinioning in Join stage.
2)Use hash partitioning and sort in Join stage only
3)Use auto partitioning in Join stage.

Our client technical lead is preferring the first option.My doubt is even if we use database sort using order by clause,dont we require a explicit sort stage to mention not to sort the data again?or Will 'auto' partitioning take care this?

Regards,
rumu

Posted: Wed Jan 05, 2011 7:05 am
by srinivas.g
First option is the best.

Posted: Wed Jan 05, 2011 7:19 am
by chulett
The answer depends on many variables so there is no blanket 'best' here. The first may very well be the best in your case, but there isn't any way for us to know.

And yes, if you go for that first option, since the job will have no idea you've done so you will more than likely need to add specific Sort stages set to "Don't sort, already sorted" so you don't waste all that effort as the job resorts everything for no good reason.

Posted: Wed Jan 05, 2011 9:36 am
by rameshrr3
Im confused a bit here, why cant we use Hash Partitioning on join stage and set the stable sort option , is it absolutely necessary to have a sort stage inserted in between? Im sure i missed something here :( ..

Posted: Wed Jan 05, 2011 1:03 pm
by jwiles
rameshrr3 wrote:Im confused a bit here, why cant we use Hash Partitioning on join stage and set the stable sort option , is it absolutely necessary to have a sort stage inserted in between? Im sure i missed something here :( ..
The purpose of Craig's suggestion was to avoid resorting the data, which has already been sorted by the database. What you mention above will fully resort the dataset within DS.

Stable Sort is not the same as Don't Sort, Already Sorted. DS,AS is only available within the Sort Stage's Properties tab. Stable Sort maintains record order within a unique sort key value, but resorts the dataset according to the other sort options specified.

When you perform partitioning, you are no longer guaranteed that your data (within a partition) has maintained it's sort order. This is especially true if your partitioning options change the distribution of the data among partitions (outside of going from one partition to many) and is largely due to the non-guaranteed arrival time of data rows among the different partitions in the job.

Posted: Thu Jan 06, 2011 3:37 am
by rumu
Dear All,

We have chosen the 'auto' partition in join stage and the env variable 'APT_NO_SRT_INSERTION' is set False hence no need to do sorting in database level as auto partition will insert by default an order by clause if source is database.


Regards
rumu

Posted: Thu Jan 06, 2011 6:38 am
by chulett
A couple of points...
rumu wrote:We have chosen the 'auto' partition in join stage and the env variable 'APT_NO_SRT_INSERTION' is set False
This is the same as not even mentioning the variable as that is its default value. Add them to give you the ability / option to override them later but don't think simply adding them like that accomplishes anything.
rumu also wrote:hence no need to do sorting in database level as auto partition will insert by default an order by clause if source is database.
Hmmm... a bit early in the morning here but I don't believe that is true. Does it actually have a clue the 'source is database'? I'll leave that to others to validate for you.

Posted: Thu Jan 06, 2011 3:36 pm
by ray.wurlod
rumu wrote: no need to do sorting in database level as auto partition will insert by default an order by clause if source is database
Completely incorrect.

Posted: Fri Jan 07, 2011 5:28 am
by rumu
Hi All,
Thanks for all your input.It seems that our decision lead to wrong path...Actually,it has been given suggested and claimed by Client Technical lead from IBM ,Denmark and he referred on edocumentation on the same.
Please refer the below guideline from IBM:

Join stage

By default, the Join stage sorts data on both input links. This sorting results in output data that is sorted on the join keys. If the InfoSphere DataStage project environment variable APT_NO_SORT_INSERTION is set, the sort is suppressed. Pushing a Join stage into a database source always generates an ORDER BY clause for the sort keys, as if APT_NO_SORT_INSERTION was not set.

InfoSphere DataStage performs implicit type conversions to make join keys of different types comparable. SQL, however, is much stricter and such implicit conversions can lead to runtime database errors. Therefore Join key columns within different type groups (for examples, numbers and strings) cannot be optimized.

Can you please then suggest the optimal solution on this requirement.


Regards,
rumu

Posted: Fri Jan 07, 2011 8:56 am
by jwiles
Rumu,

The documentation you were pointed to is specific to the InfoSphere Balanced Optimization Tool only and not to DataStage in general. Unless you are using the Balanced Optimization tool, ignore that bit of advice as it's not relevant to your situation.

As has been stated in this thread, there is no single optimal solution that covers every situation. The optimal solution for YOUR situation will depend upon the limitations you must operate under, from hardware to client-imposed rules on what you're allowed to do where. Ultimately, the jobs should meet any performance requirements (SLA) that are part of the contract. You won't know for certain, however, until you've developed and tested them. If you have to tune your jobs, that's ok...it's part of the process.

Not knowing the specifics of your data (row size, volumes, number of key columns, etc.), I would initially suggest the following as a starting point:

Extract the data (ODBC stage) into a Sort stage with hash partition on the input. From the Sort stage go into the join stage, with Auto partition set.

This is the typical method to use and will meet most performance requirements with minor tuning, depending on system configurations.

Posted: Fri Jan 07, 2011 9:11 am
by chulett
jwiles wrote:The documentation you were pointed to is specific to the InfoSphere Balanced Optimization Tool only and not to DataStage in general. Unless you are using the Balanced Optimization tool, ignore that bit of advice as it's not relevant to your situation.
Ah... was wondering where that came from.

Posted: Fri Jan 07, 2011 11:33 am
by jwiles
Yeah...it caught me off guard. The IS Balanced Optimization documentation has been incorporated into the IS Information Center. Since there's some overlap like that, you've got to make certain which portion of the InfoCenter you're looking at :roll:

Posted: Tue Jan 11, 2011 7:23 am
by rumu
Thanks all for your valuable input.
In our case,decision is pending from Client(IBM) Technical Architect.

Regards,

Rumu