Sorting in Join stage or database

rumu · Post by **rumu** » Wed Jan 05, 2011 7:02 am

Dear All,

It might be a qs which has been discussed already but still I seek your thoughts.

I am joining two tables on different data base using join stage.Database is SQL server 2008 and we are using ODBC enterprise stage to connect it.

Out of the following three optins which one is preferable:
1)Use order by clause in database level to sort the data and use auto partinioning in Join stage.
2)Use hash partitioning and sort in Join stage only
3)Use auto partitioning in Join stage.

Our client technical lead is preferring the first option.My doubt is even if we use database sort using order by clause,dont we require a explicit sort stage to mention not to sort the data again?or Will 'auto' partitioning take care this?

Regards,
rumu

srinivas.g · Post by **srinivas.g** » Wed Jan 05, 2011 7:05 am

First option is the best.

chulett · Post by **chulett** » Wed Jan 05, 2011 7:19 am

The answer depends on many variables so there is no blanket 'best' here. The first may very well be the best in your case, but there isn't any way for us to know.

And yes, if you go for that first option, since the job will have no idea you've done so you will more than likely need to add specific Sort stages set to "Don't sort, already sorted" so you don't waste all that effort as the job resorts everything for no good reason.

rameshrr3 · Post by **rameshrr3** » Wed Jan 05, 2011 9:36 am

Im confused a bit here, why cant we use Hash Partitioning on join stage and set the stable sort option , is it absolutely necessary to have a sort stage inserted in between? Im sure i missed something here

..

jwiles · Post by **jwiles** » Wed Jan 05, 2011 1:03 pm

rameshrr3 wrote:Im confused a bit here, why cant we use Hash Partitioning on join stage and set the stable sort option , is it absolutely necessary to have a sort stage inserted in between? Im sure i missed something here ..

The purpose of Craig's suggestion was to avoid resorting the data, which has already been sorted by the database. What you mention above will fully resort the dataset within DS.

Stable Sort is not the same as Don't Sort, Already Sorted. DS,AS is only available within the Sort Stage's Properties tab. Stable Sort maintains record order within a unique sort key value, but resorts the dataset according to the other sort options specified.

When you perform partitioning, you are no longer guaranteed that your data (within a partition) has maintained it's sort order. This is especially true if your partitioning options change the distribution of the data among partitions (outside of going from one partition to many) and is largely due to the non-guaranteed arrival time of data rows among the different partitions in the job.

rumu · Post by **rumu** » Thu Jan 06, 2011 3:37 am

Dear All,

We have chosen the 'auto' partition in join stage and the env variable 'APT_NO_SRT_INSERTION' is set False hence no need to do sorting in database level as auto partition will insert by default an order by clause if source is database.

Regards
rumu

chulett · Post by **chulett** » Thu Jan 06, 2011 6:38 am

A couple of points...

rumu wrote:We have chosen the 'auto' partition in join stage and the env variable 'APT_NO_SRT_INSERTION' is set False

This is the same as not even mentioning the variable as that is its default value. Add them to give you the ability / option to override them later but don't think simply adding them like that accomplishes anything.

rumu also wrote:hence no need to do sorting in database level as auto partition will insert by default an order by clause if source is database.

Hmmm... a bit early in the morning here but I don't believe that is true. Does it actually have a clue the 'source is database'? I'll leave that to others to validate for you.

ray.wurlod · Post by **ray.wurlod** » Thu Jan 06, 2011 3:36 pm

rumu wrote: no need to do sorting in database level as auto partition will insert by default an order by clause if source is database

Completely incorrect.

rumu · Post by **rumu** » Fri Jan 07, 2011 5:28 am

Hi All,
Thanks for all your input.It seems that our decision lead to wrong path...Actually,it has been given suggested and claimed by Client Technical lead from IBM ,Denmark and he referred on edocumentation on the same.
Please refer the below guideline from IBM:

Join stage

By default, the Join stage sorts data on both input links. This sorting results in output data that is sorted on the join keys. If the InfoSphere DataStage project environment variable APT_NO_SORT_INSERTION is set, the sort is suppressed. Pushing a Join stage into a database source always generates an ORDER BY clause for the sort keys, as if APT_NO_SORT_INSERTION was not set.

InfoSphere DataStage performs implicit type conversions to make join keys of different types comparable. SQL, however, is much stricter and such implicit conversions can lead to runtime database errors. Therefore Join key columns within different type groups (for examples, numbers and strings) cannot be optimized.

Can you please then suggest the optimal solution on this requirement.

Regards,
rumu

jwiles · Post by **jwiles** » Fri Jan 07, 2011 8:56 am

Rumu,

The documentation you were pointed to is specific to the InfoSphere Balanced Optimization Tool only and not to DataStage in general. Unless you are using the Balanced Optimization tool, ignore that bit of advice as it's not relevant to your situation.

As has been stated in this thread, there is no single optimal solution that covers every situation. The optimal solution for YOUR situation will depend upon the limitations you must operate under, from hardware to client-imposed rules on what you're allowed to do where. Ultimately, the jobs should meet any performance requirements (SLA) that are part of the contract. You won't know for certain, however, until you've developed and tested them. If you have to tune your jobs, that's ok...it's part of the process.

Not knowing the specifics of your data (row size, volumes, number of key columns, etc.), I would initially suggest the following as a starting point:

Extract the data (ODBC stage) into a Sort stage with hash partition on the input. From the Sort stage go into the join stage, with Auto partition set.

This is the typical method to use and will meet most performance requirements with minor tuning, depending on system configurations.

chulett · Post by **chulett** » Fri Jan 07, 2011 9:11 am

jwiles wrote:The documentation you were pointed to is specific to the InfoSphere Balanced Optimization Tool only and not to DataStage in general. Unless you are using the Balanced Optimization tool, ignore that bit of advice as it's not relevant to your situation.

Ah... was wondering where that came from.

jwiles · Post by **jwiles** » Fri Jan 07, 2011 11:33 am

Yeah...it caught me off guard. The IS Balanced Optimization documentation has been incorporated into the IS Information Center. Since there's some overlap like that, you've got to make certain which portion of the InfoCenter you're looking at

rumu · Post by **rumu** » Tue Jan 11, 2011 7:23 am

Thanks all for your valuable input.
In our case,decision is pending from Client(IBM) Technical Architect.

Regards,

Rumu