Sorting in Join stage or database

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
rumu
Participant
Posts: 286
Joined: Mon Jun 06, 2005 4:07 am

Sorting in Join stage or database

Post by rumu »

Dear All,

It might be a qs which has been discussed already but still I seek your thoughts.

I am joining two tables on different data base using join stage.Database is SQL server 2008 and we are using ODBC enterprise stage to connect it.

Out of the following three optins which one is preferable:
1)Use order by clause in database level to sort the data and use auto partinioning in Join stage.
2)Use hash partitioning and sort in Join stage only
3)Use auto partitioning in Join stage.

Our client technical lead is preferring the first option.My doubt is even if we use database sort using order by clause,dont we require a explicit sort stage to mention not to sort the data again?or Will 'auto' partitioning take care this?

Regards,
rumu
Rumu
IT Consultant
srinivas.g
Participant
Posts: 251
Joined: Mon Jun 09, 2008 5:52 am

Post by srinivas.g »

First option is the best.
Srinu Gadipudi
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

The answer depends on many variables so there is no blanket 'best' here. The first may very well be the best in your case, but there isn't any way for us to know.

And yes, if you go for that first option, since the job will have no idea you've done so you will more than likely need to add specific Sort stages set to "Don't sort, already sorted" so you don't waste all that effort as the job resorts everything for no good reason.
-craig

"You can never have too many knives" -- Logan Nine Fingers
rameshrr3
Premium Member
Premium Member
Posts: 609
Joined: Mon May 10, 2004 3:32 am
Location: BRENTWOOD, TN

Post by rameshrr3 »

Im confused a bit here, why cant we use Hash Partitioning on join stage and set the stable sort option , is it absolutely necessary to have a sort stage inserted in between? Im sure i missed something here :( ..
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

rameshrr3 wrote:Im confused a bit here, why cant we use Hash Partitioning on join stage and set the stable sort option , is it absolutely necessary to have a sort stage inserted in between? Im sure i missed something here :( ..
The purpose of Craig's suggestion was to avoid resorting the data, which has already been sorted by the database. What you mention above will fully resort the dataset within DS.

Stable Sort is not the same as Don't Sort, Already Sorted. DS,AS is only available within the Sort Stage's Properties tab. Stable Sort maintains record order within a unique sort key value, but resorts the dataset according to the other sort options specified.

When you perform partitioning, you are no longer guaranteed that your data (within a partition) has maintained it's sort order. This is especially true if your partitioning options change the distribution of the data among partitions (outside of going from one partition to many) and is largely due to the non-guaranteed arrival time of data rows among the different partitions in the job.
rumu
Participant
Posts: 286
Joined: Mon Jun 06, 2005 4:07 am

Post by rumu »

Dear All,

We have chosen the 'auto' partition in join stage and the env variable 'APT_NO_SRT_INSERTION' is set False hence no need to do sorting in database level as auto partition will insert by default an order by clause if source is database.


Regards
rumu
Rumu
IT Consultant
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

A couple of points...
rumu wrote:We have chosen the 'auto' partition in join stage and the env variable 'APT_NO_SRT_INSERTION' is set False
This is the same as not even mentioning the variable as that is its default value. Add them to give you the ability / option to override them later but don't think simply adding them like that accomplishes anything.
rumu also wrote:hence no need to do sorting in database level as auto partition will insert by default an order by clause if source is database.
Hmmm... a bit early in the morning here but I don't believe that is true. Does it actually have a clue the 'source is database'? I'll leave that to others to validate for you.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

rumu wrote: no need to do sorting in database level as auto partition will insert by default an order by clause if source is database
Completely incorrect.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
rumu
Participant
Posts: 286
Joined: Mon Jun 06, 2005 4:07 am

Post by rumu »

Hi All,
Thanks for all your input.It seems that our decision lead to wrong path...Actually,it has been given suggested and claimed by Client Technical lead from IBM ,Denmark and he referred on edocumentation on the same.
Please refer the below guideline from IBM:

Join stage

By default, the Join stage sorts data on both input links. This sorting results in output data that is sorted on the join keys. If the InfoSphere DataStage project environment variable APT_NO_SORT_INSERTION is set, the sort is suppressed. Pushing a Join stage into a database source always generates an ORDER BY clause for the sort keys, as if APT_NO_SORT_INSERTION was not set.

InfoSphere DataStage performs implicit type conversions to make join keys of different types comparable. SQL, however, is much stricter and such implicit conversions can lead to runtime database errors. Therefore Join key columns within different type groups (for examples, numbers and strings) cannot be optimized.

Can you please then suggest the optimal solution on this requirement.


Regards,
rumu
Rumu
IT Consultant
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Rumu,

The documentation you were pointed to is specific to the InfoSphere Balanced Optimization Tool only and not to DataStage in general. Unless you are using the Balanced Optimization tool, ignore that bit of advice as it's not relevant to your situation.

As has been stated in this thread, there is no single optimal solution that covers every situation. The optimal solution for YOUR situation will depend upon the limitations you must operate under, from hardware to client-imposed rules on what you're allowed to do where. Ultimately, the jobs should meet any performance requirements (SLA) that are part of the contract. You won't know for certain, however, until you've developed and tested them. If you have to tune your jobs, that's ok...it's part of the process.

Not knowing the specifics of your data (row size, volumes, number of key columns, etc.), I would initially suggest the following as a starting point:

Extract the data (ODBC stage) into a Sort stage with hash partition on the input. From the Sort stage go into the join stage, with Auto partition set.

This is the typical method to use and will meet most performance requirements with minor tuning, depending on system configurations.
- james wiles


All generalizations are false, including this one - Mark Twain.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

jwiles wrote:The documentation you were pointed to is specific to the InfoSphere Balanced Optimization Tool only and not to DataStage in general. Unless you are using the Balanced Optimization tool, ignore that bit of advice as it's not relevant to your situation.
Ah... was wondering where that came from.
-craig

"You can never have too many knives" -- Logan Nine Fingers
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Yeah...it caught me off guard. The IS Balanced Optimization documentation has been incorporated into the IS Information Center. Since there's some overlap like that, you've got to make certain which portion of the InfoCenter you're looking at :roll:
- james wiles


All generalizations are false, including this one - Mark Twain.
rumu
Participant
Posts: 286
Joined: Mon Jun 06, 2005 4:07 am

Post by rumu »

Thanks all for your valuable input.
In our case,decision is pending from Client(IBM) Technical Architect.

Regards,

Rumu
Rumu
IT Consultant
Post Reply