How to preventing the join stage from sorting the records

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

ThilSe
Participant
Posts: 80
Joined: Thu Jun 09, 2005 7:45 am

How to preventing the join stage from sorting the records

Post by ThilSe »

Hi,

I have a job that does the sorting in Database itself.

Is there any way by which one can prevent the join stage from sorting the input data again.

Thanks
Senthil
deployDS
Premium Member
Premium Member
Posts: 45
Joined: Thu Mar 09, 2006 9:36 am

Re: How to preventing the join stage from sorting the record

Post by deployDS »

[quote="ThilSe"]Hi,

I have a job that does the sorting in Database itself.

Is there any way by which one can prevent the join stage from sorting the input data again.

Thanks
Senthil[/quote]

Join stage does not sort incoming data unless you specify it to.
clickart
Premium Member
Premium Member
Posts: 15
Joined: Tue Oct 18, 2005 10:14 pm
Location: Schaumburg, IL

Re: How to preventing the join stage from sorting the record

Post by clickart »

You can check the partitioning type specified on the link.
meena
Participant
Posts: 430
Joined: Tue Sep 13, 2005 12:17 pm

Post by meena »

Hi Senthil,
I agree with 'deployDS '

Code: Select all

Join stage does not sort incoming data unless you specify it to. 
There is option in the stage to sort. You can enable it if you want the sorting option else not...

Thank you
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

It might the obvious myth that gets in to mind that, Partion makes the data sorted. No.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Re: How to preventing the join stage from sorting the record

Post by balajisr »

deployDS wrote: Join stage does not sort incoming data unless you specify it to.
Can you clarify this.

I think join stage does sorting implicitly.
SriKara
Premium Member
Premium Member
Posts: 30
Joined: Wed Jun 01, 2005 8:40 am
Location: UK

Post by SriKara »

I think join stage does sorting implicitly
Join stage does require both of its input datasets to be in sorted order for a correct output dataset.

However the default partitioning method which is "AUTO" ensures that sorting is done on both the input links to the stage. It is worthwile to note that "AUTO" partitioning method causes the data to be sorted. if somebody has specified another method of partition, then we need to take care that sorted data is input to join stage - either by specifying perform sort on input link partitioning method or by deploying a sort stage.
Regards,
SriKara
Kirtikumar
Participant
Posts: 437
Joined: Fri Oct 15, 2004 6:13 am
Location: Pune, India

Post by Kirtikumar »

SriKara wrote:Join stage does require both of its input datasets to be in sorted order for a correct output dataset.
Is it the case - that if the data in the join is not sorted and just hash partitioned, then it will not give correct result?

As per whatever exp I have on PX with join stage, it seems this is not true.
Even if the data on which you are doing join is just hash partitioned with no sorting on keys, still you should get the correctly joined data.

So why sorting is required - may be to make search or join process faster. Its normal rule, if data is sorted then search will be faster.

So I think as long as data is hash partioned on join keys, sorting would make no diff to the output data.
Regards,
S. Kirtikumar.
SriKara
Premium Member
Premium Member
Posts: 30
Joined: Wed Jun 01, 2005 8:40 am
Location: UK

Post by SriKara »

Agree with Kirtikumar completely.
I should have said "Join stage requires both of its input datasets to be key partitioned and sorted".

Hash key partition ensures, the same keys are in one partition and are not scattered across. This would ensure a correct join. Sorted data would make things faster.

The point i was trying to make is, join stage may not sort implicitly as mentioned by balajisr.
Regards,
SriKara
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Code: Select all

The data sets input to the Join stage must be key partitioned and sorted.
is the recomendation given by Ascential, for the best results. Threre is auto insertion of tsort and psort if the data is found not in sorted order. And this can be controlled by the parameter available in DS Admnistrator. Turn this off and test with the unsorted data, you may get the undesired result.
As thumb rule, the data should be properly partitioned and sorted based on the joining key.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

SriKara wrote:Agree with Kirtikumar completely.
I should have said "Join stage requires both of its input datasets to be key partitioned and sorted".

Hash key partition ensures, the same keys are in one partition and are not scattered across. This would ensure a correct join. Sorted data would make things faster.

The point i was trying to make is, join stage may not sort implicitly as mentioned by balajisr.
As far as i know Join stage sorts data implicitly even when you specify hash partitioning with no sorting options. Please correct me if i am wrong.
As kumar mentioned there is a parameter available which when set does not sort implicitly.
splayer
Charter Member
Charter Member
Posts: 502
Joined: Mon Apr 12, 2004 5:01 pm

Post by splayer »

kumar_s, unless you check the Perform Sort checkbox, it should not sort, right? Well, if sorting is implicit regardless of whether you have checked Perform Sort or not, then setting this check box is useless.

Can we set "same" partitioning method to prevent sorting? Of course that would mean that it'll take the previous stage's partitioning method.

Please shed some light.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Ok after going through the Operators Guide, though it is given in osh terms, upto my understanding, the auto insertion of partiton and sort happens only if we have auto partition (Datastage). And tsort and psort are the two integrated options of the available Sort stages(Operators). The auto insertion of the sort operators and partition can be controlled by the two environment variables APT_NO_PART_INSERTION and APT_NO_SORT_INSERTION.
To avoid the implicit interference, we can use our explict sort and partition.
Auto partition is not as such avaiable in the osh, but it was implicitly used. Fortunately we were given the explict option of seleting Auto partiton in Datastage.

IHTH
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
dsdesigner
Participant
Posts: 34
Joined: Thu Jul 29, 2004 1:03 pm

Post by dsdesigner »

Another way you could check the effects of enabling the various options (explicit sort stage, sort on partition, auto partition, hash partition, etc.) would be create jobs with the various alternatives, compile the job and then check the "Generated OSH". Of course, the ability to view the "generated osh" must have been enabled by your administrtaor".

Thanks,
Shekar
felixyong
Participant
Posts: 35
Joined: Tue Jul 22, 2003 7:24 pm
Location: Australia

Re: How to preventing the join stage from sorting the record

Post by felixyong »

ThilSe wrote:Hi,

I have a job that does the sorting in Database itself.

Is there any way by which one can prevent the join stage from sorting the input data again.

Thanks
Senthil
Hi

By default sort is added as you can see it in the score, as this is how DS EE ensure correct results in join.

The method to prevent auto insertion is the following -
Set $APT_SORT_INSERTION_CHECK_ONLY to change behavior of automatically inserted sorts

Regards
Felix
Post Reply