Join returns wrong result with Partitioning

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
monaz
Participant
Posts: 98
Joined: Sat Aug 23, 2008 3:14 am

Join returns wrong result with Partitioning

Post by monaz »

Hi All,

I am using an join stage with Hash partioning. if the source records for 80,000 records then i am getting 90,000 records as the output from the join stage.

Can anyone please guide me what mistake has been occured.

Please suugest and let me know if you any require any one details on this
miwinter
Participant
Posts: 396
Joined: Thu Jun 22, 2006 7:00 am
Location: England, UK

Post by miwinter »

Sure, whatever you are joining to from your "source of 80,000 records" is making multiple joins to the other side of the join, at least for some of the records (i.e. a one-to-many relationship).
Last edited by miwinter on Mon Jun 29, 2009 9:00 am, edited 1 time in total.
Mark Winter
<i>Nothing appeases a troubled mind more than <b>good</b> music</i>
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Simply getting more records out of the join than went in isn't an error, per se. Please explain the nature of the "join" you are doing.
-craig

"You can never have too many knives" -- Logan Nine Fingers
monaz
Participant
Posts: 98
Joined: Sat Aug 23, 2008 3:14 am

Post by monaz »

chulett wrote:Simply getting more records out of the join than went in isn't an error, per se. Please explain the nature of the "join" you are doing. ...
Thanks alot for your resonse.

I am using and left outer join wherin my source is data base and other is Dataset. My requirement is that from the source database i need to do a lookup with the dataset and populate the address feild from the dataset. if matches i need to populate description else it will be Null....

Please suggest is this way is correct to do on..

Also i am using Hash partition for the input columns based on the 10 key columns. Should i have to use Sort also in this case.

I am new to Datastage so please advice me on this
monaz
Participant
Posts: 98
Joined: Sat Aug 23, 2008 3:14 am

Post by monaz »

miwinter wrote:Sure, whatever you are joining to from your "source of 80,000 records" is making multiple joins to the other side of the join, at least for some of the records (i.e. a one-to-many relationship).
Thanks for your information.

But my requirement is not to get multiple values for the for single records..
Could you please suggest how can i resolve this?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Stop fretting that you are getting multiple values for single records (which is the "more than 80,000 records" being manifested).

Imagine the left input has 80,000 records and the right input has two, with the same reference key. An inner join would be expected to return 160,000 values (records).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
monaz
Participant
Posts: 98
Joined: Sat Aug 23, 2008 3:14 am

Post by monaz »

ray.wurlod wrote:Stop fretting that you are getting multiple values for single records (which is the "more than 80,000 records" being manifested).

Imagine the left input has 80,000 records and the right input has t ...
Ray,

The same logic i have not used hash partitioning the default is auto then i get the exact records what the source records are present.

In this case instead of using partitioing concept i am using two sort stages for sorted the data.

Please suggest what needs to be done ..
monaz
Participant
Posts: 98
Joined: Sat Aug 23, 2008 3:14 am

Post by monaz »

[quote="ray.wurlod"]Stop fretting that you are getting multiple values for single records (which is the "more than 80,000 records" being manifested).

Imagine the left input has 80,000 records and the right input has t ...[/quote

Sorry forget to give you some more information

The left input is 80,000 records and the right input is 90,000 records but my requriement is that i need to match 80,000 records against 90,000 records and take the value for 80,000 records only, if there are multiplte values matches then i need take the first one. so that they won't be any duplicate in the output...

In this scenario what needs to be done please suggest..
monaz
Participant
Posts: 98
Joined: Sat Aug 23, 2008 3:14 am

Post by monaz »

ray.wurlod wrote:Stop fretting that you are getting multiple values for single records (which is the "more than 80,000 records" being manifested).

Imagine the left input has 80,000 records and the right input has t ...
I am very sorry for not giving full information at a time..

I have also noticed that in the join stage if i make it auto then i am not getting the exact values of description though the key columns matches.
laknar
Participant
Posts: 162
Joined: Thu Apr 26, 2007 5:59 am
Location: Chennai

Post by laknar »

Sort the data by using sort stage(join key column) and derive the key change column.

Have a constraint (keychangecolumn=0)in transformer before writing the dataset.

We can avoid one-to-many relationship.
jcthornton
Premium Member
Premium Member
Posts: 79
Joined: Thu Mar 22, 2007 4:58 pm
Location: USA

Post by jcthornton »

laknar wrote:Sort the data by using sort stage(join key column) and derive the key change column.

Have a constraint (keychangecolumn=0)in transformer before writing the dataset.

We can avoid one-to-many relationship.
The other option is to sort and remove duplicates on the lookup stream prior to performing the join. Just use the same key fields for all stages. The benefit here is removing records as early in the process as possible.

If you can reduce the 90k records to the 80k you need for the lookup (using an effective date or sequence number for example) even earlier, the benefit will be better. At sub 100k records, the performance impact is not going to be massive. However, if this is a process where your record counts are going to grow, cutting volume as early as possible gets to be more and more important.
jcthornton
Premium Member
Premium Member
Posts: 79
Joined: Thu Mar 22, 2007 4:58 pm
Location: USA

Post by jcthornton »

<duplicated>
Last edited by jcthornton on Tue Jun 30, 2009 7:31 am, edited 1 time in total.
Kryt0n
Participant
Posts: 584
Joined: Wed Jun 22, 2005 7:28 pm

Post by Kryt0n »

the correct solution is to go back to business and have them identify which is the correct address to use... you probably have business/home/post addresses and some indicator to state which is which... a design issue not DataStage
monaz
Participant
Posts: 98
Joined: Sat Aug 23, 2008 3:14 am

Post by monaz »

jcthornton wrote:
laknar wrote:Sort the data by using sort stage(join key column) and derive the key change column.

Have a constraint (keychangecolumn=0)in transformer before writing the dataset.

We can avoid one-to-many relationship.
The other option is to sort and remove duplicates on the lookup stream prior to performing the join. Just use the same key fields for all stages. The benefit here is removing records as early in the process as possible.

If you can reduce the 90k records to the 80k you need for the lookup (using an effective date or sequence number for example) even earlier, the benefit will be better. At sub 100k records, the performance impact is not going to be massive. However, if this is a process where your record counts are going to grow, cutting volume as early as possible gets to be more and more important.
Thanks for your quick response

I am already removing the duplicate records from the lookup dataset and then perfoming the join. The problem is that though they are matching records the join gives Null value.. is it because of partitioning happening that 's the reason.
datisaq
Participant
Posts: 154
Joined: Wed May 14, 2008 4:34 am

Post by datisaq »

I think you've set as "left outer join".The source records which are not matched will get propagated at the output and the target attributes which you're taking from the reference will be null..

If you want excat record match change it to inner join..
IBM Certified - Information Server 8.1
Post Reply