Join returns wrong result with Partitioning
Moderators: chulett, rschirm, roy
Join returns wrong result with Partitioning
Hi All,
I am using an join stage with Hash partioning. if the source records for 80,000 records then i am getting 90,000 records as the output from the join stage.
Can anyone please guide me what mistake has been occured.
Please suugest and let me know if you any require any one details on this
I am using an join stage with Hash partioning. if the source records for 80,000 records then i am getting 90,000 records as the output from the join stage.
Can anyone please guide me what mistake has been occured.
Please suugest and let me know if you any require any one details on this
Sure, whatever you are joining to from your "source of 80,000 records" is making multiple joins to the other side of the join, at least for some of the records (i.e. a one-to-many relationship).
Last edited by miwinter on Mon Jun 29, 2009 9:00 am, edited 1 time in total.
Mark Winter
<i>Nothing appeases a troubled mind more than <b>good</b> music</i>
<i>Nothing appeases a troubled mind more than <b>good</b> music</i>
Thanks alot for your resonse.chulett wrote:Simply getting more records out of the join than went in isn't an error, per se. Please explain the nature of the "join" you are doing. ...
I am using and left outer join wherin my source is data base and other is Dataset. My requirement is that from the source database i need to do a lookup with the dataset and populate the address feild from the dataset. if matches i need to populate description else it will be Null....
Please suggest is this way is correct to do on..
Also i am using Hash partition for the input columns based on the 10 key columns. Should i have to use Sort also in this case.
I am new to Datastage so please advice me on this
Thanks for your information.miwinter wrote:Sure, whatever you are joining to from your "source of 80,000 records" is making multiple joins to the other side of the join, at least for some of the records (i.e. a one-to-many relationship).
But my requirement is not to get multiple values for the for single records..
Could you please suggest how can i resolve this?
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Stop fretting that you are getting multiple values for single records (which is the "more than 80,000 records" being manifested).
Imagine the left input has 80,000 records and the right input has two, with the same reference key. An inner join would be expected to return 160,000 values (records).
Imagine the left input has 80,000 records and the right input has two, with the same reference key. An inner join would be expected to return 160,000 values (records).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Ray,ray.wurlod wrote:Stop fretting that you are getting multiple values for single records (which is the "more than 80,000 records" being manifested).
Imagine the left input has 80,000 records and the right input has t ...
The same logic i have not used hash partitioning the default is auto then i get the exact records what the source records are present.
In this case instead of using partitioing concept i am using two sort stages for sorted the data.
Please suggest what needs to be done ..
[quote="ray.wurlod"]Stop fretting that you are getting multiple values for single records (which is the "more than 80,000 records" being manifested).
Imagine the left input has 80,000 records and the right input has t ...[/quote
Sorry forget to give you some more information
The left input is 80,000 records and the right input is 90,000 records but my requriement is that i need to match 80,000 records against 90,000 records and take the value for 80,000 records only, if there are multiplte values matches then i need take the first one. so that they won't be any duplicate in the output...
In this scenario what needs to be done please suggest..
Imagine the left input has 80,000 records and the right input has t ...[/quote
Sorry forget to give you some more information
The left input is 80,000 records and the right input is 90,000 records but my requriement is that i need to match 80,000 records against 90,000 records and take the value for 80,000 records only, if there are multiplte values matches then i need take the first one. so that they won't be any duplicate in the output...
In this scenario what needs to be done please suggest..
I am very sorry for not giving full information at a time..ray.wurlod wrote:Stop fretting that you are getting multiple values for single records (which is the "more than 80,000 records" being manifested).
Imagine the left input has 80,000 records and the right input has t ...
I have also noticed that in the join stage if i make it auto then i am not getting the exact values of description though the key columns matches.
-
- Premium Member
- Posts: 79
- Joined: Thu Mar 22, 2007 4:58 pm
- Location: USA
The other option is to sort and remove duplicates on the lookup stream prior to performing the join. Just use the same key fields for all stages. The benefit here is removing records as early in the process as possible.laknar wrote:Sort the data by using sort stage(join key column) and derive the key change column.
Have a constraint (keychangecolumn=0)in transformer before writing the dataset.
We can avoid one-to-many relationship.
If you can reduce the 90k records to the 80k you need for the lookup (using an effective date or sequence number for example) even earlier, the benefit will be better. At sub 100k records, the performance impact is not going to be massive. However, if this is a process where your record counts are going to grow, cutting volume as early as possible gets to be more and more important.
-
- Premium Member
- Posts: 79
- Joined: Thu Mar 22, 2007 4:58 pm
- Location: USA
Thanks for your quick responsejcthornton wrote:The other option is to sort and remove duplicates on the lookup stream prior to performing the join. Just use the same key fields for all stages. The benefit here is removing records as early in the process as possible.laknar wrote:Sort the data by using sort stage(join key column) and derive the key change column.
Have a constraint (keychangecolumn=0)in transformer before writing the dataset.
We can avoid one-to-many relationship.
If you can reduce the 90k records to the 80k you need for the lookup (using an effective date or sequence number for example) even earlier, the benefit will be better. At sub 100k records, the performance impact is not going to be massive. However, if this is a process where your record counts are going to grow, cutting volume as early as possible gets to be more and more important.
I am already removing the duplicate records from the lookup dataset and then perfoming the join. The problem is that though they are matching records the join gives Null value.. is it because of partitioning happening that 's the reason.