Join returns wrong result with Partitioning

monaz · Post by **monaz** » Mon Jun 29, 2009 8:55 am

Hi All,

I am using an join stage with Hash partioning. if the source records for 80,000 records then i am getting 90,000 records as the output from the join stage.

Can anyone please guide me what mistake has been occured.

Please suugest and let me know if you any require any one details on this

miwinter · Post by **miwinter** » Mon Jun 29, 2009 8:59 am

Sure, whatever you are joining to from your "source of 80,000 records" is making multiple joins to the other side of the join, at least for some of the records (i.e. a one-to-many relationship).

chulett · Post by **chulett** » Mon Jun 29, 2009 9:00 am

Simply getting more records out of the join than went in isn't an error, per se. Please explain the nature of the "join" you are doing.

monaz · Post by **monaz** » Mon Jun 29, 2009 7:41 pm

chulett wrote:Simply getting more records out of the join than went in isn't an error, per se. Please explain the nature of the "join" you are doing. ...

Thanks alot for your resonse.

I am using and left outer join wherin my source is data base and other is Dataset. My requirement is that from the source database i need to do a lookup with the dataset and populate the address feild from the dataset. if matches i need to populate description else it will be Null....

Please suggest is this way is correct to do on..

Also i am using Hash partition for the input columns based on the 10 key columns. Should i have to use Sort also in this case.

I am new to Datastage so please advice me on this

monaz · Post by **monaz** » Mon Jun 29, 2009 7:43 pm

miwinter wrote:Sure, whatever you are joining to from your "source of 80,000 records" is making multiple joins to the other side of the join, at least for some of the records (i.e. a one-to-many relationship).

Thanks for your information.

But my requirement is not to get multiple values for the for single records..
Could you please suggest how can i resolve this?

ray.wurlod · Post by **ray.wurlod** » Mon Jun 29, 2009 7:58 pm

Stop fretting that you are getting multiple values for single records (which is the "more than 80,000 records" being manifested).

Imagine the left input has 80,000 records and the right input has two, with the same reference key. An inner join would be expected to return 160,000 values (records).

monaz · Post by **monaz** » Mon Jun 29, 2009 8:04 pm

ray.wurlod wrote:Stop fretting that you are getting multiple values for single records (which is the "more than 80,000 records" being manifested).

Imagine the left input has 80,000 records and the right input has t ...

Ray,

The same logic i have not used hash partitioning the default is auto then i get the exact records what the source records are present.

In this case instead of using partitioing concept i am using two sort stages for sorted the data.

Please suggest what needs to be done ..

monaz · Post by **monaz** » Mon Jun 29, 2009 8:08 pm

[quote="ray.wurlod"]Stop fretting that you are getting multiple values for single records (which is the "more than 80,000 records" being manifested).

Imagine the left input has 80,000 records and the right input has t ...[/quote

Sorry forget to give you some more information

The left input is 80,000 records and the right input is 90,000 records but my requriement is that i need to match 80,000 records against 90,000 records and take the value for 80,000 records only, if there are multiplte values matches then i need take the first one. so that they won't be any duplicate in the output...

In this scenario what needs to be done please suggest..

monaz · Post by **monaz** » Mon Jun 29, 2009 8:29 pm

ray.wurlod wrote:Stop fretting that you are getting multiple values for single records (which is the "more than 80,000 records" being manifested).

Imagine the left input has 80,000 records and the right input has t ...

I am very sorry for not giving full information at a time..

I have also noticed that in the join stage if i make it auto then i am not getting the exact values of description though the key columns matches.

laknar · Post by **laknar** » Mon Jun 29, 2009 8:36 pm

Sort the data by using sort stage(join key column) and derive the key change column.

Have a constraint (keychangecolumn=0)in transformer before writing the dataset.

We can avoid one-to-many relationship.

jcthornton · Post by **jcthornton** » Mon Jun 29, 2009 9:19 pm

laknar wrote:Sort the data by using sort stage(join key column) and derive the key change column.

Have a constraint (keychangecolumn=0)in transformer before writing the dataset.

We can avoid one-to-many relationship.

The other option is to sort and remove duplicates on the lookup stream prior to performing the join. Just use the same key fields for all stages. The benefit here is removing records as early in the process as possible.

If you can reduce the 90k records to the 80k you need for the lookup (using an effective date or sequence number for example) even earlier, the benefit will be better. At sub 100k records, the performance impact is not going to be massive. However, if this is a process where your record counts are going to grow, cutting volume as early as possible gets to be more and more important.

jcthornton · Post by **jcthornton** » Mon Jun 29, 2009 9:21 pm

Kryt0n · Post by **Kryt0n** » Mon Jun 29, 2009 10:24 pm

the correct solution is to go back to business and have them identify which is the correct address to use... you probably have business/home/post addresses and some indicator to state which is which... a design issue not DataStage

monaz · Post by **monaz** » Mon Jun 29, 2009 11:53 pm

jcthornton wrote:
laknar wrote:Sort the data by using sort stage(join key column) and derive the key change column.

Have a constraint (keychangecolumn=0)in transformer before writing the dataset.

We can avoid one-to-many relationship.
The other option is to sort and remove duplicates on the lookup stream prior to performing the join. Just use the same key fields for all stages. The benefit here is removing records as early in the process as possible.

If you can reduce the 90k records to the 80k you need for the lookup (using an effective date or sequence number for example) even earlier, the benefit will be better. At sub 100k records, the performance impact is not going to be massive. However, if this is a process where your record counts are going to grow, cutting volume as early as possible gets to be more and more important.

Thanks for your quick response

I am already removing the duplicate records from the lookup dataset and then perfoming the join. The problem is that though they are matching records the join gives Null value.. is it because of partitioning happening that 's the reason.

datisaq · Post by **datisaq** » Tue Jun 30, 2009 1:46 am

I think you've set as "left outer join".The source records which are not matched will get propagated at the output and the target attributes which you're taking from the reference will be null..

If you want excat record match change it to inner join..