Help on JOIN stage

kumarjit · Post by **kumarjit** » Fri Feb 08, 2013 5:24 am

Let's us consider that I need to implement the following database logic in Datastage . How would I go for it ?

The requirement is to JOIN two data sources based on TWO KEY COLUMNS

soemthing like :
select A.COL1 , A.COL2 , B.COL1, B.COL2
from
table1 A
JOIN table2 B
on A.COL3=B.COL3 and A.COL4=B.COl4 ;

Question 1. Do I need to specify TWO DIFFERENT KEY's in the JOIN STAGE , one for COL3 and the other for COL4 ? Is that possible ?

Question 2 : If multiple key's are specified within a SINGLE Join stage , does the inputs needs to be HASH PARTITIONED on all the key columns ?
If yes , will that not create erroneous results , because so far as I know , for a single stage , partitioning could be done on ONLY ONE COLUMN of the INPUT LINK.
If multiple keys are specified as partitioning keys , results may be incorrect.

My workaround:

Create a SINGLE join key field by using COL3|COL4 and partitioning the inputs based on this surrogate join column . The results remian same as the database
join .

Thanks
Kumarjit.

prasson_ibm · Post by **prasson_ibm** » Fri Feb 08, 2013 6:00 am

Hi Kumar,

Answer1:-Yes if you are joining on Two columns you need to specify both the columns into a join stage.

Answer 2:- It depends upon your data.If you define Hash partition on only one key column and it has few groupings,may be all your data will land up to one or two nodes and other nodes will be useless as a result performance will be effected.We should always try to split the records to all the partitions.

kumarjit · Post by **kumarjit** » Fri Feb 08, 2013 7:18 am

THE ONE thing I forgot to mention in my post,
If the fields on which join is performed is Nullable, and if they actually contain NULL values , will the output of the database join match the output of the Datastage JOIN stage ( using multiple join keys ) ?

I often faced this problem , that if the fields on which the data sources are joined contain NULL values, the database query output does not match with the datastage's output.

Can anyone please guide me on this ?

Thanks
Kumarjit.

prasson_ibm · Post by **prasson_ibm** » Fri Feb 08, 2013 8:26 am

Hi Kumar,
Conceptually it should match,since once you apply HASH partition on the input of key columns,it generates hash value and it will make sure that it lands up to same partition.With partition you should sort the inputs with null first.In case all joining columns are null,it is pointless to perform join.

zulfi123786 · Post by **zulfi123786** » Fri Feb 08, 2013 8:37 am

There could be differences when the output columns are datatyped differently for instance a left outer joined column from left link is supposed to be NULLABLE YES and if its defined as NULLABLE NO the join operator defaults the values per the datatype.