Help on JOIN stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
kumarjit
Participant
Posts: 99
Joined: Fri Oct 12, 2012 7:47 am
Location: Kolkata

Help on JOIN stage

Post by kumarjit »

Let's us consider that I need to implement the following database logic in Datastage . How would I go for it ?

The requirement is to JOIN two data sources based on TWO KEY COLUMNS

soemthing like :
select A.COL1 , A.COL2 , B.COL1, B.COL2
from
table1 A
JOIN table2 B
on A.COL3=B.COL3 and A.COL4=B.COl4 ;

Question 1. Do I need to specify TWO DIFFERENT KEY's in the JOIN STAGE , one for COL3 and the other for COL4 ? Is that possible ?

Question 2 : If multiple key's are specified within a SINGLE Join stage , does the inputs needs to be HASH PARTITIONED on all the key columns ?
If yes , will that not create erroneous results , because so far as I know , for a single stage , partitioning could be done on ONLY ONE COLUMN of the INPUT LINK.
If multiple keys are specified as partitioning keys , results may be incorrect.


My workaround:

Create a SINGLE join key field by using COL3|COL4 and partitioning the inputs based on this surrogate join column . The results remian same as the database
join .



Thanks
Kumarjit.
Pain is the best teacher, but very few attend his class..
prasson_ibm
Premium Member
Premium Member
Posts: 536
Joined: Thu Oct 11, 2007 1:48 am
Location: Bangalore

Post by prasson_ibm »

Hi Kumar,


Answer1:-Yes if you are joining on Two columns you need to specify both the columns into a join stage.

Answer 2:- It depends upon your data.If you define Hash partition on only one key column and it has few groupings,may be all your data will land up to one or two nodes and other nodes will be useless as a result performance will be effected.We should always try to split the records to all the partitions.
kumarjit
Participant
Posts: 99
Joined: Fri Oct 12, 2012 7:47 am
Location: Kolkata

Post by kumarjit »

THE ONE thing I forgot to mention in my post,
If the fields on which join is performed is Nullable, and if they actually contain NULL values , will the output of the database join match the output of the Datastage JOIN stage ( using multiple join keys ) ?

I often faced this problem , that if the fields on which the data sources are joined contain NULL values, the database query output does not match with the datastage's output.

Can anyone please guide me on this ?

Thanks
Kumarjit.
Pain is the best teacher, but very few attend his class..
prasson_ibm
Premium Member
Premium Member
Posts: 536
Joined: Thu Oct 11, 2007 1:48 am
Location: Bangalore

Post by prasson_ibm »

Hi Kumar,
Conceptually it should match,since once you apply HASH partition on the input of key columns,it generates hash value and it will make sure that it lands up to same partition.With partition you should sort the inputs with null first.In case all joining columns are null,it is pointless to perform join.
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Post by zulfi123786 »

There could be differences when the output columns are datatyped differently for instance a left outer joined column from left link is supposed to be NULLABLE YES and if its defined as NULLABLE NO the join operator defaults the values per the datatype.
- Zulfi
Post Reply