join datasets

PeterPol · Post by **PeterPol** » Thu Nov 23, 2006 8:34 am

Hi all,

I have 2 datasets that I want to combine as follows:

Dataset 1:
col1
1
2

Dataset 2:
col2
1
2
3

result set:
col1 col2
1 1
1 2
1 3
2 1
2 2
2 3

I think the join/lookup stage is not suitable because you need to specify which keys in both sets you want to link.
How to do this?

ray.wurlod · Post by **ray.wurlod** » Thu Nov 23, 2006 10:24 am

You can use a Join stage, specify a full outer join. You may need to rename some columns so that the "key" column names match on the left and right inputs - use upstream Copy or Modify stage(s) for this.

tejaswini · Post by **tejaswini** » Thu Nov 23, 2006 11:45 pm

It is possible in a 'join' stage.
For the first dataset you add one more dummy column with a dummy value say char(1) with value 'A' using 'column generator'. this new dummy column should have the same value for all the records from the first dataset.
Now repeat the same thing for the other dataset. For the second dataset also, the dummy column name and the value should be same as that of the first dataset.
You have now got the common field to perform 'join'. in 'join' do an 'inner' join by the dummy column. This join should give you the cartesian product of the two input datasets. And I suppose, that is what you need also. Try this and let me know.

PeterPol · Post by **PeterPol** » Fri Nov 24, 2006 2:02 am

hi all,

thanks for your advices. i create a testjob with a dataset containing 110 records and a dataset containing 5 records. I joined both datasets on a dummycolumn (both filled with 1) that i added to each dataset using transformers. The result: 550 records !
However, when I implemented the same solution to my job, the join only generated a cartesian output by taking 2 instead of 5 records of dataset2. So, instead of joining 4000 * 5 records, only 8000 records were actually joined??
The log gives me no clue why the last 3 records of dataset 2 were not taken...

Any explanation for this?
There was no difference between inner join and full outer join

Peter

tejaswini · Post by **tejaswini** » Fri Nov 24, 2006 3:10 am

Are you mentioning any specific partitioning type in or before join?
If so, what partition and on which keys?

Nageshsunkoji · Post by **Nageshsunkoji** » Fri Nov 24, 2006 3:14 am

Hi Peter,

Rather than creating dummy column and perofrm inner join. You, can do one thing, which is already suggested by ray. Rename the column name that should be same in the both the datastes. Use copy stage to rename the columns. For example, you have a key with the name 'Amount' in one dataset and 'Amount_1' is the column name in the other dataset. rename the column name of second dataset as 'Amount'. But, remember that datatype should be the same in both columns and perform Full outer join and you will get the expected result. For this approach you will get duplicates also, perform Remove duplicates to eliminate those.
Let us know your output.

tejaswini · Post by **tejaswini** » Fri Nov 24, 2006 3:52 am

If you do a 'full outer join' if there is same value in both sides, only then they will join else the other value will be zero.

the result of 'full outer join' will be like

1 1
2 2
3 0

Can you please explain how the full outer join gives the expected result?

PeterPol · Post by **PeterPol** » Fri Nov 24, 2006 4:05 am

hi all,

I changed the execution mode of all stages up to the join stage to sequential and ... got the right output!!

It's not clear why parallelism fails here. Maybe you guys have some ideas about this.

Thanks anyway.
Peter

ray.wurlod · Post by **ray.wurlod** » Fri Nov 24, 2006 12:57 pm

Usually because the input data sets are not identically partitioned and sorted on the join keys.