Page 1 of 1

join datasets

Posted: Thu Nov 23, 2006 8:34 am
by PeterPol
Hi all,

I have 2 datasets that I want to combine as follows:

Dataset 1:
col1
1
2

Dataset 2:
col2
1
2
3

result set:
col1 col2
1 1
1 2
1 3
2 1
2 2
2 3

I think the join/lookup stage is not suitable because you need to specify which keys in both sets you want to link.
How to do this?

Posted: Thu Nov 23, 2006 10:24 am
by ray.wurlod
You can use a Join stage, specify a full outer join. You may need to rename some columns so that the "key" column names match on the left and right inputs - use upstream Copy or Modify stage(s) for this.

Posted: Thu Nov 23, 2006 11:45 pm
by tejaswini
It is possible in a 'join' stage.
For the first dataset you add one more dummy column with a dummy value say char(1) with value 'A' using 'column generator'. this new dummy column should have the same value for all the records from the first dataset.
Now repeat the same thing for the other dataset. For the second dataset also, the dummy column name and the value should be same as that of the first dataset.
You have now got the common field to perform 'join'. in 'join' do an 'inner' join by the dummy column. This join should give you the cartesian product of the two input datasets. And I suppose, that is what you need also. Try this and let me know.

Posted: Fri Nov 24, 2006 2:02 am
by PeterPol
hi all,

thanks for your advices. i create a testjob with a dataset containing 110 records and a dataset containing 5 records. I joined both datasets on a dummycolumn (both filled with 1) that i added to each dataset using transformers. The result: 550 records !
However, when I implemented the same solution to my job, the join only generated a cartesian output by taking 2 instead of 5 records of dataset2. So, instead of joining 4000 * 5 records, only 8000 records were actually joined??
The log gives me no clue why the last 3 records of dataset 2 were not taken...

Any explanation for this?
There was no difference between inner join and full outer join

Peter

Posted: Fri Nov 24, 2006 3:10 am
by tejaswini
Are you mentioning any specific partitioning type in or before join?
If so, what partition and on which keys?

Posted: Fri Nov 24, 2006 3:14 am
by Nageshsunkoji
Hi Peter,

Rather than creating dummy column and perofrm inner join. You, can do one thing, which is already suggested by ray. Rename the column name that should be same in the both the datastes. Use copy stage to rename the columns. For example, you have a key with the name 'Amount' in one dataset and 'Amount_1' is the column name in the other dataset. rename the column name of second dataset as 'Amount'. But, remember that datatype should be the same in both columns and perform Full outer join and you will get the expected result. For this approach you will get duplicates also, perform Remove duplicates to eliminate those.
Let us know your output.

Posted: Fri Nov 24, 2006 3:52 am
by tejaswini
If you do a 'full outer join' if there is same value in both sides, only then they will join else the other value will be zero.

the result of 'full outer join' will be like

1 1
2 2
3 0

Can you please explain how the full outer join gives the expected result?

Posted: Fri Nov 24, 2006 4:05 am
by PeterPol
hi all,

I changed the execution mode of all stages up to the join stage to sequential and ... got the right output!!

It's not clear why parallelism fails here. Maybe you guys have some ideas about this.

Thanks anyway.
Peter

Posted: Fri Nov 24, 2006 12:57 pm
by ray.wurlod
Usually because the input data sets are not identically partitioned and sorted on the join keys.