join datasets

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
PeterPol
Premium Member
Premium Member
Posts: 73
Joined: Wed Mar 08, 2006 8:59 am

join datasets

Post by PeterPol »

Hi all,

I have 2 datasets that I want to combine as follows:

Dataset 1:
col1
1
2

Dataset 2:
col2
1
2
3

result set:
col1 col2
1 1
1 2
1 3
2 1
2 2
2 3

I think the join/lookup stage is not suitable because you need to specify which keys in both sets you want to link.
How to do this?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You can use a Join stage, specify a full outer join. You may need to rename some columns so that the "key" column names match on the left and right inputs - use upstream Copy or Modify stage(s) for this.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
tejaswini
Participant
Posts: 19
Joined: Thu Aug 26, 2004 5:40 am

Post by tejaswini »

It is possible in a 'join' stage.
For the first dataset you add one more dummy column with a dummy value say char(1) with value 'A' using 'column generator'. this new dummy column should have the same value for all the records from the first dataset.
Now repeat the same thing for the other dataset. For the second dataset also, the dummy column name and the value should be same as that of the first dataset.
You have now got the common field to perform 'join'. in 'join' do an 'inner' join by the dummy column. This join should give you the cartesian product of the two input datasets. And I suppose, that is what you need also. Try this and let me know.
PeterPol
Premium Member
Premium Member
Posts: 73
Joined: Wed Mar 08, 2006 8:59 am

Post by PeterPol »

hi all,

thanks for your advices. i create a testjob with a dataset containing 110 records and a dataset containing 5 records. I joined both datasets on a dummycolumn (both filled with 1) that i added to each dataset using transformers. The result: 550 records !
However, when I implemented the same solution to my job, the join only generated a cartesian output by taking 2 instead of 5 records of dataset2. So, instead of joining 4000 * 5 records, only 8000 records were actually joined??
The log gives me no clue why the last 3 records of dataset 2 were not taken...

Any explanation for this?
There was no difference between inner join and full outer join

Peter
tejaswini
Participant
Posts: 19
Joined: Thu Aug 26, 2004 5:40 am

Post by tejaswini »

Are you mentioning any specific partitioning type in or before join?
If so, what partition and on which keys?
Nageshsunkoji
Participant
Posts: 222
Joined: Tue Aug 30, 2005 2:07 am
Location: pune
Contact:

Post by Nageshsunkoji »

Hi Peter,

Rather than creating dummy column and perofrm inner join. You, can do one thing, which is already suggested by ray. Rename the column name that should be same in the both the datastes. Use copy stage to rename the columns. For example, you have a key with the name 'Amount' in one dataset and 'Amount_1' is the column name in the other dataset. rename the column name of second dataset as 'Amount'. But, remember that datatype should be the same in both columns and perform Full outer join and you will get the expected result. For this approach you will get duplicates also, perform Remove duplicates to eliminate those.
Let us know your output.
NageshSunkoji

If you know anything SHARE it.............
If you Don't know anything LEARN it...............
tejaswini
Participant
Posts: 19
Joined: Thu Aug 26, 2004 5:40 am

Post by tejaswini »

If you do a 'full outer join' if there is same value in both sides, only then they will join else the other value will be zero.

the result of 'full outer join' will be like

1 1
2 2
3 0

Can you please explain how the full outer join gives the expected result?
PeterPol
Premium Member
Premium Member
Posts: 73
Joined: Wed Mar 08, 2006 8:59 am

Post by PeterPol »

hi all,

I changed the execution mode of all stages up to the join stage to sequential and ... got the right output!!

It's not clear why parallelism fails here. Maybe you guys have some ideas about this.

Thanks anyway.
Peter
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Usually because the input data sets are not identically partitioned and sorted on the join keys.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply