Lookup vs Join

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
shankar_ramanath
Premium Member
Premium Member
Posts: 67
Joined: Thu Aug 09, 2007 7:51 pm

Lookup vs Join

Post by shankar_ramanath »

I see comparisions of Lookup and Join in several discussions and in the manuals.

I think one of the fundamental differences is the inability of Join to provide multiple output streams (one output and one reject) whereas Lookup stage provides for this functionality.

Am I correct? I use Lookup stage currently in one of my jobs only because I need the ability to get two output streams. I could do much better with Join except for this requirement.

Many thanks in advance!
datisaq
Participant
Posts: 154
Joined: Wed May 14, 2008 4:34 am

Post by datisaq »

You can use the Outer join in Join stage and filter it out.
IBM Certified - Information Server 8.1
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There's no such thing as a reject from a join, in exactly the same way that there's no "reject" when you perform a join in SQL.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
datastage_learner
Participant
Posts: 15
Joined: Thu May 07, 2009 9:50 pm

Re: Lookup vs Join

Post by datastage_learner »

shankar_ramanath wrote:I see comparisions of Lookup and Join in several discussions and in the manuals.

I think one of the fundamental differences is the inability of Join to provide multiple output streams (one output and one reject) whereas Lookup stage provides for this functionality.
If I am getting you correctly, then there is no inability... but rather a behavior. Most of the Stages are specific in their behavior.
shankar_ramanath wrote:Am I correct? I use Lookup stage currently in one of my jobs only because I need the ability to get two output streams. I could do much better with Join except for this requirement.
To understand your thought process in much better way, pls. describe the activity that you are doing with either lookup or join. Also, there are different ways to address problems.

Joins and lookups can be used for varied input and reference data streams and their memory needs. Pls. describe the circumstances you are using join or lookup.
shankar_ramanath
Premium Member
Premium Member
Posts: 67
Joined: Thu Aug 09, 2007 7:51 pm

Re: Lookup vs Join

Post by shankar_ramanath »

Thanks to all!

[quote]
If I am getting you correctly, then there is no inability... but rather a behavior. Most of the Stages are specific in their behavior.
[/quote]

I am probably constrained in my thought process because I consider providing two output streams as an "ability". That said, I do understand that the stages have specific behavior. My question was more about any behavior that I did not decipher in Join stage. Ray's answer clarifies the question.

[quote]
To understand your thought process in much better way, pls. describe the activity that you are doing with either lookup or join. Also, there are different ways to address problems.

Joins and lookups can be used for varied input and reference data streams and their memory needs. Pls. describe the circumstances you are using join or lookup.[/quote]

Sure. I have data in the following format

Col1 Col2 Col3 Col4
ABC DEF GHI JKL
XYZ YXZ BED ABC

I need to match "ABC" in the first record with "ABC" in the second record and the difference between the two records is in Col2. Hence I create two output streams. The first output stream has a filter for Col2="DEF" and the second output stream has a filter for Col2="YXZ". The output of the first filter (Col2="DEF") is then passed as a reference to the lookup stage and the output of the second filter (Col2="YXZ") is passed as the primary link to the lookup stage. In the lookup stage, I check to see if Col1 of the first record matches Col4 of the second record. The records that DO NOT match are sent to the reject stream and I am interested in getting only those records that do not match.

Using a Lookup stage is sub-optimal because of the huge number of records that go to the reference link. The input feed has close to 40M records on average. I was thinking that Join would be better to use because of the memory usage. However, since I am in need of the data from the reject link, I could not use Join.

I will try to use Left Outer Join. May be that will come to the rescue. Thanks to datisaq. Please do suggest if there are other options.

Many Thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

That is a particularly difficult requirement. Most joins know in advance which columns are to be compared. Your requirement, if I understand correctly, is to find a match arbitrarily in any column. Neither a Join stage nor a Lookup stage can help you here, unless you perform every possible join/lookup (one for each pair of columns). Yuk.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply