Difference Between join, Lookup and merge

anilkumark802 · Post by **anilkumark802** » Fri Nov 26, 2010 9:54 am

Hi I have some problem with join,lookup and merge stages. I am giving my requirement below.

I have two seq files each of which are having one column with name ROWNUM, and in both the file data is 3 records like below
100
100
100

if i use join stage to join these two files and performing innerjoin and left outerjoin, for both operations it is giving out put 9 records like.
100
100
100
100
100
100
100
100
100

if i use Lookup stage and given options as drop, continue. for both the operations it is giving out put 3 records.
100
100
100

and if use merge stage and given option as drop then it is giving 3 records as output like
100
100
100

and if i use option as keep then it is giving 5 records as output like.
100
100
100
100
100

I think all the 3 stages should produce same result why the 3 stages join, lookup and merge are giving different results.
Can any one help me please.

ArndW · Post by **ArndW** » Fri Nov 26, 2010 10:25 am

How many nodes are you running these jobs on, are they using the same configuration file and on which column are you partitioning your data?

chulett · Post by **chulett** » Fri Nov 26, 2010 11:04 am

Never mind the fact that, because of the nature of your data, the join is giving you a cartesian product.

ray.wurlod · Post by **ray.wurlod** » Fri Nov 26, 2010 2:16 pm

For starters, Merge consumes rows from its reference (update) input(s). The other two stage types don't. Your Keep rule iss giving variations because you end up with unmatched masters after all the reference rows have been consumed.

So what do you actually want the output to be? Both results you got are correct, because the specifications were different.

swapnilverma · Post by **swapnilverma** » Sat Nov 27, 2010 12:19 am

to simplify you are joining , merging , performing lookup with two identical files..... or two same files ...

if u use two different files you will see some more magic ..

if they produce same results then why we had three diff stages...
in some cases they may produce same results ..

Join -> will give you cartesian product( set of matching recs ). as ur file is identical its gives you 3*3 = 9 rows.

Merge-> reference record is consumed ( once a matching ip found )
typically used to populate master -child records ... and when you need non matching records from reference ( as it allows a reject link )

Lookup -> no cartesian product ... to populate a column based on matching keys

arunpramanik · Post by **arunpramanik** » Sun Nov 28, 2010 10:43 pm

Best thing will to add non key fields in both of your source, and get the output containing key and non-key fields from these three stage.

Hope that will make you understand how these three stage works

Umbix0562 · Post by **Umbix0562** » Tue Nov 30, 2010 6:16 am

anilkumark802 wrote:Hi I have some problem with join,lookup and merge stages. I am giving my requirement below.

I have two seq files each of which are having one column with name ROWNUM, and in both the file data is 3 records like below
100
100
100

if i use join stage to join these two files and performing innerjoin and left outerjoin, for both operations it is giving out put 9 records like.
100
100
100
100
100
100
100
100
100

if i use Lookup stage and given options as drop, continue. for both the operations it is giving out put 3 records.7
100
100
100

and if use merge stage and given option as drop then it is giving 3 records as output like
100
100
100

and if i use option as keep then it is giving 5 records as output like.
100
100
100
100
100

I think all the 3 stages should produce same result why the 3 stages join, lookup and merge are giving different results.
Can any one help me please.

-------------

Good morning my dear friend

I think that it's only a logical problem and I try to explain it by an example. We have two tables with tree rows in the first and three rows in the second one.

Table One Table Two
Column A Column B
100 100
100 100
100 100

I develop the following SQL query

SELECT COLUMN A FROM One, Two WHERE A = B

For each existing row in Table One the query macth three rows satisfating the predicate A=B. So for the first row in table one match three rows in table two. This is the reason explaining
an output with nine rows. The Join Stage works in the same way (this is true both for the inner and left join).

The lookup stage instead works differently. It reads n rows and produce exactly the same number of row in output (if they exist in the reference link).

About the tirth Stage, the Merge one, I don't know it very well but I suppose that in the first case it distincts (aggregates) the rows and not in the second.

I hope that my explanation are clear.

Have a nice day

Umberto