how to decide using lookup or joiner

Rameshgoldenhill · Post by **Rameshgoldenhill** » Sun Mar 14, 2010 12:29 pm

How can we decide to use lookup or joiner when we are loading large amount of data.Thanks.

cdp · Post by **cdp** » Sun Mar 14, 2010 4:14 pm

Hi,

There are quite a few posts out there on this subject.

But I think it comes down to the size of the two input sources that you are trying to combine and what you are wanting to do with the output.

For instance join's can only be across two inputs and there is no reject link, while lookups are really designed to read a reference source into memory and so the reference input size should be less than amount of memory available on the box. So basically if you have two large inputs then a join would probably be better!

There is also the MERGE stage, again it all depends on what your expected output looks like!

ray.wurlod · Post by **ray.wurlod** » Sun Mar 14, 2010 4:16 pm

A joiner is someone who assembles wooden furniture, particularly cabinetry. Your choice is therefore clear.

Unless, of course, you are loading large amounts of wooden objects...

When's the interview?

ray.wurlod · Post by **ray.wurlod** » Sun Mar 14, 2010 4:19 pm

cdp wrote:For instance join's can only be across two inputs and there is no reject link, ...

This is not the case. A Join stage can have more than two inputs. In this case pairwise joins are created as intermediate results, the same way that databases do it. The "other" inputs are referred to as Intermediate. I prefer to use cascaded two-input joins to make it clearer what's happening to the next developer.

John Smith · Post by **John Smith** » Sun Mar 14, 2010 4:52 pm

Loading large amounts of data and performing lookups/joins are distinct operations meaning you can load large amounts of data with BOTH joins OR lookups. Doesn't matter.

cdp · Post by **cdp** » Sun Mar 14, 2010 6:29 pm

ray.wurlod wrote: This is not the case. A Join stage can have more than two inputs. In this case pairwise joins are created as intermediate results, the same way that databases do it. The "other" inputs are referred to as Intermediate. I prefer to use cascaded two-input joins to make it clearer what's happening to the next developer.

Do you know what, you are absolutely correct. Maybe I was confusing should with could, but I was always told not too. Sorry for the incorrect advice.

ray.wurlod · Post by **ray.wurlod** » Sun Mar 14, 2010 6:53 pm

Of course you may have been confusing should with wood.

(See my earliest post on this thread.)

chulett · Post by **chulett** » Sun Mar 14, 2010 9:15 pm

[groan]