Limitation of lookup

daisy · Post by **daisy** » Tue Sep 09, 2008 2:33 am

Lookup stage has limitation on the data size it can handle. Someone told me that the problem can be solved by partition. Therefore, I have tried to change a lookup stage input to use hash / round robin partitioning. But I find that the result is not the same. Why? And how can I solve it?

Thank you in advance.

ray.wurlod · Post by **ray.wurlod** » Tue Sep 09, 2008 3:00 am

Both inputs identically partitioned using a key-based partitioning algorithm (hash or modulus) on the key column(s) used for the join.

daisy · Post by **daisy** » Wed Sep 10, 2008 8:04 pm

ray.wurlod wrote:Both inputs identically partitioned using a key-based partitioning algorithm (hash or modulus) on the key column(s) used for the join. ...

I am not changing the lookup to join. I have changed the input partition of a lookup stage from auto to round robin but the result is different....

ray.wurlod · Post by **ray.wurlod** » Wed Sep 10, 2008 8:44 pm

A Lookup stage still performs a join - by default a left outer join, but this can be an inner join if the Lookup Failed rule is set to "Drop". Had I meant a Join stage the word would have had a capital "J".

In that context please re-read my earlier post, particularly about using a key-based partitioning algorithm.

Auto gives you Entire on the reference input. In a multi-machine environment this is less efficient than partitioning the reference input using an algorithm based on the join (lookup) key. The stream input must be partitioned the same way for this to work properly.

Round Robin is just bad in this context.

daisy · Post by **daisy** » Thu Sep 11, 2008 12:31 am

ray.wurlod wrote:A Lookup stage still performs a join - by default a left outer join, but this can be an inner join if the Lookup Failed rule is set to "Drop". Had I meant a Join stage the word would have had a capit ...

For Lookup stage, it will load all the data to memory, therefore, if the data size is very large, it will fail. For Join stage, it won't happen but Join has to perform a sort first. I want to avoid the sort so I am working on the Lookup stage to see what I can do.

I have tried to change a Lookup stage from auto partitioning to round robin (It is the only change I do to the job). And the final record count changed. It must be something on the partition affect the result...

ray.wurlod · Post by **ray.wurlod** » Thu Sep 11, 2008 1:26 am

Do get yourself a premium membership. I discuss this fully in my earlier posts.

throbinson · Post by **throbinson** » Thu Sep 11, 2008 5:38 am

Naturally, since records partitioned to nodes via Round Robin may end up in "unmatched" or different nodes when the lookup is performed. This means record "A" in node 2 will not find a hit in the look-up if the look-up record "A" is in node 1. Whereas if you key partitioned both datasets, like records would be in the same nodes and the look-up would not fail. This is a fundamental concept to EE and partitioning. You need to do as Ray suggests.