Limitation of lookup

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

Post Reply
daisy
Participant
Posts: 9
Joined: Mon Sep 08, 2008 8:12 pm

Limitation of lookup

Post by daisy »

Lookup stage has limitation on the data size it can handle. Someone told me that the problem can be solved by partition. Therefore, I have tried to change a lookup stage input to use hash / round robin partitioning. But I find that the result is not the same. Why? And how can I solve it?

Thank you in advance.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Both inputs identically partitioned using a key-based partitioning algorithm (hash or modulus) on the key column(s) used for the join.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
daisy
Participant
Posts: 9
Joined: Mon Sep 08, 2008 8:12 pm

Post by daisy »

ray.wurlod wrote:Both inputs identically partitioned using a key-based partitioning algorithm (hash or modulus) on the key column(s) used for the join. ...
I am not changing the lookup to join. I have changed the input partition of a lookup stage from auto to round robin but the result is different....
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

A Lookup stage still performs a join - by default a left outer join, but this can be an inner join if the Lookup Failed rule is set to "Drop". Had I meant a Join stage the word would have had a capital "J".

In that context please re-read my earlier post, particularly about using a key-based partitioning algorithm.

Auto gives you Entire on the reference input. In a multi-machine environment this is less efficient than partitioning the reference input using an algorithm based on the join (lookup) key. The stream input must be partitioned the same way for this to work properly.

Round Robin is just bad in this context.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
daisy
Participant
Posts: 9
Joined: Mon Sep 08, 2008 8:12 pm

Post by daisy »

ray.wurlod wrote:A Lookup stage still performs a join - by default a left outer join, but this can be an inner join if the Lookup Failed rule is set to "Drop". Had I meant a Join stage the word would have had a capit ...
For Lookup stage, it will load all the data to memory, therefore, if the data size is very large, it will fail. For Join stage, it won't happen but Join has to perform a sort first. I want to avoid the sort so I am working on the Lookup stage to see what I can do.

I have tried to change a Lookup stage from auto partitioning to round robin (It is the only change I do to the job). And the final record count changed. It must be something on the partition affect the result...
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Do get yourself a premium membership. I discuss this fully in my earlier posts.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
throbinson
Charter Member
Charter Member
Posts: 299
Joined: Wed Nov 13, 2002 5:38 pm
Location: USA

Post by throbinson »

Naturally, since records partitioned to nodes via Round Robin may end up in "unmatched" or different nodes when the lookup is performed. This means record "A" in node 2 will not find a hit in the look-up if the look-up record "A" is in node 1. Whereas if you key partitioned both datasets, like records would be in the same nodes and the look-up would not fail. This is a fundamental concept to EE and partitioning. You need to do as Ray suggests.
Post Reply