sparse lookup vs Join

mystuff · Post by **mystuff** » Wed Aug 05, 2009 10:48 am

Hi,

I have been always reading that if secondary link (i.e. the data being referenced is huge), then go for Join rather than Lookup because of the heavy usage of memory.

But what factor determines in choosing between sparse lookup and Join. As sparse lookup does not require heavy usage of memory.

Thanks

ray.wurlod · Post by **ray.wurlod** » Wed Aug 05, 2009 6:27 pm

Sparse lookup is substantially slower than normal lookup (the latter uses an in-memory virtual Data Set).

mystuff · Post by **mystuff** » Thu Aug 06, 2009 7:45 pm

ray.wurlod wrote:Sparse lookup is substantially slower than normal lookup (the latter uses an in-memory virtual Data Set). ...

How can we decide between Sparse Lookup and Join. As both are preferred when normal Lookup is not suitable for the task i.e. heavy on memory.

krishna81 · Post by **krishna81** » Fri Aug 07, 2009 12:13 am

Sparse lookups should be used in the case of the input data being significantly lesser than the reference data. This will fire a query to the database for every input record and hence would be performance issue for increased number of input records.

The Join stage uses hardly any memory at all. It takes in one row from its left input, and only those rows from the right input for which the join keys match.
if you have large data in look up table it is better to use join instead of look up stage.
One thing a join will do that a lookup wont do is insist both the input and lookup streams are sorted. If your lookup volume is low (say less than 10 million rows) it may pay to use a lookup instead to remove this sort requirement. This is easier than pushing 3 billion rows into sort scratch space.

ray.wurlod · Post by **ray.wurlod** » Fri Aug 07, 2009 12:21 am

You could push the sort out into the database server and use a Sort stage to assert that the data are already sorted. This can take advantage of any index on the sort key column(s).