Page 1 of 1

Selection of Lookup and Join

Posted: Wed Mar 17, 2010 5:07 am
by ReachKumar
Hi,

Performance wise, which one is better to go for between Join and Lookup stage in DS Parallel and why.

Can some one explain in which scenarios we go for join and for lookup?

Re: Selection of Lookup and Join

Posted: Wed Mar 17, 2010 5:23 am
by surajkumar
In all cases we are concerned with the size of the reference datasets. If
these take up a large amount of memory relative to the physical RAM
memory size of the computer you are running on, then a lookup stage
may thrash because the reference datasets may not fit in RAM along with
everything else that has to be in RAM. This results in very slow
performance since each lookup operation can, and typically does, cause a
page fault and an I/O operation.
So, if the reference datasets are big enough to cause trouble, use a join. A
join does a high-speed sort on the driving and reference datasets. This can
involve I/O if the data is big enough, but the I/O is all highly optimized
and sequential. Once the sort is over the join processing is very fast and
never involves paging or other I/O.

Posted: Wed Mar 17, 2010 6:04 am
by ReachKumar
Informative.. Thanks Suraj..