sparse lookup vs Join

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
mystuff
Premium Member
Premium Member
Posts: 200
Joined: Wed Apr 11, 2007 2:06 pm

sparse lookup vs Join

Post by mystuff »

Hi,

I have been always reading that if secondary link (i.e. the data being referenced is huge), then go for Join rather than Lookup because of the heavy usage of memory.

But what factor determines in choosing between sparse lookup and Join. As sparse lookup does not require heavy usage of memory.

Thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Sparse lookup is substantially slower than normal lookup (the latter uses an in-memory virtual Data Set).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mystuff
Premium Member
Premium Member
Posts: 200
Joined: Wed Apr 11, 2007 2:06 pm

Post by mystuff »

ray.wurlod wrote:Sparse lookup is substantially slower than normal lookup (the latter uses an in-memory virtual Data Set). ...
How can we decide between Sparse Lookup and Join. As both are preferred when normal Lookup is not suitable for the task i.e. heavy on memory.
krishna81
Premium Member
Premium Member
Posts: 78
Joined: Tue May 16, 2006 8:01 am
Location: USA

Post by krishna81 »

Sparse lookups should be used in the case of the input data being significantly lesser than the reference data. This will fire a query to the database for every input record and hence would be performance issue for increased number of input records.


The Join stage uses hardly any memory at all. It takes in one row from its left input, and only those rows from the right input for which the join keys match.
if you have large data in look up table it is better to use join instead of look up stage.
One thing a join will do that a lookup wont do is insist both the input and lookup streams are sorted. If your lookup volume is low (say less than 10 million rows) it may pay to use a lookup instead to remove this sort requirement. This is easier than pushing 3 billion rows into sort scratch space.
Datastage User
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You could push the sort out into the database server and use a Sort stage to assert that the data are already sorted. This can take advantage of any index on the sort key column(s).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply