Page 1 of 1

LOOKUP HELP

Posted: Fri Feb 04, 2011 1:54 pm
by kumar444
I have 26 million rows in oracle refrence table and around 2 lakh rows in a source (in my case ouput from a join). I need to implement a left outer join with driving table as a source(in my case ouput from a join). Is it better to go for a join stage with left outer condition or a lookup stage? If lookup stage it should be a sparse lookup or normal lookup?

Appreciate any help to sovle this issue.

Posted: Fri Feb 04, 2011 2:35 pm
by ThilSe
I would suggest you to go thorugh the DS manual for these stages.

To answer your question Lookup stage is used only when reference data is small (less than 20000 records in my dictionary) and can fit in memory.

Thanks,
Senthil

Posted: Fri Feb 04, 2011 2:45 pm
by kumar444
Thanks Senthil. But how abt sparse lookup?
Is it possible to make reference table as a source and vice versa and do a lookup.

Posted: Fri Feb 04, 2011 2:49 pm
by mavrick21
kumar444,

I would use sparse lookup.

Posted: Fri Feb 04, 2011 4:55 pm
by ray.wurlod
Join is likely the better approach. Sparse lookup is slow, particularly when the database server is not on the same machine as the DataStage server.

Posted: Fri Feb 04, 2011 5:30 pm
by mavrick21
Ray,

The reference table has 26 million rows. Based on your reply I suggested Sparse lookup

Posted: Fri Feb 04, 2011 5:35 pm
by kumar444
Thanks Guys.

Posted: Fri Feb 04, 2011 11:47 pm
by jwiles
200000 sparse lookups (2 lakh) is quite a lot. If you follow that route, you should ensure that the reference table is properly indexed and that statistics are kept up to date.

What would the size of one of your lookup records be (average size of it contains varchars)? Size of key columns + size of data columns. You may be able to support a standard lookup instead of a sparse and possibly see better overall performance. Only way to know for certain is to try...it'll basically be down to the time it takes to extract the table compared to performing 200000 sparse lookups.

Regards,

Posted: Sat Feb 05, 2011 12:24 am
by ray.wurlod
Join (preceded by "don't sort, already sorted" Sort stage and relying on sorted data from the database only grabs one key value at a time from each link. Indeed, that's why it mandates sorted inputs; it would be horribly memory hungry otherwise, needing to load the entire right input into memory and probably build an index on the join keys.

Posted: Sat Feb 05, 2011 12:02 pm
by kumar444
I have tested this case with join(left outer) and with lookup stage
(condition not met=continue,lookup failure=continue) .
In both of these cases the resulting records is not equal.

Suppose if have 1000 records in driving table and 2000 records in the right table. Join with left outer condition returns more than 1000 records (this is possible) but with lookup its returning only 1000 records. Do we have to change any lookup property to get desired result as in the case of left outer join.

Posted: Sat Feb 05, 2011 12:18 pm
by mavrick21
Yes. Sorry I don't have access to DataStage Parallel jobs but here is what I remember - It's in the link properties where you can specify if lookup fails.

Posted: Sat Feb 05, 2011 1:45 pm
by ray.wurlod
Lookup with Continue will only return one row from the reference input unless you enable "multiple rows returned" on that input. This can only be enabled on one reference input per Lookup stage.

Posted: Thu Feb 10, 2011 2:23 pm
by kumar444
Thanks Ray.

Posted: Thu Feb 10, 2011 4:01 pm
by ThilSe
Hi Kumar,

Can you please let us know if you found any performance differences between the lookup and join solutions?

Thanks,
Senthil

Posted: Thu Feb 10, 2011 5:12 pm
by kumar444
Senthil,
In my case join won the race. Lookup consumed more time then join.
As per the volume of data mentioned above in my topic, join proved to be efficient.