Query on Sparse lookup

dsscholar · Post by **dsscholar** » Tue Jun 14, 2011 1:01 pm

a)
My input link is having 100 million records and reference link is having 50 millions records.So i thought to use sparse lookup as the data is very large and in-memory lookup will result in slow performance.All the key columns are indexed. I chose sparse lookup in the ref stage. And in the lookup stage i couldnt match the key columns from both the links. Just i am able to drag columns to output link.

inputlink.keycolumn1 = reflink.keycolumn1 ( unable to do)

We normally drag the column from input link to be matched with key column in ref link. but i am unable to do it. How does the matching happens in sparse lookup then?

b) For the volume of records mentioned above for i/p and ref,if i go for join or merge stage, which will be better in performance

Join do sort on both the links. So merge stage is better? Please clarify on this.

Thanks in Advance

jwiles · Post by **jwiles** » Tue Jun 14, 2011 1:12 pm

A sparse lookup for 100 million records will be very slow. Usually you will use a sparse lookup when the input volume is much, much SMALLER than the reference volume.

A join or merge is recommended for these volumes. Your key column names must match on both input links, and your data should be partitioned and sorted prior to the join or merge.

Performance for merge and join will typically be roughly equal. Read the documentation for both stages to understand the difference in operation, as well as searching the forum here for discussion on both.

Regards,

singhald · Post by **singhald** » Tue Jun 14, 2011 11:59 pm

in sparse lookup , you dont need to drag a key from input to reference link, just you need to have same cloumn name in both input links .

dsscholar · Post by **dsscholar** » Wed Jun 15, 2011 12:03 pm

Thanks Singhald.

James,

Join stage itself will sort both the input and reference links and hash partitioning will be done when we give auto itself. As sorting is done on both the tables, it will take much time. So merge stage will be better in this case i think. What say you ?

Thanks in advance

jwiles · Post by **jwiles** » Wed Jun 15, 2011 2:57 pm

What say I? Merge has the same requirement for sort and partitioning as does Join. Lookup is the only one which does not require a sort.

FYI...Join is not performing the sort and partition. The engine is inserting a sort operator and partition operator when the job is run to meet the requirements of your join (or merge), so long as Sort Insertion Partition Insertion have not been disabled.

You have a large amount of data, it is going to take time to process it no matter what you do. You can work to optimize sorting by adjusting memory usage to larger values (default is 20MB, you can try 200 instead) either within the sort stage or an environment variable (APT_TSORT_STRESS_BLOCKSIZE IIRC).

Regarding your reference data and lookup: Why do you think an in-memory lookup will "result in slow performance"? If your 50 million rows are small enough, they should fit in memory. The time will be spent loading the data (which will happen no matter which join method you are using).

Regards,

dsscholar · Post by **dsscholar** » Fri Jun 17, 2011 12:52 am

How to increase the sorting memory to 200MB. You mean the buffer size shown in advanced tab? And another question, is it possible to get max(date column) in aggregator stage. I have issue with that.

priyadarshikunal · Post by **priyadarshikunal** » Fri Jun 17, 2011 8:06 am

Check the option's sub property i.e. restrict sort memory or something like that. the other option suggested is changing the environment variable.

The second question should have been in a seperate post. However to give a start, use preserve type option in column for calculation sub-property.