Query on Sparse lookup

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dsscholar
Premium Member
Premium Member
Posts: 195
Joined: Thu Oct 19, 2006 2:45 pm

Query on Sparse lookup

Post by dsscholar »

a)
My input link is having 100 million records and reference link is having 50 millions records.So i thought to use sparse lookup as the data is very large and in-memory lookup will result in slow performance.All the key columns are indexed. I chose sparse lookup in the ref stage. And in the lookup stage i couldnt match the key columns from both the links. Just i am able to drag columns to output link.

inputlink.keycolumn1 = reflink.keycolumn1 ( unable to do)

We normally drag the column from input link to be matched with key column in ref link. but i am unable to do it. How does the matching happens in sparse lookup then?

b) For the volume of records mentioned above for i/p and ref,if i go for join or merge stage, which will be better in performance

Join do sort on both the links. So merge stage is better? Please clarify on this.

Thanks in Advance
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

A sparse lookup for 100 million records will be very slow. Usually you will use a sparse lookup when the input volume is much, much SMALLER than the reference volume.

A join or merge is recommended for these volumes. Your key column names must match on both input links, and your data should be partitioned and sorted prior to the join or merge.

Performance for merge and join will typically be roughly equal. Read the documentation for both stages to understand the difference in operation, as well as searching the forum here for discussion on both.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
singhald
Participant
Posts: 180
Joined: Tue Aug 23, 2005 2:50 am
Location: Bangalore
Contact:

Post by singhald »

in sparse lookup , you dont need to drag a key from input to reference link, just you need to have same cloumn name in both input links .
Regards,
Deepak Singhal
Everything is okay in the end. If it's not okay, then it's not the end.
dsscholar
Premium Member
Premium Member
Posts: 195
Joined: Thu Oct 19, 2006 2:45 pm

Post by dsscholar »

Thanks Singhald.

James,

Join stage itself will sort both the input and reference links and hash partitioning will be done when we give auto itself. As sorting is done on both the tables, it will take much time. So merge stage will be better in this case i think. What say you ?

Thanks in advance
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

What say I? Merge has the same requirement for sort and partitioning as does Join. Lookup is the only one which does not require a sort.

FYI...Join is not performing the sort and partition. The engine is inserting a sort operator and partition operator when the job is run to meet the requirements of your join (or merge), so long as Sort Insertion Partition Insertion have not been disabled.

You have a large amount of data, it is going to take time to process it no matter what you do. You can work to optimize sorting by adjusting memory usage to larger values (default is 20MB, you can try 200 instead) either within the sort stage or an environment variable (APT_TSORT_STRESS_BLOCKSIZE IIRC).

Regarding your reference data and lookup: Why do you think an in-memory lookup will "result in slow performance"? If your 50 million rows are small enough, they should fit in memory. The time will be spent loading the data (which will happen no matter which join method you are using).

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
dsscholar
Premium Member
Premium Member
Posts: 195
Joined: Thu Oct 19, 2006 2:45 pm

Post by dsscholar »

How to increase the sorting memory to 200MB. You mean the buffer size shown in advanced tab? And another question, is it possible to get max(date column) in aggregator stage. I have issue with that.
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

Check the option's sub property i.e. restrict sort memory or something like that. the other option suggested is changing the environment variable.


The second question should have been in a seperate post. However to give a start, use preserve type option in column for calculation sub-property.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
Post Reply