a)
My input link is having 100 million records and reference link is having 50 millions records.So i thought to use sparse lookup as the data is very large and in-memory lookup will result in slow performance.All the key columns are indexed. I chose sparse lookup in the ref stage. And in the lookup stage i couldnt match the key columns from both the links. Just i am able to drag columns to output link.
inputlink.keycolumn1 = reflink.keycolumn1 ( unable to do)
We normally drag the column from input link to be matched with key column in ref link. but i am unable to do it. How does the matching happens in sparse lookup then?
b) For the volume of records mentioned above for i/p and ref,if i go for join or merge stage, which will be better in performance
Join do sort on both the links. So merge stage is better? Please clarify on this.
Thanks in Advance
Query on Sparse lookup
Moderators: chulett, rschirm, roy
A sparse lookup for 100 million records will be very slow. Usually you will use a sparse lookup when the input volume is much, much SMALLER than the reference volume.
A join or merge is recommended for these volumes. Your key column names must match on both input links, and your data should be partitioned and sorted prior to the join or merge.
Performance for merge and join will typically be roughly equal. Read the documentation for both stages to understand the difference in operation, as well as searching the forum here for discussion on both.
Regards,
A join or merge is recommended for these volumes. Your key column names must match on both input links, and your data should be partitioned and sorted prior to the join or merge.
Performance for merge and join will typically be roughly equal. Read the documentation for both stages to understand the difference in operation, as well as searching the forum here for discussion on both.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
What say I? Merge has the same requirement for sort and partitioning as does Join. Lookup is the only one which does not require a sort.
FYI...Join is not performing the sort and partition. The engine is inserting a sort operator and partition operator when the job is run to meet the requirements of your join (or merge), so long as Sort Insertion Partition Insertion have not been disabled.
You have a large amount of data, it is going to take time to process it no matter what you do. You can work to optimize sorting by adjusting memory usage to larger values (default is 20MB, you can try 200 instead) either within the sort stage or an environment variable (APT_TSORT_STRESS_BLOCKSIZE IIRC).
Regarding your reference data and lookup: Why do you think an in-memory lookup will "result in slow performance"? If your 50 million rows are small enough, they should fit in memory. The time will be spent loading the data (which will happen no matter which join method you are using).
Regards,
FYI...Join is not performing the sort and partition. The engine is inserting a sort operator and partition operator when the job is run to meet the requirements of your join (or merge), so long as Sort Insertion Partition Insertion have not been disabled.
You have a large amount of data, it is going to take time to process it no matter what you do. You can work to optimize sorting by adjusting memory usage to larger values (default is 20MB, you can try 200 instead) either within the sort stage or an environment variable (APT_TSORT_STRESS_BLOCKSIZE IIRC).
Regarding your reference data and lookup: Why do you think an in-memory lookup will "result in slow performance"? If your 50 million rows are small enough, they should fit in memory. The time will be spent loading the data (which will happen no matter which join method you are using).
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
-
- Premium Member
- Posts: 1735
- Joined: Thu Mar 01, 2007 5:44 am
- Location: Troy, MI
Check the option's sub property i.e. restrict sort memory or something like that. the other option suggested is changing the environment variable.
The second question should have been in a seperate post. However to give a start, use preserve type option in column for calculation sub-property.
The second question should have been in a seperate post. However to give a start, use preserve type option in column for calculation sub-property.
Priyadarshi Kunal
Genius may have its limitations, but stupidity is not thus handicapped.![Wink :wink:](./images/smilies/icon_wink.gif)
Genius may have its limitations, but stupidity is not thus handicapped.
![Wink :wink:](./images/smilies/icon_wink.gif)