difference betwn look up stage and join satge in PX
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 497
- Joined: Sun Dec 17, 2006 11:52 pm
- Location: Kolkata
- Contact:
difference betwn look up stage and join satge in PX
Hi All,
Can anyone tell me the difference betwn look up stage and join satge in PX. I presume the look up stage basically acts as the inner join property of join stage and the manual says that performance wise join is faster than a look up stage.
Can anyone tell me the difference betwn look up stage and join satge in PX. I presume the look up stage basically acts as the inner join property of join stage and the manual says that performance wise join is faster than a look up stage.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Don't believe everything you read about "performance", which is ill-defined in an ETL context.
There are three stages that effect horizontal combination of data, each subtly different - Lookup, Join and Merge.
Only the Lookup stage allows conditional lookups. The Lookup stage has a single reject link - the Merge stage can have one reject link for each reference input link.
The Lookup stage does not require sorted input, whereas the Join and Merge stages do. So the alleged "performance" advantage of the Join stage might be consumed by the overhead of sorting.
The Lookup stage supports "sparse" lookup.
There are more differences. Read the three relevant chapters from the Parallel Job Developer's Guide and post any specific questions that may remain in your mind.
There are three stages that effect horizontal combination of data, each subtly different - Lookup, Join and Merge.
Only the Lookup stage allows conditional lookups. The Lookup stage has a single reject link - the Merge stage can have one reject link for each reference input link.
The Lookup stage does not require sorted input, whereas the Join and Merge stages do. So the alleged "performance" advantage of the Join stage might be consumed by the overhead of sorting.
The Lookup stage supports "sparse" lookup.
There are more differences. Read the three relevant chapters from the Parallel Job Developer's Guide and post any specific questions that may remain in your mind.
Last edited by ray.wurlod on Sun Apr 22, 2007 10:47 pm, edited 1 time in total.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
The answers not quite as simple as that, it depends on the volumes of data you have in your two datasets, what the distinct set of key values are, and what type of actions you want when matches aren't found. Depending on the answers to these questions you can use the JOIN, LOOKUP or MERGE stage to join two datasets together.
Have a look at the manual pages for these stages, it helps to explain the subtle differences between them. Pay attention to the different reject handling each stage can provide, i.e. the join stage just does the join without reporting records that cannot match, much like a join in SQL so you may loose records without knowing about it.
The lookup stage (If used in normal mode rather than sparse) will use more memory that the other two, but can be quicker if you have large number of driving records and a small set of update/lookup values, sa these can be loaded into memory and the throughput of the job is continuous.
Have a look at the manual pages for these stages, it helps to explain the subtle differences between them. Pay attention to the different reject handling each stage can provide, i.e. the join stage just does the join without reporting records that cannot match, much like a join in SQL so you may loose records without knowing about it.
The lookup stage (If used in normal mode rather than sparse) will use more memory that the other two, but can be quicker if you have large number of driving records and a small set of update/lookup values, sa these can be loaded into memory and the throughput of the job is continuous.
Regards,
Nick.
Nick.
-
- Premium Member
- Posts: 497
- Joined: Sun Dec 17, 2006 11:52 pm
- Location: Kolkata
- Contact:
Please can you throw some light on sparse lookup,we tend to use normal lookup always.ray.wurlod wrote:Don't believe everything you read about "performance", which is ill-defined in an ETL context.
There are three stages that effect horizontal combination of data, each subtly different - Lookup, Join and Merge.
Only the Lookup stage allows conditional lookups. The Lookup stage has a single reject link - the Merge stage can have one reject link for each reference input link.
The Lookup stage does not require sorted input, whereas the Join and Merge stages do. So the alleged "performance" advantage of the Join stage might be consumed by the overhead of sorting.
The Lookup stage supports "sparse" lookup.
There are more differences. Read the three relevant chapters from the Parallel Job Developer's Guide and post any specific questions that may remain in your mind.
Usually Lookup loads the lookup data in the physical memory. Which turned out to be fastest among all. Provided the data volume that stored in memroy is managalble.
Sparse look is done at database for each input record. This avoids the overload of loading the whole data into memrory. But the same time, need to compromise the preformance, since each query is executed at database and the data need to be trasfered through network.
Sparse look is done at database for each input record. This avoids the overload of loading the whole data into memrory. But the same time, need to compromise the preformance, since each query is executed at database and the data need to be trasfered through network.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
-
- Premium Member
- Posts: 497
- Joined: Sun Dec 17, 2006 11:52 pm
- Location: Kolkata
- Contact:
So in sparse lookup the data is not stored in the memory , but will the output result differ from that of normal look up.kumar_s wrote:Usually Lookup loads the lookup data in the physical memory. Which turned out to be fastest among all. Provided the data volume that stored in memroy is managalble.
Sparse look is done at database for each input record. This avoids the overload of loading the whole data into memrory. But the same time, need to compromise the preformance, since each query is executed at database and the data need to be trasfered through network.
No, the output will be identical providing you don't change the lookup data while the job is running, in which case the sparse will show the new value and a normal one, which uses a memory image of the data as it was when the job began, will show the old value.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
Difference between lookup and join
I have often heard lookup should be used for low volume of data. Is there any limit to this 'volume' say 50K or something? How do we determine when the lookup stage should be replaced by join, if we consider only the number of records?ray.wurlod wrote:Don't believe everything you read about "performance", which is ill-defined in an ETL context.
There are three stages that effect horizontal combination of data, each subtly different - Lookup, Joi ...
Re: Difference between lookup and join
To simply use the comment "low volume of data" is inaccurate and limiting because it doesn't indicate which data is being described. You can process any number of records through a lookup stage. The limitation is the size of your reference data (lookup table)---how much memory will the lookup table consume when loaded and how much memory is available on your servers?
Part of your choice in which stage to use should take into consideration the purpose of the join operation: Lookups are typically (but not always) used to apply small data values to incoming records in a one-to-many situation (such as comparing to a list of valid values for a data column). Joins and merges are typically used to match two or more data records together, often in a one-to-one situation (such as updating address data for existing accounts during a database update) . These typical uses are not hard-and-fast rules.
If your reference data is very large is size (total bytes)--say nearing 1GB--you may want to consider a join instead of a lookup especially if your server/engine is limited in available memory. If you're performing multiple lookups then consider the size of all of your reference data sources.
As was suggested earlier in the thread, read the documentation for join, merge and lookup stages. As you learn how each operates and their limitations and advantages, you'll know better when to use which one in your jobs.
Part of your choice in which stage to use should take into consideration the purpose of the join operation: Lookups are typically (but not always) used to apply small data values to incoming records in a one-to-many situation (such as comparing to a list of valid values for a data column). Joins and merges are typically used to match two or more data records together, often in a one-to-one situation (such as updating address data for existing accounts during a database update) . These typical uses are not hard-and-fast rules.
If your reference data is very large is size (total bytes)--say nearing 1GB--you may want to consider a join instead of a lookup especially if your server/engine is limited in available memory. If you're performing multiple lookups then consider the size of all of your reference data sources.
As was suggested earlier in the thread, read the documentation for join, merge and lookup stages. As you learn how each operates and their limitations and advantages, you'll know better when to use which one in your jobs.