difference betwn look up stage and join satge in PX

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

difference betwn look up stage and join satge in PX

Post by pravin1581 »

Hi All,

Can anyone tell me the difference betwn look up stage and join satge in PX. I presume the look up stage basically acts as the inner join property of join stage and the manual says that performance wise join is faster than a look up stage.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Don't believe everything you read about "performance", which is ill-defined in an ETL context.

There are three stages that effect horizontal combination of data, each subtly different - Lookup, Join and Merge.

Only the Lookup stage allows conditional lookups. The Lookup stage has a single reject link - the Merge stage can have one reject link for each reference input link.

The Lookup stage does not require sorted input, whereas the Join and Merge stages do. So the alleged "performance" advantage of the Join stage might be consumed by the overhead of sorting.

The Lookup stage supports "sparse" lookup.

There are more differences. Read the three relevant chapters from the Parallel Job Developer's Guide and post any specific questions that may remain in your mind.
Last edited by ray.wurlod on Sun Apr 22, 2007 10:47 pm, edited 1 time in total.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
nick.bond
Charter Member
Charter Member
Posts: 230
Joined: Thu Jan 15, 2004 12:00 pm
Location: London

Post by nick.bond »

The answers not quite as simple as that, it depends on the volumes of data you have in your two datasets, what the distinct set of key values are, and what type of actions you want when matches aren't found. Depending on the answers to these questions you can use the JOIN, LOOKUP or MERGE stage to join two datasets together.

Have a look at the manual pages for these stages, it helps to explain the subtle differences between them. Pay attention to the different reject handling each stage can provide, i.e. the join stage just does the join without reporting records that cannot match, much like a join in SQL so you may loose records without knowing about it.

The lookup stage (If used in normal mode rather than sparse) will use more memory that the other two, but can be quicker if you have large number of driving records and a small set of update/lookup values, sa these can be loaded into memory and the throughput of the job is continuous.
Regards,

Nick.
nick.bond
Charter Member
Charter Member
Posts: 230
Joined: Thu Jan 15, 2004 12:00 pm
Location: London

Post by nick.bond »

The answers not quite as simple as that
Was commenting on initial post not your response Ray!
Regards,

Nick.
pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

Post by pravin1581 »

ray.wurlod wrote:Don't believe everything you read about "performance", which is ill-defined in an ETL context.

There are three stages that effect horizontal combination of data, each subtly different - Lookup, Join and Merge.

Only the Lookup stage allows conditional lookups. The Lookup stage has a single reject link - the Merge stage can have one reject link for each reference input link.

The Lookup stage does not require sorted input, whereas the Join and Merge stages do. So the alleged "performance" advantage of the Join stage might be consumed by the overhead of sorting.

The Lookup stage supports "sparse" lookup.

There are more differences. Read the three relevant chapters from the Parallel Job Developer's Guide and post any specific questions that may remain in your mind.
Please can you throw some light on sparse lookup,we tend to use normal lookup always.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Usually Lookup loads the lookup data in the physical memory. Which turned out to be fastest among all. Provided the data volume that stored in memroy is managalble.
Sparse look is done at database for each input record. This avoids the overload of loading the whole data into memrory. But the same time, need to compromise the preformance, since each query is executed at database and the data need to be trasfered through network.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

Post by pravin1581 »

kumar_s wrote:Usually Lookup loads the lookup data in the physical memory. Which turned out to be fastest among all. Provided the data volume that stored in memroy is managalble.
Sparse look is done at database for each input record. This avoids the overload of loading the whole data into memrory. But the same time, need to compromise the preformance, since each query is executed at database and the data need to be trasfered through network.
So in sparse lookup the data is not stored in the memory , but will the output result differ from that of normal look up.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

No, the output will be identical providing you don't change the lookup data while the job is running, in which case the sparse will show the new value and a normal one, which uses a memory image of the data as it was when the job began, will show the old value.
sweety123
Participant
Posts: 18
Joined: Thu Dec 10, 2009 5:28 am

Difference between lookup and join

Post by sweety123 »

ray.wurlod wrote:Don't believe everything you read about "performance", which is ill-defined in an ETL context.

There are three stages that effect horizontal combination of data, each subtly different - Lookup, Joi ...
I have often heard lookup should be used for low volume of data. Is there any limit to this 'volume' say 50K or something? How do we determine when the lookup stage should be replaced by join, if we consider only the number of records?
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Re: Difference between lookup and join

Post by jwiles »

To simply use the comment "low volume of data" is inaccurate and limiting because it doesn't indicate which data is being described. You can process any number of records through a lookup stage. The limitation is the size of your reference data (lookup table)---how much memory will the lookup table consume when loaded and how much memory is available on your servers?

Part of your choice in which stage to use should take into consideration the purpose of the join operation: Lookups are typically (but not always) used to apply small data values to incoming records in a one-to-many situation (such as comparing to a list of valid values for a data column). Joins and merges are typically used to match two or more data records together, often in a one-to-one situation (such as updating address data for existing accounts during a database update) . These typical uses are not hard-and-fast rules.

If your reference data is very large is size (total bytes)--say nearing 1GB--you may want to consider a join instead of a lookup especially if your server/engine is limited in available memory. If you're performing multiple lookups then consider the size of all of your reference data sources.

As was suggested earlier in the thread, read the documentation for join, merge and lookup stages. As you learn how each operates and their limitations and advantages, you'll know better when to use which one in your jobs.
Post Reply