which stage is more efficient: Lookup, Join or Marge

akrzy · Post by **akrzy** » Thu Feb 24, 2005 3:46 am

Could you tell me which stage is more efficient:
Lookup, Join or Marge?

We're going process large voulmes of data.

elavenil · Post by **elavenil** » Thu Feb 24, 2005 4:38 am

Since you are going for large volume of data, lookup should not be used. You can use merge or join depends on your req in the sense, use Merge if rejections to be captured otherwise join can be used.

Hope this would help.

Regards
Saravanan

vigneshra · Post by **vigneshra** » Thu Feb 24, 2005 5:00 am

akrzy,

Can you approximately tell what is the volume of the data your job is going to handle? The answer will be more crisp if you provide the detail of the volume of data. But generally, what Saravanan says will be more applicable.

vmcburney · Post by **vmcburney** » Thu Feb 24, 2005 5:26 am

As a rule of thumb if I have under 50,000 reference rows I try to use the Lookup stage. It has the best interface of the three and has good reject and conditional lookup capabilities.

If I have a very large number of reference rows but only need to use a subset of them I use the join stage. For example 10 million reference rows where I join to just 1 million of them. While it cannot do rejects I can use a Left Outer join and a filter stage to capture rejects.

If I need to keep track of every primary row and every reference row I use the Merge stage. The merge stage can output to a reject file the reference rows that are not matched as well as handle the master rows that are not matched.

mandyli · Post by **mandyli** » Thu Feb 24, 2005 5:53 am

Hi

what vmcburney says correct. If you want to handle more data go for Join stage..

//................................................................

[b]Join V Lookup[/b]
DataStage doesn't know how large your data is, so cannot make an
informed choice whether to combine data using a join stage or a lookup
stage. Here's how to decide which to use:
There are two data sets being combined. One is the primary or driving
dataset, sometimes called the left of the join. The other data set(s) are the
reference datasets, or the right of the join.
In all cases we are concerned with the size of the reference datasets. If
these take up a large amount of memory relative to the physical RAM
memory size of the computer you are running on, then a lookup stage
may thrash because the reference datasets may not fit in RAM along with
everything else that has to be in RAM. This results in very slow
performance since each lookup operation can, and typically does, cause a
page fault and an I/O operation.
So, if the reference datasets are big enough to cause trouble, use a join. A
join does a high-speed sort on the driving and reference datasets. This can
involve I/O if the data is big enough, but the I/O is all highly optimized
and sequential. Once the sort is over the join processing is very fast and
never involves paging or other I/O.

//.....................................................................................................

Thanks
Man

T42 · Post by **T42** » Tue Mar 01, 2005 2:59 pm

vmcburney wrote:As a rule of thumb if I have under 50,000 reference rows I try to use the Lookup stage. It has the best interface of the three and has good reject and conditional lookup capabilities.

Only 50k? Geez, a bit stingy with your Lookup use. I have seen excellent (read: better than Join) performance upward to 10 million small rows of reference data.

Again, this is heavily dependent on available resources, particularly memory. If you MUST sort, then a lookup is a waste.