which stage is more efficient: Lookup, Join or Marge

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
akrzy
Participant
Posts: 121
Joined: Wed Dec 08, 2004 4:46 am

which stage is more efficient: Lookup, Join or Marge

Post by akrzy »

Could you tell me which stage is more efficient:
Lookup, Join or Marge?

We're going process large voulmes of data.
elavenil
Premium Member
Premium Member
Posts: 467
Joined: Thu Jan 31, 2002 10:20 pm
Location: Singapore

Post by elavenil »

Since you are going for large volume of data, lookup should not be used. You can use merge or join depends on your req in the sense, use Merge if rejections to be captured otherwise join can be used.

Hope this would help.

Regards
Saravanan
vigneshra
Participant
Posts: 86
Joined: Wed Jun 09, 2004 6:07 am
Location: Chennai

Post by vigneshra »

akrzy,

Can you approximately tell what is the volume of the data your job is going to handle? The answer will be more crisp if you provide the detail of the volume of data. But generally, what Saravanan says will be more applicable.
Vignesh.

"A conclusion is simply the place where you got tired of thinking."
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

As a rule of thumb if I have under 50,000 reference rows I try to use the Lookup stage. It has the best interface of the three and has good reject and conditional lookup capabilities.

If I have a very large number of reference rows but only need to use a subset of them I use the join stage. For example 10 million reference rows where I join to just 1 million of them. While it cannot do rejects I can use a Left Outer join and a filter stage to capture rejects.

If I need to keep track of every primary row and every reference row I use the Merge stage. The merge stage can output to a reject file the reference rows that are not matched as well as handle the master rows that are not matched.
mandyli
Premium Member
Premium Member
Posts: 898
Joined: Wed May 26, 2004 10:45 pm
Location: Chicago

Post by mandyli »

Hi

what vmcburney says correct. If you want to handle more data go for Join stage..

//................................................................

[b]Join V Lookup[/b]
DataStage doesn't know how large your data is, so cannot make an
informed choice whether to combine data using a join stage or a lookup
stage. Here's how to decide which to use:
There are two data sets being combined. One is the primary or driving
dataset, sometimes called the left of the join. The other data set(s) are the
reference datasets, or the right of the join.
In all cases we are concerned with the size of the reference datasets. If
these take up a large amount of memory relative to the physical RAM
memory size of the computer you are running on, then a lookup stage
may thrash because the reference datasets may not fit in RAM along with
everything else that has to be in RAM. This results in very slow
performance since each lookup operation can, and typically does, cause a
page fault and an I/O operation.
So, if the reference datasets are big enough to cause trouble, use a join. A
join does a high-speed sort on the driving and reference datasets. This can
involve I/O if the data is big enough, but the I/O is all highly optimized
and sequential. Once the sort is over the join processing is very fast and
never involves paging or other I/O.


//.....................................................................................................

Thanks
Man
T42
Participant
Posts: 499
Joined: Thu Nov 11, 2004 6:45 pm

Post by T42 »

vmcburney wrote:As a rule of thumb if I have under 50,000 reference rows I try to use the Lookup stage. It has the best interface of the three and has good reject and conditional lookup capabilities.
Only 50k? Geez, a bit stingy with your Lookup use. I have seen excellent (read: better than Join) performance upward to 10 million small rows of reference data.

Again, this is heavily dependent on available resources, particularly memory. If you MUST sort, then a lookup is a waste.
Post Reply