Could you tell me which stage is more efficient:
Lookup, Join or Marge?
We're going process large voulmes of data.
which stage is more efficient: Lookup, Join or Marge
Moderators: chulett, rschirm, roy
akrzy,
Can you approximately tell what is the volume of the data your job is going to handle? The answer will be more crisp if you provide the detail of the volume of data. But generally, what Saravanan says will be more applicable.
Can you approximately tell what is the volume of the data your job is going to handle? The answer will be more crisp if you provide the detail of the volume of data. But generally, what Saravanan says will be more applicable.
Vignesh.
"A conclusion is simply the place where you got tired of thinking."
"A conclusion is simply the place where you got tired of thinking."
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
As a rule of thumb if I have under 50,000 reference rows I try to use the Lookup stage. It has the best interface of the three and has good reject and conditional lookup capabilities.
If I have a very large number of reference rows but only need to use a subset of them I use the join stage. For example 10 million reference rows where I join to just 1 million of them. While it cannot do rejects I can use a Left Outer join and a filter stage to capture rejects.
If I need to keep track of every primary row and every reference row I use the Merge stage. The merge stage can output to a reject file the reference rows that are not matched as well as handle the master rows that are not matched.
If I have a very large number of reference rows but only need to use a subset of them I use the join stage. For example 10 million reference rows where I join to just 1 million of them. While it cannot do rejects I can use a Left Outer join and a filter stage to capture rejects.
If I need to keep track of every primary row and every reference row I use the Merge stage. The merge stage can output to a reject file the reference rows that are not matched as well as handle the master rows that are not matched.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Hi
what vmcburney says correct. If you want to handle more data go for Join stage..
//................................................................
[b]Join V Lookup[/b]
DataStage doesn't know how large your data is, so cannot make an
informed choice whether to combine data using a join stage or a lookup
stage. Here's how to decide which to use:
There are two data sets being combined. One is the primary or driving
dataset, sometimes called the left of the join. The other data set(s) are the
reference datasets, or the right of the join.
In all cases we are concerned with the size of the reference datasets. If
these take up a large amount of memory relative to the physical RAM
memory size of the computer you are running on, then a lookup stage
may thrash because the reference datasets may not fit in RAM along with
everything else that has to be in RAM. This results in very slow
performance since each lookup operation can, and typically does, cause a
page fault and an I/O operation.
So, if the reference datasets are big enough to cause trouble, use a join. A
join does a high-speed sort on the driving and reference datasets. This can
involve I/O if the data is big enough, but the I/O is all highly optimized
and sequential. Once the sort is over the join processing is very fast and
never involves paging or other I/O.
//.....................................................................................................
Thanks
Man
what vmcburney says correct. If you want to handle more data go for Join stage..
//................................................................
[b]Join V Lookup[/b]
DataStage doesn't know how large your data is, so cannot make an
informed choice whether to combine data using a join stage or a lookup
stage. Here's how to decide which to use:
There are two data sets being combined. One is the primary or driving
dataset, sometimes called the left of the join. The other data set(s) are the
reference datasets, or the right of the join.
In all cases we are concerned with the size of the reference datasets. If
these take up a large amount of memory relative to the physical RAM
memory size of the computer you are running on, then a lookup stage
may thrash because the reference datasets may not fit in RAM along with
everything else that has to be in RAM. This results in very slow
performance since each lookup operation can, and typically does, cause a
page fault and an I/O operation.
So, if the reference datasets are big enough to cause trouble, use a join. A
join does a high-speed sort on the driving and reference datasets. This can
involve I/O if the data is big enough, but the I/O is all highly optimized
and sequential. Once the sort is over the join processing is very fast and
never involves paging or other I/O.
//.....................................................................................................
Thanks
Man
Only 50k? Geez, a bit stingy with your Lookup use. I have seen excellent (read: better than Join) performance upward to 10 million small rows of reference data.vmcburney wrote:As a rule of thumb if I have under 50,000 reference rows I try to use the Lookup stage. It has the best interface of the three and has good reject and conditional lookup capabilities.
Again, this is heavily dependent on available resources, particularly memory. If you MUST sort, then a lookup is a waste.