Lookup issue

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dsscholar
Premium Member
Premium Member
Posts: 195
Joined: Thu Oct 19, 2006 2:45 pm

Lookup issue

Post by dsscholar »

One source, 4 reference and one target. All are oracle.

Query:

If i use single lookup, will the performance be low as all the reference data are loaded to the cache memory and then the processing is done. How it will happen actually. Which is the best design. To have separate lookup stages for all the 4 reference tables or a single one. In case of 4 lookups means, at a time one reference table only will be taken to cache memory right? In this case,will the processing be fast in each lookup stage?

And in case of entire partitioning, reference data will be taken to 4 partitions( 4 nodes defined).Does the output of the lookup will be 4 records with same content?

Please clarify.
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

How big are your reference inputs? Primarily number of rows, but also row size?

Will performance be low as compared to doing what? Whether you use join, lookup or merge, there will be some pre-processing occurring before the data streams are brought together. With join and merge, all of the input streams must be partitioned and sorted at some point before they are brought together. With lookup, the reference data must be loaded into the lookup hash tables before the main data stream is brought in.

Which will be faster---four lookup stages or one with four reference inputs? The answer: "it depends". For relatively small volumes of data (your main data stream here), there will likely be very little difference. For very large volumes of data (many millions of rows), you might see better performance (due to pipeline parallelism) with separate stages that are prevented from combining at runtime. Only testing the different scenarios will tell you for certain which is quickest in your situation, and every situation is different.

Four lookups = one reference table at a time? No. Each stage is operating independently of the others and so each will load it's own reference table into memory when told to initialize by the parallel engine.

Will the output of the lookup be four identical records? No, unless you ALSO performed entire partitioning on your main data stream.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
dsscholar
Premium Member
Premium Member
Posts: 195
Joined: Thu Oct 19, 2006 2:45 pm

Post by dsscholar »

I asked for performance comparison between the lookup cases :

4 lookup stages for 4 reference tables
1 lookup stage for 4 reference table

22 millions records are coming from the source. You mentioned that, for separate lookup stages data will be loaded separately. I know that, but what in case of single lookup stage. All the 4 reference table data will be loaded into the memory in case of pipeline parallelism? What in case of sequential mode? If data is large, there is a chance lookup memory will be out of space right. Memory that lookup stage uses is RAM or scratch disk?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

James's questions are pertinent, for all that you seem to want to avoid them. For example 22 million rows can be quite small (for example two bytes per row) or huge in terms of memory demand. But you do not provide guidance, even though memory demand is the principal determinant of what I believe you understand to be meant by "performance".

Your two scenarios won't differ much. In both cases all four reference data sets will need to be loaded into memory.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Sorry...in the case of multiple lookups in single stage, yes all of the reference tables will be loaded. The only other options that would be available would seriously reduce job performance due to excessive buffering of your main input stream or excessive loading/unloading/reloading of reference tables as blocks of your input data are processed.

To sum up, performance would most likely be similar between your two scenarios. However, because each situation is unique, we can't predict with 100% certainty which would be the better performer in your situation...there are too many variables, from the data itself as Ray discusses to the performance of your overall job design to the performance of your data sources and targets.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Pipeline parallelism != partition parallelism

Pipeline parallelism:

source----->stage 1----->stage 2----->stage 3----->stage 4----->target

Separate (non-combined) processes/operators in the OS: stage 1 receives a block of data from source, processes it, passes it to stage 2 then receives another block from source and begins processing it. Repeats until no more data is available from source. stages 2-4 are doing the same thing at the same time with their respective input and output links. A single pipeline like the above is essentially a sequential job (single partition). Pipeline parallelism doesn't happen within an operator (unless specifically written that way, which most if not all standard ops are not).

Partition parallelism:

Multiple copies of the pipeline above, with data from source split to the different pipelines. Each pipeline is a logical partition/node of your job. The pipelines are working in parallel with each other (partition parallelism) and are using pipeline parallelism within the pipeline.

The four reference tables will be loaded in either situation and in both examples of parallelism.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
sendmkpk
Premium Member
Premium Member
Posts: 97
Joined: Mon Apr 02, 2007 2:47 am

lookup occupies sractch or RAM ?

Post by sendmkpk »

Hi

i just want to know if lookup stage uses up scratch/resoure disk or RAM memory for loading ref data into.

and does it load same ref data into all partitions or parts of data into different partitions or just loads all ref data into shared RAM or .....
Praveen
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The reference data for a Lookup stage is loaded into memory.

Which rows are loaded to which node depends on the partitionong algorithm; Entire places all rows on every node. Key-based algorithms (Hash, for example), places a discrete subset of rows on each node.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sendmkpk
Premium Member
Premium Member
Posts: 97
Joined: Mon Apr 02, 2007 2:47 am

Post by sendmkpk »

Thx a lot for reply

When partition is auto, how/where the data ref data is placed?
Praveen
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Most often, the auto partitioning on the reference data will resolve to Entire partitioning. I don't believe I've seen a situation where it hasn't yet, but maybe someone else has.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

may be not if the automatic partition insertion is disabled in environment variable.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
sendmkpk
Premium Member
Premium Member
Posts: 97
Joined: Mon Apr 02, 2007 2:47 am

Post by sendmkpk »

so

lets say we have 2 GB RAM,

and 2.2 GB is the complete memory needed for the key column's data to occupy

Then does it means that the lookup stage goes for a toss here?
Praveen
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

With 2GB RAM I'd be surprised if Information Server even installed!
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sendmkpk
Premium Member
Premium Member
Posts: 97
Joined: Mon Apr 02, 2007 2:47 am

Post by sendmkpk »

Perhaps it won't

My question is wrt the scenario
Praveen
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

"Goes for a toss"? Is that a new technical term? I'm going to assume you weren't talking about definition #3. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply