Lookup issue

dsscholar · Post by **dsscholar** » Tue Jul 19, 2011 11:28 am

One source, 4 reference and one target. All are oracle.

Query:

If i use single lookup, will the performance be low as all the reference data are loaded to the cache memory and then the processing is done. How it will happen actually. Which is the best design. To have separate lookup stages for all the 4 reference tables or a single one. In case of 4 lookups means, at a time one reference table only will be taken to cache memory right? In this case,will the processing be fast in each lookup stage?

And in case of entire partitioning, reference data will be taken to 4 partitions( 4 nodes defined).Does the output of the lookup will be 4 records with same content?

Please clarify.

jwiles · Post by **jwiles** » Tue Jul 19, 2011 12:11 pm

How big are your reference inputs? Primarily number of rows, but also row size?

Will performance be low as compared to doing what? Whether you use join, lookup or merge, there will be some pre-processing occurring before the data streams are brought together. With join and merge, all of the input streams must be partitioned and sorted at some point before they are brought together. With lookup, the reference data must be loaded into the lookup hash tables before the main data stream is brought in.

Which will be faster---four lookup stages or one with four reference inputs? The answer: "it depends". For relatively small volumes of data (your main data stream here), there will likely be very little difference. For very large volumes of data (many millions of rows), you might see better performance (due to pipeline parallelism) with separate stages that are prevented from combining at runtime. Only testing the different scenarios will tell you for certain which is quickest in your situation, and every situation is different.

Four lookups = one reference table at a time? No. Each stage is operating independently of the others and so each will load it's own reference table into memory when told to initialize by the parallel engine.

Will the output of the lookup be four identical records? No, unless you ALSO performed entire partitioning on your main data stream.

Regards,

dsscholar · Post by **dsscholar** » Tue Jul 19, 2011 8:22 pm

I asked for performance comparison between the lookup cases :

4 lookup stages for 4 reference tables
1 lookup stage for 4 reference table

22 millions records are coming from the source. You mentioned that, for separate lookup stages data will be loaded separately. I know that, but what in case of single lookup stage. All the 4 reference table data will be loaded into the memory in case of pipeline parallelism? What in case of sequential mode? If data is large, there is a chance lookup memory will be out of space right. Memory that lookup stage uses is RAM or scratch disk?

ray.wurlod · Post by **ray.wurlod** » Tue Jul 19, 2011 9:03 pm

James's questions are pertinent, for all that you seem to want to avoid them. For example 22 million rows can be quite small (for example two bytes per row) or huge in terms of memory demand. But you do not provide guidance, even though memory demand is the principal determinant of what I believe you understand to be meant by "performance".

Your two scenarios won't differ much. In both cases all four reference data sets will need to be loaded into memory.

jwiles · Post by **jwiles** » Wed Jul 20, 2011 6:00 am

Sorry...in the case of multiple lookups in single stage, yes all of the reference tables will be loaded. The only other options that would be available would seriously reduce job performance due to excessive buffering of your main input stream or excessive loading/unloading/reloading of reference tables as blocks of your input data are processed.

To sum up, performance would most likely be similar between your two scenarios. However, because each situation is unique, we can't predict with 100% certainty which would be the better performer in your situation...there are too many variables, from the data itself as Ray discusses to the performance of your overall job design to the performance of your data sources and targets.

Regards,

jwiles · Post by **jwiles** » Wed Jul 20, 2011 8:35 am

Pipeline parallelism != partition parallelism

Pipeline parallelism:

source----->stage 1----->stage 2----->stage 3----->stage 4----->target

Separate (non-combined) processes/operators in the OS: stage 1 receives a block of data from source, processes it, passes it to stage 2 then receives another block from source and begins processing it. Repeats until no more data is available from source. stages 2-4 are doing the same thing at the same time with their respective input and output links. A single pipeline like the above is essentially a sequential job (single partition). Pipeline parallelism doesn't happen within an operator (unless specifically written that way, which most if not all standard ops are not).

Partition parallelism:

Multiple copies of the pipeline above, with data from source split to the different pipelines. Each pipeline is a logical partition/node of your job. The pipelines are working in parallel with each other (partition parallelism) and are using pipeline parallelism within the pipeline.

The four reference tables will be loaded in either situation and in both examples of parallelism.

Regards,

sendmkpk · Post by **sendmkpk** » Tue May 14, 2013 3:38 am

Hi

i just want to know if lookup stage uses up scratch/resoure disk or RAM memory for loading ref data into.

and does it load same ref data into all partitions or parts of data into different partitions or just loads all ref data into shared RAM or .....

ray.wurlod · Post by **ray.wurlod** » Tue May 14, 2013 3:48 am

The reference data for a Lookup stage is loaded into memory.

Which rows are loaded to which node depends on the partitionong algorithm; Entire places all rows on every node. Key-based algorithms (Hash, for example), places a discrete subset of rows on each node.

sendmkpk · Post by **sendmkpk** » Tue May 14, 2013 8:27 am

Thx a lot for reply

When partition is auto, how/where the data ref data is placed?

jwiles · Post by **jwiles** » Tue May 14, 2013 12:19 pm

Most often, the auto partitioning on the reference data will resolve to Entire partitioning. I don't believe I've seen a situation where it hasn't yet, but maybe someone else has.

Regards,

priyadarshikunal · Post by **priyadarshikunal** » Tue May 14, 2013 12:34 pm

may be not if the automatic partition insertion is disabled in environment variable.

sendmkpk · Post by **sendmkpk** » Mon May 20, 2013 2:29 am

so

lets say we have 2 GB RAM,

and 2.2 GB is the complete memory needed for the key column's data to occupy

Then does it means that the lookup stage goes for a toss here?

ray.wurlod · Post by **ray.wurlod** » Mon May 20, 2013 3:25 am

With 2GB RAM I'd be surprised if Information Server even installed!

sendmkpk · Post by **sendmkpk** » Mon May 20, 2013 6:45 am

Perhaps it won't

My question is wrt the scenario

chulett · Post by **chulett** » Mon May 20, 2013 6:49 am

"Goes for a toss"? Is that a new technical term? I'm going to assume you weren't talking about definition #3.

DSXchange

Lookup issue

Lookup issue

lookup occupies sractch or RAM ?