Lookup issue
Moderators: chulett, rschirm, roy
Lookup issue
One source, 4 reference and one target. All are oracle.
Query:
If i use single lookup, will the performance be low as all the reference data are loaded to the cache memory and then the processing is done. How it will happen actually. Which is the best design. To have separate lookup stages for all the 4 reference tables or a single one. In case of 4 lookups means, at a time one reference table only will be taken to cache memory right? In this case,will the processing be fast in each lookup stage?
And in case of entire partitioning, reference data will be taken to 4 partitions( 4 nodes defined).Does the output of the lookup will be 4 records with same content?
Please clarify.
Query:
If i use single lookup, will the performance be low as all the reference data are loaded to the cache memory and then the processing is done. How it will happen actually. Which is the best design. To have separate lookup stages for all the 4 reference tables or a single one. In case of 4 lookups means, at a time one reference table only will be taken to cache memory right? In this case,will the processing be fast in each lookup stage?
And in case of entire partitioning, reference data will be taken to 4 partitions( 4 nodes defined).Does the output of the lookup will be 4 records with same content?
Please clarify.
How big are your reference inputs? Primarily number of rows, but also row size?
Will performance be low as compared to doing what? Whether you use join, lookup or merge, there will be some pre-processing occurring before the data streams are brought together. With join and merge, all of the input streams must be partitioned and sorted at some point before they are brought together. With lookup, the reference data must be loaded into the lookup hash tables before the main data stream is brought in.
Which will be faster---four lookup stages or one with four reference inputs? The answer: "it depends". For relatively small volumes of data (your main data stream here), there will likely be very little difference. For very large volumes of data (many millions of rows), you might see better performance (due to pipeline parallelism) with separate stages that are prevented from combining at runtime. Only testing the different scenarios will tell you for certain which is quickest in your situation, and every situation is different.
Four lookups = one reference table at a time? No. Each stage is operating independently of the others and so each will load it's own reference table into memory when told to initialize by the parallel engine.
Will the output of the lookup be four identical records? No, unless you ALSO performed entire partitioning on your main data stream.
Regards,
Will performance be low as compared to doing what? Whether you use join, lookup or merge, there will be some pre-processing occurring before the data streams are brought together. With join and merge, all of the input streams must be partitioned and sorted at some point before they are brought together. With lookup, the reference data must be loaded into the lookup hash tables before the main data stream is brought in.
Which will be faster---four lookup stages or one with four reference inputs? The answer: "it depends". For relatively small volumes of data (your main data stream here), there will likely be very little difference. For very large volumes of data (many millions of rows), you might see better performance (due to pipeline parallelism) with separate stages that are prevented from combining at runtime. Only testing the different scenarios will tell you for certain which is quickest in your situation, and every situation is different.
Four lookups = one reference table at a time? No. Each stage is operating independently of the others and so each will load it's own reference table into memory when told to initialize by the parallel engine.
Will the output of the lookup be four identical records? No, unless you ALSO performed entire partitioning on your main data stream.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
I asked for performance comparison between the lookup cases :
4 lookup stages for 4 reference tables
1 lookup stage for 4 reference table
22 millions records are coming from the source. You mentioned that, for separate lookup stages data will be loaded separately. I know that, but what in case of single lookup stage. All the 4 reference table data will be loaded into the memory in case of pipeline parallelism? What in case of sequential mode? If data is large, there is a chance lookup memory will be out of space right. Memory that lookup stage uses is RAM or scratch disk?
4 lookup stages for 4 reference tables
1 lookup stage for 4 reference table
22 millions records are coming from the source. You mentioned that, for separate lookup stages data will be loaded separately. I know that, but what in case of single lookup stage. All the 4 reference table data will be loaded into the memory in case of pipeline parallelism? What in case of sequential mode? If data is large, there is a chance lookup memory will be out of space right. Memory that lookup stage uses is RAM or scratch disk?
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
James's questions are pertinent, for all that you seem to want to avoid them. For example 22 million rows can be quite small (for example two bytes per row) or huge in terms of memory demand. But you do not provide guidance, even though memory demand is the principal determinant of what I believe you understand to be meant by "performance".
Your two scenarios won't differ much. In both cases all four reference data sets will need to be loaded into memory.
Your two scenarios won't differ much. In both cases all four reference data sets will need to be loaded into memory.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Sorry...in the case of multiple lookups in single stage, yes all of the reference tables will be loaded. The only other options that would be available would seriously reduce job performance due to excessive buffering of your main input stream or excessive loading/unloading/reloading of reference tables as blocks of your input data are processed.
To sum up, performance would most likely be similar between your two scenarios. However, because each situation is unique, we can't predict with 100% certainty which would be the better performer in your situation...there are too many variables, from the data itself as Ray discusses to the performance of your overall job design to the performance of your data sources and targets.
Regards,
To sum up, performance would most likely be similar between your two scenarios. However, because each situation is unique, we can't predict with 100% certainty which would be the better performer in your situation...there are too many variables, from the data itself as Ray discusses to the performance of your overall job design to the performance of your data sources and targets.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
Pipeline parallelism != partition parallelism
Pipeline parallelism:
source----->stage 1----->stage 2----->stage 3----->stage 4----->target
Separate (non-combined) processes/operators in the OS: stage 1 receives a block of data from source, processes it, passes it to stage 2 then receives another block from source and begins processing it. Repeats until no more data is available from source. stages 2-4 are doing the same thing at the same time with their respective input and output links. A single pipeline like the above is essentially a sequential job (single partition). Pipeline parallelism doesn't happen within an operator (unless specifically written that way, which most if not all standard ops are not).
Partition parallelism:
Multiple copies of the pipeline above, with data from source split to the different pipelines. Each pipeline is a logical partition/node of your job. The pipelines are working in parallel with each other (partition parallelism) and are using pipeline parallelism within the pipeline.
The four reference tables will be loaded in either situation and in both examples of parallelism.
Regards,
Pipeline parallelism:
source----->stage 1----->stage 2----->stage 3----->stage 4----->target
Separate (non-combined) processes/operators in the OS: stage 1 receives a block of data from source, processes it, passes it to stage 2 then receives another block from source and begins processing it. Repeats until no more data is available from source. stages 2-4 are doing the same thing at the same time with their respective input and output links. A single pipeline like the above is essentially a sequential job (single partition). Pipeline parallelism doesn't happen within an operator (unless specifically written that way, which most if not all standard ops are not).
Partition parallelism:
Multiple copies of the pipeline above, with data from source split to the different pipelines. Each pipeline is a logical partition/node of your job. The pipelines are working in parallel with each other (partition parallelism) and are using pipeline parallelism within the pipeline.
The four reference tables will be loaded in either situation and in both examples of parallelism.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
lookup occupies sractch or RAM ?
Hi
i just want to know if lookup stage uses up scratch/resoure disk or RAM memory for loading ref data into.
and does it load same ref data into all partitions or parts of data into different partitions or just loads all ref data into shared RAM or .....
i just want to know if lookup stage uses up scratch/resoure disk or RAM memory for loading ref data into.
and does it load same ref data into all partitions or parts of data into different partitions or just loads all ref data into shared RAM or .....
Praveen
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The reference data for a Lookup stage is loaded into memory.
Which rows are loaded to which node depends on the partitionong algorithm; Entire places all rows on every node. Key-based algorithms (Hash, for example), places a discrete subset of rows on each node.
Which rows are loaded to which node depends on the partitionong algorithm; Entire places all rows on every node. Key-based algorithms (Hash, for example), places a discrete subset of rows on each node.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 1735
- Joined: Thu Mar 01, 2007 5:44 am
- Location: Troy, MI
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
"Goes for a toss"? Is that a new technical term? I'm going to assume you weren't talking about definition #3. ![Wink :wink:](./images/smilies/icon_wink.gif)
![Wink :wink:](./images/smilies/icon_wink.gif)
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers