Join stage memory usage
Moderators: chulett, rschirm, roy
Join stage memory usage
Hi All,
Can anybody please explain me the join operation, in terms of memory usage ?
As per my understanding join doesn't take all the data into memory as lookup does, instead it goes into the table (say both the input link are Oracle src) and read the data page wise.But if join input streams are hash
partitioned and sorted then how will it use memory / scratch space ?
Currently I am getting following error with 3 billion data in one input link :
Tsort merger aborting: Scratch space full
Please suggest!
Thanks,
Ankita
Can anybody please explain me the join operation, in terms of memory usage ?
As per my understanding join doesn't take all the data into memory as lookup does, instead it goes into the table (say both the input link are Oracle src) and read the data page wise.But if join input streams are hash
partitioned and sorted then how will it use memory / scratch space ?
Currently I am getting following error with 3 billion data in one input link :
Tsort merger aborting: Scratch space full
Please suggest!
Thanks,
Ankita
-
- Premium Member
- Posts: 397
- Joined: Wed Apr 12, 2006 2:28 pm
- Location: Tennesse
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The Join stage uses hardly any memory at all. It takes in one row from its left input, and only those rows from the right input for which the join keys match.
Your error actually came from an (inserted?) tsort operator that precedes the join operator. Dump the score to see how this is connected. So it is the sorting process - in particular the merging of the heaps - where your job ran out of scratch space.
Configure more scratch space by adding more file systems as resource scratchdisk (for all partitions) in your configuration file. Adding more directories on the existing file systems will not help - it is the file systems that fill.
Your error actually came from an (inserted?) tsort operator that precedes the join operator. Dump the score to see how this is connected. So it is the sorting process - in particular the merging of the heaps - where your job ran out of scratch space.
Configure more scratch space by adding more file systems as resource scratchdisk (for all partitions) in your configuration file. Adding more directories on the existing file systems will not help - it is the file systems that fill.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
One thing a join will do that a lookup wont do is insist both the input and lookup streams are sorted. If your lookup volume is low (say less than 10 million rows) it may pay to use a lookup instead to remove this sort requirement. This is easier than pushing 3 billion rows into sort scratch space.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Thanks for your replies, but I am still not able to solve it.
Since data volume is more than 10 million so Join is used.Now I need to
estimate the additional scratch space that is needed to sort a specific volume of records ( say 5 billion data in one input stream and 25 million in other ) .
I need to understand that how much data DataStage will pull into scratch / memory at a time in order to sort input streams of join.
Since data volume is more than 10 million so Join is used.Now I need to
estimate the additional scratch space that is needed to sort a specific volume of records ( say 5 billion data in one input stream and 25 million in other ) .
I need to understand that how much data DataStage will pull into scratch / memory at a time in order to sort input streams of join.
You can calculate disk requirement for your job based on the record length from each source. This is a rough calculation but at least will give you some idea. Following is the procedure:ankita wrote:Thanks for your replies, but I am still not able to solve it.
Since data volume is more than 10 million so Join is used.Now I need to
estimate the additional scratch space that is needed to sort a specific volume of records ( say 5 billion data in one input stream and 25 million in other ) .
I need to understand that how much data DataStage will pull into scratch / memory at a time in order to sort input streams of join.
1. Record length calculation - For each source add length of all columns. If the datatype of the column is variable length, add 1 for overhead. For example, If you source has 3 columns:
Code: Select all
ID NUMBER (9)
NAME VARCHAR2(20)
STATUS CHAR(4)
2. Based on the above calculation let us assume record length for:
Source A - 100 B
Source B - 200 B
3. Total Number of records in your sources are:
Source A - 5 billion = 500000000
Source B - 25 million = 25000000
4. Size estimation:
Source A = 500000000 * 100 = 50000000000 B = 46.57 GB
Source B = 25000000 * 200 = 5000000000 B = 4.66 GB
Total = 51.23 GB
Conclusion - To execute this job you need to have at least 51.23 GB disk space available for scratch.
If you are running this job 4 ways and let us assume your data is evenly partitioned you should have at least 12.80 GB disk space available on each node for scratch.
Hope it helps....
Assume everything I say or do is positive
-
- Premium Member
- Posts: 497
- Joined: Sun Dec 17, 2006 11:52 pm
- Location: Kolkata
- Contact:
Hi,
Ours is a item scalability project where we are testing with huge volume of data. Objective is to find out the breakpoints and then scale
up the system accordingly.That's why I wanted to know how to estimate scratch space while doing join.
As per my experience, in normal scenarios this space problem should not occur unless,
1. Lookup reference steam is huge & sort is performed
2. hash partition is tending to skew, not evenly partitioned
Thanks,
Ankita
Ours is a item scalability project where we are testing with huge volume of data. Objective is to find out the breakpoints and then scale
up the system accordingly.That's why I wanted to know how to estimate scratch space while doing join.
As per my experience, in normal scenarios this space problem should not occur unless,
1. Lookup reference steam is huge & sort is performed
2. hash partition is tending to skew, not evenly partitioned
Thanks,
Ankita