Join stage memory usage

ankita · Post by **ankita** » Mon Jun 04, 2007 12:43 pm

Hi All,
Can anybody please explain me the join operation, in terms of memory usage ?
As per my understanding join doesn't take all the data into memory as lookup does, instead it goes into the table (say both the input link are Oracle src) and read the data page wise.But if join input streams are hash
partitioned and sorted then how will it use memory / scratch space ?

Currently I am getting following error with 3 billion data in one input link :
Tsort merger aborting: Scratch space full

Please suggest!

Thanks,
Ankita

samsuf2002 · Post by **samsuf2002** » Mon Jun 04, 2007 1:01 pm

You need to increase your scratch space. if you have large data in look up table it is better to use join instead of look up stage.

ankita · Post by **ankita** » Mon Jun 04, 2007 1:39 pm

Yes, I understand that. I need to do some capacity planning for that.

But how much data will be in memory while using join, input hash paritioned & sorted ? with 4 node , is input stream/4 ?

Please let me know if you have any suggestion regarding capacity planning.

ray.wurlod · Post by **ray.wurlod** » Tue Jun 05, 2007 1:52 am

The Join stage uses hardly any memory at all. It takes in one row from its left input, and only those rows from the right input for which the join keys match.

Your error actually came from an (inserted?) tsort operator that precedes the join operator. Dump the score to see how this is connected. So it is the sorting process - in particular the merging of the heaps - where your job ran out of scratch space.

Configure more scratch space by adding more file systems as resource scratchdisk (for all partitions) in your configuration file. Adding more directories on the existing file systems will not help - it is the file systems that fill.

vmcburney · Post by **vmcburney** » Tue Jun 05, 2007 4:11 am

One thing a join will do that a lookup wont do is insist both the input and lookup streams are sorted. If your lookup volume is low (say less than 10 million rows) it may pay to use a lookup instead to remove this sort requirement. This is easier than pushing 3 billion rows into sort scratch space.

ankita · Post by **ankita** » Tue Jun 05, 2007 9:05 am

Thanks for your replies, but I am still not able to solve it.

Since data volume is more than 10 million so Join is used.Now I need to
estimate the additional scratch space that is needed to sort a specific volume of records ( say 5 billion data in one input stream and 25 million in other ) .
I need to understand that how much data DataStage will pull into scratch / memory at a time in order to sort input streams of join.

csrazdan · Post by **csrazdan** » Tue Jun 05, 2007 10:07 am

ankita wrote:Thanks for your replies, but I am still not able to solve it.

Since data volume is more than 10 million so Join is used.Now I need to
estimate the additional scratch space that is needed to sort a specific volume of records ( say 5 billion data in one input stream and 25 million in other ) .
I need to understand that how much data DataStage will pull into scratch / memory at a time in order to sort input streams of join.

You can calculate disk requirement for your job based on the record length from each source. This is a rough calculation but at least will give you some idea. Following is the procedure:

1. Record length calculation - For each source add length of all columns. If the datatype of the column is variable length, add 1 for overhead. For example, If you source has 3 columns:

Code: Select all

           ID         NUMBER (9)
           NAME    VARCHAR2(20)
           STATUS CHAR(4)

The records length of this source will be 9+20+4 = 33 + 1 for overhead for NAME variable length column = 34 B

2. Based on the above calculation let us assume record length for:
Source A - 100 B
Source B - 200 B

3. Total Number of records in your sources are:
Source A - 5 billion = 500000000
Source B - 25 million = 25000000

4. Size estimation:
Source A = 500000000 * 100 = 50000000000 B = 46.57 GB
Source B = 25000000 * 200 = 5000000000 B = 4.66 GB

Total = 51.23 GB

Conclusion - To execute this job you need to have at least 51.23 GB disk space available for scratch.

If you are running this job 4 ways and let us assume your data is evenly partitioned you should have at least 12.80 GB disk space available on each node for scratch.

Hope it helps....

ankita · Post by **ankita** » Tue Jun 05, 2007 4:46 pm

Thanks for the explanation ! I got the idea .

pravin1581 · Post by **pravin1581** » Tue Jun 05, 2007 10:31 pm

ankita wrote:Thanks for the explanation ! I got the idea .

Hi , Ankita please share the idea , regarding resolving this problem we r also facing the same problem.

ankita · Post by **ankita** » Thu Jun 07, 2007 9:47 am

Hi,
Ours is a item scalability project where we are testing with huge volume of data. Objective is to find out the breakpoints and then scale
up the system accordingly.That's why I wanted to know how to estimate scratch space while doing join.
As per my experience, in normal scenarios this space problem should not occur unless,
1. Lookup reference steam is huge & sort is performed
2. hash partition is tending to skew, not evenly partitioned

Thanks,
Ankita