Join stage memory usage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ankita
Participant
Posts: 57
Joined: Sun Nov 13, 2005 11:17 pm

Join stage memory usage

Post by ankita »

Hi All,
Can anybody please explain me the join operation, in terms of memory usage ?
As per my understanding join doesn't take all the data into memory as lookup does, instead it goes into the table (say both the input link are Oracle src) and read the data page wise.But if join input streams are hash
partitioned and sorted then how will it use memory / scratch space ?

Currently I am getting following error with 3 billion data in one input link :
Tsort merger aborting: Scratch space full

Please suggest!

Thanks,
Ankita
samsuf2002
Premium Member
Premium Member
Posts: 397
Joined: Wed Apr 12, 2006 2:28 pm
Location: Tennesse

Post by samsuf2002 »

You need to increase your scratch space. if you have large data in look up table it is better to use join instead of look up stage.
hi sam here
ankita
Participant
Posts: 57
Joined: Sun Nov 13, 2005 11:17 pm

Post by ankita »

Yes, I understand that. I need to do some capacity planning for that.

But how much data will be in memory while using join, input hash paritioned & sorted ? with 4 node , is input stream/4 ?

Please let me know if you have any suggestion regarding capacity planning.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The Join stage uses hardly any memory at all. It takes in one row from its left input, and only those rows from the right input for which the join keys match.

Your error actually came from an (inserted?) tsort operator that precedes the join operator. Dump the score to see how this is connected. So it is the sorting process - in particular the merging of the heaps - where your job ran out of scratch space.

Configure more scratch space by adding more file systems as resource scratchdisk (for all partitions) in your configuration file. Adding more directories on the existing file systems will not help - it is the file systems that fill.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

One thing a join will do that a lookup wont do is insist both the input and lookup streams are sorted. If your lookup volume is low (say less than 10 million rows) it may pay to use a lookup instead to remove this sort requirement. This is easier than pushing 3 billion rows into sort scratch space.
ankita
Participant
Posts: 57
Joined: Sun Nov 13, 2005 11:17 pm

Post by ankita »

Thanks for your replies, but I am still not able to solve it.

Since data volume is more than 10 million so Join is used.Now I need to
estimate the additional scratch space that is needed to sort a specific volume of records ( say 5 billion data in one input stream and 25 million in other ) .
I need to understand that how much data DataStage will pull into scratch / memory at a time in order to sort input streams of join.
csrazdan
Participant
Posts: 127
Joined: Wed May 12, 2004 6:03 pm
Location: Chicago IL

Post by csrazdan »

ankita wrote:Thanks for your replies, but I am still not able to solve it.

Since data volume is more than 10 million so Join is used.Now I need to
estimate the additional scratch space that is needed to sort a specific volume of records ( say 5 billion data in one input stream and 25 million in other ) .
I need to understand that how much data DataStage will pull into scratch / memory at a time in order to sort input streams of join.
You can calculate disk requirement for your job based on the record length from each source. This is a rough calculation but at least will give you some idea. Following is the procedure:

1. Record length calculation - For each source add length of all columns. If the datatype of the column is variable length, add 1 for overhead. For example, If you source has 3 columns:

Code: Select all

           ID         NUMBER (9)
           NAME    VARCHAR2(20)
           STATUS CHAR(4) 
     
The records length of this source will be 9+20+4 = 33 + 1 for overhead for NAME variable length column = 34 B

2. Based on the above calculation let us assume record length for:
Source A - 100 B
Source B - 200 B

3. Total Number of records in your sources are:
Source A - 5 billion = 500000000
Source B - 25 million = 25000000

4. Size estimation:
Source A = 500000000 * 100 = 50000000000 B = 46.57 GB
Source B = 25000000 * 200 = 5000000000 B = 4.66 GB

Total = 51.23 GB

Conclusion - To execute this job you need to have at least 51.23 GB disk space available for scratch.

If you are running this job 4 ways and let us assume your data is evenly partitioned you should have at least 12.80 GB disk space available on each node for scratch.

Hope it helps....
Assume everything I say or do is positive
ankita
Participant
Posts: 57
Joined: Sun Nov 13, 2005 11:17 pm

Post by ankita »

Thanks for the explanation ! I got the idea .
Ankita
pravin1581
Premium Member
Premium Member
Posts: 497
Joined: Sun Dec 17, 2006 11:52 pm
Location: Kolkata
Contact:

Post by pravin1581 »

ankita wrote:Thanks for the explanation ! I got the idea .
Hi , Ankita please share the idea , regarding resolving this problem we r also facing the same problem.
ankita
Participant
Posts: 57
Joined: Sun Nov 13, 2005 11:17 pm

Post by ankita »

Hi,
Ours is a item scalability project where we are testing with huge volume of data. Objective is to find out the breakpoints and then scale
up the system accordingly.That's why I wanted to know how to estimate scratch space while doing join.
As per my experience, in normal scenarios this space problem should not occur unless,
1. Lookup reference steam is huge & sort is performed
2. hash partition is tending to skew, not evenly partitioned

Thanks,
Ankita
Post Reply