Join of tables with huge volume of data - scratch space

Chandrathdsx · Post by **Chandrathdsx** » Wed Feb 17, 2010 11:14 am

Hi,
I am doing a full outer join of two tables with table1 has 20million records and table2 has 1000million records. [Production data volumes]
Job Design:
SrcTbl1--->Join(with hash and sort by join keys) -->Transformer-->Tgt Tbl
|
SrcTbl2 ------
I have only one scratch disk with 35GB of space.
Please suggest if this is a right thing to do considering the fact that huge volume of data?
-What are my implecations -- will I run out of scatch space (the appserver is unix)?
-How do you monitor the usage of scratch space while the job is running? [any unix commands to monitor?]
-Is there a way to spit the records out to 'Trasformer' and subsequently to Tgt Table instead of waiting for all records to be joined to free up scratch space? Any environment variables set-up?
-Would it help if I use a multi-node configuration?
-Or do I need to consider alternative using custom SQL to join the tables in a SQL.

Any help is highly appreciated.

Thank you!

ArndW · Post by **ArndW** » Wed Feb 17, 2010 11:40 am

Once your data for left and right parts of join is sorted on the join keys then your job won't need excessive amounts of scratch space. It will need scratch space for sorting. You didn't mention how many Mb/Gb your two data tables are.

keshav0307 · Post by **keshav0307** » Wed Feb 17, 2010 11:07 pm

for large volume of data to join, more number of nodes.

can you post the content of configuration your are using for this job?

Chandrathdsx · Post by **Chandrathdsx** » Thu Feb 18, 2010 10:53 am

keshav0307 wrote:for large volume of data to join, more number of nodes.

can you post the content of configuration your are using for this job?

Here is my max 4x4 config (total 8 nodes): but pls note that we have only one scratchdisk with 35GB.
{
node "node1"
{
fastname "ds01lp01"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node2"
{
fastname "ds01lp01"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node3"
{
fastname "ds01lp01"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node4"
{
fastname "ds01lp01"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node5"
{
fastname "ds01lp02"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node6"
{
fastname "ds01lp02"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node7"
{
fastname "ds01lp02"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node8"
{
fastname "ds01lp02"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}

}

Thanks for reply.

ArndW · Post by **ArndW** » Thu Feb 18, 2010 11:23 am

Unless your Database is partitioned, going for more nodes in your configuration file is not going to make a positive difference.

So far the only size you've mentioned is the 35Gb of scratch. Since we have no idea of the sizes of the two data sources it is impossible to make any suggestions or prognostications.

Sreenivasulu · Post by **Sreenivasulu** » Thu Feb 18, 2010 11:38 am

Hi ArndW,
Is it table partition or index partition in oracle ?
Since i have not heard of database partition. I may be wrong here

Regards
Sreeni

Chandrathdsx · Post by **Chandrathdsx** » Thu Feb 18, 2010 3:31 pm

ArndW wrote:Unless your Database is partitioned, going for more nodes in your configuration file is not going to make a positive difference.
ArndW,
So far the only size you've mentioned is the 35Gb of scratch. Since we have no idea of the sizes of the two data sources it is impossible to make any suggestions or prognostications.

Here are some details based on production volume estimates:
The source table1: 20 million = 5 gb [data size]
The source table2: 1000 million = around 250gb [data size]

both source tables are Teradata.

Pleae let me know if you need any details.

Please confirm me does it needs 255gb of scratch space to join this data in datastage with this full volume? I can not test this with full volume of data in development, but do not want to design some thing that may fail in production..

Thank you for all your help and suggest me with options..

Chandrathdsx · Post by **Chandrathdsx** » Sun Feb 28, 2010 9:23 pm

Any suggestions on this please?

Thank you!

hamzaqk · Post by **hamzaqk** » Sun Feb 28, 2010 10:23 pm

If i were you i would do it in by writing a SQL for the join in teradata.

DSXchange

Join of tables with huge volume of data - scratch space

Join of tables with huge volume of data - scratch space

Re: Join of tables with huge volume of data - scratch space