Page 1 of 1

Join of tables with huge volume of data - scratch space

Posted: Wed Feb 17, 2010 11:14 am
by Chandrathdsx
Hi,
I am doing a full outer join of two tables with table1 has 20million records and table2 has 1000million records. [Production data volumes]
Job Design:
SrcTbl1--->Join(with hash and sort by join keys) -->Transformer-->Tgt Tbl
|
SrcTbl2 ------
I have only one scratch disk with 35GB of space.
Please suggest if this is a right thing to do considering the fact that huge volume of data?
-What are my implecations -- will I run out of scatch space (the appserver is unix)?
-How do you monitor the usage of scratch space while the job is running? [any unix commands to monitor?]
-Is there a way to spit the records out to 'Trasformer' and subsequently to Tgt Table instead of waiting for all records to be joined to free up scratch space? Any environment variables set-up?
-Would it help if I use a multi-node configuration?
-Or do I need to consider alternative using custom SQL to join the tables in a SQL.

Any help is highly appreciated.

Thank you!

Re: Join of tables with huge volume of data - scratch space

Posted: Wed Feb 17, 2010 11:40 am
by ArndW
Once your data for left and right parts of join is sorted on the join keys then your job won't need excessive amounts of scratch space. It will need scratch space for sorting. You didn't mention how many Mb/Gb your two data tables are.

Posted: Wed Feb 17, 2010 11:07 pm
by keshav0307
for large volume of data to join, more number of nodes.

can you post the content of configuration your are using for this job?

Posted: Thu Feb 18, 2010 10:53 am
by Chandrathdsx
keshav0307 wrote:for large volume of data to join, more number of nodes.

can you post the content of configuration your are using for this job?
Here is my max 4x4 config (total 8 nodes): but pls note that we have only one scratchdisk with 35GB.
{
node "node1"
{
fastname "ds01lp01"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node2"
{
fastname "ds01lp01"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node3"
{
fastname "ds01lp01"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node4"
{
fastname "ds01lp01"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node5"
{
fastname "ds01lp02"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node6"
{
fastname "ds01lp02"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node7"
{
fastname "ds01lp02"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node8"
{
fastname "ds01lp02"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}

}

Thanks for reply.

Posted: Thu Feb 18, 2010 11:23 am
by ArndW
Unless your Database is partitioned, going for more nodes in your configuration file is not going to make a positive difference.

So far the only size you've mentioned is the 35Gb of scratch. Since we have no idea of the sizes of the two data sources it is impossible to make any suggestions or prognostications.

Posted: Thu Feb 18, 2010 11:38 am
by Sreenivasulu
Hi ArndW,
Is it table partition or index partition in oracle ?
Since i have not heard of database partition. I may be wrong here

Regards
Sreeni

Posted: Thu Feb 18, 2010 3:31 pm
by Chandrathdsx
ArndW wrote:Unless your Database is partitioned, going for more nodes in your configuration file is not going to make a positive difference.
ArndW,
So far the only size you've mentioned is the 35Gb of scratch. Since we have no idea of the sizes of the two data sources it is impossible to make any suggestions or prognostications.
Here are some details based on production volume estimates:
The source table1: 20 million = 5 gb [data size]
The source table2: 1000 million = around 250gb [data size]

both source tables are Teradata.

Pleae let me know if you need any details.

Please confirm me does it needs 255gb of scratch space to join this data in datastage with this full volume? I can not test this with full volume of data in development, but do not want to design some thing that may fail in production..

Thank you for all your help and suggest me with options..

Posted: Sun Feb 28, 2010 9:23 pm
by Chandrathdsx
Any suggestions on this please?

Thank you!

Posted: Sun Feb 28, 2010 10:23 pm
by hamzaqk
If i were you i would do it in by writing a SQL for the join in teradata.