Join of tables with huge volume of data - scratch space
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 59
- Joined: Sat Jul 05, 2008 11:32 am
Join of tables with huge volume of data - scratch space
Hi,
I am doing a full outer join of two tables with table1 has 20million records and table2 has 1000million records. [Production data volumes]
Job Design:
SrcTbl1--->Join(with hash and sort by join keys) -->Transformer-->Tgt Tbl
|
SrcTbl2 ------
I have only one scratch disk with 35GB of space.
Please suggest if this is a right thing to do considering the fact that huge volume of data?
-What are my implecations -- will I run out of scatch space (the appserver is unix)?
-How do you monitor the usage of scratch space while the job is running? [any unix commands to monitor?]
-Is there a way to spit the records out to 'Trasformer' and subsequently to Tgt Table instead of waiting for all records to be joined to free up scratch space? Any environment variables set-up?
-Would it help if I use a multi-node configuration?
-Or do I need to consider alternative using custom SQL to join the tables in a SQL.
Any help is highly appreciated.
Thank you!
I am doing a full outer join of two tables with table1 has 20million records and table2 has 1000million records. [Production data volumes]
Job Design:
SrcTbl1--->Join(with hash and sort by join keys) -->Transformer-->Tgt Tbl
|
SrcTbl2 ------
I have only one scratch disk with 35GB of space.
Please suggest if this is a right thing to do considering the fact that huge volume of data?
-What are my implecations -- will I run out of scatch space (the appserver is unix)?
-How do you monitor the usage of scratch space while the job is running? [any unix commands to monitor?]
-Is there a way to spit the records out to 'Trasformer' and subsequently to Tgt Table instead of waiting for all records to be joined to free up scratch space? Any environment variables set-up?
-Would it help if I use a multi-node configuration?
-Or do I need to consider alternative using custom SQL to join the tables in a SQL.
Any help is highly appreciated.
Thank you!
Re: Join of tables with huge volume of data - scratch space
Once your data for left and right parts of join is sorted on the join keys then your job won't need excessive amounts of scratch space. It will need scratch space for sorting. You didn't mention how many Mb/Gb your two data tables are.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Premium Member
- Posts: 783
- Joined: Mon Jan 16, 2006 10:17 pm
- Location: Sydney, Australia
-
- Participant
- Posts: 59
- Joined: Sat Jul 05, 2008 11:32 am
Here is my max 4x4 config (total 8 nodes): but pls note that we have only one scratchdisk with 35GB.keshav0307 wrote:for large volume of data to join, more number of nodes.
can you post the content of configuration your are using for this job?
{
node "node1"
{
fastname "ds01lp01"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node2"
{
fastname "ds01lp01"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node3"
{
fastname "ds01lp01"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node4"
{
fastname "ds01lp01"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node5"
{
fastname "ds01lp02"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node6"
{
fastname "ds01lp02"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node7"
{
fastname "ds01lp02"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node8"
{
fastname "ds01lp02"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
Thanks for reply.
Unless your Database is partitioned, going for more nodes in your configuration file is not going to make a positive difference.
So far the only size you've mentioned is the 35Gb of scratch. Since we have no idea of the sizes of the two data sources it is impossible to make any suggestions or prognostications.
So far the only size you've mentioned is the 35Gb of scratch. Since we have no idea of the sizes of the two data sources it is impossible to make any suggestions or prognostications.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Premium Member
- Posts: 892
- Joined: Thu Oct 16, 2003 5:18 am
-
- Participant
- Posts: 59
- Joined: Sat Jul 05, 2008 11:32 am
Here are some details based on production volume estimates:ArndW wrote:Unless your Database is partitioned, going for more nodes in your configuration file is not going to make a positive difference.
ArndW,
So far the only size you've mentioned is the 35Gb of scratch. Since we have no idea of the sizes of the two data sources it is impossible to make any suggestions or prognostications.
The source table1: 20 million = 5 gb [data size]
The source table2: 1000 million = around 250gb [data size]
both source tables are Teradata.
Pleae let me know if you need any details.
Please confirm me does it needs 255gb of scratch space to join this data in datastage with this full volume? I can not test this with full volume of data in development, but do not want to design some thing that may fail in production..
Thank you for all your help and suggest me with options..
-
- Participant
- Posts: 59
- Joined: Sat Jul 05, 2008 11:32 am