Join of tables with huge volume of data - scratch space

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Chandrathdsx
Participant
Posts: 59
Joined: Sat Jul 05, 2008 11:32 am

Join of tables with huge volume of data - scratch space

Post by Chandrathdsx »

Hi,
I am doing a full outer join of two tables with table1 has 20million records and table2 has 1000million records. [Production data volumes]
Job Design:
SrcTbl1--->Join(with hash and sort by join keys) -->Transformer-->Tgt Tbl
|
SrcTbl2 ------
I have only one scratch disk with 35GB of space.
Please suggest if this is a right thing to do considering the fact that huge volume of data?
-What are my implecations -- will I run out of scatch space (the appserver is unix)?
-How do you monitor the usage of scratch space while the job is running? [any unix commands to monitor?]
-Is there a way to spit the records out to 'Trasformer' and subsequently to Tgt Table instead of waiting for all records to be joined to free up scratch space? Any environment variables set-up?
-Would it help if I use a multi-node configuration?
-Or do I need to consider alternative using custom SQL to join the tables in a SQL.

Any help is highly appreciated.

Thank you!
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Re: Join of tables with huge volume of data - scratch space

Post by ArndW »

Once your data for left and right parts of join is sorted on the join keys then your job won't need excessive amounts of scratch space. It will need scratch space for sorting. You didn't mention how many Mb/Gb your two data tables are.
keshav0307
Premium Member
Premium Member
Posts: 783
Joined: Mon Jan 16, 2006 10:17 pm
Location: Sydney, Australia

Post by keshav0307 »

for large volume of data to join, more number of nodes.

can you post the content of configuration your are using for this job?
Chandrathdsx
Participant
Posts: 59
Joined: Sat Jul 05, 2008 11:32 am

Post by Chandrathdsx »

keshav0307 wrote:for large volume of data to join, more number of nodes.

can you post the content of configuration your are using for this job?
Here is my max 4x4 config (total 8 nodes): but pls note that we have only one scratchdisk with 35GB.
{
node "node1"
{
fastname "ds01lp01"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node2"
{
fastname "ds01lp01"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node3"
{
fastname "ds01lp01"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node4"
{
fastname "ds01lp01"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node5"
{
fastname "ds01lp02"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node6"
{
fastname "ds01lp02"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node7"
{
fastname "ds01lp02"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}
}
node "node8"
{
fastname "ds01lp02"
pools ""
resource disk "/DSproj/dev/Datasets" {pools ""}
resource scratchdisk "/DSproj/dev/Scratch" {pools ""}

}

Thanks for reply.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Unless your Database is partitioned, going for more nodes in your configuration file is not going to make a positive difference.

So far the only size you've mentioned is the 35Gb of scratch. Since we have no idea of the sizes of the two data sources it is impossible to make any suggestions or prognostications.
Sreenivasulu
Premium Member
Premium Member
Posts: 892
Joined: Thu Oct 16, 2003 5:18 am

Post by Sreenivasulu »

Hi ArndW,
Is it table partition or index partition in oracle ?
Since i have not heard of database partition. I may be wrong here

Regards
Sreeni
Chandrathdsx
Participant
Posts: 59
Joined: Sat Jul 05, 2008 11:32 am

Post by Chandrathdsx »

ArndW wrote:Unless your Database is partitioned, going for more nodes in your configuration file is not going to make a positive difference.
ArndW,
So far the only size you've mentioned is the 35Gb of scratch. Since we have no idea of the sizes of the two data sources it is impossible to make any suggestions or prognostications.
Here are some details based on production volume estimates:
The source table1: 20 million = 5 gb [data size]
The source table2: 1000 million = around 250gb [data size]

both source tables are Teradata.

Pleae let me know if you need any details.

Please confirm me does it needs 255gb of scratch space to join this data in datastage with this full volume? I can not test this with full volume of data in development, but do not want to design some thing that may fail in production..

Thank you for all your help and suggest me with options..
Chandrathdsx
Participant
Posts: 59
Joined: Sat Jul 05, 2008 11:32 am

Post by Chandrathdsx »

Any suggestions on this please?

Thank you!
hamzaqk
Participant
Posts: 249
Joined: Tue Apr 17, 2007 5:50 am
Location: islamabad

Post by hamzaqk »

If i were you i would do it in by writing a SQL for the join in teradata.
Teradata Certified Master V2R5
Post Reply