85% Scratch disk usage on head node in Grid envorinment

DataStage_Sterling · Post by **DataStage_Sterling** » Sat Mar 01, 2014 9:54 pm

We are migrating jobs from 7.5.1 to 8.7 Grid env.

Job Design
Oracle Connector -> Sort (2 key columns) -> Transformer (record comparision, transformations and 4 links out) -> Funnel -> Remove duplicates (based on 4 columns) -> Sequential File

Problem
1. Job processes about 120 million records and takes 8 hours to complete on a 2x2 grid env
2. Fills 85% of scratch space on head node

Is there anyway to avoid scratch disk fill up? I would like to know how to build scratch disk pools on the compute nodes.

Thank you
DataStage Sterling

PaulVL · Post by **PaulVL** » Sun Mar 02, 2014 6:58 am

Can you show the APT file as created by the job?

There should be setting in global_grid_values in your $GRIDHOME (or overwritten at your project level) that deals with executing on the conductor, can you tell us what that is (going off of memory on that one, might be wrong).

Did you disable the Head Node from accepting Grid jobs ?

If using Platform LSF type: bhosts
Then look to see if that Head Node is "closed"

DataStage_Sterling · Post by **DataStage_Sterling** » Sun Mar 02, 2014 9:38 am

Configuration File

Code: Select all

IIS-DSEE-DYNG0014 <Dynamic_grid.sh>Information: SEQFILE Host(s): xxx_ServerName: xxx_ServerName:
{
 node "Conductor"
 {
  fastname "zzz_ServerName"
  pools "conductor"
  resource disk "/opt/<resource disk>" {pools ""}
  resource scratchdisk "/opt/<scratch disk>" {pools ""}
 }
 node "Compute1"
 {
  fastname "xxx_ServerName"
  pools ""
  resource disk "/opt/<resource disk>" {pools ""}
  resource scratchdisk "/opt/<scratch disk>" {pools ""}
 }
 node "Compute2"
 {
  fastname "xxx_ServerName"
  pools ""
  resource disk "/opt/<resource disk>" {pools ""}
  resource scratchdisk "/opt/<scratch disk>" {pools ""}
 }
 node "Compute3"
 {
  fastname "xxx_ServerName"
  pools ""
  resource disk "/opt/<resource disk>" {pools ""}
  resource scratchdisk "/opt/<scratch disk>" {pools ""}
 }
 node "Compute4"
 {
  fastname "xxx_ServerName"
  pools ""
  resource disk "/opt/<resource disk>" {pools ""}
  resource scratchdisk "/opt/<scratch disk>" {pools ""}
 }
}
IIS-DSEE-OSHC0007 <osh_conductor>Information: Authorized to proceed.

Grid values from Job Log

Code: Select all

APT_GRID_COMPUTENODES=2
APT_GRID_CONFIG=
APT_GRID_ENABLE=YES
APT_GRID_IDENTIFIER=
APT_GRID_OPTS=
APT_GRID_PARTITIONS=2
APT_GRID_QUEUE=
APT_GRID_SCRIPTPOST=
APT_GRID_SCRIPTPRE=
APT_GRID_SEQFILE_HOST=
APT_GRID_SEQFILE_HOST2=
APT_GRID_STAT_CMD=

Yes we are using platform LSF and the head node is not closed.

lstsaur · Post by **lstsaur** » Sun Mar 02, 2014 4:41 pm

Scratch space of the compute nodes, minimun 25 GB, must be a local disk, not NAS-mounted or NFS-mount. It seems like your job's scratch processing is all done on the head node. No wonder job takes longer to finish.

PaulVL · Post by **PaulVL** » Sun Mar 02, 2014 7:49 pm

Well, your APT_GRID_CONFIG must be set, otherwise the grid enablement toolkit will use your defautl.apt, which is NOT grid friendly. Didn't IBM explain that to your admins?

That is why you ran out of scratch. It also looks like the scratch and datasets you've been using is under the tool installation mount. (which would also indicate that you may be using the default.apt)

I'm surprised you don't have APT_GRID_QUEUE also defined. Not a requirement, but a best practice to set, otherwise you'll be submitting to whatever the default queue is. Probably NORMAL queue.

PM me who from IBM services helped you guys set up that grid. I don't think you got your money's worth.
At least they put the Conductor node as Conductor pool and not blank.

Are you Platform LSF or Load Leveler?

DataStage_Sterling · Post by **DataStage_Sterling** » Mon Mar 03, 2014 11:09 am

lstsaur wrote:Scratch space of the compute nodes, minimun 25 GB, must be a local disk, not NAS-mounted or NFS-mount. It seems like your job's scratch processing is all done on the head node. No wonder job takes longer to finish.

It seems that it was a local disk before. But for better maintenance and performance it was NFS mounted.