85% Scratch disk usage on head node in Grid envorinment

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
DataStage_Sterling
Participant
Posts: 26
Joined: Wed Jul 17, 2013 9:00 am

85% Scratch disk usage on head node in Grid envorinment

Post by DataStage_Sterling »

We are migrating jobs from 7.5.1 to 8.7 Grid env.

Job Design
Oracle Connector -> Sort (2 key columns) -> Transformer (record comparision, transformations and 4 links out) -> Funnel -> Remove duplicates (based on 4 columns) -> Sequential File



Problem
1. Job processes about 120 million records and takes 8 hours to complete on a 2x2 grid env
2. Fills 85% of scratch space on head node

Is there anyway to avoid scratch disk fill up? I would like to know how to build scratch disk pools on the compute nodes.

Thank you
DataStage Sterling
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Can you show the APT file as created by the job?

There should be setting in global_grid_values in your $GRIDHOME (or overwritten at your project level) that deals with executing on the conductor, can you tell us what that is (going off of memory on that one, might be wrong).

Did you disable the Head Node from accepting Grid jobs ?

If using Platform LSF type: bhosts
Then look to see if that Head Node is "closed"
DataStage_Sterling
Participant
Posts: 26
Joined: Wed Jul 17, 2013 9:00 am

Post by DataStage_Sterling »

Configuration File

Code: Select all

IIS-DSEE-DYNG0014 <Dynamic_grid.sh>Information: SEQFILE Host(s): xxx_ServerName: xxx_ServerName:
{
 node "Conductor"
 {
  fastname "zzz_ServerName"
  pools "conductor"
  resource disk "/opt/<resource disk>" {pools ""}
  resource scratchdisk "/opt/<scratch disk>" {pools ""}
 }
 node "Compute1"
 {
  fastname "xxx_ServerName"
  pools ""
  resource disk "/opt/<resource disk>" {pools ""}
  resource scratchdisk "/opt/<scratch disk>" {pools ""}
 }
 node "Compute2"
 {
  fastname "xxx_ServerName"
  pools ""
  resource disk "/opt/<resource disk>" {pools ""}
  resource scratchdisk "/opt/<scratch disk>" {pools ""}
 }
 node "Compute3"
 {
  fastname "xxx_ServerName"
  pools ""
  resource disk "/opt/<resource disk>" {pools ""}
  resource scratchdisk "/opt/<scratch disk>" {pools ""}
 }
 node "Compute4"
 {
  fastname "xxx_ServerName"
  pools ""
  resource disk "/opt/<resource disk>" {pools ""}
  resource scratchdisk "/opt/<scratch disk>" {pools ""}
 }
}
IIS-DSEE-OSHC0007 <osh_conductor>Information: Authorized to proceed.
Grid values from Job Log

Code: Select all

APT_GRID_COMPUTENODES=2
APT_GRID_CONFIG=
APT_GRID_ENABLE=YES
APT_GRID_IDENTIFIER=
APT_GRID_OPTS=
APT_GRID_PARTITIONS=2
APT_GRID_QUEUE=
APT_GRID_SCRIPTPOST=
APT_GRID_SCRIPTPRE=
APT_GRID_SEQFILE_HOST=
APT_GRID_SEQFILE_HOST2=
APT_GRID_STAT_CMD=
Yes we are using platform LSF and the head node is not closed.
Last edited by DataStage_Sterling on Mon Mar 03, 2014 10:50 am, edited 1 time in total.
lstsaur
Participant
Posts: 1139
Joined: Thu Oct 21, 2004 9:59 pm

Post by lstsaur »

Scratch space of the compute nodes, minimun 25 GB, must be a local disk, not NAS-mounted or NFS-mount. It seems like your job's scratch processing is all done on the head node. No wonder job takes longer to finish.
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Well, your APT_GRID_CONFIG must be set, otherwise the grid enablement toolkit will use your defautl.apt, which is NOT grid friendly. Didn't IBM explain that to your admins?

That is why you ran out of scratch. It also looks like the scratch and datasets you've been using is under the tool installation mount. (which would also indicate that you may be using the default.apt)

I'm surprised you don't have APT_GRID_QUEUE also defined. Not a requirement, but a best practice to set, otherwise you'll be submitting to whatever the default queue is. Probably NORMAL queue.

PM me who from IBM services helped you guys set up that grid. I don't think you got your money's worth.
At least they put the Conductor node as Conductor pool and not blank.

Are you Platform LSF or Load Leveler?
DataStage_Sterling
Participant
Posts: 26
Joined: Wed Jul 17, 2013 9:00 am

Post by DataStage_Sterling »

lstsaur wrote:Scratch space of the compute nodes, minimun 25 GB, must be a local disk, not NAS-mounted or NFS-mount. It seems like your job's scratch processing is all done on the head node. No wonder job takes longer to finish.
It seems that it was a local disk before. But for better maintenance and performance it was NFS mounted.
Post Reply