NODES AND SCRATCH ON SAN DISKS

srireddypunuru · Post by **srireddypunuru** » Mon Feb 06, 2012 2:05 pm

Team,

We are running IS 8.5 on windows 4 CPU box. Where we have D and E Drives are SAN and D is the place we have installed IS 8.5

E Drive has 200GB free space where pointed our CONFIG FILE with 4 Nodes.

Issues we are facing -

1) Lots of OSH being created jobs failing - Unable to allocate resources etc.
2) IBM gave us a patch JR41358_PXE_windows_8501

They are suggesting to reduce the nodes to a 2 NODE CONFIG FILE

Below is our config file. Our Job designs arnt that complex but redcing the nodes to 2 will increase the execution time of the jobs.

Any thought our situation really appreciated.

main_program: APT configuration file: D:/IBM/InformationServer/Server/Configurations/default.apt

Code: Select all

{	node "node1"
	{
		fastname "ms"
		pools ""
		resource disk "E:/IBM/InformationServer/Server/Datasets" {pools ""}
		resource scratchdisk "E:/IBM/InformationServer/Server/Scratch" {pools ""}
	}

	node "node2"
	{
		fastname "ms979"
		pools ""
		resource disk "E:/IBM/InformationServer/Server/Datasets" {pools ""}
		resource scratchdisk "E:/IBM/InformationServer/Server/Scratch" {pools ""}
	}
	node "node3"
	{
		fastname "ms979"
		pools ""
		resource disk "E:/IBM/InformationServer/Server/Datasets" {pools ""}
		resource scratchdisk "E:/IBM/InformationServer/Server/Scratch" {pools ""}
	}
	node "node4"
	{
		fastname "ms979"
		pools ""
		resource disk "E:/IBM/InformationServer/Server/Datasets" {pools ""}
		resource scratchdisk "E:/IBM/InformationServer/Server/Scratch" {pools ""}
	}
}

o

qt_ky · Post by **qt_ky** » Mon Feb 06, 2012 8:24 pm

The key would be finding out what resources it is unable to allocated.

Using SAN should not be an issue.

If you're disk is full, the error makes sense. If not, it could be referring to memory. Do you have a lot of old, large files filling up your 200GB?

Find out why they recommend 2 nodes.

Are there any clues in the README from the patch they gave you?

srireddypunuru · Post by **srireddypunuru** » Mon Feb 06, 2012 9:09 pm

Thanks Eric,

1) I have cleaned up the disk out of 200GB i have 150 GB Free.

Please find read me file from PATCH.

PATCH FOR APAR : JR41358
PATCH NAME : patch_JR41358_PXE_windows_8501
ENGINEERING TEAM : Parallel Framework
COMPONENT : PXEngine
TIERS : Engine
OPERATING SYSTEM : windows 32 & 64 bit
SUITE VERSION* : 8.5.0.1
UNINSTALL** : Supported
RECOMPILE JOBS : None Required

* This patch requires that IBM Information Server suite and component be
installed at the exact level shown and no other.

** If the patch can be uninstalled (see above) and you need to uninstall it,
see the patch installation instructions for information on uninstalling.

PROBLEM:
Intermittent job failure when using shared memory for interprocess communication.
Jobs fail with the following fatal error:
Unable to initialize communication channel on XXXX. This is typically caused by a
configuration problem. Examples of typical problems include:
1) The temporary directory, identified by $TMPDIR and/or the scratch disks in your
ORCHESTRATE configuration, is located on a non-local file system (e. g. mounted over NFS).
2) The temporary directory is located on a file system with insufficient space.

RESOLUTION:
Fixed an issue with shared memory file name string handling.

I have a 4 CPU and they recommonded 2 NODE syaing it is a thumb rule for N CPU to hve N-2 nodes in general for a small Development systems like ours.

Thanks
Sri
Cincinnati OH

vmcburney · Post by **vmcburney** » Tue Feb 07, 2012 12:02 am

How much RAM do you have? Are you doing RAM intensive jobs such as large Lookups? Are you running low volume jobs against that config file? That results in a hell of a lot of useless data partitioning and re partitioning and lots of processes you don't need. Consider having a single node config file that is used for low volume jobs - that may free up resources to run your high volume jobs across four nodes. I would expect that only a small percentage of your jobs need to run on a 4 node config.

srireddypunuru · Post by **srireddypunuru** » Tue Feb 07, 2012 9:53 am

IBM Team was suggesting to use Create 4 Mount Points on 4 LUNs

Code: Select all

{
node "node1"                                                  SAN
 {
                           
     fastname "servername"
     pools ""
     resource disk "/datasets/d1" {pools ""}   -------------> MOUNT POUNT 1           
     resource Scratchdisk "/scratch/s1" {pools ""}  -------->  MOUNT POINT 2
}
node "node2"
 {

     fastname "servername"
     pools ""
     resource disk "/datasets/d2" {pools ""}   ---------> MOUNT POINT 3
     resource Scratchdisk "/scratch/s2" {pools ""}---------> MOUNT POINT 4
}

nOT sURE WHAT THEY MEAN Vincent can you throw som light on this.

kwwilliams · Post by **kwwilliams** » Tue Feb 07, 2012 11:14 am

Take a step back and look at your development environment as a whole -- is the setup adequate to meet your needs and was it architected correctly?

1. How many developers do you have workign concurrently (don't care about total number, but how many are working at the same time)?

2. How many jobs do you have currently? How many jobs do you anticipate will be created per month?

3. Is this happening in all of your environments or just one in particular? Your last note seems to state this is a development environment, which is why I am asking -- I size them differently because they have different needs.

4. Piggy backing on Vincent's comments - do you have large normal lookups in your jobs that require a lot of memory?

5. Do you have large sorts that require copious amounts of scratch space?

6. Do you have large datasets that require large amounts of resource disk?

The questions could go on, those are the things that I look for when getting the approximate size of a system. You could also ask your IBM representative to bring someone in to evaluate your needs.

I'm just outside of Cincinnati - Let me know if you can't clear your issues up. I would be happy to talk by phone - or drop by at some point to discuss.

kwwilliams · Post by **kwwilliams** » Tue Feb 07, 2012 11:28 am

Think of a LUN as a device. They want your storage admins to create four different LUNS (think devices) that you will then mount to your Windows server. The idea is that with four different devices you will not have contention between the scratch (node1), scratch (node2), resource(node1), and resource (node2) which will make your system faster.

I doubt it is causing your allocation issue.

ray.wurlod · Post by **ray.wurlod** » Tue Feb 07, 2012 6:30 pm

I reckon the second suggestion in the error message warrants closer examination. What is your temporary directory? How much free space exists on its file system?

Change the value of TMPDIR environment variable so that it points to a directory on a file system with lots more space than that of /tmp.

If you're running server jobs, also change UVTEMP in the uvconfig file and regen the shared memory image.

qt_ky · Post by **qt_ky** » Fri Feb 10, 2012 1:59 pm

I'm just outside of Cincinnati too (about an hour).

Check TMPDIR like Ray and the README both suggested. The paths in your config file are not the only paths used by the application when creating temporary files.

Also install the patch if you can, then give an update.

DSXchange