config datastage cluster

ppgoml · Post by **ppgoml** » Fri Jan 21, 2011 1:01 am

I am doing configuration of a datastage cluster.
I have two machine A and B.
the OS version is RedHat Enterprise Linux 4.8 32bit, and the Datastage version is V8.1 fixpack2.
I installed the datastage server on Machine A.
and I setuped the trusted rsh on both A and B.

according to the "Planning, Installation, and Configuration Guide", I have two ways to add B as a processing node.

There are two ways to make the parallel engine available to all the nodes in an
MPP system:
1. You can globally cross-mount, typically via NFS, a single directory on a single
system containing the parallel engine software. This configuration makes
software upgrades more convenient than if the parallel engine components are
installed on all processing systems. If you are using NFS to globally mount the
directory, mount it using the hard or the hard, intr option. Do not mount it by
using the soft option. Start up times are faster if you copy the engine to each
node, however.
2. You can use a script to copy the parallel engine components to a directory with
the same path name on all processing systems that you designate for processing
parallel jobs.

if I use copy-orchdist to copy the parallel engine components from A to B, I get below errors while checking configuration file in designer.

Code: Select all

##I IIS-DSEE-TFCN-00001 17:46:21(000) <main_program> 
IBM WebSphere DataStage Enterprise Edition 8.1.0.5809 
Copyright (c) 2001, 2005-2008 IBM Corporation. All rights reserved
 


##I IIS-DSEE-TFCN-00006 17:46:21(001) <main_program> conductor uname: -s=Linux; -r=2.6.18-194.el5; -v=#1 SMP Tue Mar 16 21:52:43 EDT 2010; -n=dstage2; -m=i686
##I IIS-DSEE-TCOA-00067 17:46:21(002) <main_program> OS charset: UTF-8.
##I IIS-DSEE-TCOA-00068 17:46:21(003) <main_program> Input charset: UTF-8.
##I IIS-DSEE-TFSC-00001 17:46:21(004) <main_program> APT configuration file: /opt/IBM/InformationServer/Server/Configurations/2-NODES.apt
##E IIS-DSEE-TFPM-00330 17:46:21(005) <main_program> The Section Leader on node node2 has terminated unexpectedly.
##F IIS-DSEE-TFPM-00113 17:49:07(000) <
APT_CheckConfigOperator,0> Fatal Error: Unable to start ORCHESTRATE network connection on node node1(dstage2): COMPLETEWAIT failed: parallel APT_CheckConfigOperator(0,0)
##F IIS-DSEE-TFPM-00114 17:49:07(000) <APT_RealFileExportOperator in APT_FileExportOperator,0> Fatal Error: Unable to start ORCHESTRATE network connection on node node1 (dstage2):  APT_PMConnectionSetup:: operator 1(sequential APT_RealFileExportOperator in APT_FileExportOperator)timed out with 1 incomplete incoming connections.
##E IIS-DSEE-TFPM-00192 17:49:08(000) <node_node1> Player 1 terminated unexpectedly.
##E IIS-DSEE-TFPM-00338 17:49:08(000) <main_program> APT_PMsectionLeader(1, node1), player 1 - Unexpected exit status 1.
##E IIS-DSEE-TFPM-00192 17:49:08(001) <node_node1> Player 2 terminated unexpectedly.
##E IIS-DSEE-TFPM-00338 17:49:08(001) <main_program> APT_PMsectionLeader(1, node1), player 2 - Unexpected exit status 1.
##W IIS-DSEE-TFPM-00091 17:49:13(000) <main_program> APT_PMpollUntilZero: WARNING: called with counter = 0
##E IIS-DSEE-TFSC-00011 17:49:18(000) <main_program> Step execution finished with status = FAILED.
##E IIS-DSEE-TCOA-00069 17:49:18(001) <main_program> ERROR: check configuration file failed.

if I globally cross-mount via NFS, it works fine.

My customer dislike using NFS. Is there some solution to make copy-orchdist way work?

Thanks.

--
Jack

dougcl · Post by **dougcl** » Thu Jan 27, 2011 2:07 pm

Hi we are also interested in this subject. We currently have one UNIX machine as the DS engine, which runs everything off of a mounted directory.

Everything is in the directory. Install files, the project (we have one project), the dataset ds files, the dataset node files, the scratch area, and so on.

We would like to introduce a second engine, mount the same directory, share the same install, share the same dataset ds and node files, share the same project, and allocate jobs to it via a second config file. Has anyone done this? Since we are already coordinating job dependencies, I don't see why this shouldn't work, and it would be very simple, assuming our file system can keep up.

Thanks,
Doug

lstsaur · Post by **lstsaur** » Thu Jan 27, 2011 2:20 pm

Doug,
If you want to share everything between all nodes, you have to put them in a grid instead of a cluster.

ray.wurlod · Post by **ray.wurlod** » Thu Jan 27, 2011 3:57 pm

Can't you simply cross-mount, and set APT_ORCHHOME (and maybe other environment variables) on each machine?

dougcl · Post by **dougcl** » Fri Jan 28, 2011 11:29 am

ray.wurlod wrote:Can't you simply cross-mount, and set APT_ORCHHOME (and maybe other environment variables) on each machine?

Cross mounting the project is suggested here:

http://publib.boulder.ibm.com/infocente ... _dist.html

Sorry, it's unclear whether this is referring to the install, the project or both. I think it's referring to both. Also, I believe (from reading the documentation) that NFS is sufficient to support these shared filesystems. It is not a high throughput connection.

I am having some trouble understanding how to tie a job to a particular fastname. I want a job to run all its nodes on a given engine. I am assuming this is supported. Do we create multiple config files, each with different fastnames, or do we create one config file, omit the fastname and use APT_PM_CONDUCTOR_HOSTNAME to allocate jobs to engines?