Page 1 of 1

How to setup DataStage for MPP system

Posted: Tue Dec 06, 2005 11:22 pm
by dsusr
Hi All,

We are having 3 different Linux servers and each server is having 4 CPUs. All these servers have DataStage PX installed on them. I want to connect all these servers and need to use them for Parallel processing.

From the manager guide i got that to prepare DataStage to work on MPP system we just need to change the configuration file, i just want to know what type of connectivity needs to be open between these since as of now only ssh is enabled.

Do we need to have telnet enabled between these systems for connectivity?


Posted: Tue Dec 06, 2005 11:43 pm
by ray.wurlod
You don't need telnet, but you do need TCP/IP, and the ability for processes to communicate via sockets. You should use as fast a network connectivity as you can get - not less than 1 Gbit imho.

You didn't actually need the full DataStage PX installed on each, unless you want to run separate server jobs on each or use BASIC Transformer stage or server Shared Containers on all nodes.

Posted: Wed Dec 07, 2005 4:45 am
by dsusr

I have modified the configuration file and have added one node that signifies the second server in that configuration file. But when I am running my job i am getting the following error:-

main_program: The section leader on <server2> died

main_program: **** Parallel startup failed ****
This is usually due to a configuration error, such as not having the Orchestrate install directory properly mounted on all nodes, rsh permissions not correctly set (via /etc/hosts.equiv or .rhosts), or running from a directory that is not mounted on all nodes. Look for error messages in the preceding output.

I have even checked for the rsh and is properly installed on the system.


Posted: Wed Dec 07, 2005 6:10 am
by Eric
1) have you installed DataStage PX into the same path on all machines?
2) have you tested rsh using the machine names in your APT_CONFIG file?

Posted: Wed Dec 07, 2005 1:18 pm
by ray.wurlod
Read the error message carefully, then read the Install and Upgrade Guide and complete any steps you missed (such as permitting processes on one machine to execute on another as "trusted").

You really should test configuration files before attempting to use them. There is a test utility on the Configuration File editor in Manager.

Posted: Thu Dec 08, 2005 1:08 am
by dsusr

Yes I have installed datastage on both the servers at same location. Also I have tested rsh on both the nodes by giving the command
rsh server1name uptime & rsh server2name uptime and both are giving correct result.

My configuration file is as follows:-

node "node1"
fastname "PUN020"
pools ""
resource disk "/opt/datastage/Ascential/DataStage/Datasets" {pools ""}
resource scratchdisk "/opt/datastage/Ascential/DataStage/Scratch" {pools ""}
node "node2"
fastname "PUN040"
pools ""
resource disk "/home/dsadm" {pools ""}
resource scratchdisk "/Scratch" {pools ""}

When I am testing this configuration file using the check utility from manager it is giving the following error:-

##E TFIO 000211 14:49:16(000) <APT_RealFileExportOperator in APT_FileExportOperator,0> APT_Communicator::connectTo: connect() failed due to Unix error = 111 (Connection refused) on node PUN020 using ConnectionInfo object 'TCP, connection Host: PUN020 (, TCP port number: 11001', RETRYING connect()

##E TFIO 000211 14:49:16(001) <APT_RealFileExportOperator in APT_FileExportOperator,0> APT_Communicator::connectTo: connect() failed due to Unix error = 111 (Connection refused) on node PUN020 using ConnectionInfo object 'TCP, connection Host: PUN020 (, TCP port number: 11001', RETRYING connect()

##F TFIO 000112 14:49:16(002) <APT_RealFileExportOperator in APT_FileExportOperator,0> Fatal Error: APT_Communicator::pmSendPartitionInfo() failed on node PUN020 for partition 0 of dataset 0 (write failed to handle 14) Bad file descriptor

##E TFPM 000192 14:49:16(000) <node_node1> Player 2 terminated unexpectedly.

##E TFPM 000338 14:49:16(004) <main_program> Unexpected exit status 1

##E TFSR 000011 14:49:21(000) <main_program> Step execution finished with status = FAILED.

##E TOCK 000000 14:49:21(001) <main_program> ERROR: check configuration file failed.

One important point to note here is that this configuration file is on PUN020 server and this check is giving an error while trying to use its own node.

Please let me know if i need to do any other change.


Posted: Thu Dec 08, 2005 9:27 pm
by dsusr
Is anyone having any idea what this APT_Communicator mean. Also is there any documentation for setting up an MPP system.


Posted: Fri Dec 09, 2005 5:54 pm
by ray.wurlod
"Connection refused" usually indicates that the two machines are not in a trusted relationship. Have you made entries in the appropriate files, such as lmhosts, to enable this?

Posted: Mon Dec 12, 2005 2:25 am
by daniel0623
firstly,make sure rsh service has been started.
secondly,add .rhosts file into dsadm home directory on each machine, and add user dsadm in .rhosts file.
thirdly, test rsh whether connected or not.

Posted: Mon Dec 12, 2005 11:55 pm
by dsusr
daniel0623 wrote:firstly,make sure rsh service has been started.
secondly,add .rhosts file into dsadm home directory on each machine, and add user dsadm in .rhosts file.
thirdly, test rsh whether connected or not.
Hi daniel/Ray,

Pardon me for replying late. Yes I have tested for the rsh from both the servers and even i am able to login on any of the server using rsh. This I have tried using dsadmn userid only.

The issue is that the server from which i am running the job is not identifying the node name of it's own node. If i try to test the configuration file for that single node only then it is not giving any error.

Thanks & Regards

Posted: Tue Dec 13, 2005 3:17 pm
by ray.wurlod
Have you checked the hosts files on all systems? How exactly is name to IP address resolution performed?

Posted: Wed Dec 14, 2005 2:04 am
by dsusr
ray.wurlod wrote:Have you checked the hosts files on all systems? How exactly is name to IP address resolution performed?

Hi Ray,

Yes in the hosts file the host name is mapped to correct ip address.

If i try to login on the server using rsh and hostname it is able to login on the server. Also the error is coming for its own node, the node which is working fine for other config file.

Thanks & Regards


Posted: Fri Dec 16, 2005 4:55 am
by chenxs
hi, have you solved this problem?

we meet this issue also, please tell me how to solve

thanks a log~

Posted: Tue Dec 20, 2005 1:55 am
by weela_lee
I met the same problem today. After I did the following setting, it works!
1. set the user dsadm and it's group's id all the same on all cluster;
2. not only set home directory .rhosts but also set /etc/hosts with all cluster info;

Wish it help :D

Posted: Thu Jul 13, 2006 7:54 pm
by ib_icf
I met the same problem, resolved it now:
##E TFIO 000211 14:49:16(000) <APT_RealFileExportOperator in APT_FileExportOperator,0> APT_Communicator::connectTo: connect() failed due to Unix error = 111 (Connection refused) on node PUN020 using ConnectionInfo object 'TCP, connection Host: PUN020 (, TCP port number: 11001', RETRYING connect()
This problem is caused bec of the wrong ip address configuration, according to the error message above, there must be one line in /etc/hosts like this:

Code: Select all PUN020 localhost 
change it to

Code: Select all localhost 
realip PUN020
hope it's helpful...