I am trying to run 25+ jobs in about 5 job streams. About 20+ jobs abort with.
Here is the sequence of messages:
WARNING:
main_program: Ignoring message with bad cookie; expected 1234567890.123456.2de5, received 1234567888.123455.2d3d
WARNING:
main_program: Accept timed out retries = 16
FATAL ERROR:
main_program: The section leader on tste3ftp001.ihop.local died
FATAL ERROR:
main_program: **** Parallel startup failed ****
This is usually due to a configuration error, such as
not having the Orchestrate install directory properly
mounted on all nodes, rsh permissions not correctly
set (via /etc/hosts.equiv or .rhosts), or running from
a directory that is not mounted on all nodes. Look for
error messages in the preceding output.
INFORMATIONAL MESSAGE(later):
main_program: rsh issued, no response received
------------------------------------------------------------------------------------
I did a search for 'Parallel startup failed' and found some posts.
Some of the solutions I came across are:
1)Do you have a firewall enabled?
Question: How do I check that on my Linux box?
2)Are all ports used by Information Server open (There are about 20 of them)?
Question: How do I check ports I have on my Linux box and how do I check if they are open?
3)rsh has to be configured to do password-less login.
Question: How do I check this?
4)Does the configuration file contain nodes external to underlying host?
Answer: Not my config file.
5)Another solution from this post:
/etc/hosts.equiv or .rhosts
a)Need to enable rsh on the server
Question: How do I enable rsh?
b) This I don't understand.
Provide entries in configuration file , must contain node entries of 3 servers on all machines
Ex:
Code: Select all
{
node "node1"
{
fastname "hostname1"
***********
}
node "node2"
{
fastname "hostname2"
}
node "node3"
{
fastname "hostname3"
}
}
c)Create startup.apt and add the file path in administrator.
Question: This I don't understand at all. Should I create a config file called startup.apt. How do I add the file path to the
administrator?
6)Have you done this step?
On the primary computer, create the remsh file in the /Server/PXEngine/etc/ directory with the following content.
#!/bin/sh
exec /usr/bin/ssh "$@"
Question: What is the consequence of this step?
MOST INTERESTING:
a)The exact same sequence and jobs run perfectly fine on a different Linux box.
b)On the problem box, not all jobs abort but a majority do.
c)All of these jobs run fine when run individually.
MORE INFO: Everything is happening on the same Linux box except where the database resides.