Page 1 of 2

Ignoring message with bad cookie. Has anybody seen this?

Posted: Sat Dec 10, 2011 10:59 am
by abc123
Here is the full warning:

Ignoring message with bad cookie; expected 1234567890.123456.2de5, received 1234567888.123455.2d3d

Any ideas?

Posted: Sat Dec 10, 2011 11:11 am
by chulett
And you got this where? From what? Doing what, exactly? :?

FYI, a search reveals yours is the only post with that message, you may be on your own here.

Posted: Sat Dec 10, 2011 12:33 pm
by qt_ky
What does your cookie have to do with Parallel jobs?

Posted: Sat Dec 10, 2011 6:33 pm
by abc123
I am trying to run 25+ jobs in about 5 job streams. About 20+ jobs abort with.

Here is the sequence of messages:

WARNING:
main_program: Ignoring message with bad cookie; expected 1234567890.123456.2de5, received 1234567888.123455.2d3d

WARNING:
main_program: Accept timed out retries = 16

FATAL ERROR:
main_program: The section leader on tste3ftp001.ihop.local died

FATAL ERROR:
main_program: **** Parallel startup failed ****
This is usually due to a configuration error, such as
not having the Orchestrate install directory properly
mounted on all nodes, rsh permissions not correctly
set (via /etc/hosts.equiv or .rhosts), or running from
a directory that is not mounted on all nodes. Look for
error messages in the preceding output.

INFORMATIONAL MESSAGE(later):
main_program: rsh issued, no response received
------------------------------------------------------------------------------------

I did a search for 'Parallel startup failed' and found some posts.

Some of the solutions I came across are:

1)Do you have a firewall enabled?
Question: How do I check that on my Linux box?

2)Are all ports used by Information Server open (There are about 20 of them)?
Question: How do I check ports I have on my Linux box and how do I check if they are open?

3)rsh has to be configured to do password-less login.
Question: How do I check this?

4)Does the configuration file contain nodes external to underlying host?
Answer: Not my config file.

5)Another solution from this post:
/etc/hosts.equiv or .rhosts

a)Need to enable rsh on the server
Question: How do I enable rsh?

b) This I don't understand.

Provide entries in configuration file , must contain node entries of 3 servers on all machines

Ex:

Code: Select all

{
node "node1"
  {
    fastname "hostname1"
    ***********
  }
node "node2"
  {
    fastname "hostname2"
  }
node "node3"
  {
    fastname "hostname3"
  }
} 
c)Create startup.apt and add the file path in administrator.
Question: This I don't understand at all. Should I create a config file called startup.apt. How do I add the file path to the
administrator?

6)Have you done this step?
On the primary computer, create the remsh file in the /Server/PXEngine/etc/ directory with the following content.
#!/bin/sh
exec /usr/bin/ssh "$@"

Question: What is the consequence of this step?

MOST INTERESTING:
a)The exact same sequence and jobs run perfectly fine on a different Linux box.
b)On the problem box, not all jobs abort but a majority do.
c)All of these jobs run fine when run individually.

MORE INFO: Everything is happening on the same Linux box except where the database resides.

Posted: Sat Dec 10, 2011 6:40 pm
by qt_ky
What version of DataStage is on the problem server? What version of OS? How long has the problem server been up and running successfully before getting the warnings and aborts? What is your job design for the job logging the cookie warning?

Posted: Sat Dec 10, 2011 9:29 pm
by abc123
1)What version of DataStage is on the problem server?
Ans: 8.1 on both. Exact same patches on both.

2)What version of OS?
Ans: Linux 2.6.18-274 ... x86_64 GNU/Linux

3)How long has the problem server been up and running successfully before getting the warnings and aborts?
Ans: Several years. No other problems other than the Datastage job errors.

4)What is your job design for the job logging the cookie warning?
Ans: 2 source OraEnterprise stages going to a change capture on to a transformer onto a sequential file. It is the same job in both environments.

Posted: Sat Dec 10, 2011 10:50 pm
by qt_ky
I'm drawing a blank on the whole "cookie" thing from your job log.

You might also check this topic:

viewtopic.php?t=141488

Every now and then the network can have problems or an admin could change something unintentionally to cause mysterious problems. I've seen it happen many times.

It sounds like you're just going to have to start double-checking all the settings and comparing them against the working server. Hopefully someone else will have a better idea.

Posted: Sun Dec 11, 2011 8:55 am
by chulett
This is classic 'involve your support provider' territory to me.

Posted: Sun Dec 11, 2011 9:16 am
by abc123
But can either of you answer any of my questions in my second post? At least, I can try out a few things then.

Posted: Sun Dec 11, 2011 10:05 am
by qt_ky
Firewalls are most often external from your server. You would have to ask your Firewall or Network team what firewalls may be in place.

To check if a specific port is open, I usually run this from the command line: telnet server/IP port. Example: telnet 12.34.56.78 13401

What happens next depends on the telnet command and whether or not you're on Windows or Unix. Compare the telnet results between a port you know is open vs. the one you're testing. If the port is not open, telnet will usually just hang for some time.

If the server has been up and running just fine for years, then I wouldn't go creating new *.apt config files trying to make something new work. Rather, I would start narrowing down the problem and comparing PX and Unix settings between servers. If nothing is obviously different fairly quickly, then open a support case.

Posted: Sun Dec 11, 2011 10:10 am
by chulett
IMHO, those kind of questions are relevant when you can't get anything to run, which is obviously not the case here. Ignoring the oddball cookie message, what happens when you run the jobs individually? Do they run ok? When you run them all and most of them fail, it is always the same set or does the collection of failed jobs change run over run? I'm wondering if you are simply overloading the system with these "25+" jobs.

Posted: Thu Dec 15, 2011 11:00 am
by _chamak
I think you are reaching the maximum user processeses. Try increasing the maximum user processeses. You can check using the below command
lsattr -E -l sys0.

Hope it helps.

Posted: Thu Dec 15, 2011 11:48 pm
by abc123
My lsattr doesn't have -E option.

What is sys0?

My ulimit -a has 2097.

I am definitely not exceeding that many processes. Isn't there a way to count how many processes are being spawned during the run?

Posted: Fri Dec 16, 2011 10:42 am
by abc123
By the way, our server is on vmware. Could that be an issue?

Posted: Fri Dec 16, 2011 7:34 pm
by qt_ky
Please show full ulimit -a output from both servers. I usually go with the unlimited setting for most of those settings shown by ulimit -a. Depending on number of nodes used and number of simultaneous jobs, you could generate a large number of processes.