Ignoring message with bad cookie. Has anybody seen this?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Ignoring message with bad cookie. Has anybody seen this?

Post by abc123 »

Here is the full warning:

Ignoring message with bad cookie; expected 1234567890.123456.2de5, received 1234567888.123455.2d3d

Any ideas?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

And you got this where? From what? Doing what, exactly? :?

FYI, a search reveals yours is the only post with that message, you may be on your own here.
-craig

"You can never have too many knives" -- Logan Nine Fingers
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

What does your cookie have to do with Parallel jobs?
Choose a job you love, and you will never have to work a day in your life. - Confucius
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Post by abc123 »

I am trying to run 25+ jobs in about 5 job streams. About 20+ jobs abort with.

Here is the sequence of messages:

WARNING:
main_program: Ignoring message with bad cookie; expected 1234567890.123456.2de5, received 1234567888.123455.2d3d

WARNING:
main_program: Accept timed out retries = 16

FATAL ERROR:
main_program: The section leader on tste3ftp001.ihop.local died

FATAL ERROR:
main_program: **** Parallel startup failed ****
This is usually due to a configuration error, such as
not having the Orchestrate install directory properly
mounted on all nodes, rsh permissions not correctly
set (via /etc/hosts.equiv or .rhosts), or running from
a directory that is not mounted on all nodes. Look for
error messages in the preceding output.

INFORMATIONAL MESSAGE(later):
main_program: rsh issued, no response received
------------------------------------------------------------------------------------

I did a search for 'Parallel startup failed' and found some posts.

Some of the solutions I came across are:

1)Do you have a firewall enabled?
Question: How do I check that on my Linux box?

2)Are all ports used by Information Server open (There are about 20 of them)?
Question: How do I check ports I have on my Linux box and how do I check if they are open?

3)rsh has to be configured to do password-less login.
Question: How do I check this?

4)Does the configuration file contain nodes external to underlying host?
Answer: Not my config file.

5)Another solution from this post:
/etc/hosts.equiv or .rhosts

a)Need to enable rsh on the server
Question: How do I enable rsh?

b) This I don't understand.

Provide entries in configuration file , must contain node entries of 3 servers on all machines

Ex:

Code: Select all

{
node "node1"
  {
    fastname "hostname1"
    ***********
  }
node "node2"
  {
    fastname "hostname2"
  }
node "node3"
  {
    fastname "hostname3"
  }
} 
c)Create startup.apt and add the file path in administrator.
Question: This I don't understand at all. Should I create a config file called startup.apt. How do I add the file path to the
administrator?

6)Have you done this step?
On the primary computer, create the remsh file in the /Server/PXEngine/etc/ directory with the following content.
#!/bin/sh
exec /usr/bin/ssh "$@"

Question: What is the consequence of this step?

MOST INTERESTING:
a)The exact same sequence and jobs run perfectly fine on a different Linux box.
b)On the problem box, not all jobs abort but a majority do.
c)All of these jobs run fine when run individually.

MORE INFO: Everything is happening on the same Linux box except where the database resides.
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

What version of DataStage is on the problem server? What version of OS? How long has the problem server been up and running successfully before getting the warnings and aborts? What is your job design for the job logging the cookie warning?
Choose a job you love, and you will never have to work a day in your life. - Confucius
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Post by abc123 »

1)What version of DataStage is on the problem server?
Ans: 8.1 on both. Exact same patches on both.

2)What version of OS?
Ans: Linux 2.6.18-274 ... x86_64 GNU/Linux

3)How long has the problem server been up and running successfully before getting the warnings and aborts?
Ans: Several years. No other problems other than the Datastage job errors.

4)What is your job design for the job logging the cookie warning?
Ans: 2 source OraEnterprise stages going to a change capture on to a transformer onto a sequential file. It is the same job in both environments.
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

I'm drawing a blank on the whole "cookie" thing from your job log.

You might also check this topic:

viewtopic.php?t=141488

Every now and then the network can have problems or an admin could change something unintentionally to cause mysterious problems. I've seen it happen many times.

It sounds like you're just going to have to start double-checking all the settings and comparing them against the working server. Hopefully someone else will have a better idea.
Choose a job you love, and you will never have to work a day in your life. - Confucius
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

This is classic 'involve your support provider' territory to me.
-craig

"You can never have too many knives" -- Logan Nine Fingers
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Post by abc123 »

But can either of you answer any of my questions in my second post? At least, I can try out a few things then.
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Firewalls are most often external from your server. You would have to ask your Firewall or Network team what firewalls may be in place.

To check if a specific port is open, I usually run this from the command line: telnet server/IP port. Example: telnet 12.34.56.78 13401

What happens next depends on the telnet command and whether or not you're on Windows or Unix. Compare the telnet results between a port you know is open vs. the one you're testing. If the port is not open, telnet will usually just hang for some time.

If the server has been up and running just fine for years, then I wouldn't go creating new *.apt config files trying to make something new work. Rather, I would start narrowing down the problem and comparing PX and Unix settings between servers. If nothing is obviously different fairly quickly, then open a support case.
Choose a job you love, and you will never have to work a day in your life. - Confucius
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

IMHO, those kind of questions are relevant when you can't get anything to run, which is obviously not the case here. Ignoring the oddball cookie message, what happens when you run the jobs individually? Do they run ok? When you run them all and most of them fail, it is always the same set or does the collection of failed jobs change run over run? I'm wondering if you are simply overloading the system with these "25+" jobs.
-craig

"You can never have too many knives" -- Logan Nine Fingers
_chamak
Premium Member
Premium Member
Posts: 29
Joined: Tue Aug 24, 2010 10:29 am

Post by _chamak »

I think you are reaching the maximum user processeses. Try increasing the maximum user processeses. You can check using the below command
lsattr -E -l sys0.

Hope it helps.
-Thanks
Chamak
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Post by abc123 »

My lsattr doesn't have -E option.

What is sys0?

My ulimit -a has 2097.

I am definitely not exceeding that many processes. Isn't there a way to count how many processes are being spawned during the run?
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Post by abc123 »

By the way, our server is on vmware. Could that be an issue?
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Please show full ulimit -a output from both servers. I usually go with the unlimited setting for most of those settings shown by ulimit -a. Depending on number of nodes used and number of simultaneous jobs, you could generate a large number of processes.
Choose a job you love, and you will never have to work a day in your life. - Confucius
Post Reply