Troubleshooting unresponsive server

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
VCInDSX
Premium Member
Premium Member
Posts: 223
Joined: Fri Apr 13, 2007 10:02 am
Location: US

Troubleshooting unresponsive server

Post by VCInDSX »

Our datastage server has stopped responding to client connections. When i attempt to connect to the server it hangs on the login screen forever and ever.....

This is a GRID server that is fairly new (just development projects/jobs).
I can telnet/ssh to the head node and other nodes in the grid.
I can telnet to the various ports that are required for the server connection.

Upon searching the various threads, i found that the rpc deamon is something to look for.

I did the following
netstat -a | grep dsrpc
and got a LISTEN status.

I did a ps -ef | grep dsrpc and it yielded an output line for dsrpcd.

I have submitted this to the Admin group for resolution, but thought I might ask the gurus here to see if there might be some common items to watch for.

Is there a list of services that one should look for?

What else would cause such a serious condition?

Thanks for your time and input
-V
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Ordinarily, in the case of a hang, I'd suggest looking at server name resolution, but the fact that telnet is OK pre-empts that suggestion. Can you try leaving one of the hung connection attempts for, say, at least ten minutes to see whether a timeout error occurs? That may contain a useful diagnostic error code. While it is hanging you might like to try a netstat command on the server.

The next step would be to start dsrpcd with logging, but let's not go there quite yet.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
lstsaur
Participant
Posts: 1139
Joined: Thu Oct 21, 2004 9:59 pm

Post by lstsaur »

VCInDSX,
First, go to grid_enabled directory (head node) to run the test.sh script to verify that your grid environment is working. If the job running succesfully, you will see information regarding how many nodes (compute nodes names) and partitions, etc. Of course, if the job fails, then you know the "grid" is not even set up correctly let alone able to running DS.

Let me know the result; I will let you know what's next to check.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Surely clients make a connection only to the head node?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
VCInDSX
Premium Member
Premium Member
Posts: 223
Joined: Fri Apr 13, 2007 10:02 am
Location: US

Post by VCInDSX »

Hi Ray & lstsaur,
Thanks for your invaluable time and input. This had a premature ending today. The admin folks restarted the DS Server last night (this is a new Dev server and not much impact) which "solved" the issue. I am not sure if we might have lost an opportunity to find some issue that might still be lingering in the background.

However, to answer your questions.
The clients connect to the head node - in our setup.

I left one login attempt to hang on for as much as it took. It did not return for 2 hours. I had to kill that instance. This was before I posted my query in this forum. By the time I came back here, the restart had completed. I will use this tip the next time, if we have one.

As for the test.sh, just for sake of my own learning, i tried to execute and was able to see the output of a test job that it ran and all the outputs of Peek. I will try this out if we run into the same issue again. Thanks for the tip.

Thanks again for your help
-V
lstsaur
Participant
Posts: 1139
Joined: Thu Oct 21, 2004 9:59 pm

Post by lstsaur »

VCInDSX,
Is PBSPro (Resource Manager) used in your grid envrionment? It tells you right away that your server is not up and running.
VCInDSX
Premium Member
Premium Member
Posts: 223
Joined: Fri Apr 13, 2007 10:02 am
Location: US

Post by VCInDSX »

Hi lstsaur,
Thanks for the followup.
I don't think PBSPro is on this server. However, I recall from an earlier chat with the Admin folks that Ganglia is installed and setup on this box. Do you think that might be of help, if we run into this issue again?

Thanks,
-V
cppwiz
Participant
Posts: 135
Joined: Tue Sep 04, 2007 11:27 am

Post by cppwiz »

We have experienced this same issue (hung on login screen) several times with our recently upgraded v8 system with the same resolution each time - restart the server. I'm interested in a better resolution also.
Post Reply