Job monitor stops every day

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ivannavi
Premium Member
Premium Member
Posts: 120
Joined: Mon Mar 07, 2005 9:49 am
Location: Croatia

Job monitor stops every day

Post by ivannavi »

Every morning we find that JobMonApp is not running. Then we start it. Then it runs.

I found in the logs last lines reading two different things (depends which day):
"Couldn't read from Socket: Connection reset"
and
"Could not write to subscriber socket."
I think it's when it dies. How do I find what's killing it?
gbusson
Participant
Posts: 98
Joined: Fri Oct 07, 2005 2:50 am
Location: France
Contact:

Post by gbusson »

hi,

Check there is no other application running on the ports taken by the JobMonApp.

the ports are in the file : $APT_ORCHHOME/etc/jobmon_ports
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Another possibility

Post by asorrell »

I was at a client that had a similar problem. We re-indexed all of his projects and the problem went away. Not sure what was corrupted, but jobmon halted with the same error message.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Do 'they" stop DataStage - for example for backups - in the middle of the night?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ivannavi
Premium Member
Premium Member
Posts: 120
Joined: Mon Mar 07, 2005 9:49 am
Location: Croatia

Post by ivannavi »

No, we don't stop DataStage. Who is 'they'?
ivannavi
Premium Member
Premium Member
Posts: 120
Joined: Mon Mar 07, 2005 9:49 am
Location: Croatia

Post by ivannavi »

jobmon_ports file says:
APT_JOBMON_PORT1=13400
APT_JOBMON_PORT2=13401
As for "Check there is no other application running on the ports taken by the JobMonApp", i did this:
$ netstat -a | grep 13400
tcp 0 0 *.13400 *.* LISTEN
$ netstat -a | grep 13401
tcp 0 0 *.13401 *.* LISTEN
Is this OK?
ivannavi
Premium Member
Premium Member
Posts: 120
Joined: Mon Mar 07, 2005 9:49 am
Location: Croatia

Post by ivannavi »

I reindexed all projects. I still find Job Monitor not running almost every day. Any ideas?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

By "they" I mean operations staff, who may shut things down in order to take backups, etc.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ivannavi
Premium Member
Premium Member
Posts: 120
Joined: Mon Mar 07, 2005 9:49 am
Location: Croatia

Post by ivannavi »

Well, they don't stop anything without us knowing.
ivannavi
Premium Member
Premium Member
Posts: 120
Joined: Mon Mar 07, 2005 9:49 am
Location: Croatia

Post by ivannavi »

gbusson wrote:
Check there is no other application running on the ports taken by the JobMonApp
So I installed lsof because one can supply it with a port number as input parameter to see what's using the port. Then I wrote a script to capture this into a log file every second. It is pasted below in case someone thinks he could use it. I was hoping to spot the killer application by examining the file when jobmonitor dies. I was prying for it more than two weeks.
Nothing unusual. When no jobs were running, only two lines were present for each "query":
from lsof____ *:13400 29022 (LISTEN) from ps____ dsadm 29022 1 0 Jun 12 ? 8:09 /asc/Ascential/DataStage/DSEngine/java/jre/bin/PA_RISC2.0/java time____ 20070613 13:51:02
from lsof____ *:13401 29022 (LISTEN) from ps____ dsadm 29022 1 0 Jun 12 ? 8:09 /asc/Ascential/DataStage/DSEngine/java/jre/bin/PA_RISC2.0/java time____ 20070613 13:51:03
Additional lines were present when jobs were running (various phantoms, osh, sqlldr etc...), but nothing suspicious. At the time jobmonitor died this log also stopped receiving entries.
Now I guess my problem is neither in repository (which I reindexed) nor in some other application using the ports. What else should I try?
:evil:

The script:
#! /bin/sh
dafault_ifs=$IFS
IFS="
"
while [ 0 ]
do
sleep 1
IFS="
"
for I in $(/home/dsadm/lsof-4.77/lsof -i :13400 | awk '{print $9, $2, $10}' | grep -v 'NAME PID')
do

IFS=$dafault_ifs
brojac=`expr 1`
for J in $I
do
if [ $brojac = 2 ]
then
echo 'from lsof____' $I ' from ps____' `ps -f -p $J | grep -v 'UID PID PPID C STIME TTY TIME COMMAND'` ' time____' `date +'%Y%m%d %H:%M:%S'`
fi
brojac=`expr $brojac + 1`
done

done
IFS="
"
for I in $(/home/dsadm/lsof-4.77/lsof -i :13401 | awk '{print $9, $2, $10}' | grep -v 'NAME PID')
do

IFS=$dafault_ifs
brojac=`expr 1`
for J in $I
do
if [ $brojac = 2 ]
then
echo 'from lsof____' $I ' from ps____' `ps -f -p $J | grep -v 'UID PID PPID C STIME TTY TIME COMMAND'` ' time____' `date +'%Y%m%d %H:%M:%S'`
fi
brojac=`expr $brojac + 1`
done

done


done
ralleo
Premium Member
Premium Member
Posts: 21
Joined: Mon Dec 11, 2006 9:05 am
Location: London

Post by ralleo »

Sometimes by restart DataStage doesnt always clear the sockets.

Before restart DataStage, issue a uv -admin -clearesockets command. This is because sockets get block and had to force to clear.

See if this works.






-----------------------------------
Many ways to solve a problem
ivannavi
Premium Member
Premium Member
Posts: 120
Joined: Mon Mar 07, 2005 9:49 am
Location: Croatia

Post by ivannavi »

Hey Ralleo, I'm not quite sure I understand. So let me go through this step by step:

1) Initially both DataStage and JobMonApp are up and running.
2) Then JobMonApp stops. DataStage is still up.
3) Then I stop DataStage.
4) Then I issue uv -admin -clearsockets command.
5) Then I start DataStage.
6) Then I hope JobMonApp won't die again.

Is that what you had in mind? Is this "clearsockets" necessary if I always wait for "netstat -a | grep dsr" to show nothing, at least not to show FIN_WAIT_2?
ralleo
Premium Member
Premium Member
Posts: 21
Joined: Mon Dec 11, 2006 9:05 am
Location: London

Post by ralleo »

Normally when both DataStage and JobMonApp are running and JobMonApp stops, you dont have to stop DataStage. You can restart JobMonApp again on its on.

The reason why I suggested the clear sockets command, was because you stopped and restarted DataStage. To eliminate any problems before restarted DataStage it is good practice to clear sockets, so that no blocked sockets affects the running of DataStage.

Regarding having to restart JobMonApp everyday, seems to me that there is an activity taking place on the server between certain times that causes JobMonApp fail.

If you log in UV and type "STATUS" this will tell you all activities on the server. Probably review these and see which activites can cause this to happen.
Also, if any large logs files are generated at certains times on the same filesystem as JobMonApp, this can cause JobMonApp to stop.



------------------------------------------------------------------------------------

Walking on water and developing software from a specification are easy if both are frozen. "(Edward V Berard)
ivannavi
Premium Member
Premium Member
Posts: 120
Joined: Mon Mar 07, 2005 9:49 am
Location: Croatia

Post by ivannavi »

Thanks. I'll try that. :)
Post Reply