Page 1 of 1

Job monitor stops every day

Posted: Fri May 11, 2007 1:43 am
by ivannavi
Every morning we find that JobMonApp is not running. Then we start it. Then it runs.

I found in the logs last lines reading two different things (depends which day):
"Couldn't read from Socket: Connection reset"
and
"Could not write to subscriber socket."
I think it's when it dies. How do I find what's killing it?

Posted: Mon May 14, 2007 3:47 am
by gbusson
hi,

Check there is no other application running on the ports taken by the JobMonApp.

the ports are in the file : $APT_ORCHHOME/etc/jobmon_ports

Another possibility

Posted: Mon May 14, 2007 2:39 pm
by asorrell
I was at a client that had a similar problem. We re-indexed all of his projects and the problem went away. Not sure what was corrupted, but jobmon halted with the same error message.

Posted: Mon May 14, 2007 4:45 pm
by ray.wurlod
Do 'they" stop DataStage - for example for backups - in the middle of the night?

Posted: Wed May 16, 2007 1:28 am
by ivannavi
No, we don't stop DataStage. Who is 'they'?

Posted: Wed May 16, 2007 2:05 am
by ivannavi
jobmon_ports file says:
APT_JOBMON_PORT1=13400
APT_JOBMON_PORT2=13401
As for "Check there is no other application running on the ports taken by the JobMonApp", i did this:
$ netstat -a | grep 13400
tcp 0 0 *.13400 *.* LISTEN
$ netstat -a | grep 13401
tcp 0 0 *.13401 *.* LISTEN
Is this OK?

Posted: Sun May 27, 2007 2:52 am
by ivannavi
I reindexed all projects. I still find Job Monitor not running almost every day. Any ideas?

Posted: Sun May 27, 2007 7:20 pm
by ray.wurlod
By "they" I mean operations staff, who may shut things down in order to take backups, etc.

Posted: Mon May 28, 2007 2:29 am
by ivannavi
Well, they don't stop anything without us knowing.

Posted: Thu Jun 14, 2007 7:49 am
by ivannavi
gbusson wrote:
Check there is no other application running on the ports taken by the JobMonApp
So I installed lsof because one can supply it with a port number as input parameter to see what's using the port. Then I wrote a script to capture this into a log file every second. It is pasted below in case someone thinks he could use it. I was hoping to spot the killer application by examining the file when jobmonitor dies. I was prying for it more than two weeks.
Nothing unusual. When no jobs were running, only two lines were present for each "query":
from lsof____ *:13400 29022 (LISTEN) from ps____ dsadm 29022 1 0 Jun 12 ? 8:09 /asc/Ascential/DataStage/DSEngine/java/jre/bin/PA_RISC2.0/java time____ 20070613 13:51:02
from lsof____ *:13401 29022 (LISTEN) from ps____ dsadm 29022 1 0 Jun 12 ? 8:09 /asc/Ascential/DataStage/DSEngine/java/jre/bin/PA_RISC2.0/java time____ 20070613 13:51:03
Additional lines were present when jobs were running (various phantoms, osh, sqlldr etc...), but nothing suspicious. At the time jobmonitor died this log also stopped receiving entries.
Now I guess my problem is neither in repository (which I reindexed) nor in some other application using the ports. What else should I try?
:evil:

The script:
#! /bin/sh
dafault_ifs=$IFS
IFS="
"
while [ 0 ]
do
sleep 1
IFS="
"
for I in $(/home/dsadm/lsof-4.77/lsof -i :13400 | awk '{print $9, $2, $10}' | grep -v 'NAME PID')
do

IFS=$dafault_ifs
brojac=`expr 1`
for J in $I
do
if [ $brojac = 2 ]
then
echo 'from lsof____' $I ' from ps____' `ps -f -p $J | grep -v 'UID PID PPID C STIME TTY TIME COMMAND'` ' time____' `date +'%Y%m%d %H:%M:%S'`
fi
brojac=`expr $brojac + 1`
done

done
IFS="
"
for I in $(/home/dsadm/lsof-4.77/lsof -i :13401 | awk '{print $9, $2, $10}' | grep -v 'NAME PID')
do

IFS=$dafault_ifs
brojac=`expr 1`
for J in $I
do
if [ $brojac = 2 ]
then
echo 'from lsof____' $I ' from ps____' `ps -f -p $J | grep -v 'UID PID PPID C STIME TTY TIME COMMAND'` ' time____' `date +'%Y%m%d %H:%M:%S'`
fi
brojac=`expr $brojac + 1`
done

done


done

Posted: Thu Jun 14, 2007 10:35 am
by ralleo
Sometimes by restart DataStage doesnt always clear the sockets.

Before restart DataStage, issue a uv -admin -clearesockets command. This is because sockets get block and had to force to clear.

See if this works.






-----------------------------------
Many ways to solve a problem

Posted: Tue Jun 19, 2007 2:11 am
by ivannavi
Hey Ralleo, I'm not quite sure I understand. So let me go through this step by step:

1) Initially both DataStage and JobMonApp are up and running.
2) Then JobMonApp stops. DataStage is still up.
3) Then I stop DataStage.
4) Then I issue uv -admin -clearsockets command.
5) Then I start DataStage.
6) Then I hope JobMonApp won't die again.

Is that what you had in mind? Is this "clearsockets" necessary if I always wait for "netstat -a | grep dsr" to show nothing, at least not to show FIN_WAIT_2?

Posted: Tue Jun 19, 2007 5:28 am
by ralleo
Normally when both DataStage and JobMonApp are running and JobMonApp stops, you dont have to stop DataStage. You can restart JobMonApp again on its on.

The reason why I suggested the clear sockets command, was because you stopped and restarted DataStage. To eliminate any problems before restarted DataStage it is good practice to clear sockets, so that no blocked sockets affects the running of DataStage.

Regarding having to restart JobMonApp everyday, seems to me that there is an activity taking place on the server between certain times that causes JobMonApp fail.

If you log in UV and type "STATUS" this will tell you all activities on the server. Probably review these and see which activites can cause this to happen.
Also, if any large logs files are generated at certains times on the same filesystem as JobMonApp, this can cause JobMonApp to stop.



------------------------------------------------------------------------------------

Walking on water and developing software from a specification are easy if both are frozen. "(Edward V Berard)

Posted: Wed Jun 20, 2007 3:57 am
by ivannavi
Thanks. I'll try that. :)