Page 1 of 2

AIX Unable To Detect DataStage Process

Posted: Fri Apr 19, 2013 5:59 am
by jerome_rajan
As part of our control framework, we are trying to design a flow so that a job shouldn't start until the previous job completes. The script we have written is

Code: Select all

while [ `ps -ef|grep "phantom DSD.RUN job_jobName"|grep -iv -e "grep" -e "SH -c"|wc -l` -ne 0  ]; do sleep 2; done
The loop however exits even when the process is actually running. Why could this be happening?

Posted: Sun Apr 21, 2013 9:11 pm
by jwiles
Is your jobname stored in a variable, or is it hardcoded in the command (as in your example)?

Add -x to the shell executable at the top of your script:

Code: Select all

#!/usr/bin/ksh -x
to enable the shell to trace the execution of your script.

Regards,

Posted: Mon Apr 22, 2013 2:49 am
by jerome_rajan
Not much help. We could still see the job running after the while loop exited :(. Is DataStage refreshing the process because of which the loop exits at that instant? What exactly is causing the shell to miss the presence of the process?

Posted: Mon Apr 22, 2013 9:06 am
by jwiles
Why wasn't it (the -x, I assume) much help? -x should have shown you how the commands executed and what the results resolved to, similar to the following:

Code: Select all

++ ps -ef
++ grep 'phantom DSD.RUN job_jobName'
++ grep -iv -e grep -e 'SH -c'
++ wc -l
+ '[' 0 -ne 0 ']'
You see each command that is executed. Do they match your expectations? Does the grep argument exactly match what you see if you do this manually?

Is your jobname stored in a variable, or is it hardcoded in the command (as in your example)?

Regards,

Posted: Mon Apr 22, 2013 9:12 am
by jerome_rajan
The trace output is exactly what it should be. My concern however is that though the loop exits, the job in question is still running in which case, the script should not have exited. The job name is currently hardcoded for debugging.

Posted: Mon Apr 22, 2013 10:23 am
by priyadarshikunal
did it go to sleep for even a single time?

Posted: Mon Apr 22, 2013 10:47 am
by jerome_rajan
Yes it did but random number of times in every run.

Posted: Mon Apr 22, 2013 10:48 am
by jwiles
Does the command line resolve to a 1 or 0 in the trace?

'[' 1 -ne 0 ']'

or

'[' 0 -ne 0 ']'

If it's resolving to 0, then the grep argument is probably not matching to what ps -ef is actually putting out and that's what you will need to concentrate on. The logic itself works (I can do the same using running processes and it will loop until I kill it), so what you're searching for is not quite right.

Regards,

Posted: Mon Apr 22, 2013 1:53 pm
by jwiles
To answer your earlier question: No, the job would not be "refreshing". But that does bring up the characteristics of the job you're testing this with: How long does it run? How are you starting/restarting it when it ends?

Regards,

Posted: Mon Apr 22, 2013 10:47 pm
by jerome_rajan
jwiles wrote:Does the command line resolve to a 1 or 0 in the trace?
It resolves to a 0 and consequently exits the loop. But the pertinent question here is why is the process/job in question that the shell presumed completed still running? A ps -eaf immediately after the while loop exits still gives me the DSD.RUN job_Jobname in the list
jwiles wrote:How long does it run? How are you starting/restarting it when it ends?
The original job could run for anywhere between an hour to 2 hours. We created a test job that would run for much lesser time and aid in debugging the issue. The test job was supplied enough data to run for 5-10 mins. It is a parallel job and has been running stand-alone thus far. The job has not been designed to restart after completion.

On that note, I'm beginning to think it's more of an AIX issue. I'm pretty sure that a parallel job wouldn't just break and disappear and then reappear just like that!

Posted: Tue Apr 23, 2013 8:47 am
by PaulVL
Please paste to us the dsjob -run command that you are using.

Posted: Tue Apr 23, 2013 8:48 am
by jwiles
In order to debug this further, I would suggest breaking the command string into its separate commands and storing results in either files or variables so that you can see what is being seen at the time by the commands. Something like the following would be one way of accomplishing this:

Code: Select all

MyWordCount=1
while [ $MyWordCount -ne 0 ]
do
  ps -ef >./psef.out
  MyWordCount=`grep "phantom DSD.RUN job_jobName" ./psef.out|grep -  iv -e "grep" -e "SH -c"|wc -l`
done
echo "ps -ef output at time of death:"
cat ./psef.out
This will place the output of the ps -ef command into a file, which you can then examine when the script exits to see if ps's output is the trigger. You can break the command string down further as necessary.

Regards,

Posted: Tue Apr 23, 2013 4:52 pm
by PaulVL
One thing to remember is that a LONG "ps -ef" line may get truncated due to your default COULMNS setting for your shell and you might not be able to grep for the exact job name because of that issue.

Posted: Tue Apr 23, 2013 10:59 pm
by jerome_rajan
Thanks James & Paul. Will try your suggestions and report in a while. Paul's thought seems reasonable. I had initially thought about the page display limit affecting the grep output (which of course I found was actually not)

Posted: Wed Apr 24, 2013 5:13 pm
by PaulVL
I mention it because I ran into that problem with the Grid enablement toolkit (a few releases ago). They also parsed the PS line and were limited to the COLUMNS of the shell script that was used to bounce the datastage engine (ya it was that tricky).

So, it was a leson learned for me to expand the PS statement fully if I want to grep / parsing something from it.

On Linux (suse) I would do a "ps -efww | grep ..." to get the full wide ps text. the WW option is not supported on AIX but something equivalent is out there.