LSF and Datastage

skathaitrooney · Post by **skathaitrooney** » Sat Mar 28, 2015 2:26 am

Hello Experts,

I have faced some problems while using LSF as a resource manager.
Many a times i have noticed in LSF that my job is in PENDING or RUNNING Status. Few minutes later that job disappears from LSF but the job's phantom process keeps on running on my server. When i opened director, i noticed that the job was hung in the first 3 lines of its logs only.
Is there any parameter that has to be defines in LSF or something?

Could anyone sugest something?

Thanks Already!

PaulVL · Post by **PaulVL** » Mon Mar 30, 2015 7:18 am

Check your ability to execute your jobs on the compute nodes. My guess is that you lack SSH keys to them.

Do this:

bhosts

Take that list of servers and SSH down to them one at a time from your head node.

Ensure that each compute node has the mount for your JOBDIR as defined in the grid_global_values file in your $GRIDHOME path. Make sure that the user id you are running has rw access to that path.

skathaitrooney · Post by **skathaitrooney** » Fri Apr 17, 2015 4:28 am

Hi Paul,

SSH is working fine, also each compoute node has the JOBDIR mounted correctly.

I debugged a lil more, i found that the sequencer.sh script is running related to the phantom process and thus a compute node is not getting assigned to the job.

The sequencer.sh process is creating a child process "sleep 5".
This is the point why my sequence is stuck in the first 3 lines and not progressing

Any idea where could be the possible hang?

PaulVL · Post by **PaulVL** » Fri Apr 17, 2015 8:07 am

I'm not a big fan of the sequence.sh process. I preffer to load balance each job to the grid. I do understand the thought process behind the sequence.sh script... just don't have to agree with it.

</rant>

You're going to have to debug your stuff by tracing the pids back to owner, then introducing debug statements to sequence.sh.

Get out of datastage and run the test.sh script from GRIDHOME using the same user id that your job would run as.

Ensure that your JOBDIR is the same as the one used by your project.

Ensure that the APT_GRID_QUEUE is the same too. Any additional APT_GRID_OPT values should also be introduced into your test.sh script.

If that works (after a few executions) then your issue is at least not something basic.

Are ALL datastage grid jobs hanging? Or just sequence.sh ones?

upon which log message is the job hanging?

<Waiting to be released from queue> ??

Or prior to that?

Maybe you are not passing the output from your sequence.sh into the APT_CONFIG parm of your subsequent jobs, thus your APT file would be non existant, and that might cause it to hang.

Look at the job parms of that job that is hanging. Are they what you expect?

skathaitrooney · Post by **skathaitrooney** » Fri Apr 17, 2015 8:30 am

Thanks Paul for reviewing this.
Here are the answers to your questions.

The job is hanging here:

Starting Job seq_DL_MA_PF_USER
Environment variable settings:
seq_DL_MA_PF_USER: Set NLS locale to US-ENGLISH,US-ENGLISH,US-ENGLISH,US-ENGLISH,US-ENGLISH
seq_DL_MA_PF_USER..JobControl (@Coordinator): Starting new run of checkpointed Sequence job

That means sequencer.sh is running.

Also, all of my sequences use sequencer.sh script which return the APT_CONFIG_FILE that is passed to all the parallel jobs that are part of that sequence.

This issue does not come up for every job. Just randomly 4-5 jobs get affected due to this. When i kill the job and re run it, it runs fine.

As per my understanding sequencer.sh is generating a process "sleep 5", i have tried strace and gdb but just couldnt trace whats actually happening here.

PaulVL · Post by **PaulVL** » Fri Apr 17, 2015 11:07 am

sequencer.sh script start?
is that pid active?
Did the script submit a job to the grid via DynamicGrid.sh?
Did you look at that output?
What is the content of your JOBDIR path for that execution?
Your SLEEP 5 command has a parent pid. Probably the /jobdirpath/...../blahblah.wait job. See what it's looping on. Most likely the PID of a job.

Are you messing with PATH or JAVAPATH in your sequencer?

Can you go out to your compute node and see if something is running there?

intermitent huh.... hmmm

check your SSH configuration and look at MaxStartups

Do you have other intermittent problem in your environment? Intermittent errors with LDAP authentication maybe?

skathaitrooney · Post by **skathaitrooney** » Mon Apr 20, 2015 7:09 am

Paul,

sequencer.sh is running non stop and its pid is active. It should ideally run for 4-5 seconds as we see for other jobs.

In sequencer.sh, DynamicGrid.sh is sourced.

And sleep 5... its parent id is sequencer.sh

JAVAPATH is not being mesed around with.

And no process related to the job is running on the compute node..probably because config file is not generated.

For ssh configurations max startups line is commneted out. Also, maxstartups shouldnt be a problem according to me.

Couldn't find any errors related to LDAP authentication.

As far as i noticed , the sleep 5 is getting executed again and again as its pid is changing every 5 seconds.. this is why sequencer.sh is not able to complete.

PaulVL · Post by **PaulVL** » Mon Apr 20, 2015 8:08 am

There is no SLEEP 5 seconds in sequence.sh nor in Dynamic_grid.sh. There will be a SLEEP 5 in the shell script that the java code creates on the fly.

Look for a PID with the following "blah blah.<jobname>_<a pid#>.wait".
He will have a loop, checking for the existance of a PID (one of your OSH processes).

Here's what I would do next:

Look at the QUEUE you are executing against.
Find all of the hosts accessible in that QUEUE.
Use the same user id that you are using to run the job and manually SSH/RSH to each box one at a time. If you lack an SSH key to that box (or if your user id doesn't even exist there) you would see this type of behavior.
When on the box, look to see if the user ID has RW permissions to JOBDIR.

The random nature of the failure implies that 1 host out of your pool may be tripping you up. WHen you resubmit you hit a different server (thus it works).

skathaitrooney · Post by **skathaitrooney** » Tue Apr 21, 2015 5:03 am

Paul,

This is assured that SSh is enabled for all compute nodes.
Also the JOBDIR is shared and the user id has RW permissions.

I am waiting for the next time this issue occurs.

Had a chat with IBM, they asked to notice the following:

1. check if any hung sockets are present on the head node

Code: Select all

netstat -a|grep dsrpc

2. They also asked to monitor the &PH& directory at the time of issue if any file is getting generated at the time of occurence of the issue.

3. They also asked to capture the following: ps -ef|grep dsapi,

Code: Select all

ps -ef|grep dscs,
ps -ef|grep phantom
cd $DSHOME (i.e. DSEngine directory)
. ./dsenv (source the dsenv)
./bin/smat -a

Lets see if we find anything. I'l share the findings if any.

PaulVL · Post by **PaulVL** » Tue Apr 21, 2015 8:02 am

What version of the Grid Enablement Toolkit are you using?

skathaitrooney · Post by **skathaitrooney** » Tue Apr 21, 2015 8:34 am

Its 4.3.3

PaulVL · Post by **PaulVL** » Tue Apr 21, 2015 10:42 am

Current release is 5.0.7. I do recommend upgrading.

Lots of fixes, including testing of return codes (rolles eyes) from java calls.

New JOBDIR directory structure.
Different form of handshaking with compute nodes to coordinate jobs.
Will create named pipes on Head Node for that handshake.
Makes Julian fries (or maybe Julius fries).

New requirement is a LOCAL mount for those FIFO files.

I'd also double your max SSH settings per user id just to be safe.

Just as easy to install as the other versions.

skathaitrooney · Post by **skathaitrooney** » Tue Apr 28, 2015 12:37 am

Hey Paul,

I am confused now whats actually causing the issue. Earlier i was thinking that sequencer.sh was causing problems. We were using execute command activity stage to trigger sequencer.sh.

But just yesterday another job got hung. but in this case a script was getting called via execute command activity stage. This job also got stuck at the same point as the previous job. Is there a chance that there is a potential problem with execute command activity?

PaulVL · Post by **PaulVL** » Tue Apr 28, 2015 8:50 am

*shrug*

Looks like you have some debugging to do.

Write a diagnostic job that loops. Open an execute stage, exectue a script within it. The script should have mundane commands in it. ls, pws, hostname, etc...

Basically the contents of the script should never fail.

If the looping activity causes a hang, then you know you've got bigger issues. Loop for a huge number 500, 1000. make that job run for at least an hour.

Intermediate hangs are hard to track. when you do get a hang. Dig deep into your cranium to understand what exactly are the resources being used when that activity happens. ( remember that .profile is often overlooked when opening an execute stage )

Good luck.