Page 1 of 1

Semaphores are building up to a very high number

Posted: Tue Aug 13, 2013 4:15 am
by ulab
Hello DS Friends,

we faced an issue, that the jobs are running long in XYZ server. On checking we found that the jobs are not going to the Grid queue. and we saw an Error: Port Library failed to initialize, Could not create the Java virtual machine,
After googleing on this issue we got to know that the issue is with the semaphores when the value reaches 1000000(ipcs -rsa | wc -l), now my question to all my DS friends is

what is the reason/cause that keeps semaphore issue to happen? we are badly impacted with this issue in all environments, and this happens every week irrespective of the environment, the only work arround we have now is failover the server to secondary node/server and re-boot the primary server. please put in your valuable thoughts and experiences on the root cause of semaohores,

NOTE: submited a PMR to IBM but still no resolution/other work arround from them yet.

Posted: Tue Aug 13, 2013 4:35 am
by ray.wurlod
An internet search will reveal to you that a semaphore is a place for a program to wait for some event to occur (even for an amount of time to elapse). Semaphores are implemented in different ways on different platforms, but in all cases they should be released once finished with. If they aren't, then there's some problem with the application - you really will need to wait for your official support to help to diagnose, since there are many different processes making up most DataStage applications.

Posted: Mon Aug 19, 2013 9:46 am
by PaulVL
What version of DataStage are you running?

The ODBC manager on a particular release had some semaphore issues which caused them to never get released.

I arn into that at my previous client outside of Boston. 9.1 FP2 resolved the issue I believe, but I left before we applied it to PROD (the only environment that had the issue). We had to execute a manual cleanup until the FP2 was installed.

Posted: Mon Aug 19, 2013 9:58 am
by asorrell
Are you on AIX? My current client has this issue (they are running 8.7 on AIX 6.1) and we are working with IBM to get a fix. Apparently the issue has been identified, but there is no resolution at this time.

We investigated using ipcrm to remove the semaphores, but there were considerable drawbacks / risks involved. Frequent system re-boots were deemed the safest workaround.

I wrote a script that checks the ipcs count every 30 minutes and writes it out to disk so we can see how the count is creeping up. On our very busy system the ipcs command will occasionally time out, causing it to return an error instead of a semaphore list, but other than that it works well.

Code: Select all

while [ 1 -eq 1 ]
  do
    echo `date;ipcs -rs | wc -l` >> /home/asorrell/ipcs.out
    sleep 1800
  done
I just submit this in the background with the nohup option and let it run till we reboot.

Posted: Mon Aug 19, 2013 10:27 am
by PaulVL
ipcrm -ruo <userid #1>
ipcrm -ruo <userid #2>
etc...

That command will clear out the unreleased semaphores. Should leave the currently used ones unaffected.

My old client was on AIX as well. IBM had indicated that the issue was associated with a bad ODBC driver manager. "Should be fixed" in FP2 for 9.1. But I never got a chance to validate that.

I can ask my old team mates to see if they applied the patch yet.

Posted: Mon Aug 19, 2013 1:32 pm
by asorrell
We thought about using the same ipcrm command but the majority of our jobs are run by dsadm. IBM Engineering said that they could not guarantee that using ipcrm against dsadm on a running system would not have any side-effects, so we decided to skip that and reboot. I still think it would be safe, but if IBM said no, I wasn't willing to put my neck on the line to try it.

Now if you are on a system where most of the jobs are run under individual user-ids, using the ipcrm command against user-id's that are currently not logged in (but have left semaphores behind) should be 100% safe.