Page 1 of 1

Unable to start ORCHESTRATE job

Posted: Tue Jun 21, 2011 2:08 pm
by JPalatianos
Hi,
A few months back we had many of our parallel jobs just hanging on the
OSH script (...) step. We opened a PMR with IBM and they suggested to install Fix Pack 3 We were running version 8.0.1 Fix Pack 1. We installed Fix Pack3 and we are now receiving the following error(for the past month)

main_program: Fatal Error: Unable to start ORCHESTRATE job:
APT_PMwaitForPlayersToStart failed while waiting for players to confirm
startup. This likely indicates a network problem.
Status from APT_PMpoll is 0; node name is node0

Once one parallel job gets this error no other parallel job would run unless we bounce the server(I have a dummy job that tests this with a rowgen going to a peek). We are currently bouncing the server 3 to 4 times a day to allow our ETL processes to run in production.

This has no affect on our server jobs, and a new PMR has been opened with IBM. We are getting nowhere with IBM, only the suggestion that upgrading to 8.5 may resolve the issue. They had us turn the McAfee Virus scan off on certain directories thingking that may be the culprit but that did not help.

I have read the other posts for this error and did not find much.

Any suggestions would be appreciated.
Thanks - -John

Posted: Wed Jun 22, 2011 7:09 am
by mhester
John,

I might have IBM support focus on the dsdlockd process. This can cause hangs like what you are experiencing. This may not be your issue, but worth the time to investigate.

We had this very issue and was related to the ownership of the dsdlockd process. We are Unix, but maybe this is a Windows issue too.

Worth a shot.

Posted: Wed Jun 22, 2011 8:16 am
by JPalatianos
Mike,
Thank you for that information. I will let IBM know and see what they come back with.
Thanks - - John

Posted: Wed Jun 22, 2011 8:22 am
by mhester
You are most welcome - report back and let us know what they find.

Posted: Wed Jun 22, 2011 11:44 am
by JPalatianos
IBM just got back with the following:

1) Patch being created to backport the environment variables - APT_PM_PLAYER_CONNECT_TIMEOUT, APT_PM_PLAYER_TIMEOUT - which will let the system keep running jobs even though it is overloaded
2) Get ETA for patch
3) Setup conference call with Prudential to explain what this patch will actually do and what suggestions we have going forward.


I will update when we have the patch available and installed.
Thanks - - John

Posted: Wed Jun 22, 2011 1:18 pm
by mhester
John,

Thanks for the update!

Posted: Mon Aug 22, 2011 1:33 pm
by JPalatianos
The latest on this issue.....IBM has narrowed the problem to the MKS toolkit. They had us generate many logs for them to analyze and they should be getting back to us with a resolution.

Posted: Fri Sep 16, 2011 8:42 am
by JPalatianos
hi,
Per IBM we have appled an upgrade to our MKS toolkit (Via patch supplied to us by IBM) and 2 weeks later it seems to have taken care of our orchestrate issues.
Thanks - - John