Fatal Error: Unable to start ORCHESTRATE job: APT_PMwaitForP
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 17
- Joined: Wed Mar 19, 2014 3:53 pm
Fatal Error: Unable to start ORCHESTRATE job: APT_PMwaitForP
We are getting this error randomly
Fatal Error: Unable to start ORCHESTRATE job: APT_PMwaitForPlayersToStart failed while waiting for players to confirm startup. This likely indicates a network problem.
When I searched this error in google , I have seen the below solution from IBM site
Resolving the problem
Generally you will see this error when you have reached the limit on number of processes that can be created (MAXUPROC). You should run this command to see the max number of processes
defined:
lsattr -E -l sys0 | grep maxuproc
.
To monitor actual usage, you can try this command while the jobs are running to see the number of processes that are running for the given user:
.
ps -ef|grep |wc -l
.
You will need to consider increasing the maxuproc value to accomodate the workload expected on your system.
But it looks like it works only for UNIX/Linux OS . Our datastage server is installed in windows server
So is there any equivalent command in windows to check the maxuporc value and increase it
Fatal Error: Unable to start ORCHESTRATE job: APT_PMwaitForPlayersToStart failed while waiting for players to confirm startup. This likely indicates a network problem.
When I searched this error in google , I have seen the below solution from IBM site
Resolving the problem
Generally you will see this error when you have reached the limit on number of processes that can be created (MAXUPROC). You should run this command to see the max number of processes
defined:
lsattr -E -l sys0 | grep maxuproc
.
To monitor actual usage, you can try this command while the jobs are running to see the number of processes that are running for the given user:
.
ps -ef|grep |wc -l
.
You will need to consider increasing the maxuproc value to accomodate the workload expected on your system.
But it looks like it works only for UNIX/Linux OS . Our datastage server is installed in windows server
So is there any equivalent command in windows to check the maxuporc value and increase it
-
- Premium Member
- Posts: 17
- Joined: Wed Mar 19, 2014 3:53 pm
-
- Premium Member
- Posts: 17
- Joined: Wed Mar 19, 2014 3:53 pm
-
- Participant
- Posts: 102
- Joined: Thu Sep 17, 2009 1:23 am
-
- Premium Member
- Posts: 17
- Joined: Wed Mar 19, 2014 3:53 pm
Lookup stages load data to memory, so make sure there is enough physical memory to support all concurrent lookups.
Lookup stages also use temp space. Make sure you have enough temp space to support all concurrent lookups.
If you are short on memory or temp space and can not get more, you could replace the lookup with a left outer join. The join stage requires sorted inputs, so the inserted sort operation(s) will utilize scratch disk space.
Are your resource disk, scratch disk and temp space all on separate mount points / file systems?
Mike
Lookup stages also use temp space. Make sure you have enough temp space to support all concurrent lookups.
If you are short on memory or temp space and can not get more, you could replace the lookup with a left outer join. The join stage requires sorted inputs, so the inserted sort operation(s) will utilize scratch disk space.
Are your resource disk, scratch disk and temp space all on separate mount points / file systems?
Mike
I've seen this problem before on Windows and a combination of changing the settings mentioned earlier and otherwise reducing the load on the server made a big difference, although we did still occasionally get the same error message.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Premium Member
- Posts: 17
- Joined: Wed Mar 19, 2014 3:53 pm
Scratch space ,Temp space everything is in same D: drive but we monitored the space utilization. Almost 50% of space is free in D: drive at peak times.
So Physically there is no space issue .So Is there something to modify in the uvconfig file / values of environmental variables need to be increased ?
So Physically there is no space issue .So Is there something to modify in the uvconfig file / values of environmental variables need to be increased ?
I do not believe that the Lookup stages have anything to do with your issue actually.
APT_PMwaitForPlayersToStart is being thrown into the error. Which implies that your jobs are failing in the startup phase not execution phase.
Look at your APT file, if you are dispatching your nodes to different fastname hosts, then you are a clustered/grid environment.
Set APT_PM_PLAYER_CONNECT_TIMEOUT, APT_PM_PLAYER_TIMEOUT to 300 if not already done. These two timeout values will not affect job execution, they will just wait a little longer before giving up on a job and causing it to abort. The value is in seconds.
Based upon your findings in the APT file, you might see if the failure always happens on the same target node (host).
APT_PMwaitForPlayersToStart is being thrown into the error. Which implies that your jobs are failing in the startup phase not execution phase.
Look at your APT file, if you are dispatching your nodes to different fastname hosts, then you are a clustered/grid environment.
Set APT_PM_PLAYER_CONNECT_TIMEOUT, APT_PM_PLAYER_TIMEOUT to 300 if not already done. These two timeout values will not affect job execution, they will just wait a little longer before giving up on a job and causing it to abort. The value is in seconds.
Based upon your findings in the APT file, you might see if the failure always happens on the same target node (host).
-
- Premium Member
- Posts: 17
- Joined: Wed Mar 19, 2014 3:53 pm
We have increased the below values from 60 to 120
APT_PM_CONDUCTOR_TIMEOUT 120
APT_PM_PLAYER_TIMEOUT =120
APT_PM_PLAYER_CONNECT_TIMEOUT=120
APT_PM_NODE_TIMEOUT =120
After this there is no failure for the last 5 days ,But it is too early to conclude .
Where can I see the APT file like Path . APT files created for a Job or one file for a project?
APT_PM_CONDUCTOR_TIMEOUT 120
APT_PM_PLAYER_TIMEOUT =120
APT_PM_PLAYER_CONNECT_TIMEOUT=120
APT_PM_NODE_TIMEOUT =120
After this there is no failure for the last 5 days ,But it is too early to conclude .
Where can I see the APT file like Path . APT files created for a Job or one file for a project?