Fatal Error: Unable to start ORCHESTRATE job: APT_PMwaitForP

hexaware_tmk · Post by **hexaware_tmk** » Tue Nov 04, 2014 3:31 pm

We are getting this error randomly
Fatal Error: Unable to start ORCHESTRATE job: APT_PMwaitForPlayersToStart failed while waiting for players to confirm startup. This likely indicates a network problem.

When I searched this error in google , I have seen the below solution from IBM site

Resolving the problem
Generally you will see this error when you have reached the limit on number of processes that can be created (MAXUPROC). You should run this command to see the max number of processes
defined:
lsattr -E -l sys0 | grep maxuproc
.
To monitor actual usage, you can try this command while the jobs are running to see the number of processes that are running for the given user:
.
ps -ef|grep |wc -l
.
You will need to consider increasing the maxuproc value to accomodate the workload expected on your system.

But it looks like it works only for UNIX/Linux OS . Our datastage server is installed in windows server

So is there any equivalent command in windows to check the maxuporc value and increase it

PaulVL · Post by **PaulVL** » Tue Nov 04, 2014 5:24 pm

Not sure why you have that do nothing grep.

I think you wanted:

ps -ef | wc -l

Anyhow.... I think you might want to increase the timeout values instead of increasing max procs.

are you on a cluster / grid environment?

hexaware_tmk · Post by **hexaware_tmk** » Wed Nov 05, 2014 11:52 am

Thanks for the reply

Sorry, Iam not sure about the Environment . How to figure out out whether its Cluster/Grid ?

Anyhow ill check with my administrator

hexaware_tmk · Post by **hexaware_tmk** » Wed Nov 05, 2014 11:54 am

is this the two environmental variable whose values should be increased for increasing the timeout? APT_PM_PLAYER_CONNECT_TIMEOUT, APT_PM_PLAYER_TIMEOUT

rohitagarwal15 · Post by **rohitagarwal15** » Fri Nov 07, 2014 4:44 am

As you say you are getting this problem randomly, is there any specific job where you are getting this error?
What is the job design? Are you fetching record from some database or writing to some database?
what will load on Datastage server when you are facing this issue?

hexaware_tmk · Post by **hexaware_tmk** » Mon Nov 10, 2014 2:32 pm

Its not one specific job , we have almost 400 jobs and it occur randomly . But one common thing among these jobs is it has a lookup stage , So is there any setting related to lookup stage I have to look?

Mike · Post by **Mike** » Tue Nov 11, 2014 9:15 am

Lookup stages load data to memory, so make sure there is enough physical memory to support all concurrent lookups.

Lookup stages also use temp space. Make sure you have enough temp space to support all concurrent lookups.

If you are short on memory or temp space and can not get more, you could replace the lookup with a left outer join. The join stage requires sorted inputs, so the inserted sort operation(s) will utilize scratch disk space.

Are your resource disk, scratch disk and temp space all on separate mount points / file systems?

Mike

ArndW · Post by **ArndW** » Tue Nov 11, 2014 12:41 pm

I've seen this problem before on Windows and a combination of changing the settings mentioned earlier and otherwise reducing the load on the server made a big difference, although we did still occasionally get the same error message.

hexaware_tmk · Post by **hexaware_tmk** » Tue Nov 11, 2014 4:33 pm

Scratch space ,Temp space everything is in same D: drive but we monitored the space utilization. Almost 50% of space is free in D: drive at peak times.

So Physically there is no space issue .So Is there something to modify in the uvconfig file / values of environmental variables need to be increased ?

PaulVL · Post by **PaulVL** » Wed Nov 12, 2014 8:47 am

I do not believe that the Lookup stages have anything to do with your issue actually.

APT_PMwaitForPlayersToStart is being thrown into the error. Which implies that your jobs are failing in the startup phase not execution phase.

Look at your APT file, if you are dispatching your nodes to different fastname hosts, then you are a clustered/grid environment.

Set APT_PM_PLAYER_CONNECT_TIMEOUT, APT_PM_PLAYER_TIMEOUT to 300 if not already done. These two timeout values will not affect job execution, they will just wait a little longer before giving up on a job and causing it to abort. The value is in seconds.

Based upon your findings in the APT file, you might see if the failure always happens on the same target node (host).

hexaware_tmk · Post by **hexaware_tmk** » Wed Nov 12, 2014 10:49 am

We have increased the below values from 60 to 120

APT_PM_CONDUCTOR_TIMEOUT 120

APT_PM_PLAYER_TIMEOUT =120

APT_PM_PLAYER_CONNECT_TIMEOUT=120

APT_PM_NODE_TIMEOUT =120

After this there is no failure for the last 5 days ,But it is too early to conclude .

Where can I see the APT file like Path . APT files created for a Job or one file for a project?

PaulVL · Post by **PaulVL** » Wed Nov 12, 2014 11:41 am

Look in your execution (Director) log to see the APT_CONFIGFILE that was used or generated (if using a GRID).

It will show you the hosts your job executed on (fastname entries).