Fatal Error: Unable to start ORCHESTRATE job: APT_PMwaitForP

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
hexaware_tmk
Premium Member
Premium Member
Posts: 17
Joined: Wed Mar 19, 2014 3:53 pm

Fatal Error: Unable to start ORCHESTRATE job: APT_PMwaitForP

Post by hexaware_tmk »

We are getting this error randomly
Fatal Error: Unable to start ORCHESTRATE job: APT_PMwaitForPlayersToStart failed while waiting for players to confirm startup. This likely indicates a network problem.

When I searched this error in google , I have seen the below solution from IBM site

Resolving the problem
Generally you will see this error when you have reached the limit on number of processes that can be created (MAXUPROC). You should run this command to see the max number of processes
defined:
lsattr -E -l sys0 | grep maxuproc
.
To monitor actual usage, you can try this command while the jobs are running to see the number of processes that are running for the given user:
.
ps -ef|grep |wc -l
.
You will need to consider increasing the maxuproc value to accomodate the workload expected on your system.


But it looks like it works only for UNIX/Linux OS . Our datastage server is installed in windows server

So is there any equivalent command in windows to check the maxuporc value and increase it
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Not sure why you have that do nothing grep.

I think you wanted:

ps -ef | wc -l

Anyhow.... I think you might want to increase the timeout values instead of increasing max procs.

are you on a cluster / grid environment?
hexaware_tmk
Premium Member
Premium Member
Posts: 17
Joined: Wed Mar 19, 2014 3:53 pm

Post by hexaware_tmk »

Thanks for the reply

Sorry, Iam not sure about the Environment . How to figure out out whether its Cluster/Grid ?

Anyhow ill check with my administrator
hexaware_tmk
Premium Member
Premium Member
Posts: 17
Joined: Wed Mar 19, 2014 3:53 pm

Post by hexaware_tmk »

is this the two environmental variable whose values should be increased for increasing the timeout? APT_PM_PLAYER_CONNECT_TIMEOUT, APT_PM_PLAYER_TIMEOUT
rohitagarwal15
Participant
Posts: 102
Joined: Thu Sep 17, 2009 1:23 am

Post by rohitagarwal15 »

As you say you are getting this problem randomly, is there any specific job where you are getting this error?
What is the job design? Are you fetching record from some database or writing to some database?
what will load on Datastage server when you are facing this issue?
Rohit
hexaware_tmk
Premium Member
Premium Member
Posts: 17
Joined: Wed Mar 19, 2014 3:53 pm

Post by hexaware_tmk »

Its not one specific job , we have almost 400 jobs and it occur randomly . But one common thing among these jobs is it has a lookup stage , So is there any setting related to lookup stage I have to look?
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

Lookup stages load data to memory, so make sure there is enough physical memory to support all concurrent lookups.

Lookup stages also use temp space. Make sure you have enough temp space to support all concurrent lookups.

If you are short on memory or temp space and can not get more, you could replace the lookup with a left outer join. The join stage requires sorted inputs, so the inserted sort operation(s) will utilize scratch disk space.

Are your resource disk, scratch disk and temp space all on separate mount points / file systems?

Mike
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I've seen this problem before on Windows and a combination of changing the settings mentioned earlier and otherwise reducing the load on the server made a big difference, although we did still occasionally get the same error message.
hexaware_tmk
Premium Member
Premium Member
Posts: 17
Joined: Wed Mar 19, 2014 3:53 pm

Post by hexaware_tmk »

Scratch space ,Temp space everything is in same D: drive but we monitored the space utilization. Almost 50% of space is free in D: drive at peak times.

So Physically there is no space issue .So Is there something to modify in the uvconfig file / values of environmental variables need to be increased ?
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

I do not believe that the Lookup stages have anything to do with your issue actually.

APT_PMwaitForPlayersToStart is being thrown into the error. Which implies that your jobs are failing in the startup phase not execution phase.


Look at your APT file, if you are dispatching your nodes to different fastname hosts, then you are a clustered/grid environment.


Set APT_PM_PLAYER_CONNECT_TIMEOUT, APT_PM_PLAYER_TIMEOUT to 300 if not already done. These two timeout values will not affect job execution, they will just wait a little longer before giving up on a job and causing it to abort. The value is in seconds.

Based upon your findings in the APT file, you might see if the failure always happens on the same target node (host).
hexaware_tmk
Premium Member
Premium Member
Posts: 17
Joined: Wed Mar 19, 2014 3:53 pm

Post by hexaware_tmk »

We have increased the below values from 60 to 120

APT_PM_CONDUCTOR_TIMEOUT 120

APT_PM_PLAYER_TIMEOUT =120

APT_PM_PLAYER_CONNECT_TIMEOUT=120

APT_PM_NODE_TIMEOUT =120

After this there is no failure for the last 5 days ,But it is too early to conclude .

Where can I see the APT file like Path . APT files created for a Job or one file for a project?
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Look in your execution (Director) log to see the APT_CONFIGFILE that was used or generated (if using a GRID).

It will show you the hosts your job executed on (fastname entries).
Post Reply