APT_PMwaitForPlayersToStart failed while waiting for player

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ajaykumar
Participant
Posts: 49
Joined: Tue Sep 01, 2009 7:56 am

APT_PMwaitForPlayersToStart failed while waiting for player

Post by ajaykumar »

Hi,

one of the job has been aborted due to this Error,
Event Id: 24327
Time : Thu Aug 26 11:14:32 2010
Type : FATAL
User : dsadm
Message :
main_program: Fatal Error: Unable to start ORCHESTRATE job: APT_PMwaitForPlayersToStart failed while waiting for player count. This likely indicates a network problem.
Status from APT_PMpoll is 0; node name is node1

Actuvally we opened a PMR with IBM and they asked to tune some parameters in the uvconfig file
Here are the parameter that i tuned in uvconfig,
MFiles=100
T30File=400
GLTABSZ=75
RLTABSZ=150
RLOWNER=150
MAXRLOCK=149

and i also added the variable APT_PM_PLAYER_TIMEOUT=120.

But still this is occuring very frequently now.

our version is v 8.1.0. no fp 1 on zlinux

Any suggestions.......?
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

Based on your job score (if it gets that far) how many processes on how many nodes are there?

This error kind of makes me think that the section leaders are waiting for a response from the players (node specific processes) and this is where the issue is.

Almost seems like there are not enough resources available or something like that. Is the job big?

Let us know

Thanks!
ajaykumar
Participant
Posts: 49
Joined: Tue Sep 01, 2009 7:56 am

Post by ajaykumar »

mhester wrote:Based on your job score (if it gets that far) how many processes on how many nodes are there?

This error kind of makes me think that the section leaders are waiting for a response from the players (node specific processes) and this is where the issue is.

Almost seems like there are not enough resources available or something like that. Is the job big?

Let us know

Thanks!
Hi can you tell me the process of finding the job score? and its not a big job i guess, but coming to the resource the cpu is 98% free and only this and few jobs are running and its on 2 node configuration file. I guess there are plenty of resources available. Do we have to add any Environment variable .........?
ajaykumar
Participant
Posts: 49
Joined: Tue Sep 01, 2009 7:56 am

hi

Post by ajaykumar »

i guess we have 65k process available
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

You can get the score by either enabling APT_DUMP_SCORE = True at the project level or include it in your job and set it to True. You will see this and it will look something like this in the director log -
main_program: This step has 10 datasets:
Can you please include a copy of your configuration file?

What exactly does this do?
ajaykumar
Participant
Posts: 49
Joined: Tue Sep 01, 2009 7:56 am

Here are the details

Post by ajaykumar »

APT_DUMP_SCORE is set to False. So no Info about that.

Here is my Configuration file.

main_program: APT configuration file: /opt/IBM/InformationServer/Server/Configurations/default_ipt1.apt
{
node "node1"
{
fastname "vmzldliis07"
pools ""
resource disk "/IISData/Dataset1" {pools ""}
resource scratchdisk "/IISWork/Scratch1" {pools ""}
}
node "node2"
{
fastname "vmzldliis07"
pools ""
resource disk "/IISData/Dataset2" {pools ""}
resource scratchdisk "/IISWork/Scratch2" {pools ""}
}

}
vivekgadwal
Premium Member
Premium Member
Posts: 457
Joined: Tue Sep 25, 2007 4:05 pm

Re: Here are the details

Post by vivekgadwal »

ajaykumar wrote:APT_DUMP_SCORE is set to False. So no Info about that.
Please turn on the APT_DUMP_SCORE (Set it to True) and run the job. Then document/post the results here for perusal. This is what "mhester" was suggesting...

Were you able to replicate this problem in other environments?
Vivek Gadwal

Experience is what you get when you didn't get what you wanted
ajaykumar
Participant
Posts: 49
Joined: Tue Sep 01, 2009 7:56 am

Re: Here are the details

Post by ajaykumar »

ya we are not able to replicate this problem in other enviroments, if i remember correctly 50 days back we got this type of error in our prod server. This is happening randomly once in while. here is the attached log

Server Name:vmzldliis07 Project Name:cfpat600 Job Name:CFPCOMM_ClearDS_BalancingRecordCount.PPGLBL

**************************************************
STATUS REPORT FOR JOB: CFPCOMM_ClearDS_BalancingRecordCount.PPGLBL
Generated: 2010-08-26 11:22:30
Job start time=2010-08-26 11:11:28
Job end time=2010-08-26 11:14:33
Job elapsed time=00:03:05
Job status=3 (Aborted)
Stage: CFPCOMM_RG_OneRow, 0 rows input
Stage start time=, end time=, elapsed=00:00:00
Link: LNK_GenerateFiles, 0 rows
Stage: CFPCOMM_GenerateFiles, 0 rows input
Stage start time=, end time=, elapsed=00:00:00
Link: LNK_GenerateFiles, 0 rows
Link: LNK_out_ALL, 0 rows
****************************************************
*** LOG DETAILS ***
Event Id: 24317
Time : Thu Aug 26 11:11:29 2010
Type : STARTED
User : dsadm
Message :
Starting Job CFPCOMM_ClearDS_BalancingRecordCount.PPGLBL.
pFileDate = 20091231081599
pProcessAcronym = PPGLBL
pWorkDir = /IISData/cfp/cfpat600/Work
$APT_CONFIG_FILE = /opt/IBM/InformationServer/Server/Configurations/default_ipt1.apt
DSJobController = JsCFPCOMM_Common_Interface.PPGLBL
-----------------------------------------------------------------------
Event Id: 24318
Time : Thu Aug 26 11:11:50 2010
Type : INFO
User : dsadm
Message :
Environment variable settings:
_=/usr/bin/nohup
A__z=! PROFILEREAD
ACLOCAL_FLAGS=-I /opt/gnome/share/aclocal
AGENTWORKS_DIR=/opt/tng/aw
APT_COMPILEOPT=-O -fPIC -Wno-deprecated -c
APT_COMPILER=g++
APT_CONFIG_FILE=/opt/IBM/InformationServer/Server/Configurations/default_ipt1.apt
APT_ERROR_CONFIGURATION=severity, !vseverity, !jobid, moduleid, errorIndex, timestamp, !ipaddr, !nodeplayer, !nodename, opid, message
APT_LINKER=g++
APT_LINKOPT=-shared
APT_MONITOR_MINTIME=10
APT_NO_JOBMON=1
APT_NO_ONE_NODE_COMBINING_OPTIMIZATION=1
APT_NO_PART_INSERTION=1
APT_NO_SORT_INSERTION=1
APT_OPERATOR_REGISTRY_PATH=/IISProjects/cfp/cfpat600/buildop
APT_ORCHHOME=/opt/IBM/InformationServer/Server/PXEngine
APT_PM_NODE_TIMEOUT=420
APT_PM_PLAYER_TIMEOUT=120
ASBHOME=/opt/IBM/InformationServer/ASBNode
BELL=^G
CALIB=/opt/CA/SharedComponents/lib
CASHCOMP=/opt/CA/SharedComponents
COLORTERM=1
CPU=s390x
CSHEDIT=emacs
DS_ENABLE_RESERVED_CHAR_CONVERT=0
DS_OPERATOR_BUILDOP_DIR=buildop
DS_OPERATOR_WRAPPED_DIR=wrapped
DS_TDM_PIPE_OPEN_TIMEOUT=720
DS_TDM_TRACE_SUBROUTINE_CALLS=0
DS_USERNO=-29521
DSHOME=/opt/IBM/InformationServer/Server/DSEngine
DSIPC_OPEN_TIMEOUT=30
DSWaitForJob=300
DSWaitStartup=300
FLAVOR=-1
FROM_HEADER=
G_BROKEN_FILENAMES=1
G_FILENAME_ENCODING=@locale,UTF-8,ISO-8859-15,CP1252
GNOME2_PATH=/usr/local:/opt/gnome:/usr
GROFF_NO_SGR=yes
HISTSIZE=1000
HOME=/home/dsadm
HOST=vmzldliis07
HOSTNAME=vmzldliis07
HOSTTYPE=s390x
INFODIR=/usr/local/info:/usr/share/info:/usr/info
INFOPATH=/usr/local/info:/usr/share/info:/usr/info:/opt/gnome/share/info
INPUTRC=/etc/inputrc
JAVA_BINDIR=/opt/IBMJava2-s390x-142/bin
JAVA_CLASSPATH=/opt/IBMJava2-s390x-142/bin
JAVA_HOME=/opt/IBMJava2-s390x-142
JAVA_ROOT=/opt/IBMJava2-s390x-142
JRE_HOME=/opt/IBMJava2-s390x-142
LANG=en_US.UTF-8
LD_LIBRARY_PATH=/IISProjects/cfp/cfpat600/RT_BP4.O:/opt/IBM/InformationServer/Server/DSComponents/lib:/opt/IBM/InformationServer/Server/DSComponents/bin:/opt/IBM/InformationServer/Server/DSParallel:/opt/IBM/InformationServer/Server/PXEngine/user_lib:/opt/IBM/InformationServer/Server/PXEngine/lib:/IISProjects/cfp/cfpat600/buildop:/opt/IBM/InformationServer/Server/branded_odbc/lib:/opt/IBM/InformationServer/Server/DSEngine/lib:/opt/IBM/InformationServer/Server/DSEngine/uvdlls:/opt/IBM/InformationServer/ASBNode/apps/jre/bin:/opt/IBM/InformationServer/ASBNode/apps/jre/bin/classic:/opt/IBM/InformationServer/ASBNode/lib/cpp:/opt/IBM/InformationServer/ASBNode/apps/proxy/cpp/linux-all-s390x_64:/opt/IBM/db2/V9.5/lib64:/usr/lib:.:/lib
LESS=-M -I
LESS_ADVANCED_PREPROCESSOR=no
LESSCLOSE=lessclose.sh %s %s
LESSKEY=/etc/lesskey.bin
LESSOPEN=lessopen.sh %s
LIC_ECHO=echo -e
LOGNAME=dsadm
LS_COLORS=no=00:fi=00:di=01;34:ln=00;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=41;33;01:ex=00;32:*.cmd=00;32:*.exe=01;32:*.com=01;32:*.bat=01;32:*.btm=01;32:*.dll=01;32:*.tar=00;31:*.tbz=00;31:*.tgz=00;31:*.rpm=00;31:*.deb=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.zoo=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.tb2=00;31:*.tz2=00;31:*.tbz2=00;31:*.avi=01;35:*.bmp=01;35:*.fli=01;35:*.gif=01;35:*.jpg=01;35:*.jpeg=01;35:*.mng=01;35:*.mov=01;35:*.mpg=01;35:*.pcx=01;35:*.pbm=01;35:*.pgm=01;35:*.png=01;35:*.ppm=01;35:*.tga=01;35:*.tif=01;35:*.xbm=01;35:*.xpm=01;35:*.dl=01;35:*.gl=01;35:*.wmv=01;35:*.aiff=00;32:*.au=00;32:*.mid=00;32:*.mp3=00;32:*.ogg=00;32:*.voc=00;32:*.wav=00;32:
LS_OPTIONS=-N --color=tty -T 0
MACHTYPE=s390x-suse-linux
MAIL=/var/spool/mail/dsadm
MANPATH=/usr/local/man:/usr/share/man:/usr/X11R6/man:/opt/gnome/share/man
MINICOM=-c on
MORE=-sl
NNTPSERVER=news
ODBCINI=/opt/IBM/InformationServer/Server/DSEngine/.odbc.ini
OLDPWD=/
OSH_STDOUT_MSG=1
OSTYPE=linux
PAGER=less
PATH=/IISProjects/cfp/cfpat600/wrapped:/IISProjects/cfp/cfpat600/buildop:/IISProjects/cfp/cfpat600/RT_BP4.O:/opt/IBM/InformationServer/Server/DSComponents/lib:/opt/IBM/InformationServer/Server/DSComponents/bin:/opt/IBM/InformationServer/Server/DSParallel:/opt/IBM/InformationServer/Server/PXEngine/user_osh_wrappers:/opt/IBM/InformationServer/Server/PXEngine/osh_wrappers:/opt/IBM/InformationServer/Server/PXEngine/bin:/bin:/usr/bin:/opt/IBMJava2-s390x-142/bin:/opt/tng/aw/services/bin:/opt/tng/aw/services/tools:/opt/tng/aw/agents/bin:/bin:/bin:/usr/kerberos/bin:/usr/local/bin:/usr/X11R6/bin:.
PKG_CONFIG_PATH=/opt/gnome/lib64/pkgconfig:/opt/gnome/share/pkgconfig
PROFILEREAD=true
PWD=/opt/IBM/InformationServer/Server/DSEngine
PX_DBCONNECTHOME=/opt/IBM/InformationServer/Server/DSComponents
PYTHONSTARTUP=/etc/pythonstart
RFC_CONNECTION_TIMEOUT=600
RFC_NO_COMPRESS=0
RFC_TRACE=0
RFC_TRACE_DIR=/IISData
SAPINST_JRE_HOME=/opt/IBMJava2-s390x-142
SHELL=/bin/ksh
SHLVL=2
TERM=
TEXINPUTS=:/home/dsadm/.TeX:/usr/share/doc/.TeX:/usr/doc/.TeX
TMPDIR=/IISWork/cfp/tmp
UDTBIN=/opt/IBM/InformationServer/Server/DSEngine/ud41/bin
UDTHOME=/opt/IBM/InformationServer/Server/DSEngine/ud41
USER=dsadm
WHO=cfpat600
WINDOWMANAGER=gnome: not found
XDG_CONFIG_DIRS=/usr/local/etc/xdg/:/etc/xdg/:/etc/opt/gnome/xdg/
XDG_DATA_DIRS=/usr/local/share/:/usr/share/:/etc/opt/kde3/share/:/opt/kde3/share/:/opt/gnome/share/
XKEYSYMDB=/usr/X11R6/lib/X11/XKeysymDB
XNLSPATH=/usr/X11R6/lib/X11/nls
-----------------------------------------------------------------------
Event Id: 24319
Time : Thu Aug 26 11:11:50 2010
Type : INFO
User : dsadm
Message :
Parallel job initiated
-----------------------------------------------------------------------
Event Id: 24320
Time : Thu Aug 26 11:11:50 2010
Type : INFO
User : dsadm
Message :
OSH script
# OSH / orchestrate script for Job CFPCOMM_ClearDS_BalancingRecordCount compiled at 19:52:54 17 SEP 2009
#################################################################
#### STAGE: CFPCOMM_RG_OneRow
## Operator
generator
## Operator options
-schema record
(
row1:string[max=1] {cycle={value=1}};
)
-records 1
## General options
[ident('CFPCOMM_RG_OneRow'); jobmon_ident('CFPCOMM_RG_OneRow')]
## Outputs
0> [] 'CFPCOMM_RG_OneRow:LNK_GenerateFiles.v'
;
#################################################################
#### STAGE: CFPCOMM_GenerateFiles
## Operator
transform
## Operator options
-flag run
-name 'V0S1_CFPCOMM_ClearDS_BalancingRecordCount_CFPCOMM_GenerateFiles'
## General options
[ident('CFPCOMM_GenerateFiles'); jobmon_ident('CFPCOMM_GenerateFiles')]
## Inputs
0< [] 'CFPCOMM_RG_OneRow:LNK_GenerateFiles.v'
## Outputs
0> [] 'CFPCOMM_GenerateFiles:LNK_out_ALL.v'
;
#### STAGE: COMM_DS_BalancingAndRecordCount_OutputFile.LNK_out_ALL_Part
## Operator
entire
## General options
[ident('COMM_DS_BalancingAndRecordCount_OutputFile.LNK_out_ALL_Part')]
## Inputs
0< [] 'CFPCOMM_GenerateFiles:LNK_out_ALL.v'
## Outputs
0> [] 'CFPCOMM_GenerateFiles:LNK_out_ALL_Part.v'
;
#################################################################
#### STAGE: COMM_DS_BalancingAndRecordCount_OutputFile
## Operator
copy
## General options
[ident('COMM_DS_BalancingAndRecordCount_OutputFile')]
## Inputs
0< [] 'CFPCOMM_GenerateFiles:LNK_out_ALL_Part.v'
## Outputs
0>| [ds] '[&"pWorkDir"]/Data_Sets/d_CFP[&"pProcessAcronym"]_BalancingAndRecordCount_OutputFile_[&"pFileDate"].ds'
;
# End of OSH code
-----------------------------------------------------------------------
Event Id: 24321
Time : Thu Aug 26 11:11:50 2010
Type : INFO
User : dsadm
Message :
Parallel job default NLS map UTF-8, default locale OFF
-----------------------------------------------------------------------
Event Id: 24322
Time : Thu Aug 26 11:12:23 2010
Type : INFO
User : dsadm
Message :
main_program: IBM WebSphere DataStage Enterprise Edition 8.1.0.5182
Copyright (c) 2001, 2005-2008 IBM Corporation. All rights reserved
-----------------------------------------------------------------------
Event Id: 24323
Time : Thu Aug 26 11:12:23 2010
Type : INFO
User : dsadm
Message :
main_program: The open files limit is 10240; raising to 65536.
-----------------------------------------------------------------------
Event Id: 24324
Time : Thu Aug 26 11:12:23 2010
Type : INFO
User : dsadm
Message :
main_program: conductor uname: -s=Linux; -r=2.6.16.54-0.2.5-default; -v=#1 SMP Mon Jan 21 13:29:51 UTC 2008; -n=vmzldliis07; -m=s390x
-----------------------------------------------------------------------
Event Id: 24325
Time : Thu Aug 26 11:12:23 2010
Type : INFO
User : dsadm
Message :
main_program: orchgeneral: loaded
orchsort: loaded
orchstats: loaded
-----------------------------------------------------------------------
Event Id: 24326
Time : Thu Aug 26 11:14:32 2010
Type : INFO
User : dsadm
Message :
main_program: APT configuration file: /opt/IBM/InformationServer/Server/Configurations/default_ipt1.apt
{
node "node1"
{
fastname "vmzldliis07"
pools ""
resource disk "/IISData/Dataset1" {pools ""}
resource scratchdisk "/IISWork/Scratch1" {pools ""}
}

node "node2"
{
fastname "vmzldliis07"
pools ""
resource disk "/IISData/Dataset2" {pools ""}
resource scratchdisk "/IISWork/Scratch2" {pools ""}
}

}
-----------------------------------------------------------------------
Event Id: 24327
Time : Thu Aug 26 11:14:32 2010
Type : FATAL
User : dsadm
Message :
main_program: Fatal Error: Unable to start ORCHESTRATE job: APT_PMwaitForPlayersToStart failed while waiting for player count. This likely indicates a network problem.
Status from APT_PMpoll is 0; node name is node1
-----------------------------------------------------------------------
Event Id: 24328
Time : Thu Aug 26 11:14:33 2010
Type : STARTED
User : dsadm
Message :
Job CFPCOMM_ClearDS_BalancingRecordCount.PPGLBL aborted.
-----------------------------------------------------------------------
ajaykumar
Participant
Posts: 49
Joined: Tue Sep 01, 2009 7:56 am

Re: Here are the details

Post by ajaykumar »

ya we are not able to replicate this problem in other enviroments, if i remember correctly 50 days back we got this type of error in our prod server. This is happening randomly once in while. here is the attached log
vivekgadwal
Premium Member
Premium Member
Posts: 457
Joined: Tue Sep 25, 2007 4:05 pm

Post by vivekgadwal »

Check with your Unix/Linux admin to see if anything unusual is getting logged on the OS side.
Vivek Gadwal

Experience is what you get when you didn't get what you wanted
Post Reply