Fatal : node_node1: Player 2 terminated unexpectedly.

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Is your job failing everytime with the same inforamtion. I mean node1 with player2?
Try have a run in sequential mode, or a single node config file.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Rajesh_kr82
Participant
Posts: 24
Joined: Sat Oct 15, 2005 1:09 pm

Post by Rajesh_kr82 »

I am also facing the same problem of SIGKILL. My jobs are getting aborted even with small amount of data. My job is a multiple instance job. I am running one instance at a time and evertime i supply different input files. Out of 50-60 times job fails like once or twice. Was any one able to successfully remove this problem of SIGKILL.

My buffers are set to default values:
APT_BUFFERING_POLICY Automatic buffering
APT_BUFFER_DISK_WRITE_INCREMENT 1048576
APT_BUFFER_FREE_RUN 0.5
APT_BUFFER_MAXIMUM_MEMORY 3145728
APT_BUFFER_MAXIMUM_TIMEOUT 1

Could anyone find the patch required for AIX?
Regards,
Rajesh
yakiku
Premium Member
Premium Member
Posts: 23
Joined: Thu May 13, 2004 7:14 am

Post by yakiku »

Kumar:
It is not failing with the same error always.
For ex, here is the error from another failure:

"node_node1: Player 5 terminated unexpectedly."
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

To narrow down, try running the job in a single node config file, so that, you can eliminate lower down the number of processors running for that user. And can confirm, whether its related to number of processor/user limit.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
jgreve
Premium Member
Premium Member
Posts: 107
Joined: Mon Sep 25, 2006 4:25 pm

More fun with SIGKILL

Post by jgreve »

kumar_s wrote:To narrow down, try running the job in a single node config file, so that, you can eliminate lower down the number of processors running for that user. And can confirm, whether its related to number of processor/user limit.
I'm hitting SIGKILL problems now, so I'll chime in here.
Things run fine for dozens or hundreds of invocations,
then Boom! "Here, have a nice SIGKILL."
I've started running on a 1-node config file; still blows up once in a while.
OS in question is aix

Code: Select all

   $ uname -a
   AIX wi1023 1 5 0022C74B4C00
As per an earlier post, these happen to be project's defaults.

Code: Select all

   APT_BUFFERING_POLICY Automatic buffering
   APT_BUFFER_DISK_WRITE_INCREMENT 1048576
   APT_BUFFER_FREE_RUN 0.5
   APT_BUFFER_MAXIMUM_MEMORY 3145728
   APT_BUFFER_MAXIMUM_TIMEOUT 1
A question r.e. an earlier post speculating on why SIGKILL
was happening:
If a system is maxed out on processes (process table is full,
or some user quota is reached), PLEASE tell me that the os will
fail on fork() - vs. sending SIGKILL to random process(es)
in hopes of freeing up a slot in the process table.

Anyway, I would expect something like this inside of the
parallel engine; please let me know if my mental model
needs revising:

Code: Select all

void launch_player() {
   /* same should apply to conductor lanuching section leaders & so on */
   pid_t pid;
   switch (pid = fork()) {
   case -1:
       log_error( "SectLeader %d: unable to launch player, aborting.", getpid() );
       /* set some kind of stop-running flag. */
       break;
   case 0: /* pid of zero, we actually the new child proc */
       do_player_things(..);
       exit(0);
   default:
       log_msg("Conductor %d launched player %d", getpid(), pid);
   }
}

Which means it is time for me to go look for logs...
I mean, how does a program not write an error message if it dies?
That is so... so... 1980's!
John G.
pavankvk
Participant
Posts: 202
Joined: Thu Dec 04, 2003 7:54 am

Post by pavankvk »

Can u post the complete log..i got this error many times
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

jgreve wrote:I mean, how does a program not write an error message if it dies?
If it's killed with SIGKILL it may not get the opportunity to.

Further, all player processes redirect their stdout and stderr to their node's section leader process, which transmits anything to be logged back to the conductor process. A failure might occur anywhere along this path.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jgreve
Premium Member
Premium Member
Posts: 107
Joined: Mon Sep 25, 2006 4:25 pm

kill, kill, kill

Post by jgreve »

ray.wurlod wrote:
jgreve wrote:I mean, how does a program not write an error message if it dies?
If it's killed with SIGKILL it may not get the opportunity to.
That is a good point...
I'm used to SIGKILL coming from humans, e.g. $ kill -9 pid

If something external to DataStage is killing children, I need to have a talk with that individual doing the killing (or owning the process that is doing the killing). My question about the "fork() / case" code was whether or not the operating system would kill prcoesses of its own volition (I'm hoping the answer is 'no').

If something internal to DataStage were to issue a SIGKILL I hope it would
sort of log something like "CONDUCTOR'S DIARY: Jan 26, Today I had to KILL
my children (#'s 12032, 3029123, and 2931239) because they didn't write home for over 800 msec."

I can understand the receiver of a SIGKILL just falling over dead without saying goodbye in a log or stderr. I would hope the sender of the SIGKILL made note of it somewhere (or am I just being naive?).
jgreve
Premium Member
Premium Member
Posts: 107
Joined: Mon Sep 25, 2006 4:25 pm

the illusion of solidity

Post by jgreve »

pavankvk wrote:Can u post the complete log..i got this error many times
*sigh* no, I've run it since than and auto-clean wiped it.
If it blows up again, I'll do that.
I almost want it to blow up now, it has been 200+ invocations
since the last abort and it is now presenting the illusion of being solid :?
jgreve
Premium Member
Premium Member
Posts: 107
Joined: Mon Sep 25, 2006 4:25 pm

My SIGKILL has returned - "complete" log posted, 2

Post by jgreve »

jgreve wrote:
pavankvk wrote:Can u post the complete log..i got this error many times
My SIGKILL has returned.
I think this is the "complete" log; with any luck you'll look at this and say "Don't do thing X, that is stupid" and I'll come away from this being smarter.

dsjob -logdetail mdirqa flatcols04b 5816 5837 > temp.dat
The range of event #'s go from the last clean run to the final entry of the aborted run. I've split the log lines at around 80-chars because
the [ code ] [ /code ] formatting looks kind of bad with 2km-wide lines.

That was apparently (still) too long; I've tried trimming down
the log (truncating longer lines to 80 chars). Let me know if I have
removed an important bit - I will post that part again.

Code: Select all

Event Id: 5816 Time	: Sun Jan 28 19:26:57 2007 Type	: INFO User	: xyzuser
Message	: main_program: Step execution finished with status = OK.
Event Id: 5817 Time	: Sun Jan 28 19:26:57 2007 Type	: INFO User	: xyzuser
Message	: main_program: Startup time, 0:14; production run time, 0:02.
Event Id: 5818 Time	: Sun Jan 28 19:26:57 2007 Type	: INFO User	: xyzuser
Message	: Parallel job reports successful completion
Event Id: 5819 Time	: Sun Jan 28 19:26:57 2007 Type	: STARTED User	: xyzuser
Message	: Finished Job flatcols04b.
Event Id: 5820 Time	: Sun Jan 28 19:30:41 2007 Type	: STARTED User	: xyzuser
Message	: Starting Job flatcols04b.
	$APT_CONFIG_FILE = /home/dasadm/Ascential/DataStage/Configurations/mdirqa1.apt
	$APT_NO_SORT_INSERTION = True
	$OSH_ECHO = True
	$OSH_DUMP = True
	$APT_DUMP_SCORE = True
	$OSH_PRINT_SCHEMAS = True
	$APT_DISABLE_COMBINATION = True
	SRC_DB_NAME = edwq0
	SRC_DB_UNAME = mdrdev1
	SRC_DB_PASSWORD = ********
	TGT_DB_NAME = edwq0
	TGT_DB_UNAME = mdrdev1
	TGT_DB_PASSWORD = ********
	SEQ_PATH = /data/datastage/mdir/qa/datasets
Event Id: 5821 Time	: Sun Jan 28 19:30:42 2007 Type	: INFO User	: xyzuser
Message	: Environment variable settings: _=/usr/bin/nohup LANG=en_US LOGIN=dasa
Event Id: 5822 Time	: Sun Jan 28 19:30:43 2007 Type	: INFO User	: xyzuser
Message	: Parallel job initiated
	# OSH / orchestrate script for Job flatcols04b compiled at 22:40:59 24 JAN 200
Event Id: 5823 Time	: Sun Jan 28 19:30:43 2007 Type	: INFO User	: xyzuser
Message	: main_program: Ascential DataStage(tm) Enterprise Edition 7.5.1A
	Copyright (c) 2004, 1997-2004 Ascential Software Corporation.
	All Rights Reserved
Event Id: 5824 Time	: Sun Jan 28 19:30:43 2007 Type	: INFO User	: xyzuser
Message	: main_program: orchgeneral: loaded
	orchsort: loaded
	orchstats: loaded
Event Id: 5825 Time	: Sun Jan 28 19:30:43 2007 Type	: INFO User	: xyzuser
Message	: main_program: APT configuration file: /home/dasadm/Ascential/DataStag
	{
		node "node1"
		{
			fastname "wi1023"
			pools ""
			resource disk "/data/datastage/mdir/qa/datasets/data" {pools ""}
			resource scratchdisk "/data/datastage/mdir/qa/scratch" {pools ""}
		}

	}
Event Id: 5826 Time	: Sun Jan 28 19:30:44 2007 Type	: INFO User	: xyzuser
Message	: main_program: This step has 1 dataset:
ds0: {op0[1p] (parallel APT_LicenseCountOp in APT_LicenseOperator) ->eCollectAn
It has 2 operators:
op0[1p] {(parallel APT_LicenseCountOp in APT_LicenseOperator) on nodes ( node1[
op1[1p] {(sequential APT_LicenseCheckOp in APT_LicenseOperator) on nodes ( node
It runs 2 processes on 1 node.

Event Id: 5827 Time	: Sun Jan 28 19:30:45 2007 Type	: INFO User	: xyzuser
Message	: main_program: Echo:
...deleted...

Event Id: 5828 Time	: Sun Jan 28 19:30:48 2007 Type	: INFO User	: xyzuser
Message	: main_program: Dump: { 
text="dscapiop\n-stageprops flatcols04b.db2S_inCOP '{\n   SERVER=\\'edwq0\\', \
op={ text="\n\ntsort\n-stable\n-key 'PROD_HOLD_SEQ'\n-asc\n-sorted\n-key 'cntry
op={ text="\n\ntransform\n-flag run\n-name 'V0S381_flatcols04b_txCOP'\n\n[ident
op={ text="\n\nremdup\n-keep last\n-key 'seq'\n\n[ident('rdCOP'); jobmon_ident(
op={ text="\n\ncopy\n[ident('cpRdCOP'); jobmon_ident('cpRdCOP')]\n0< 'rdCOP:cpR
op={ text="\n\npeek\n-nrecs 10\n-name\n\n[ident('peekRdCOP'); jobmon_ident('pee
op={ text="\n\ndscapiop\n-stageprops flatcols04b.db2S_inCOS '{\n   SERVER=\\'ed
op={ text="\n\ntsort\n-stable\n-key 'PROD_HOLD_SEQ'\n-asc\n-sorted\n-key 'CNTRY
op={ text="\n\ntransform\n-flag run\n-name 'V22S0_flatcols04b_txCOS'\n\n[ident(
op={ text="\n\nremdup\n-keep last\n-key 'seq'\n\n[ident('rdCOS'); jobmon_ident(
op={ text="\n\npeek\n-nrecs 10\n-name\n\n[ident('peekRdCOS'); jobmon_ident('pee
op={ text="\n\ncopy\n[ident('cpRdCOS'); jobmon_ident('cpRdCOS')]\n0< 'rdCOS:cpR
op={ text="\n\nleftouterjoin\n-key 'seq'\n\n[ident('join'); jobmon_ident('join'
op={ text="\n\npeek\n-nrecs 10\n-name\n\n[ident('peek_flatcols'); jobmon_ident(
op={ text="\n\ntsort\n-stable\n-key 'seq'\n-asc\n-sorted\n\n[ident('fauxSortPRO
op={ text="\n\ndscapiop\n-stageprops flatcols04b.db2S_inPROD '{\n   SERVER=\\'e
op={ text="\n\ncopy\n[ident('cpPKLANG'); jobmon_ident('cpPKLANG')]\n0< 'rdPKLAN
op={ text="\n\npeek\n-nrecs 10\n-name\n\n[ident('peekRdPKLANG'); jobmon_ident('
op={ text="\n\nremdup\n-keep last\n-key 'seq'\n\n[ident('rdPKLANG'); jobmon_ide
op={ text="\n\ntransform\n-flag run\n-name 'V27S4_flatcols04b_txPKLANG'\n\n[ide
op={ text="\n\ntsort\n-stable\n-key 'PROD_HOLD_SEQ'\n-asc\n-sorted\n-key 'PACK_
op={ text="\n\ndscapiop\n-stageprops flatcols04b.db2S_inPKLANG '{\n   SERVER=\\
op={ text="\n\ndscapiop\n-stageprops flatcols04b.db2W_inNOURISH '{\n   SERVER=\
op={ text="\n\ntsort\n-stable\n-key 'PROD_HOLD_SEQ'\n-asc\n-sorted\n-key 'SSN_C
op={ text="\n\ntransform\n-flag run\n-name 'V28S2_flatcols04b_txNOURISH'\n\n[id
op={ text="\n\nremdup\n-keep last\n-key 'seq'\n\n[ident('rdNOURISH'); jobmon_id
op={ text="\n\npeek\n-nrecs 10\n-name\n\n[ident('peekRdNOURISH'); jobmon_ident(
op={ text="\n\ncopy\n[ident('cpRdNOURISH'); jobmon_ident('cpRdNOURISH')]\n0< 'r
op={ text="\n\ncopy\n[ident('cpRdREDUCED'); jobmon_ident('cpRdREDUCED')]\n0< 'r
op={ text="\n\npeek\n-nrecs 10\n-name\n\n[ident('peekRdREDUCED'); jobmon_ident(
op={ text="\n\nremdup\n-keep last\n-key 'seq'\n\n[ident('rdREDUCED'); jobmon_id
op={ text="\n\ntransform\n-flag run\n-name 'V29S4_flatcols04b_txREDUCED'\n\n[id
op={ text="\n\ntsort\n-stable\n-key 'PROD_HOLD_SEQ'\n-asc\n-sorted\n-key 'SSR_C
op={ text="\n\ndscapiop\n-stageprops flatcols04b.db2W_inREDUCED '{\n   SERVER=\
op={ text="\n\ncopy\n[ident('cpJoin'); jobmon_ident('cpJoin')]\n0< 'join:join.v
data={ text="\n0> [-pp] 'db2S_inCOP:db2S_inCOP.v'", line=32, column=1, name="db
data={ text="\n0> [modify(keep PROD_HOLD_SEQ,cntry_prod;)] 'fauxSortCOP:fauxSor
data={ text="\n0> [] 'txCOP:rdCOP.v'", line=64, column=1, name="txCOP:rdCOP.v",
data={ text="\n0> [modify(keep seq,cop_ctry;)] 'rdCOP:cpRdCOP.v'", line=73, col
data={ text="\n0> [modify(keep seq,cop_ctry;)] 'cpRdCOP:peekRdCOP.v'", line=82,
data={ text="\n1> [modify(keep seq;flat_cop=cop_ctry;)] 'cpRdCOP:joinCOP.v'", l
data={ text="\n0> [-pp] 'db2S_inCOS:db2S_inCOS.v'", line=133, column=1, name="d
data={ text="\n0> [modify(keep PROD_HOLD_SEQ,CNTRY_SALE;)] 'fauxSortCOS:txCOS.v
data={ text="\n0> [] 'txCOS:rdCOP.v'", line=165, column=1, name="txCOS:rdCOP.v"
data={ text="\n0> [modify(keep seq,cos_ctry;)] 'rdCOS:cpRdCOS.v'", line=180, co
data={ text="\n0< 'cpRdCOS:peekRdCOS.v'", line=191, column=1, name="cpRdCOS:pee
data={ text="\n1> [modify(keep seq;flat_cos=cos_ctry;)] 'cpRdCOS:joinCOS.v'", l
data={ text="\n0< 'fauxSortPROD:fauxSortPROD.v'", line=212, column=1, name="fau
data={ text="\n3< 'cpPKLANG:joinPKLANG.v'", line=215, column=1, name="cpPKLANG:
data={ text="\n4< 'cpRdNOURISH:joinNOURISH.v'", line=216, column=1, name="cpRdN
data={ text="\n5< 'cpRdREDUCED:joinREDUCED.v'", line=217, column=1, name="cpRdR
data={ text="\n0> [modify(keep seq,flat_cop,flat_cos,flat_packlang,flat_ssn,fla
data={ text="\n0< 'cpJoin:peek_flatcols.v'", line=230, column=1, name="cpJoin:p
data={ text="\n0< 'db2S_inPROD:db2S_inPROD.v'", line=235, column=1, name="db2S_
data={ text="\n0< 'rdPKLANG:cpRdPKLANG.v'", line=288, column=1, name="rdPKLANG:
data={ text="\n0> [modify(keep seq;pklang=flat_packlang;)] 'cpPKLANG:peekRdPKLA
data={ text="\n0< 'txPKLANG:rdPKLANG.v'", line=310, column=1, name="txPKLANG:rd
data={ text="\n0< 'fauxSortPKLANG:txPKLANG.v'", line=331, column=1, name="fauxS
data={ text="\n0< 'db2S_inPKLANG:db2S_inPKLANG.v'", line=337, column=1, name="d
data={ text="\n0> [-pp] 'db2W_inNOURISH:db2W_inNOURISH.v'", line=427, column=1,
data={ text="\n0> [modify(keep PROD_HOLD_SEQ;SSN_CODE:string[max=780]=SSN_CODE;
data={ text="\n0> [] 'txNOURISH:rdNOURISH.v'", line=460, column=1, name="txNOUR
data={ text="\n0> [modify(keep seq,flat_ssn;)] 'rdNOURISH:cpRdNOURISH.v'", line
data={ text="\n0< 'cpRdNOURISH:peekRdNOURISH.v'", line=486, column=1, name="cpR
data={ text="\n0< 'rdREDUCED:cpRdREDUCED.v'", line=504, column=1, name="rdREDUC
data={ text="\n0> [modify(keep seq,flat_ssr;)] 'cpRdREDUCED:peekRdREDUCED.v'", 
data={ text="\n0< 'txREDUCED:rdREDUCED.v'", line=525, column=1, name="txREDUCED
data={ text="\n0< 'fauxSortREDUCED:txREDUCED.v'", line=546, column=1, name="fau
data={ text="\n0< 'db2W_inREDUCED:db2W_inREDUCED.v'", line=552, column=1, name=
data={ text="\n0>| [ds] '/data/datastage/mdir/qa/datasets/flatcols.ds'", line=6
}
.
...[Message split, exceeds 64000 characters]...

Event Id: 5829 Time	: Sun Jan 28 19:30:49 2007 Type	: INFO User	: xyzuser
Message	: main_program: Requesting delayed metadata.
	Requesting delayed metadata.  Requesting delayed metadata.  Requesting delayed
Event Id: 5830 Time	: Sun Jan 28 19:30:49 2007 Type	: INFO User	: xyzuser
Message	: main_program: This step has 40 datasets:
ds0: {op0[1p] (sequential db2S_inCOP) eOther(APT_HashPartitioner { key={ va
ds1: {op1[1p] (parallel fauxSortCOP) [pp] eSame->eCollectAny op12[1p] (para
ds2: {op2[1p] (sequential db2S_inCOS) eOther(APT_HashPartitioner { key={ va
ds3: {op3[1p] (parallel fauxSortCOS) [pp] eSame->eCollectAny op16[1p] (para
ds4: {op4[1p] (sequential db2S_inPROD) eOther(APT_HashPartitioner { key={ v
ds5: {op5[1p] (parallel fauxSortPROD) [pp] eSame->eCollectAny op32[1p] (par
ds6: {op6[1p] (sequential db2S_inPKLANG) eOther(APT_HashPartitioner { key={
ds7: {op7[1p] (parallel fauxSortPKLANG) [pp] eSame->eCollectAny op20[1p] (p
ds8: {op8[1p] (sequential db2W_inNOURISH) eOther(APT_HashPartitioner { key=
ds9: {op9[1p] (parallel fauxSourtNOURISH) [pp] eSame->eCollectAny op24[1p] 
ds10: {op10[1p] (sequential db2W_inREDUCED) eOther(APT_HashPartitioner { ke
ds11: {op11[1p] (parallel fauxSortREDUCED) [pp] eSame->eCollectAny op28[1p]
ds12: {op12[1p] (parallel APT_TransformOperatorImplV0S381_flatcols04b_txCOP
ds13: {op13[1p] (parallel rdCOP) [pp] eSame->eCollectAny op14[1p] (parallel
ds14: {op14[1p] (parallel cpRdCOP) [pp] eSame->eCollectAny op15[1p] (parall
ds15: {op14[1p] (parallel cpRdCOP) [pp] eSame->eCollectAny op32[1p] (parall
ds16: {op16[1p] (parallel APT_TransformOperatorImplV22S0_flatcols04b_txCOS 
ds17: {op17[1p] (parallel rdCOS) [pp] eSame->eCollectAny op18[1p] (parallel
ds18: {op18[1p] (parallel cpRdCOS) [pp] eSame->eCollectAny op19[1p] (parall
ds19: {op18[1p] (parallel cpRdCOS) [pp] eSame->eCollectAny op33[1p] (parall
ds20: {op20[1p] (parallel APT_TransformOperatorImplV27S4_flatcols04b_txPKLA
ds21: {op21[1p] (parallel rdPKLANG) [pp] eSame->eCollectAny op22[1p] (paral
ds22: {op22[1p] (parallel cpPKLANG) [pp] eSame->eCollectAny op23[1p] (paral
ds23: {op22[1p] (parallel cpPKLANG) [pp] eSame->eCollectAny op34[1p] (paral
ds24: {op24[1p] (parallel APT_TransformOperatorImplV28S2_flatcols04b_txNOUR
ds25: {op25[1p] (parallel rdNOURISH) [pp] eSame->eCollectAny op26[1p] (para
ds26: {op26[1p] (parallel cpRdNOURISH) [pp] eSame->eCollectAny op27[1p] (pa
ds27: {op26[1p] (parallel cpRdNOURISH) [pp] eSame->eCollectAny op35[1p] (pa
ds28: {op28[1p] (parallel APT_TransformOperatorImplV29S4_flatcols04b_txREDU
ds29: {op29[1p] (parallel rdREDUCED) [pp] eSame->eCollectAny op30[1p] (para
ds30: {op30[1p] (parallel cpRdREDUCED) [pp] eSame->eCollectAny op31[1p] (pa
ds31: {op30[1p] (parallel cpRdREDUCED) [pp] eSame->eCollectAny op36[1p] (pa
ds32: {op32[1p] (parallel APT_JoinSubOperator(0) in join) [pp] eSame->eColl
ds33: {op33[1p] (parallel APT_JoinSubOperator(1) in join) [pp] eSame->eColl
ds34: {op34[1p] (parallel APT_JoinSubOperator(2) in join) [pp] eSame->eColl
ds35: {op35[1p] (parallel APT_JoinSubOperator(3) in join) [pp] eSame->eColl
ds36: {op36[1p] (parallel APT_JoinSubOperator(4) in join) [pp] eSame->eColl
ds37: {op37[1p] (parallel cpJoin) [pp] eSame->eCollectAny op38[1p] (paralle
ds38: {op39[1p] (parallel delete data files in delete /data/datastage/mdir/
ds39: {op37[1p] (parallel cpJoin) [pp] -> /data/datastage/mdir/qa/datasets/

It has 41 operators:
op0[1p] {(sequential db2S_inCOP) on nodes ( node1[op0,p0])}
op1[1p] {(parallel fauxSortCOP) on nodes ( node1[op1,p0])}
op2[1p] {(sequential db2S_inCOS) on nodes ( node1[op2,p0])}
op3[1p] {(parallel fauxSortCOS) on nodes ( node1[op3,p0])}
op4[1p] {(sequential db2S_inPROD) on nodes ( node1[op4,p0])}
op5[1p] {(parallel fauxSortPROD) on nodes ( node1[op5,p0])}
op6[1p] {(sequential db2S_inPKLANG) on nodes ( node1[op6,p0])}
op7[1p] {(parallel fauxSortPKLANG) on nodes ( node1[op7,p0])}
op8[1p] {(sequential db2W_inNOURISH) on nodes ( node1[op8,p0])}
op9[1p] {(parallel fauxSourtNOURISH) on nodes ( node1[op9,p0])}
op10[1p] {(sequential db2W_inREDUCED) on nodes ( node1[op10,p0])}
op11[1p] {(parallel fauxSortREDUCED) on nodes ( node1[op11,p0])}
op12[1p] {(parallel APT_TransformOperatorImplV0S381_flatcols04b_txCOP in txCOP)
op13[1p] {(parallel rdCOP) on nodes ( node1[op13,p0])}
op14[1p] {(parallel cpRdCOP) on nodes ( node1[op14,p0])}
op15[1p] {(parallel peekRdCOP) on nodes ( node1[op15,p0])}
op16[1p] {(parallel APT_TransformOperatorImplV22S0_flatcols04b_txCOS in txCOS) 
op17[1p] {(parallel rdCOS) on nodes ( node1[op17,p0])}
op18[1p] {(parallel cpRdCOS) on nodes ( node1[op18,p0])}
op19[1p] {(parallel peekRdCOS) on nodes ( node1[op19,p0])}
op20[1p] {(parallel APT_TransformOperatorImplV27S4_flatcols04b_txPKLANG in txPK
op21[1p] {(parallel rdPKLANG) on nodes ( node1[op21,p0])}
op22[1p] {(parallel cpPKLANG) on nodes ( node1[op22,p0])}
op23[1p] {(parallel peekRdPKLANG) on nodes ( node1[op23,p0])}
op24[1p] {(parallel APT_TransformOperatorImplV28S2_flatcols04b_txNOURISH in txN
op25[1p] {(parallel rdNOURISH) on nodes ( node1[op25,p0])}
op26[1p] {(parallel cpRdNOURISH) on nodes ( node1[op26,p0])}
op27[1p] {(parallel peekRdNOURISH) on nodes ( node1[op27,p0])}
op28[1p] {(parallel APT_TransformOperatorImplV29S4_flatcols04b_txREDUCED in txR
op29[1p] {(parallel rdREDUCED) on nodes ( node1[op29,p0])}
op30[1p] {(parallel cpRdREDUCED) on nodes ( node1[op30,p0])}
op31[1p] {(parallel peekRdREDUCED) on nodes ( node1[op31,p0])}
op32[1p] {(parallel APT_JoinSubOperator(0) in join) on nodes ( node1[op32,p0])}
op33[1p] {(parallel APT_JoinSubOperator(1) in join) on nodes ( node1[op33,p0])}
op34[1p] {(parallel APT_JoinSubOperator(2) in join) on nodes ( node1[op34,p0])}
op35[1p] {(parallel APT_JoinSubOperator(3) in join) on nodes ( node1[op35,p0])}
op36[1p] {(parallel APT_JoinSubOperator(4) in join) on nodes ( node1[op36,p0])}
op37[1p] {(parallel cpJoin) on nodes ( node1[op37,p0])}
op38[1p] {(parallel peek_flatcols) on nodes ( node1[op38,p0])}
op39[1p] {(parallel delete data files in delete /data/datastage/mdir/qa/dataset
op40[1p] {(sequential delete descriptor file in delete /data/datastage/mdir/qa/
It runs 41 processes on 1 node.

Event Id: 5831 Time	: Sun Jan 28 19:30:57 2007 Type	: INFO User	: xyzuser
Message	: main_program: Schemas:
Data set "db2S_inCOP:db2S_inCOP.v": record ( PROD_HOLD_SEQ: int32; cntry_prod:
Data set "fauxSortCOP:fauxSortCOP.v": record ( PROD_HOLD_SEQ: int32; cntry_pro
Data set "txCOP:rdCOP.v": record ( cop_ctry: string[max=280]; seq: int32;)
Data set "rdCOP:cpRdCOP.v": record ( cop_ctry: string[max=280]; seq: int32;)
Data set "cpRdCOP:peekRdCOP.v": record ( cop_ctry: string[max=280]; seq: int3
Data set "cpRdCOP:joinCOP.v": record ( flat_cop: string[max=280]; seq: int32;)
Data set "db2S_inCOS:db2S_inCOS.v": record ( PROD_HOLD_SEQ: int32; CNTRY_SALE
Data set "fauxSortCOS:txCOS.v": record ( PROD_HOLD_SEQ: int32; CNTRY_SALE: st
Data set "txCOS:rdCOP.v": record ( cos_ctry: string[max=400]; seq: int32;)
Data set "rdCOS:cpRdCOS.v": record ( cos_ctry: string[max=400]; seq: int32;)
Data set "cpRdCOS:peekRdCOS.v": record ( cos_ctry: string[max=400]; seq: int3
Data set "cpRdCOS:joinCOS.v": record ( flat_cos: string[max=400]; seq: int32;)
Data set "fauxSortPROD:fauxSortPROD.v": record ( seq: int32;)
Data set "cpPKLANG:joinPKLANG.v": record ( flat_packlang: string[max=240]; seq
Data set "cpRdNOURISH:joinNOURISH.v": record ( flat_ssn: string[max=780]; seq:
Data set "cpRdREDUCED:joinREDUCED.v": record ( flat_ssr: string[max=300]; seq:
Data set "join:join.v": record ( seq: int32; flat_cop: string[max=280]; flat_c
Data set "cpJoin:peek_flatcols.v": record ( PRODUCT_H_PK: int32; flat_cop: str
Data set "db2S_inPROD:db2S_inPROD.v": record ( seq: int32;) Data set "rdPKLANG
Data set "cpPKLANG:peekRdPKLANG.v": record ( pklang: string[max=240]; seq: int
Data set "txPKLANG:rdPKLANG.v": record ( flat_packlang: string[max=240]; seq: 
Data set "fauxSortPKLANG:txPKLANG.v": record ( PROD_HOLD_SEQ: int32; PACK_LANG
Data set "db2S_inPKLANG:db2S_inPKLANG.v": record ( PROD_HOLD_SEQ: int32; PACK_
Data set "db2W_inNOURISH:db2W_inNOURISH.v": record ( PROD_HOLD_SEQ: int32; SSN
Data set "fauxSourtNOURISH:txNOURISH.v": record ( PROD_HOLD_SEQ: int32; SSN_CO
Data set "txNOURISH:rdNOURISH.v": record ( flat_ssn: string[max=780]; seq: int
Data set "rdNOURISH:cpRdNOURISH.v": record ( flat_ssn: string[max=780]; seq: i
Data set "cpRdNOURISH:peekRdNOURISH.v": record ( flat_ssn: string[max=780]; se
Data set "rdREDUCED:cpRdREDUCED.v": record ( flat_ssr: string[max=300]; seq: i
Data set "cpRdREDUCED:peekRdREDUCED.v": record ( flat_ssr: string[max=300]; se
Data set "txREDUCED:rdREDUCED.v": record ( flat_ssr: string[max=300]; seq: int
Data set "fauxSortREDUCED:txREDUCED.v": record ( PROD_HOLD_SEQ: int32; SSR_COD
Data set "db2W_inREDUCED:db2W_inREDUCED.v": record ( PROD_HOLD_SEQ: int32; SSR
Data set "/data/datastage/mdir/qa/datasets/flatcols.ds": record ( PRODUCT_H_PK
Operator "db2S_inCOP": output 0 interface: record ( PROD_HOLD_SEQ: int32; cntr
Operator "fauxSortCOP": input 0 interface: record ( PROD_HOLD_SEQ: int32; cntr
Operator "txCOP": input 0 interface: record ( PROD_HOLD_SEQ: int32; cntry_prod
Operator "rdCOP": input 0 interface: record ( seq: int32; inRec: *;) output 0 
Operator "cpRdCOP": input 0 interface: record ( inRec: *;) output 0 interface:
Operator "peekRdCOP": input 0 interface: record ( cop_ctry: string[max=280]; s
Operator "db2S_inCOS": output 0 interface: record ( PROD_HOLD_SEQ: int32; CNTR
Operator "fauxSortCOS": input 0 interface: record ( PROD_HOLD_SEQ: int32; CNTR
Operator "txCOS": input 0 interface: record ( PROD_HOLD_SEQ: int32; CNTRY_SALE
Operator "rdCOS": input 0 interface: record ( seq: int32; inRec: *;) output 0 
Operator "peekRdCOS": input 0 interface: record ( cos_ctry: string[max=400]; s
Operator "cpRdCOS": input 0 interface: record ( inRec: *;) output 0 interface:
Operator "join": input 0 interface: record ( seq: int32; leftRec: *;) input 1 
Operator "peek_flatcols": input 0 interface: record ( PRODUCT_H_PK: int32; fla
Operator "fauxSortPROD": input 0 interface: record ( seq: int32; sortRec: *;) 
Operator "db2S_inPROD": output 0 interface: record ( seq: int32;)
Operator "cpPKLANG": input 0 interface: record ( inRec: *;) output 0 interface
Operator "peekRdPKLANG": input 0 interface: record ( pklang: string[max=240]; 
Operator "rdPKLANG": input 0 interface: record ( seq: int32; inRec: *;) output
Operator "txPKLANG": input 0 interface: record ( PROD_HOLD_SEQ: int32; PACK_LA
Operator "fauxSortPKLANG": input 0 interface: record ( PROD_HOLD_SEQ: int32; P
Operator "db2S_inPKLANG": output 0 interface: record ( PROD_HOLD_SEQ: int32; P
Operator "db2W_inNOURISH": output 0 interface: record ( PROD_HOLD_SEQ: int32; 
Operator "fauxSourtNOURISH": input 0 interface: record ( PROD_HOLD_SEQ: int32;
Operator "txNOURISH": input 0 interface: record ( PROD_HOLD_SEQ: int32; SSN_CO
Operator "rdNOURISH": input 0 interface: record ( seq: int32; inRec: *;) outpu
Operator "peekRdNOURISH": input 0 interface: record ( flat_ssn: string[max=780
Operator "cpRdNOURISH": input 0 interface: record ( inRec: *;) output 0 interf
Operator "cpRdREDUCED": input 0 interface: record ( inRec: *;) output 0 interf
Operator "peekRdREDUCED": input 0 interface: record ( flat_ssr: string[max=300
Operator "rdREDUCED": input 0 interface: record ( seq: int32; inRec: *;) outpu
Operator "txREDUCED": input 0 interface: record ( PROD_HOLD_SEQ: int32; SSR_CO
Operator "fauxSortREDUCED": input 0 interface: record ( PROD_HOLD_SEQ: int32; 
Operator "db2W_inREDUCED": output 0 interface: record ( PROD_HOLD_SEQ: int32; 
Operator "cpJoin": input 0 interface: record ( inRec: *;) output 0 interface: 
.
Event Id: 5832 Time	: Sun Jan 28 19:30:57 2007 Type	: FATAL User	: xyzuser
Message	: node_node1: Player 35 terminated unexpectedly.
Event Id: 5833 Time	: Sun Jan 28 19:30:57 2007 Type	: FATAL User	: xyzuser
Message	: main_program: Unexpected termination by Unix signal 9(SIGKILL)
Event Id: 5834 Time	: Sun Jan 28 19:31:03 2007 Type	: FATAL User	: xyzuser
Message	: main_program: Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
	Unexpected exit status 1
Event Id: 5835 Time	: Sun Jan 28 19:31:03 2007 Type	: FATAL User	: xyzuser
Message	: main_program: Step execution finished with status = FAILED.
Event Id: 5836 Time	: Sun Jan 28 19:31:03 2007 Type	: INFO User	: xyzuser
Message	: main_program: Startup time, 0:20; production run time, 0:00.
Event Id: 5837 Time	: Sun Jan 28 19:31:03 2007 Type	: STARTED User	: xyzuser
Message	: Job flatcols04b aborted.

Last edited by jgreve on Tue Jan 30, 2007 1:44 pm, edited 2 times in total.
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

SIGKILL - possibly caused by JobMon

Post by asorrell »

I was at a client site where we had this problem (various random SIGKILL - re-run job and it runs fine).

We tried changing buffers, we tried changing scratch / resource, we tried changing to single-node configs. Nothing seemed to get rid of it.

We got really desperate, so we called IBM Customer Service. After several days of diagnosis they felt that it was being caused by the Job Monitoring process itself. Apparently this is a known PX issue where on certain systems (this was AIX) the monitoring process can cause a job to randomly abort.

Initially they asked us to set the Monitor timeslice very high (update once every few minutes). This reduced the occurances but didn't eliminate them. So on our production system we set APT_NO_JOBMON (I think that is right) to True - to totally disable jobmon on that system. Since then, no SIGKILLs.

Upside - system works fine now. Downside - you have to go check the director log after the run to see final record counts.

My understanding is that with the totally new performance monitoring in release 8.0 this problem goes away.

If this fixes your problem please post here to confirm...
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Thats an excellent find. I think you should write post your entry in the FAQ forum as the SIGKILL is among the dreaded issues for many developers.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
pneumalin
Premium Member
Premium Member
Posts: 125
Joined: Sat May 07, 2005 6:32 am

Post by pneumalin »

Search the word: APT_NO_JOBMON. It's been discussed several time last year... The latest version 7.5.2 actually resolves this issue, or get the patch for 7.5.1A from IBM Support to fix it.
jgreve
Premium Member
Premium Member
Posts: 107
Joined: Mon Sep 25, 2006 4:25 pm

followup coming Re: SIGKILL - possibly caused by JobMon

Post by jgreve »

Thank you. I will plug APT_NO_JOBMON into
my dsjobs and let everyone know how things
turn out in a few days (we have a big run coming
up this weekend!).

John G.
asorrell wrote:So on our production system we set APT_NO_JOBMON (I think that is right) to True - to totally disable jobmon on that system. Since then, no SIGKILLs.

Upside - system works fine now. Downside - you have to go check the director log after the run to see final record counts.

My understanding is that with the totally new performance monitoring in release 8.0 this problem goes away.

If this fixes your problem please post here to confirm...
jgreve
Premium Member
Premium Member
Posts: 107
Joined: Mon Sep 25, 2006 4:25 pm

looking much better Re: SIGKILL - possibly caused by JobMon

Post by jgreve »

asorrell wrote:If this fixes your problem please post here to confirm...
So far, so good.
I'm running a large batch of records (200,000 main records, each with about 30 to 60 associated helper records in various tables).
Anyway, for the last month I would launch this batch to run overnight.
It would usually abort about after 15,000 to 30,000 of the main records.

This last week, I've launched it twice and it was still running when I came in the next day. Out Standing! It is much more stable now.

As an aside, I'm not throwing all of the records at my jobs in one shot. There is a driver script that shovels chunks of 1000 main records into a set of jobs + sql script. If I bite off too many records at once, the sql goes exponential. This also makes error recovery / trouble shooting much more manageable.
I mention this because each ds-job was actually invoked only about 15 to 30 times before the most-complex one would abort. And it was almost always the most complex job; twice I had other jobs abort but that appeared to be related to database-availability issues.

So I have gone from only 2 or 3 dozen dsjob invocations to 200+
I am a happy camper! :)

John G.
Post Reply