Parallel job reports failure (code 139)

priyadarshikunal · Post by **priyadarshikunal** » Mon Feb 18, 2008 10:27 am

Hi,

I went through all the related posts
I tried rebooting,deleting all datasets and other things suggested,
but that havn't got my problem solved.

I have around 1200 jobs to populate the warehouse.

I combined all the jobs in form of sub sequences and then to a master sequence. But when i am trying to run the master sequence atleast 1 job fails with the below mentioned fatal error (not the same jobs mean to say its not coming to a perticular job/jobs)

Parallel job reports failure (code 139)

Is there any setting that I missed that should have been done before running these many jobs.

or anything else causing the problem.

Maximum number of jobs running at a time may go to 50 jobs at a time.
but I am getting this error when the number of jobs running is around 4-5 jobs at a time.

Additional Information:
RCP is disabled

Regards,

ArndW · Post by **ArndW** » Mon Feb 18, 2008 11:43 am

There should be more to the error message (something with segmentation fault). Can you post that? Also, if you reset the job do you get an entry in your log with "from previous run"?

priyadarshikunal · Post by **priyadarshikunal** » Tue Feb 19, 2008 12:22 am

ArndW wrote:There should be more to the error message (something with segmentation fault). Can you post that? Also, if you reset the job do you get an entry in your log with "from previous run"? ...

I am getting 1 warning some times but not all the times

Code: Select all

main_program: Received SIGPIPE signal caused by closing of the socket on port 13400.
No output will be sent to port 13400 for the rest of the job.
RT_SC675/OshExecuter.sh[25]: 1114310 Memory fault(coredump)

and no i cannot find any entry with "from previous run" in the log

priyadarshikunal · Post by **priyadarshikunal** » Tue Feb 19, 2008 12:37 am

additionally this problem started when we moved our code to a superior server.

In older server the occurance of this error was very less.

ray.wurlod · Post by **ray.wurlod** » Tue Feb 19, 2008 1:49 am

This is some new meaning of "superior", then?

Do you know what this port number is used for?

priyadarshikunal · Post by **priyadarshikunal** » Tue Feb 19, 2008 2:12 am

ray.wurlod wrote:This is some new meaning of "superior", then?

Do you know what this port number is used for?

Ans 1:

Code: Select all

I meant the configuration his higher(almost double) than the older one.

Yeah, haven't encountered anything better except the timings.

Ans 2:

Code: Select all

port 13400 is used by oracle application server as container for J2EE Services.

13400 is the default port for IIOP(Oracle container for J2EE)

IIOP (Internet Inter-ORB Protocol) is a protocol that makes it possible for distributed programs written in different programming languages to communicate over the Internet or intranet.

But this warning has not been encountered all the times but it comes sometimes as mentioned earlier.

but yes this one is also an issue.

ray.wurlod · Post by **ray.wurlod** » Tue Feb 19, 2008 2:34 am

Sounds like you need to get your Oracle DBA on board, and maybe your network administrator as well, to find out whether it's Oracle that's raising the SIGPIPE signals and, if so, for what reason (for example unexpected delays).

That Oshexecuter.sh has core dumped may indicate that something at the DataStage end may be to blame; can you take a look in or around line 25 in RT_SC675/OshExecuter.sh in your project directory to see what it was trying to do? Preserve the core file in your project directory; IBM support may want to analyze it.

priyadarshikunal · Post by **priyadarshikunal** » Tue Feb 19, 2008 6:50 am

ray.wurlod wrote:Sounds like you need to get your Oracle DBA on board, and maybe your network administrator as well, to find out whether it's Oracle that's raising the SIGPIPE signals and, if so, for what reason .

I went through the file OshExecuter.sh but i cannot find any thing unusual

i am posting that script, may be i missed something

Code: Select all

1	#!/bin/sh
2	# Shell script for Datastage to execute an osh script, generated at 2008-02-14 14:20:49
3	# Compiler runtime stamp 8.0///55/C
4	#
5	# Parameters:
6	# $1 - Run indicator: R=normal run, P=performance wrappered run
7	# $2 - Environment variable file name - dummy
8	# $3... - Osh / performance checker command line arguments
9	RunIndicator=$1
10	DummyEnv=$2
11	shift
12	shift
13	if test ! -x "$APT_ORCHHOME/bin/osh"
14	  then echo '##OSHRETVAL NOOSH'
15	  exit 1
16	fi
17	# Test for resource estimation option.
18	if test $RunIndicator = P
19	  then $APT_ORCHHOME/bin/orchresest "$@" 2>&1 &
20	  else $APT_ORCHHOME/bin/osh "$@" 2>&1 &
21	fi
22	oshpid=$!
23	# Write the pid of the conductor process
24	echo '##OSHPID' $oshpid
25	wait $oshpid
26	# Write the terminating string
27	echo '##OSHRETVAL' $?
28	# end of script

priyadarshikunal · Post by **priyadarshikunal** » Tue Feb 19, 2008 7:09 am

one more thing

I tried running that job alone it runs fine

but again i tried with the sequence it failed with same error

don't know why its happening

Regards.

priyadarshikunal · Post by **priyadarshikunal** » Wed Feb 27, 2008 4:27 am

Once again the error was due to "Time Based Job Monitor"

Changing

APT_MONITOR_SIZE=100000 and
APT_MONITOR_TIME=5 resolved that issue. :D (only the error with code 139)

Thanks to all