Parallel job reports failure (code 139)

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Parallel job reports failure (code 139)

Post by priyadarshikunal »

Hi,

I went through all the related posts
I tried rebooting,deleting all datasets and other things suggested,
but that havn't got my problem solved.

I have around 1200 jobs to populate the warehouse.

I combined all the jobs in form of sub sequences and then to a master sequence. But when i am trying to run the master sequence atleast 1 job fails with the below mentioned fatal error (not the same jobs mean to say its not coming to a perticular job/jobs)

Parallel job reports failure (code 139)

Is there any setting that I missed that should have been done before running these many jobs.

or anything else causing the problem.

Maximum number of jobs running at a time may go to 50 jobs at a time.
but I am getting this error when the number of jobs running is around 4-5 jobs at a time.

Additional Information:
RCP is disabled

Regards,
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

There should be more to the error message (something with segmentation fault). Can you post that? Also, if you reset the job do you get an entry in your log with "from previous run"?
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

ArndW wrote:There should be more to the error message (something with segmentation fault). Can you post that? Also, if you reset the job do you get an entry in your log with "from previous run"? ...
I am getting 1 warning some times but not all the times

Code: Select all

main_program: Received SIGPIPE signal caused by closing of the socket on port 13400.
No output will be sent to port 13400 for the rest of the job.
RT_SC675/OshExecuter.sh[25]: 1114310 Memory fault(coredump) 
and no i cannot find any entry with "from previous run" in the log
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

additionally this problem started when we moved our code to a superior server.

In older server the occurance of this error was very less.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

This is some new meaning of "superior", then?
:lol:

Do you know what this port number is used for?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

ray.wurlod wrote:This is some new meaning of "superior", then?
:lol:

Do you know what this port number is used for?
Ans 1:

Code: Select all

I meant the configuration his higher(almost double) than the older one.
Yeah, haven't encountered anything better except the timings. :?

Ans 2:

Code: Select all

port 13400 is used by oracle application server as container for J2EE Services.

13400 is the default port for IIOP(Oracle container for J2EE)

IIOP (Internet Inter-ORB Protocol) is a protocol that makes it possible for distributed programs written in different programming languages to communicate over the Internet or intranet.

But this warning has not been encountered all the times but it comes sometimes as mentioned earlier.

but yes this one is also an issue.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Sounds like you need to get your Oracle DBA on board, and maybe your network administrator as well, to find out whether it's Oracle that's raising the SIGPIPE signals and, if so, for what reason (for example unexpected delays).

That Oshexecuter.sh has core dumped may indicate that something at the DataStage end may be to blame; can you take a look in or around line 25 in RT_SC675/OshExecuter.sh in your project directory to see what it was trying to do? Preserve the core file in your project directory; IBM support may want to analyze it.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

ray.wurlod wrote:Sounds like you need to get your Oracle DBA on board, and maybe your network administrator as well, to find out whether it's Oracle that's raising the SIGPIPE signals and, if so, for what reason .
I went through the file OshExecuter.sh but i cannot find any thing unusual

i am posting that script, may be i missed something

Code: Select all

1	#!/bin/sh
2	# Shell script for Datastage to execute an osh script, generated at 2008-02-14 14:20:49
3	# Compiler runtime stamp 8.0///55/C
4	#
5	# Parameters:
6	# $1 - Run indicator: R=normal run, P=performance wrappered run
7	# $2 - Environment variable file name - dummy
8	# $3... - Osh / performance checker command line arguments
9	RunIndicator=$1
10	DummyEnv=$2
11	shift
12	shift
13	if test ! -x "$APT_ORCHHOME/bin/osh"
14	  then echo '##OSHRETVAL NOOSH'
15	  exit 1
16	fi
17	# Test for resource estimation option.
18	if test $RunIndicator = P
19	  then $APT_ORCHHOME/bin/orchresest "$@" 2>&1 &
20	  else $APT_ORCHHOME/bin/osh "$@" 2>&1 &
21	fi
22	oshpid=$!
23	# Write the pid of the conductor process
24	echo '##OSHPID' $oshpid
25	wait $oshpid
26	# Write the terminating string
27	echo '##OSHRETVAL' $?
28	# end of script
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

one more thing

I tried running that job alone it runs fine

but again i tried with the sequence it failed with same error

don't know why its happening

Regards.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

Once again the error was due to "Time Based Job Monitor" :x

Changing

APT_MONITOR_SIZE=100000 and
APT_MONITOR_TIME=5 resolved that issue. :D (only the error with code 139)

Thanks to all
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
Post Reply