run time fatal error : Player 12 terminated unexpectedly

mctny · Post by **mctny** » Sat Jun 24, 2006 1:29 pm

hi Everyone,

I was wondering if you could help me to investigate the following problem I face in every 2-3 days with different jobs that runs nightly in production.
I have no idea whether it is a bug in datastage EE or related with OS or database. there is no warning before but it says player something ( 2, 6 or 12 etc) terminated unexpectedly. My managers are asking me the reason for job failures in every 2-3 days, I didnot design those jobs, but I cannot say like that. I have to find the cause, DataStage Error log is not descriptive at all.

I asked DBA he said there was no issue with Unix at the time the jobs failed.

I would appreciate if you share your ideas,
Thanks
Cetin

Event #:2641
Timestamp:6/24/2006 4:35:35 AM
Event type:Fatal
User:dsadm
Message:
node_node1: Player 12 terminated unexpectedly.

chulett · Post by **chulett** » Sat Jun 24, 2006 6:10 pm

I've got no direct ideas, but was going to suggest you search the forum for keywords in your message. When I did for 'Player terminated unexpectedly' I found quite a number of messages, including some from you with what I'm guess are earlier attempts to solve this problem.

Still think you should do that and see if anything there helps. What are the other messages in the log that preceed this message? If you Reset the job after it aborts, do you get anything labelled 'From previous run...'?

You've also not stated (in this post) any specifics. Please post your versions for DataStage, operating system, database - and anything you can see in the jobs that are aborting that they may have in common... use of a transformer, or certain logic, for instance. There's no way anyone can provide specific assistance without information like that.

mctny · Post by **mctny** » Sat Jun 24, 2006 7:06 pm

Thanks Craig, yes I searched the forum before I post this, the other posts about this didnot help me much, since the error log is vague so are the answers.
I am using DataStage EE 7.5.1.A on AIX with 2 nodes, the target database is oracle 10G and the source is SQL sever and oracle.
when I reset the job and rerun it runs successfully. there is no error or warnings prior to the fatal error. yes the jobs that fails all have transformer stages, but almost all of our jobs has transformer stages.

thanks again

ray.wurlod · Post by **ray.wurlod** » Sat Jun 24, 2006 7:22 pm

Is there any possibility that (a) the node on which the error was reported is overloaded, or (b) that someone is using kill to knock out processes?

kris007 · Post by **kris007** » Sat Jun 24, 2006 8:22 pm

I used to get similar kind of message along with couple of other messages ofcourse but in my case I was joining to huge tables and also sorting them so disk space got full and use to terminate with the similar kind of message. Might want to check disk space just a guess..

mctny · Post by **mctny** » Sat Jun 24, 2006 8:58 pm

There is no possibility that someone is using kill command, however the node being overloaded might be a possibility although the dba/unix admin says it is not the case. our jobs run after midnight so the server shouldnot be busy at all.

the jobs are not processing huge number of rows, the maximum number of rows that any of the jobs handle is around 100K rows.

I don't know how to check disk space or how busy was the unix server at that time.

thanks again
Cetin

kris007 · Post by **kris007** » Sun Jun 25, 2006 5:19 am

Try to use

Code: Select all

du scratchdiskspace

and issue the command for every 30 sec and observe how the used space and available space changes.Even though there are only 100K rows if the record is big enough and memory is also used by some other process.. disk might get full.

chulett · Post by **chulett** » Sun Jun 25, 2006 7:25 am

I'd suggest taking Krrish's suggestion to your UNIX SA (not your DBA unless they are one-in-the-same, which would be... unusual) and present it to them. Tell them that you need them to monitor disk usage during your job runs for you. They should be more than happy to help.

ray.wurlod · Post by **ray.wurlod** » Sun Jun 25, 2006 3:26 pm

The -s option is appropriate (you only need the total usage figure). For each directory mentioned as scratch disk resource in the configuration file, monitor with regular executions of

Code: Select all

du -s pathname

kumar_s · Post by **kumar_s** » Mon Jun 26, 2006 12:25 am

Also check your page size usage and its pattern. Were the job been triggered parallel?
Try to check the same job to run in sequential.
Try to set the Envronmential varialble DISABLE_JOBMON to TRUE. Try for one or two cycle, if dont get any Player termination, pls revert back.

Klaus Schaefer · Post by **Klaus Schaefer** » Mon Jun 26, 2006 6:51 am

"This looks similar to problems relating to Time Based Job Monitoring that have been experienced by other customers and is also documented in the datastage readme.txt (see extract below):

Time Based Job Monitoring - Intermittent Problems
-------------------------------------------------
Intermittent problems have been observed while running jobs on the
Parallel canvas when time based job monitoring is enabled (the default).
Time based job monitoring can be disabled in favor of size based job
monitoring. This is done by unsetting the APT_MONITOR_TIME environment
variable and setting the APT_MONITOR_SIZE variable to a suitable number,
e.g. 1000000. This will cause the job to update row count information
every 1000000 rows. The environment variables can be set in the Project
Properties (in the Administrator) - this will affect all jobs.
Alternatively, they be set for an individual job using the Job
Properties screen in the Designer.

To fully resolve the problem, can you physically remove the entry for the APT_MONITOR_TIME='' from the DSParams file within the project directory. Then run a simple test job to ensure that the APT_MONITOR_TIME environment variable does not appear in the job log and has been removed."

Quoted from a support answer that is usually given in similar situations

Klaus

kumar_s · Post by **kumar_s** » Mon Jun 26, 2006 8:57 am

For this "Intermittent Problems" a patch has been relased by Ascential. But that too only for customers who asks for it. Else the general solution from the support team would be to turn of the monitor.

ray.wurlod · Post by **ray.wurlod** » Mon Jun 26, 2006 2:49 pm

It would be nice to know what kind of "intermittent problems" have been observed when the job monitor was executing at an interval of too few rows. I can't see that "unexpected termination" of a single player process would be caused by the job monitor. I'd be more likely to suspect totally exhausting some resource, generating a null pointer in the code, or something similarly fatal. Was there a core file produced on the node on which the unexpected termination was reported?

sud · Post by **sud** » Mon Jun 26, 2006 3:48 pm

Hi,

We used to have the same problem in our previous project. Most of the time simply resetting and running would resolve the issue. Generally this happens (as everyone has already sorted out) because the CPU is overloaded. You will face this only on boxes used by many users and when many jobs are triggered simultaneously. I feel that there is something called a response wait for a datastage job exceeding which DS sigkills the job.

mctny · Post by **mctny** » Wed Jun 28, 2006 2:19 pm

Thank you all for all the responses, yes simply resetting and rerunning jobs solve the problem most of the time but when a job fails at night, our tester see that the tables are not populated etc and send an email to everyone in the early morning so everyone knows that our jobs failed that night. so I want to resolve it before no one notice,

yes it is an intermittent problem, hence it makes it hard to solve. it happens in every 2-3 days in different jobs which cause the sequence abort. I don't know if I core file is produced, if so how can I checkt it?

I doubt this problem really related with APT_MONITOR_TIME or APT_MONITOR_SIZE? I mean I didnot understand why they these causes job failures, There are not many users connected to the UNIX box at the time of the runs happening. it could be a datastage bug, or a unix issue or something not related to DS.

yesterday I set the MONITOR_TIME parameter to nothing and ALL the jobs failed yesterday., the error was again the same for most i.e., Player terminated unexpectedly. one of the error was different, I will post a new topic for that. it is a sigsegv error

I still haven't figured out the cause hence didnot suggest solution for it. I would appreciate if you guys could help in one or or another to dignose the cause and resolve it permanently.

Thank you again

DSXchange

run time fatal error : Player 12 terminated unexpectedly

run time fatal error : Player 12 terminated unexpectedly

Re: run time fatal error : Player 12 terminated unexpectedly