run time fatal error : Player 12 terminated unexpectedly

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

mctny
Charter Member
Charter Member
Posts: 166
Joined: Thu Feb 02, 2006 6:55 am

run time fatal error : Player 12 terminated unexpectedly

Post by mctny »

hi Everyone,

I was wondering if you could help me to investigate the following problem I face in every 2-3 days with different jobs that runs nightly in production.
I have no idea whether it is a bug in datastage EE or related with OS or database. there is no warning before but it says player something ( 2, 6 or 12 etc) terminated unexpectedly. My managers are asking me the reason for job failures in every 2-3 days, I didnot design those jobs, but I cannot say like that. I have to find the cause, DataStage Error log is not descriptive at all.

I asked DBA he said there was no issue with Unix at the time the jobs failed.

I would appreciate if you share your ideas,
Thanks
Cetin


Event #:2641
Timestamp:6/24/2006 4:35:35 AM
Event type:Fatal
User:dsadm
Message:
node_node1: Player 12 terminated unexpectedly.
Thanks,
Chad
__________________________________________________________________
"There are three kinds of people in this world; Ones who know how to count and the others who don't know how to count !"
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I've got no direct ideas, but was going to suggest you search the forum for keywords in your message. When I did for 'Player terminated unexpectedly' I found quite a number of messages, including some from you with what I'm guess are earlier attempts to solve this problem.

Still think you should do that and see if anything there helps. What are the other messages in the log that preceed this message? If you Reset the job after it aborts, do you get anything labelled 'From previous run...'? :?

You've also not stated (in this post) any specifics. Please post your versions for DataStage, operating system, database - and anything you can see in the jobs that are aborting that they may have in common... use of a transformer, or certain logic, for instance. There's no way anyone can provide specific assistance without information like that.
-craig

"You can never have too many knives" -- Logan Nine Fingers
mctny
Charter Member
Charter Member
Posts: 166
Joined: Thu Feb 02, 2006 6:55 am

Post by mctny »

Thanks Craig, yes I searched the forum before I post this, the other posts about this didnot help me much, since the error log is vague so are the answers.
I am using DataStage EE 7.5.1.A on AIX with 2 nodes, the target database is oracle 10G and the source is SQL sever and oracle.
when I reset the job and rerun it runs successfully. there is no error or warnings prior to the fatal error. yes the jobs that fails all have transformer stages, but almost all of our jobs has transformer stages.

thanks again
Thanks,
Chad
__________________________________________________________________
"There are three kinds of people in this world; Ones who know how to count and the others who don't know how to count !"
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Is there any possibility that (a) the node on which the error was reported is overloaded, or (b) that someone is using kill to knock out processes?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kris007
Charter Member
Charter Member
Posts: 1102
Joined: Tue Jan 24, 2006 5:38 pm
Location: Riverside, RI

Post by kris007 »

I used to get similar kind of message along with couple of other messages ofcourse but in my case I was joining to huge tables and also sorting them so disk space got full and use to terminate with the similar kind of message. Might want to check disk space just a guess..
Kris

Where's the "Any" key?-Homer Simpson
mctny
Charter Member
Charter Member
Posts: 166
Joined: Thu Feb 02, 2006 6:55 am

Post by mctny »

There is no possibility that someone is using kill command, however the node being overloaded might be a possibility although the dba/unix admin says it is not the case. our jobs run after midnight so the server shouldnot be busy at all.

the jobs are not processing huge number of rows, the maximum number of rows that any of the jobs handle is around 100K rows.

I don't know how to check disk space or how busy was the unix server at that time.


thanks again
Cetin
Thanks,
Chad
__________________________________________________________________
"There are three kinds of people in this world; Ones who know how to count and the others who don't know how to count !"
kris007
Charter Member
Charter Member
Posts: 1102
Joined: Tue Jan 24, 2006 5:38 pm
Location: Riverside, RI

Post by kris007 »

Try to use

Code: Select all

du scratchdiskspace
and issue the command for every 30 sec and observe how the used space and available space changes.Even though there are only 100K rows if the record is big enough and memory is also used by some other process.. disk might get full.
Kris

Where's the "Any" key?-Homer Simpson
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I'd suggest taking Krrish's suggestion to your UNIX SA (not your DBA unless they are one-in-the-same, which would be... unusual) and present it to them. Tell them that you need them to monitor disk usage during your job runs for you. They should be more than happy to help.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The -s option is appropriate (you only need the total usage figure). For each directory mentioned as scratch disk resource in the configuration file, monitor with regular executions of

Code: Select all

du -s pathname
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Also check your page size usage and its pattern. Were the job been triggered parallel?
Try to check the same job to run in sequential.
Try to set the Envronmential varialble DISABLE_JOBMON to TRUE. Try for one or two cycle, if dont get any Player termination, pls revert back.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Klaus Schaefer
Participant
Posts: 94
Joined: Wed May 08, 2002 8:44 am
Location: Germany
Contact:

Re: run time fatal error : Player 12 terminated unexpectedly

Post by Klaus Schaefer »

"This looks similar to problems relating to Time Based Job Monitoring that have been experienced by other customers and is also documented in the datastage readme.txt (see extract below):

Time Based Job Monitoring - Intermittent Problems
-------------------------------------------------
Intermittent problems have been observed while running jobs on the
Parallel canvas when time based job monitoring is enabled (the default).
Time based job monitoring can be disabled in favor of size based job
monitoring. This is done by unsetting the APT_MONITOR_TIME environment
variable and setting the APT_MONITOR_SIZE variable to a suitable number,
e.g. 1000000. This will cause the job to update row count information
every 1000000 rows. The environment variables can be set in the Project
Properties (in the Administrator) - this will affect all jobs.
Alternatively, they be set for an individual job using the Job
Properties screen in the Designer.

To fully resolve the problem, can you physically remove the entry for the APT_MONITOR_TIME='' from the DSParams file within the project directory. Then run a simple test job to ensure that the APT_MONITOR_TIME environment variable does not appear in the job log and has been removed."

Quoted from a support answer that is usually given in similar situations ;-)

Klaus
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

For this "Intermittent Problems" a patch has been relased by Ascential. But that too only for customers who asks for it. Else the general solution from the support team would be to turn of the monitor.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It would be nice to know what kind of "intermittent problems" have been observed when the job monitor was executing at an interval of too few rows. I can't see that "unexpected termination" of a single player process would be caused by the job monitor. I'd be more likely to suspect totally exhausting some resource, generating a null pointer in the code, or something similarly fatal. Was there a core file produced on the node on which the unexpected termination was reported?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sud
Premium Member
Premium Member
Posts: 366
Joined: Fri Dec 02, 2005 5:00 am
Location: Here I Am

Post by sud »

Hi,

We used to have the same problem in our previous project. Most of the time simply resetting and running would resolve the issue. Generally this happens (as everyone has already sorted out) because the CPU is overloaded. You will face this only on boxes used by many users and when many jobs are triggered simultaneously. I feel that there is something called a response wait for a datastage job exceeding which DS sigkills the job. :roll:
It took me fifteen years to discover I had no talent for ETL, but I couldn't give it up because by that time I was too famous.
mctny
Charter Member
Charter Member
Posts: 166
Joined: Thu Feb 02, 2006 6:55 am

Post by mctny »

Thank you all for all the responses, yes simply resetting and rerunning jobs solve the problem most of the time but when a job fails at night, our tester see that the tables are not populated etc and send an email to everyone in the early morning so everyone knows that our jobs failed that night. so I want to resolve it before no one notice,

yes it is an intermittent problem, hence it makes it hard to solve. it happens in every 2-3 days in different jobs which cause the sequence abort. I don't know if I core file is produced, if so how can I checkt it?

I doubt this problem really related with APT_MONITOR_TIME or APT_MONITOR_SIZE? I mean I didnot understand why they these causes job failures, There are not many users connected to the UNIX box at the time of the runs happening. it could be a datastage bug, or a unix issue or something not related to DS.

yesterday I set the MONITOR_TIME parameter to nothing and ALL the jobs failed yesterday., the error was again the same for most i.e., Player terminated unexpectedly. one of the error was different, I will post a new topic for that. it is a sigsegv error


I still haven't figured out the cause hence didnot suggest solution for it. I would appreciate if you guys could help in one or or another to dignose the cause and resolve it permanently.


Thank you again
Thanks,
Chad
__________________________________________________________________
"There are three kinds of people in this world; Ones who know how to count and the others who don't know how to count !"
Post Reply