Jobs fail intermittently

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
gagan8877
Premium Member
Premium Member
Posts: 77
Joined: Mon Jun 19, 2006 1:30 pm

Jobs fail intermittently

Post by gagan8877 »

We have PROD, ACCP and DEV projects in the same Server. There is a scheduler sequence that calls other sequences and it is scheduled to run nightly. Intermittenly the sequences/jobs that are being called fail - each time it could be a different job. Some sequences/jobs are called in parallel others one after the other. the failure was on Dec 2 09. It has been running smooth since then, but can happen again and there doesnt seem to be a consistency between the failures - they dont happen on a particular day - they happen randomly. the server was rebooted on the 2nd and since then no failures. Here's the log:

Occurred: 1:10:01 AM On date: 12/2/2009 Type: Control
Event: Starting Job jb_arch_parse_process_msgs._SYSPROC_ADMIN_js_code_master_maintenance. (...)

Occurred: 1:10:02 AM On date: 12/2/2009 Type: Info
Event: Environment variable settings: (...)

Occurred: 1:10:02 AM On date: 12/2/2009 Type: Info
Event: Parallel job initiated

Occurred: 1:10:02 AM On date: 12/2/2009 Type: Info
Event: OSH script (...)

Occurred: 1:10:04 AM On date: 12/2/2009 Type: Info
Event: main_program: IBM WebSphere DataStage Enterprise Edition 8.0.1.4843 (...)

Occurred: 1:10:04 AM On date: 12/2/2009 Type: Info
Event: main_program: orchgeneral: loaded (...)

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: APT configuration file: C:/IBM/InformationServer/Server/Configurations/default.apt (...)

{
node "node1"
{
fastname "QSIS"
pools ""
resource disk "C:/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "C:/IBM/InformationServer/Server/Scratch" {pools ""}
}

}

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: APT_PMConnectionRecord::start: cmsg read returned 28, expected 40, Invalid argument

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: **** Parallel startup failed **** (...)

This is usually due to a configuration error, such as
not having the Orchestrate install directory properly
mounted on all nodes, rsh permissions not correctly
set (via /etc/hosts.equiv or .rhosts), or running from
a directory that is not mounted on all nodes. Look for
error messages in the preceding output.


Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: Step started on node QSIS; it uses 1 nodes. (...)
The program running the step is /C=/IBM/InformationServer/Server/PXEngine/bin/osh.exe.

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: The ORCHESTRATE startup program in /C=/IBM/InformationServer/Server/PXEngine/etc/standalone.exe is being used.

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: A startup script is not being used.

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: The TCP port being used for startup is 10,000; the associated socket number is 4.

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: Node status:

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: QSIS -

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: startup script failed or hung

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: APT_PMConnectionRecord:: kill for rsh process failed, Operation not permitted

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: Unable to contact one or more Section Leaders. (...)
Probable configuration problem; contact Orchestrate system administrator.

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: Step execution finished with status = FAILED.

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: Startup time, 0:06; production run time, 0:00.

Could it be a configuration problem that can occur intermittently? Could it be contention? Where should I begin the investigation? What should I look at? Any help will be greatly appreciated.

Thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Is anyone monitoring system resources, particularly CPU, memory and disk space usage? If so, these logs may prove a fruitful source of diagnostic information. I would guess you have a contention for resources issue.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
gagan8877
Premium Member
Premium Member
Posts: 77
Joined: Mon Jun 19, 2006 1:30 pm

Post by gagan8877 »

ray.wurlod wrote:Is anyone monitoring system resources, particularly CPU, memory and disk space usage? If so, these logs may prove a fruitful source of diagnostic information. I would guess you have a contention for resources issue.
Thanks Ray - We haven't monitored the system resources yet, but we are planning to do that in a few days. Will post them there once I have the results - thanks again.
Gary
"A journey of a thousand miles, begins with one step"
Post Reply