Jobs fail intermittently

gagan8877 · Post by **gagan8877** » Mon Dec 07, 2009 2:03 pm

We have PROD, ACCP and DEV projects in the same Server. There is a scheduler sequence that calls other sequences and it is scheduled to run nightly. Intermittenly the sequences/jobs that are being called fail - each time it could be a different job. Some sequences/jobs are called in parallel others one after the other. the failure was on Dec 2 09. It has been running smooth since then, but can happen again and there doesnt seem to be a consistency between the failures - they dont happen on a particular day - they happen randomly. the server was rebooted on the 2nd and since then no failures. Here's the log:

Occurred: 1:10:01 AM On date: 12/2/2009 Type: Control
Event: Starting Job jb_arch_parse_process_msgs._SYSPROC_ADMIN_js_code_master_maintenance. (...)

Occurred: 1:10:02 AM On date: 12/2/2009 Type: Info
Event: Environment variable settings: (...)

Occurred: 1:10:02 AM On date: 12/2/2009 Type: Info
Event: Parallel job initiated

Occurred: 1:10:02 AM On date: 12/2/2009 Type: Info
Event: OSH script (...)

Occurred: 1:10:04 AM On date: 12/2/2009 Type: Info
Event: main_program: IBM WebSphere DataStage Enterprise Edition 8.0.1.4843 (...)

Occurred: 1:10:04 AM On date: 12/2/2009 Type: Info
Event: main_program: orchgeneral: loaded (...)

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: APT configuration file: C:/IBM/InformationServer/Server/Configurations/default.apt (...)

{
node "node1"
{
fastname "QSIS"
pools ""
resource disk "C:/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "C:/IBM/InformationServer/Server/Scratch" {pools ""}
}

}

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: APT_PMConnectionRecord::start: cmsg read returned 28, expected 40, Invalid argument

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: **** Parallel startup failed **** (...)

This is usually due to a configuration error, such as
not having the Orchestrate install directory properly
mounted on all nodes, rsh permissions not correctly
set (via /etc/hosts.equiv or .rhosts), or running from
a directory that is not mounted on all nodes. Look for
error messages in the preceding output.

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: Step started on node QSIS; it uses 1 nodes. (...)
The program running the step is /C=/IBM/InformationServer/Server/PXEngine/bin/osh.exe.

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: The ORCHESTRATE startup program in /C=/IBM/InformationServer/Server/PXEngine/etc/standalone.exe is being used.

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: A startup script is not being used.

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: The TCP port being used for startup is 10,000; the associated socket number is 4.

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: Node status:

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: QSIS -

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: startup script failed or hung

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: APT_PMConnectionRecord:: kill for rsh process failed, Operation not permitted

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: Unable to contact one or more Section Leaders. (...)
Probable configuration problem; contact Orchestrate system administrator.

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: Step execution finished with status = FAILED.

Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: Startup time, 0:06; production run time, 0:00.

Could it be a configuration problem that can occur intermittently? Could it be contention? Where should I begin the investigation? What should I look at? Any help will be greatly appreciated.

Thanks

ray.wurlod · Post by **ray.wurlod** » Mon Dec 07, 2009 4:07 pm

Is anyone monitoring system resources, particularly CPU, memory and disk space usage? If so, these logs may prove a fruitful source of diagnostic information. I would guess you have a contention for resources issue.

gagan8877 · Post by **gagan8877** » Tue Dec 08, 2009 7:57 pm

ray.wurlod wrote:Is anyone monitoring system resources, particularly CPU, memory and disk space usage? If so, these logs may prove a fruitful source of diagnostic information. I would guess you have a contention for resources issue.

Thanks Ray - We haven't monitored the system resources yet, but we are planning to do that in a few days. Will post them there once I have the results - thanks again.