Jobs fail intermittently
Posted: Mon Dec 07, 2009 2:03 pm
We have PROD, ACCP and DEV projects in the same Server. There is a scheduler sequence that calls other sequences and it is scheduled to run nightly. Intermittenly the sequences/jobs that are being called fail - each time it could be a different job. Some sequences/jobs are called in parallel others one after the other. the failure was on Dec 2 09. It has been running smooth since then, but can happen again and there doesnt seem to be a consistency between the failures - they dont happen on a particular day - they happen randomly. the server was rebooted on the 2nd and since then no failures. Here's the log:
Occurred: 1:10:01 AM On date: 12/2/2009 Type: Control
Event: Starting Job jb_arch_parse_process_msgs._SYSPROC_ADMIN_js_code_master_maintenance. (...)
Occurred: 1:10:02 AM On date: 12/2/2009 Type: Info
Event: Environment variable settings: (...)
Occurred: 1:10:02 AM On date: 12/2/2009 Type: Info
Event: Parallel job initiated
Occurred: 1:10:02 AM On date: 12/2/2009 Type: Info
Event: OSH script (...)
Occurred: 1:10:04 AM On date: 12/2/2009 Type: Info
Event: main_program: IBM WebSphere DataStage Enterprise Edition 8.0.1.4843 (...)
Occurred: 1:10:04 AM On date: 12/2/2009 Type: Info
Event: main_program: orchgeneral: loaded (...)
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: APT configuration file: C:/IBM/InformationServer/Server/Configurations/default.apt (...)
{
node "node1"
{
fastname "QSIS"
pools ""
resource disk "C:/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "C:/IBM/InformationServer/Server/Scratch" {pools ""}
}
}
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: APT_PMConnectionRecord::start: cmsg read returned 28, expected 40, Invalid argument
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: **** Parallel startup failed **** (...)
This is usually due to a configuration error, such as
not having the Orchestrate install directory properly
mounted on all nodes, rsh permissions not correctly
set (via /etc/hosts.equiv or .rhosts), or running from
a directory that is not mounted on all nodes. Look for
error messages in the preceding output.
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: Step started on node QSIS; it uses 1 nodes. (...)
The program running the step is /C=/IBM/InformationServer/Server/PXEngine/bin/osh.exe.
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: The ORCHESTRATE startup program in /C=/IBM/InformationServer/Server/PXEngine/etc/standalone.exe is being used.
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: A startup script is not being used.
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: The TCP port being used for startup is 10,000; the associated socket number is 4.
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: Node status:
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: QSIS -
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: startup script failed or hung
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: APT_PMConnectionRecord:: kill for rsh process failed, Operation not permitted
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: Unable to contact one or more Section Leaders. (...)
Probable configuration problem; contact Orchestrate system administrator.
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: Step execution finished with status = FAILED.
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: Startup time, 0:06; production run time, 0:00.
Could it be a configuration problem that can occur intermittently? Could it be contention? Where should I begin the investigation? What should I look at? Any help will be greatly appreciated.
Thanks
Occurred: 1:10:01 AM On date: 12/2/2009 Type: Control
Event: Starting Job jb_arch_parse_process_msgs._SYSPROC_ADMIN_js_code_master_maintenance. (...)
Occurred: 1:10:02 AM On date: 12/2/2009 Type: Info
Event: Environment variable settings: (...)
Occurred: 1:10:02 AM On date: 12/2/2009 Type: Info
Event: Parallel job initiated
Occurred: 1:10:02 AM On date: 12/2/2009 Type: Info
Event: OSH script (...)
Occurred: 1:10:04 AM On date: 12/2/2009 Type: Info
Event: main_program: IBM WebSphere DataStage Enterprise Edition 8.0.1.4843 (...)
Occurred: 1:10:04 AM On date: 12/2/2009 Type: Info
Event: main_program: orchgeneral: loaded (...)
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: APT configuration file: C:/IBM/InformationServer/Server/Configurations/default.apt (...)
{
node "node1"
{
fastname "QSIS"
pools ""
resource disk "C:/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "C:/IBM/InformationServer/Server/Scratch" {pools ""}
}
}
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: APT_PMConnectionRecord::start: cmsg read returned 28, expected 40, Invalid argument
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: **** Parallel startup failed **** (...)
This is usually due to a configuration error, such as
not having the Orchestrate install directory properly
mounted on all nodes, rsh permissions not correctly
set (via /etc/hosts.equiv or .rhosts), or running from
a directory that is not mounted on all nodes. Look for
error messages in the preceding output.
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: Step started on node QSIS; it uses 1 nodes. (...)
The program running the step is /C=/IBM/InformationServer/Server/PXEngine/bin/osh.exe.
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: The ORCHESTRATE startup program in /C=/IBM/InformationServer/Server/PXEngine/etc/standalone.exe is being used.
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: A startup script is not being used.
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: The TCP port being used for startup is 10,000; the associated socket number is 4.
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: Node status:
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: QSIS -
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: startup script failed or hung
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: APT_PMConnectionRecord:: kill for rsh process failed, Operation not permitted
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: Unable to contact one or more Section Leaders. (...)
Probable configuration problem; contact Orchestrate system administrator.
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Fatal
Event: main_program: Step execution finished with status = FAILED.
Occurred: 1:10:10 AM On date: 12/2/2009 Type: Info
Event: main_program: Startup time, 0:06; production run time, 0:00.
Could it be a configuration problem that can occur intermittently? Could it be contention? Where should I begin the investigation? What should I look at? Any help will be greatly appreciated.
Thanks