QualityStage Job aborts intermittently

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
DataQuality_IS8.1
Premium Member
Premium Member
Posts: 17
Joined: Wed Jun 24, 2009 11:18 am

QualityStage Job aborts intermittently

Post by DataQuality_IS8.1 »

Hi,

We are seeing that a particular job is aborting intermittently when run individually and as well as from Sequencer level.

This job is a parallel job running on 12 nodes. When the job is run repeatedly without changing anything it runs fine sometimes and also fails it is very inconsistent. However, i was not able to run it succesfully repeatedly for more than 3 times

Everytime the job aborts a core dump is getting generated. Following is the error. Attached find the core dump as well for your reference

Core was generated by `/d01/is/IBM/InformationServer/Server/PXEngine/bin/osh -APT_PMsectionLeaderFlag'.
Program terminated with signal 11, Segmentation fault.
#0 0x00f6a630 in ?? ()

Upon researching on the error I found that setting the APT_DISABLE_COMBINATION = True might help but it didnt work even after setting the variable to true.

Thanks,
Karuna
ragasambath
Participant
Posts: 12
Joined: Wed Oct 03, 2007 9:11 am
Location: London

Re: QualityStage Job aborts intermittently

Post by ragasambath »

DataQuality_IS8.1 wrote:Hi,

We are seeing that a particular job is aborting intermittently when run individually and as well as from Sequencer level.

This job is a parallel job running on 12 nodes. When the job is run repeatedly without changing anything it runs fine sometimes and also fails it is very inconsistent. However, i was not able to run it succesfully repeatedly for more than 3 times

Everytime the job aborts a core dump is getting generated. Following is the error. Attached find the core dump as well for your reference

Core was generated by `/d01/is/IBM/InformationServer/Server/PXEngine/bin/osh -APT_PMsectionLeaderFlag'.
Program terminated with signal 11, Segmentation fault.
#0 0x00f6a630 in ?? ()

Upon researching on the error I found that setting the APT_DISABLE_COMBINATION = True might help but it didnt work even after setting the variable to true.

Thanks,
Karuna
The problem is due to memory allocation to your userid/group

Discuss with UNIX admin

Thanks
Regards

Raga
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The fact that it's -APT_SectionLeaderFlag suggests there's a problem starting/communicating with processes on other nodes, possibly during startup. Enable the reporting environment variable that shows startup to investigate further. I can't see any evidence (yet) that it's memory related. APT_DISABLE_COMBINATION is only of use when errors are thrown by APT_CombinedOperatorController, which is not the case here.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
DataQuality_IS8.1
Premium Member
Premium Member
Posts: 17
Joined: Wed Jun 24, 2009 11:18 am

Post by DataQuality_IS8.1 »

Hi Ray,

Thank you. Yes, it seems to be related to communication with nodes. I confirmed this by run the failing job that was originally designed to run on 12 nodes and decreased the number of nodes to 1 and 3 and ran the job again. Job finished successfully on both 1 and 3 nodes.


Can you please let me know the exact name of the reporting variable that shows startup?

Thanks a lot for your help

regards,
Karuna
DataQuality_IS8.1
Premium Member
Premium Member
Posts: 17
Joined: Wed Jun 24, 2009 11:18 am

Post by DataQuality_IS8.1 »

Hi,

Does number of parallel nodes have relation to number of physical CPUs/cores of the physical application Servers? We are able to run the same job using 12 nodes in 2 of our other environments that are dual processor/ 2 CPU machines. Whereas, the environment we are having issues with has single CPU dual core.

Thanks,
Karuna
Post Reply