Page 1 of 1

QualityStage Job aborts intermittently

Posted: Thu Apr 08, 2010 8:29 am
by DataQuality_IS8.1
Hi,

We are seeing that a particular job is aborting intermittently when run individually and as well as from Sequencer level.

This job is a parallel job running on 12 nodes. When the job is run repeatedly without changing anything it runs fine sometimes and also fails it is very inconsistent. However, i was not able to run it succesfully repeatedly for more than 3 times

Everytime the job aborts a core dump is getting generated. Following is the error. Attached find the core dump as well for your reference

Core was generated by `/d01/is/IBM/InformationServer/Server/PXEngine/bin/osh -APT_PMsectionLeaderFlag'.
Program terminated with signal 11, Segmentation fault.
#0 0x00f6a630 in ?? ()

Upon researching on the error I found that setting the APT_DISABLE_COMBINATION = True might help but it didnt work even after setting the variable to true.

Thanks,
Karuna

Re: QualityStage Job aborts intermittently

Posted: Thu Apr 08, 2010 9:10 am
by ragasambath
DataQuality_IS8.1 wrote:Hi,

We are seeing that a particular job is aborting intermittently when run individually and as well as from Sequencer level.

This job is a parallel job running on 12 nodes. When the job is run repeatedly without changing anything it runs fine sometimes and also fails it is very inconsistent. However, i was not able to run it succesfully repeatedly for more than 3 times

Everytime the job aborts a core dump is getting generated. Following is the error. Attached find the core dump as well for your reference

Core was generated by `/d01/is/IBM/InformationServer/Server/PXEngine/bin/osh -APT_PMsectionLeaderFlag'.
Program terminated with signal 11, Segmentation fault.
#0 0x00f6a630 in ?? ()

Upon researching on the error I found that setting the APT_DISABLE_COMBINATION = True might help but it didnt work even after setting the variable to true.

Thanks,
Karuna
The problem is due to memory allocation to your userid/group

Discuss with UNIX admin

Thanks

Posted: Thu Apr 08, 2010 4:15 pm
by ray.wurlod
The fact that it's -APT_SectionLeaderFlag suggests there's a problem starting/communicating with processes on other nodes, possibly during startup. Enable the reporting environment variable that shows startup to investigate further. I can't see any evidence (yet) that it's memory related. APT_DISABLE_COMBINATION is only of use when errors are thrown by APT_CombinedOperatorController, which is not the case here.

Posted: Mon Apr 12, 2010 8:14 am
by DataQuality_IS8.1
Hi Ray,

Thank you. Yes, it seems to be related to communication with nodes. I confirmed this by run the failing job that was originally designed to run on 12 nodes and decreased the number of nodes to 1 and 3 and ran the job again. Job finished successfully on both 1 and 3 nodes.


Can you please let me know the exact name of the reporting variable that shows startup?

Thanks a lot for your help

regards,
Karuna

Posted: Mon Apr 12, 2010 11:47 am
by DataQuality_IS8.1
Hi,

Does number of parallel nodes have relation to number of physical CPUs/cores of the physical application Servers? We are able to run the same job using 12 nodes in 2 of our other environments that are dual processor/ 2 CPU machines. Whereas, the environment we are having issues with has single CPU dual core.

Thanks,
Karuna