Hi,
We are seeing that a particular job is aborting intermittently when run individually and as well as from Sequencer level.
This job is a parallel job running on 12 nodes. When the job is run repeatedly without changing anything it runs fine sometimes and also fails it is very inconsistent. However, i was not able to run it succesfully repeatedly for more than 3 times
Everytime the job aborts a core dump is getting generated. Following is the error. Attached find the core dump as well for your reference
Core was generated by `/d01/is/IBM/InformationServer/Server/PXEngine/bin/osh -APT_PMsectionLeaderFlag'.
Program terminated with signal 11, Segmentation fault.
#0 0x00f6a630 in ?? ()
Upon researching on the error I found that setting the APT_DISABLE_COMBINATION = True might help but it didnt work even after setting the variable to true.
Thanks,
Karuna
QualityStage Job aborts intermittently
-
- Premium Member
- Posts: 17
- Joined: Wed Jun 24, 2009 11:18 am
-
- Participant
- Posts: 12
- Joined: Wed Oct 03, 2007 9:11 am
- Location: London
Re: QualityStage Job aborts intermittently
The problem is due to memory allocation to your userid/groupDataQuality_IS8.1 wrote:Hi,
We are seeing that a particular job is aborting intermittently when run individually and as well as from Sequencer level.
This job is a parallel job running on 12 nodes. When the job is run repeatedly without changing anything it runs fine sometimes and also fails it is very inconsistent. However, i was not able to run it succesfully repeatedly for more than 3 times
Everytime the job aborts a core dump is getting generated. Following is the error. Attached find the core dump as well for your reference
Core was generated by `/d01/is/IBM/InformationServer/Server/PXEngine/bin/osh -APT_PMsectionLeaderFlag'.
Program terminated with signal 11, Segmentation fault.
#0 0x00f6a630 in ?? ()
Upon researching on the error I found that setting the APT_DISABLE_COMBINATION = True might help but it didnt work even after setting the variable to true.
Thanks,
Karuna
Discuss with UNIX admin
Thanks
Regards
Raga
Raga
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The fact that it's -APT_SectionLeaderFlag suggests there's a problem starting/communicating with processes on other nodes, possibly during startup. Enable the reporting environment variable that shows startup to investigate further. I can't see any evidence (yet) that it's memory related. APT_DISABLE_COMBINATION is only of use when errors are thrown by APT_CombinedOperatorController, which is not the case here.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 17
- Joined: Wed Jun 24, 2009 11:18 am
Hi Ray,
Thank you. Yes, it seems to be related to communication with nodes. I confirmed this by run the failing job that was originally designed to run on 12 nodes and decreased the number of nodes to 1 and 3 and ran the job again. Job finished successfully on both 1 and 3 nodes.
Can you please let me know the exact name of the reporting variable that shows startup?
Thanks a lot for your help
regards,
Karuna
Thank you. Yes, it seems to be related to communication with nodes. I confirmed this by run the failing job that was originally designed to run on 12 nodes and decreased the number of nodes to 1 and 3 and ran the job again. Job finished successfully on both 1 and 3 nodes.
Can you please let me know the exact name of the reporting variable that shows startup?
Thanks a lot for your help
regards,
Karuna
-
- Premium Member
- Posts: 17
- Joined: Wed Jun 24, 2009 11:18 am
Hi,
Does number of parallel nodes have relation to number of physical CPUs/cores of the physical application Servers? We are able to run the same job using 12 nodes in 2 of our other environments that are dual processor/ 2 CPU machines. Whereas, the environment we are having issues with has single CPU dual core.
Thanks,
Karuna
Does number of parallel nodes have relation to number of physical CPUs/cores of the physical application Servers? We are able to run the same job using 12 nodes in 2 of our other environments that are dual processor/ 2 CPU machines. Whereas, the environment we are having issues with has single CPU dual core.
Thanks,
Karuna