received signal SIGSEGV

ThilSe · Post by **ThilSe** » Fri Dec 16, 2005 12:08 am

Hi,

I had to execute a job 11 times. It executed successfully 7 times but failed 4 times. There is no particular order of failure.
Each time the job handles with approximately 40 million records.

I got the following error:

APT_CombinedOperatorController,0: signalHandler__Fi() at 0xd272ac3c
APT_CombinedOperatorController,0: processInputRecord__29APT_ParallelSortMergeOperatorFi() at 0xd26a0570
APT_CombinedOperatorController,0: runLocally__30APT_CombinedOperatorControllerFv() at 0xd2675afc
APT_CombinedOperatorController,0: run__15APT_OperatorRepFv() at 0xd259edd4
APT_CombinedOperatorController,0: runLocally__14APT_OperatorSCFv() at 0xd258b954
APT_CombinedOperatorController,0: runLocally__Q2_6APT_SC8OperatorFUi() at 0xd26162c4
APT_CombinedOperatorController,0: runLocally__Q2_6APT_IR7ProcessFv() at 0xd26928a0
APT_CombinedOperatorController,0: executePlayer__18APT_ProcessManagerFP16APT_ScoreProcess() at 0xd265ab2c
APT_CombinedOperatorController,0: executeStep__FPC19APT_PMMessageHeader() at 0xd2727ad8
APT_CombinedOperatorController,0: dispatch__30APT_PMcontrolServiceTableClassCFPC19APT_PMMessageHeader() at 0xd2390e2c
APT_CombinedOperatorController,0: Operator terminated abnormally: received signal SIGSEGV
main_program: Unexpected termination by Unix signal 9(SIGKILL)

When i executed the job with a lesser configuration it works fine for the remaining 4 times.

I want to know whether this error is caused by Bad job design or a Sytem error, so that i can prevent it in the future.

Thanks
Senthil

salil · Post by **salil** » Fri Dec 16, 2005 3:11 am

APT_CombinedOperatorController,0: Operator terminated abnormally: received signal SIGSEGV
main_program: Unexpected termination by Unix signal 9(SIGKILL)

When i executed the job with a lesser configuration it works fine for the remaining 4 times.

I want to know whether this error is caused by Bad job design or a Sytem error, so that i can prevent it in the future.

Sigsegv denotes Unix signal segment violation.This generally happens because of stack overflow.The reason can be anything which does not contain within the defined stack limit like Incompatible datatypes for a particular data item.

ArndW · Post by **ArndW** » Fri Dec 16, 2005 3:29 am

A segmentation violation is most often caused by bad pointer addresses in a program - either trying to write to a null pointer address or trying to read/write a protected address. You are seeing this happen sporadically, so it is either due to your data contents or to a system resource restriction. If, with the exact same data, the error remains sporadic then it is most likely due to some system resource issue - the conclusion is reinforced by the fact that the error goes away with a configuration using less nodes.

How many designer canvas stages does this job have and how many nodes does your normal and reduced APT.CONFIG file(s) have? Is it always the controller for node 0 that fails? Does it fail at job start or after processing for a while?

ThilSe · Post by **ThilSe** » Fri Dec 16, 2005 4:17 am

ArndW,

ArndW wrote: how many nodes does your normal and reduced APT.CONFIG file(s) have?

The original configuration file was a 4-node config file and the reduced one is a 2-node config file.

ArndW wrote: Is it always the controller for node 0 that fails?

Always The controller for node 0 fails

ArndW wrote: Does it fail at job start or after processing for a while?

The time taken for the job is around 5 min.
The job fails after executing for sometime. It fails at about 3 minutes.

ArndW wrote: How many designer canvas stages does this job have

The job has 2 lookup stages, 1 trnsfrmr stg, 1 modify stg, 1 join stg, 4 db2 src/trgt stages

Thanks
Senthil

ArndW · Post by **ArndW** » Fri Dec 16, 2005 4:30 am

With that small number of stages and a 4-node configuration you won't have too many jobs flooding the system, even if you have partitioned DB/2 tables and are using the enterprise partitioning.

A SIGSEGV after an appreciable runtime is either being triggered by a combination of data values or by something overflowing. The most common place for either of these to happen is in the transform stage. Can you remove this stage for test purposes and see if your job runs without error?

If you change to a 6- or 8- node configuration does the problem still happen? Does it happen more often (i.e. reproduceable every time). Does the error always occur at the same row? Is it the same data each time, including the lookups?

ThilSe · Post by **ThilSe** » Fri Dec 16, 2005 7:03 am

ArndW wrote:
If you change to a 6- or 8- node configuration does the problem still happen? Does it happen more often (i.e. reproduceable every time). Does the error always occur at the same row? Is it the same data each time, including the lookups?

The job fails in the 6-node configuration also.

With the same data it fails in 4or6 node but succeeds in 2node
I dont know whether it occurs at the same row.

I will try with different set of data and update

Thanks
Senthil

DSXchange

received signal SIGSEGV

received signal SIGSEGV

Re: received signal SIGSEGV