Hi,
I had to execute a job 11 times. It executed successfully 7 times but failed 4 times. There is no particular order of failure.
Each time the job handles with approximately 40 million records.
I got the following error:
APT_CombinedOperatorController,0: signalHandler__Fi() at 0xd272ac3c
APT_CombinedOperatorController,0: processInputRecord__29APT_ParallelSortMergeOperatorFi() at 0xd26a0570
APT_CombinedOperatorController,0: runLocally__30APT_CombinedOperatorControllerFv() at 0xd2675afc
APT_CombinedOperatorController,0: run__15APT_OperatorRepFv() at 0xd259edd4
APT_CombinedOperatorController,0: runLocally__14APT_OperatorSCFv() at 0xd258b954
APT_CombinedOperatorController,0: runLocally__Q2_6APT_SC8OperatorFUi() at 0xd26162c4
APT_CombinedOperatorController,0: runLocally__Q2_6APT_IR7ProcessFv() at 0xd26928a0
APT_CombinedOperatorController,0: executePlayer__18APT_ProcessManagerFP16APT_ScoreProcess() at 0xd265ab2c
APT_CombinedOperatorController,0: executeStep__FPC19APT_PMMessageHeader() at 0xd2727ad8
APT_CombinedOperatorController,0: dispatch__30APT_PMcontrolServiceTableClassCFPC19APT_PMMessageHeader() at 0xd2390e2c
APT_CombinedOperatorController,0: Operator terminated abnormally: received signal SIGSEGV
main_program: Unexpected termination by Unix signal 9(SIGKILL)
When i executed the job with a lesser configuration it works fine for the remaining 4 times.
I want to know whether this error is caused by Bad job design or a Sytem error, so that i can prevent it in the future.
Thanks
Senthil
received signal SIGSEGV
Moderators: chulett, rschirm, roy
Re: received signal SIGSEGV
APT_CombinedOperatorController,0: Operator terminated abnormally: received signal SIGSEGV
main_program: Unexpected termination by Unix signal 9(SIGKILL)
When i executed the job with a lesser configuration it works fine for the remaining 4 times.
I want to know whether this error is caused by Bad job design or a Sytem error, so that i can prevent it in the future.
Sigsegv denotes Unix signal segment violation.This generally happens because of stack overflow.The reason can be anything which does not contain within the defined stack limit like Incompatible datatypes for a particular data item.
main_program: Unexpected termination by Unix signal 9(SIGKILL)
When i executed the job with a lesser configuration it works fine for the remaining 4 times.
I want to know whether this error is caused by Bad job design or a Sytem error, so that i can prevent it in the future.
Sigsegv denotes Unix signal segment violation.This generally happens because of stack overflow.The reason can be anything which does not contain within the defined stack limit like Incompatible datatypes for a particular data item.
A printer consists of 3 main parts: the case, the jammed paper tray and the blinking red light.
A segmentation violation is most often caused by bad pointer addresses in a program - either trying to write to a null pointer address or trying to read/write a protected address. You are seeing this happen sporadically, so it is either due to your data contents or to a system resource restriction. If, with the exact same data, the error remains sporadic then it is most likely due to some system resource issue - the conclusion is reinforced by the fact that the error goes away with a configuration using less nodes.
How many designer canvas stages does this job have and how many nodes does your normal and reduced APT.CONFIG file(s) have? Is it always the controller for node 0 that fails? Does it fail at job start or after processing for a while?
How many designer canvas stages does this job have and how many nodes does your normal and reduced APT.CONFIG file(s) have? Is it always the controller for node 0 that fails? Does it fail at job start or after processing for a while?
ArndW,
The job fails after executing for sometime. It fails at about 3 minutes.
Thanks
Senthil
The original configuration file was a 4-node config file and the reduced one is a 2-node config file.ArndW wrote: how many nodes does your normal and reduced APT.CONFIG file(s) have?
Always The controller for node 0 failsArndW wrote: Is it always the controller for node 0 that fails?
The time taken for the job is around 5 min.ArndW wrote: Does it fail at job start or after processing for a while?
The job fails after executing for sometime. It fails at about 3 minutes.
The job has 2 lookup stages, 1 trnsfrmr stg, 1 modify stg, 1 join stg, 4 db2 src/trgt stagesArndW wrote: How many designer canvas stages does this job have
Thanks
Senthil
With that small number of stages and a 4-node configuration you won't have too many jobs flooding the system, even if you have partitioned DB/2 tables and are using the enterprise partitioning.
A SIGSEGV after an appreciable runtime is either being triggered by a combination of data values or by something overflowing. The most common place for either of these to happen is in the transform stage. Can you remove this stage for test purposes and see if your job runs without error?
If you change to a 6- or 8- node configuration does the problem still happen? Does it happen more often (i.e. reproduceable every time). Does the error always occur at the same row? Is it the same data each time, including the lookups?
A SIGSEGV after an appreciable runtime is either being triggered by a combination of data values or by something overflowing. The most common place for either of these to happen is in the transform stage. Can you remove this stage for test purposes and see if your job runs without error?
If you change to a 6- or 8- node configuration does the problem still happen? Does it happen more often (i.e. reproduceable every time). Does the error always occur at the same row? Is it the same data each time, including the lookups?
The job fails in the 6-node configuration also.ArndW wrote:
If you change to a 6- or 8- node configuration does the problem still happen? Does it happen more often (i.e. reproduceable every time). Does the error always occur at the same row? Is it the same data each time, including the lookups?
With the same data it fails in 4or6 node but succeeds in 2node
I dont know whether it occurs at the same row.
I will try with different set of data and update
Thanks
Senthil