Page 1 of 2

Dataset corruption, SIGSEGV while reading

Posted: Wed Feb 16, 2011 11:39 am
by niremy
Hello,

I'm facing an odd problem and I need your enlightment:
I've a job that fails to read a dataset with the following error:

Code: Select all

Event Id: 5834
Time    : Wed Feb 16 18:03:24 2011
Type    : FATAL
User    : ...
Message :
        DS_001,0: Unable to map file /.../dataset/node1/DS_001.ds...0000.0000.0000.7080.cf254e0d.0005.ce3515ca: Invalid argument
        The error occurred on Orchestrate node node1 (hostname ...)
Event Id: 5835
Time    : Wed Feb 16 18:03:24 2011
Type    : FATAL
User    : ...
Message :
        DS_001,1: Unable to map file /.../dataset/node2/DS_001.ds...0000.0001.0000.7080.cf254e0d.0006.9934fd0a: Invalid argument
        The error occurred on Orchestrate node node2 (hostname ...)
Event Id: 5836
Time    : Wed Feb 16 18:03:25 2011
Type    : WARNING
User    : ...
Message :
        DS_001,0: /bin/echo: write error: Broken pipe
Event Id: 5837
Time    : Wed Feb 16 18:03:25 2011
Type    : FATAL
User    : ...
Message :
        DS_001,1: Operator terminated abnormally: received signal SIGSEGV
Event Id: 5838
Time    : Wed Feb 16 18:03:30 2011
Type    : FATAL
User    : ...
Message :
        DS_001,0: Operator terminated abnormally: received signal SIGSEGV
I've checked disk space during execution and nothing seems to consume much space on disks.

The source file is 84 lines long and weight 20K.

I tried to run the same job with the same file on my test server and everything runs smoothly.

I tried to rerun several times the job with the same file but each time it fails with the very same error.

I also searched this forum and couldn't find any clue to the source of my current problem.

I thank you in advance for all your remarks that could lead me to the resolution of this issue :wink:

Posted: Thu Feb 17, 2011 1:42 am
by Sreenivasulu
What is the meaning of 'weight 20K' :)

Posted: Thu Feb 17, 2011 1:50 am
by gssr
The dataset was not properly loaded. Check the job that creates the Dataset

Posted: Thu Feb 17, 2011 2:52 am
by niremy
Sreenivasulu wrote:What is the meaning of 'weight 20K' :)
20 KBytes
It was to prevent the response "The file is too big" :wink:

Posted: Thu Feb 17, 2011 2:54 am
by niremy
gssr wrote:The dataset was not properly loaded. Check the job that creates the Dataset
How come that the job works perfectly on another server ?

I already checked it multiple times and it doesn't differ from my other dataset creation :(

Posted: Thu Feb 17, 2011 2:57 am
by Vidyut
R using the same dataset created in your Test Environment??

Posted: Thu Feb 17, 2011 3:18 am
by niremy
Vidyut wrote:R using the same dataset created in your Test Environment??
In fact I have a job sequence that runs the first job that creates the dataset using information from the flat file and then the second job that reads the dataset.

The dataset is clearly corrupted on one of the server as even the orchadmin dump command fails to read the dataset properly.

I'm puzzled because I don't have any warnings with the creation of the dataset :(

Posted: Thu Feb 17, 2011 3:19 am
by devesh_ssingh
check the enviorment in which you are reading it...
since it partition it wont work on two different enviorment unless config file is same for both...

i mean if you are reading dataset which created on 8-node config server to 4-node it won't work ....

for that you should create new dataset on 4-node server....

Posted: Thu Feb 17, 2011 3:36 am
by niremy
devesh_ssingh wrote:check the enviorment in which you are reading it...
since it partition it wont work on two different enviorment unless config file is same for both...

i mean if you are reading dataset which created on 8-node config server to 4-node it won't work ....

for that you should create new dataset on 4-node server....
Thanks for the hint ...

For the job creating the dataset :

Code: Select all

Environment variable settings: 
APT_CONFIG_FILE=/app/EQOPIGL/ISF/Projects/EQOPIGL1/EQOPIGL1_INIT/apt_config_file_2_nodes.apt
For the job reading the dataset

Code: Select all

Environment variable settings: 
APT_CONFIG_FILE=/app/EQOPIGL/ISF/Projects/EQOPIGL1/EQOPIGL1_INIT/apt_config_file_2_nodes.apt
So no luck ...
I forgot to mention that the very same jobs work fine with tiny file with 2 or 3 lines

Posted: Fri Feb 18, 2011 10:28 am
by PaulVL
Show us the content of your APT file.

I'd be interested to see if your datasegments path is a valid path on the server you are executing on.

Also, do you have proper read/write authority to that path.

Posted: Fri Feb 18, 2011 12:05 pm
by niremy
PaulVL wrote:Show us the content of your APT file.

I'd be interested to see if your datasegments path is a valid path on the server you are executing on.

Also, do you have proper read/write authority to that path.
Here is the content of the APT_CONFIG_FILE:

Code: Select all

 cat /app/EQOPIGL/ISF/Projects/EQOPIGL1/EQOPIGL1_INIT/apt_config_file_2_nodes.apt
{
        node "node1"
        {
                fastname "slxd2003.app.eiffage.loc"
                pools ""
                resource disk "/app/EQOPIGL/ISF/Files/EQOPIGL1/dataset/node1" {pools ""}
                resource scratchdisk "/app/EQOPIGL/ISF/Files/EQOPIGL1/scratch/node1" {pools ""}
        }
        node "node2"
        {
                fastname "slxd2003.app.eiffage.loc"
                pools ""
                resource disk "/app/EQOPIGL/ISF/Files/EQOPIGL1/dataset/node2" {pools ""}
                resource scratchdisk "/app/EQOPIGL/ISF/Files/EQOPIGL1/scratch/node2" {pools ""}
        }
}

Code: Select all

tree -dpugfDi /app/EQOPIGL/ISF/Files/EQOPIGL1/dataset
/app/EQOPIGL/ISF/Files/EQOPIGL1/dataset
[drwxrwxr-x eqopigl1 eqopigl1 Feb 18 15:05]  /app/EQOPIGL/ISF/Files/EQOPIGL1/dataset/node1
[drwxrwxr-x eqopigl1 eqopigl1 Feb 18 15:05]  /app/EQOPIGL/ISF/Files/EQOPIGL1/dataset/node2
And my user is eqopigl1 of course.

As a reminder, using the same APT_CONFIG_FILE and the same job with very small file in input, the job works flawlessly.
I'm thinking more of a server mis-configuration whereas you seem to think about bad job design :wink:

Nevertheless I appreciate your efforts with helping me finding the problem :)

Posted: Mon Feb 21, 2011 11:34 am
by niremy
May I ask for some more comments ?
I'm stuck with this problem and can't see any solutions ... :?

Posted: Tue Feb 22, 2011 1:19 am
by kshah9
Hey buddy,

Just once contact to your ADMIN team, as I can see the error as "DS_001,0: /bin/echo: write error: Broken pipe", I have faced same issue, and on contacting the DS-ADMIN (Server Team) our issue used to get resolved. So juat an suggestion, to contact the ADMIN team, mentioning the error message.

Not sure, will it resolve the problem, but you can try once.

Regards,
Kunal shah

Posted: Tue Feb 22, 2011 5:43 am
by niremy
kshah9 wrote: Just once contact to your ADMIN team, as I can see the error as "DS_001,0: /bin/echo: write error: Broken pipe", I have faced same issue, and on contacting the DS-ADMIN (Server Team) our issue used to get resolved. So juat an suggestion, to contact the ADMIN team, mentioning the error message.
Thanks, but I post here on behalf of my admin team, we have the same level of knowledge on this issue :roll:
So again any tips will help :wink:

Posted: Tue Feb 22, 2011 7:56 am
by chulett
Have you involved your official support provider yet?