The partition was evidently corrupted

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
nirav.parikh
Participant
Posts: 14
Joined: Thu Dec 13, 2007 2:57 am

The partition was evidently corrupted

Post by nirav.parikh »

We are on Information Server V11.3. We are facing an issue in datastage jobs where some of the jobs that read from a dataset fail randomly with the error

Fatal Error: I/O subsystem: partition 0 must be a multiple of 131072 in size (was 851968) (this size is Different with every failed job). The partition was evidently corrupted.

I have searched the forums for this problem and found one post which says that this issue could happen potentially when you run out of diskspace.

but this is not the case in our situation I think , the resource disk has 1.5 TB space allocated to it of which only about 100-300 GB is used at a given time.

Scratch space has about 100GB space. TMPDIR ENV Variable points to a disk location which also has about 100GB. /tmp has about 17GB and is cleaned up on a daily basis. The problem is the job failures happens randomly. What I mean by that is. Different jobs fail on different days. We see about 7-10 failures in a week. The only way around this for now is to rerun the job that creates the dataset and then execute the job that reads this dataset again and the issue goes away.

We are at datastage V11.3 and the OS is Linux. The other thing that might be useful info in finding a solution to this is that our resource disk locations are nfs and scratch is on SAN. It will be great if anyone can provide any pointers/ideas/solution with this issue.
Thanks & Regards
Nirav
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Re: The partition was evidently corrupted

Post by samyamkrishna »

Has the job which creates this Dataset run fine?
Try running that agani and then run the jobs that read it.
Cheers,
Samyam
nirav.parikh
Participant
Posts: 14
Joined: Thu Dec 13, 2007 2:57 am

Re: The partition was evidently corrupted

Post by nirav.parikh »

Yes the job which creates the dataset completes successfully each time. Only the jobs that read these datasets have random failures. The job which creates the dataset always overwrites the previous runs dataset, so we have even tried to delete the dataset using orchadmin so that the next run is clean. the delete happens correctly but randomly after a few days the issue crops up again.
Thanks & Regards
Nirav
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

I would think this type of problem is caused outside of DataStage, such as with the file system or perhaps faulty hardware. I vaguely recall that NFS was to be avoided. Maybe someone else will have a better idea.
Choose a job you love, and you will never have to work a day in your life. - Confucius
nirav.parikh
Participant
Posts: 14
Joined: Thu Dec 13, 2007 2:57 am

Post by nirav.parikh »

While we wait for the experts to provide some guidance, here is some more information/observation.

As I had mentioned before the job that creates this dataset never fails but the job that reads it fails; the pattern we have observed is that the datafile/partition size on partition 1 is always 65536 bytes less than the multiple of 131072, what I mean by that is the for example the block size by default is 131072 and the dataset description shows 7 blocks on a partition then the partition size should be 131072*7 =917504 bytes but when this job fails the partition one size is 851968 as shown below. 851968 is 65536 or half a block less than 917504. The figures in this example is based on the numbers in the original post, but the gist is that when ever the reading job fails with dataset corrupted error we have noticed that the partition size is half block less than what it should be.

Originally we thought since this is happening randomly it might be caused due to data issues possibly, but we ruled it out because when we rerun the job again with the same data it creates the dataset correctly. so it seems to be environmental issue but what we are not able to isolate. Also please note that this job runs every 15 minutes. Any help is much appreciated.
Thanks & Regards
Nirav
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Involve your official support provider.
-craig

"You can never have too many knives" -- Logan Nine Fingers
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Post by samyamkrishna »

Also keep an eye on other jobs that run between the time the read job was running fine and the read job started failing and check if any of the jobs are coruupting the dataset. and keep track of the activities on the server to check if any other process is corrupting the dataset.
Cheers,
Samyam
Post Reply