The partition was evidently corrupted

nirav.parikh · Post by **nirav.parikh** » Fri Dec 11, 2015 12:13 pm

We are on Information Server V11.3. We are facing an issue in datastage jobs where some of the jobs that read from a dataset fail randomly with the error

Fatal Error: I/O subsystem: partition 0 must be a multiple of 131072 in size (was 851968) (this size is Different with every failed job). The partition was evidently corrupted.

I have searched the forums for this problem and found one post which says that this issue could happen potentially when you run out of diskspace.

but this is not the case in our situation I think , the resource disk has 1.5 TB space allocated to it of which only about 100-300 GB is used at a given time.

Scratch space has about 100GB space. TMPDIR ENV Variable points to a disk location which also has about 100GB. /tmp has about 17GB and is cleaned up on a daily basis. The problem is the job failures happens randomly. What I mean by that is. Different jobs fail on different days. We see about 7-10 failures in a week. The only way around this for now is to rerun the job that creates the dataset and then execute the job that reads this dataset again and the issue goes away.

We are at datastage V11.3 and the OS is Linux. The other thing that might be useful info in finding a solution to this is that our resource disk locations are nfs and scratch is on SAN. It will be great if anyone can provide any pointers/ideas/solution with this issue.

samyamkrishna · Post by **samyamkrishna** » Fri Dec 11, 2015 1:47 pm

Has the job which creates this Dataset run fine?
Try running that agani and then run the jobs that read it.

nirav.parikh · Post by **nirav.parikh** » Fri Dec 11, 2015 2:25 pm

Yes the job which creates the dataset completes successfully each time. Only the jobs that read these datasets have random failures. The job which creates the dataset always overwrites the previous runs dataset, so we have even tried to delete the dataset using orchadmin so that the next run is clean. the delete happens correctly but randomly after a few days the issue crops up again.

qt_ky · Post by **qt_ky** » Sat Dec 12, 2015 6:58 am

I would think this type of problem is caused outside of DataStage, such as with the file system or perhaps faulty hardware. I vaguely recall that NFS was to be avoided. Maybe someone else will have a better idea.

nirav.parikh · Post by **nirav.parikh** » Tue Dec 15, 2015 10:29 am

While we wait for the experts to provide some guidance, here is some more information/observation.

As I had mentioned before the job that creates this dataset never fails but the job that reads it fails; the pattern we have observed is that the datafile/partition size on partition 1 is always 65536 bytes less than the multiple of 131072, what I mean by that is the for example the block size by default is 131072 and the dataset description shows 7 blocks on a partition then the partition size should be 131072*7 =917504 bytes but when this job fails the partition one size is 851968 as shown below. 851968 is 65536 or half a block less than 917504. The figures in this example is based on the numbers in the original post, but the gist is that when ever the reading job fails with dataset corrupted error we have noticed that the partition size is half block less than what it should be.

Originally we thought since this is happening randomly it might be caused due to data issues possibly, but we ruled it out because when we rerun the job again with the same data it creates the dataset correctly. so it seems to be environmental issue but what we are not able to isolate. Also please note that this job runs every 15 minutes. Any help is much appreciated.

chulett · Post by **chulett** » Tue Dec 15, 2015 10:35 am

Involve your official support provider.

samyamkrishna · Post by **samyamkrishna** » Tue Dec 15, 2015 1:15 pm

Also keep an eye on other jobs that run between the time the read job was running fine and the read job started failing and check if any of the jobs are coruupting the dataset. and keep track of the activities on the server to check if any other process is corrupting the dataset.

DSXchange

The partition was evidently corrupted

The partition was evidently corrupted

Re: The partition was evidently corrupted

Re: The partition was evidently corrupted