We are on Information Server V11.3. We are facing an issue in datastage jobs where some of the jobs that read from a dataset fail randomly with the error
Fatal Error: I/O subsystem: partition 0 must be a multiple of 131072 in size (was 851968) (this size is Different with every failed job). The partition was evidently corrupted.
I have searched the forums for this problem and found one post which says that this issue could happen potentially when you run out of diskspace.
but this is not the case in our situation I think , the resource disk has 1.5 TB space allocated to it of which only about 100-300 GB is used at a given time.
Scratch space has about 100GB space. TMPDIR ENV Variable points to a disk location which also has about 100GB. /tmp has about 17GB and is cleaned up on a daily basis. The problem is the job failures happens randomly. What I mean by that is. Different jobs fail on different days. We see about 7-10 failures in a week. The only way around this for now is to rerun the job that creates the dataset and then execute the job that reads this dataset again and the issue goes away.
We are at datastage V11.3 and the OS is Linux. The other thing that might be useful info in finding a solution to this is that our resource disk locations are nfs and scratch is on SAN. It will be great if anyone can provide any pointers/ideas/solution with this issue.
The partition was evidently corrupted
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 14
- Joined: Thu Dec 13, 2007 2:57 am
The partition was evidently corrupted
Thanks & Regards
Nirav
Nirav
-
- Premium Member
- Posts: 258
- Joined: Tue Jul 04, 2006 10:35 pm
- Location: Toronto
Re: The partition was evidently corrupted
Has the job which creates this Dataset run fine?
Try running that agani and then run the jobs that read it.
Try running that agani and then run the jobs that read it.
Cheers,
Samyam
Samyam
-
- Participant
- Posts: 14
- Joined: Thu Dec 13, 2007 2:57 am
Re: The partition was evidently corrupted
Yes the job which creates the dataset completes successfully each time. Only the jobs that read these datasets have random failures. The job which creates the dataset always overwrites the previous runs dataset, so we have even tried to delete the dataset using orchadmin so that the next run is clean. the delete happens correctly but randomly after a few days the issue crops up again.
Thanks & Regards
Nirav
Nirav
-
- Participant
- Posts: 14
- Joined: Thu Dec 13, 2007 2:57 am
While we wait for the experts to provide some guidance, here is some more information/observation.
As I had mentioned before the job that creates this dataset never fails but the job that reads it fails; the pattern we have observed is that the datafile/partition size on partition 1 is always 65536 bytes less than the multiple of 131072, what I mean by that is the for example the block size by default is 131072 and the dataset description shows 7 blocks on a partition then the partition size should be 131072*7 =917504 bytes but when this job fails the partition one size is 851968 as shown below. 851968 is 65536 or half a block less than 917504. The figures in this example is based on the numbers in the original post, but the gist is that when ever the reading job fails with dataset corrupted error we have noticed that the partition size is half block less than what it should be.
Originally we thought since this is happening randomly it might be caused due to data issues possibly, but we ruled it out because when we rerun the job again with the same data it creates the dataset correctly. so it seems to be environmental issue but what we are not able to isolate. Also please note that this job runs every 15 minutes. Any help is much appreciated.
As I had mentioned before the job that creates this dataset never fails but the job that reads it fails; the pattern we have observed is that the datafile/partition size on partition 1 is always 65536 bytes less than the multiple of 131072, what I mean by that is the for example the block size by default is 131072 and the dataset description shows 7 blocks on a partition then the partition size should be 131072*7 =917504 bytes but when this job fails the partition one size is 851968 as shown below. 851968 is 65536 or half a block less than 917504. The figures in this example is based on the numbers in the original post, but the gist is that when ever the reading job fails with dataset corrupted error we have noticed that the partition size is half block less than what it should be.
Originally we thought since this is happening randomly it might be caused due to data issues possibly, but we ruled it out because when we rerun the job again with the same data it creates the dataset correctly. so it seems to be environmental issue but what we are not able to isolate. Also please note that this job runs every 15 minutes. Any help is much appreciated.
Thanks & Regards
Nirav
Nirav
-
- Premium Member
- Posts: 258
- Joined: Tue Jul 04, 2006 10:35 pm
- Location: Toronto