MTBF of Ext3 and Partition Size

Ric Wheeler rwheeler at redhat.com
Thu Apr 16 16:55:37 UTC 2009


Theodore Tso wrote:
> On Thu, Apr 16, 2009 at 07:53:59AM -0400, Kyle Brandt wrote:
>> On several of my servers I seem to have a high rate of server crashes do to
>> file system errors.  So I have some questions related to this:
>>
>> Is there any Mean Time Between Failure ( MTBF) data for the ext3
>> file-system?
>>
>> Does increased partition size cause a higher risk of the partition being
>> corrupted? If so, is there any data on the ratio between partition size and
>> the likely hood of failure?
> 
> The probability of these sorts of filesystem problems is going to be
> dominated by hardware induced corruptions --- so it's not going to
> make a lot of sense to talk about MTBF failures without having a
> specific hardware context in mind.  If you have lousy memory, or a
> lousy disk controller cable, or a cable connector which is loose then
> corruptions will happen often.  If you are are located some place
> where there is a strong alpha particle source, then you will have a
> much greater percentage chance of bit flips.  If you use ECC memory,
> and do very careful hardware selection, with enterprise-quality disks
> that trade off disk capacity for a much stronger level of ECC codes,
> then of course the MBTF will be much less.
> 
> (For example, there was the imfamous story in the early 1990's when
> Sun had a spate of bad memory; I think it was ultimately traced to
> radioactive contamination of the ceramic materials used to make their
> memory chips; this caused alpha particles to cause "bit flips" and
> which had the result of making their customers rather antsy,
> especially since Sun tried todeny there was even a problem for quite
> some time.)
> 
> So if you are having a high rate of server crashes, the first thing I
> would do is to make sure you have the latest distribution updates;
> it's possible it's caused by a kernel bug that has since been fixed,
> but it's somewhat unlikely.  The next thing I would do is take one of
> the machines that has been cashing off line, and try running a 36-48
> hour memory test.
> 
>> Does ext3 on hardware raid (10) increase the possibility of file system
>> corruption?
> 
> No, it shouldn't --- unless you have a buggy or otherwise dodgy
> hardware raid controller.
> 
> 						- Ted
> 

One note is that the file system will often be the first notification that your 
hardware RAID has done something wrong - you should have a careful look at any 
logs/errors/etc that your storage maintains for you.

Can you share specifics of your system - what is the storage, which kernel, etc?

Regards,

Ric


Ric




More information about the Ext3-users mailing list