MTBF of Ext3 and Partition Size

Theodore Tso tytso at mit.edu
Thu Apr 16 16:51:36 UTC 2009


On Thu, Apr 16, 2009 at 07:53:59AM -0400, Kyle Brandt wrote:
> 
> On several of my servers I seem to have a high rate of server crashes do to
> file system errors.  So I have some questions related to this:
> 
> Is there any Mean Time Between Failure ( MTBF) data for the ext3
> file-system?
> 
> Does increased partition size cause a higher risk of the partition being
> corrupted? If so, is there any data on the ratio between partition size and
> the likely hood of failure?

The probability of these sorts of filesystem problems is going to be
dominated by hardware induced corruptions --- so it's not going to
make a lot of sense to talk about MTBF failures without having a
specific hardware context in mind.  If you have lousy memory, or a
lousy disk controller cable, or a cable connector which is loose then
corruptions will happen often.  If you are are located some place
where there is a strong alpha particle source, then you will have a
much greater percentage chance of bit flips.  If you use ECC memory,
and do very careful hardware selection, with enterprise-quality disks
that trade off disk capacity for a much stronger level of ECC codes,
then of course the MBTF will be much less.

(For example, there was the imfamous story in the early 1990's when
Sun had a spate of bad memory; I think it was ultimately traced to
radioactive contamination of the ceramic materials used to make their
memory chips; this caused alpha particles to cause "bit flips" and
which had the result of making their customers rather antsy,
especially since Sun tried todeny there was even a problem for quite
some time.)

So if you are having a high rate of server crashes, the first thing I
would do is to make sure you have the latest distribution updates;
it's possible it's caused by a kernel bug that has since been fixed,
but it's somewhat unlikely.  The next thing I would do is take one of
the machines that has been cashing off line, and try running a 36-48
hour memory test.

> Does ext3 on hardware raid (10) increase the possibility of file system
> corruption?

No, it shouldn't --- unless you have a buggy or otherwise dodgy
hardware raid controller.

						- Ted




More information about the Ext3-users mailing list