[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: Ext3 strangeness data loss


On Tue, 2003-02-04 at 16:43, Bodrogi Viktor wrote:

> This really breaks my confidence in RAID-1 mirrors.

Why?  RAID1 is there to deal with disk errors.  Not controller errors or
memory errors. 

The data on the disk is protected by a CRC, so disk reads themselves
should not fail silently --- ie. if it returns bad data, you'll get the
CRC failure and RAID1 will fail over to the other disk.

For SCSI, you have cable parity; for IDE, you have CRCs again (at least
in UDMA mode), so the transfer from disk to controller is once again
protected against silent data loss.

So if something goes wrong there, the OS is likely to hear about it,
take the disk offline, and failover transparently to the other disk.

I've found that the vast majority of cases where you get silent data
corruption, the corruption is occurring in system memory.  It's either
between the controller and the CPU, or between CPU and main memory, or
it's bad memory on the mboard or in the CPU cache.  And you just can't
protect against that, short of going for something like the massively
redundant fault-tolerant systems like Himalaya which run multiple
instances of the CPU in lock-step and use majority voting to detect

> Would the situation get better with a four disk RAID-5?

No, RAID-5 is sometimes even more sensitive to such problems, because it
has the ability to reconstruct one disk from the contents of the other
--- and so, if one disk goes offline, then silent corruptions of the
other disks can cause it to reconstruct the wrong data for the missing

Remember, there's a huge difference between silent errors, and errors
which are detected and dealt with intelligently.  RAID of all varieties
assumes that when data on disk gets corrupted, you get to hear about it
rather than silently being given bad data; and because of sector CRCs on
disk, that is usually a valid assumption.

> I prefer definitive errors than unknown failures.
> Then it gets show up as a disk error, not as random segfaults.
> If this phenomena is HW error, should it be logged anywhere?
> I didn't find anything in syslog...

You tell me!  It _could_ be just about anything.  If it's a sector IO
failure, it will be logged.  If it's main memory silently corrupting
data because of bad ram, it won't be --- run memtest86 to try to locate


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]