[linux-lvm] Random file system errors

f-lvm at media.mit.edu f-lvm at media.mit.edu
Tue Apr 28 03:32:12 UTC 2009


I suspect two things:  RAM and one of your disk controllers.

Going for the latter first---when you created non-LVM tests, were you
using the same disk?  Probably not.  Same IDE or SATA channels?  Maybe
you only get random errors from one of your IDE channels, or only one
of your SATA channels, or perhaps everything that passes through the
Northbridge, or something like that.  You may have to swap disks
around to do fault isolation.  The fact that the errors -stick- once
the data's in RAM makes me think that it's getting trashed on the way
in but that otherwise your RAM might be good.

But maybe it's not.  I had a bizarre failure once where I thought I
had a network problem, since I detected it when dd'ing 500GB from one
machine to another.  Turns out the problem was bad behavior in my RAM,
but ONLY WHEN when the CPU was throttled down!  Once I turned off CPU
throttling, the random errors went away.  And, of course, memtest86+
never saw it, because it -always- nails the CPU...

(In my case, I saw bit flips via md5sum whether the data was coming
from IDE, SATA, or even a USB stick---and there's very little hardware
in common there.  A tight loop md5summing the same file (one which fit
in RAM) got wrong, wrong, right right right right and the "rights"
suddenly started getting spit out much faster---it was at that moment
that I realized the wrong values were probably being produced while
the CPU was throttled.  [And putting a 10-second sleep in between each
md5sum got mostly wrong values---but as soon as I started nailing the
CPU in another process, the values being spit out by the loop, even
with the pauses, became correct...].  The dd didn't use enough CPU
time to throttle up, so I saw errors---and at about the same rate as
you, maybe one bit flip every few gig.  And I -knew- the data coming
in from the disks -had- to be good because it was a crypto filesystem
---bit errors there would trash entire blocks, and that wasn't
happening.)




More information about the linux-lvm mailing list