[Linux-cluster] About GFS1 and I/O barriers.

Wed Apr 2 09:53:34 UTC 2008

Hi,

On Mon, 2008-03-31 at 15:16 +0200, Mathieu Avila wrote:
> Le Mon, 31 Mar 2008 11:54:20 +0100,
> Steven Whitehouse <swhiteho at redhat.com> a écrit :
> 
> > Hi,
> > 
> 
> Hi,
> 
> > Both GFS1 and GFS2 are safe from this problem since neither of them
> > use barriers. Instead we do a flush at the critical points to ensure
> > that all data is on disk before proceeding with the next stage.
> > 
> 
> I don't think this solves the problem.
> 
> Consider a cheap iSCSI disk (no NVRAM, no UPS) accessed by all my GFS
> nodes; this disk has a write cache enabled, which means it will reply
> that write requests are performed even if they are not really written
> on the platters. The disk (like most disks nowadays) has some logic
> that allows it to optimize writes by re-scheduling them. It is possible
> that all writes are ACK'd before the power failure, but only a fraction
> of them were really performed : some are before the flush, some are
> after the flush. 
> --Not all blocks writes before the flush were performed but other
> blocks after the flush are written -> the FS is corrupted.--
> So, after the power failure all data in the disk's write cache are
> forgotten. If the journal data was in the disk cache, the journal was
> not written to disk, but other metadata have been written, so there are
> metadata inconsistencies.
> 
I don't agree that write caching implies that I/O must be acked before
it has hit disk. It might well be reordered (which is ok), but if we
wait for all outstanding I/O completions, then we ought to be able to be
sure that all I/O is actually on disk, or at the very least that further
I/O will not be reordered with already ACKed data. If devices are
sending ACKs in advance of the I/O hitting disk then I think thats
broken behaviour.

Consider what happens if a device was to send an ACK for a write and
then it discovers an uncorrectable error during the write - how would it
then be able to report it since it had already sent an "ok"? So far as I
can see the only reason for having the drive send an I/O completion back
is to report the success or otherwise of the operation, and if that
operation hasn't been completed, then we might just as well not wait for
ACKs.

> This is the problem that I/O barriers try to solve, by really forcing
> the block device (and the block layer) to have all blocks issued before
> the barrier to be written before any other after the barrier starts
> begin written.
> 
> The other solution is to completely disable the write cache of the
> disks, but this leads to dramatically bad performances.
> 
If its a choice between poor performance thats correct and good
performance which might lose data, then I know which I would choose :-)
Not all devices support barriers, so it always has to be an option; ext3
uses the barrier=1 mount option for this reason, and if it fails (e.g.
if the underlying device doesn't support barriers) it falls back to the
same technique which we are using in gfs1/2.

The other thing to bear in mind is that barriers, as currently
implemented are not really that great either. It would be nice to
replace them with something that allows better performance with (for
example) mirrors where the only current method of implementing the
barrier is to wait for all the I/O completions from all the disks in the
mirror set (and thus we are back to waiting for outstanding I/O again).

Steve.