[Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing

Mon Dec 17 17:15:41 UTC 2018

On Mon, Dec 17, 2018 at 09:58:47AM -0500, Bob Peterson wrote:
> Dave Teigland recommended. Unless I'm mistaken, Dave has said that 
> GFS2 should never withdraw; it should always just kernel panic (Dave, 
> correct me if I'm wrong). At least this patch confines that behavior 
> to a small subset of withdraws.

The basic idea is that you want to get a malfunctioning node out of the way as quickly as possible so others can recover and carry on.  Escalating a partial failure into a total node failure is the best way to do that in this case.  Specialized recovery paths run from a partially failed node won't be as reliable, and are prone to blocking all the nodes.

I think a reasonable alternative to this is to just sit in an infinite retry loop until the i/o succeeds.

Dave
[Mark Syms] I would hope that this code would only trigger after some effort has been put into  retrying as panicing the host on the first I/O failure seems like a sure fire way to get unhappy users (and in our case paying customers). As Edvin points out there may be other filesystems that may be able to cleanly unmount and thus avoid having to check everything on restart.