[Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing

Tue Dec 18 16:09:27 UTC 2018

Thanks Bob,

We believe we have seen these issues from time to time in our automated testing but I suspect that they're indicating a configuration problem with the backing storage. For flexibility a proportion of our purely functional testing will use storage provided by a VM running a software iSCSI target and these tests seem to be somewhat susceptible to getting I/O errors, some of which will inevitably end up being in the journal. If we start to see a lot we'll need to look at the config of the VMs first I think.

	Mark.

-----Original Message-----
From: Bob Peterson <rpeterso at redhat.com> 
Sent: 18 December 2018 15:52
To: Mark Syms <Mark.Syms at citrix.com>
Cc: cluster-devel at redhat.com
Subject: Re: [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing

----- Original Message -----
> Hi Bob,
> 
> I agree, it's a hard problem. I'm just trying to understand that we've 
> done the absolute best we can and that if this condition is hit then 
> the best solution really is to just kill the node. I guess it's also a 
> question of how common this actually ends up being. We have now got 
> customers starting to use GFS2 for VM storage on XenServer so I guess 
> we'll just have to see how many support calls we get in on it.
> 
> Thanks,
> 
> Mark.

Hi Mark,

I don't expect the problem to be very common in the real world. 
The user has to get IO errors while writing to the GFS2 journal, which is not very common. The patch is basically reacting to a phenomenon we recently started noticing in which the HBA (qla2xxx) driver shuts down and stops accepting requests when you do abnormal reboots (which we sometimes do to test node recovery). In these cases, the node doesn't go down right away.
It stays up just long enough to cause IO errors with subsequent withdraws, which, we discovered, results in file system corruption.
Normal reboots, "/sbin/reboot -fin", and "echo b > /proc/sysrq-trigger" should not have this problem, nor should node fencing, etc.

And like I said, I'm open to suggestions on how to fix it. I wish there was a better solution.

As it is, I'd kind of like to get something into this merge window for the upstream kernel, but I'll need to submit the pull request for that probably tomorrow or Thursday. If we find a better solution, we can always revert these changes and implement a new one.

Regards,

Bob Peterson