[Linux-cluster] GFS2 corruption/withdrawal/crash

Tue Jul 28 15:34:04 UTC 2009

> I'd be very interested to know about the circumstances which led up to 
> this file getting into that state. Was it always a zero byte file, or 
> has it been truncated at some stage from some larger size? Was there a 
> prior fs crash at some time, or has the fs been otherwise reliable since 
> mkfs time? 
> 
> Anything that you can tell us about the history of this file would be 
> very interesting to know. 

Sure... This particular file was one that is recreated each night. It's the
result of a find /path -print >file that starts back at zero bytes each
night and ends up with a list of filenames found in a particular directory.
It is ~150MB in size currently.

> Ideally it wouldn't crash. In reality there are cases where what we need 
> to do in order to recover from an error gracefully cannot be done in the 
> context in which the error has occurred. The context in this case 
> usually means the locks which are being held at the time. There is some 
> ongoing work to try and improve on this, particularly wrt to corrupt 
> on-disk structures. In some cases we can now just return -EIO to the 
> user and carry on rather than withdrawing from the cluster. 
> 
> The interesting thing in this case is that if the file is zero length, 
> it shouldn't have any indirect blocks at all, so it looks like the inode 
> height might have become corrupt. If you are able to save the metadata 
> from this fs, then that is something which we would find very helpful to 
> have a look at, 
> 
> Steve. 

This equipment was far less stable initially than I'd have liked. Except for
this oops we appear to be in pretty good shape now though. Physical hard 
power cycles happened many times during our initial setup. Which, while I'm 
on the subject, might be valuable to explain here along with what we did to
stabilize things. 

Three servers, a few large GFS2 filesystems, Xen kernels, CLVMD, RHCS 
controlling a bunch of VMs. We were having lots of problems with the cluster
becoming inquorate and nodes being fenced every time a non-member node booted
and joined. 

We figured out that as a machine booted and started networking, started cman
and related components, and started xend there were more pauses than there 
should be and the delay was long enough to trip RHCS into thinking a node
had died. So we did this, we renamed /etc/rd3.d/S98xend to S17xend. So that
xend would fire up and do the NIC moving/renaming before cluster suite.

Thanks.