[Linux-cluster] GFS2 fatal: invalid metadata block

Wed Sep 30 09:24:41 UTC 2009

Hi,

On Tue, 2009-09-29 at 13:54 -0600, Kai Meyer wrote:
> Steven Whitehouse wrote:
> > Hi,
> >
> > You seem to have a number of issues here....
> >
> > On Fri, 2009-09-25 at 12:31 -0600, Kai Meyer wrote:
> >   
> >> Sorry for the slow response. We ended up (finally) purchasing some RHEL 
> >> licenses to try and get some phone support for this problem, and came up 
> >> with a plan to salvage what we could. I'll try to offer a brief history 
> >> of the problem in hope you can help me understand this issue a little 
> >> better.
> >> I've posted the relevant logfile entries to the events described here : 
> >> http://kai.gnukai.com/gfs2_meltdown.txt
> >> All the nodes send syslog to a remote server named pxe, so the combined 
> >> syslog for all the nodes plus the syslog server is here: 
> >> http://kai.gnukai.com/messages.txt
> >> We started with a 4 node cluster (nodes 1, 2, 4, 5). The GFS2 filesystem 
> >> was created with the latest CentOS 5.3 had to offer when it was 
> >> released. Node 3 was off at the time the errors occurred, and not part 
> >> of the cluster.
> >> First issue I can recover from syslog is from node 5 (192.168.100.105) 
> >> on Sep 8 14:11:27 was a 'fatal: invalid metadata block' error that 
> >> resulted in the file system being withdrawn.
> >>     
> >
> > Ok. So lets start with that message. Once that message has appeared, it
> > means that something on disk has been corrupted. The only way in which
> > that can be fixed is to unmount on all nodes and run fsck.gfs2 on the
> > filesystem. The other nodes will only carry on working until they too
> > read the same erroneous block.
> >
> > These issues are usually very tricky to track down. The main reason for
> > that is that the event which caused the corruption is usually long in
> > the past before the issue is discovered. Often there has been so much
> > activity that its impossible to attribute it to any particular event.
> >
> > That said, we are very interested to receive reports of such corruption
> > in case we can figure out the common factors between such reports.
> >
> >   
> Is there any more information I can provide that would be useful? At 
> this point, I don't have the old disk array anymore. Once the data was 
> recovered (as far as it was possible), the boss had me run smart checks 
> on the disks, and then he re-sold them to a customer.

There are a number of useful bits of info which we tend to ask for to
try and narrow down such issues, these include:

1. How was the filesystem created?
 - Was it with mkfs.gfs2 or an upgrade from a GFS2 filesystem
 - Was it grown with gfs2_grow at any stage?
2. Recovery
 - Was a failed node or node(s) recovered at some stage since the fs was
created?
 - What kind of fencing was used?
3. General usage pattern
 - What applications were running?
 - What kind of files were in use (large/small) ?
 - How were the files arranged? (all in one directory, a few directories
or many directories)
 - Was the usage heavy or light?
 - Was the fs using quota?
 - Was the system using selinux? (even if not in enforcing mode)
4. Hardware
 - What was the array in use? (make/model)
 - How was it configured? (RAID level)
 - How was it connected to the nodes? (fibre channel, AoE, etc)
5. Manual intervention
 - Was fsck.gfs2 run on the filesystem at any stage?
 - Did it find/repair any problems? (if so, what?)
 - Were there any log messages which stuck you as odd?
 - Did you use manual fencing at any time? (not recommended, but
possible)
 - Did you notice any operations which seemed to run unusually
fast/slow?

I do realise that in many cases there will be only partial information
for a lot of the above questions, but thats the kind of information that
is very helpful to us in figuring these things out.

> > The current behaviour of withdrawing a node in the event of a disk error
> > is not ideal. In reality there is often little other choice though, as
> > letting the node continue to operate risks possible greater corruption
> > of data due to the potential for it to be working on incorrect data from
> > the original problem.
> >
> > On recent upstream kernels we've tried to be a bit better about handling
> > such errors by turning off use of individual resource groups in some
> > cases, so that at least some filesystem activity can carry on.
> >
> >   
> Is there a bug or something I can follow to see updates on this issue?
> 
There is bz #519049, there are a couple of others which might possibly
be the same thing, but might just as easily be configuration issues with
faulty fencing.

> >> Next was node 4 (192.168.100.104) to hit a 'fatal: filesystem 
> >> consistency error' that also resulted in the file system being 
> >> withdrawn. On the systems themselves, any attempt to access the 
> >> filesystem would result in a I/O error response. At the prospect of 
> >> rebooting 2 of the 4 nodes in my cluster, I brought node 3 
> >> (192.168.100.103) online first. Then I power cycled nodes 4 and 5 one at 
> >> a time and let them come back online. These nodes are running Xen, so I 
> >> start to bring the VMs that were on nodes 4 and 5 online on nodes 3-5 
> >> after all 3 had joined the cluster.
> >> Shortly thereafter, node 3 encounters the 'fatal: invalid metadata 
> >> block', and withdraws the file system. Then node 2 (.102) encounters 
> >> 'fatal: invalid metadata block' also, and withdraws the filesystem. So I 
> >> reboot them.
> >> During their reboot, nodes 1 (.101) and 5 hits the same 'fatal: invalid 
> >> metadata block' error. I waited for nodes 2 and 3 to come back online to 
> >> preserve the cluster. At this point, node 4 was the only node that still 
> >> had the filesystem mounted. After I had rebooted the other 4 nodes, none 
> >> of them could mount the files system after joining the cluster, and node 
> >> 4 was spinning on the error:
> >> Sep  8 16:54:22 192.168.100.104 kernel: GFS2: 
> >> fsid=xencluster1:xenclusterfs1.0: jid=4: Trying to acquire journal lock...
> >> Sep  8 16:54:22 192.168.100.104 kernel: GFS2: 
> >> fsid=xencluster1:xenclusterfs1.0: jid=4: Busy
> >> It wasn't until this point that we suspected the SAN. We discovered that 
> >> the SAN had marked a drive as "failed" but did not remove it from the 
> >> array and begin to rebuild on the hot spare. When we physically removed 
> >> the failed drive, the hot spare was picked up and put into the array.
> >> The VMs on node 4 were the only ones "running" but they had all crashed 
> >> because their disk was unavailable. I decided to reboot all the nodes to 
> >> try and re-establish the cluster. We were able to get all the VMs turned 
> >> back on, and we thought we were out of the dark, with the exception of 
> >> the high level of filesystem corruption we caused inside 30% of the VM's 
> >> filesystems. We ran them through their ext3 filesystem checks, and got 
> >> them all running again.
> >>
> >>     
> > ext3 or gfs2? I assume you mean the latter
> >
> >   
> I did mean ext3. The filesystems I was running fsck on were inside each 
> individual VM's disk image. At this point, we had not attempted a gfs2_fsck.
Ah, now I see. Sorry I didn't follow that the first time.

> >> Then at the time I send the original email, we were encountering the 
> >> same invalid metadata block errors on the VMs at different points.
> >>
> >> With Redhat on the phone, we decided to migrate as much data as we could 
> >> from the original production SAN to a new SAN, and bring the VMs online 
> >> on the new SAN. There were a total of 3 VM disk images that would not 
> >> copy because they would trigger the invalid metadata block error every 
> >> time. After the migration, we tried 3 filesystem checks, all of which 
> >> failed, leaving the fsck_dlm mechanism configured on the filesystem. We 
> >> were able to override the lock with the instructions here:
> >> http://kbase.redhat.com/faq/docs/DOC-17402
> >>
> >>     
> > Was that reported as a bugzilla? fsck.gfs2 should certainly not fail int
> > that way. Although, bearing in mind what you've said about bad hardware,
> > that might be the reason. 
> >
> >   
> I didn't do any reporting via bugzilla. Redhat tech support intimated 
> that a bug report from CentOS servers wouldn't get much attention. 
> Another reason we are very interested in moving to RHEL 5.4.
Well its not going to get as much attention as a RHEL bug, but all
reports are useful. It may give us a hint which we'd not otherwise have
and sometimes the only way to solve an issue is to look at lots of
reports to find the common factors. So please don't let that put you off
reporting it,

Steve.