[Linux-cluster] GFS2 fatal: invalid metadata block

Mon Oct 19 22:30:59 UTC 2009

Ok, so our lab test results have turned up some fun events.

Firstly, we were able to duplicate the invalid metadata block exactly 
under the following circumstances:

We wanted to monkey with the VLan that fenced/openais ran on. We failed 
miserably causing all three of my test nodes to believe that they became 
lone islands in the cluster, unable to get enough votes themselves to 
fence anybody. So we chose to simply power cycle the nodes with out 
trying to gracefully leave the cluster or reboot (they are diskless 
servers with NFS root filesystems so the GFS2 filesystem is the only 
thing we were risking corruption with.) After the nodes came back 
online, we began to see the same random reboots and filesystem withdraws 
within 24 hours. The filesystem taht went into production that 
eventually hit these errors was likely not reformatted just before 
putting into production, and I believe it is highly likely that the last 
format done on that production filesystem was done while we were still 
doing testing. I hope that as we continue in our lab, we can reproduce 
the same circumstances, and give you a step-by-step that will cause this 
issue. It'll make me feel much better about our current GFS2 filesystem 
that was created and unmounted cleanly by a single node, and then put 
straight into production, and has been only mounted once by our current 
production servers since it was formatted.

Secondly, the way our VMs are doing I/O, we have found the cluster.conf 
configuration settings:
<dlm plock_ownership="1" plock_rate_limit="0"/>
<gfs_controld plock_rate_limit="0"/>
have lowered our %wa times from ~60% to ~30% utilization. I am curious 
why the locking deamon is set to default to such a low number by default 
(100). Adding these two parameters in the cluster.conf raised our locks 
per second with the ping_pong binary from 93 to 3000+ in our 5 node 
cluster. Our throughput doesn't seem to improve by either upping the 
locking limit or setting up jumbo frames, but processes spend much less 
time in I/O wait state than before (if my munin graphs are believable). 
How likely is it that the low locking rate had a hand in causing the 
filesystem withdraws and 'invalid metadata block' errors?

I'm still not completely confident I won't see this happen again on my 
production servers. I'm hoping you can help me with that.

-Kai Meyer

On 09/30/2009 10:36 AM, Kai Meyer wrote:
> Steven Whitehouse wrote:
>> Hi,
>>
>> On Tue, 2009-09-29 at 13:54 -0600, Kai Meyer wrote:
>>> Steven Whitehouse wrote:
>>>> Hi,
>>>>
>>>> You seem to have a number of issues here....
>>>>
>>>> On Fri, 2009-09-25 at 12:31 -0600, Kai Meyer wrote:
>>>>> Sorry for the slow response. We ended up (finally) purchasing some 
>>>>> RHEL licenses to try and get some phone support for this problem, 
>>>>> and came up with a plan to salvage what we could. I'll try to 
>>>>> offer a brief history of the problem in hope you can help me 
>>>>> understand this issue a little better.
>>>>> I've posted the relevant logfile entries to the events described 
>>>>> here : http://kai.gnukai.com/gfs2_meltdown.txt
>>>>> All the nodes send syslog to a remote server named pxe, so the 
>>>>> combined syslog for all the nodes plus the syslog server is here: 
>>>>> http://kai.gnukai.com/messages.txt
>>>>> We started with a 4 node cluster (nodes 1, 2, 4, 5). The GFS2 
>>>>> filesystem was created with the latest CentOS 5.3 had to offer 
>>>>> when it was released. Node 3 was off at the time the errors 
>>>>> occurred, and not part of the cluster.
>>>>> First issue I can recover from syslog is from node 5 
>>>>> (192.168.100.105) on Sep 8 14:11:27 was a 'fatal: invalid metadata 
>>>>> block' error that resulted in the file system being withdrawn.
>>>> Ok. So lets start with that message. Once that message has 
>>>> appeared, it
>>>> means that something on disk has been corrupted. The only way in which
>>>> that can be fixed is to unmount on all nodes and run fsck.gfs2 on the
>>>> filesystem. The other nodes will only carry on working until they too
>>>> read the same erroneous block.
>>>>
>>>> These issues are usually very tricky to track down. The main reason 
>>>> for
>>>> that is that the event which caused the corruption is usually long in
>>>> the past before the issue is discovered. Often there has been so much
>>>> activity that its impossible to attribute it to any particular event.
>>>>
>>>> That said, we are very interested to receive reports of such 
>>>> corruption
>>>> in case we can figure out the common factors between such reports.
>>>>
>>> Is there any more information I can provide that would be useful? At 
>>> this point, I don't have the old disk array anymore. Once the data 
>>> was recovered (as far as it was possible), the boss had me run smart 
>>> checks on the disks, and then he re-sold them to a customer.
>>
>> There are a number of useful bits of info which we tend to ask for to
>> try and narrow down such issues, these include:
>>
>> 1. How was the filesystem created?
>>  - Was it with mkfs.gfs2 or an upgrade from a GFS2 filesystem
> It was created with mkfs.gfs2
>>  - Was it grown with gfs2_grow at any stage?
> Nope
>> 2. Recovery
>>  - Was a failed node or node(s) recovered at some stage since the fs was
>> created?
> Nodes in the cluster never failed due to this bug. The filesystem 
> would withdraw, but the OS would continue to list the filesystem as 
> mounted, therefore I could never leave the fence domain gracefully, 
> nor do a graceful reboot. 'service cman stop' would always fail. If I 
> killed off the fenced process the node would get fenced, but then I 
> would have to remove the /etc/fstab entry for the gfs2 filesystem in 
> order for the server to do a graceful reboot. I feel lucky that all my 
> servers run diskless with a read only NFS root, so getting a node back 
> into the cluster was a matter of a hard reboot. The only filesystems 
> mounted on the nodes are over NFS and GFS2. With GFS2 withdrawn, there 
> are no filesystems left to corrupt if I simply shutdown the server.
>>  - What kind of fencing was used?
> We already had an API for managing ports and vlans on our cisco 
> routers, so we wrote a short bash script that would shutdown the port 
> specified in the cluster.conf that we configured to be dedicated to 
> the iSCSI connection to the SAN. The fencing worked great every time, 
> meaning once a server was fenced, it's ability to connect ot the SAN 
> was severed on the router at the switch level. However, fencing was 
> never triggered until I power cycled a server. Nodes remained in the 
> cluster after the filesystem was withdrawn.
>> 3. General usage pattern
>>  - What applications were running?
> Only xen HVM with vanilla loopback drivers for the sparse disk images.
>>  - What kind of files were in use (large/small) ?
> The sparse disk images start around 1-3GB and some were as large as 
> 50GB. The small plain text xen configuration files were also on the 
> GFS2 filesystem.
>>  - How were the files arranged? (all in one directory, a few directories
>> or many directories)
> At the root of the GFS2 filesystem, there was a folder for each VM, 
> each folder containing sparse disk image files and one plain text 
> configuration file.
>>  - Was the usage heavy or light?
> I'm not sure how it compares to others. The throughput is around 
> 6MB/sec (bytes not bits) as reported from /proc/diskstats, and it 
> correlates nearly exactly with bandwidth graphs.
>>  - Was the fs using quota?
> No, we mount it with the quota=off option.
>>  - Was the system using selinux? (even if not in enforcing mode)
> It's currently set to 'disabled'.
>> 4. Hardware
>>  - What was the array in use? (make/model)
> Promise VTrak M610i
>>  - How was it configured? (RAID level)
> 14 disk RAID10
>>  - How was it connected to the nodes? (fibre channel, AoE, etc)
> iSCSI
>> 5. Manual intervention
>>  - Was fsck.gfs2 run on the filesystem at any stage?
> After we moved all the data off we could, we ran gfs2_fsck 3 times on 
> the entire array. All three times failed to remove the lock_dlm setting.
>>  - Did it find/repair any problems? (if so, what?)
> I have a typescript log floating around on one of the test servers I 
> think. I it didn't seem to show any errors.
>>  - Were there any log messages which stuck you as odd?
> I've posted all the syslogmessages in a previous post. Nothing looked 
> terribly fishy outside of what we've discussed.
>>  - Did you use manual fencing at any time? (not recommended, but
>> possible)
> No manual fencing.
>>  - Did you notice any operations which seemed to run unusually
>> fast/slow?
> Nothing outstanding.
>> I do realise that in many cases there will be only partial information
>> for a lot of the above questions, but thats the kind of information that
>> is very helpful to us in figuring these things out.
>>
>>>> The current behaviour of withdrawing a node in the event of a disk 
>>>> error
>>>> is not ideal. In reality there is often little other choice though, as
>>>> letting the node continue to operate risks possible greater corruption
>>>> of data due to the potential for it to be working on incorrect data 
>>>> from
>>>> the original problem.
>>>>
>>>> On recent upstream kernels we've tried to be a bit better about 
>>>> handling
>>>> such errors by turning off use of individual resource groups in some
>>>> cases, so that at least some filesystem activity can carry on.
>>>>
>>> Is there a bug or something I can follow to see updates on this issue?
>>>
>> There is bz #519049, there are a couple of others which might possibly
>> be the same thing, but might just as easily be configuration issues with
>> faulty fencing.
> Thanks, I'll try to keep an eye on it. I really appreciate the time 
> you've put in to help me understand what's going on.
>
>>>>> Next was node 4 (192.168.100.104) to hit a 'fatal: filesystem 
>>>>> consistency error' that also resulted in the file system being 
>>>>> withdrawn. On the systems themselves, any attempt to access the 
>>>>> filesystem would result in a I/O error response. At the prospect 
>>>>> of rebooting 2 of the 4 nodes in my cluster, I brought node 3 
>>>>> (192.168.100.103) online first. Then I power cycled nodes 4 and 5 
>>>>> one at a time and let them come back online. These nodes are 
>>>>> running Xen, so I start to bring the VMs that were on nodes 4 and 
>>>>> 5 online on nodes 3-5 after all 3 had joined the cluster.
>>>>> Shortly thereafter, node 3 encounters the 'fatal: invalid metadata 
>>>>> block', and withdraws the file system. Then node 2 (.102) 
>>>>> encounters 'fatal: invalid metadata block' also, and withdraws the 
>>>>> filesystem. So I reboot them.
>>>>> During their reboot, nodes 1 (.101) and 5 hits the same 'fatal: 
>>>>> invalid metadata block' error. I waited for nodes 2 and 3 to come 
>>>>> back online to preserve the cluster. At this point, node 4 was the 
>>>>> only node that still had the filesystem mounted. After I had 
>>>>> rebooted the other 4 nodes, none of them could mount the files 
>>>>> system after joining the cluster, and node 4 was spinning on the 
>>>>> error:
>>>>> Sep  8 16:54:22 192.168.100.104 kernel: GFS2: 
>>>>> fsid=xencluster1:xenclusterfs1.0: jid=4: Trying to acquire journal 
>>>>> lock...
>>>>> Sep  8 16:54:22 192.168.100.104 kernel: GFS2: 
>>>>> fsid=xencluster1:xenclusterfs1.0: jid=4: Busy
>>>>> It wasn't until this point that we suspected the SAN. We 
>>>>> discovered that the SAN had marked a drive as "failed" but did not 
>>>>> remove it from the array and begin to rebuild on the hot spare. 
>>>>> When we physically removed the failed drive, the hot spare was 
>>>>> picked up and put into the array.
>>>>> The VMs on node 4 were the only ones "running" but they had all 
>>>>> crashed because their disk was unavailable. I decided to reboot 
>>>>> all the nodes to try and re-establish the cluster. We were able to 
>>>>> get all the VMs turned back on, and we thought we were out of the 
>>>>> dark, with the exception of the high level of filesystem 
>>>>> corruption we caused inside 30% of the VM's filesystems. We ran 
>>>>> them through their ext3 filesystem checks, and got them all 
>>>>> running again.
>>>>>
>>>> ext3 or gfs2? I assume you mean the latter
>>>>
>>> I did mean ext3. The filesystems I was running fsck on were inside 
>>> each individual VM's disk image. At this point, we had not attempted 
>>> a gfs2_fsck.
>> Ah, now I see. Sorry I didn't follow that the first time.
>>
>>>>> Then at the time I send the original email, we were encountering 
>>>>> the same invalid metadata block errors on the VMs at different 
>>>>> points.
>>>>>
>>>>> With Redhat on the phone, we decided to migrate as much data as we 
>>>>> could from the original production SAN to a new SAN, and bring the 
>>>>> VMs online on the new SAN. There were a total of 3 VM disk images 
>>>>> that would not copy because they would trigger the invalid 
>>>>> metadata block error every time. After the migration, we tried 3 
>>>>> filesystem checks, all of which failed, leaving the fsck_dlm 
>>>>> mechanism configured on the filesystem. We were able to override 
>>>>> the lock with the instructions here:
>>>>> http://kbase.redhat.com/faq/docs/DOC-17402
>>>>>
>>>> Was that reported as a bugzilla? fsck.gfs2 should certainly not 
>>>> fail int
>>>> that way. Although, bearing in mind what you've said about bad 
>>>> hardware,
>>>> that might be the reason.
>>> I didn't do any reporting via bugzilla. Redhat tech support 
>>> intimated that a bug report from CentOS servers wouldn't get much 
>>> attention. Another reason we are very interested in moving to RHEL 5.4.
>> Well its not going to get as much attention as a RHEL bug, but all
>> reports are useful. It may give us a hint which we'd not otherwise have
>> and sometimes the only way to solve an issue is to look at lots of
>> reports to find the common factors. So please don't let that put you off
>> reporting it,
>>
>> Steve.
>>
>>
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> Ok, if I end up being able to find the fsck typescript log, I'll 
> create a bug, and include all the information I've included in here so 
> far.
>
> Thanks so much for your attention.
>
> -Kai Meyer
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster