[Linux-cluster] GFS 1.04: fatal: assertion "x <= length" failed
Frederik Schueler
fs at lowpingbastards.de
Tue Oct 9 15:26:19 UTC 2007
Hello Bob,
First of all, thanks for the detailed answer.
On Mon, Oct 08, 2007 at 03:26:56PM -0500, Bob Peterson wrote:
> This is odd. What it means is this: GFS was searching for a free block
> to allocate. The resource group ("RG"--not to be confused with
> rgmanager's resource groups) indicated at least one free block for that
> section of the file system, but there were no free blocks to be found in
> the bitmap for that section (a direct contradiction). Therefore, the
> file system was determined to be corrupt.
What explains the files in l+f.
The files where created a day before the crash, they disappeared some time
later and where recreated, but I was not informed about this before the
crash happened.
In this particular setup, the "storage" is a 2 nodes heartbeat(2.1.2)
failover cluster exporting a drbd 8.2.0 device via iscsi (iscsitarget
0.4.15).
The systems are rhel5 with stock 2.6.18-8.1.14.el5 kernel, drbd and
heartbeat selfcompiled.
The iscsi target is exported without data and header digest, I will
switch it to crc32c now, to rule out the network.
> (3) Another possibility is a hardware problem
> with your media--the hard drives, FC switch, HBAs, etc. This could
> happen, for example, if GFS read the bitmap(s) from disk and the disk
> returned the wrong information. We've seen a lot of that, and the best
> thing to do is test the media (but it's a tedious and sometimes
> destructive task).
I wonder if a drbd failover might cause something like this, manually
switching the resources to the backup node: the drbd share is made
secondary, iscsi-target stopped, the service IP removed, and on the
other node the IP activated, the drbd device made primary and
iscsitarget started. The clients issue a
connection0:0: iscsi: detected conn error (1011)
and everything continues after the switch, which takes 4-5 seconds.
OTOH, there was no failover since before the gfs filesystem was formatted and
populated, only some scheduled node reboots.
I wonder if broken memory in one of the nodes or the iscsi servers could
be to blame, but I did not see segfaults or MCEs anywhere, GFS does
direct-I/O and we had iscsitarget set to blockio, which is uncached
direct-I/O too.
> If you had not run gfs_fsck, we might have been able to tell a little
> bit more about what happened from the contents of the journals.
> For example, in RHEL5 and equivalent, you can use gfs2_edit to save
> off the file system metadata and send it in for analysis.
> (gfs2_edit can operate on gfs1 file systems as well as gfs2). However,
> Since gfs_fsck clears the journals, that information is now long gone.
Oh, thanks for the hint, I'll do this in case this happens again.
Best regards
Frederik Schüler
--
ENOSIG
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071009/704f999d/attachment.sig>
More information about the Linux-cluster
mailing list