[Linux-cluster] GFS 1.04: fatal: assertion "x <= length" failed

Tue Oct 9 15:26:19 UTC 2007

Hello Bob,

First of all, thanks for the detailed answer.

On Mon, Oct 08, 2007 at 03:26:56PM -0500, Bob Peterson wrote:
> This is odd.  What it means is this:  GFS was searching for a free block
> to allocate.  The resource group ("RG"--not to be confused with
> rgmanager's resource groups) indicated at least one free block for that
> section of the file system, but there were no free blocks to be found in
> the bitmap for that section (a direct contradiction).  Therefore, the
> file system was determined to be corrupt.

What explains the files in l+f. 
The files where created  a day before the crash, they disappeared some time 
later and where recreated, but I was not informed about this before the 
crash happened. 

In this particular setup, the "storage" is a 2 nodes heartbeat(2.1.2) 
failover cluster exporting a drbd 8.2.0 device via iscsi (iscsitarget
0.4.15). 
The systems are rhel5 with stock 2.6.18-8.1.14.el5 kernel, drbd and 
heartbeat selfcompiled.

The iscsi target is exported without data and header digest, I will
switch it to crc32c now, to rule out the network.

>  (3) Another possibility is a hardware problem
> with your media--the hard drives, FC switch, HBAs, etc.  This could
> happen, for example, if GFS read the bitmap(s) from disk and the disk
> returned the wrong information.  We've seen a lot of that, and the best
> thing to do is test the media (but it's a tedious and sometimes
> destructive task).

I wonder if a drbd failover might cause something like this, manually
switching the resources to the backup node: the drbd share is made
secondary, iscsi-target stopped, the service IP removed, and on the
other node the IP activated, the drbd device made primary and
iscsitarget started. The clients issue a 

 connection0:0: iscsi: detected conn error (1011)

and everything continues after the switch, which takes 4-5 seconds.

OTOH, there was no failover since before the gfs filesystem was formatted and 
populated, only some scheduled node reboots.

I wonder if broken memory in one of the nodes or the iscsi servers could
be to blame, but I did not see segfaults or MCEs anywhere, GFS does
direct-I/O and we had iscsitarget set to blockio, which is uncached 
direct-I/O too.

> If you had not run gfs_fsck, we might have been able to tell a little
> bit more about what happened from the contents of the journals.
> For example, in RHEL5 and equivalent, you can use gfs2_edit to save
> off the file system metadata and send it in for analysis.
> (gfs2_edit can operate on gfs1 file systems as well as gfs2).  However,
> Since gfs_fsck clears the journals, that information is now long gone.

Oh, thanks for the hint, I'll do this in case this happens again.

Best regards
Frederik Schüler

-- 
ENOSIG
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071009/704f999d/attachment.sig>