[Linux-cluster] Cluster1, RHEL4 gfs_fsck

Tue May 20 16:40:14 UTC 2008

On Mon, 2008-05-19 at 16:18 -0400, Wes Young wrote:
> I'm having a little trouble with an older installation of RHEL4,  
> cluster/GFS.
> 
> One of my cluster nodes crashed the other day, when I brought it back  
> up I got a the error:
> 
> GFS: Trying to join cluster "lock_dlm", "oss:mydisk"
> GFS: fsid=oss:mydisk.0: Joined cluster. Now mounting FS...
> GFS: fsid=oss:mydisk.0: jid=0: Trying to acquire journal lock...
> GFS: fsid=oss:mydisk.0: jid=0: Looking at journal...
> attempt to access beyond end of device
> sdb: rw=0, want=19149432840, limit=858673152
> GFS: fsid=oss:mydisk.0: fatal: I/O error

Hi Wes,

Sorry for the long post, but this needs some explanation.

>From your email, it sounds like you have corruption in your
resource group index file (rindex).  You might be the victim 
of this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=436383

If so, there's a fix to gfs_fsck to repair the damage.  This is
associated with this bug record:
https://bugzilla.redhat.com/show_bug.cgi?id=440896

While working on that bug, I discovered some kinds of
corruption that confuse the gfs_fsck's rindex repair code.
That's described in bug: 
https://bugzilla.redhat.com/show_bug.cgi?id=442271

I don't think any of these fixes are generally available
yet, except in patch form; I think they're scheduled for
4.7.  The last one, 442271, is only written against RHEL5
at the moment, so I don't have plans to fix it in RHEL4 yet.

So here's what I recommend:

First, determine for sure if this is the problem by doing
something like this:

mount the file system
gfs_tool rindex /mnt/gfs | grep "4294967292"
(there /mnt/gfs is your mount point)
umount the file system

If it comes back with "ri_data = 429496729" then that IS the
problem, in which case you need to acquire the fixes to
the first two bugs listed.  You can do this a number of
ways: (1) wait until 4.7 comes out, (2) get the patches from
the bugzilla and build them from the source tree, (3) grab the
RHEL4 branch from the cluster git tree and build from there,
because it should include those two fixes.  IIRC, I think that
the fix to gfs_grow (the original cause of this corruption)
has been released as a z-stream fix for 4.6 too, but I don't
think we did that for gfs_fsck.

If it comes back with no output, then there's a
different kind of corruption in your rindex.
You could try to build a RHEL4 version of the patch
from bug 442271 and see if it fixes your corruption.
So this at your own risk; we cannot be responsible for
your data.  I recommend making a full backup before trying
anything.  Depending on the size of the file system and
your amount of free storage, you could dd the entire GFS
device to a file you can restore.

You could also save off your file system metadata and
put it on an ftp server or web server so I can grab it
then I'll use it "in the name of 442271" to figure out
if the most recent patch in the bz will fix the corruption
and if not, I will adjust the 442271 patch so it does.
The problem with that is: there is no code in RHEL4 to
do this either.  I built a RHEL4 version of a tool
(gfs2_edit) that can save off your metadata, but I may need
to bring it up to date with recent changes first.
Either way, this might take some time to resolve.

Regards,

Bob Peterson
Red Hat Clustering & GFS