[Linux-cluster] Re: GFS2 corruption/withdrawal/crash

Mon Aug 10 14:05:29 UTC 2009

----- "Steven Whitehouse" <swhiteho at redhat.com> wrote:
| Hi,
| 
| On Sat, 2009-08-08 at 19:19 -0400, Wendell Dingus wrote:
| > Well, I just ran fsck.gfs2 against this filesystem twice with a
| 10-minute 
| > pause between them. As such:
| > # fsck -C -t gfs2 -y /dev/mapper/VGIMG0-LVIMG0
| > 
| > Output of second run:
| > fsck 1.39 (29-May-2006)
| > Initializing fsck
| > Recovering journals (this may take a while)...
| > Journal recovery complete.
| > Validating Resource Group index.
| > Level 1 RG check.
| > (level 1 passed)
| > Starting pass1
| > Pass1 complete
| > Starting pass1b
| > Pass1b complete
| > Starting pass1c
| > Pass1c complete
| > Starting pass2
| > Pass2 complete
| > Starting pass3
| > Pass3 complete
| > Starting pass4
| > Pass4 complete
| > Starting pass5
| > Unlinked block found at block 37974707 (0x24372b3), left unchanged.
| > ..snip about 30 total of these..
| > Unlinked block found at block 96603710 (0x5c20e3e), left unchanged.
| > Pass5 complete
| > Writing changes to disk
| > gfs2_fsck complete
| > 
| > When it was done I remounted the filesystem and tried to "rm -rf
| /raid1/bad"
| > which is a subdir in the root of this filesystem that contains the
| zero-byte
| > file that was the focal point of this grief to start with. 
| > 
| > Results:
| > 
| That looks like a bug in fsck at least as it should be dealing with
| the
| unlinked blocks that it finds, not ignoring them. Chances are that
| the
| block which is causing the issues belongs to one of the unlinked
| blocks
| (inodes I think it should say)
| 
| Steve.

Hi,

The "Unlinked block found...left unchanged." messages are harmless.
This merely means that fsck.gfs2 found some blocks that were
marked as "unlinked metadata" that should be automatically
reassigned by gfs2's kernel code when needed.  At some point, we 
made the decision not to fix the bitmaps for various reasons.  I don't
remember the details, but I remember discussing it anyway.  Lately
I've been thinking that we made the wrong decision and I should make
fsck.gfs2 fix them rather than ignore them.

In theory, those blocks should not have caused the kernel withdraw
problem you saw.  Do any of the blocks in the fsck output correspond
to the block complained about in the kernel withdraw output?  That
might be an important clue what happened.

Regards,

Bob Peterson
Red Hat File Systems