[Linux-cluster] falure during gfs2_grow caused node crash & data loss
bergman at merctech.com
bergman at merctech.com
Mon Mar 22 16:43:38 UTC 2010
In the message dated: Mon, 22 Mar 2010 09:52:21 EDT,
The pithy ruminations from Bob Peterson on
<Re: [Linux-cluster] falure during gfs2_grow caused node crash & data loss> wer
e:
=> ----- bergman at merctech.com wrote:
=> | I just had a serious problem with gfs2_grow which caused a loss of
=> | data and a
=> | cluster node reboot.
=> |
=> | I was attempting to grow a gfs2 volume from 50GB => 145GB. The volume
=> | was
=> | mounted on both cluster nodes at the start of running "gfs2_grow".
=> | When I
=> | umounted the volume from _one_ node (not where gfs2_grow was running),
=> | the
=> | macine running gfs2_grow rebooted and the filesystem is damaged.
=> |
=> | The sequence of commands was as follows. Each command was successful
=> | until the
=> | "umount".
=> (snip)
=> | Mark
=>
=> Hi Mark,
Thanks for getting back to me.
=>
=> There's a good chance this was caused by bugzilla bug #546683 which
=> is scheduled to be released in 5.5. However, I've also seen some
=> problems like this when a logical volume in LVM isn't marked as
=> clustered. Make sure it is with the "vgs" command (check if the flags
=> end with a "c") and if not, do vgchange -cy <volgrp>
Yes, the volume group is clustered (it contains 5 other filesystems, some of
which are gfs2 clustered) and works fine.
=>
=> As for fsck.gfs2, it should never segfault. IMHO, this is a bug
=> so please open a bugzilla record: Product: "Red Hat Enterprise Linux 5"
Can I paraphrase that when I talk to our developers? I've been trying to
convince them that (in most cases) segfault == bug. :)
=> and component "gfs2-utils". Assign it to me.
Will do...once my Bugzilla account is setup.
=>
=> As for recovering your volume, you can try this but it's not guaranteed
=> to work:
=> (1) Reduce the volume to its size from before the gfs2_grow.
That claims to be successful. The 'lvs' command shows the volume at it's
previous size.
=> (2) Mount it from one node only, if you can (it may crash).
I'm unable to mount the volume:
/sbin/mount.gfs2: error mounting /dev/mapper/global_vg-legacy on /legacy: No such file or directory
An fsck.gfs2 at this point reports:
Initializing fsck
Recovering journals (this may take a while)...
Journal recovery complete.
Validating Resource Group index.
Level 1 RG check.
(level 1 failed)
Level 2 RG check.
L2: number of rgs in the index = 85.
WARNING: rindex file is corrupt.
(level 2 failed)
Level 3 RG check.
RG 1 at block 0x11 intact [length 0x3b333]
RG 2 at block 0x3B344 intact [length 0x3b32f]
RG 3 at block 0x76673 intact [length 0x3b32f]
RG 4 at block 0xB19A2 intact [length 0x3b32f]
RG 5 at block 0xECCD1 intact [length 0x3b32f]
RG 6 at block 0x128000 intact [length 0x3b32f]
* RG 7 at block 0x16332F *** DAMAGED *** [length 0x3b32f]
* RG 8 at block 0x19E65E *** DAMAGED *** [length 0x3b32f]
* RG 9 at block 0x1D998D *** DAMAGED *** [length 0x3b32f]
* RG 10 at block 0x214CBC *** DAMAGED *** [length 0x3b32f]
Error: too many bad RGs.
Error rebuilding rg list.
(level 3 failed)
RG recovery impossible; I can't fix this file system.
=> (3) If it lets you mount it, run gfs2_grow again.
=> (4) Unmount the volume.
=> (5) Mount the volume from both nodes.
=>
=> If that doesn't work or if the system can't properly mount the volume
=> your choices are either (1) reformat the volume and restore from
I figured I'll have to do that...so I'll keep playing with the alternatives
first.
=> backup, (2) Use gfs2_edit to patch the i_size field of the rindex file
Do you mean "di_size"?
=> to be a fairly small multiple of 96 then repeat steps 1 through 4.
According to "gfs2_edit -p rindex", the initial value of di_size is:
di_size 8192 0x2000
Does that give any indication of an appropriate "fairly small multiple"?
Thanks,
Mark
=>
=> Regards,
=>
=> Bob Peterson
=> Red Hat File Systems
=>
More information about the Linux-cluster
mailing list