[Linux-cluster] falure during gfs2_grow caused node crash & data loss
bergman at merctech.com
bergman at merctech.com
Sat Mar 20 01:50:58 UTC 2010
I just had a serious problem with gfs2_grow which caused a loss of data and a
cluster node reboot.
I was attempting to grow a gfs2 volume from 50GB => 145GB. The volume was
mounted on both cluster nodes at the start of running "gfs2_grow". When I
umounted the volume from _one_ node (not where gfs2_grow was running), the
macine running gfs2_grow rebooted and the filesystem is damaged.
The sequence of commands was as follows. Each command was successful until the
"umount".
node2:
gfs_fsck /dev/mapper/legacy
gfs2_convert /dev/mapper/legacy
fsck.gfs2 /dev/mapper/legacy
mount /dev/mapper/legacy /shared/legacy
lvextend vg /dev/mapper/pv_vol10
node1:
mount /dev/mapper/legacy /shared/legacy
node2:
gfs2_grow /shared/legacy
node1:
umount /shared/legacy
Immediately after umounting the volume from node1, node2 hung then rebooted
(probably fenced by node1)
node2:
upon reboot, /dev/mapper/legacy seemed to be mounted (it was
listed in /etc/fstab), but showed the previous size
via "df". I did not examine the contents of /shared/legacy,
but umounted it and attempted to fsck.gfs2 the volume.
Running fsck.gfs2 reports:
Initializing fsck
Recovering journals (this may take a while)...
Validating Resource Group index.
Level 1 RG check.
(level 1 failed)
Level 2 RG check.
L2: number of rgs in the index = 85.
WARNING: rindex file is corrupt.
(level 2 failed)
Level 3 RG check.
RG 1 at block 0x11 intact [length 0x3b333]
RG 2 at block 0x3B344 intact [length 0x3b32f]
RG 3 at block 0x76673 intact [length 0x3b32f]
RG 4 at block 0xB19A2 intact [length 0x3b32f]
RG 5 at block 0xECCD1 intact [length 0x3b32f]
RG 6 at block 0x128000 intact [length 0x3b32f]
* RG 7 at block 0x16332F *** DAMAGED *** [length 0x3b32f]
* RG 8 at block 0x19E65E *** DAMAGED *** [length 0x3b32f]
* RG 9 at block 0x1D998D *** DAMAGED *** [length 0x3b32f]
* RG 10 at block 0x214CBC *** DAMAGED *** [length 0x3b32f]
Error: too many bad RGs.
Error rebuilding rg list.
(level 3 failed)
RG recovery impossible; I can't fix this file system.
Attempting to fsck the filesystem with the "experimental" version
announced by Bob Peterson on Monday produces a segfault as soon as it
gets to the Level 3 RG check.
Running "gfs2_edit savemeta" seemed to work...at least it didn't exit with an
error, and it produced about 540MB of file of type "GLS_BINARY_MSB_FIRST".
Environment:
CentOS 5.4 (2.6.18-164.11.1.el5)
gfs2-utils-0.1.62-1.el5
cman-2.0.115-1.el5_4.9
lvm2-cluster-2.02.46-8.el5_4.1
lvm2-2.02.46-8.el5_4.2
Any suggestions or hope of data recovery before I reformat the volume?
Thanks,
Mark
More information about the Linux-cluster
mailing list