[Linux-cluster] falure during gfs2_grow caused node crash & data loss

Sat Mar 20 01:50:58 UTC 2010

I just had a serious problem with gfs2_grow which caused a loss of data and a 
cluster node reboot.

I was attempting to grow a gfs2 volume from 50GB => 145GB. The volume was 
mounted on both cluster nodes at the start of running "gfs2_grow". When I 
umounted the volume from _one_ node (not where gfs2_grow was running), the 
macine running gfs2_grow rebooted and the filesystem is damaged.

The sequence of commands was as follows. Each command was successful until the 
"umount".

node2:
	gfs_fsck /dev/mapper/legacy
	gfs2_convert /dev/mapper/legacy
	fsck.gfs2 /dev/mapper/legacy
	mount /dev/mapper/legacy /shared/legacy
	lvextend vg /dev/mapper/pv_vol10

node1:
	mount /dev/mapper/legacy /shared/legacy

node2:
	gfs2_grow /shared/legacy

node1:
	umount /shared/legacy

Immediately after umounting the volume from node1, node2 hung then rebooted
(probably fenced by node1)

node2:
	upon reboot, /dev/mapper/legacy seemed to be mounted (it was
		listed in /etc/fstab), but showed the previous size
		via "df". I did not examine the contents of /shared/legacy,
		but umounted it and attempted to fsck.gfs2 the volume.

Running fsck.gfs2 reports:
		Initializing fsck
		Recovering journals (this may take a while)...
		Validating Resource Group index.
		Level 1 RG check.
		(level 1 failed)
		Level 2 RG check.
		L2: number of rgs in the index = 85.
		WARNING: rindex file is corrupt.
		(level 2 failed)
		Level 3 RG check.

		  RG 1 at block 0x11 intact [length 0x3b333]
		  RG 2 at block 0x3B344 intact [length 0x3b32f]
		  RG 3 at block 0x76673 intact [length 0x3b32f]
		  RG 4 at block 0xB19A2 intact [length 0x3b32f]
		  RG 5 at block 0xECCD1 intact [length 0x3b32f]
		  RG 6 at block 0x128000 intact [length 0x3b32f]
		* RG 7 at block 0x16332F *** DAMAGED *** [length 0x3b32f]
		* RG 8 at block 0x19E65E *** DAMAGED *** [length 0x3b32f]
		* RG 9 at block 0x1D998D *** DAMAGED *** [length 0x3b32f]
		* RG 10 at block 0x214CBC *** DAMAGED *** [length 0x3b32f]
		Error: too many bad RGs.
		Error rebuilding rg list.
		(level 3 failed)
		RG recovery impossible; I can't fix this file system.

Attempting to fsck the filesystem with the "experimental" version
announced by Bob Peterson on Monday produces a segfault as soon as it
gets to the Level 3 RG check.

Running "gfs2_edit savemeta" seemed to work...at least it didn't exit with an 
error, and it produced about 540MB of file of type "GLS_BINARY_MSB_FIRST".

Environment:
	CentOS 5.4 (2.6.18-164.11.1.el5)
	gfs2-utils-0.1.62-1.el5
	cman-2.0.115-1.el5_4.9
	lvm2-cluster-2.02.46-8.el5_4.1
	lvm2-2.02.46-8.el5_4.2

Any suggestions or hope of data recovery before I reformat the volume?

Thanks,

Mark