[Linux-cluster] gfs2 filesystem crash with no recovery?

Thu Mar 18 18:29:29 UTC 2010

On 03/18/2010 10:04 AM, Steven Whitehouse wrote:
> Hi,
>
> On Thu, 2010-03-18 at 09:18 -0400, Douglas O'Neal wrote:
>   
>> On 03/15/2010 09:55 AM, Douglas O'Neal wrote:
>>     
>>> I have a problem with a gfs2 filesystem that is (was) being mounted 
>>> from a single host.  The system appeared to have hung over the weekend 
>>> so I unmounted and remounted the disk.  After a couple of minutes I 
>>> received this in the kernel logs:
>>>
>>> Mar 15 08:28:50 localhost kernel: GFS2: fsid=: Trying to join cluster 
>>> "lock_nolock", "sde1"
>>> Mar 15 08:28:50 localhost kernel: GFS2: fsid=sde1.0: Now mounting FS...
>>> Mar 15 08:28:50 localhost kernel: GFS2: fsid=sde1.0: jid=0, already 
>>> locked for use
>>> Mar 15 08:28:50 localhost kernel: GFS2: fsid=sde1.0: jid=0: Looking at 
>>> journal...
>>> Mar 15 08:28:50 localhost kernel: GFS2: fsid=sde1.0: jid=0: Done
>>> Mar 15 08:43:37 localhost kernel: GFS2: fsid=sde1.0: fatal: invalid 
>>> metadata block
>>> Mar 15 08:43:37 localhost kernel: GFS2: fsid=sde1.0:   bh = 4294972166 
>>> (type: exp=3, found=2)
>>> Mar 15 08:43:37 localhost kernel: GFS2: fsid=sde1.0:   function = 
>>> gfs2_rgrp_bh_get, file = fs/gfs2/rgrp.c, line = 759
>>> Mar 15 08:43:37 localhost kernel: GFS2: fsid=sde1.0: about to withdraw 
>>> this file system
>>> Mar 15 08:43:37 localhost kernel: GFS2: fsid=sde1.0: withdrawn
>>> Mar 15 08:43:37 localhost kernel: Pid: 3687, comm: cp Not tainted 
>>> 2.6.32-gentoo-r7 #2
>>> Mar 15 08:43:37 localhost kernel: Call Trace:
>>> Mar 15 08:43:37 localhost kernel: [<ffffffffa03b285d>] ? 
>>> gfs2_lm_withdraw+0x12d/0x160 [gfs2]
>>> Mar 15 08:43:37 localhost kernel: [<ffffffff813bf22b>] ? 
>>> io_schedule+0x4b/0x70
>>> Mar 15 08:43:37 localhost kernel: [<ffffffff810cc560>] ? 
>>> sync_buffer+0x0/0x50
>>> Mar 15 08:43:37 localhost kernel: [<ffffffff813bf7a9>] ? 
>>> out_of_line_wait_on_bit+0x79/0xa0
>>> Mar 15 08:43:37 localhost kernel: [<ffffffff8104e740>] ? 
>>> wake_bit_function+0x0/0x30
>>> Mar 15 08:43:37 localhost kernel: [<ffffffff810cb162>] ? 
>>> submit_bh+0x112/0x140
>>> Mar 15 08:43:37 localhost kernel: [<ffffffffa03b2947>] ? 
>>> gfs2_metatype_check_ii+0x47/0x60 [gfs2]
>>> Mar 15 08:43:37 localhost kernel: [<ffffffffa03ae40b>] ? 
>>> gfs2_rgrp_bh_get+0x1db/0x300 [gfs2]
>>> Mar 15 08:43:37 localhost kernel: [<ffffffffa0397d86>] ? 
>>> do_promote+0x116/0x200 [gfs2]
>>> Mar 15 08:43:37 localhost kernel: [<ffffffffa03992a5>] ? 
>>> finish_xmote+0x1a5/0x3a0 [gfs2]
>>> Mar 15 08:43:37 localhost kernel: [<ffffffffa0398fcd>] ? 
>>> do_xmote+0xfd/0x230 [gfs2]
>>> Mar 15 08:43:37 localhost kernel: [<ffffffffa039986d>] ? 
>>> gfs2_glock_nq+0x13d/0x320 [gfs2]
>>> Mar 15 08:43:37 localhost kernel: [<ffffffffa03aea2d>] ? 
>>> gfs2_inplace_reserve_i+0x1ed/0x7b0 [gfs2]
>>> Mar 15 08:43:37 localhost kernel: [<ffffffffa0399581>] ? 
>>> run_queue+0xe1/0x210 [gfs2]
>>> Mar 15 08:43:37 localhost kernel: [<ffffffffa039986d>] ? 
>>> gfs2_glock_nq+0x13d/0x320 [gfs2]
>>> Mar 15 08:43:37 localhost kernel: [<ffffffffa03a1f92>] ? 
>>> gfs2_write_begin+0x272/0x480 [gfs2]
>>> Mar 15 08:43:37 localhost kernel: [<ffffffff8106df04>] ? 
>>> generic_file_buffered_write+0x114/0x290
>>> Mar 15 08:43:37 localhost kernel: [<ffffffff8106e4a8>] ? 
>>> __generic_file_aio_write+0x278/0x450
>>> Mar 15 08:43:37 localhost kernel: [<ffffffff8106e6d5>] ? 
>>> generic_file_aio_write+0x55/0xb0
>>> Mar 15 08:43:37 localhost kernel: [<ffffffff810a6a1b>] ? 
>>> do_sync_write+0xdb/0x120
>>> Mar 15 08:43:37 localhost kernel: [<ffffffff8104e710>] ? 
>>> autoremove_wake_function+0x0/0x30
>>> Mar 15 08:43:37 localhost kernel: [<ffffffff8108511f>] ? 
>>> handle_mm_fault+0x1bf/0x850
>>> Mar 15 08:43:37 localhost kernel: [<ffffffff8108b5cc>] ? 
>>> mmap_region+0x23c/0x5d0
>>> Mar 15 08:43:37 localhost kernel: [<ffffffff810a752b>] ? 
>>> vfs_write+0xcb/0x160
>>> Mar 15 08:43:37 localhost kernel: [<ffffffff810a76c3>] ? 
>>> sys_write+0x53/0xa0
>>> Mar 15 08:43:37 localhost kernel: [<ffffffff8100b2ab>] ? 
>>> system_call_fastpath+0x16/0x1b
>>>
>>> I again unmounted the disk but now when I try to fsck the filesystem I 
>>> get:
>>> urania# fsck.gfs2 -v /dev/sde1
>>> Initializing fsck
>>> Initializing lists...
>>> Either the super block is corrupted, or this is not a GFS2 filesystem
>>>
>>> The server is a running kernel 2.6.32, 64-bit.  The array is a 
>>> Jetstore 516iS with a single 28TB iSCSI volume defined.  The relevant 
>>> line from the fstab is
>>> /dev/sde1        /illumina    gfs2    _netdev,rw,lockproto=lock_nolock
>>>
>>> gfs2_tool isn't much help, nor is gfs2_edit:
>>> urania# gfs2_tool sb /dev/sde1 all
>>> /usr/src/cluster-3.0.7/gfs2/tool/../libgfs2/libgfs2.h: there isn't a 
>>> GFS2 filesystem on /dev/sde1
>>> urania# gfs2_edit -p sb /dev/sde1
>>> bad seek: Invalid argument from gfs2_load_inode:416: block 
>>> 3747350044811107074 (0x34014302ee029b02)
>>>
>>> Is there an alternate superblock that I can use to mount the disk to 
>>> at least get the last couple of days of data off of it?
>>>
>>>       
>> Anybody?
>>
>>     
> What version of the userland tools are you using? There has been an
> update recently to fsck designed to solve a number of problems. I've
> never seen a filesystem which is so badly corrupted that the super block
> is unrecognisable before. The super block is not ever altered during
> normal fs usage.
>
> Are you 100% certain that this volume was not being accessed by another
> node on the network?
>
> If you can save off the metadata then we can take a look at it. That
> might not be possible with a corrupt superblock though, so an
> alternative is to make it available somehow for us to look at,
>
> Steve.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   
Userland tools 3.0.7. The iSCSI array is on a closed network and is 
protected by a CHAP login. No other system has been configured to access 
the array. I have the first 1MB of the disk available at 
http://urania.dbi.udel.edu/sde.block.bz2 if you want to see the actual 
data. gfs2_edit will not pull the metadata off:

urania ~ # gfs2_edit savemeta /dev/sde /tmp/metasave
Segmentation fault


Doug