[Linux-cluster] CLVM/GFS hangs while rebuilding an AoE RAID5 LUN, anyone experienced that before? any clue?

Abraham Alawi a.alawi at auckland.ac.nz
Tue Jul 6 23:57:58 UTC 2010


Thanks Steve, Yes, sadly I can confirm the file system has been corrupted but I still don't understand why I/O will stop flowing at the LVM level (& doesn't fence it either) and why fsck keeps crashing without a useful error message, is there any signal I can send to gfs_fsck to by pass certain stages? Also to speed up the fsck process, I was thinking of utilizing the RAM and increase the read_ahead parameter (hdparm -a) of the PV device (an AoE device) by 1GB since that will hugely optimize the sequential read and fscking is mostly a sequential read process and very bit of writings, what do you think?

Herein the tail of the last fsck log file:
(metawalk.c:516)	Extended attributes exist for inode #34020861.
(metawalk.c:413)	Checking EA leaf block #34020862.
(pass1.c:485)	Setting block #34020862 to eattr block
(pass1.c:907)	Checking metadata block 34020862
(pass1.c:923)	Metadata block 34020862 not an inode or free metadata
(pass1.c:907)	Checking metadata block 34020863
(link.c:22)	Setting link count to 1 for 34020863
(metawalk.c:516)	Extended attributes exist for inode #34020863.
(metawalk.c:413)	Checking EA leaf block #34020864.
(pass1.c:485)	Setting block #34020864 to eattr block
(pass1.c:907)	Checking metadata block 34020864
(pass1.c:923)	Metadata block 34020864 not an inode or free metadata
(pass1.c:907)	Checking metadata block 34020865
(link.c:22)	Setting link count to 1 for 34020865
(pass1.c:213)	Setting 34020917 to data block
(pass1.c:213)	Setting 34020918 to data block
(pass1.c:213)	Setting 34020919 to data block
(pass1.c:213)	Setting 34020920 to data block
(metawalk.c:516)	Extended attributes exist for inode #34020865.
(metawalk.c:413)	Checking EA leaf block #34020866.
(pass1.c:485)	Setting block #34020866 to eattr block
(pass1.c:907)	Checking metadata block 34020866
(pass1.c:923)	Metadata block 34020866 not an inode or free metadata
(pass1.c:907)	Checking metadata block 34020867


Thanks,

  -- Abraham

On 6/07/2010, at 8:22 PM, Steven Whitehouse wrote:

> Hi,
> 
> It looks to me as if the fs is corrupt in some manner. Try unmounting on
> all nodes and running fsck on one node on the filesystem. Make sure you
> save the output of fsck in case that is useful for future debugging and
> make sure you have a backup of the data in question first.
> 
> Its tricky to say exactly what might have gone wrong (the fsck output
> might give a clue) but you will certainly need fsck to fix whatever the
> problem is,
> 
> Steve.
> 
> On Tue, 2010-07-06 at 13:22 +1200, Abraham Alawi wrote:
>> The system was running well for a while but lately we had a flaky disk in the RAID array which we replaced with a healthy one but suddenly the CLVM/GFS became unusable, we can mount GFS but while listing it recursively 'ls -R' it hangs with Input/output error, can't even access the c/LVM LUN rawly using 'dd' BUT we still can access the LVM PV devices using 'dd'. Reconfiguring the LVM volume as a local one and accessing it exclusively from one node doesn't make a difference. 
>> 
>> RHEL5: 2.6.18-164.11.1.el5
>> # modinfo gfs
>> filename:       /lib/modules/2.6.18-164.11.1.el5/weak-updates/gfs/gfs.ko
>> license:        GPL
>> author:         Red Hat, Inc.
>> description:    Global File System 0.1.34-2.el5
>> srcversion:     3B1BAC4069F1A4B556A958A
>> depends:        dlm
>> vermagic:       2.6.18-159.el5 SMP mod_unload gcc-4.1
>> 
>> # uname -r
>> 2.6.18-164.11.1.el5
>> 
>> # modinfo /lib/modules/2.6.18-164.11.1.el5/kernel/drivers/block/aoe/aoe.ko
>> filename:       /lib/modules/2.6.18-164.11.1.el5/kernel/drivers/block/aoe/aoe.ko
>> description:    AoE block/char driver for 2.6.2 and newer 2.6 kernels
>> author:         Sam Hopkins <sah at coraid.com>
>> license:        GPL
>> srcversion:     42BF122979AC807F2BB50E6
>> depends:        
>> vermagic:       2.6.18-164.11.1.el5 SMP mod_unload gcc-4.1
>> parm:           aoe_iflist:aoe_iflist=dev1[,dev2...]
>> (string)
>> parm:           version:aoe module version 74
>> (string)
>> parm:           aoe_dyndevs:Use dynamic minor numbers for devices. (int)
>> parm:           aoe_deadsecs:After aoe_deadsecs seconds, give up and fail dev. (int)
>> parm:           aoe_maxout:Only aoe_maxout outstanding packets for every MAC on eX.Y. (int)
>> parm:           aoe_maxsectors:When nonzero, set the maximum number of sectors per I/O request in new devices. (int)
>> 
>> # modinfo dlm
>> filename:       /lib/modules/2.6.18-164.11.1.el5/kernel/fs/dlm/dlm.ko
>> license:        GPL
>> author:         Red Hat, Inc.
>> description:    Distributed Lock Manager
>> srcversion:     E768995007648CA8DB078AE
>> depends:        configfs
>> vermagic:       2.6.18-164.11.1.el5 SMP mod_unload gcc-4.1
>> module_sig:	883f3504b56fe19c59c69348c13cf1f1126a509f6ddaee3965ee8b5fcd04163669647a889a9801e09f722187d1de068c0d52cd2b99bc3d475cb6ca1a0
>> 
>> 
>> 
>> Herein what the kernel spits out:
>> 
>> Jul  6 11:27:36 kiwiland kernel: GFS 0.1.34-2.el5 (built Sep  9 2009 06:54:42) installed
>> Jul  6 11:27:36 kiwiland kernel: Lock_DLM (built Sep  9 2009 06:54:38) installed
>> Jul  6 11:27:36 kiwiland kernel: Lock_Nolock (built Sep  9 2009 06:54:37) installed
>> Jul  6 11:27:36 kiwiland kernel: Trying to join cluster "lock_dlm", "FSC:files"
>> Jul  6 11:27:36 kiwiland kernel: Joined cluster. Now mounting FS...
>> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Trying to acquire journal lock...
>> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Looking at journal...
>> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Acquiring the transaction lock...
>> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Replaying journal...
>> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Replayed 0 of 11 blocks
>> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: replays = 0, skips = 4, sames = 7
>> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Journal replayed in 1s
>> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Done
>> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=1: Trying to acquire journal lock...
>> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=1: Looking at journal...
>> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=1: Done
>> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Scanning for log elements...
>> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Found 2 unlinked inodes
>> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Found quota changes for 2 IDs
>> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Done
>> Jul  6 11:27:36 kiwiland kernel: Trying to join cluster "lock_dlm", "FSC:webcluster"
>> Jul  6 11:27:36 kiwiland kernel: Joined cluster. Now mounting FS...
>> Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=1: Trying to acquire journal lock...
>> Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=1: Looking at journal...
>> Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=1: Done
>> Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Scanning for log elements...
>> Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Found 0 unlinked inodes
>> Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Found quota changes for 0 IDs
>> Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Done
>> Jul  6 11:27:37 kiwiland kernel: Installing knfsd (copyright (C) 1996 okir at monad.swb.de).
>> Jul  6 11:27:39 kiwiland kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
>> Jul  6 11:27:39 kiwiland kernel: NFSD: starting 90-second grace period
>> Jul  6 11:32:21 kiwiland kernel: dlm: closing connection to node 1
>> Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Trying to acquire journal lock...
>> Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: fatal: invalid metadata block
>> Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0:   bh = 1432543247 (magic)
>> Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0:   function = gfs_rgrp_read
>> Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0:   file = /builddir/build/BUILD/gfs-kmod-0.1.34/_kmod_build_/src/gfs/rgrp.c, line = 830
>> Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0:   time = 1278372781
>> Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: about to withdraw from the cluster
>> Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: telling LM to withdraw
>> Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Looking at journal...
>> Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Acquiring the transaction lock...
>> Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Replaying journal...
>> Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Replayed 0 of 0 blocks
>> Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: replays = 0, skips = 0, sames = 0
>> Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Journal replayed in 1s
>> Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Done
>> Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:files.0: withdrawn
>> Jul  6 11:33:02 kiwiland kernel: 
>> Jul  6 11:33:02 kiwiland kernel: Call Trace:
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff88805018>] :gfs:gfs_lm_withdraw+0xc4/0xd3
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff80063a36>] __wait_on_bit+0x60/0x6e
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8001538b>] sync_buffer+0x0/0x3f
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff800a00e5>] wake_bit_function+0x0/0x23
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8881cc97>] :gfs:gfs_meta_check_ii+0x32/0x3e
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff88819439>] :gfs:gfs_rgrp_read+0x139/0x225
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff887fb8e8>] :gfs:glock_wait_internal+0x229/0x2c3
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff887fbd17>] :gfs:gfs_glock_nq+0x395/0x3d6
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff887fbd6e>] :gfs:gfs_glock_nq_init+0x16/0x2a
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff88817466>] :gfs:gfs_rgrp_lvb_init+0x1e/0x3f
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8881a46f>] :gfs:gfs_stat_gfs+0x213/0x273
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8881353d>] :gfs:gfs_statfs+0x67/0xea
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff800deba3>] vfs_statfs+0x63/0x7f
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8886d2ce>] :nfsd:nfsd_statfs+0x28/0x38
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff888745f8>] :nfsd:nfsd3_proc_fsstat+0x3f/0x54
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8886a1db>] :nfsd:nfsd_dispatch+0xd8/0x1d6
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff886e0529>] :sunrpc:svc_process+0x454/0x71b
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff80064644>] __down_read+0x12/0x92
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8886a5a1>] :nfsd:nfsd+0x0/0x2cb
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8886a746>] :nfsd:nfsd+0x1a5/0x2cb
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8886a5a1>] :nfsd:nfsd+0x0/0x2cb
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8886a5a1>] :nfsd:nfsd+0x0/0x2cb
>> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
>> Jul  6 11:33:02 kiwiland kernel: 
>> 
>> 
>> Another kernel spit out:
>> Jul  5 02:01:19 Hercules kernel: GFS: fsid=FSC:files.0: fast statfs start time = 1278252079
>> Jul  5 03:01:16 Hercules kernel: GFS: fsid=FSC:files.0: fast statfs start time = 1278255676
>> Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: fatal: invalid metadata block
>> Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0:   bh = 86700288 (magic)
>> Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0:   function = gfs_get_meta_buffer
>> Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0:   file = /builddir/build/BUILD/gfs-kmod-0.1.34/_kmod_build_/src/gfs/dio.c, line = 1225
>> Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0:   time = 1278255737
>> Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: about to withdraw from the cluster
>> Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: telling LM to withdraw
>> Jul  5 03:02:21 Hercules kernel: GFS: fsid=FSC:files.0: withdrawn
>> Jul  5 03:02:21 Hercules kernel: 
>> Jul  5 03:02:21 Hercules kernel: Call Trace:
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8880a018>] :gfs:gfs_lm_withdraw+0xc4/0xd3
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8001538b>] sync_buffer+0x0/0x3f
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff800a00e5>] wake_bit_function+0x0/0x23
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff88821c97>] :gfs:gfs_meta_check_ii+0x32/0x3e
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff887f7717>] :gfs:gfs_get_meta_buffer+0x1d1/0x247
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff88804193>] :gfs:gfs_copyin_dinode+0x1d/0x12f
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff88800d6e>] :gfs:gfs_glock_nq_init+0x16/0x2a
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff888043e3>] :gfs:inode_create+0x13e/0x1df
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff88804a5d>] :gfs:gfs_inode_get+0x9d/0xba
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff888053bb>] :gfs:gfs_lookupi+0x33d/0x3df
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff887fce57>] :gfs:ea_find_i+0x0/0x6b
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff888172af>] :gfs:gfs_lookup+0x363/0x41a
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff80025426>] igrab+0x25/0x34
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff888055a0>] :gfs:gfs_iget+0x3d/0x1f1
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff88801224>] :gfs:gfs_glock_dq+0x13c/0x14b
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8000cf01>] do_lookup+0xe5/0x1e6
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8000a22b>] __link_path_walk+0xa01/0xf42
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8000e9cc>] link_path_walk+0x42/0xb2
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8000cc9c>] do_path_lookup+0x275/0x2f1
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff80012752>] getname+0x15b/0x1c2
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff800236ba>] __user_walk_fd+0x37/0x4c
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8003f235>] vfs_lstat_fd+0x18/0x47
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8002a95a>] sys_newlstat+0x19/0x31
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8005dde9>] error_exit+0x0/0x84
>> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8005d116>] system_call+0x7e/0x83
>> 
>> 
>> Thanks in advance,
>> 
>>  -- Abraham
>> 
>> ''''''''''''''''''''''''''''''''''''''''''''''''''''''
>> Abraham Alawi
>> 
>> Unix/Linux Systems Administrator
>> Science IT
>> University of Auckland
>> e: a.alawi at auckland.ac.nz
>> p: +64-9-373 7599, ext#: 87572
>> 
>> ''''''''''''''''''''''''''''''''''''''''''''''''''''''
>> 
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

''''''''''''''''''''''''''''''''''''''''''''''''''''''
Abraham Alawi

Unix/Linux Systems Administrator
Science IT
University of Auckland
e: a.alawi at auckland.ac.nz
p: +64-9-373 7599, ext#: 87572

''''''''''''''''''''''''''''''''''''''''''''''''''''''





More information about the Linux-cluster mailing list