[Linux-cluster] CLVM/GFS hangs while rebuilding an AoE RAID5 LUN, anyone experienced that before? any clue?

Steven Whitehouse swhiteho at redhat.com
Tue Jul 6 08:22:01 UTC 2010


Hi,

It looks to me as if the fs is corrupt in some manner. Try unmounting on
all nodes and running fsck on one node on the filesystem. Make sure you
save the output of fsck in case that is useful for future debugging and
make sure you have a backup of the data in question first.

Its tricky to say exactly what might have gone wrong (the fsck output
might give a clue) but you will certainly need fsck to fix whatever the
problem is,

Steve.

On Tue, 2010-07-06 at 13:22 +1200, Abraham Alawi wrote:
> The system was running well for a while but lately we had a flaky disk in the RAID array which we replaced with a healthy one but suddenly the CLVM/GFS became unusable, we can mount GFS but while listing it recursively 'ls -R' it hangs with Input/output error, can't even access the c/LVM LUN rawly using 'dd' BUT we still can access the LVM PV devices using 'dd'. Reconfiguring the LVM volume as a local one and accessing it exclusively from one node doesn't make a difference. 
> 
> RHEL5: 2.6.18-164.11.1.el5
> # modinfo gfs
> filename:       /lib/modules/2.6.18-164.11.1.el5/weak-updates/gfs/gfs.ko
> license:        GPL
> author:         Red Hat, Inc.
> description:    Global File System 0.1.34-2.el5
> srcversion:     3B1BAC4069F1A4B556A958A
> depends:        dlm
> vermagic:       2.6.18-159.el5 SMP mod_unload gcc-4.1
> 
> # uname -r
> 2.6.18-164.11.1.el5
> 
> # modinfo /lib/modules/2.6.18-164.11.1.el5/kernel/drivers/block/aoe/aoe.ko
> filename:       /lib/modules/2.6.18-164.11.1.el5/kernel/drivers/block/aoe/aoe.ko
> description:    AoE block/char driver for 2.6.2 and newer 2.6 kernels
> author:         Sam Hopkins <sah at coraid.com>
> license:        GPL
> srcversion:     42BF122979AC807F2BB50E6
> depends:        
> vermagic:       2.6.18-164.11.1.el5 SMP mod_unload gcc-4.1
> parm:           aoe_iflist:aoe_iflist=dev1[,dev2...]
>  (string)
> parm:           version:aoe module version 74
>  (string)
> parm:           aoe_dyndevs:Use dynamic minor numbers for devices. (int)
> parm:           aoe_deadsecs:After aoe_deadsecs seconds, give up and fail dev. (int)
> parm:           aoe_maxout:Only aoe_maxout outstanding packets for every MAC on eX.Y. (int)
> parm:           aoe_maxsectors:When nonzero, set the maximum number of sectors per I/O request in new devices. (int)
> 
> # modinfo dlm
> filename:       /lib/modules/2.6.18-164.11.1.el5/kernel/fs/dlm/dlm.ko
> license:        GPL
> author:         Red Hat, Inc.
> description:    Distributed Lock Manager
> srcversion:     E768995007648CA8DB078AE
> depends:        configfs
> vermagic:       2.6.18-164.11.1.el5 SMP mod_unload gcc-4.1
> module_sig:	883f3504b56fe19c59c69348c13cf1f1126a509f6ddaee3965ee8b5fcd04163669647a889a9801e09f722187d1de068c0d52cd2b99bc3d475cb6ca1a0
> 
> 
> 
> Herein what the kernel spits out:
> 
> Jul  6 11:27:36 kiwiland kernel: GFS 0.1.34-2.el5 (built Sep  9 2009 06:54:42) installed
> Jul  6 11:27:36 kiwiland kernel: Lock_DLM (built Sep  9 2009 06:54:38) installed
> Jul  6 11:27:36 kiwiland kernel: Lock_Nolock (built Sep  9 2009 06:54:37) installed
> Jul  6 11:27:36 kiwiland kernel: Trying to join cluster "lock_dlm", "FSC:files"
> Jul  6 11:27:36 kiwiland kernel: Joined cluster. Now mounting FS...
> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Trying to acquire journal lock...
> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Looking at journal...
> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Acquiring the transaction lock...
> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Replaying journal...
> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Replayed 0 of 11 blocks
> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: replays = 0, skips = 4, sames = 7
> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Journal replayed in 1s
> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Done
> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=1: Trying to acquire journal lock...
> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=1: Looking at journal...
> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=1: Done
> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Scanning for log elements...
> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Found 2 unlinked inodes
> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Found quota changes for 2 IDs
> Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Done
> Jul  6 11:27:36 kiwiland kernel: Trying to join cluster "lock_dlm", "FSC:webcluster"
> Jul  6 11:27:36 kiwiland kernel: Joined cluster. Now mounting FS...
> Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=1: Trying to acquire journal lock...
> Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=1: Looking at journal...
> Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=1: Done
> Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Scanning for log elements...
> Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Found 0 unlinked inodes
> Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Found quota changes for 0 IDs
> Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Done
> Jul  6 11:27:37 kiwiland kernel: Installing knfsd (copyright (C) 1996 okir at monad.swb.de).
> Jul  6 11:27:39 kiwiland kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
> Jul  6 11:27:39 kiwiland kernel: NFSD: starting 90-second grace period
> Jul  6 11:32:21 kiwiland kernel: dlm: closing connection to node 1
> Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Trying to acquire journal lock...
> Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: fatal: invalid metadata block
> Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0:   bh = 1432543247 (magic)
> Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0:   function = gfs_rgrp_read
> Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0:   file = /builddir/build/BUILD/gfs-kmod-0.1.34/_kmod_build_/src/gfs/rgrp.c, line = 830
> Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0:   time = 1278372781
> Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: about to withdraw from the cluster
> Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: telling LM to withdraw
> Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Looking at journal...
> Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Acquiring the transaction lock...
> Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Replaying journal...
> Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Replayed 0 of 0 blocks
> Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: replays = 0, skips = 0, sames = 0
> Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Journal replayed in 1s
> Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Done
> Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:files.0: withdrawn
> Jul  6 11:33:02 kiwiland kernel: 
> Jul  6 11:33:02 kiwiland kernel: Call Trace:
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff88805018>] :gfs:gfs_lm_withdraw+0xc4/0xd3
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff80063a36>] __wait_on_bit+0x60/0x6e
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8001538b>] sync_buffer+0x0/0x3f
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff800a00e5>] wake_bit_function+0x0/0x23
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8881cc97>] :gfs:gfs_meta_check_ii+0x32/0x3e
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff88819439>] :gfs:gfs_rgrp_read+0x139/0x225
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff887fb8e8>] :gfs:glock_wait_internal+0x229/0x2c3
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff887fbd17>] :gfs:gfs_glock_nq+0x395/0x3d6
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff887fbd6e>] :gfs:gfs_glock_nq_init+0x16/0x2a
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff88817466>] :gfs:gfs_rgrp_lvb_init+0x1e/0x3f
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8881a46f>] :gfs:gfs_stat_gfs+0x213/0x273
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8881353d>] :gfs:gfs_statfs+0x67/0xea
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff800deba3>] vfs_statfs+0x63/0x7f
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8886d2ce>] :nfsd:nfsd_statfs+0x28/0x38
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff888745f8>] :nfsd:nfsd3_proc_fsstat+0x3f/0x54
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8886a1db>] :nfsd:nfsd_dispatch+0xd8/0x1d6
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff886e0529>] :sunrpc:svc_process+0x454/0x71b
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff80064644>] __down_read+0x12/0x92
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8886a5a1>] :nfsd:nfsd+0x0/0x2cb
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8886a746>] :nfsd:nfsd+0x1a5/0x2cb
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8886a5a1>] :nfsd:nfsd+0x0/0x2cb
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8886a5a1>] :nfsd:nfsd+0x0/0x2cb
> Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
> Jul  6 11:33:02 kiwiland kernel: 
> 
> 
> Another kernel spit out:
> Jul  5 02:01:19 Hercules kernel: GFS: fsid=FSC:files.0: fast statfs start time = 1278252079
> Jul  5 03:01:16 Hercules kernel: GFS: fsid=FSC:files.0: fast statfs start time = 1278255676
> Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: fatal: invalid metadata block
> Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0:   bh = 86700288 (magic)
> Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0:   function = gfs_get_meta_buffer
> Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0:   file = /builddir/build/BUILD/gfs-kmod-0.1.34/_kmod_build_/src/gfs/dio.c, line = 1225
> Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0:   time = 1278255737
> Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: about to withdraw from the cluster
> Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: telling LM to withdraw
> Jul  5 03:02:21 Hercules kernel: GFS: fsid=FSC:files.0: withdrawn
> Jul  5 03:02:21 Hercules kernel: 
> Jul  5 03:02:21 Hercules kernel: Call Trace:
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8880a018>] :gfs:gfs_lm_withdraw+0xc4/0xd3
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8001538b>] sync_buffer+0x0/0x3f
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff800a00e5>] wake_bit_function+0x0/0x23
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff88821c97>] :gfs:gfs_meta_check_ii+0x32/0x3e
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff887f7717>] :gfs:gfs_get_meta_buffer+0x1d1/0x247
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff88804193>] :gfs:gfs_copyin_dinode+0x1d/0x12f
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff88800d6e>] :gfs:gfs_glock_nq_init+0x16/0x2a
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff888043e3>] :gfs:inode_create+0x13e/0x1df
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff88804a5d>] :gfs:gfs_inode_get+0x9d/0xba
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff888053bb>] :gfs:gfs_lookupi+0x33d/0x3df
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff887fce57>] :gfs:ea_find_i+0x0/0x6b
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff888172af>] :gfs:gfs_lookup+0x363/0x41a
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff80025426>] igrab+0x25/0x34
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff888055a0>] :gfs:gfs_iget+0x3d/0x1f1
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff88801224>] :gfs:gfs_glock_dq+0x13c/0x14b
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8000cf01>] do_lookup+0xe5/0x1e6
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8000a22b>] __link_path_walk+0xa01/0xf42
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8000e9cc>] link_path_walk+0x42/0xb2
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8000cc9c>] do_path_lookup+0x275/0x2f1
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff80012752>] getname+0x15b/0x1c2
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff800236ba>] __user_walk_fd+0x37/0x4c
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8003f235>] vfs_lstat_fd+0x18/0x47
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8002a95a>] sys_newlstat+0x19/0x31
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8005dde9>] error_exit+0x0/0x84
> Jul  5 03:02:21 Hercules kernel:  [<ffffffff8005d116>] system_call+0x7e/0x83
> 
> 
> Thanks in advance,
> 
>   -- Abraham
> 
> ''''''''''''''''''''''''''''''''''''''''''''''''''''''
> Abraham Alawi
> 
> Unix/Linux Systems Administrator
> Science IT
> University of Auckland
> e: a.alawi at auckland.ac.nz
> p: +64-9-373 7599, ext#: 87572
> 
> ''''''''''''''''''''''''''''''''''''''''''''''''''''''
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster





More information about the Linux-cluster mailing list