[Linux-cluster] CLVM/GFS hangs while rebuilding an AoE RAID5 LUN, anyone experienced that before? any clue?

Abraham Alawi a.alawi at auckland.ac.nz
Tue Jul 6 01:22:47 UTC 2010


The system was running well for a while but lately we had a flaky disk in the RAID array which we replaced with a healthy one but suddenly the CLVM/GFS became unusable, we can mount GFS but while listing it recursively 'ls -R' it hangs with Input/output error, can't even access the c/LVM LUN rawly using 'dd' BUT we still can access the LVM PV devices using 'dd'. Reconfiguring the LVM volume as a local one and accessing it exclusively from one node doesn't make a difference. 

RHEL5: 2.6.18-164.11.1.el5
# modinfo gfs
filename:       /lib/modules/2.6.18-164.11.1.el5/weak-updates/gfs/gfs.ko
license:        GPL
author:         Red Hat, Inc.
description:    Global File System 0.1.34-2.el5
srcversion:     3B1BAC4069F1A4B556A958A
depends:        dlm
vermagic:       2.6.18-159.el5 SMP mod_unload gcc-4.1

# uname -r
2.6.18-164.11.1.el5

# modinfo /lib/modules/2.6.18-164.11.1.el5/kernel/drivers/block/aoe/aoe.ko
filename:       /lib/modules/2.6.18-164.11.1.el5/kernel/drivers/block/aoe/aoe.ko
description:    AoE block/char driver for 2.6.2 and newer 2.6 kernels
author:         Sam Hopkins <sah at coraid.com>
license:        GPL
srcversion:     42BF122979AC807F2BB50E6
depends:        
vermagic:       2.6.18-164.11.1.el5 SMP mod_unload gcc-4.1
parm:           aoe_iflist:aoe_iflist=dev1[,dev2...]
 (string)
parm:           version:aoe module version 74
 (string)
parm:           aoe_dyndevs:Use dynamic minor numbers for devices. (int)
parm:           aoe_deadsecs:After aoe_deadsecs seconds, give up and fail dev. (int)
parm:           aoe_maxout:Only aoe_maxout outstanding packets for every MAC on eX.Y. (int)
parm:           aoe_maxsectors:When nonzero, set the maximum number of sectors per I/O request in new devices. (int)

# modinfo dlm
filename:       /lib/modules/2.6.18-164.11.1.el5/kernel/fs/dlm/dlm.ko
license:        GPL
author:         Red Hat, Inc.
description:    Distributed Lock Manager
srcversion:     E768995007648CA8DB078AE
depends:        configfs
vermagic:       2.6.18-164.11.1.el5 SMP mod_unload gcc-4.1
module_sig:	883f3504b56fe19c59c69348c13cf1f1126a509f6ddaee3965ee8b5fcd04163669647a889a9801e09f722187d1de068c0d52cd2b99bc3d475cb6ca1a0



Herein what the kernel spits out:

Jul  6 11:27:36 kiwiland kernel: GFS 0.1.34-2.el5 (built Sep  9 2009 06:54:42) installed
Jul  6 11:27:36 kiwiland kernel: Lock_DLM (built Sep  9 2009 06:54:38) installed
Jul  6 11:27:36 kiwiland kernel: Lock_Nolock (built Sep  9 2009 06:54:37) installed
Jul  6 11:27:36 kiwiland kernel: Trying to join cluster "lock_dlm", "FSC:files"
Jul  6 11:27:36 kiwiland kernel: Joined cluster. Now mounting FS...
Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Trying to acquire journal lock...
Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Looking at journal...
Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Acquiring the transaction lock...
Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Replaying journal...
Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Replayed 0 of 11 blocks
Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: replays = 0, skips = 4, sames = 7
Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Journal replayed in 1s
Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=0: Done
Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=1: Trying to acquire journal lock...
Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=1: Looking at journal...
Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: jid=1: Done
Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Scanning for log elements...
Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Found 2 unlinked inodes
Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Found quota changes for 2 IDs
Jul  6 11:27:36 kiwiland kernel: GFS: fsid=FSC:files.0: Done
Jul  6 11:27:36 kiwiland kernel: Trying to join cluster "lock_dlm", "FSC:webcluster"
Jul  6 11:27:36 kiwiland kernel: Joined cluster. Now mounting FS...
Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=1: Trying to acquire journal lock...
Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=1: Looking at journal...
Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=1: Done
Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Scanning for log elements...
Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Found 0 unlinked inodes
Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Found quota changes for 0 IDs
Jul  6 11:27:37 kiwiland kernel: GFS: fsid=FSC:webcluster.1: Done
Jul  6 11:27:37 kiwiland kernel: Installing knfsd (copyright (C) 1996 okir at monad.swb.de).
Jul  6 11:27:39 kiwiland kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Jul  6 11:27:39 kiwiland kernel: NFSD: starting 90-second grace period
Jul  6 11:32:21 kiwiland kernel: dlm: closing connection to node 1
Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Trying to acquire journal lock...
Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: fatal: invalid metadata block
Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0:   bh = 1432543247 (magic)
Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0:   function = gfs_rgrp_read
Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0:   file = /builddir/build/BUILD/gfs-kmod-0.1.34/_kmod_build_/src/gfs/rgrp.c, line = 830
Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0:   time = 1278372781
Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: about to withdraw from the cluster
Jul  6 11:33:01 kiwiland kernel: GFS: fsid=FSC:files.0: telling LM to withdraw
Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Looking at journal...
Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Acquiring the transaction lock...
Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Replaying journal...
Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Replayed 0 of 0 blocks
Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: replays = 0, skips = 0, sames = 0
Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Journal replayed in 1s
Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:webcluster.1: jid=0: Done
Jul  6 11:33:02 kiwiland kernel: GFS: fsid=FSC:files.0: withdrawn
Jul  6 11:33:02 kiwiland kernel: 
Jul  6 11:33:02 kiwiland kernel: Call Trace:
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff88805018>] :gfs:gfs_lm_withdraw+0xc4/0xd3
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff80063a36>] __wait_on_bit+0x60/0x6e
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8001538b>] sync_buffer+0x0/0x3f
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff800a00e5>] wake_bit_function+0x0/0x23
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8881cc97>] :gfs:gfs_meta_check_ii+0x32/0x3e
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff88819439>] :gfs:gfs_rgrp_read+0x139/0x225
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff887fb8e8>] :gfs:glock_wait_internal+0x229/0x2c3
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff887fbd17>] :gfs:gfs_glock_nq+0x395/0x3d6
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff887fbd6e>] :gfs:gfs_glock_nq_init+0x16/0x2a
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff88817466>] :gfs:gfs_rgrp_lvb_init+0x1e/0x3f
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8881a46f>] :gfs:gfs_stat_gfs+0x213/0x273
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8881353d>] :gfs:gfs_statfs+0x67/0xea
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff800deba3>] vfs_statfs+0x63/0x7f
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8886d2ce>] :nfsd:nfsd_statfs+0x28/0x38
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff888745f8>] :nfsd:nfsd3_proc_fsstat+0x3f/0x54
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8886a1db>] :nfsd:nfsd_dispatch+0xd8/0x1d6
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff886e0529>] :sunrpc:svc_process+0x454/0x71b
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff80064644>] __down_read+0x12/0x92
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8886a5a1>] :nfsd:nfsd+0x0/0x2cb
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8886a746>] :nfsd:nfsd+0x1a5/0x2cb
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8886a5a1>] :nfsd:nfsd+0x0/0x2cb
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8886a5a1>] :nfsd:nfsd+0x0/0x2cb
Jul  6 11:33:02 kiwiland kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Jul  6 11:33:02 kiwiland kernel: 


Another kernel spit out:
Jul  5 02:01:19 Hercules kernel: GFS: fsid=FSC:files.0: fast statfs start time = 1278252079
Jul  5 03:01:16 Hercules kernel: GFS: fsid=FSC:files.0: fast statfs start time = 1278255676
Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: fatal: invalid metadata block
Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0:   bh = 86700288 (magic)
Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0:   function = gfs_get_meta_buffer
Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0:   file = /builddir/build/BUILD/gfs-kmod-0.1.34/_kmod_build_/src/gfs/dio.c, line = 1225
Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0:   time = 1278255737
Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: about to withdraw from the cluster
Jul  5 03:02:17 Hercules kernel: GFS: fsid=FSC:files.0: telling LM to withdraw
Jul  5 03:02:21 Hercules kernel: GFS: fsid=FSC:files.0: withdrawn
Jul  5 03:02:21 Hercules kernel: 
Jul  5 03:02:21 Hercules kernel: Call Trace:
Jul  5 03:02:21 Hercules kernel:  [<ffffffff8880a018>] :gfs:gfs_lm_withdraw+0xc4/0xd3
Jul  5 03:02:21 Hercules kernel:  [<ffffffff8001538b>] sync_buffer+0x0/0x3f
Jul  5 03:02:21 Hercules kernel:  [<ffffffff80063ab0>] out_of_line_wait_on_bit+0x6c/0x78
Jul  5 03:02:21 Hercules kernel:  [<ffffffff800a00e5>] wake_bit_function+0x0/0x23
Jul  5 03:02:21 Hercules kernel:  [<ffffffff88821c97>] :gfs:gfs_meta_check_ii+0x32/0x3e
Jul  5 03:02:21 Hercules kernel:  [<ffffffff887f7717>] :gfs:gfs_get_meta_buffer+0x1d1/0x247
Jul  5 03:02:21 Hercules kernel:  [<ffffffff88804193>] :gfs:gfs_copyin_dinode+0x1d/0x12f
Jul  5 03:02:21 Hercules kernel:  [<ffffffff88800d6e>] :gfs:gfs_glock_nq_init+0x16/0x2a
Jul  5 03:02:21 Hercules kernel:  [<ffffffff888043e3>] :gfs:inode_create+0x13e/0x1df
Jul  5 03:02:21 Hercules kernel:  [<ffffffff88804a5d>] :gfs:gfs_inode_get+0x9d/0xba
Jul  5 03:02:21 Hercules kernel:  [<ffffffff888053bb>] :gfs:gfs_lookupi+0x33d/0x3df
Jul  5 03:02:21 Hercules kernel:  [<ffffffff887fce57>] :gfs:ea_find_i+0x0/0x6b
Jul  5 03:02:21 Hercules kernel:  [<ffffffff888172af>] :gfs:gfs_lookup+0x363/0x41a
Jul  5 03:02:21 Hercules kernel:  [<ffffffff80025426>] igrab+0x25/0x34
Jul  5 03:02:21 Hercules kernel:  [<ffffffff888055a0>] :gfs:gfs_iget+0x3d/0x1f1
Jul  5 03:02:21 Hercules kernel:  [<ffffffff88801224>] :gfs:gfs_glock_dq+0x13c/0x14b
Jul  5 03:02:21 Hercules kernel:  [<ffffffff8000cf01>] do_lookup+0xe5/0x1e6
Jul  5 03:02:21 Hercules kernel:  [<ffffffff8000a22b>] __link_path_walk+0xa01/0xf42
Jul  5 03:02:21 Hercules kernel:  [<ffffffff8000e9cc>] link_path_walk+0x42/0xb2
Jul  5 03:02:21 Hercules kernel:  [<ffffffff8000cc9c>] do_path_lookup+0x275/0x2f1
Jul  5 03:02:21 Hercules kernel:  [<ffffffff80012752>] getname+0x15b/0x1c2
Jul  5 03:02:21 Hercules kernel:  [<ffffffff800236ba>] __user_walk_fd+0x37/0x4c
Jul  5 03:02:21 Hercules kernel:  [<ffffffff8003f235>] vfs_lstat_fd+0x18/0x47
Jul  5 03:02:21 Hercules kernel:  [<ffffffff8002a95a>] sys_newlstat+0x19/0x31
Jul  5 03:02:21 Hercules kernel:  [<ffffffff8005dde9>] error_exit+0x0/0x84
Jul  5 03:02:21 Hercules kernel:  [<ffffffff8005d116>] system_call+0x7e/0x83


Thanks in advance,

  -- Abraham

''''''''''''''''''''''''''''''''''''''''''''''''''''''
Abraham Alawi

Unix/Linux Systems Administrator
Science IT
University of Auckland
e: a.alawi at auckland.ac.nz
p: +64-9-373 7599, ext#: 87572

''''''''''''''''''''''''''''''''''''''''''''''''''''''





More information about the Linux-cluster mailing list