[Linux-cluster] gfs2 blocking tasks

Tue Aug 21 10:08:02 UTC 2012

As yet another reply to my own post, i found this on the node where it 
hangs ( this time it's vm01, and /var/lib/libvirt/sanlock that's hanging 
):


[ 1219.640653] GFS2: fsid=: Trying to join cluster "lock_dlm", 
"kvm:sanlock"
[ 1219.660035] GFS2: fsid=kvm:sanlock.0: Joined cluster. Now mounting 
FS...
[ 1219.720108] GFS2: fsid=kvm:sanlock.0: jid=0, already locked for use
[ 1219.720113] GFS2: fsid=kvm:sanlock.0: jid=0: Looking at journal...
[ 1219.772606] GFS2: fsid=kvm:sanlock.0: jid=0: Acquiring the 
transaction lock...
[ 1219.772659] GFS2: fsid=kvm:sanlock.0: jid=0: Replaying journal...
[ 1219.772675] GFS2: fsid=kvm:sanlock.0: jid=0: Replayed 0 of 0 blocks
[ 1219.772679] GFS2: fsid=kvm:sanlock.0: jid=0: Found 1 revoke tags
[ 1219.782611] init: libvirt-bin main process ended, respawning
[ 1219.784161] GFS2: fsid=kvm:sanlock.0: jid=0: Journal replayed in 1s
[ 1219.784268] GFS2: fsid=kvm:sanlock.0: jid=0: Done
[ 1219.784329] GFS2: fsid=kvm:sanlock.0: jid=1: Trying to acquire 
journal lock...
[ 1219.788349] GFS2: fsid=kvm:sanlock.0: jid=1: Looking at journal...
[ 1219.886047] GFS2: fsid=kvm:sanlock.0: jid=1: Done
[ 1219.886110] GFS2: fsid=kvm:sanlock.0: jid=2: Trying to acquire 
journal lock...
[ 1219.891121] GFS2: fsid=kvm:sanlock.0: jid=2: Looking at journal...
[ 1219.943994] init: ttyS1 main process (20318) terminated with status 
1
[ 1219.944037] init: ttyS1 main process ended, respawning
[ 1219.967054] init: ttyS0 main process (20320) terminated with status 
1
[ 1219.967100] init: ttyS0 main process ended, respawning
[ 1219.972037] ttyS0: LSR safety check engaged!
[ 1220.226027] GFS2: fsid=kvm:sanlock.0: jid=2: Acquiring the 
transaction lock...
[ 1220.226160] GFS2: fsid=kvm:sanlock.0: jid=2: Replaying journal...
[ 1220.311801] GFS2: fsid=kvm:sanlock.0: jid=2: Replayed 3 of 106 
blocks
[ 1220.311805] GFS2: fsid=kvm:sanlock.0: jid=2: Found 4487 revoke tags
[ 1220.322148] GFS2: fsid=kvm:sanlock.0: jid=2: Journal replayed in 1s
[ 1220.322253] GFS2: fsid=kvm:sanlock.0: jid=2: Done
[ 1221.457120] BUG: unable to handle kernel NULL pointer dereference at 
0000000000000018
[ 1221.457508] IP: [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220 [gfs2]
[ 1221.457958] PGD 2170f2067 PUD 20c0ff067 PMD 0
[ 1221.458197] Oops: 0000 [#1] SMP
[ 1221.458374] CPU 0
[ 1221.458460] Modules linked in: gfs2 dlm ipmi_si mptctl mptbase 
ipmi_devintf ipmi_msghandler dell_rbu ip6table_filter ip6_tables 
iptable_filter ip_tables ebtable_nat ebtables x_tables kvm_intel kvm 
ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue 
configfs dm_round_robin ib_iser bridge rdma_cm ib_cm iw_cm ib_sa ib_mad 
ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi 
bonding nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc 8021q garp stp 
joydev dm_multipath dcdbas mac_hid i7core_edac edac_core 
acpi_power_meter lp parport usbhid hid bnx2 mpt2sas scsi_transport_sas 
e1000e raid_class scsi_dh_rdac usb_storage [last unloaded: ipmi_si]
[ 1221.463058]
[ 1221.463146] Pid: 20611, comm: libvirtd Not tainted 3.2.0-26-generic 
#41-Ubuntu Dell Inc. PowerEdge R310/05XKKK
[ 1221.463626] RIP: 0010:[<ffffffffa04f800a>]  [<ffffffffa04f800a>] 
gfs2_unlink+0x8a/0x220 [gfs2]
[ 1221.464100] RSP: 0018:ffff88020cfe1d28  EFLAGS: 00010296
[ 1221.464344] RAX: 0000000000000000 RBX: ffff88021ef58000 RCX: 
ffff88020cfe1d40
[ 1221.464662] RDX: 0000000000000000 RSI: 00000000000183f3 RDI: 
ffff88022efa2440
[ 1221.464979] RBP: ffff88020cfe1e38 R08: 4000000000000000 R09: 
0000000000000000
[ 1221.465297] R10: fdd50265775f720a R11: 0000000000000003 R12: 
ffff88021ef50000
[ 1221.465615] R13: ffff88020cfe1d80 R14: ffff8802178980c0 R15: 
ffff88022efa2000
[ 1221.466115] FS:  00007f4d2c0e7700(0000) GS:ffff880237200000(0000) 
knlGS:0000000000000000
[ 1221.466487] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1221.466748] CR2: 0000000000000018 CR3: 00000002175e4000 CR4: 
00000000000006f0
[ 1221.467066] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 1221.467384] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[ 1221.467702] Process libvirtd (pid: 20611, threadinfo 
ffff88020cfe0000, task ffff8802241cdbc0)
[ 1221.468091] Stack:
[ 1221.468199]  0000000000000003 ffff88022f108048 ffff88020cfe1d58 
ffff88020cfe1d40
[ 1221.468581]  ffff88020cfe1d40 ffff88022f108000 ffff88022354aa00 
0000000000000001
[ 1221.468963]  0000000000000000 0000000000000000 ffffffffa04f7fda 
ffff88020cfe1d80
[ 1221.469346] Call Trace:
[ 1221.469486]  [<ffffffffa04f7fda>] ? gfs2_unlink+0x5a/0x220 [gfs2]
[ 1221.469955]  [<ffffffffa04f7ff4>] ? gfs2_unlink+0x74/0x220 [gfs2]
[ 1221.470236]  [<ffffffff8129cb2c>] ? 
security_inode_permission+0x1c/0x30
[ 1221.470536]  [<ffffffff81184e70>] vfs_unlink.part.26+0x80/0xf0
[ 1221.470802]  [<ffffffff81184f1c>] vfs_unlink+0x3c/0x60
[ 1221.471040]  [<ffffffff8118758a>] do_unlinkat+0x1aa/0x1d0
[ 1221.471290]  [<ffffffff81177fc0>] ? vfs_write+0x110/0x180
[ 1221.471538]  [<ffffffff811782a7>] ? sys_write+0x67/0x90
[ 1221.471780]  [<ffffffff81188106>] sys_unlink+0x16/0x20
[ 1221.472019]  [<ffffffff81661fc2>] system_call_fastpath+0x16/0x1b
[ 1221.472290] Code: 00 00 49 83 c5 40 31 d2 4c 89 e9 be 01 00 00 00 e8 
fc 1e ff ff 48 8b b3 28 02 00 00 4c 89 ff e8 ad 7e 00 00 48 8d 8d 08 ff 
ff ff <48> 8b 78 18 31 d2 be 01 00 00 00 48 83 e9 80 e8 d2 1e ff ff 48
[ 1221.473936] RIP  [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220 [gfs2]
[ 1221.474240]  RSP <ffff88020cfe1d28>
[ 1221.474408] CR2: 0000000000000018
[ 1221.474959] ---[ end trace f7df780fd45600a8 ]---


Bart Verwilst schreef op 20.08.2012 09:50:
> Nothing out of the ordinary, should have mentioned that!
>
> <snip>
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20
> bytes from fd 17
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
> command is 7
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: About to
> process command
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command to
> process is 7
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
> get_all_members: allocated new buffer (retsize=1024)
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
> get_all_members: retlen = 1760
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command
> return code is 4
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> Returning command data. length = 1760
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: sending
> reply 40000007 to fd 17
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20
> bytes from fd 17
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: client
> command is 800000b7
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: About to
> process command
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command to
> process is 800000b7
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command
> return code is 0
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> Returning command data. length = 0
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: sending
> reply c00000b7 to fd 17
> Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: read 20
> bytes from fd 17
> Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: client 
> command is 7
> Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: About to
> process command
> </snip>
>
> Digimer schreef op 20.08.2012 00:01:
>> On 08/19/2012 05:34 PM, Bart Verwilst wrote:
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO: task
>>> kworker/1:0:3117 blocked for more than 120 seconds.
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>
>> Nothing around Aug 19 00:08:00 ?