[Linux-cluster] Cluster death

Kaloyan Kovachev kkovachev at varna.net
Sat Nov 14 11:24:45 UTC 2009


Hello,
 yesterday the cluster died after a short network outage i guess and Node2
(which is the only one with single NIC without bonding) have been fenced, but
services not relocated and after rebooted was unable to mount the GFS2 shares
and i had to reboot the entire cluster.
 It's a 4 node cluster with Node1 and Node2 booting from the network (from
Stor1 and Stor2). From the logs i can see that both Stor1 and Stor2 have the
same trace (below in csv format from the rsyslog database as nodes don't have
local storage) which is probably why Node2 was unable to rejoin the cluster
even it was fenced.

"2009-11-13 16:41:12","Stor1","kernel","emerg","kernel:"," [773501.547726]
------------[ cut here ]------------"
"2009-11-13 16:41:12","Stor1","kernel","critical","kernel:"," [773501.559005]
kernel BUG at fs/inode.c:1323!"
"2009-11-13 16:41:12","Stor1","kernel","emerg","kernel:"," [773501.559251]
invalid opcode: 0000 [#1] SMP "
"2009-11-13 16:41:12","Stor1","kernel","emerg","kernel:"," [773501.559516]
last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.559946]
CPU 6 "
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.560173]
Modules linked in: gfs2 dlm configfs iscsi_tcp libiscsi_tcp libiscsi
scsi_transport_iscsi iscsi_trgt drbd 

e1000e [last unloaded: configfs]"
"2009-11-13 16:41:12","Stor1","kernel","info","kernel:"," [773501.560690] Pid:
7742, comm: dlm_send Not tainted 2.6.31.4 #13 X8STi"
"2009-11-13 16:41:12","Stor1","kernel","info","kernel:"," [773501.560910] RIP:
0010:[<ffffffff810fa032>]  [<ffffffff810fa032>] iput+0x1b/0x65"
"2009-11-13 16:41:12","Stor1","kernel","info","kernel:"," [773501.561391] RSP:
0018:ffff8801a85fdc60  EFLAGS: 00010246"
"2009-11-13 16:41:12","Stor1","kernel","info","kernel:"," [773501.561625] RAX:
0000000000000000 RBX: ffff8801af7cb488 RCX: ffff8801af7cb440"
"2009-11-13 16:41:12","Stor1","kernel","info","kernel:"," [773501.562053] RDX:
ffff8801b7e25480 RSI: ffffea0005e633c0 RDI: ffff8801af7cb488"
"2009-11-13 16:41:12","Stor1","kernel","info","kernel:"," [773501.562545] RBP:
ffff8801a85fdc70 R08: ffff8801a85fdbe0 R09: 8000000000000000"
"2009-11-13 16:41:12","Stor1","kernel","info","kernel:"," [773501.562976] R10:
0000000000000000 R11: ffff8801a85fdbc0 R12: ffff8801af7cb440"
"2009-11-13 16:41:12","Stor1","kernel","info","kernel:"," [773501.563480] R13:
ffff8801add24010 R14: ffff8801add24030 R15: ffff8801a8600000"
"2009-11-13 16:41:12","Stor1","kernel","info","kernel:"," [773501.563920] FS:
 0000000000000000(0000) GS:ffff8800280e8000(0000) knlGS:0000000000000000"
"2009-11-13 16:41:12","Stor1","kernel","info","kernel:"," [773501.564401] CS:
 0010 DS: 0018 ES: 0018 CR0: 000000008005003b"
"2009-11-13 16:41:12","Stor1","kernel","info","kernel:"," [773501.564652] CR2:
00007f258933d9c8 CR3: 00000001bcd03000 CR4: 00000000000006e0"
"2009-11-13 16:41:12","Stor1","kernel","info","kernel:"," [773501.565082] DR0:
0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000"
"2009-11-13 16:41:12","Stor1","kernel","info","kernel:"," [773501.565568] DR3:
0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.565992]
Process dlm_send (pid: 7742, threadinfo ffff8801a85fc000, task ffff8801a8600000)"
"2009-11-13 16:41:12","Stor1","kernel","emerg","kernel:"," [773501.566487] Stack:"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.566698] 
0000000000000000 0000000000000000 ffff8801a85fdc90 ffffffff815dedc0"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.566932]
<0> 0000000000000000 ffff8801add24000 ffff8801a85fddd0 ffffffffa00e2f0f"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.567418]
<0> 0000000000000000 0000000000000000 0000000000000000 0000000000000000"
"2009-11-13 16:41:12","Stor1","kernel","emerg","kernel:"," [773501.568081]
Call Trace:"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.568347] 
[<ffffffff815dedc0>] sock_release+0x5c/0x6c"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.568609] 
[<ffffffffa00e2f0f>] tcp_connect_to_sock+0x1f4/0x247 [dlm]"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.568834] 
[<ffffffff810444d8>] ? __wake_up+0x43/0x50"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.569063] 
[<ffffffffa00e3e5e>] process_send_sockets+0x31/0x1a5 [dlm]"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.569334] 
[<ffffffff81654294>] ? tcp_sendpage+0x0/0x46e"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.569592] 
[<ffffffff81062540>] worker_thread+0x17d/0x222"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.569821] 
[<ffffffffa00e3e2d>] ? process_send_sockets+0x0/0x1a5 [dlm]"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.570044] 
[<ffffffff81066968>] ? autoremove_wake_function+0x0/0x38"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.570305] 
[<ffffffff810623c3>] ? worker_thread+0x0/0x222"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.570551] 
[<ffffffff810665ce>] kthread+0x8f/0x97"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.570769] 
[<ffffffff81011bea>] child_rip+0xa/0x20"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.570989] 
[<ffffffff810408d1>] ? resched_task+0x6a/0x6e"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.571236] 
[<ffffffff8106653f>] ? kthread+0x0/0x97"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.571490] 
[<ffffffff81011be0>] ? child_rip+0x0/0x20"
"2009-11-13 16:41:12","Stor1","kernel","emerg","kernel:"," [773501.571721]
Code: e8 28 96 00 00 eb df 48 83 c4 20 5b 41 5c c9 c3 55 48 89 e5 53 48 89 fb
48 83 ec 08 48 85 ff 

74 50 48 83 bf 10 02 00 00 40 75 04 <0f> 0b eb fe 48 8d 7f 48 48 c7 c6 10 b9
d5 81 e8 42 0b 24 00 85 "
"2009-11-13 16:41:12","Stor1","kernel","alert","kernel:"," [773501.572592] RIP
 [<ffffffff810fa032>] iput+0x1b/0x65"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.572824] 
RSP <ffff8801a85fdc60>"
"2009-11-13 16:41:12","Stor1","kernel","warning","kernel:"," [773501.573739]
---[ end trace 209b2e1c9e1ac145 ]---"
"2009-11-13 16:41:12","Stor2","kernel","emerg","kernel:"," [775403.412503]
------------[ cut here ]------------"
"2009-11-13 16:41:12","Stor2","kernel","critical","kernel:"," [775403.412887]
kernel BUG at fs/inode.c:1323!"
"2009-11-13 16:41:12","Stor2","kernel","emerg","kernel:"," [775403.413262]
invalid opcode: 0000 [#1] SMP "
"2009-11-13 16:41:12","Stor2","kernel","emerg","kernel:"," [775403.413387]
last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387]
CPU 3 "
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387]
Modules linked in: gfs2 dlm configfs iscsi_tcp libiscsi_tcp libiscsi
scsi_transport_iscsi iscsi_trgt drbd igb 

[last unloaded: configfs]"
"2009-11-13 16:41:12","Stor2","kernel","info","kernel:"," [775403.413387] Pid:
5478, comm: dlm_send Not tainted 2.6.31.4 #3 Unknow"
"2009-11-13 16:41:12","Stor2","kernel","info","kernel:"," [775403.413387] RIP:
0010:[<ffffffff810fa032>]  [<ffffffff810fa032>] iput+0x1b/0x65"
"2009-11-13 16:41:12","Stor2","kernel","info","kernel:"," [775403.413387] RSP:
0018:ffff88012a49dc60  EFLAGS: 00010246"
"2009-11-13 16:41:12","Stor2","kernel","info","kernel:"," [775403.413387] RAX:
0000000000000000 RBX: ffff88013210bcc8 RCX: ffff88013210bc80"
"2009-11-13 16:41:12","Stor2","kernel","info","kernel:"," [775403.413387] RDX:
ffff88013a5be800 RSI: ffffea00042f39c0 RDI: ffff88013210bcc8"
"2009-11-13 16:41:12","Stor2","kernel","info","kernel:"," [775403.413387] RBP:
ffff88012a49dc70 R08: ffff88012a49dbe0 R09: 8000000000000000"
"2009-11-13 16:41:12","Stor2","kernel","info","kernel:"," [775403.413387] R10:
ffff88012c1dbca8 R11: ffff88012a49dbc0 R12: ffff88013210bc80"
"2009-11-13 16:41:12","Stor2","kernel","info","kernel:"," [775403.413387] R13:
ffff88012c1671c0 R14: ffff88012c1671e0 R15: ffff88012a490f20"
"2009-11-13 16:41:12","Stor2","kernel","info","kernel:"," [775403.413387] FS:
 00007fb62c5086f0(0000) GS:ffff88002807c000(0000) knlGS:0000000000000000"
"2009-11-13 16:41:12","Stor2","kernel","info","kernel:"," [775403.413387] CS:
 0010 DS: 0018 ES: 0018 CR0: 000000008005003b"
"2009-11-13 16:41:12","Stor2","kernel","info","kernel:"," [775403.413387] CR2:
00007fb62c522e49 CR3: 0000000129caf000 CR4: 00000000000006e0"
"2009-11-13 16:41:12","Stor2","kernel","info","kernel:"," [775403.413387] DR0:
0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000"
"2009-11-13 16:41:12","Stor2","kernel","info","kernel:"," [775403.413387] DR3:
0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387]
Process dlm_send (pid: 5478, threadinfo ffff88012a49c000, task ffff88012a490f20)"
"2009-11-13 16:41:12","Stor2","kernel","emerg","kernel:"," [775403.413387] Stack:"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387] 
0000000000000000 0000000000000000 ffff88012a49dc90 ffffffff815dedc0"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387]
<0> 0000000000000000 ffff88012c1671b0 ffff88012a49ddd0 ffffffffa00d9f0f"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387]
<0> ffff88012a490f58 ffff8800280917e0 0000000000000001 ffffc900116de088"
"2009-11-13 16:41:12","Stor2","kernel","emerg","kernel:"," [775403.413387]
Call Trace:"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387] 
[<ffffffff815dedc0>] sock_release+0x5c/0x6c"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387] 
[<ffffffffa00d9f0f>] tcp_connect_to_sock+0x1f4/0x247 [dlm]"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387] 
[<ffffffff810444d8>] ? __wake_up+0x43/0x50"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387] 
[<ffffffffa00dae5e>] process_send_sockets+0x31/0x1a5 [dlm]"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387] 
[<ffffffff81654294>] ? tcp_sendpage+0x0/0x46e"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387] 
[<ffffffff81062540>] worker_thread+0x17d/0x222"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387] 
[<ffffffffa00dae2d>] ? process_send_sockets+0x0/0x1a5 [dlm]"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387] 
[<ffffffff81066968>] ? autoremove_wake_function+0x0/0x38"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387] 
[<ffffffff810623c3>] ? worker_thread+0x0/0x222"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387] 
[<ffffffff810665ce>] kthread+0x8f/0x97"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387] 
[<ffffffff81011bea>] child_rip+0xa/0x20"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387] 
[<ffffffff8106653f>] ? kthread+0x0/0x97"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387] 
[<ffffffff81011be0>] ? child_rip+0x0/0x20"
"2009-11-13 16:41:12","Stor2","kernel","emerg","kernel:"," [775403.413387]
Code: e8 28 96 00 00 eb df 48 83 c4 20 5b 41 5c c9 c3 55 48 89 e5 53 48 89 fb
48 83 ec 08 48 85 ff 

74 50 48 83 bf 10 02 00 00 40 75 04 <0f> 0b eb fe 48 8d 7f 48 48 c7 c6 10 b9
d5 81 e8 42 0b 24 00 85 "
"2009-11-13 16:41:12","Stor2","kernel","alert","kernel:"," [775403.413387] RIP
 [<ffffffff810fa032>] iput+0x1b/0x65"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.413387] 
RSP <ffff88012a49dc60>"
"2009-11-13 16:41:12","Stor2","kernel","warning","kernel:"," [775403.453850]
---[ end trace 1a43987c53646c14 ]---"
"2009-11-13 16:41:12","Stor1","local4","notice","corosync[7441]:","   [MAIN  ]
Completed service synchronization, ready to provide service."
"2009-11-13 16:41:12","Stor2","local4","notice","corosync[5211]:","   [MAIN  ]
Completed service synchronization, ready to provide service."
"2009-11-13 16:41:12","Node1","local4","notice","corosync[4710]:","   [MAIN  ]
Completed service synchronization, ready to provide service."
"2009-11-13 16:41:12","Stor2","local4","info","fenced[5286]:"," fencing
deferred to Node1"
"2009-11-13 16:41:12","Stor1","local4","info","fenced[7529]:"," fencing
deferred to Node1"
"2009-11-13 16:41:12","Stor1","kernel","info","kernel:"," [773501.685992]
GFS2: fsid=Cluster1:Hosting.0: jid=2: Trying to acquire journal lock..."
"2009-11-13 16:41:12","Stor2","kernel","info","kernel:"," [775403.553450]
GFS2: fsid=Cluster1:Hosting.1: jid=2: Trying to acquire journal lock..."
"2009-11-13 16:41:12","Node1","local4","info","fenced[4831]:"," fencing node
Node2"
"2009-11-13 16:41:12","Stor1","kernel","info","kernel:"," [773501.689702]
GFS2: fsid=Cluster1:Services.2: jid=1: Trying to acquire journal lock..."
"2009-11-13 16:41:12","Stor2","kernel","info","kernel:"," [775403.557265]
GFS2: fsid=Cluster1:Services.3: jid=1: Trying to acquire journal lock..."
"2009-11-13 16:41:12","Node1","kernel","info","kernel:"," [775171.695293]
GFS2: fsid=Cluster1:Services.0: jid=1: Trying to acquire journal lock..."
"2009-11-13 16:41:12","Node1","kernel","info","kernel:"," [775171.697581]
GFS2: fsid=Cluster1:Mails.0: jid=1: Trying to acquire journal lock..."
"2009-11-13 16:41:26","Node1","local4","error","fenced[4831]:"," fence Node2
success"
"2009-11-13 16:41:32","Stor1","auth","info","sshd[7120]:"," Timeout, client
not responding."

I have just increased post_fail_dalay from 0 to 5 with the hope to avoid
future problems, but maybe i should also include configfs in the kernel
instead of a module to avoid the crash?




More information about the Linux-cluster mailing list