[Linux-cluster] rgmanager crash, deadlock?

Tue Nov 7 20:29:59 UTC 2006

Last night one of my five cluster nodes suffered a hardware failure (memory,
cpu?). The other nodes properly fenced the failed machine, but no matter
what clusvcadm command I ran, I could not get the other cluster members to
start, stop or disable the cluster resource group/service that had been
running on the failed node. (the resource group/service that was running on
the failed node includes an EXT3 fs, an IP address, a rsyncd and a smbd init
script)

The "clusvcadm -d [service]" command would just hang for minutes and not
return. "clustat" intially reported the rg/service in an unknown state, then
stopped reporting rgmanager status and only showed cman status. The cluster
remained quorate the entire time. Resource groups/services on non-failed
nodes continued to run, but no matter what I tried I could not get rgmanager
status on any node.

I had to reset the entire cluster to get things back to normal. (This is a
heavily used operational system so I didn't have time to do further
debugging.) My logs don't show any rgmanger related error messages, only
fencing status:

Nov  6 20:24:37 bamf02 kernel: CMAN: removing node bamf03 from the cluster :
Missed too many heartbeats
Nov  6 20:24:38 bamf02 fenced[5913]: fencing deferred to bamf01
---
Nov  6 20:24:37 bamf01 kernel: CMAN: node bamf03 has been removed from the
cluster : Missed too many heartbeats
Nov  6 20:24:38 bamf01 fenced[5756]: bamf03 not a cluster member after 0 sec
post_fail_delay
Nov  6 20:24:38 bamf01 fenced[5756]: fencing node "bamf03"
Nov  6 20:24:46 bamf01 fenced[5756]: fence "bamf03" success
Nov  6 20:30:36 bamf01 sshd(pam_unix)[27244]: session opened for user root
by root(uid=0)
Nov  6 20:36:29 bamf01 kernel: CMAN: node bamf03 rejoining
Nov  6 20:42:55 bamf01 shutdown: shutting down for system reboot
---

I'm running RHEL4U4 (cman 1.0.11-0, cman-kernel-smp 2.6.9-45.5, dlm 1.0.1-1,
magma 1.0.6-0 rgmanager 1.9.53) on x86_64 hardware.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20061107/39e084e8/attachment.htm>
-------------- next part --------------
Nov  6 20:17:48 bamf03 clurgmgrd: [4170]: <info> Executing /etc/init.d/rsyncd-cougar status
Nov  6 20:17:51 bamf03 sshd(pam_unix)[10896]: session opened for user root by (uid=0)
Nov  6 20:18:18 bamf03 clurgmgrd: [4170]: <info> Executing /etc/init.d/rsyncd-cougar status
Nov  6 20:19:18 bamf03 last message repeated 2 times
Nov  6 20:20:48 bamf03 last message repeated 3 times
Nov  6 20:21:18 bamf03 clurgmgrd: [4170]: <info> Executing /etc/init.d/rsyncd-cougar status
Nov  6 20:21:34 bamf03 kernel: Bad page state at prep_new_page (in process 'smbd', page 00000101fe80fec0)
Nov  6 20:21:34 bamf03 kernel: flags:0x05001078 mapping:000001010c7f75e8 mapcount:0 count:2
Nov  6 20:21:34 bamf03 kernel: Backtrace:
Nov  6 20:21:34 bamf03 kernel:
Nov  6 20:21:34 bamf03 kernel: Call Trace:<ffffffff8015d383>{bad_page+112} <ffffffff8015dd41>{buffered_rmqueue+520}
Nov  6 20:21:34 bamf03 kernel:        <ffffffff802a721f>{sock_sendmsg+271} <ffffffff8015de7f>{__alloc_pages+211}
Nov  6 20:21:34 bamf03 kernel:        <ffffffff8015e145>{__get_free_pages+11} <ffffffff8018b3d7>{__pollwait+58}
Nov  6 20:21:34 bamf03 kernel:        <ffffffff802ad4af>{datagram_poll+39} <ffffffff802ad488>{datagram_poll+0}
Nov  6 20:21:34 bamf03 kernel:        <ffffffff802ad488>{datagram_poll+0} <ffffffff8018b6e8>{do_select+656}
Nov  6 20:21:34 bamf03 kernel:        <ffffffff8018b39d>{__pollwait+0} <ffffffff8018bb82>{sys_select+820}
Nov  6 20:21:34 bamf03 kernel:        <ffffffff801932d8>{dnotify_parent+34} <ffffffff8011026a>{system_call+126}
Nov  6 20:21:34 bamf03 kernel:
Nov  6 20:21:34 bamf03 kernel: Trying to fix it up, but a reboot is needed
Nov  6 20:21:48 bamf03 clurgmgrd: [4170]: <info> Executing /etc/init.d/rsyncd-cougar status
Nov  6 20:22:08 bamf03 kernel: Bad page state at prep_new_page (in process 'ip.sh', page 00000101fe80c730)
Nov  6 20:22:08 bamf03 kernel: flags:0x0500102c mapping:0000010079d9a3e0 mapcount:0 count:2
Nov  6 20:22:08 bamf03 kernel: Backtrace:
Nov  6 20:22:08 bamf03 kernel:
Nov  6 20:22:08 bamf03 kernel: Call Trace:<ffffffff8015d383>{bad_page+112} <ffffffff8015dd41>{buffered_rmqueue+520}
Nov  6 20:22:08 bamf03 kernel:        <ffffffff8015de7f>{__alloc_pages+211} <ffffffff801696e6>{do_no_page+651}
Nov  6 20:22:08 bamf03 kernel:        <ffffffff8015c5cb>{__generic_file_aio_read+385} <ffffffff80169ca7>{handle_mm_fault+373}
Nov  6 20:22:08 bamf03 kernel:        <ffffffff8015c7af>{generic_file_aio_read+48} <ffffffff801793e8>{do_sync_read+173}
Nov  6 20:22:08 bamf03 kernel:        <ffffffff8018f01d>{dput+56} <ffffffff80123e9a>{do_page_fault+518}
Nov  6 20:22:08 bamf03 kernel:        <ffffffff80135756>{autoremove_wake_function+0} <ffffffff801932d8>{dnotify_parent+34}
Nov  6 20:22:08 bamf03 kernel:        <ffffffff8017950c>{vfs_read+248} <ffffffff80110d91>{error_exit+0}
Nov  6 20:22:08 bamf03 kernel:
Nov  6 20:22:08 bamf03 kernel: Trying to fix it up, but a reboot is needed
Nov  6 20:22:16 bamf03 kernel: Bad page state at prep_new_page (in process 'smbd', page 00000101fe816ec0)
Nov  6 20:22:16 bamf03 kernel: flags:0x05001028 mapping:000001018b7eea30 mapcount:0 count:2
Nov  6 20:22:16 bamf03 kernel: Backtrace:
Nov  6 20:22:16 bamf03 kernel:
Nov  6 20:22:16 bamf03 kernel: Call Trace:<ffffffff8015d383>{bad_page+112} <ffffffff8015dd41>{buffered_rmqueue+520}
Nov  6 20:22:16 bamf03 kernel:        <ffffffff802a721f>{sock_sendmsg+271} <ffffffff8015de7f>{__alloc_pages+211}
Nov  6 20:22:16 bamf03 kernel:        <ffffffff8015e145>{__get_free_pages+11} <ffffffff8018b3d7>{__pollwait+58}
Nov  6 20:22:16 bamf03 kernel:        <ffffffff802cff03>{tcp_poll+44} <ffffffff8018b6e8>{do_select+656}
Nov  6 20:22:16 bamf03 kernel:        <ffffffff8018b39d>{__pollwait+0} <ffffffff8018bb82>{sys_select+820}
Nov  6 20:22:16 bamf03 kernel:        <ffffffff801932d8>{dnotify_parent+34} <ffffffff8011026a>{system_call+126}
Nov  6 20:22:16 bamf03 kernel:
Nov  6 20:22:16 bamf03 kernel: Trying to fix it up, but a reboot is needed
Nov  6 20:22:18 bamf03 clurgmgrd: [4170]: <info> Executing /etc/init.d/rsyncd-cougar status
Nov  6 20:22:38 bamf03 clurgmgrd[4170]: <notice> Stopping service cougar-compout
Nov  6 20:22:38 bamf03 clurgmgrd: [4170]: <info> Executing /etc/init.d/rsyncd-cougar stop
Nov  6 20:22:38 bamf03 clurgmgrd: [4170]: <info> Removing IPv4 address 192.168.10.22 from bond0
Nov  6 20:22:41 bamf03 clurgmgrd: [4170]: <info> Stopping Samba instance "cougar"
Nov  6 20:22:41 bamf03 nmbd[30156]: [2006/11/06 20:22:41, 0] nmbd/nmbd.c:terminate(56)
Nov  6 20:22:41 bamf03 nmbd[30156]:   Got SIGTERM: going down...
Nov  6 20:22:41 bamf03 nmbd[30156]: [2006/11/06 20:22:41, 0] libsmb/nmblib.c:send_udp(790)
Nov  6 20:22:41 bamf03 nmbd[30156]:   Packet send failed to 192.168.255.255(138) ERRNO=Invalid argument
Nov  6 20:23:10 bamf03 sshd(pam_unix)[13090]: session opened for user root by root(uid=0)
Nov  6 20:24:16 bamf03 sshd(pam_unix)[13146]: session opened for user root by root(uid=0)
Nov  6 20:24:36 bamf03 kernel: CMAN: removing node bamf01 from the cluster : Missed too many heartbeats
Nov  6 20:24:38 bamf03 kernel: clustat[13184] trap stack segment rip:33512b1c13 rsp:7fbffff840 error:0
Nov  6 21:36:04 bamf03 syslogd 1.4.1: restart.