[Linux-cluster] rgmanager crash, deadlock?

Thu Nov 9 16:12:17 UTC 2006

On Tue, 2006-11-07 at 12:29 -0800, aberoham at gmail.com wrote:
> 
> Last night one of my five cluster nodes suffered a hardware failure
> (memory, cpu?). The other nodes properly fenced the failed machine,
> but no matter what clusvcadm command I ran, I could not get the other
> cluster members to start, stop or disable the cluster resource
> group/service that had been running on the failed node. (the resource
> group/service that was running on the failed node includes an EXT3 fs,
> an IP address, a rsyncd and a smbd init script) 
> 
> The "clusvcadm -d [service]" command would just hang for minutes and
> not return. "clustat" intially reported the rg/service in an unknown
> state, then stopped reporting rgmanager status and only showed cman
> status. The cluster remained quorate the entire time. Resource
> groups/services on non-failed nodes continued to run, but no matter
> what I tried I could not get rgmanager status on any node. 
> 	
> I had to reset the entire cluster to get things back to normal. (This
> is a heavily used operational system so I didn't have time to do
> further debugging.) My logs don't show any rgmanger related error
> messages, only fencing status: 
> 
> Nov  6 20:24:37 bamf02 kernel: CMAN: removing node bamf03 from the
> cluster : Missed too many heartbeats
> Nov  6 20:24:38 bamf02 fenced[5913]: fencing deferred to bamf01
> ---
> Nov  6 20:24:37 bamf01 kernel: CMAN: node bamf03 has been removed from
> the cluster : Missed too many heartbeats 
> Nov  6 20:24:38 bamf01 fenced[5756]: bamf03 not a cluster member after
> 0 sec post_fail_delay
> Nov  6 20:24:38 bamf01 fenced[5756]: fencing node "bamf03"
> Nov  6 20:24:46 bamf01 fenced[5756]: fence "bamf03" success 
> Nov  6 20:30:36 bamf01 sshd(pam_unix)[27244]: session opened for user
> root by root(uid=0)
> Nov  6 20:36:29 bamf01 kernel: CMAN: node bamf03 rejoining
> Nov  6 20:42:55 bamf01 shutdown: shutting down for system reboot 
> ---
> 
> I'm running RHEL4U4 (cman 1.0.11-0, cman-kernel-smp 2.6.9-45.5, dlm
> 1.0.1-1, magma 1.0.6-0 rgmanager 1.9.53) on x86_64 hardware.

cman_tool status ?

Did rgmanager crash (service rgmanager status reported it as dead)?

Was anything in dmesg indicating a DLM error?

-- Lon