[Linux-cluster] Strange crash with cman (stable branch)

Olivier Crête ocrete at max-t.com
Tue Jan 16 01:05:15 UTC 2007


Hi,

I had 9 node running kernel 2.6.17.11 with a snapshot of the cman STABLE
tree (with in-kernel cman). No dlmm, fenced or gfs. We have have own app
and do the fencing ourselves. After 3 nodes died (for unrelated
reasons), all of the cman nodes disconnected, even though the cman using
service was still running. On every node, in the dmesg, I got messages
like the following:

CMAN: node ia-009 has been removed from the cluster : Missed too many heartbeats
CMAN: node ia-008 has been removed from the cluster : Missed too many heartbeats
CMAN: bad generation number 17 in HELLO message from 4, expected 16
CMAN: removing node ia-007 from the cluster : No response to messages
CMAN: node ia-006 has been removed from the cluster : No response to messages
CMAN: removing node ia-002 from the cluster : No response to messages
CMAN: removing node ia-004 from the cluster : No response to messages
CMAN: removing node ia-005 from the cluster : No response to messages
CMAN: removing node ia-003 from the cluster : No response to messages
CMAN: quorum lost, blocking activity
CMAN: node ia-001 has been removed from the cluster : No response to messages
CMAN: killed by NODEDOWN message
CMAN: we are leaving the cluster. No response to messages
SM: 03000003 sm_stop: SG still joined


Nodes ia-00[789] are the nodes that crashed.. and that message is on the
6 others.

-- 
Olivier Crête
ocrete at max-t.com
Maximum Throughput Inc.





More information about the Linux-cluster mailing list