[Linux-cluster] cluster lost quorum after 11 hours
Patrick Caulfield
pcaulfie at redhat.com
Mon Feb 14 15:12:32 UTC 2005
On Fri, Feb 11, 2005 at 04:47:38PM -0800, Daniel McNeil wrote:
> I was running my test on a 3 node cluster and it died
> after 11 hours. cl030 lost quorum with the other 2 nodes
> kicked out of the cluster. cl031 also hit a bunch of asserts
> like
> lock_dlm: Assertion failed on line 352 of file
> /Views/redhat-cluster/cluster/gfs-kernel/src/dlm/lock.c
> lock_dlm: assertion: "!error"
> lock_dlm: time = 291694516
> stripefs: error=-22 num=2,19
> I assume is caused by the cluster shutting down.
>
>
> /var/log/messages showed:
>
> cl030:
> Feb 11 02:44:33 cl030 kernel: CMAN: removing node cl032a from the cluster : No response to messages
> Feb 11 02:44:33 cl030 kernel: CMAN: removing node cl031a from the cluster : No response to messages
> Feb 11 02:44:33 cl030 kernel: CMAN: quorum lost, blocking activity
> Feb 11 14:40:33 cl030 sshd(pam_unix)[27323]: session opened for user root by (uid=0)
You should only get nodes dying from "No response to messages" during a state
transition of some sort (eg a node leaving or joining or possibly a GFS
mount/dismount). In which case the DLM has to do recovery. I recently checked in
a couple of changes that will stop the DLM recovery from taking over the
machine when there are several thousand locks to recover, that might help.
During a normal "steady" state, a node should not die from
"No response to messages" because the only messages that are being sent are
HELLO heartbeat messages and they are not acked.
--
patrick
More information about the Linux-cluster
mailing list