[Linux-cluster] Interesting cluster case after a node hardware failure

Cosimo Streppone cosimo at streppone.it
Tue May 9 09:09:34 UTC 2006


I'm through an interesting case that I don't fully understand.
I found some log messages that I never saw before, that are quite
worrying. I report them here for quick reference, then I'm going
to include full log extract with comments on what I think has happened.
Please correct me when I'm wrong.

This is on RedHat Enterprise ES 4 Update 3 with
RHCS 4, with hand-modified init scripts to make them work with CS4.

   node2 kernel: CMAN: too many transition restarts - will die
   node2 kernel: CMAN: we are leaving the cluster. Inconsistent cluster view
   node2 kernel: WARNING: dlm_emergency_shutdown
   node2 kernel: WARNING: dlm_emergency_shutdown
   node2 kernel: SM: 00000001 sm_stop: SG still joined
   node2 kernel: SM: 01000003 sm_stop: SG still joined
   node2 kernel: SM: 03000002 sm_stop: SG still joined
   node2 clurgmgrd[2820]: <warning> #67: Shutting down uncleanly
   node2 ccsd[2473]: Cluster manager shutdown.  Attemping to reconnect...
   node2 ccsd[2473]: Cluster is not quorate.  Refusing connection.
   node2 ccsd[2473]: Error while processing connect: Connection refused
   node2 ccsd[2473]: Invalid descriptor specified (-111).
   node2 ccsd[2473]: Someone may be attempting something evil.
   node2 ccsd[2473]: Error while processing get: Invalid request descriptor
   node2 ccsd[2473]: Invalid descriptor specified (-111).
   node2 ccsd[2473]: Someone may be attempting something evil.
   node2 ccsd[2473]: Error while processing get: Invalid request descriptor
   node2 ccsd[2473]: Invalid descriptor specified (-21).
   node2 ccsd[2473]: Someone may be attempting something evil.
   node2 ccsd[2473]: Error while processing disconnect: Invalid request descriptor
   node2 clurgmgrd: [2820]: <info> Executing /etc/rc.d/init.d/xinetd stop

Cluster is composed of two nodes (node1 and node2), two HP DL360 machines
with iLO devices configured for fencing but not connected for now.
There is one service only which has several shared resources attached
(fs, init scripts, and 1 ip address).

As I said, I attached an extract of "messages" log that shows a series
of events which led to malfunctioning clustered service.

Please can anyone shed some light on this?
Thank you for any suggestion.


Trace begins. Node1 dies of hardware failure.

---------8<--------------------8<---------------

   May  8 15:30:20 node2 kernel: CMAN: removing node node1 from the cluster : Missed too many heartbeats
   May  8 15:30:20 node2 fenced[2540]: node1 not a cluster member after 0 sec post_fail_delay
   May  8 15:30:20 node2 fenced[2540]: fencing node "node1"
   May  8 15:30:23 node2 fenced[2540]: agent "fence_ilo" reports: connect: No route to host at 
/opt/perl/lib/site_perl/5.8.6/linux/Net/SSL
   May  8 15:30:23 node2 fenced[2540]: fence "node1" failed

Fencing could never work here because iLO interface of node1
was down (and *not* connected, FWIW).

   [...]

   May  8 15:31:55 node2 kernel: CMAN: node node1 rejoining
   May  8 15:31:59 node2 clurgmgrd[2820]: <info> Magma Event: Membership Change
   May  8 15:31:59 node2 clurgmgrd[2820]: <info> State change: node1 DOWN
   May  8 15:31:59 node2 clurgmgrd: [2820]: <info> Executing /etc/rc.d/init.d/xinetd status

Cluster recognizes the node1 is down. Ok.
I didn't understand the "rejoining" though. Node1 was down.

   [...]

   May  8 15:36:40 node2 kernel: CMAN: too many transition restarts - will die
   May  8 15:36:40 node2 kernel: CMAN: we are leaving the cluster. Inconsistent cluster view
   May  8 15:36:40 node2 kernel: WARNING: dlm_emergency_shutdown
   May  8 15:36:40 node2 kernel: WARNING: dlm_emergency_shutdown
   May  8 15:36:40 node2 kernel: SM: 00000001 sm_stop: SG still joined
   May  8 15:36:40 node2 kernel: SM: 01000003 sm_stop: SG still joined
   May  8 15:36:40 node2 kernel: SM: 03000002 sm_stop: SG still joined
   May  8 15:36:40 node2 clurgmgrd[2820]: <warning> #67: Shutting down uncleanly
   May  8 15:36:40 node2 ccsd[2473]: Cluster manager shutdown.  Attemping to reconnect...
   May  8 15:36:40 node2 ccsd[2473]: Cluster is not quorate.  Refusing connection.
   May  8 15:36:40 node2 ccsd[2473]: Error while processing connect: Connection refused
   May  8 15:36:40 node2 ccsd[2473]: Invalid descriptor specified (-111).
   May  8 15:36:40 node2 ccsd[2473]: Someone may be attempting something evil.
   May  8 15:36:40 node2 ccsd[2473]: Error while processing get: Invalid request descriptor
   May  8 15:36:40 node2 ccsd[2473]: Invalid descriptor specified (-111).
   May  8 15:36:40 node2 ccsd[2473]: Someone may be attempting something evil.
   May  8 15:36:40 node2 ccsd[2473]: Error while processing get: Invalid request descriptor
   May  8 15:36:40 node2 ccsd[2473]: Invalid descriptor specified (-21).
   May  8 15:36:40 node2 ccsd[2473]: Someone may be attempting something evil.
   May  8 15:36:40 node2 ccsd[2473]: Error while processing disconnect: Invalid request descriptor
   May  8 15:36:40 node2 clurgmgrd: [2820]: <info> Executing /etc/rc.d/init.d/xinetd stop

 From now on, all services are being shut down by the cluster resource manager daemon.
But what could have happened that triggered a `dlm_emergency_shutdown'?

   May  8 15:36:40 node2 ccsd[2473]: Cluster is not quorate.  Refusing connection.
   May  8 15:36:40 node2 ccsd[2473]: Error while processing connect: Connection refused
   May  8 15:36:40 node2 ccsd[2473]: Invalid descriptor specified (-111).
   May  8 15:36:40 node2 ccsd[2473]: Someone may be attempting something evil.
   May  8 15:36:40 node2 ccsd[2473]: Error while processing get: Invalid request descriptor
   May  8 15:36:40 node2 ccsd[2473]: Invalid descriptor specified (-111).
   May  8 15:36:40 node2 ccsd[2473]: Someone may be attempting something evil.
   May  8 15:36:40 node2 ccsd[2473]: Error while processing get: Invalid request descriptor
   May  8 15:36:40 node2 ccsd[2473]: Invalid descriptor specified (-21).
   May  8 15:36:40 node2 ccsd[2473]: Someone may be attempting something evil.
   May  8 15:36:40 node2 ccsd[2473]: Error while processing disconnect: Invalid request descriptor

   [...]

All services were shut down, shared ip address was released
and SAN volume unmounted.

   May  8 15:37:26 node2 ccsd[2473]: Unable to connect to cluster infrastructure after 60 seconds.
   ...
   May  8 15:41:26 node2 ccsd[2473]: Unable to connect to cluster infrastructure after 300 seconds.
   ...

The morning after, the node2 was rebooted. The shut down is not clean, but rebooting has
restored the cluster in a consistent state.
Node1 is not accessible due to complete hardware failure.

   May  9 09:08:05 node2 fenced: Stopping fence domain:
   May  9 09:08:05 node2 fenced: shutdown failed
   May  9 09:08:05 node2 fenced: ESC[60G
   May  9 09:08:05 node2 fenced:
   May  9 09:08:05 node2 rc: Stopping fenced:  failed
   May  9 09:08:05 node2 lock_gulmd: Stopping lock_gulmd:
   May  9 09:08:05 node2 lock_gulmd: shutdown succeeded
   May  9 09:08:05 node2 lock_gulmd: ESC[60G
   May  9 09:08:05 node2 lock_gulmd:
   May  9 09:08:05 node2 rc: Stopping lock_gulmd:  succeeded
   May  9 09:08:05 node2 cman: Stopping cman:
   May  9 09:08:08 node2 cman: failed to stop cman failed
   May  9 09:08:08 node2 cman: ESC[60G
   May  9 09:08:08 node2 cman:
   May  9 09:08:08 node2 rc: Stopping cman:  failed
   May  9 09:08:08 node2 ccsd: Stopping ccsd:
   May  9 09:08:08 node2 ccsd[2473]: Stopping ccsd, SIGTERM received.
   May  9 09:08:09 node2 ccsd: shutdown succeeded
   May  9 09:08:09 node2 ccsd: ESC[60G
   May  9 09:08:09 node2 ccsd:
   May  9 09:08:09 node2 rc: Stopping ccsd:  succeeded
   May  9 09:08:09 node2 irqbalance: irqbalance shutdown succeeded
   May  9 09:08:09 node2 multipathd: mpath0: stop event checker thread
   May  9 09:08:09 node2 multipathd: multipathd shutdown succeeded
   May  9 09:08:09 node2 kernel: Kernel logging (proc) stopped.
   May  9 09:08:09 node2 kernel: Kernel log daemon terminating.
   May  9 09:08:10 node2 syslog: klogd shutdown succeeded
   May  9 09:08:10 node2 exiting on signal 15

---------8<--------------------8<---------------

End of trace.

?

-- 
Cosimo




More information about the Linux-cluster mailing list