[Linux-cluster] Fencing when missed too many heartbeats

rhurst at bidmc.harvard.edu rhurst at bidmc.harvard.edu
Mon Mar 17 13:01:23 UTC 2008


We have a premium subscription ticket open on this already, but I wanted
to throw the question out there to this development list to possibly
hear from its software engineers and make this scenario more clear to
its users:

1)  When one node detects 'missed too many heartbeats', what
decision-making process goes into effect towards the final outcome of
fencing the node?

2)  If a few nodes are down for maintenance, and they left the cluster
with "remove" for adjustment of 'quorum' count, but not 'expected'
count, how might this affect question #1?

It would be even more excellent If the responses could apply using our
RHEL AS 4.5 11-node cluster as example:

$ cman_tool nodes
Node  Votes Exp Sts  Name
   1    1   19   M   db2
   2    5   19   M   net1
   3    5   19   M   net2
   4    1   19   M   db4
   5    1   19   M   db1
   6    1   19   M   db5
   7    1   19   X   app3
   8    1   19   X   app2
   9    1   19   M   app6
  10    1   19   M   db3
  11    1   19   X   net3

LVS network tier: net1 (5-votes), net2 (5-votes), net3 (remove)
Application tier: app2 (remove), app3 (remove), app6
Database tier: db1, db2, db3, db4, db5

Expected: 19, Quorum: 9, Total votes: 16

FYI: the nodes net3, app2, app3 left this cluster with "remove" to do
some isolated testing of RHEL AS 4.6 update, but only net3 was left
powered on.  It was in this state for over a week.

As seen in syslog messages from each member that net1 went 'dark':

Mar 15 16:20:28 net2 kernel: CMAN: node net1 has been removed from the
cluster : Missed too many heartbeats
Mar 15 16:20:29 net2 fenced[19273]: fencing deferred to db2
Mar 15 16:23:05 net2 clurgmgrd[20012]: <info> Magma Event: Membership
Change 
Mar 15 16:23:05 net2 clurgmgrd[20012]: <info> State change: net1 DOWN 

Mar 15 12:29:16 app6 kernel: CMAN: node net1 has been removed from the
cluster : Missed too many heartbeats
Mar 15 12:29:17 app6 fenced[19015]: fencing deferred to db2
Mar 15 12:31:53 app6 clurgmgrd[21831]: <info> Magma Event: Membership
Change 
Mar 15 12:31:53 app6 clurgmgrd[21831]: <info> State change: net1 DOWN 

Mar 15 16:29:19 db1 kernel: CMAN: node net1 has been removed from the
cluster : Missed too many heartbeats
Mar 15 16:29:20 db1 fenced[19297]: fencing deferred to db2
Mar 15 16:31:56 db1 clurgmgrd[21436]: <info> Magma Event: Membership
Change 
Mar 15 16:31:56 db1 clurgmgrd[21436]: <info> State change: net1 DOWN 

Mar 15 16:29:19 db2 kernel: CMAN: removing node net1 from the cluster :
Missed too many heartbeats
Mar 15 16:29:20 db2 fenced[14778]: net1 not a cluster member after 0 sec
post_fail_delay
Mar 15 16:29:20 db2 fenced[14778]: fencing node "net1"
Mar 15 16:31:48 db2 ccsd[14677]: Attempt to close an unopened CCS
descriptor (151704870). 
Mar 15 16:31:48 db2 ccsd[14677]: Error while processing disconnect:
Invalid request descriptor 
Mar 15 16:31:48 db2 fenced[14778]: fence "net1" success

Mar 15 16:29:19 db3 kernel: CMAN: node net1 has been removed from the
cluster : Missed too many heartbeats
Mar 15 16:29:20 db3 fenced[19097]: fencing deferred to db2
Mar 15 16:31:56 db3 clurgmgrd[21315]: <info> Magma Event: Membership
Change 
Mar 15 16:31:56 db3 clurgmgrd[21315]: <info> State change: net1 DOWN 

Mar 15 16:29:19 db4 kernel: CMAN: node net1 has been removed from the
cluster : Missed too many heartbeats
Mar 15 16:29:20 db4 fenced[19126]: fencing deferred to db2
Mar 15 16:31:56 db4 clurgmgrd[21182]: <info> Magma Event: Membership
Change 
Mar 15 16:31:56 db4 clurgmgrd[21182]: <info> State change: net1 DOWN 

Mar 15 16:29:19 db5 kernel: CMAN: node net1 has been removed from the
cluster : Missed too many heartbeats
Mar 15 16:29:20 db5 fenced[14508]: fencing deferred to db2
Mar 15 16:31:56 db5 clurgmgrd[17187]: <info> Magma Event: Membership
Change 
Mar 15 16:31:56 db5 clurgmgrd[17187]: <info> State change: net1 DOWN

It may be of no consequence, but also note that there was clock drift on
net2, because of a failed NTP server;  and also app6 because its clock
was not calibrated after being down for a motherboard swapout and memory
upgrade for a few weeks.


Robert Hurst, Sr. Caché Administrator
Beth Israel Deaconess Medical Center
1135 Tremont Street, REN-7
Boston, Massachusetts   02120-2140
617-754-8754 ∙ Fax: 617-754-8730 ∙ Cell: 401-787-3154
Any technology distinguishable from magic is insufficiently advanced.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080317/86f57245/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3227 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080317/86f57245/attachment.p7s>


More information about the Linux-cluster mailing list