[Linux-cluster] reasons for sporadic token loss?

Tue Jul 31 13:57:33 UTC 2012

Hi together!

I am experiencing sporadic problems with my cluster setup. Maybe someone 
has an idea? But first some facts:

Type: RHEL 6.1 two node cluster (corosync 1.2.3-36) on two Dell R610 
each with a quad port NIC

NICs:
- interfaces em1/em2 are bonded using mode 5; these interfaces are cross 
connected (intended to be used for the cluster housekeeping 
communication) - no network element in between
- interfaces em3/em4 are bonded using mode 1; these interfaces are 
connected to two switches

Cluster configuration:

<?xml version="1.0"?>
<cluster config_version="51" name="my-cluster">
     <cman expected_votes="1" two_node="1"/>
     <clusternodes>
         <clusternode name="df1-clusterlink" nodeid="1">
             <fence>
                 <method name="VBoxManage-DF-1">
                     <device name="VBoxManage-DF-1" />
                 </method>
             </fence>
             <unfence>
             </unfence>
         </clusternode>
         <clusternode name="df2-clusterlink" nodeid="2">
             <fence>
                 <method name="VBoxManage-DF-2">
                     <device name="VBoxManage-DF-2" />
                 </method>

             </fence>
             <unfence>
             </unfence>
         </clusternode>
     </clusternodes>
     <fencedevices>
         <fencedevice name="VBoxManage-DF-1" agent="fence_vbox" 
vboxhost="vboxhost.private" login="test" vmname="RHEL 6.1 x86_64 
DF-System Server 1" />
         <fencedevice name="VBoxManage-DF-2" agent="fence_vbox" 
vboxhost="vboxhost.private" login="test" vmname="RHEL 6.1 x86_64 
DF-System Server 2" />
     </fencedevices>
     <rm>
         <resources>
             <ip address="10.200.104.15/27" monitor_link="on" 
sleeptime="10"/>
             <script file="/usr/share/cluster/app.sh" name="myapp"/>
         </resources>
         <failoverdomains>
             <failoverdomain name="fod-myapp" nofailback="0" ordered="1" 
restricted="0">
                 <failoverdomainnode name="df1-clusterlink" priority="1"/>
                 <failoverdomainnode name="df2-clusterlink" priority="2"/>
             </failoverdomain>
         </failoverdomains>
         <service domain="fod-myapp" exclusive="1" max_restarts="3" 
name="rg-myapp" recovery="restart" restart_expire_time="1">
             <script ref=myapp"/>
             <ip ref="10.200.104.15/27"/>
         </service>
     </rm>
     <logging debug="on"/>
     <gfs_controld enable_plock="0" plock_rate_limit="0"/>
     <dlm enable_plock="0" plock_ownership="1" plock_rate_limit="0"/>
</cluster>

--------------------------------------------------------------------------------

Problem:
Sometimes the second node "detects" that the token has been lost 
(corosync.log):

[no TOTEM messages before that]
Jul 28 13:00:10 corosync [TOTEM ] The token was lost in the OPERATIONAL 
state.
Jul 28 13:00:10 corosync [TOTEM ] A processor failed, forming new 
configuration.
Jul 28 13:00:10 corosync [TOTEM ] Receive multicast socket recv buffer 
size (262142 bytes).
Jul 28 13:00:10 corosync [TOTEM ] Transmit multicast socket send buffer 
size (262142 bytes).

This happens lets say once a week. This leads to fencing of the first 
node. What I see from 'corosync-objctl -a' is that this is maybe due to 
a consensus timeout (some excerpt from the commands output follows); I 
marked the lines which I so far consider as important:

totem.transport=udp
totem.version=2
totem.nodeid=2
totem.vsftype=none
totem.token=10000
totem.join=60
totem.fail_recv_const=2500
totem.consensus=2000
totem.rrp_mode=none
totem.secauth=1
totem.key=my-cluster
totem.interface.ringnumber=0
totem.interface.bindnetaddr=172.16.42.2
totem.interface.mcastaddr=239.192.187.168
totem.interface.mcastport=5405
runtime.totem.pg.mrp.srp.orf_token_tx=3
runtime.totem.pg.mrp.srp.orf_token_rx=1103226
runtime.totem.pg.mrp.srp.memb_merge_detect_tx=395
runtime.totem.pg.mrp.srp.memb_merge_detect_rx=1098359
runtime.totem.pg.mrp.srp.memb_join_tx=38
runtime.totem.pg.mrp.srp.memb_join_rx=50
runtime.totem.pg.mrp.srp.mcast_tx=218
runtime.totem.pg.mrp.srp.mcast_retx=0
runtime.totem.pg.mrp.srp.mcast_rx=541
runtime.totem.pg.mrp.srp.memb_commit_token_tx=12
runtime.totem.pg.mrp.srp.memb_commit_token_rx=18
runtime.totem.pg.mrp.srp.token_hold_cancel_tx=49
runtime.totem.pg.mrp.srp.token_hold_cancel_rx=173
runtime.totem.pg.mrp.srp.operational_entered=6
runtime.totem.pg.mrp.srp.operational_token_lost=1
^^^
runtime.totem.pg.mrp.srp.gather_entered=7
runtime.totem.pg.mrp.srp.gather_token_lost=0
runtime.totem.pg.mrp.srp.commit_entered=6
runtime.totem.pg.mrp.srp.commit_token_lost=0
runtime.totem.pg.mrp.srp.recovery_entered=6
runtime.totem.pg.mrp.srp.recovery_token_lost=0
runtime.totem.pg.mrp.srp.consensus_timeouts=1
^^^
runtime.totem.pg.mrp.srp.mtt_rx_token=1727
runtime.totem.pg.mrp.srp.avg_token_workload=62244458
runtime.totem.pg.mrp.srp.avg_backlog_calc=0
runtime.totem.pg.mrp.srp.rx_msg_dropped=0
runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(172.16.42.2)
runtime.totem.pg.mrp.srp.members.2.join_count=1
runtime.totem.pg.mrp.srp.members.2.status=joined
runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(172.16.42.1)
runtime.totem.pg.mrp.srp.members.1.join_count=3
runtime.totem.pg.mrp.srp.members.1.status=joined
runtime.blackbox.dump_flight_data=no
runtime.blackbox.dump_state=no

Some questions at this point:
A) why did the cluster lose the token? due to timeout? token (10000) or 
consensus (2000)?
B) why is the timeout ellapsed? maybe that is connected with the answer 
to A ... ?
C) is it normal that 'token=10000' and 'consensus=2000' although normal 
documentation says that default is 'token=1000' and 'consensus=1.2*token'?
D) since I suspect problems concerning the switches connecting the other 
interfaces (em3/em4 bonded to bond0) of those machines I wonder whether 
any traffic goes that way and not via bond1?

As I already stated: the connection of em3/em4 is a direct one without 
any network element.

So far I want to add the following line to cluster.conf and see whether 
the situation improves:

     <totem token_retransmits_before_loss_const="10" 
fail_recv_const="100" consensus="12000"/>

Any comment concerning that?

While googling for reasons I have seen that it is also a problem if both 
nodes are not synchronized concerning time; but in my case the ntpd on 
both nodes uses two stratum 2 NTP servers. I also cannot detect anything 
unsual like e.g. a jump of multiple seconds inside the log files 
although I have to admit that so far the ntpd does not run with debug 
enabled.

Thanks in advance for any hint or comment!

Kind regards,

     Heiko