[Linux-cluster] Necessary a delay to restart cman?

Chrissie Caulfield ccaulfie at redhat.com
Wed May 6 12:01:21 UTC 2009


Miguel Sanchez wrote:
> Hi. I have a CentOS 5.3 cluster with two nodes. If I execute service
> cman restart within a node, or stop + start after few seconds, another
> node doesn´t recognize this membership return and its fellow stay
> forever offline.
> 
> For example:
> 
> * Before cman restart:
> 
> node1# cman_tool status
> Version: 6.1.0
> Config Version: 6
> Cluster Name: CSVirtualizacion
> Cluster Id: 42648
> Cluster Member: Yes
> Cluster Generation: 202600
> Membership state: Cluster-Member
> Nodes: 2
> Expected votes: 1
> Total votes: 2
> Quorum: 1
> Active subsystems: 7
> Flags: 2node Dirty
> Ports Bound: 0
> Node name: patty
> Node ID: 1
> Multicast addresses: 224.0.0.133
> Node addresses: 138.100.8.70
> 
> * After cman stop for node2 (and before a number seconds < token parameter)
> 
> node1# cman_tool status
> Version: 6.1.0
> Config Version: 6
> Cluster Name: CSVirtualizacion
> Cluster Id: 42648
> Cluster Member: Yes
> Cluster Generation: 202600
> Membership state: Cluster-Member
> Nodes: 2
> Expected votes: 1
> Total votes: 1
> Quorum: 1
> Active subsystems: 7
> Flags: 2node Dirty
> Ports Bound: 0
> Node name: patty
> Node ID: 1
> Multicast addresses: 224.0.0.133
> Node addresses: 138.100.8.70
> Wed May  6 12:29:38 CEST 2009
> 
> * After cman stop for node2 (and after a number seconds > token parameter)
> 
> node1# date; cman_tool status
> Version: 6.1.0
> Config Version: 6
> Cluster Name: CSVirtualizacion
> Cluster Id: 42648
> Cluster Member: Yes
> Cluster Generation: 202604
> Membership state: Cluster-Member
> Nodes: 1
> Expected votes: 1
> Total votes: 1
> Quorum: 1
> Active subsystems: 7
> Flags: 2node Dirty
> Ports Bound: 0
> Node name: patty
> Node ID: 1
> Multicast addresses: 224.0.0.133
> Node addresses: 138.100.8.70
> Wed May  6 12:29:47 CEST 2009
> 
> /var/log/messages:
> May  6 12:35:20 node2 openais[17262]: [TOTEM] The token was lost in the
> OPERATIONAL state.
> May  6 12:35:20 node2 openais[17262]: [TOTEM] Receive multicast socket
> recv buffer size (288000 bytes).
> May  6 12:35:20 node2 openais[17262]: [TOTEM] Transmit multicast socket
> send buffer size (262142 bytes).
> May  6 12:35:20 node2 openais[17262]: [TOTEM] entering GATHER state from 2.
> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering GATHER state from 0.
> May  6 12:35:25 node2 openais[17262]: [TOTEM] Creating commit token
> because I am the rep.
> May  6 12:35:25 node2 openais[17262]: [TOTEM] Saving state aru 26 high
> seq received 26
> May  6 12:35:25 node2 openais[17262]: [TOTEM] Storing new sequence id
> for ring 31780
> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering COMMIT state.
> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering RECOVERY state.
> May  6 12:35:25 node2 openais[17262]: [TOTEM] position [0] member
> 10.10.8.70:
> May  6 12:35:25 node2 openais[17262]: [TOTEM] previous ring seq 202620
> rep 10.10.8.70
> May  6 12:35:25 node2 openais[17262]: [TOTEM] aru 26 high delivered 26
> received flag 1
> May  6 12:35:25 node2 openais[17262]: [TOTEM] Did not need to originate
> any messages in recovery.
> May  6 12:35:25 node2 openais[17262]: [TOTEM] Sending initial ORF token
> May  6 12:35:25 node2 openais[17262]: [CLM  ] CLM CONFIGURATION CHANGE
> May  6 12:35:25 node2 openais[17262]: [CLM  ] New Configuration:
> May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.70)
> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Left:
> May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.71)
> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Joined:
> May  6 12:35:25 node2 openais[17262]: [CLM  ] CLM CONFIGURATION CHANGE
> May  6 12:35:25 node2 openais[17262]: [CLM  ] New Configuration:
> May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.70)
> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Left:
> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Joined:
> May  6 12:35:25 node2 openais[17262]: [SYNC ] This node is within the
> primary component and will provide service.
> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering OPERATIONAL state.
> May  6 12:35:25 node2 kernel: dlm: closing connection to node 2
> May  6 12:35:25 node2 openais[17262]: [CLM  ] got nodejoin message
> 10.10.8.70
> May  6 12:35:25 node2 openais[17262]: [CPG  ] got joinlist message from
> node 1
> 
> 
> if node2 doesn`t wait for run cman start to the detection the
> operational token's lost, node1 detect node2 like offline forever.
> Following attempts for cman restarts don`t change this state:
> node1# cman_tool nodes
> Node  Sts   Inc   Joined               Name
>   1   M  202616   2009-05-06 12:34:43  node1
>   2   X  202628                        node2
> node2# cman_tool nodes
> Node  Sts   Inc   Joined               Name
>   1   M  202644   2009-05-06 12:51:04  node1
>   2   M  202640   2009-05-06 12:51:04  node2
> 
> 
> Is it necessary a delay for cman stop + start to avoid this inconsistent
> state or really is it a bug?


I suspect it's an instance of this known bug. Check that CentOS has the
appropriate patch available:

https://bugzilla.redhat.com/show_bug.cgi?id=485026

Chrissie




More information about the Linux-cluster mailing list