[Linux-cluster] Necessary a delay to restart cman?

Miguel Sanchez ntadmin at fi.upm.es
Wed May 6 10:59:45 UTC 2009


 Hi. I have a CentOS 5.3 cluster with two nodes. If I execute service 
cman restart within a node, or stop + start after few seconds, another 
node doesn´t recognize this membership return and its fellow stay 
forever offline.

For example:

* Before cman restart:

node1# cman_tool status
Version: 6.1.0
Config Version: 6
Cluster Name: CSVirtualizacion
Cluster Id: 42648
Cluster Member: Yes
Cluster Generation: 202600
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Quorum: 1
Active subsystems: 7
Flags: 2node Dirty
Ports Bound: 0
Node name: patty
Node ID: 1
Multicast addresses: 224.0.0.133
Node addresses: 138.100.8.70

* After cman stop for node2 (and before a number seconds < token parameter)

node1# cman_tool status
Version: 6.1.0
Config Version: 6
Cluster Name: CSVirtualizacion
Cluster Id: 42648
Cluster Member: Yes
Cluster Generation: 202600
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 1
Quorum: 1
Active subsystems: 7
Flags: 2node Dirty
Ports Bound: 0
Node name: patty
Node ID: 1
Multicast addresses: 224.0.0.133
Node addresses: 138.100.8.70
Wed May  6 12:29:38 CEST 2009

* After cman stop for node2 (and after a number seconds > token parameter)

node1# date; cman_tool status
Version: 6.1.0
Config Version: 6
Cluster Name: CSVirtualizacion
Cluster Id: 42648
Cluster Member: Yes
Cluster Generation: 202604
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Quorum: 1
Active subsystems: 7
Flags: 2node Dirty
Ports Bound: 0
Node name: patty
Node ID: 1
Multicast addresses: 224.0.0.133
Node addresses: 138.100.8.70
Wed May  6 12:29:47 CEST 2009

/var/log/messages:
May  6 12:35:20 node2 openais[17262]: [TOTEM] The token was lost in the 
OPERATIONAL state.
May  6 12:35:20 node2 openais[17262]: [TOTEM] Receive multicast socket 
recv buffer size (288000 bytes).
May  6 12:35:20 node2 openais[17262]: [TOTEM] Transmit multicast socket 
send buffer size (262142 bytes).
May  6 12:35:20 node2 openais[17262]: [TOTEM] entering GATHER state from 2.
May  6 12:35:25 node2 openais[17262]: [TOTEM] entering GATHER state from 0.
May  6 12:35:25 node2 openais[17262]: [TOTEM] Creating commit token 
because I am the rep.
May  6 12:35:25 node2 openais[17262]: [TOTEM] Saving state aru 26 high 
seq received 26
May  6 12:35:25 node2 openais[17262]: [TOTEM] Storing new sequence id 
for ring 31780
May  6 12:35:25 node2 openais[17262]: [TOTEM] entering COMMIT state.
May  6 12:35:25 node2 openais[17262]: [TOTEM] entering RECOVERY state.
May  6 12:35:25 node2 openais[17262]: [TOTEM] position [0] member 
10.10.8.70:
May  6 12:35:25 node2 openais[17262]: [TOTEM] previous ring seq 202620 
rep 10.10.8.70
May  6 12:35:25 node2 openais[17262]: [TOTEM] aru 26 high delivered 26 
received flag 1
May  6 12:35:25 node2 openais[17262]: [TOTEM] Did not need to originate 
any messages in recovery.
May  6 12:35:25 node2 openais[17262]: [TOTEM] Sending initial ORF token
May  6 12:35:25 node2 openais[17262]: [CLM  ] CLM CONFIGURATION CHANGE
May  6 12:35:25 node2 openais[17262]: [CLM  ] New Configuration:
May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.70)
May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Left:
May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.71)
May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Joined:
May  6 12:35:25 node2 openais[17262]: [CLM  ] CLM CONFIGURATION CHANGE
May  6 12:35:25 node2 openais[17262]: [CLM  ] New Configuration:
May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.70)
May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Left:
May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Joined:
May  6 12:35:25 node2 openais[17262]: [SYNC ] This node is within the 
primary component and will provide service.
May  6 12:35:25 node2 openais[17262]: [TOTEM] entering OPERATIONAL state.
May  6 12:35:25 node2 kernel: dlm: closing connection to node 2
May  6 12:35:25 node2 openais[17262]: [CLM  ] got nodejoin message 
10.10.8.70
May  6 12:35:25 node2 openais[17262]: [CPG  ] got joinlist message from 
node 1


if node2 doesn`t wait for run cman start to the detection the 
operational token's lost, node1 detect node2 like offline forever. 
Following attempts for cman restarts don`t change this state:
node1# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M  202616   2009-05-06 12:34:43  node1
   2   X  202628                        node2
node2# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M  202644   2009-05-06 12:51:04  node1
   2   M  202640   2009-05-06 12:51:04  node2


Is it necessary a delay for cman stop + start to avoid this inconsistent 
state or really is it a bug?

Regards.




More information about the Linux-cluster mailing list