[Linux-cluster] Necessary a delay to restart cman?

Chrissie Caulfield ccaulfie at redhat.com
Wed May 6 13:05:41 UTC 2009


Adam Hough wrote:
> On Wed, May 6, 2009 at 7:01 AM, Chrissie Caulfield <ccaulfie at redhat.com> wrote:
>> Miguel Sanchez wrote:
>>> Hi. I have a CentOS 5.3 cluster with two nodes. If I execute service
>>> cman restart within a node, or stop + start after few seconds, another
>>> node doesn´t recognize this membership return and its fellow stay
>>> forever offline.
>>>
>>> For example:
>>>
>>> * Before cman restart:
>>>
>>> node1# cman_tool status
>>> Version: 6.1.0
>>> Config Version: 6
>>> Cluster Name: CSVirtualizacion
>>> Cluster Id: 42648
>>> Cluster Member: Yes
>>> Cluster Generation: 202600
>>> Membership state: Cluster-Member
>>> Nodes: 2
>>> Expected votes: 1
>>> Total votes: 2
>>> Quorum: 1
>>> Active subsystems: 7
>>> Flags: 2node Dirty
>>> Ports Bound: 0
>>> Node name: patty
>>> Node ID: 1
>>> Multicast addresses: 224.0.0.133
>>> Node addresses: 138.100.8.70
>>>
>>> * After cman stop for node2 (and before a number seconds < token parameter)
>>>
>>> node1# cman_tool status
>>> Version: 6.1.0
>>> Config Version: 6
>>> Cluster Name: CSVirtualizacion
>>> Cluster Id: 42648
>>> Cluster Member: Yes
>>> Cluster Generation: 202600
>>> Membership state: Cluster-Member
>>> Nodes: 2
>>> Expected votes: 1
>>> Total votes: 1
>>> Quorum: 1
>>> Active subsystems: 7
>>> Flags: 2node Dirty
>>> Ports Bound: 0
>>> Node name: patty
>>> Node ID: 1
>>> Multicast addresses: 224.0.0.133
>>> Node addresses: 138.100.8.70
>>> Wed May  6 12:29:38 CEST 2009
>>>
>>> * After cman stop for node2 (and after a number seconds > token parameter)
>>>
>>> node1# date; cman_tool status
>>> Version: 6.1.0
>>> Config Version: 6
>>> Cluster Name: CSVirtualizacion
>>> Cluster Id: 42648
>>> Cluster Member: Yes
>>> Cluster Generation: 202604
>>> Membership state: Cluster-Member
>>> Nodes: 1
>>> Expected votes: 1
>>> Total votes: 1
>>> Quorum: 1
>>> Active subsystems: 7
>>> Flags: 2node Dirty
>>> Ports Bound: 0
>>> Node name: patty
>>> Node ID: 1
>>> Multicast addresses: 224.0.0.133
>>> Node addresses: 138.100.8.70
>>> Wed May  6 12:29:47 CEST 2009
>>>
>>> /var/log/messages:
>>> May  6 12:35:20 node2 openais[17262]: [TOTEM] The token was lost in the
>>> OPERATIONAL state.
>>> May  6 12:35:20 node2 openais[17262]: [TOTEM] Receive multicast socket
>>> recv buffer size (288000 bytes).
>>> May  6 12:35:20 node2 openais[17262]: [TOTEM] Transmit multicast socket
>>> send buffer size (262142 bytes).
>>> May  6 12:35:20 node2 openais[17262]: [TOTEM] entering GATHER state from 2.
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering GATHER state from 0.
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] Creating commit token
>>> because I am the rep.
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] Saving state aru 26 high
>>> seq received 26
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] Storing new sequence id
>>> for ring 31780
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering COMMIT state.
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering RECOVERY state.
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] position [0] member
>>> 10.10.8.70:
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] previous ring seq 202620
>>> rep 10.10.8.70
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] aru 26 high delivered 26
>>> received flag 1
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] Did not need to originate
>>> any messages in recovery.
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] Sending initial ORF token
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ] CLM CONFIGURATION CHANGE
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ] New Configuration:
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.70)
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Left:
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.71)
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Joined:
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ] CLM CONFIGURATION CHANGE
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ] New Configuration:
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.70)
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Left:
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Joined:
>>> May  6 12:35:25 node2 openais[17262]: [SYNC ] This node is within the
>>> primary component and will provide service.
>>> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering OPERATIONAL state.
>>> May  6 12:35:25 node2 kernel: dlm: closing connection to node 2
>>> May  6 12:35:25 node2 openais[17262]: [CLM  ] got nodejoin message
>>> 10.10.8.70
>>> May  6 12:35:25 node2 openais[17262]: [CPG  ] got joinlist message from
>>> node 1
>>>
>>>
>>> if node2 doesn`t wait for run cman start to the detection the
>>> operational token's lost, node1 detect node2 like offline forever.
>>> Following attempts for cman restarts don`t change this state:
>>> node1# cman_tool nodes
>>> Node  Sts   Inc   Joined               Name
>>>   1   M  202616   2009-05-06 12:34:43  node1
>>>   2   X  202628                        node2
>>> node2# cman_tool nodes
>>> Node  Sts   Inc   Joined               Name
>>>   1   M  202644   2009-05-06 12:51:04  node1
>>>   2   M  202640   2009-05-06 12:51:04  node2
>>>
>>>
>>> Is it necessary a delay for cman stop + start to avoid this inconsistent
>>> state or really is it a bug?
>>
>> I suspect it's an instance of this known bug. Check that CentOS has the
>> appropriate patch available:
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=485026
>>
>> Chrissie
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
> 
> 
> When restarting cman, I have always had to stop cman and then manually
> stop openais before trying to start cman again.   If I do not follow
> these steps then the node would never rejoin the cluster or might
> fence the other node.

That indicates some form of configuration error. You should never have
to do that. Make sure that openais is not enabled at boot time using

chkconfig openais off

Also, I really don't recommend stopping and starting cman without a
reboot. Yes you might get away with it a few times, but one day it won't
work and you'll be emailing here again ;-)

Chrissie




More information about the Linux-cluster mailing list