[Linux-cluster] Strange behaviours in two-node cluster

Javier Vela jvdiago at gmail.com
Tue Jul 17 07:30:47 UTC 2012


Hi, I'm also seeing a lot of log entries in the logs like that:

openais[4264]: [TOTEM] Retransmit List: 34 35 36 37 38 39 3a 3b 3c

I've searched through internet and this happens when there are some delay
between the nodes, but openais its supposed to recover gracefully. Can this
be a problem?

2012/7/16 Javier Vela <jvdiago at gmail.com>

> Hi,
>
> I set two_node=0 in purpose, because of I use a quorum disk with one
> additional vote. If one one fails, I still have two votes, and the cluster
> remains quorate, avoiding the split-brain situation. Is this approach
> wrong? In my tests, this aspect of the quorum worked well.
>
> Fencing works very well. When something happens, the fencing kills the
> faulting server without any problems.
>
> The first time I ran into problems I cheked multicast traffic between the
> nodes with iperf and everything appeared to be OK. What I don't know is how
> works the purge you said. I didn't know that any purge was running
> whatsoever. How can I check if is happening? Moreover, when I did the test
> only one cluster was running. Now there are 3 cluster running in the same
> virtual switch.
>
>
> Software:
>
> Red Hat Enterprise Linux Server release 5.7 (Tikanga)
> cman-2.0.115-85.el5
> rgmanager-2.0.52-21.el5
> openais-0.80.6-30.el5
>
>
>  Regards, Javi
>
>
> 2012/7/16 Digimer <lists at alteeve.ca>
>
>> Why did you set 'two_node="0" expected_votes="3"' on a two node cluster?
>> With this, losing a node will mean you lose quorum and all cluster
>> activity will stop. Please change this to 'two_node="1"
>> expected_votes="1"'.
>>
>> Did you confirm that your fencing actually works? Does 'fence_node
>> node1' and 'fence_node node2' actually kill the target?
>>
>> Are you running into multicast issues? If your switch (virtual or real)
>> purges multicast groups periodically, it will break the cluster.
>>
>> What version of the cluster software and what distro are you using?
>>
>> Digimer
>>
>>
>> On 07/16/2012 12:03 PM, Javier Vela wrote:
>> > Hi, two weeks ago I asked for some help building a two-node cluster with
>> > HA-LVM. After some e-mails, finally I got my cluster working. The
>> > problem now is that sometimes, and in some clusters (I have three
>> > clusters with the same configuration), I got very strange behaviours.
>> >
>> > #1 Openais detects some problem and shutdown itself. The network is Ok,
>> > is a virtual device in vmware, shared with the other cluster hearbet
>> > networks, and only happens in one cluster. The error messages:
>> >
>> > Jul 16 08:50:32 node1 openais[3641]: [TOTEM] FAILED TO RECEIVE
>> > Jul 16 08:50:32 node1 openais[3641]: [TOTEM] entering GATHER state from
>> 6.
>> > Jul 16 08:50:36 node1 openais[3641]: [TOTEM] entering GATHER state from
>> 0
>> >
>> > Do you know what can I check in order to solve the problem? I don't know
>> > from where I should start. What makes Openais to not receive messages?
>> >
>> >
>> > #2 I'm getting a lot of RGmanager errors when rgmanager tries to change
>> > the service status. i.e: clusvdcam -d service. Always happens when I
>> > have the two nodes UP. If I shutdown one node, then the command finishes
>> > succesfully. Prior to execute the command, I always check the status
>> > with clustat, and everything is OK:
>> >
>> > clurgmgrd[5667]: <err> #52: Failed changing RG status
>> >
>> > Another time, what can I check in order to detect problems with
>> > rgmanager that clustat and cman_tool doesn't show?
>> >
>> > #3 Sometimes, not always, a node that has been fenced cannot join the
>> > cluster after the reboot. With clustat I can see that there is quorum:
>> >
>> > clustat:
>> > [root at node2 ~]# clustat
>> > Cluster Status test_cluster @ Mon Jul 16 05:46:57 2012
>> > Member Status: Quorate
>> >
>> >  Member Name                             ID   Status
>> >  ------ ----                             ---- ------
>> >  node1-hb                                  1 Offline
>> >  node2-hb                               2 Online, Local, rgmanager
>> >  /dev/disk/by-path/pci-0000:02:01.0-scsi-    0 Online, Quorum Disk
>> >
>> >  Service Name                   Owner (Last)                   State
>> >  ------- ----                   ----- ------                   -----
>> >  service:test                   node2-hb                  started
>> >
>> > The log show how node2 fenced node1:
>> >
>> > node2 messages
>> > Jul 13 04:00:31 node2 fenced[4219]: node1 not a cluster member after 0
>> > sec post_fail_delay
>> > Jul 13 04:00:31 node2 fenced[4219]: fencing node "node1"
>> > Jul 13 04:00:36 node2 clurgmgrd[4457]: <info> Waiting for node #1 to be
>> > fenced
>> > Jul 13 04:01:04 node2 fenced[4219]: fence "node1" success
>> > Jul 13 04:01:06 node2 clurgmgrd[4457]: <info> Node #1 fenced; continuing
>> >
>> > But the node that tries to join the cluster says that there isn't
>> > quorum. Finally. It finishes inquorate, without seeing node1 and the
>> > quorum disk.
>> >
>> > node1 messages
>> > Jul 16 05:48:19 node1 ccsd[4207]: Error while processing connect:
>> > Connection refused
>> > Jul 16 05:48:19 node1 ccsd[4207]: Cluster is not quorate.  Refusing
>> > connection.
>> >
>> > Have something in common the three errors?  What should I check? I've
>> > discarded cluster configuration because cluster is working, and the
>> > errors doesn't appear in all the nodes. The most annoying error
>> > cureently is the #1. Every 10-15 minutes Openais fails and the nodes
>> > gets fenced. I attach the cluster.conf.
>> >
>> > Thanks in advance.
>> >
>> > Regards, Javi
>> >
>> >
>> >
>> > --
>> > Linux-cluster mailing list
>> > Linux-cluster at redhat.com
>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>> >
>>
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.com
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120717/ba3a982b/attachment.htm>


More information about the Linux-cluster mailing list