[Linux-cluster] new cluster acting odd

Digimer lists at alteeve.ca
Mon Dec 1 17:14:22 UTC 2014

On 01/12/14 11:56 AM, Megan . wrote:
> Thank you for your replies.
> The cluster is intended to be 9 nodes, but i haven't finished building
> the remaining 2.  Our production cluster is expected to be similar in
> size.  What tuning should I be looking at?
> Here is a link to our config.  http://pastebin.com/LUHM8GQR  I had to
> remove IP addresses.

Can you simplify those fencedevice definitions? I would wonder if the 
set timeouts could be part of the problem. Always start with the 
simplest possible configurations and only add options in response to 
actual issues discovered in testing.

> I tried the method of (echo c > /proc/sysrq-trigger) to crash a node,
> the cluster kept seeing it as online and never fenced it, yet i could
> no longer ssh to the node.  I did this on a physical and VM box with
> the same result.  I had to fence_node node to get it to reboot, but it
> came up split brained (thinking it was the only one online). Now that
> node has cman down and the rest of the cluster sees it as still
> online.

Then corosync failed to detect the fault. That is a sign, to me, of a 
fundamental network or configuration issue. Corosync should have shown 
messages about a node being lost and reconfiguring. If that didn't 
happen, then you're not even up to the point where fencing factors in.

Did you configure corosync.conf? When it came up, did it think it was 
quorate or inquorate?

> I thought fencing was working because i'm able to do fence_node node
> and see the box reboot and come back online.  I did have to get the FC
> version of the fence_agents because of an issue with the idrac agent
> not working properly.  We are running fence-agents-3.1.6-1.fc14.x86_64

That tells you that the configuration of the fence agents is working, 
but it doesn't test failure detection. You can use the 'fence_check' 
tool to see if the cluster can talk to everything, but in the end, the 
only useful test is to simulate an actual crash.

Wait; 'fc14' ?! What OS are you using?

> fence_tool dump worked on one of my nodes, but it is just hanging on the rest.
> [root at map1-uat ~]# fence_tool dump
> 1417448610 logging mode 3 syslog f 160 p 6 logfile p 6
> /var/log/cluster/fenced.log
> 1417448610 fenced started
> 1417448610 connected to dbus :1.12
> 1417448610 cluster node 1 added seq 89048
> 1417448610 cluster node 2 added seq 89048
> 1417448610 cluster node 3 added seq 89048
> 1417448610 cluster node 4 added seq 89048
> 1417448610 cluster node 5 added seq 89048
> 1417448610 cluster node 6 added seq 89048
> 1417448610 cluster node 8 added seq 89048
> 1417448610 our_nodeid 4 our_name map1-uat.project.domain.com
> 1417448611 logging mode 3 syslog f 160 p 6 logfile p 6
> /var/log/cluster/fenced.log
> 1417448611 logfile cur mode 100644
> 1417448611 cpg_join fenced:daemon ...
> 1417448621 daemon cpg_join error retrying
> 1417448631 daemon cpg_join error retrying
> 1417448641 daemon cpg_join error retrying
> 1417448651 daemon cpg_join error retrying
> 1417448661 daemon cpg_join error retrying
> 1417448671 daemon cpg_join error retrying
> 1417448681 daemon cpg_join error retrying
> 1417448691 daemon cpg_join error retrying
> .
> .
> .
> [root at map1-uat ~]# clustat
> Cluster Status for gibsuat @ Mon Dec  1 16:51:49 2014
> Member Status: Quorate
>   Member Name                                                     ID   Status
>   ------ ----                                                     ---- ------
>   archive1-uat.project.domain.com                                1 Online
>   admin1-uat.project.domain.com                                  2 Online
>   mgmt1-uat.project.domain.com                                   3 Online
>   map1-uat.project.domain.com                                    4 Online, Local
>   map2-uat.project.domain.com                                    5 Online
>   cache1-uat.project.domain.com                                 6 Online
>   data1-uat.project.domain.com                                   8 Online
> The  /var/log/cluster/fenced.log on the nodes is saying Dec 01
> 16:02:34 fenced cpg_join error retrying every 10th of a second.
> Obviously having some major issues.  These are fresh boxes, no other
> services right now other then ones related to the cluster.

What OS/version?

> I've also experimented with the  <cman transport="udpu"/> to disable
> multicast to see if that helped but it doesn't seem to make a
> difference with the node stability.

Very bad idea with >2~3 node clusters. The overhead will be far too 
great for a 7~9 node cluster.

> Is there a document or some sort of reference that I can give the
> network folks on how the switches should be configured?  I read stuff
> on boards about IGMP snooping, but I couldn't find anything from
> RedHat to hand them.

I have this:





There are comments in there about multicast, etc.

Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?

More information about the Linux-cluster mailing list