[Linux-cluster] new cluster acting odd
Megan .
nagemnna at gmail.com
Mon Dec 1 18:03:50 UTC 2014
We have 11 10-20TB GFS2 mounts that I need to share across all nodes.
Its the only reason we went with the cluster solution. I don't know
how we could split it up into different smaller clusters.
On Mon, Dec 1, 2014 at 12:14 PM, Digimer <lists at alteeve.ca> wrote:
> On 01/12/14 11:56 AM, Megan . wrote:
>>
>> Thank you for your replies.
>>
>> The cluster is intended to be 9 nodes, but i haven't finished building
>> the remaining 2. Our production cluster is expected to be similar in
>> size. What tuning should I be looking at?
>>
>>
>> Here is a link to our config. http://pastebin.com/LUHM8GQR I had to
>> remove IP addresses.
>
>
> Can you simplify those fencedevice definitions? I would wonder if the set
> timeouts could be part of the problem. Always start with the simplest
> possible configurations and only add options in response to actual issues
> discovered in testing.
I can try to simplify. I had the longer timeouts because what I saw
happening on the physical boxes, was the box would be on its way
down/up and the fence command would fail, but the box actually did
come back online. The physicals take 10-15 minutes to reboot and i
wasn't sure how to handle timeout issues, so i made the timeouts a bit
extreme for testing. I'll try to make the config more vanilla for
troubleshooting.
>> I tried the method of (echo c > /proc/sysrq-trigger) to crash a node,
>> the cluster kept seeing it as online and never fenced it, yet i could
>> no longer ssh to the node. I did this on a physical and VM box with
>> the same result. I had to fence_node node to get it to reboot, but it
>> came up split brained (thinking it was the only one online). Now that
>> node has cman down and the rest of the cluster sees it as still
>> online.
>
>
> Then corosync failed to detect the fault. That is a sign, to me, of a
> fundamental network or configuration issue. Corosync should have shown
> messages about a node being lost and reconfiguring. If that didn't happen,
> then you're not even up to the point where fencing factors in.
>
> Did you configure corosync.conf? When it came up, did it think it was
> quorate or inquorate?
corosync.conf didn't work since it seems the RedHat HA Cluster doesn't
use that file. http://people.redhat.com/ccaulfie/docs/CmanYinYang.pdf
I tried it since we wanted to try to put the multicast traffic on a
different bond/vlan but we figured out the file isn't used.
>> I thought fencing was working because i'm able to do fence_node node
>> and see the box reboot and come back online. I did have to get the FC
>> version of the fence_agents because of an issue with the idrac agent
>> not working properly. We are running fence-agents-3.1.6-1.fc14.x86_64
>
>
> That tells you that the configuration of the fence agents is working, but it
> doesn't test failure detection. You can use the 'fence_check' tool to see if
> the cluster can talk to everything, but in the end, the only useful test is
> to simulate an actual crash.
>
> Wait; 'fc14' ?! What OS are you using?
>
>
We are Centos 6.6. I went with the fedora core agents because of this
exact issue http://forum.proxmox.com/threads/12311-Proxmox-HA-fencing-and-Dell-iDrac7
I read that it was fixed in the next version, which i could only find
for FC.
>> fence_tool dump worked on one of my nodes, but it is just hanging on the
>> rest.
>>
>> [root at map1-uat ~]# fence_tool dump
>> 1417448610 logging mode 3 syslog f 160 p 6 logfile p 6
>> /var/log/cluster/fenced.log
>> 1417448610 fenced 3.0.12.1 started
>> 1417448610 connected to dbus :1.12
>> 1417448610 cluster node 1 added seq 89048
>> 1417448610 cluster node 2 added seq 89048
>> 1417448610 cluster node 3 added seq 89048
>> 1417448610 cluster node 4 added seq 89048
>> 1417448610 cluster node 5 added seq 89048
>> 1417448610 cluster node 6 added seq 89048
>> 1417448610 cluster node 8 added seq 89048
>> 1417448610 our_nodeid 4 our_name map1-uat.project.domain.com
>> 1417448611 logging mode 3 syslog f 160 p 6 logfile p 6
>> /var/log/cluster/fenced.log
>> 1417448611 logfile cur mode 100644
>> 1417448611 cpg_join fenced:daemon ...
>> 1417448621 daemon cpg_join error retrying
>> 1417448631 daemon cpg_join error retrying
>> 1417448641 daemon cpg_join error retrying
>> 1417448651 daemon cpg_join error retrying
>> 1417448661 daemon cpg_join error retrying
>> 1417448671 daemon cpg_join error retrying
>> 1417448681 daemon cpg_join error retrying
>> 1417448691 daemon cpg_join error retrying
>> .
>> .
>> .
>>
>>
>> [root at map1-uat ~]# clustat
>> Cluster Status for gibsuat @ Mon Dec 1 16:51:49 2014
>> Member Status: Quorate
>>
>> Member Name ID
>> Status
>> ------ ---- ----
>> ------
>> archive1-uat.project.domain.com 1 Online
>> admin1-uat.project.domain.com 2 Online
>> mgmt1-uat.project.domain.com 3 Online
>> map1-uat.project.domain.com 4 Online,
>> Local
>> map2-uat.project.domain.com 5 Online
>> cache1-uat.project.domain.com 6 Online
>> data1-uat.project.domain.com 8 Online
>>
>>
>> The /var/log/cluster/fenced.log on the nodes is saying Dec 01
>> 16:02:34 fenced cpg_join error retrying every 10th of a second.
>>
>> Obviously having some major issues. These are fresh boxes, no other
>> services right now other then ones related to the cluster.
>
>
> What OS/version?
>
>> I've also experimented with the <cman transport="udpu"/> to disable
>> multicast to see if that helped but it doesn't seem to make a
>> difference with the node stability.
>
>
> Very bad idea with >2~3 node clusters. The overhead will be far too great
> for a 7~9 node cluster.
>
>> Is there a document or some sort of reference that I can give the
>> network folks on how the switches should be configured? I read stuff
>> on boards about IGMP snooping, but I couldn't find anything from
>> RedHat to hand them.
>
>
> I have this:
>
> https://alteeve.ca/w/AN!Cluster_Tutorial_2#Six_Network_Interfaces.2C_Seriously.3F
>
> https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network_Switches
>
> https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network_Security_Considerations
>
> https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network
>
> There are comments in there about multicast, etc.
>
Thank you for the links. I will review them with our network folks,
hopefully it will help us sort out some of our issues.
I will use the fence_check tool to see if i can troubleshoot the fencing.
Thank you very much for all of your suggestions.
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
More information about the Linux-cluster
mailing list