[Linux-cluster] new cluster acting odd

Mon Dec 1 16:34:32 UTC 2014

> On Dec 1, 2014, at 10:57, Digimer <lists at alteeve.ca> wrote:
> 
> On 01/12/14 09:16 AM, Megan . wrote:
>> We decided to use this cluster solution in order to share GFS2 mounts
>> across servers.  We have a 7 node cluster that is newly setup, but
>> acting oddly.  It has 3 vmware guest hosts and 4 physical hosts (dells
>> with Idracs).  They are all running Centos 6.6.  I have fencing
>> working (I'm able to do fence_node node and it will fence with
>> success).  I do not have the gfs2 mounts in the cluster yet.
> 
> Very glad you have fencing, that's a common early mistake.
> 
> 7-node cluster is actually pretty large and is around the upper-end before tuning starts to become fairly important.

Ha, I was unaware this was part of the folklore.  We have a couple of 9-node clusters, it did take some tuning to get them stable, and we are thinking about splitting one of them.  For our clusters, we found uniform configuration helped a lot, so a mix of physical hosts and VMs in the same (largish) cluster would make me a little nervous, don't know about anyone else's feelings.

>> I can manually fence it, and it still comes online with the same
>> issue.  I end up having to take the whole cluster down, sometimes
>> forcing reboot on some nodes, then brining it back up.  Its takes a
>> good part of the day just to bring the whole cluster online again.
> 
> Something fence related is not working.

We used to see something like this, and it usually traced back to "shouldn't be possible" inconsistencies in the fence group membership.  Once the fence group gets blocked by a mismatched membership list, everything above it breaks.  Sometimes a "fence_tool ls -n" on all the cluster members will reveal an inconsistency ("fence_tool dump" also, but that's hard to interpret without digging into the group membership protocols).  If you can find an inconsistency, manually fencing the nodes in the minority might repair it.

At the time, I did quite a lot of staring at fence_tool dumps, but never figured out how the fence group was getting into "shouldn't be possible" inconsistencies.  This was also all late RHEL5 and early RHEL6, so may not be applicable anymore.

> My recommendation would be to schedule a maintenance window and then stop everything except cman (no rgmanager, no gfs2, etc). Then methodically test crashing all nodes (I like 'echo c > /proc/sysrq-trigger) and verify they are fenced and then recover properly. It's worth disabling cman and rgmanager from starting at boot (period, but particularly for this test).
> 
> If you can reliably (and repeatedly) crash -> fence -> rejoin, then I'd start loading back services and re-trying. If the problem reappears only under load, then that's an indication of the problem, too.

I'd agree--start at the bottom of the stack and work your way up.

-dan