[Linux-cluster] fencing for no reason that I can see

Tue Sep 11 02:08:37 UTC 2012

On Mon, Sep 10, 2012 at 8:27 PM, Terry <td3201 at gmail.com> wrote:
> Hello,
>
> I have seen this a few times where one node stops seeing the other
> node for some unknown reason and fences it.  Any idea how I can debug
> this?  Here's from the node doing the fencing:
>
>
> Sep 10 19:01:23 omadvnfs01a corosync[10371]:   [TOTEM ] A processor
> failed, forming new configuration.
> Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [QUORUM] Members[1]: 1
> Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [TOTEM ] A processor
> joined or left the membership and a new membership was formed.
> Sep 10 19:01:25 omadvnfs01a rgmanager[10692]: State change:
> omadvnfs01b.sec.jel.lc DOWN
> Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [CPG   ] chosen
> downlist: sender r(0) ip(10.198.1.110) ; members(old:2 left:1)
> Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [MAIN  ] Completed
> service synchronization, ready to provide service.
> Sep 10 19:01:25 omadvnfs01a fenced[10427]: fencing node omadvnfs01b.sec.jel.lc
>
>
> And here is from the fenced node:
>
> Sep 10 17:09:27 omadvnfs01b rpc.idmapd[6126]: nfsdcb:
> read(/proc/net/rpc/nfs4.idtoname/channel) failed: errno 0 (End of
> File)
> Sep 10 17:14:47 omadvnfs01b rpc.idmapd[6125]: nfsdcb:
> read(/proc/net/rpc/nfs4.idtoname/channel) failed: errno 0 (End of
> File)
> Sep 10 19:04:44 omadvnfs01b kernel: imklog 5.8.10, log source =
> /proc/kmsg started.
> Sep 10 19:04:44 omadvnfs01b rsyslogd: [origin software="rsyslogd"
> swVersion="5.8.10" x-pid="2379" x-info="http://www.rsyslog.com"] start
>
>
> I did notice that they were about 40 seconds off in time.  I just
> fixed that but what else can I look for here.  Our monitoring started
> noticing things at 19:02:30 that the fenced node was off the grid
> which is a little after it was fenced.  What test is performed to see
> if the other node is up?  How many times does it try?
>
> Thanks!

I guess I should have read the docs more thoroughly.  Right from RHEL
6 cluster guide:
Ensure that exotic bond modes and VLAN tagging are not in use on
interfaces that the cluster uses for inter-node communication.

I am using a 3 interface 802.3ad link aggregate on the production
network.  I could either use an iscsi interface or split one of the
three bond slave interfaces out and dedicate it to inter-node traffic.
 I was also looking into a potential multicast issue but I believe my
switches support it fine (Foundry FLS).  I wouldnt think it would be
intermittent like this.  Anyone have any other thoughts?