[Linux-cluster] fencing for no reason that I can see

Tue Sep 11 15:17:37 UTC 2012

Hi,

I had similar problems. The problem turned out to be that the firmware 
for the Broadcom NICs inside of our Dell R610 has been obsolete resp. 
buggy. So depending on your hardware please have the vendor check your 
firmware/BIOS/... versions - might help ...

Kind regards,

     Heiko

Am 11.09.2012 03:27, schrieb Terry:
> Hello,
>
> I have seen this a few times where one node stops seeing the other
> node for some unknown reason and fences it.  Any idea how I can debug
> this?  Here's from the node doing the fencing:
>
>
> Sep 10 19:01:23 omadvnfs01a corosync[10371]:   [TOTEM ] A processor
> failed, forming new configuration.
> Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [QUORUM] Members[1]: 1
> Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [TOTEM ] A processor
> joined or left the membership and a new membership was formed.
> Sep 10 19:01:25 omadvnfs01a rgmanager[10692]: State change:
> omadvnfs01b.sec.jel.lc DOWN
> Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [CPG   ] chosen
> downlist: sender r(0) ip(10.198.1.110) ; members(old:2 left:1)
> Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [MAIN  ] Completed
> service synchronization, ready to provide service.
> Sep 10 19:01:25 omadvnfs01a fenced[10427]: fencing node omadvnfs01b.sec.jel.lc
>
>
> And here is from the fenced node:
>
> Sep 10 17:09:27 omadvnfs01b rpc.idmapd[6126]: nfsdcb:
> read(/proc/net/rpc/nfs4.idtoname/channel) failed: errno 0 (End of
> File)
> Sep 10 17:14:47 omadvnfs01b rpc.idmapd[6125]: nfsdcb:
> read(/proc/net/rpc/nfs4.idtoname/channel) failed: errno 0 (End of
> File)
> Sep 10 19:04:44 omadvnfs01b kernel: imklog 5.8.10, log source =
> /proc/kmsg started.
> Sep 10 19:04:44 omadvnfs01b rsyslogd: [origin software="rsyslogd"
> swVersion="5.8.10" x-pid="2379" x-info="http://www.rsyslog.com"] start
>
>
> I did notice that they were about 40 seconds off in time.  I just
> fixed that but what else can I look for here.  Our monitoring started
> noticing things at 19:02:30 that the fenced node was off the grid
> which is a little after it was fenced.  What test is performed to see
> if the other node is up?  How many times does it try?
>
> Thanks!
>
>