[Linux-cluster] Problem with machines fencing one another in 2 Node NFS cluster

Thu Feb 24 15:13:28 UTC 2011

On 02/24/2011 08:34 AM, Randy Brown wrote:
> Thanks for the response.  Sorry for the delay.  I had an issue that,
> unexpectedly, took me away from the office.  I am just getting back to
> this now.
> 
> Yes, the MAC addresses were all updated after the cloning.  According to
> my notes, here are sections of the log files at the time of a fence from
> each cluster node.
> 
> Feb 10 15:17:48 nfs2-cluster clurgmgrd[4280]:<notice>  Resource Group
> Manager Starting
> Feb 10 15:18:17 nfs2-cluster rgmanager: [7580]:<notice>  Shutting down
> Cluster Service Manager...
> Feb 10 15:18:17 nfs2-cluster clurgmgrd[4280]:<notice>  Shutting down
> Feb 10 15:18:17 nfs2-cluster clurgmgrd[4280]:<notice>  Shutting down
> Feb 10 15:18:17 nfs2-cluster clurgmgrd[4280]:<notice>  Shutdown
> complete, exiting
> Feb 10 15:18:17 nfs2-cluster rgmanager: [7580]:<notice>  Cluster Service
> Manager is stopped.
> Feb 10 15:18:23 nfs2-cluster ccsd[2989]: Stopping ccsd, SIGTERM received.
> Feb 10 15:18:23 nfs2-cluster NAMC
> Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading all
> openais components
> Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais
> component: openais_confdb v0 (19/10)
> Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais
> component: openais_cpg v0 (18/8)
> Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais
> component: openais_cfg v0 (17/7)
> Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais
> component: openais_msg v0 (16/6)
> Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais
> component: openais_lck v0 (15/5)
> Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais
> component: openais_evt v0 (14/4)
> Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais
> component: openais_ckpt v0 (13/3)
> Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais
> component: openais_amf v0 (12/2)
> Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais
> component: openais_clm v0 (11/1)
> Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais
> component: openais_evs v0 (10/0)
> Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais
> component: openais_cman v0 (9/9)
> Feb 10 15:18:23 nfs2-cluster gfs_controld[3077]: cluster is down, exiting
> Feb 10 15:18:23 nfs2-cluster dlm_controld[3071]: cluster is down, exiting
> Feb 10 15:18:23 nfs2-cluster fenced[3065]: cluster is down, exiting
> Feb 10 15:18:23 nfs2-cluster kernel: dlm: closing connection to node 2
> Feb 10 15:18:23 nfs2-cluster kernel: dlm: closing connection to node 1
> 
> 
> Feb 10 15:17:34 nfs1-cluster ntpd[3765]: synchronized to LOCAL(0),
> stratum 10
> Feb 10 15:18:17 nfs1-cluster clurgmgrd[4323]:<notice>  Member 2 shutting
> down
> Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] The token was lost
> in the OPERATIONAL state.
> Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] Receive multicast
> socket recv buffer size (320000 bytes).
> Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] Transmit multicast
> socket send buffer size (262142 bytes).
> Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] entering GATHER
> state from 2.
> Feb 10 15:18:34 nfs1-cluster ntpd[3765]: synchronized to 132.236.56.250,
> stratum 2
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering GATHER
> state from 0.
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Creating commit
> token because I am the rep.
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Saving state aru 230
> high seq received 230
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Storing new sequence
> id for ring 1f80
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering COMMIT state.
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering RECOVERY
> state.
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] position [0] member
> 140.90.91.240:
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] previous ring seq
> 8060 rep 140.90.91.240
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] aru 230 high
> delivered 230 received flag 1
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Did not need to
> originate any messages in recovery.
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Sending initial ORF
> token
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] CLM CONFIGURATION
> CHANGE
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] New Configuration:
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ]     r(0)
> ip(140.90.91.240)
> Feb 10 15:18:35 nfs1-cluster kernel: dlm: closing connection to node 2
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] Members Left:
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ]     r(0)
> ip(140.90.91.242)
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] Members Joined:
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] CLM CONFIGURATION
> CHANGE
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] New Configuration:
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ]     r(0)
> ip(140.90.91.240)
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] Members Left:
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] Members Joined:
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [SYNC ] This node is within
> the primary component and will provide service.
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering OPERATIONAL
> state.
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] got nodejoin message
> 140.90.91.240
> Feb 10 15:18:35 nfs1-cluster openais[3046]: [CPG  ] got joinlist message
> from node 1
> 
> 
> I was seeing a number of these messages but they stopped after upgrading
> openais
> 
> nfs2-cluster openais[3012]: [TOTEM] Retransmit List: 1df3
> 
> Yes, these are in managed switches.  I will try to run the tcpdump
> asap.  Unfortunately, that means I have to have it crash again to get
> what I need and my users are already annoyed by the downtime we've had. 
> I know this isn't the best solution for our needs, but given the lack of
> funding, this seemed like a good idea at the time.
> 
> Thanks for the help!
> 
> Randy

The logs you posted seem incomplete. For example, there is no messages
about fenced. Can I assume that nfs2 gets fenced first, comes back up
and then fences nfs1? Was nfs2 the failed and restored node?

In either case, it looks like nfs2 withdraws from the cluster by (trying
to?) shut down the cluster software. Normally, this leaving the cluster
is ordered and will not trigger a fence as the leaving is announced. If
a fence happens, it's because the node simply vanished from the point of
view of the remaining node(s).

This (which is odd in it's own right):

Feb 10 15:17:48 nfs2-cluster clurgmgrd[4280]:<notice>  Resource Group
Manager Starting
Feb 10 15:18:17 nfs2-cluster rgmanager: [7580]:<notice>  Shutting down
Cluster Service Manager...
Feb 10 15:18:17 nfs2-cluster clurgmgrd[4280]:<notice>  Shutting down

Shouldn't lead to this:

Feb 10 15:18:17 nfs1-cluster clurgmgrd[4323]:<notice>  Member 2 shutting
down
Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] The token was lost
in the OPERATIONAL state.

The 'nfs1' node sees the member leaving, but still freaks out when totem
token stop arriving. I am less convinced now that this is a multicast
issue (but it doesn't hurt to keep watching it).

Can you post the full logs from both nodes well before until well after
the fencing? Can you also post your full cluster.conf and openais.conf
files (only obfuscating the password, leave everything else)? It might
be most effective to post these on http://pastebin.com for brevity.

Something is quite odd here... I'm almost thinking that the internal
node IDs aren't unique or something, but I am not entirely familiar with
the internals (either how IDs are created or how they are stored)... I'm
curious to sort this out now. :)

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org