[Linux-cluster] RHEL 4.7 fenced fails -- stuck join state: S-2,2,1

Mon Aug 24 16:32:39 UTC 2009

RHEL support pointed me to a document suggesting this may be an
implementation issue:

http://kbase.redhat.com/faq/docs/DOC-5935

        "DNS is not a reliable way to get name resolution for the
        cluster. All cluster nodes must be defined in /etc/hosts with
        the name that matches cluster.conf and uname -n."

But we use a local DNS service on all of our hosts (for years) and
leave /etc/hosts alone with only the localhost entry in it.  Our servers
have multiple bonded NICs, so I put the "hosts" in their own private
domain / IP address, and each clusternode entry in cluster.conf is
simply added as: acropolis, cerberus, rycon, and solaria.  I let DNS
(and reverse DNS) resolve those names, i.e.,

/etc/resolve 
search blade  ccc.cluster  bidmc.harvard.edu  bidn.caregroup.org 
nameserver 127.0.0.1

$ host acropolis 
acropolis.blade has address 192.168.2.1 
$ host cerberus 
cerberus.blade has address 192.168.2.4 
$ host rycon 
rycon.blade has address 192.168.2.11 
$ host solaria 
solaria.blade has address 192.168.2.12 

root at acropolis [~]$ netstat -a | grep 6809 
udp    0    0 acropolis.blade:6809    *:* 
udp    0    0 192.168.255.255:6809    *:* 

[root at cerberus ~]# netstat -a | grep 6809 
udp    0    0 cerberus.blade:6809     *:* 
udp    0    0 192.168.255.255:6809    *:* 

[root at solaria ~]# netstat -a | grep 6809 
udp    0    0 solaria.blade:6809      *:* 
udp    0    0 192.168.255.255:6809    *:*

... even though each of those server's $( uname -n )
has .bidmc.harvard.edu (for corporate LAN-facing NICs) appended to them.
Is this REALLY a cause for concern?  If so, could this introduce a
failure (if not at join) during some later event?  Any feedback is
welcome!

On Tue, 2009-08-11 at 10:55 -0400, Robert Hurst wrote:

> Simple 4-node cluster, 2-nodes have a GFS shared home directory
> mounted for over a month.  Today, I wanted to mount /home on a 3rd
> node, so:
> 
> # service fenced start                [failed]
> 
> Weird.  Checking /var/log/messages show:
> 
> Aug 11 10:19:06 cerberus kernel: Lock_Harness 2.6.9-80.9.el4_7.10
> (built Jan 22 2009 18:39:16) installed
> Aug 11 10:19:06 cerberus kernel: GFS 2.6.9-80.9.el4_7.10 (built Jan 22
> 2009 18:39:32) installed
> Aug 11 10:19:06 cerberus kernel: GFS: Trying to join cluster
> "lock_dlm", "ccc_cluster47:home"
> Aug 11 10:19:06 cerberus kernel: Lock_DLM (built Jan 22 2009 18:39:18)
> installed
> Aug 11 10:19:06 cerberus kernel: lock_dlm: fence domain not found;
> check fenced
> Aug 11 10:19:06 cerberus kernel: GFS: can't mount proto = lock_dlm,
> table = ccc_cluster47:home, hostdata = 
> 
> # cman_tool services
> Service          Name                              GID LID State
> Code
> Fence Domain:    "default"                           0   2 join
> S-2,2,1
> []
> 
> So, a fenced process is now hung:
> 
> root     28302  0.0  0.0  3668  192 ?        Ss   10:19   0:00 fenced
> -t 120 -w
> 
> Q: Any idea how to "recover" from this state, without rebooting?
> 
> The other two servers are unaffected by this (thankfully) and show
> normal operations:
> 
> $ cman_tool services
> 
> Service          Name                              GID LID State
> Code
> Fence Domain:    "default"                           2   2 run       -
> [1 12]
> 
> DLM Lock Space:  "home"                              5   5 run       -
> [1 12]
> 
> GFS Mount Group: "home"                              6   6 run       -
> [1 12]
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090824/f0702401/attachment.htm>