[Linux-cluster] RHEL 4.7 fenced fails -- stuck join state: S-2,2,1
Robert Hurst
rhurst at bidmc.harvard.edu
Mon Aug 24 16:32:39 UTC 2009
RHEL support pointed me to a document suggesting this may be an
implementation issue:
http://kbase.redhat.com/faq/docs/DOC-5935
"DNS is not a reliable way to get name resolution for the
cluster. All cluster nodes must be defined in /etc/hosts with
the name that matches cluster.conf and uname -n."
But we use a local DNS service on all of our hosts (for years) and
leave /etc/hosts alone with only the localhost entry in it. Our servers
have multiple bonded NICs, so I put the "hosts" in their own private
domain / IP address, and each clusternode entry in cluster.conf is
simply added as: acropolis, cerberus, rycon, and solaria. I let DNS
(and reverse DNS) resolve those names, i.e.,
/etc/resolve
search blade ccc.cluster bidmc.harvard.edu bidn.caregroup.org
nameserver 127.0.0.1
$ host acropolis
acropolis.blade has address 192.168.2.1
$ host cerberus
cerberus.blade has address 192.168.2.4
$ host rycon
rycon.blade has address 192.168.2.11
$ host solaria
solaria.blade has address 192.168.2.12
root at acropolis [~]$ netstat -a | grep 6809
udp 0 0 acropolis.blade:6809 *:*
udp 0 0 192.168.255.255:6809 *:*
[root at cerberus ~]# netstat -a | grep 6809
udp 0 0 cerberus.blade:6809 *:*
udp 0 0 192.168.255.255:6809 *:*
[root at solaria ~]# netstat -a | grep 6809
udp 0 0 solaria.blade:6809 *:*
udp 0 0 192.168.255.255:6809 *:*
... even though each of those server's $( uname -n )
has .bidmc.harvard.edu (for corporate LAN-facing NICs) appended to them.
Is this REALLY a cause for concern? If so, could this introduce a
failure (if not at join) during some later event? Any feedback is
welcome!
On Tue, 2009-08-11 at 10:55 -0400, Robert Hurst wrote:
> Simple 4-node cluster, 2-nodes have a GFS shared home directory
> mounted for over a month. Today, I wanted to mount /home on a 3rd
> node, so:
>
> # service fenced start [failed]
>
> Weird. Checking /var/log/messages show:
>
> Aug 11 10:19:06 cerberus kernel: Lock_Harness 2.6.9-80.9.el4_7.10
> (built Jan 22 2009 18:39:16) installed
> Aug 11 10:19:06 cerberus kernel: GFS 2.6.9-80.9.el4_7.10 (built Jan 22
> 2009 18:39:32) installed
> Aug 11 10:19:06 cerberus kernel: GFS: Trying to join cluster
> "lock_dlm", "ccc_cluster47:home"
> Aug 11 10:19:06 cerberus kernel: Lock_DLM (built Jan 22 2009 18:39:18)
> installed
> Aug 11 10:19:06 cerberus kernel: lock_dlm: fence domain not found;
> check fenced
> Aug 11 10:19:06 cerberus kernel: GFS: can't mount proto = lock_dlm,
> table = ccc_cluster47:home, hostdata =
>
> # cman_tool services
> Service Name GID LID State
> Code
> Fence Domain: "default" 0 2 join
> S-2,2,1
> []
>
> So, a fenced process is now hung:
>
> root 28302 0.0 0.0 3668 192 ? Ss 10:19 0:00 fenced
> -t 120 -w
>
> Q: Any idea how to "recover" from this state, without rebooting?
>
> The other two servers are unaffected by this (thankfully) and show
> normal operations:
>
> $ cman_tool services
>
> Service Name GID LID State
> Code
> Fence Domain: "default" 2 2 run -
> [1 12]
>
> DLM Lock Space: "home" 5 5 run -
> [1 12]
>
> GFS Mount Group: "home" 6 6 run -
> [1 12]
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090824/f0702401/attachment.htm>
More information about the Linux-cluster
mailing list