[Linux-cluster] RHEL/CentOS-6 HA NFS Configuration Question
ajb2 at mssl.ucl.ac.uk
Wed Sep 5 16:48:27 UTC 2012
On 05/09/12 15:59, Randy Zagar wrote:
> What I don't understand is what changed between RHEL-5 and RHEL-6 that
> has made HA NFS failover so difficult?
HA NFS failover has always been difficult for a number of reasons mostly
related to how abysmal the Linux NFS implementation is.
> I have been running a 3-node CentOS-5 cluster serving home directories
> (via NFS) to 200+ users for several years now and have been able to fail
> over home directories without significant issues.
> With CentOS-5, most of the time I am able to avoid stale filehandle
You can avoid those almost completely if you modify
/usr/share/cluster/nfsclient.sh to include flocks on all exportfs
The alternative is to replace /usr/sbin/exportfs with a flock wrapper
(which may well be cleaner/safer)
The problem is that exportfs is not multi-instance aware and if there
are multiple copies of it running you end up with a race condition which
can result in export tables being clobbered if you have multiple NFS
Quite simply: Linux NFS code is as crufty as hell and nothing short of a
proper cleanroom rewrite will fix it. Even NFSv4 code copies in
mountains of 20-30 year old code which has never been properly vetted.
I can post a diff for the script if you'd like it. We notified Redhat
about this issue years ago (and the fix) but they still haven't gotten
around to including it in the official packages.
> So far, I've been unwilling to use CentOS-6 for HA NFS as I
> can't get failover to work properly.
RHEL6 NFS failover seems to work on NFSv3 for me, but I haven't tested
it under production loads yet - when we moved from EL4 to EL5 everything
broke when production loads were applied (The EL4-5 changeover was
performed overnight by a Redhat engineer at a cost of several thousand
dollars as we wished to avoid trouble - Everything crashed and burned
the morning after.)
More information about the Linux-cluster