[Linux-cluster] NFS Failover

Mon Apr 26 17:58:16 UTC 2010

Hi,

NFS Setup, 2 servers, stock redhat 5.4.

The following is on the SAN:
1) /var/lib/nfs     (so that I could preserve locks between 2 severs)
2) /export/home (home area I export to)
3) /export/shared

Setup:
1) HA-LVM (so that only 1 NFS server can see the volume at one time)
2) /export/home
192.168.251.0/255.255.255.0(rw,async,no_root_squash,fsid=4000)
3) Shared IP
4) All NFS dynamic ports are locked down to static one
5) rpc.statd is started with "-n <hostnameoffloatingip>"
6) RPCNFSDCOUNT=64

The Service setup (with the parent-child relationship):
- Floating IP
 |- LVM, FileSystem Mounts (to mount /var/lib/nfs, /export/home)
  |--- nfslock
    |----- nfs

It seems to be working with me failing it over several hundred times.
The only issues were that after fail-over some clients can stop writing.

Clients mount with defaults,async,noatime,proto=udp. The default is
hard-mounting and NFSv3.

I test that there are 4 NFS clients and 8 processes/NFS client writing to
files while I perform the failover.
Some times, there are clients that will stop writing -- this is inconsistent
with the fact that it's hard-mounted.
I've tried clients with redhat 5.4.x and 5.5 kernels with the same results.
timeo, retrans changes do not help as well.

I tried TCP option and the clients panic'ked (bugzilla.redhat.com #585269)
during fail-over, hence the udp options.

I wonder if anyone is seeing the same thing. The annoying thing is that the
clients stopped writing only happen some times; not all the times.
The failover completed all the time. After the fail-over, the clients can
still see the mounted space.

I noticed that when the client has issues, the rpciod/6 will shoot up to
100% for several seconds. My processes that are writing files are shot to
100%, then died without finishing writing the files.

It feels like a bug on NFS clients; I'm not that certain. I would like to
request community help for second opinion.

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100426/18b9de32/attachment.htm>