[Cluster-devel] Re: [NFS] [PATCH 0/3] NLM lock failover

Sat Aug 5 05:44:42 UTC 2006

On Fri, 2006-08-04 at 11:51 -0400, Trond Myklebust wrote:
> On Fri, 2006-08-04 at 10:56 -0400, Wendy Cheng wrote:
> > Anyway, better be conservative than sorry - I think we want to switch to
> > "fsid" approach to avoid messing with these networking issues, including
> > IPV6 modification. That is, we use fsid as the key to drop the lock and
> > set per-fsid NLM grace period. The ha-callout will have a 4th argument
> > (fsid) when invoked. 
> 
> What is the point of doing that? As far as the client is concerned, a
> server has either rebooted or it hasn't. It doesn't know about single
> filesystems rebooting.
> 

For active-active failover, the submitted patches allow:  

1: Drop the locks tied with one particular floating ip (in old server).
2: Notify relevant clients that the floating ip has been rebooted.
3: Set per-ip nlm grace period.
4: The (notified) nfs clients reclaim locks into the new server.

While the above 4 steps are being executed, both servers keep alive with
other nfs services un-interrupted. (1) and (3) are accomplished by Patch
3-1 and Patch 3-2. (4) is nfs client's task that follows its existing
logic without changes. 

For (2), the basics are built upon the existing rpc.statd's HA features,
specifically the -H and -nNP option. It, however, needs Patch 3-3 to
pass the correct floating ip address into rpc.statd user mode daemon as
the following: 

For system not involved in HA failover, nothing has change. All new
functions are optional with added-on feature. For cluster failover,

1. The rpc.statd is dispatched as "rpc.statd -H ha-callout"
2. Upon each monitor RPC calls (SM_MON or SM_UNMON), rpc.statd
   received the following from kernel:
   2-a: event (mon or unmon)
   2-b: server interface
   2-c: client interface.
3. The rpc.statd does its normal chores by writing or deleting 
   the client interface to/from the default sm directory. Server
   interface is not used here.  
   (btw, this is the existing logic without changes).
4. Then, the rpc.statd invokes ha-callout with the following three
   arguments:
   4-a: event (add-client or del-client)
   4-b: server interface
   4-c: client interface
   The ha-callout (in our case, it will be part of RHCS cluster suite)
   builds multiple sm directories based on 4-b, say 
   /shared_storage/sm_x,  where x is server's ip interface.
5. Upon failover, the cluster suite invokes 
   "rpc.statd -n x -N -P /shared_storage/sm_x" to notify affected
   clients. The new short-life rpc.statd will send the notification to
   relevant (nlm) clients and subsequently exits. The old rpc.statd
   (from step 1) is not aware of the failover event.

Note that before patch 3-3, the kernel always sets 2-b to
system_utsname.nodename. For rpc.statd, if RESTRICTED_STATD flag is on,
the rpc.statd always set 4-b to 127.0.0.1. Without RESTRICTED_STATD on,
it sets 4-b with whatever was passed by kernel (via 2-b). What (kernel)
patch 3-3 does is setting 2-b to the floating ip so rpc.statd could get
the correct ip and pass it into 4-b.

Greg said (I havn't figured out how) without setting 4-b to 127.0.0.1,
we "may" open a security hole. So the thinking here is, then, let's not
change anything but add an fsid as 4th argument for ha-callout as:

   4-d: fsid.

where "fsid" can be viewed as an unique identifier for an NFS export
specified in exports file (check "man exports"); e.g.

        /failover_dir   *(rw, fsid=1234)

With the added fsid info from ha-callout program, the cluster suite (or
human administrator) should be able to associated which (nlm) client has
been affected by one particular failover. 

>From implementation point of view, since fsid, if specified, has already
been part of the filehandle that is part of the nlm_file structure, we
should be able to replace the floating ip in the submitted patches with
fsid and still accomplish the very same thing. In short, the failover
sequence with the new interface would look like:

taken-over server:
A-1. tear down floating ip, say 10.10.1.1
A-2. unexport subject filesystem
A-3. "echo 1234 > /proc/fs/nfsd/nlm_unlock"  //fsid=1234
A-4. umount filesystem.

take-over server:
B-1. mount the subject filesystem
B-2. "echo 1234 > /proc/fs/nfsd/nlm_set_ip_grace"
B-3. "rpc.statd -n 10.10.1.1 -N -P /shared_storage/sm_10.10.1.1"
B-4. bring up 10.10.1.1
B-5. re-export the filesystem

A-3 and B-2 could be issued multiple times if the floating ip is
associated with multiple fsid(s).

Make sense ?

This fsid can also resolve Neil's concern (about nlm client using wrong
server interface to access filesystem) that I'll follow up sometime next
week. 

-- Wendy