[Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover
Wendy Cheng
wcheng at redhat.com
Tue Apr 24 03:30:56 UTC 2007
Neil Brown wrote:
>One thing that has been bothering me is that sometimes the
>"filesystem" (in the guise of an fsid) is used to talk to the kernel
>about failover issues (when flushing locks or restarting the grace
>period) and sometimes the local network address is used (when talking
>with statd).
>
>
This is a perception issue - it depends on how the design is described.
More on this later.
>I would rather use a single identifier. In my previous email I was
>leaning towards using the filesystem as the single identifier. Today
>I'm leaning the other way - to using the local network address.
>
>
Guess you're juggling with too many things so forget why we came down to
this route ? We started the discussion using network interface (to drop
the locks) but found it wouldn't work well on local filesytems such as
ext3. There is really no control on which local (sever side) interface
NFS clients will use (shouldn't be hard to implement one though). When
the fail-over server starts to remove the locks, it needs a way to find
*all* of the locks associated with the will-be-moved partition. This is
to allow umount to succeed. The server ip address alone can't guarantee
that. That was the reason we switched to fsid. Also remember this is NFS
v2/v3 - clients have no knowledge of server migration.
Now, let's move back to first paragraph. An active-active failover can
be described as a 5-steps process:
Step 1. Quiesce the floating network address.
Step 2. Move the exported filesystem directories from Server A to Server B.
Step 3. Re-enable the network interface.
Step 4. Inform clients about the changes via NSM (Network Status
Monitor) Protocol.
Step 5. Grace period.
I was told last week that, independent of lockd, some cluster
filesystems do have their own implementation of grace period. It is on
the wish list that this feature is taken into consideration. IMHO, the
overall process should be viewed as a collaboration between filesystem,
network interface, and NFS protocol itself. Mixing the filesystem and
network operations are unavoidable.
On the other hand, the current proposed interface is expandable .. say,
prefix a non-numerical string "DEV" or "UUID" to ask for dropping locks
as in:
shell> echo "DEV12390 > /proc/fs/nfsd/nlm_unlock;
or allow individual grace period of 10 seconds as:
shell> echo "1234 at 10" > nlm_set_grace_for_fsid
With above said, some of the following flow confuses me ... comment
inlined as below ..
>It works like this:
>
> We have a module parameter for lockd something like
> "virtual_server".
> If that is set to 0, none of the following changes are effective.
> If it is set to 1:
>
>
ok with me ...
> The destination address for any lockd request becomes part of the
> key to find the nsm_handle.
>
>
As explained above, the address along can't guarantee the associated
locks get cleaned up for one particular filesystem.
> The my_name field in SM_MON requests and SM_UNMON requests is set
> to a textual representation of that destination address.
>
>
That's what the current patch does.
> The reply to SM_MON (currently completely ignored by all versions
> of Linux) has an extra value which indicates how many more seconds
> of grace period there is to go. This can be stuffed into res_stat
> maybe.
> Places where we currently check 'nlmsvc_grace_period', get moved to
> *after* the nlmsvc_retrieve_args call, and the grace_period value
> is extracted from host->nsm.
>
>
ok with me but I don't see the advantages though ?
> This is the full extent of the kernel changes.
>
> To remove old locks, we arrange for the callbacks registered with
> statd for the relevant clients to be called.
> To set the grace period, we make sure statd knows about it and it
> will return the relevant information to lockd.
> To notify clients of the need to reclaim locks, we simple use the
> information stored by statd, which contains the local network
> address.
>
>
I'm lost here... help ?
>The only aspect of this that gives me any cause for concern is
>overloading the return value for SM_MON. Possibly it might be cleaner
>to define an SM_MON2 with different args or whatever.
>As this interface is entirely local to the one machine, and as it can
>quite easily be kept back-compatible, I think the concept is fine.
>
>
Agree !
>Statd would need to pass the my_name field to the ha callout rather
>than replacing it with "127.0.0.1", but other than that I don't think
>any changes are needed to statd (though I haven't thought through that
>fully yet).
>
>
That's the current patch does.
>Comments?
>
>
>
>
I feel we're in the loop again... If there is any way I can shorten this
discussion, please do let me know.
-- Wendy
More information about the Cluster-devel
mailing list