[Cluster-devel] [PATCH 0/4 Revised] NLM - lock failover
Wendy Cheng
wcheng at redhat.com
Tue Apr 17 19:30:21 UTC 2007
Few new thoughts from the latest round of review are really good and
worth doing....
However, since this particular NLM patch set is only part of the overall
scaffolding code to allow NFS V3 server fail over before NFS V4 is
widely adopted and stabilized, I'm wondering whether we should drag
ourselves too far for something that will be replaced soon. Lon and I
had been discussing the possibility of proposing new design changes into
the existing state monitoring protocol itself - but I'm leaning toward
*not* doing client SM_NOTIFY eventually (by passing the lock states
directly from fail-over server to take-over server if all possible).
This would consolidate few next work items such as NFSD V3 request reply
cache entires (or at least non-idempotent operation entries) or NFS V4
states that need to get moved around between the fail over servers.
In general, NFS cluster failover has been error prone and has timing
constraints (e.g. failover must finish within a sensible time interval).
Would it make more sense to have a workable solution with restricted
application first ? We can always merge various pieces together later as
we learn more from our users. For this reasoning, simple and plain
patches like this set would work best for now.
In any case, the following collect the review comments so far:
o 1-1 [from hch]
"Dropping locks should also support uuid or dev_t based exports."
A valid request. The easiest solution might be simply taking Neil's idea
by using export path name. So this issue is combined into 1-3 (see below
for details).
o 1-2 [from hch]
"It would be nice to have a more general push api for changes to
filesystem state, that works on a similar basis as getting information
from /etc/exports."
Could hch (or anyone) elaborate more on this ? Should I interpret it as
implementing a configuration file (that describes the failover options
that has a format similar to /etc/exports (including filesystem
identifiers, the length of grace period, etc) and a command (maybe two -
one on failover server and one on take-over server) to kick off the
failover based on the pre-defined configuration file ?
o 1-3 [from neilb]
"It would seem to make more sense to use the filesystem name (i.e. a
path) by writing a directory name to /proc/fs/nfsd/nlm_unlock and maybe
also to /proc/fs/nlm_restart_grace_for_fs" and have 'my_name' in the
SM_MON request be the path name of the export point rather the network
address."
It was my mistake to mention that we could use "fsid" in the "my_name"
field in previous post. As Lon pointed out, SM_MON requires server
address so we do not blindly notify clients that could result with
unspecified behaviors. On the other hand, the "path name" idea does
solve various problems if we want to support different types of existing
filesystem identifiers for failover purpose. Combining the configuration
file mentioned in 1-2, this could be a nice long term solution. Few
concerns (about using path name alone) :
*String comparison can be error-prone and slow
* It loses the abstraction provided by the "fsid" approach, particularly
for a cluster filesystem load balancing purpose. With "fsid" approach,
we could simply export the same directory using two different fsid(s)
(associated with two different IP addresses) for various purposes on the
same node.
* Will have to repeatedly educate users that "dev_t" is not unique
across reboots or nodes; uuid is restricted to one single disk
partition; and both of them require extra steps to obtain the values
somewhere else that are not easily read by human eyes. My support
experiences taught me that by the time users really understand the
difference, they'll switch to fsid anyway.
1-4 [from bfields]
"Unrelated bug fix should break out from the feature patches".
Will do
2-1 [from cluster coherent NFS conf. call]
"Hooks to allow cluster filesystem does its own "start" and "stop" of
grace period."
This could be solved by using a configuration file as described in 1-2.
3-1 [from okir]
"There's not enough room in the SM_MON request to accommodate additional
network addresses (e.g. IPv6)".
SM_MON is sent and received *within* the very same server. Is it really
matter whether we follow the protocol standard in this particular RPC
call ? My guess is not. Current patch writes server IP into "my_name"
field as a variable length character array. I see no reason this can't
be a larger character array (say 128 bytes for IPV6) to accommodate all
the existing network addressing we know of.
3-2 [from okir]
"Should we think about replacing SM_MON with some new design altogether
(think netlink) ?"
Yes. But before we spend the efforts, I would suggest we focus on
1. Offering a tentative workable NFS V3 solution for our users first.
2. Check out the requirement from NFS V4 implementation so we don't end
up revising the new changes again when V4 failover arrives.
In short, my vote is taking this (NLM) patch set and let people try it
out while we switch our gear to look into other NFS V3 failover issues
(nfsd in particular). Neil ?
-- Wendy
More information about the Cluster-devel
mailing list