[Cluster-devel] GFS2: Wait for journal id on mount if not specified on mount command line

Mon Jun 7 17:34:14 UTC 2010

On Mon, Jun 07, 2010 at 04:39:09PM +0100, Steven Whitehouse wrote:
> 
> This patch implements a wait for the journal id in the case that it has
> not been specified on the command line. This is to allow the future
> removal of the mount.gfs2 helper. The journal id would instead be
> directly communicated by gfs_controld to the file system. Here is a
> comparison of the two systems:
> 
> Current:
> 1. mount calls mount.gfs2
> 2. mount.gfs2 connects to gfs_controld to retrieve the journal id
> 3. mount.gfs2 adds the journal id to the mount command line and calls
> the mount system call
> 4. gfs_controld receives the status of the mount request via a uevent
> 
> Proposed:
> 1. mount calls the mount system call (no mount.gfs2 helper)
> 2. gfs_controld receives a uevent for a gfs2 fs which it doesn't know
> about already
> 3. gfs_controld assigns a journal id to it via sysfs
> 4. the mount system call then completes as normal (sending a uevent
> according to status)

Proposed is the way it originally worked.  I switched to using Current
back in 2005... unfortunately I don't remember all the specific reasons,
but I'm pretty sure it was the error/edge cases that were better handled
without sitting in the kernel early in the process.  (Especially when you
combine simultaneous mounting / mount failures / node failures / recovery.)

A couple obvious questions from the start...
- What if gfs_controld isn't running?
- Won't processes start to access the fs and block during this intermediate
time between mount(2) and getting a journal id?  All of those processes
now need errors returned if gfs_controld returns an error instead of a
journal id.

Another way to compare them:

Current:
- get all the userspace/clustering-related/error-laden overhead sorted out
- then, at the very end, pull the kernel fs into the picture
- collect the result of mount(2) in userpsace, which is almost always
  "success"

Proposed:
- pull the kernel fs into the picture
- transition to userspace to sort out all the clustering-related /
  error-laden overhead
- get back to the kernel with the result
- collect the result of mount(2) in userspace

The further you get before you encounter errors, the harder they are to
handle.  You want most errors to happen earlier, with fewer entities
involved, so backing out is easier to do.

IIRC, nfs recently moved to using a mount helper after *not* using one for
many years.  It would be interesting to ask them about their motivations.

> The advantage of the proposed system is that it is completely backward
> compatible with the current system both at the kernel and at the
> userland levels. The "first" parameter can also be set the same way,
> with the restriction that it must be set before the journal id is
> assigned.

That's not an "advantage" of new versus old, which is the missing bit of
information here.  I'm not against changing it per se, but it seems we'd
want some substantial advantage before going to all the effort of changing
such a delicate area that has worked quite well for the past 5 years.

There's room for real, major improvements in this whole area, but you're
barking up the wrong tree.  gfs_controld has always been far too complex.
But it's *not* a result of current mount helper scheme.  It is a direct
result of gfs_controld being required to do jobs that gfs (in kernel)
should probably handle itself:  allocating journal id's, coordinating who
does journal recovery, coordinating first mounter recovery, sorting out
valid combinations of mount options from different nodes, keeping track of
recovered journals vs journals that haven't been recovered, coordinating
when all journals have been successfully recovered so that normal fs
access can be continued.

If you want to do something that's meaningful and beneficial in this area,
you need to look at moving *those* things from gfs_controld into gfs.
Ocfs2 is a good example here, it handles almost all of that stuff in the
kernel, and leaves only what's really necessary for ocfs2_controld.

In fact, this could be a perfect area for gfs2/ocfs2 unification:  adopt a
single fs_controld, single mount/unmount scheme, single node failure/recovery
notification scheme, single journal id/allocation scheme.

Dave