[Linux-cluster] NFS on GFS architectural issues / problems

Mon Aug 21 15:44:57 UTC 2006

I get an error page that no document with ID 99 exists.

I'm about to setup a cluster that uses NFS with GFS, so I'd love to  
read that document.

On Aug 21, 2006, at 12:42 AM, Riaan van Niekerk wrote:

> hi Bob and others
>
> I found on the Red Hat 108 Developer Portal the following GFS1/GFS2  
> design document which details amongst others, some of the issues  
> with NFS on GFS:
> https://rpeterso.108.redhat.com/servlets/ProjectDocumentView? 
> documentID=99
>
> (I see it was sent to this list over a year ago, but I never found  
> it while searching through the archives. it has a lot of good  
> information in it)
>
> It has a disclaimer: Some of the comments
> are no longer applicable due to design changes
>
> My question to you or anyone who is familiar with NFS on GFS, or  
> GFS in general, which of the following are still valid issues for  
> the current (6.1u4) version of GFS. If all or most of them still  
> apply, I can use this as motivation for my customer to strongly  
> consider going off NFS on GFS. Removing the NFS from our GFS  
> cluster has been on the cards for quite a while, but has not gained  
> momentum due to lack of information on the performance gains of  
> such a move (very difficult to gage) or the architectural problems/ 
> limitations of NFS on GFS (for which the following extract is spot- 
> on).
>
> Note - can you consider adding a link to this document from your FAQ?
>
> +++++++++
>
> o  NFS Support
>
> A GFS filesystem can be exported through NFS to other
> nodes.  There
> are a number of issues with NFS on top of a cluster
> filesystem,
> though.
>
> 1) Filehandle misses
>
>    When a NFS request comes into the server, it's up to
> the filesystem
>    (and a few Linux helper routines) to map the NFS
> filehandle to the
>    correct inode.  Doing that is easy if the inode is
> already in the
>    node's cache.  The tricky part is when the
> filesystem must read in
>    the inode from the disk.  There is nothing in the
> filehandle that
>    anchors the inode into the filesystem (such as a
> glock on a
>    directory that contains an entry pointing to the
> inode), so a lot
>    more care has to taken to make sure the block really
> contains a
>    valid inode.  (See the section on the proposed new
> RG formats.)
>
>    It's also non-trivial to handle inode migration in
> GFS when a NFS
>    server is running.  There is no centralized data
> structure that can
>    map filehandles into inodes (such a structure would be a
>    scalability/performance bottleneck).  It's difficult
> to find a
>    representation of the inode that could be used to
> quickly find it
>    even in the face of the inode changing blocks.
>
>    Another problem is that filehandle requests can come
> in random
>    times for inodes that don't exist anymore or are in
> the process of
>    being recreated.  This can break optimizations based
> on ideas like
>    "since this node in the process of creating this
> inode, it are
>    the only one who knows about its locks".  GFS has
> suffered from
>    these mis-optimizations in the past.  From what I've
> seen, I believe
>    OCFS2 currently has problems like this, too.
>
> 2) Readdir
>
>    Linux has an interesting mechanism to do handle
> readdir() requests.
>    The VFS (or NFSD) passes the filesystem a request
> containing not
>    only the directory and offset to be read, but a
> filldir function to
>    call for each entry found.  So, the filesystem
> doesn't directly
>    fill in a buffer of entries, but calls an arbitrary
> routine that
>    can either put the entries into a buffer or do some
> other type of
>    processing on them.  This is a powerful concept, but
> can be easily
>    misused.
>
>    I believe that NFSD's use of it is problematic at
> best.  The
>    filldir routine used by NFSD for the readdirplus NFS
> procedure
>    calls back into the filesystem to do a lookup and
> stat() on the
>    inode pointed to by the entry.  This call is painful
> because of
>    GFS' locking.  gfs_readdir() must call filldir with
> the directory
>    lock held so that it doesn't lose its place in the
> directory.  The
>    stat() that the filldir routine does causes the
> inode's lock to be
>    acquired.  Because concurrent inode locks must
> always be acquired
>    in ascending numerical order and the filldir routine
> forces an
>    ordering that might be something other than that,
> there is a
>    deadlock potential.
>
>    GFS detects when NFSD calls its readdir and switches
> to a routine
>    that avoids calling the filldir routine with the
> lock held.  It's
>    not as efficient, but it avoids the deadlock.  It'd
> be nice if
>    there was a better way to do the detection, though.
>  (The code
>    currently looks at the program's name.)
>
> 3) FCNTL locking
>
>    There are a huge number of issues with acquiring and
> failing over
>    fcntl()-style locks when there are multiple GFS
> heads exporting
>    NFS.  GFS pretty much ignores them right now.  A
> good place to
>    start would be to change NFSD so it actually passes
> fcntl calls
>    down into the filesystem.
>
> 4) NFSv4
>
>    NFSv4 requires all sorts of changes to GFS in order
> for them to
>    work together.  Op locks being one I can remember at
> the moment.
>    I think I've repressed my memories of the others.
>
> ++++++++
> <riaan.vcf>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster