[Linux-cluster] GFS2 + NFS crash BUG: Unable to handle kernel NULL pointer deference

Mon Jul 11 08:30:11 UTC 2011

On 08/07/11 22:09, J. Bruce Fields wrote:

> With default mount options, the linux NFS client (like most NFS clients)
> assumes that a file has a most one writer at a time.  (Applications that
> need to do write-sharing over NFS need to use file locking.)

The problem is that file locking on V3 isn't passed back down to the 
filesystem - hence the issues with nfs vs samba (or local disk 
access(*)) on the same server.

(*) Local disk access includes anything running on other nodes in a 
GFS/GFS2 environment. This precludes exporting the same GFS(2) 
filesystem on multiple cluster nodes.

> The NFS protocol supports higher granularity timestamps.  The limitation
> is the exported filesystem.  If you're using something other than
> ext2/3, you're probably getting higher granularity.

GFS/GFS2 in this case...

>> can (and has)
>> result in writes made by non-nfs processes to cause NFS clients which have
>> that file opened read/write to see "stale filehandle" errors due to the
>> inode having changed when they weren't expecting it.
>
> Changing file data or attributes won't result in stale filehandle
> errors.  (Bug reports welcome if you've seen otherwise.)

I'll have to try and repeat the issue, but it's a race condition with a 
narrow window at the best of times.

> Stale
> filehandle errors should only happen when a client attempts to use a
> file which no longer exists on the server.  (E.g. if another client
> deletes a file while your client has it open.)

It's possible this has happened. I have no idea what user batch scripts 
are trying to do on the compute nodes, but in the case that was brought 
to my attention the file was edited on one node while another had it open.

>  (This can also happen if
> you rename a file across directories on a filesystem exported with the
> subtree_check option.  The subtree_check option is deprecated, for that
> reason.)

All our FSes are exported no_subtree_check and at the root of the FS.

>> We (should) all know NFS was a kludge. What's surprising is how much
>> kludge stll remains in the current v2/3 code (which is surprisingly opaque
>> and incredibly crufty, much of it dates from the early 1990s or earlier)
>
> Details welcome.

The non-parallelisation in exportfs (leading to race conditions) for 
starters. We had to insert flock statements in every call to it in 
/usr/share/cluster/nfsclient.sh in order to have reliable service startups

There are a number of RH Bugzilla tickets revolving around NFS behaviour 
which would be worth looking at.

>> As I said earlier, V4 is supposed to play a lot nicer
>
> V4 has a number of improvements, but what I've described above applies
> across versions (module some technical details about timestamps vs.
> change attributes).

Thanks for the input.

NFS has been a major pain point in our organisation for years. If you 
have ideas for doing things better then I'm very interested.

Alan