NFS is going crazy, and taking me with it

Nigel Wade nmw at ion.le.ac.uk
Fri Jun 9 08:20:42 UTC 2006


Chris St. Pierre wrote:
> I have a RHEL 4 NFS server that shares out three volumes, all
> read-only.  One goes to another Linux box, and the other two go to a
> Solaris 9 machine.  One of the volumes mounted on the Solaris boxes is
> having bewildering problems.
> 
> Every night, two processes run on the server that cause these
> problems.  The first is:
> 
> /sbin/quotacheck -fguma
> 
> The second is AIDE, a Tripwire replacement.  When either of these
> processes runs, semi-random files semi-disappear from the client.  The
> files are always in the same directories, but different ones disappear
> on different days.  The symptoms are always the same: running 'ls'
> will show the files, but running 'ls -lAF' (or anything that requires
> running stat() on them) fails with "File not found."  Opening them
> also fails.  To solve this problem, I have to touch the file *on the
> client*; of course, it gives an error that it can't create the file in
> question, but after that, everything works.
> 
> The only common thread I can think of between quotacheck and AIDE is
> that both stat a very large number of files on the server.  That said,
> AIDE is not configured to check any of the volumes that are shared via
> NFS.  I also wrote a quick Perl script to recurse into a directory and
> stat all the files in it, but that doesn't break the NFS shares,
> either.
> 
> I initially thought the problems where related to the firewall on my
> server, so I turned it off.  (There is no firewall on the client.)
> Based on suggestions from fellow S.A., I tried adding actimeo=0 and
> forcedirectio to the mount options on the client, but that didn't
> solve anything.  My users are getting very antsy, to say the least.
> Does anyone have any ideas?  (Aside from cosmic rays, I mean.)  Here's
> my /etc/exports on 'huxley', the server:
> 
> /webdirs/univ job.nebrwesleyan.edu(all_squash,anonuid=1080,anongid=1080,ro)
> /webdirs/students students.nebrwesleyan.edu(all_squash,anonuid=1080,anongid=1080,ro)
> /webdirs/faculty job.nebrwesleyan.edu(all_squash,anonuid=1080,anongid=1080,ro)
> 
> And on 'job', the client, the corresponding lines from /etc/vfstab:
> 
> huxley:/webdirs/univ    -       /www_misc       nfs     -       yes soft,bg,actimeo=0,forcedirectio
> huxley:/webdirs/faculty -       /web/people     nfs     -       yes soft,bg
> 
> It bears repeating that only one of the volumes (/webdirs/univ,
> mounted on /www_misc) is having problems; the other volume shared
> between the two servers is just fine.  Other NFS mounts on the client
> and shares from the server are similarly fine.  In fact, most of the
> NFS share in question is fine -- it's just two directories that
> consistently lose files whenever quotacheck or AIDE is run.
> 
> Any ideas?  I'm up against a brick wall on this one.  Thanks!
> 
> Chris St. Pierre
> Unix Systems Administrator
> Nebraska Wesleyan University
> 

This is just wild speculation on my part...

Could it be that the job you are running is placing such a heavy load on the 
server that NFS requests from the client are timing out? This in turn is being 
cached on the client, causing the resulting "File not found" errors? I notice 
you have actimeo=0, could this be the culprit - does that mean cache forever, or 
never cache? The man page isn't forthcoming on that.

-- 
Nigel Wade, System Administrator, Space Plasma Physics Group,
             University of Leicester, Leicester, LE1 7RH, UK
E-mail :    nmw at ion.le.ac.uk
Phone :     +44 (0)116 2523548, Fax : +44 (0)116 2523555




More information about the redhat-list mailing list