NFS is going crazy, and taking me with it
Nigel Wade
nmw at ion.le.ac.uk
Fri Jun 9 08:20:42 UTC 2006
Chris St. Pierre wrote:
> I have a RHEL 4 NFS server that shares out three volumes, all
> read-only. One goes to another Linux box, and the other two go to a
> Solaris 9 machine. One of the volumes mounted on the Solaris boxes is
> having bewildering problems.
>
> Every night, two processes run on the server that cause these
> problems. The first is:
>
> /sbin/quotacheck -fguma
>
> The second is AIDE, a Tripwire replacement. When either of these
> processes runs, semi-random files semi-disappear from the client. The
> files are always in the same directories, but different ones disappear
> on different days. The symptoms are always the same: running 'ls'
> will show the files, but running 'ls -lAF' (or anything that requires
> running stat() on them) fails with "File not found." Opening them
> also fails. To solve this problem, I have to touch the file *on the
> client*; of course, it gives an error that it can't create the file in
> question, but after that, everything works.
>
> The only common thread I can think of between quotacheck and AIDE is
> that both stat a very large number of files on the server. That said,
> AIDE is not configured to check any of the volumes that are shared via
> NFS. I also wrote a quick Perl script to recurse into a directory and
> stat all the files in it, but that doesn't break the NFS shares,
> either.
>
> I initially thought the problems where related to the firewall on my
> server, so I turned it off. (There is no firewall on the client.)
> Based on suggestions from fellow S.A., I tried adding actimeo=0 and
> forcedirectio to the mount options on the client, but that didn't
> solve anything. My users are getting very antsy, to say the least.
> Does anyone have any ideas? (Aside from cosmic rays, I mean.) Here's
> my /etc/exports on 'huxley', the server:
>
> /webdirs/univ job.nebrwesleyan.edu(all_squash,anonuid=1080,anongid=1080,ro)
> /webdirs/students students.nebrwesleyan.edu(all_squash,anonuid=1080,anongid=1080,ro)
> /webdirs/faculty job.nebrwesleyan.edu(all_squash,anonuid=1080,anongid=1080,ro)
>
> And on 'job', the client, the corresponding lines from /etc/vfstab:
>
> huxley:/webdirs/univ - /www_misc nfs - yes soft,bg,actimeo=0,forcedirectio
> huxley:/webdirs/faculty - /web/people nfs - yes soft,bg
>
> It bears repeating that only one of the volumes (/webdirs/univ,
> mounted on /www_misc) is having problems; the other volume shared
> between the two servers is just fine. Other NFS mounts on the client
> and shares from the server are similarly fine. In fact, most of the
> NFS share in question is fine -- it's just two directories that
> consistently lose files whenever quotacheck or AIDE is run.
>
> Any ideas? I'm up against a brick wall on this one. Thanks!
>
> Chris St. Pierre
> Unix Systems Administrator
> Nebraska Wesleyan University
>
This is just wild speculation on my part...
Could it be that the job you are running is placing such a heavy load on the
server that NFS requests from the client are timing out? This in turn is being
cached on the client, causing the resulting "File not found" errors? I notice
you have actimeo=0, could this be the culprit - does that mean cache forever, or
never cache? The man page isn't forthcoming on that.
--
Nigel Wade, System Administrator, Space Plasma Physics Group,
University of Leicester, Leicester, LE1 7RH, UK
E-mail : nmw at ion.le.ac.uk
Phone : +44 (0)116 2523548, Fax : +44 (0)116 2523555
More information about the redhat-list
mailing list