[Linux-cachefs] fscache recursive hang -- similar to loopback NFS issues

Tue Jul 29 21:17:35 UTC 2014

On Tue, 29 Jul 2014 17:12:34 +0100 David Howells <dhowells at redhat.com> wrote:

> Milosz Tanski <milosz at adfin.com> wrote:
> 
> > That's the same thing exact fix I started testing on Saturday. I found that
> > there already is a wait_event_timeout (even without your recent changes). The
> > thing I'm not quite sure is what timeout it should use?
> 
> That's probably something to make an external tuning knob for.
> 
> David

Ugg.  External tuning knobs should be avoided wherever possible, and always
come with detailed instructions on how to tune them  </rant>

In this case I think it very nearly doesn't matter *at all* what value is
used.

If you set it a bit too high, then on the very very rare occasion that it
would currently deadlock, you get a longer-than-necessary wait.  So just make
sure that is short enough that by the time the sysadmin notices and starts
looking for the problem, it will be gone.

And if you set it a bit too low, then it will loop around to find another
page to deal with before that one is finished being written out, and so maybe
do a little bit more work than is needed (though it'll be needed eventually).

So the perfect number is somewhere between the typical response time for
storage, and the typical response time for the sys-admin.  Anywhere between
100ms and 10sec would do.  1 second is the geo-mean.

(sorry I didn't reply earlier - I missed you email somehow).

NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 828 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cachefs/attachments/20140730/74862a67/attachment.sig>