[Linux-cachefs] locking/refcount problems in cachefiles.
NeilBrown
neilb at suse.de
Mon Feb 3 00:32:45 UTC 2014
On Wed, 29 Jan 2014 12:17:15 +0000 David Howells <dhowells at redhat.com> wrote:
> NeilBrown <neilb at suse.de> wrote:
>
> > Analysis of the crash dump suggests that fscache_object_destroy, and thus
> > __rb_erase_colour, is being called on an object that has already been
> > destroy and is no longer in the rb tree. The rbtree code gets upset and
> > crashes.
>
> Not unreasonably... But which rb_tree? There are two:
>
> (1) struct cachefiles_cache::active_nodes.
>
> This is governed by struct cachefiles_cache::active_lock.
>
> (2) fscache_object_list.
>
> This is governed by fscache_object_list_lock.
>
> Unless you have CONFIG_FSCACHE_OBJECT_LIST=y this isn't present and
> fscache_objlist_remove() does nothing - in which case all
> fscache_object_destroy() does is release the cookie.
Ahhh. Two rb_trees. Of course. That does invalidate my theory somewhat.
>
> Can you poke around in the registers, see if any of them point to tree (2)
> (which is a global variable).
The only rb_tree that I've looked at the the one for the object being
destroyed when the machine crashes.
I walk up to the root and that exactly matches fscache_object_list.rb_node.
It then walked down to enumerate the whole tree and the object being deleted
isn't in it. Nor is the parent of the object being deleted.
>
> > Thus you can get a race
> > ...
> > cachefiles_mark_object_active increments
> > ->usage (to 1) and drops the lock
>
> This is tree (1).
>
> > cachefiles_put_object calls
> > fscache_object_destroy which
> > unlinks from the rb tree.
>
> And this is tree (2).
>
> > cachefiles_objects live in an rbtree which does not imply a reference to
> > the object.
>
> Whilst that is true, they're not allowed to be in the rbtree unless they still
> have at least one reference outstanding.
>
> Apart from cachefiles_walk_to_object()'s "check_error" labelled part, objects
> are only rb_erase()'d in cachefiles_drop_object(). This is called from the
> fscache object state machine (fscache_drop_object) which holds a ref on the
> cachefiles object until fscache_object_work_func() releases it just prior to
> returning.
Thanks - I see that now. The code does look safe.
Which makes it all the more surprising that I have this crash-dump :-(
I also have a crash-dump of a crash in fscache_put_operation() because
FSCACHE_OP_DEAD is already set. This is in 3.0 - the code is quite different
in mainline though none of the changes look like they would fix any
whatever caused the bug.
So I have two bug reports which both seem to suggest that something was freed
twice. Yet I can't see anything that would allow that to happen.
Most frustrating.
If you have any ideas, I'd love to hear them :-)
Thanks,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 828 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cachefs/attachments/20140203/a5570595/attachment.sig>
More information about the Linux-cachefs
mailing list