[Cluster-devel] [PATCH 4/4] GFS2: Make deletes use regular inode_lookup with special code path

Wed Mar 23 16:13:35 UTC 2016

Hi Ben,

----- Original Message -----
> On Mon, Mar 21, 2016 at 02:12:43PM -0400, Bob Peterson wrote:
> > Hi Ben,
> > 
> > ----- Original Message -----
> > > On Fri, Dec 18, 2015 at 01:02:30PM -0600, Bob Peterson wrote:
> > > > This patch changes the inode reclaiming code, delete_work_func, use
> > > > the normal gfs2_inode_lookup, but with a special parameter, for_del
> > > > which causes additional checks to be made.
> > > 
> > > I get what you're doing here, and I can't see anything really wrong with
> > > it, but I haven't gone through all the delete code as carefully as I
> > > could.  I do have two questions.
> > > 
> > > 1. I don't see any reason why, even if it's just for paranoias sake, we
> > > don't avoid dereferencing the gfs2_inode in verify_unlinked before
> > > locking and checking it. It seems like a simple fix that avoids any
> > > chance that there is a problem due to not holding the directory glock
> > > when we're getting the inode.
> > 
> > I'm not sure I understand the question. Today, function verify_unlinked
> > dereferences the gfs2_inode, ip, but so does its only caller,
> > gfs2_inode_lookup. We already know ip is valid at that point, so more
> > references shouldn't matter. Unless you mean the inode's glock?
> 
> So before, delete_work_func called gfs2_lookup_by_inum, which locked the
> glock without dereferencing the gfs2_inode pointer, and then called
> gfs2_inode_lookup. Other code paths call gfs2_inode_lookup without the
> glock on the inode itself locked, but with the containing directory's
> inode glock held. But now, we are deferencing the gfs2_inode pointer
> without either holding its glock or the glock of the directory that the
> file is in.

Yes, and the old lock order was the majority of the problem.
It seems to me that a lot of GFS2 touches the inode without holding
the inode's glock. For example, take function gfs2_create_inode().
It locks the inode's directory first, but then it does:

   inode = gfs2_dir_search
      gfs2_inode_lookup
         iget_locked

So it's referencing the inode (of which ip is part of) without its glock.
If the inode isn't found by gfs2_dir_search, it continues on to do:

   inode = new_inode(sdp->sd_vfs);
      	inode = new_inode_pseudo(sb);

	   struct inode *inode = alloc_inode(sb);
	   if (inode) {
		   spin_lock(&inode->i_lock);
   ...
   error = gfs2_glock_get(sdp, ip->i_no_addr, &gfs2_inode_glops, CREATE, &ip->i_gl);
   ...
   error = gfs2_glock_nq_init(ip->i_gl, LM_ST_EXCLUSIVE, GL_SKIP, ghs + 1);

Which means it also accesses the inode before the inode's glock.
So AFAICT, the correct order is: grab a reference to the inode, then the inode's glock.

But gfs2_lookup_by_inum was doing this in the opposite order: It was holding
the inode glock (found by its number) then it grabbed the inode itself.

> Assuming that everything has to hold the inode lock first now, once

Now, as it has always been.

> we've locked the that, and locked the glock and verified the file in
> verify inode, it should be safe to unlock the glock in verify_inode, and
> continue to dereference it the gfs2_inode until we drop the inode lock.
> 
> >  
> > > 2. gfs2_inode_lookup calls iget_locked, which creates an inode if one
> > > doesn't already exist.  If there's any chance of racing and calling
> > > delete_work_func for an inode that has already been removed, won't we
> > > just end up creating a new one here?
> > > 
> > > -Ben
> > 
> > Yes, and that's exactly what we want. I found cases in which the
> > delete_work_func races with the evict() code. In those cases, we need
> > the inode to be recreated in order to go through another evict() cycle
> > which will end up deleting it, whereas the first evict didn't.
> > For example, the shrinker can arbitrarily call evict() and get rid of
> > the inode while its nlink is still positive, IOW, it's not deleted,
> > but after that point in time, the system gets a callback from another
> > node instructing it to delete the inode because it was unlinked
> > remotely from that other node. In that case, we still need to
> > transition it from unlinked to free, but we need a valid inode
> > in order to do the proper delete. So we recreate it and then it gets
> > evicted again, in the name of deleting it.
> 
> O.k. clearly I didn't read through all the delete code as carefully as I
> could have.

Well, the delete code is a maze, to be sure, but the proof is in the
pudding. Instrumentation proved to me that this was happening.

One of the many pitfalls I fell into with the delete code is: not only
did I need to recreate the inode after was freed from memory in these
circumstances, but I also needed to do an inode refresh, which seemed
like a real waste of time, given that the inode was being deleted.
Ideally I'd like to eliminate that need and thereby make it more efficient.
These were both handled by the standard inode lookup path, so that's
why I ditched my original attempt to create a delete-specific function to
do the inode lookup, in favor of using the normal inode lookup path.

And even now, I'm not sure what to do about the NFS lookup path, because
that lookup_by_inum still has the wrong lock order AFAICT.

Regards,

Bob Peterson
Red Hat File Systems