[Cluster-devel] [PATCH 4/4] GFS2: Make deletes use regular inode_lookup with special code path

Tue Mar 15 13:50:45 UTC 2016

----- Original Message -----
> On 24/02/16 20:42, Bob Peterson wrote:
> > ----- Original Message -----
> >> Hi,
> >>
> >> On 18/12/15 19:02, Bob Peterson wrote:
> >>> This patch changes the inode reclaiming code, delete_work_func, use
> >>> the normal gfs2_inode_lookup, but with a special parameter, for_del
> >>> which causes additional checks to be made.
> >> I assume that this was just intended to be a clean up rather than
> >> something that fixes a bug? At least I can't see any obvious changes to
> >> the operations, just the way in which they are carried out. I think it
> >> would be neater to turn the non_block parameter of gfs2_inode_lookup()
> >> into a flags argument, rather than adding an additional argument which
> >> is only a bool.
> >>
> >> I'm not sure that factoring out some of the checks from
> >> gfs2_lookup_by_inum() buys us that much though, since that function
> >> still has to remain in its current form for NFS anyway,
> >>
> >> Steve.
> > Hi Steve,
> >
> > Actually, no, it's not a cleanup at all. Without this patch, I can
> > recreate one of the problems I'm trying to solve here, which is:
> >
> > [nate_bob1] [fsck] Unlinked inode found at block 483992 (0x76298).
> > [nate_bob1] [fsck] Block 483993 (0x76299) bitmap says 1 (Data) but FSCK saw
> > 0 (Free)
> > [nate_bob1] [fsck] Bitmap at block 483993 (0x76299) left inconsistent
> > [nate_bob1] [fsck] Block 483994 (0x7629a) bitmap says 1 (Data) but FSCK saw
> > 0 (Free)
> > [nate_bob1] [fsck] Bitmap at block 483994 (0x7629a) left inconsistent
> > [nate_bob1] [fsck] Block 483995 (0x7629b) bitmap says 1 (Data) but FSCK saw
> > 0 (Free)
> > [nate_bob1] [fsck] Bitmap at block 483995 (0x7629b) left inconsistent
> > [nate_bob1] [fsck] Block 483996 (0x7629c) bitmap says 1 (Data) but FSCK saw
> > 0 (Free)
> > [nate_bob1] [fsck] Bitmap at block 483996 (0x7629c) left inconsistent
> > [nate_bob1] [fsck] Block 483997 (0x7629d) bitmap says 1 (Data) but FSCK saw
> > 0 (Free)
> > [nate_bob1] [fsck] Bitmap at block 483997 (0x7629d) left inconsistent
> > etc.
> >
> > Turns out if you use the regular inode lookup path, the problem goes away.
> >
> > Initially (several months ago) I had written a new inode lookup function
> > just for the transition from unlinked to free. I debugged it and got it
> > working fairly well, only to discover that my code did the exact same
> > things
> > in the exact same order as the original lookup. So I deleted my new
> > function
> > in favor of using the original, hence, this patch.
> >
> > My concern, of course, is that NFS may also be vulnerable to this problem,
> > but I didn't want to mess with the NFS path any more than I had to, so I
> > left the original inum lookup there for it to use. At some point we should
> > try to determine if NFS has an issue related to this. I suspect it does.
> >
> > Regards,
> >
> > Bob Peterson
> > Red Hat File Systems
> I think we need to know the answer to that before we make this change.
> Also, what is the problem? I'm still not sure that I understand the
> actual issue here. It is down to the ordering of operations, or
> different locking, or what exactly?
> 
> The fsck output seems to indicate that there is an unlinked inode in
> which some of the blocks are marked as free. If that occurs without
> there being an un-recovered journal then that is certainly a problem,
> but it would be with the transaction code, so I'm still not quite sure
> how this patch relates to it. It looks like it needs more investigation,
> 
> Steve.

Hi Steve (and all),

I've done a lot of analysis of this issue, and run into multiple complications,
chief among them that these problems often manifest in unmount after the
debugfs files have been destroyed. But I finally have some answers.

In almost all cases, what's happening is that delete_work_func sees
gl->gl_object and calls gfs2_ilookup, but since the inode is gone at that point,
it returns an error. So lookup_by_inum never gets called, and therefore function
delete_work_func just exits without doing anything. So the inode is left in an
unlinked state.

I tried making it call lookup_by_inum as a fallback if gfs2_ilookup doesn't find
the inode, but that causes a hang in unmount. This can still happen in today's
code, but it's less likely than the first scenario.

The hang is caused because gfs2_lookup_by_inum() grabs the inode glock first, then
the vfs inode lock. The normal lookup path, gfs2_inode_lookup, gets the vfs inode
lock first, then the iopen glock, before everything else. It doesn't really lock
the inode glock, but the caller usually does that after it returns from the lookup.

So in the unmount hang, the delete func is holding the inode glock and waiting for
the vfs inode which is stuck in I_FREEING state. That may go back to that patch I
reverted which skips over I_FREEING inodes. But if that patch is in place, the
I_FREEING inode is skipped, and once again, the inode is left in an UNLINKED state.

The reason the regular lookup path has no problems is because it grabs the vfs inode
lock first, which waits for the I_FREEING inode before going on to lock any glocks.

What this means is that NFS very likely has a similar hang, since it also uses
lookup_by_inum. So we have to figure out what we're going to do about that.
Perhaps you could explain why we wrote lookup_by_inum to begin with?

I'm also greatly concerned because of something else I notice in doing all this
analysis: the delete_work_func goes by the inode number in the glock. There's that
whole concept that we can't dereference ip because we don't have proper locks. But
if the glock has since been reused for another mount point, which also happens to
have an inode at that location, it might end up deleting a file from the wrong file
system. The delete_work_func gets its superblock pointer from the glock. The saving
grace, of course, is that the delete func verifies the block is in an UNLINKEd state,
which means it can't be too far off the rails. Still, it's distressing to me that we
can have two delete_work_func processes, and they can both be operating on the same
file system, and even contending for the same glock due to dlm sending callbacks.
And I'm not sure what to do about it. And yes, I have proof that this is actually
happening. Check out this excerpt from the glocks debugfs file:

G:  s:UN n:2/5a299 f:lqb t:EX d:EX/0 a:0 v:0 r:5 m:200
 H: s:EX f:W e:0 p:2750 [delete_workqueu] gfs2_glock_nq_num+0x5b/0xa0 [gfs2]
 H: s:EX f:W e:0 p:2749 [delete_workqueu] gfs2_glock_nq_num+0x5b/0xa0 [gfs2]

There are only two mount points, which means there are only two delete_work queues,
which means both these processes should not be operating on the same glock.
But somehow they are.

Ordinarily I would hope that when the glock was enqueued to the delete work queue,
it would have a reference that would prevent the glock from being reclaimed and
reused until after dlm sends its callback. But somehow it is happening, and I
need to understand exactly why.

Regards,

Bob Peterson
Red Hat File Systems