[Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

Fri Dec 4 15:51:51 UTC 2015

On Fri, Dec 04, 2015 at 09:51:53AM -0500, Bob Peterson wrote:
> it's from the fenced process, and if so, queue the final put. That should
> mitigate the problem.

Bob, I'm perplexed by the focus on fencing; this issue is broader than
fencing as I mentioned in bz 1255872.  Over the years that I've reported
these issues, rarely if ever have they involving fencing.  Any userland
process, not just the fencing process, can allocate memory, fall into the
general shrinking path, get into gfs2 and dlm, and end up blocked for some
undefined time.  That can cause problems in any number of ways.

The specific problem you're focused on may be one of the easier ways of
demonstrating the problem -- making the original userland process one of
the cluster-related processes that gfs2/dlm depend on, combined with
recovery when those processes do an especially large amount of work that
gfs2/dlm require.  But problems could occur if any process is forced to
unwittingly do this dlm work, not just a cluster-related process, and it
would not need to involve recovery (or fencing which is one small part of
it).

I believe in gfs1 and the original gfs2, gfs had its own mechanism/threads
for shrinking its cache and doing the dlm work, and would not do anything
from the generic shrinking paths because of this issue.  I don't think
it's reasonable to expect random, unsuspecting processes on the system to
perform gfs2/dlm operations that are often remote, lengthy, indefinite, or
unpredictable.  I think gfs2 needs to do that kind of heavy lifting from
its own dedicated contexts, or from processes that are explicitly choosing
to use gfs2.