[Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free

Fri Feb 1 14:34:17 UTC 2019

Hi Ross,

----- Original Message -----
(snip)
> We haven't observed any problems that can be directly attributed to this
> without KASAN, although it is hard to tell what a stray write may do. We
> have hit sporadic asserts and filesystem corruption during testing.
> 
> When I added tracing, the time between freeing a glock and writing to it
> varied but could be up to hundreds of milliseconds so I would guess that
> this could easily happen without KASAN. It is relatively easy to
> reproduce in our test environment.
> 
> Do you have any suggestions for tracking down the root cause?

In the past, I've debugged problems with glock reference counting by
using kernel tracing and instrumentation. Unfortunately, the "glock_put"
trace point only shows you when the glock ref count goes to 0, and
doesn't show when or how the glock is first created, which, of course,
doesn't show if it's created and destroyed multiple times, and often
that's important to figuring these out, otherwise it's just a lot of chaos.

In the past, I've added my own temporary kernel trace point for when new
glocks are created, and called it "glock_new." You probably also want to
modify the glock put functions, such as gfs2_glock_put and
gfs2_glock_queue_put, to call a trace point so you can tell that too, and
have it save off the gl_lockref reference count in the trace.

Then recreate the problem with the trace running. I attached a script I
often use for these purposes. The script contains several bogus trace
point references for various sets of temporary trace points I've added
and deleted over the years, like a generic "debug" trace point where I
can add generic messages of what's happening. So don't be surprised if
you get errors about trying to cat values into non-existent debugfs files.
Just ignore them. The script DOES contain a trigger for a "glock_new"
trace point for just this purpose. I can try to dig out whether I still
have that trace point (glock_new) and the generic debug trace point
lying around somewhere in my many git repositories, but it might take
longer than just writing them again from scratch. I know it pre-dates
the concept of a "queued_put" so things will need to be tweaked anyway.

The script had a bunch of declares at the top for which trace points to
monitor and collect. I modified it for glock_new and glock_put, but
you can play with it.

To run the script and collect the trace, just do this:
./gfs2trace.sh &
(recreate the problem)
rm /var/run/gfs2-tracepoints.pid

Removing that file triggers the trace script to stop tracing and save
the results to a file in /tmp/ named after the machine's name
(so we can keep them straight in clustered situations).
Then, of course, someone needs to analyze the resulting trace file and
figure out where the count is getting off. I hope this helps.

Regards,

Bob Peterson
Red Hat File Systems
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gfs2trace.sh
Type: application/x-shellscript
Size: 9477 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20190201/3c6a6bdf/attachment.bin>