[Cluster-devel] [GFS2 PATCH 08/10] gfs2: New log flush watchdog

Wed Jul 14 08:53:26 UTC 2021

Hi,

On Tue, 2021-07-13 at 15:03 -0500, Bob Peterson wrote:
> On 7/13/21 1:41 PM, Steven Whitehouse wrote:
> > Hi,
> > 
> > On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
> > > This patch adds a new watchdog whose sole purpose is to complain
> > > when
> > > gfs2_log_flush operations are taking too long.
> > > 
> > This one is a bit confusing. It says that it is to check if the log
> > flush is taking too long, but it appears to set a timeout based on
> > the
> > amount of dirty data that will be written back, so it isn't really
> > the
> > log flush, but the writeback and log flush that is being timed I
> > think?
> > 
> > It also looks like the timeout is entirely dependent upon the
> > number of
> > dirty pages too, and not on the log flush size. I wonder about the
> > performance impact of traversing the list of dirty pages too. If
> > that
> > can be avoided it should make the implementation rather more
> > efficient,
> > 
> > Steve.
> 
> Well, perhaps my patch description was misleading. The watchdog is
> meant
> to time how long function gfs2_log_flush() holds the
> sd_log_flush_lock rwsem
> in write mode. 

I think it needs looking at a bit more carefully. That lock is really
an implementation detail, and one that we expect will change in the not
too distant future as the log code improves.

As you say the description is confusing, and the log messages even more
so, since they give a page count that refers to the ordered writes and
not to the log writes at all. 

Also, we have tools already that should be able to diagnose this issue
(slow I/O) such as blktrace, although I know that is more tricky to
catch after the fact. So I think we need to look at this again to see
if there is a better solution.

> That includes writing the ordered writes as well as the
> metadata. The metadata portion is almost always outweighed 100:1 (or
> more)
> by the ordered writes. The length of time it will take to do the
> ordered 
> writes
> should be based on the number of dirty pages. I don't think running
> the
> ordered writes list will impact performance too badly, and that's one
> reason
> I chose to do it before we actually take the sd_log_flush_lock. It
> does, 
> however,
> hold the sd_ordered_lock lock during its count. Still, it's small 
> compared to
> counting the actual pages or something, and modern cpus can run lists
> very
> quickly.
> 
What limits do we have on the ordered write list length? I seem to
remember we had addressed that issue at some point in the past.
Generally though iterating over what might be quite a long list is not
a good plan from a performance perspective,

Steve.

> My initial version didn't count at all; it just used an arbitrary
> number of
> seconds any log flush _ought_ to take. However, Barry pointed out
> that older
> hardware can be slow when driven to extremes and I didn't want false
> positives.
> 
> I also thought about keeping an interactive count whenever pages are
> dirtied, or when inodes are added to the ordered writes list, but
> that 
> seemed
> like overkill. But it is a reasonable alternative.
> 
> The timeout value is also somewhat arbitrary, but I'm open to
> suggestions.
> 
> In my case, faulty hardware caused log flushes to take a very long
> time, 
> which
> caused many transactions and glocks to be blocked a long time and
> eventually
> hit the 120-second kernel watchdog, which gives the impression glocks
> are
> being held a very long time (which they are) for some unknown reason.
> 
> This can manifest on many (often non-faulty) nodes, since glocks can
> be 
> tied up
> indefinitely waiting for a process who has it locked EX but now must
> wait until it can acquire the transaction lock, which is blocked on
> the 
> log flush:
> My goal was to make hardware problems (like faulty HBAs and fibre
> switches)
> NOT seem like cascading gfs2 file system problems or slowdowns.
> 
> These messages will hopefully prompt operations people to investigate
> the
> cause of the slowdown.
> 
> I tested this patch with faulty hardware, and it yielded messages
> like:
> 
> [ 2127.027527] gfs2: fsid=bobsrhel8:test.0: log flush pid 256206 took
> > 
> 20 seconds to write 98 pages.
> [ 2348.979535] gfs2: fsid=bobsrhel8:test.0: log flush pid 256681 took
> > 
> 1 seconds to write 1 pages.
> [ 3643.571505] gfs2: fsid=bobsrhel8:test.0: log flush pid 262385 took
> > 
> 4 seconds to write 16 pages.
> 
> Regards,
> 
> Bob Peterson
> 
>