[Cluster-devel] [GFS2 PATCH 08/10] gfs2: New log flush watchdog

Tue Jul 13 20:03:02 UTC 2021

On 7/13/21 1:41 PM, Steven Whitehouse wrote:
> Hi,
>
> On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
>> This patch adds a new watchdog whose sole purpose is to complain when
>> gfs2_log_flush operations are taking too long.
>>
> This one is a bit confusing. It says that it is to check if the log
> flush is taking too long, but it appears to set a timeout based on the
> amount of dirty data that will be written back, so it isn't really the
> log flush, but the writeback and log flush that is being timed I think?
>
> It also looks like the timeout is entirely dependent upon the number of
> dirty pages too, and not on the log flush size. I wonder about the
> performance impact of traversing the list of dirty pages too. If that
> can be avoided it should make the implementation rather more efficient,
>
> Steve.

Well, perhaps my patch description was misleading. The watchdog is meant
to time how long function gfs2_log_flush() holds the sd_log_flush_lock rwsem
in write mode. That includes writing the ordered writes as well as the
metadata. The metadata portion is almost always outweighed 100:1 (or more)
by the ordered writes. The length of time it will take to do the ordered 
writes
should be based on the number of dirty pages. I don't think running the
ordered writes list will impact performance too badly, and that's one reason
I chose to do it before we actually take the sd_log_flush_lock. It does, 
however,
hold the sd_ordered_lock lock during its count. Still, it's small 
compared to
counting the actual pages or something, and modern cpus can run lists very
quickly.

My initial version didn't count at all; it just used an arbitrary number of
seconds any log flush _ought_ to take. However, Barry pointed out that older
hardware can be slow when driven to extremes and I didn't want false
positives.

I also thought about keeping an interactive count whenever pages are
dirtied, or when inodes are added to the ordered writes list, but that 
seemed
like overkill. But it is a reasonable alternative.

The timeout value is also somewhat arbitrary, but I'm open to suggestions.

In my case, faulty hardware caused log flushes to take a very long time, 
which
caused many transactions and glocks to be blocked a long time and eventually
hit the 120-second kernel watchdog, which gives the impression glocks are
being held a very long time (which they are) for some unknown reason.

This can manifest on many (often non-faulty) nodes, since glocks can be 
tied up
indefinitely waiting for a process who has it locked EX but now must
wait until it can acquire the transaction lock, which is blocked on the 
log flush:
My goal was to make hardware problems (like faulty HBAs and fibre switches)
NOT seem like cascading gfs2 file system problems or slowdowns.

These messages will hopefully prompt operations people to investigate the
cause of the slowdown.

I tested this patch with faulty hardware, and it yielded messages like:

[ 2127.027527] gfs2: fsid=bobsrhel8:test.0: log flush pid 256206 took > 
20 seconds to write 98 pages.
[ 2348.979535] gfs2: fsid=bobsrhel8:test.0: log flush pid 256681 took > 
1 seconds to write 1 pages.
[ 3643.571505] gfs2: fsid=bobsrhel8:test.0: log flush pid 262385 took > 
4 seconds to write 16 pages.

Regards,

Bob Peterson