[Cluster-devel] gfs2 hang in xfstests generic/361
Bob Peterson
rpeterso at redhat.com
Mon Jul 26 14:49:44 UTC 2021
On 7/26/21 9:00 AM, Christoph Hellwig wrote:
> I noticed this hang while testing the iomap_iter series on gfs2,
> but it also reproduces on 5.14-rc3. This is running locally with
> "-O -p lock_nolock":
>
> generic/361 files ... [ 1479.222703] run fstests generic/361 at 2021-07-26 13:57:10
(snip)
> [ 1491.744459] gfs2: fsid=loop0.0: fatal: I/O error
> [ 1491.744459] block = 17192
> [ 1491.744459] function = gfs2_ail1_empty_one, file = fs/gfs2/log.c, line = 323
> [ 1491.747491] gfs2: fsid=loop0.0: fatal: I/O error(s)
> [ 1491.748477] gfs2: fsid=loop0.0: about to withdraw this file system
> [ 1491.752284]
> [ 1491.752587] =============================
> [ 1491.753403] [ BUG: Invalid wait context ]
> [ 1491.754122] 5.14.0-rc2+ #47 Not tainted
> [ 1491.754860] -----------------------------
> [ 1491.755563] kworker/2:1H/1975 is trying to lock:
> [ 1491.756370] ffff8881048d0888 (&wq->mutex){+.+.}-{3:3}, at: flush_workqueue+0xc9/0x5f0
> [ 1491.757736] other info that might help us debug this:
> [ 1491.758622] context-{4:4}
> [ 1491.759087] 4 locks held by kworker/2:1H/1975:
> [ 1491.759863] #0: ffff888101717b38 ((wq_completion)glock_workqueue){+.+.}-{0:0}, at: p0
> [ 1491.761623] #1: ffffc900040dfe78
> ((work_completion)(&(&gl->gl_work)->work)){+.+.}-{00
> [ 1491.763528] #2: ffff88811ce6b000 (&sdp->sd_log_flush_lock){++++}-{3:3}, at: gfs2_log0
> [ 1491.765284] #3: ffff88811ce6ae28 (&sdp->sd_log_lock){+.+.}-{2:2}, at: gfs2_flush_rev0
> [ 1491.767064] stack backtrace:
> [ 1491.767629] CPU: 2 PID: 1975 Comm: kworker/2:1H Not tainted 5.14.0-rc2+ #47
> [ 1491.769000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/04
> [ 1491.770641] Workqueue: glock_workqueue glock_work_func
> [ 1491.771635] Call Trace:
> [ 1491.772101] dump_stack_lvl+0x45/0x59
> [ 1491.772777] __lock_acquire.cold+0x2a2/0x2be
> [ 1491.773529] ? save_trace+0x3e/0x380
> [ 1491.774160] lock_acquire+0xc9/0x2f0
> [ 1491.774815] ? flush_workqueue+0xc9/0x5f0
> [ 1491.775521] __mutex_lock+0x75/0x870
> [ 1491.776151] ? flush_workqueue+0xc9/0x5f0
> [ 1491.776856] ? flush_workqueue+0xc9/0x5f0
> [ 1491.777560] ? lock_release+0x13c/0x2e0
> [ 1491.778234] flush_workqueue+0xc9/0x5f0
> [ 1491.779012] gfs2_make_fs_ro+0x2b/0x2b0
> [ 1491.779687] gfs2_withdraw.cold+0x16f/0x4bd
> [ 1491.780424] ? gfs2_freeze_lock+0x24/0x60
> [ 1491.781129] gfs2_ail1_empty+0x305/0x310
> [ 1491.781821] gfs2_flush_revokes+0x29/0x40
> [ 1491.782526] revoke_lo_before_commit+0x12/0x1c0
> [ 1491.783324] gfs2_log_flush+0x337/0xb00
> [ 1491.784001] inode_go_sync+0x8e/0x200
> [ 1491.784663] do_xmote+0xd2/0x380
> [ 1491.785268] glock_work_func+0x57/0x130
> [ 1491.785944] process_one_work+0x237/0x560
Hi Christoph,
Thanks. I've run generic/361 many times on many recent branches and I've
never seen this before. For example, this is from last Friday:
generic/361 8s ... 13s
Still, I can see what's going on and it's not a recent problem. This is
basically a problem with our withdraw sequence from February 2020.
(patch 601ef0d52e96) I'll try to fix it as soon as I get a chance.
Regards,
Bob Peterson
More information about the Cluster-devel
mailing list