[Cluster-devel] gfs2 hang in xfstests generic/361

Mon Jul 26 14:49:44 UTC 2021

On 7/26/21 9:00 AM, Christoph Hellwig wrote:
> I noticed this hang while testing the iomap_iter series on gfs2,
> but it also reproduces on 5.14-rc3.  This is running locally with
> "-O -p lock_nolock":
> 
> generic/361 files ... [ 1479.222703] run fstests generic/361 at 2021-07-26 13:57:10
(snip)
> [ 1491.744459] gfs2: fsid=loop0.0: fatal: I/O error
> [ 1491.744459]   block = 17192
> [ 1491.744459]   function = gfs2_ail1_empty_one, file = fs/gfs2/log.c, line = 323
> [ 1491.747491] gfs2: fsid=loop0.0: fatal: I/O error(s)
> [ 1491.748477] gfs2: fsid=loop0.0: about to withdraw this file system
> [ 1491.752284]
> [ 1491.752587] =============================
> [ 1491.753403] [ BUG: Invalid wait context ]
> [ 1491.754122] 5.14.0-rc2+ #47 Not tainted
> [ 1491.754860] -----------------------------
> [ 1491.755563] kworker/2:1H/1975 is trying to lock:
> [ 1491.756370] ffff8881048d0888 (&wq->mutex){+.+.}-{3:3}, at: flush_workqueue+0xc9/0x5f0
> [ 1491.757736] other info that might help us debug this:
> [ 1491.758622] context-{4:4}
> [ 1491.759087] 4 locks held by kworker/2:1H/1975:
> [ 1491.759863]  #0: ffff888101717b38 ((wq_completion)glock_workqueue){+.+.}-{0:0}, at: p0
> [ 1491.761623]  #1: ffffc900040dfe78
> ((work_completion)(&(&gl->gl_work)->work)){+.+.}-{00
> [ 1491.763528]  #2: ffff88811ce6b000 (&sdp->sd_log_flush_lock){++++}-{3:3}, at: gfs2_log0
> [ 1491.765284]  #3: ffff88811ce6ae28 (&sdp->sd_log_lock){+.+.}-{2:2}, at: gfs2_flush_rev0
> [ 1491.767064] stack backtrace:
> [ 1491.767629] CPU: 2 PID: 1975 Comm: kworker/2:1H Not tainted 5.14.0-rc2+ #47
> [ 1491.769000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/04
> [ 1491.770641] Workqueue: glock_workqueue glock_work_func
> [ 1491.771635] Call Trace:
> [ 1491.772101]  dump_stack_lvl+0x45/0x59
> [ 1491.772777]  __lock_acquire.cold+0x2a2/0x2be
> [ 1491.773529]  ? save_trace+0x3e/0x380
> [ 1491.774160]  lock_acquire+0xc9/0x2f0
> [ 1491.774815]  ? flush_workqueue+0xc9/0x5f0
> [ 1491.775521]  __mutex_lock+0x75/0x870
> [ 1491.776151]  ? flush_workqueue+0xc9/0x5f0
> [ 1491.776856]  ? flush_workqueue+0xc9/0x5f0
> [ 1491.777560]  ? lock_release+0x13c/0x2e0
> [ 1491.778234]  flush_workqueue+0xc9/0x5f0
> [ 1491.779012]  gfs2_make_fs_ro+0x2b/0x2b0
> [ 1491.779687]  gfs2_withdraw.cold+0x16f/0x4bd
> [ 1491.780424]  ? gfs2_freeze_lock+0x24/0x60
> [ 1491.781129]  gfs2_ail1_empty+0x305/0x310
> [ 1491.781821]  gfs2_flush_revokes+0x29/0x40
> [ 1491.782526]  revoke_lo_before_commit+0x12/0x1c0
> [ 1491.783324]  gfs2_log_flush+0x337/0xb00
> [ 1491.784001]  inode_go_sync+0x8e/0x200
> [ 1491.784663]  do_xmote+0xd2/0x380
> [ 1491.785268]  glock_work_func+0x57/0x130
> [ 1491.785944]  process_one_work+0x237/0x560

Hi Christoph,

Thanks. I've run generic/361 many times on many recent branches and I've
never seen this before. For example, this is from last Friday:

generic/361 8s ...  13s

Still, I can see what's going on and it's not a recent problem. This is
basically a problem with our withdraw sequence from February 2020.
(patch 601ef0d52e96) I'll try to fix it as soon as I get a chance.

Regards,

Bob Peterson