[dm-devel] [PATCH] Fix over-zealous flush_disk when changing device size.

Thu Mar 17 17:33:38 UTC 2011

NeilBrown <neilb at suse.de> writes:

> On Wed, 16 Mar 2011 16:30:22 -0400 Jeff Moyer <jmoyer at redhat.com> wrote:
>
>> NeilBrown <neilb at suse.de> writes:
>> 
>> >> Synchronous notification of errors.  If we don't try to write everything
>> >> back immediately after the size change, we don't see dirty pages in
>> >> zapped regions until the writeout/page cache management takes it into
>> >> its head to try to clean the pages.
>> >> 
>> >
>> > So if you just want synchronous errors, I think you want:
>> >     fsync_bdev()
>> >
>> > which calls sync_filesystem() if it can find a filesystem, else
>> > sync_blockdev();  (sync_filesystem itself calls sync_blockdev too).
>> 
>> ... which deadlocks md.  ;-)  writeback_inodes_sb_nr is waiting for the
>> flusher thread to write back the dirty data.  The flusher thread is
>> stuck in md_write_start, here:
>> 
>>         wait_event(mddev->sb_wait,
>>                    !test_bit(MD_CHANGE_PENDING, &mddev->flags));
>> 
>> This is after reverting your change, and replacing the flush_disk call
>> in check_disk_size_change with a call to fsync_bdev.  I'm not familiar
>> enough with md to really suggest a way forward.  Neil?
>
> That would be quite easy to avoid.
> Just call
>    md_write_start()
> before revalidate_disk, and
>    md_write_end()
> afterwards.

That does not avoid the problem (if I understood your suggestion).  You
instead end up with the following:

INFO: task md127_raid5:2282 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md127_raid5     D ffff88011c72d0a0  5688  2282      2 0x00000080
 ffff880118997c20 0000000000000046 ffff880100000000 0000000000000246
 0000000000014d00 ffff88011c72cb10 ffff88011c72d0a0 ffff880118997fd8
 ffff88011c72d0a8 0000000000014d00 ffff880118996010 0000000000014d00
Call Trace:
 [<ffffffff8138bbbd>] md_write_start+0xad/0x1d0
 [<ffffffff810801d0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0311558>] raid5_finish_reshape+0x98/0x1e0 [raid456]
 [<ffffffff8138a933>] reap_sync_thread+0x63/0x130
 [<ffffffff8138c8b6>] md_check_recovery+0x1f6/0x6f0
 [<ffffffffa03150ab>] raid5d+0x3b/0x610 [raid456]
 [<ffffffff810804c9>] ? prepare_to_wait+0x59/0x90
 [<ffffffff81387ee9>] md_thread+0x119/0x150
 [<ffffffff810801d0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81387dd0>] ? md_thread+0x0/0x150
 [<ffffffff8107fb56>] kthread+0x96/0xa0
 [<ffffffff8100cc04>] kernel_thread_helper+0x4/0x10
 [<ffffffff8107fac0>] ? kthread+0x0/0xa0
 [<ffffffff8100cc00>] ? kernel_thread_helper+0x0/0x10

I'll leave this to you to work out when you have time.

Cheers,
Jeff