dead drive behavior has changed?
Carl D. Roth
roth at ursus.net
Wed May 20 16:37:44 UTC 2009
I've noticed that my Fedora systems have recently changed in the way that
they deal with dead or dying disks. It used to be the case that if a disk
went off-line for any reason, the processes attached to it would die due
to I/O errors. This is unfortunate, but otherwise doesn't hobble the rest
of the system.
Now what is happening is that the processes stick around, and the kernel
(i am guessing the journalling system) is stuck waiting for the disk to
return. I get kernel messages of the form
INFO: task rdiff-backup:19311 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rdiff-backup D de403c5c 0 19311 19239
dd583da8 00200086 000000a2 de403c5c 00000008 c087c67c c087fc00
c087fc00
c087fc00 c1894010 c1894284 c13ebc00 00000000 c13ebc00 c189403c
0000141a
dd583d98 c041fbc8 00000000 c1894284 bcba483b 00200246 dd583ddc
dd583da8
Call Trace:
[<c041fbc8>] ? update_curr+0x8d/0xf0
[<c043f1a0>] ? prepare_to_wait+0x4d/0x54
[<df83a900>] start_this_handle+0x2cc/0x3dd [jbd2]
[<c0422b90>] ? dequeue_task_fair+0x3d/0x42
[<c043efa2>] ? autoremove_wake_function+0x0/0x33
[<df83ab7d>] jbd2_journal_start+0x8c/0xb9 [jbd2]
[<df895c2f>] ext4_journal_start_sb+0x40/0x42 [ext4]
[<df88b7cb>] ext4_da_writepages+0x107/0x2ee [ext4]
[<c047684b>] ? pagevec_lookup_tag+0x1c/0x25
[<c04755f5>] ? write_cache_pages+0xfc/0x2ad
[<c046f813>] ? find_get_pages_tag+0x2f/0xda
[<df88b6c4>] ? ext4_da_writepages+0x0/0x2ee [ext4]
[<c04757f0>] do_writepages+0x23/0x34
[<c04ab7e5>] __writeback_single_inode+0x16c/0x2b7
[<c04a37bd>] ? generic_drop_inode+0x67/0x188
[<c04abcab>] generic_sync_sb_inodes+0x202/0x31b
[<c04abe32>] sync_inodes_sb+0x6e/0x76
[<c04abe7b>] __sync_inodes+0x41/0x88
[<c04abecf>] sync_inodes+0xd/0x1e
[<c04ae547>] do_sync+0x14/0x5a
[<c04ae59a>] sys_sync+0xd/0x13
[<c0404c8a>] syscall_call+0x7/0xb
=======================
The processes never die, they cannot be killed, and they keep adding to
the load average of the system, resulting in a denial-of-service attack.
Is there any way to "gracefully" (I know this is a relative term) have the
system disconnect from a dead disk?
Is there a way to have the kernel kill these hung processes?
Thanks!
More information about the fedora-list
mailing list