dead drive behavior has changed?

Carl D. Roth roth at ursus.net
Wed May 20 16:37:44 UTC 2009


I've noticed that my Fedora systems have recently changed in the way that 
they deal with dead or dying disks.  It used to be the case that if a disk 
went off-line for any reason, the processes attached to it would die due 
to I/O errors.  This is unfortunate, but otherwise doesn't hobble the rest 
of the system.

Now what is happening is that the processes stick around, and the kernel 
(i am guessing the journalling system) is stuck waiting for the disk to 
return.  I get kernel messages of the form

INFO: task rdiff-backup:19311 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rdiff-backup  D de403c5c     0 19311  19239
       dd583da8 00200086 000000a2 de403c5c 00000008 c087c67c c087fc00 
c087fc00 
       c087fc00 c1894010 c1894284 c13ebc00 00000000 c13ebc00 c189403c 
0000141a 
       dd583d98 c041fbc8 00000000 c1894284 bcba483b 00200246 dd583ddc 
dd583da8 
Call Trace:
 [<c041fbc8>] ? update_curr+0x8d/0xf0
 [<c043f1a0>] ? prepare_to_wait+0x4d/0x54
 [<df83a900>] start_this_handle+0x2cc/0x3dd [jbd2]
 [<c0422b90>] ? dequeue_task_fair+0x3d/0x42
 [<c043efa2>] ? autoremove_wake_function+0x0/0x33
 [<df83ab7d>] jbd2_journal_start+0x8c/0xb9 [jbd2]
 [<df895c2f>] ext4_journal_start_sb+0x40/0x42 [ext4]
 [<df88b7cb>] ext4_da_writepages+0x107/0x2ee [ext4]
 [<c047684b>] ? pagevec_lookup_tag+0x1c/0x25
 [<c04755f5>] ? write_cache_pages+0xfc/0x2ad
 [<c046f813>] ? find_get_pages_tag+0x2f/0xda
 [<df88b6c4>] ? ext4_da_writepages+0x0/0x2ee [ext4]
 [<c04757f0>] do_writepages+0x23/0x34
 [<c04ab7e5>] __writeback_single_inode+0x16c/0x2b7
 [<c04a37bd>] ? generic_drop_inode+0x67/0x188
 [<c04abcab>] generic_sync_sb_inodes+0x202/0x31b
 [<c04abe32>] sync_inodes_sb+0x6e/0x76
 [<c04abe7b>] __sync_inodes+0x41/0x88
 [<c04abecf>] sync_inodes+0xd/0x1e
 [<c04ae547>] do_sync+0x14/0x5a
 [<c04ae59a>] sys_sync+0xd/0x13
 [<c0404c8a>] syscall_call+0x7/0xb
 =======================

The processes never die, they cannot be killed, and they keep adding to 
the load average of the system, resulting in a denial-of-service attack.

Is there any way to "gracefully" (I know this is a relative term) have the 
system disconnect from a dead disk?

Is there a way to have the kernel kill these hung processes?

Thanks!




More information about the fedora-list mailing list