[rhelv6-list] Host hung, hung_task_timeout_secs mentioned

Wed Jun 15 15:53:31 UTC 2011

On Wed, 2011-06-15 at 11:13 -0400, Brian Long wrote:
> I ran into a server hang last night running 2.6.32-131.2.1.el6.x86_64.
> I just installed the latest updates (RHEL 6.1) yesterday morning and I
> experienced the hang during my Amanda backups.  I found a RHEL 5 bug
> which mention similar problems but no fix:
> https://bugzilla.redhat.com/show_bug.cgi?id=605444
> 
> I had Opsware monitoring the host and it went offline completely for
> about 1 minute.  Has anyone else experienced this?  I'm running a LSI
> 8708EM2 RAID controller with battery-backed cache.
> 
> Jun 15 02:00:01 delenn xinetd[2082]: START: amanda pid=19385 from=x.x.x.x
> Jun 15 02:00:31 delenn xinetd[2082]: EXIT: amanda status=0 pid=19385
> duration=30(sec)
> Jun 15 02:09:45 delenn kernel: INFO: task jbd2/dm-1-8:609 blocked for
> more than 120 seconds.
> Jun 15 02:09:45 delenn kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 15 02:09:45 delenn kernel: jbd2/dm-1-8   D 0000000000000000     0
> 609      2 0x00000000
> Jun 15 02:09:45 delenn kernel: ffff8802629e1c10 0000000000000046
> ffff8802629e1bd8 ffff8802629e1bd4
> Jun 15 02:09:45 delenn kernel: ffff880263b1d340 ffff88026fc24300
> ffff8800282f5f80 0000000103e3d1cb
> Jun 15 02:09:45 delenn kernel: ffff880263b1d0b8 ffff8802629e1fd8
> 000000000000f598 ffff880263b1d0b8
> Jun 15 02:09:45 delenn kernel: Call Trace:
> Jun 15 02:09:45 delenn kernel: [<ffffffff811a3a90>] ? sync_buffer+0x0/0x50
> Jun 15 02:09:45 delenn kernel: [<ffffffff814db013>] io_schedule+0x73/0xc0
> Jun 15 02:09:45 delenn kernel: [<ffffffff811a3ad0>] sync_buffer+0x40/0x50
> Jun 15 02:09:45 delenn kernel: [<ffffffff814db87f>] __wait_on_bit+0x5f/0x90
> Jun 15 02:09:45 delenn kernel: [<ffffffff811a3a90>] ? sync_buffer+0x0/0x50
> Jun 15 02:09:45 delenn kernel: [<ffffffff814db928>]
> out_of_line_wait_on_bit+0x78/0x90
> Jun 15 02:09:45 delenn kernel: [<ffffffff8108e140>] ?
> wake_bit_function+0x0/0x50
> Jun 15 02:09:45 delenn kernel: [<ffffffff811a3a86>]
> __wait_on_buffer+0x26/0x30
> Jun 15 02:09:45 delenn kernel: [<ffffffffa00847d1>]
> jbd2_journal_commit_transaction+0x1121/0x1490 [jbd2]
> Jun 15 02:09:45 delenn kernel: [<ffffffff810096d0>] ? __switch_to+0xd0/0x320
> Jun 15 02:09:45 delenn kernel: [<ffffffff8107a11b>] ?
> try_to_del_timer_sync+0x7b/0xe0
> Jun 15 02:09:45 delenn kernel: [<ffffffffa0089948>]
> kjournald2+0xb8/0x220 [jbd2]
> Jun 15 02:09:45 delenn kernel: [<ffffffff8108e100>] ?
> autoremove_wake_function+0x0/0x40
> Jun 15 02:09:45 delenn kernel: [<ffffffffa0089890>] ?
> kjournald2+0x0/0x220 [jbd2]
> Jun 15 02:09:45 delenn kernel: [<ffffffff8108dd96>] kthread+0x96/0xa0
> Jun 15 02:09:45 delenn kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20
> Jun 15 02:09:45 delenn kernel: [<ffffffff8108dd00>] ? kthread+0x0/0xa0
> Jun 15 02:09:45 delenn kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
> Jun 15 02:20:25 delenn auditd[1702]: Audit daemon rotating log files
> Jun 15 02:43:41 delenn xinetd[2082]: START: amanda pid=20084 from=x.x.x.x
> Jun 15 02:44:11 delenn xinetd[2082]: EXIT: amanda status=0 pid=20084
> duration=30(sec)

We battled a very similar issue on one of our older systems that was
recently upgraded.  Specifically the system was a Dell 2950 that servers
as a central backup server.  This server runs NFS and Samba to take
Oracle RMAN backups, runs BackupPC to backup a number of Linux systems,
and is a backup target for our VMware backup solution.

During the heavy I/O (somewhat common for a backup server) we would get
messages similar to what you're seeing.  Interestingly, we have another
older server that performs virtually the same function but so far it
hasn't experienced this issue.

In our case we could sometimes reproduce the issue by running parallel
iozone throughput benchmarks.  Interestingly, we haven't seen the issue
in the last few weeks, but about 3 weeks ago we made changes to our
backups which cause the jobs to be spread out a little more, thus not
stressing the I/O as much.

We haven't had time to really dig into the root cause but I'd certainly
be interested in finding out.

One other thing we did do, we had a few minor VM tweaks in sysctl.conf
that we were using with RHEL5 to improve the performance and we removed
all of those as well.  I have no idea if that really changed anything,
but I think probably not.  Still I thought I'd mention it.

Please keep us informed as to what you find out.

Thanks,
Tom