[libvirt] blkio cgroup
Vivek Goyal
vgoyal at redhat.com
Mon Feb 21 18:44:42 UTC 2011
On Mon, Feb 21, 2011 at 09:30:32AM +0100, Dominik Klein wrote:
> On 02/21/2011 09:19 AM, Dominik Klein wrote:
> >>> - Is it possible to capture 10-15 second blktrace on your underlying
> >>> physical device. That should give me some idea what's happening.
> >>
> >> Will do, read on.
> >
> > Just realized I missed this one ... Had better done it right away.
> >
> > So here goes.
> >
> > Setup as in first email. 8 Machines, 2 important, 6 not important ones
> > with a throttle of ~10M. group_isolation=1. Each vm dd'ing zeroes.
> >
> > blktrace -d /dev/sdb -w 30
> > === sdb ===
> > CPU 0: 4769 events, 224 KiB data
> > CPU 1: 28079 events, 1317 KiB data
> > CPU 2: 1179 events, 56 KiB data
> > CPU 3: 5529 events, 260 KiB data
> > CPU 4: 295 events, 14 KiB data
> > CPU 5: 649 events, 31 KiB data
> > CPU 6: 185 events, 9 KiB data
> > CPU 7: 180 events, 9 KiB data
> > CPU 8: 17 events, 1 KiB data
> > CPU 9: 12 events, 1 KiB data
> > CPU 10: 6 events, 1 KiB data
> > CPU 11: 55 events, 3 KiB data
> > CPU 12: 28005 events, 1313 KiB data
> > CPU 13: 1542 events, 73 KiB data
> > CPU 14: 4814 events, 226 KiB data
> > CPU 15: 389 events, 19 KiB data
> > CPU 16: 1545 events, 73 KiB data
> > CPU 17: 119 events, 6 KiB data
> > CPU 18: 3019 events, 142 KiB data
> > CPU 19: 62 events, 3 KiB data
> > CPU 20: 800 events, 38 KiB data
> > CPU 21: 17 events, 1 KiB data
> > CPU 22: 243 events, 12 KiB data
> > CPU 23: 1 events, 1 KiB data
> > Total: 81511 events (dropped 0), 3822 KiB data
> >
> > Very constant 296 blocked processes in vmstat during this run. But...
> > apparently no data is written at all (see "bo" column).
Hm..., this sounds bad. If you have put a limit of ~10Mb/s then no
"bo" is bad. That would explain that why your box is not responding
and you need to do power reset.
- I am assuming that you have not put any throttling limits on root group.
Is your system root also on /dev/sdb or on a separate disk altogether.
- This sounds like a bug in throttling logic. To narrow it down can you
start running "deadline" on end device. If it still happens, it is more
or less in throttling layer.
- We can also try to remove dm layers and just create partitions on
/dev/sdb and export as virtio disks to virtual machines and take
dm layer out of picture and see if it still happens.
- In one of the mails you mentioned that with 1 virutal machine throttling
READs and WRITEs is working for you. So it looks like 1 virtual machine
does not hang but once you launch 8 virtual machines it hangs. Can we
try increasing the number of vitual machines gragually and confirm that
it happens only if some certain number of virtual machines are launched.
- Can you also paste me the rules you have put on important and non-important
groups. Somehow I suspect that some of the rule has gone horribly bad
in the sense that it is very low and effectively no virtual machine
is making any progress.
- How long does it take to reach in this locked state where bo=0.
- you can also try to redirect blktrace output to blkparse and redirect
it to standard output and see capture some output by copying pasting
last messages.
In the mean time, I will try to launch more machines and see if I can
reproduce the issue.
Thanks
Vivek
> >
> > vmstat 2
> > procs -----------memory---------- ---swap-- -----io---- -system--
> > ----cpu----
> > r b swpd free buff cache si so bi bo in cs us sy
> > id wa
> > 0 296 0 125254224 21432 142016 0 0 16 633 181 331 0
> > 0 93 7
> > 0 296 0 125253728 21432 142016 0 0 0 0 17115 33794
> > 0 0 25 75
> > 0 296 0 125254112 21432 142016 0 0 0 0 17084 33721
> > 0 0 25 74
> > 1 296 0 125254352 21440 142012 0 0 0 18 17047 33736
> > 0 0 25 75
> > 0 296 0 125304224 21440 131060 0 0 0 0 17630 33989
> > 0 1 23 76
> > 1 296 0 125306496 21440 130260 0 0 0 0 16810 33401
> > 0 0 20 80
> > 4 296 0 125307208 21440 129856 0 0 0 0 17169 33744
> > 0 0 26 74
> > 0 296 0 125307496 21448 129508 0 0 0 14 17105 33650
> > 0 0 36 64
> > 0 296 0 125307712 21452 129672 0 0 2 1340 17117 33674
> > 0 0 22 78
> > 1 296 0 125307752 21452 129520 0 0 0 0 16875 33438
> > 0 0 29 70
> > 1 296 0 125307776 21452 129520 0 0 0 0 16959 33560
> > 0 0 21 79
> > 1 296 0 125307792 21460 129520 0 0 0 12 16700 33239
> > 0 0 15 85
> > 1 296 0 125307808 21460 129520 0 0 0 0 16750 33274
> > 0 0 25 74
> > 1 296 0 125307808 21460 129520 0 0 0 0 17020 33601
> > 0 0 26 74
> > 1 296 0 125308272 21460 129520 0 0 0 0 17080 33616
> > 0 0 20 80
> > 1 296 0 125308408 21460 129520 0 0 0 0 16428 32972
> > 0 0 42 58
> > 1 296 0 125308016 21460 129524 0 0 0 0 17021 33624
> > 0 0 22 77
>
> While we're on that ... It is impossible for me now to recover from this
> state without pulling the power plug.
>
> On the VMs console I see messages like
> INFO: task (kjournald|flush-254|dd|rs:main|...) blocked for more than
> 120 seconds.
If VMs are completely blocked and not making any progress, it is expected.
>
> While the ssh sessions through which the dd was started seem intact
> (pressing enter gives a new line), it is impossible to cancel the dd
> command. Logging in on the VMs console also is impossible.
>
> Opening a new ssh session to the host also does not work. Killing the
> qemu-kvm processes from a session opened earlier leaves zomby processes.
> Moving the VMs back to the root cgroup makes no difference either.
>
> Regards
> Dominik
>
> --
> libvir-list mailing list
> libvir-list at redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list
More information about the libvir-list
mailing list