[libvirt] blkio cgroup

Tue Feb 22 15:24:26 UTC 2011

On Tue, Feb 22, 2011 at 02:29:31PM +0100, Dominik Klein wrote:
> Hi
> 
> to be as verbose and clear as possible, I will write every command I use.

Thanks Dominik for detailed explanation and tests.

> 
> This sets everything up for the test:
> 
> # setup start
> 
> importantvms="kernel1 kernel2"
> notimportantvms="kernel3 kernel4 kernel5 kernel6 kernel7 kernel8"
> 
> for vm in $importantvms $notimportantvms; do
> virsh domstate $vm|grep -q running || virsh start $vm;
> done
> 
> echo 1 > /sys/block/sdb/queue/iosched/group_isolation
> 
> mount -t cgroup -o blkio none /mnt
> cd /mnt
> mkdir important
> mkdir notimportant
> 
> for vm in $notimportantvms; do
> cd /proc/$(pgrep -f "qemu-kvm.*$vm")/task
> for task in *; do
> /bin/echo $task > /mnt/notimportant/tasks
> done
> done
> 
> for vm in $importantvms; do
> cd /proc/$(pgrep -f "qemu-kvm.*$vm")/task
> for task in *; do
> /bin/echo $task > /mnt/important/tasks
> done
> done
> 
> #ls -lH /dev/vdisks/kernel[3-8]
> #brw-rw---- 1 root root 254, 25 Feb 22 13:42 kernel3
> #brw-rw---- 1 root root 254, 26 Feb 22 13:42 kernel4
> #brw-rw---- 1 root root 254, 27 Feb 22 13:42 kernel5
> #brw-rw---- 1 root root 254, 28 Feb 22 13:42 kernel6
> #brw-rw---- 1 root root 254, 29 Feb 22 13:43 kernel7
> #brw-rw---- 1 root root 254, 30 Feb 22 13:43 kernel8
> 
> for i in $(seq 25 30); do
> /bin/echo 254:$i 10000000 >
> /mnt/notimportant/blkio.throttle.write_bps_device
> done
> 
> # setup complete
> 
> > Hm..., this sounds bad. If you have put a limit of ~10Mb/s then no
> > "bo" is bad. That would explain that why your box is not responding
> > and you need to do power reset.
> > 
> > - I am assuming that you have not put any throttling limits on root group.
> >   Is your system root also on /dev/sdb or on a separate disk altogether.
> 
> No throttling on root. Correct.
> 
> system root is on sda
> vms are on sdb
> 
> > - This sounds like a bug in throttling logic. To narrow it down can you
> >   start running "deadline" on end device. If it still happens, it is more
> >   or less in throttling layer.
> 
> cat /sys/block/sdb/queue/scheduler
> noop deadline [cfq]
> echo deadline > /sys/block/sdb/queue/scheduler
> cat /sys/block/sdb/queue/scheduler
> noop [deadline] cfq
> 
> This changes things:
> 
> vmstat before test (nothing going on):
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
> id wa
>  1  0      0 130231696  17968  69424    0    0     0     0 16573 32834
> 0  0 100  0
> 
> vmstat during test while all 8 vms are writing (2 unthrottled, 6 at 10M):
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
> id wa
>  2  9      0 126257984  17968  69424    0    0     8 164462 20114 36751
>  5  4 52 39
> 
> vmstat during test when only throttled vms are still dd'ing
>  2 21      0 124518928  17976  69424    0    0     0 63876 17410 33670
> 1  1 59 39
> 
> Load is 12-ish during the test and the results are as expected:
> Throttled VMs write with approx 10M, unthrottled get like 80 each, which
> sums up to about the 200M max capacity of the device for direct io.

Ok, so deadline works just fine and there are no hangs. So it sounds more
like CFQ issue.

> 
> > - We can also try to remove dm layers and just create partitions on 
> >   /dev/sdb and export as virtio disks to virtual machines and take
> >   dm layer out of picture and see if it still happens.
> 
> Did that. Also happens when scheduler is cfq. Also does not happen when
> scheduler is deadline. So it does not seem to be a dm issue.

Ok, so dm is not the culprit then.

> 
> > - In one of the mails you mentioned that with 1 virutal machine throttling
> >   READs and WRITEs is working for you. So it looks like 1 virtual machine
> >   does not hang but once you launch 8 virtual machines it hangs. Can we
> >   try increasing the number of vitual machines gragually and confirm that
> >   it happens only if some certain number of virtual machines are launched.
> 
> Back on cfq here.
> 
> 1 throttled vm: works, load ~4
> 2 throttled vms: works, load ~6
> 3 throttled vms: works, load ~9
> 4 throttled vms: works, load ~12
> 6 throttled vms: works, load ~20
> 
> The number of blocked threads increases with the number of vms dd'ing.
> 
> At the beginning of each test, the blocked threads number goes really
> high (4 vms 160, 6 vms 220), but then drops significantly and stays low.
> 
> So it seems that when only throttled vms are running, the problem does
> not occur.
> 
> 1 throttled + 1 unthrottled vm: works, load ~5
> 2 throttled + 1 unthrottled vm: boom
> 

That't interesting. I will try to run 2 throttled VMs and 1 Unthrottled
and see if I can hit the same situation.

> Constantly 144 blocked threads, bo=0, load increasing to 144. System
> needs power reset.
> 
> So, thinking about what I did in the initial setup, I re-tested without the
> 
> for vm in $importantvms; do
> cd /proc/$(pgrep -f "qemu-kvm.*$vm")/task
> for task in *; do
> /bin/echo $task > /mnt/important/tasks
> done
> done
> 
> since I don't do anything with that "important" cgroup (yet) anyway. It
> did not make a difference though.
> 
> > - Can you also paste me the rules you have put on important and non-important
> >   groups. Somehow I suspect that some of the rule has gone horribly bad
> >   in the sense that it is very low and effectively no virtual machine
> >   is making any progress.
> 
> See setup commands in the beginning of this email.
> 
> > - How long does it take to reach in this locked state where bo=0.
> 
> It goes there "instantly", right after the dd commands start.
> 
> > - you can also try to redirect blktrace output to blkparse and redirect
> >   it to standard output and see capture some output by copying pasting
> >   last messages.
> 
> I hope this is what you meant:
> 
> blktrace -d /dev/sdb -o - | blkparse -i -

Yes, that's what I meant. Few observations from this trace.

- After a while all the IO coming in seems to be sleeping on
  get_request(). "S" signifies that. That means some other threads have
  already consumed the available request descriptors and new threads 
  are being put to sleep and will be woken up when request descriptors are
  available. Not sure why that is happening here.

- CFQ was idling on a queue and idle timer fired and it scheduled another
  dispatch. After that there is no request dispatched at all. Something
  went wrong for sure.

  8,16   1        0    55.806298452     0  m   N cfq idle timer fired
  8,16   1	  0    55.806300924     0  m   N cfq3564 slice expired t=0
  8,16   1        0    55.806303380     0  m   N cfq3564 sl_used=2 disp=1 charge
=2 iops=0 sect=8
  8,16   1        0    55.806303770     0  m   N cfq3564 del_from_rr
  8,16   1        0    55.806304555     0  m   N cfq schedule dispatch

- Following seems odd

8,16   7      925    55.890489350  3768  Q  WS 3407925696 + 480 [qemu-kvm]
8,16   7      926    55.890494793  3768  S  WS 3407925696 + 480 [qemu-kvm]
8,16   1        1 1266874889.707163736     0  C   R 3390363000 + 48 [0]
8,16   1        0 1266874889.707181482     0  m   N cfq3273 complete rqnoidl 

Time in fourth column jumps from 55.x to 1266874889.X. Sounds like some
corruption.

- I don't see any throttling messages. They are prefixed by "throtl". So
  it seems all this IO is happening in root group. I believe it belongs
  to unthrottled VM. So to me it looks that system reached in bad shape
  even before throttled VMs were started.

- So it sounds more and more like a CFQ issue which happens in conjuction
  with throttling. I will try to reproduce it.

- Need little more info about how did you capture the blktrace. So you 
  started blktrace and then started dd in parallel in all the three
  VMs and immediately system freezes and these are the only logs we see
  on console?

Thanks
Vivek