[libvirt] blkio cgroup

Fri Feb 18 16:31:37 UTC 2011

On Fri, Feb 18, 2011 at 03:42:45PM +0100, Dominik Klein wrote:
> Hi Vivek
> 
> I don't know whether you follow the libvirt list, I assume you don't. So
> I thought I'd forward you an E-Mail involving the blkio controller and a
> terrible situation arising from using it (maybe in a wrong way).
> 
> I'd truely appreciate it if you read it and commented on it. Maybe I did
> something wrong, but maybe also I found a bug in some way.

Hi Dominik, 

Thanks for forwarding me this mail. Yes, I am not on libvir-list. I have
just now subscribed.

Few questions inline.

> -------- Original Message --------
> Subject: Re: [libvirt] [PATCH 0/6 v3] Add blkio cgroup support
> Date: Fri, 18 Feb 2011 14:42:51 +0100
> From: Dominik Klein <dk at in-telegence.net>
> To: libvir-list at redhat.com
> 
> Hi
> 
> back with some testing results.
> 
> >> how about the start Guest with option "cache=none" to bypass pagecache?
> >> This should help i think.
> > 
> > I will read up on where to set that and give it a try. Thanks for the hint.
> 
> So here's what I did and found out:
> 
> The host system has 2 12 core CPUs and 128 GB of Ram.
> 
> I have 8 test VMs named kernel1 to kernel8. Each VM has 4 VCPUs, 2 GB of
> RAm and one disk, which is an lv on the host. Cache mode is "none":

So you have only one root SATA disk and setup a linear logical volume on
that? I not, can you give more info about the storage configuration?

- I am assuming you are using CFQ on your underlying physical disk.

- What kernel version are you testing with.

- Cache=none mode is good which should make all the IO O_DIRECT on host
  and should show up as SYNC IO on CFQ without losing io context info.
  The onlly probelm is intermediate dm layer and if it is changing the
  io context somehow. I am not sure at this point of time.

- Is it possible to capture 10-15 second blktrace on your underlying
  physical device. That should give me some idea what's happening.

- Can you also try setting /sys/block/<disk>/queue/iosched/group_isolation=1
  on your underlying physical device where CFQ is running and see if it makes
  any difference.

> 
> for vm in kernel1 kernel2 kernel3 kernel4 kernel5 kernel6 kernel7
> kernel8; do virsh dumpxml $vm|grep cache; done
>       <driver name='qemu' type='raw' cache='none'/>
>       <driver name='qemu' type='raw' cache='none'/>
>       <driver name='qemu' type='raw' cache='none'/>
>       <driver name='qemu' type='raw' cache='none'/>
>       <driver name='qemu' type='raw' cache='none'/>
>       <driver name='qemu' type='raw' cache='none'/>
>       <driver name='qemu' type='raw' cache='none'/>
>       <driver name='qemu' type='raw' cache='none'/>
> 
> My goal is to give more I/O time to kernel1 and kernel2 than to the rest
> of the VMs.
> 
> mount -t cgroup -o blkio none /mnt
> cd /mnt
> mkdir important
> mkdir notimportant
> 
> echo 1000 > important/blkio.weight
> echo 100 > notimportant/blkio.weight
> for vm in kernel3 kernel4 kernel5 kernel6 kernel7 kernel8; do
> cd /proc/$(pgrep -f "qemu-kvm.*$vm")/task
> for task in *; do
> /bin/echo $task > /mnt/notimportant/tasks
> done
> done
> 
> for vm in kernel1 kernel2; do
> cd /proc/$(pgrep -f "qemu-kvm.*$vm")/task
> for task in *; do
> /bin/echo $task > /mnt/important/tasks
> done
> done
> 
> Then I used cssh to connect to all 8 VMs and execute
> dd if=/dev/zero of=testfile bs=1M count=1500
> in all VMs simultaneously.
> 
> Results are:
> kernel1: 47.5593 s, 33.1 MB/s
> kernel2: 60.1464 s, 26.2 MB/s
> kernel3: 74.204 s, 21.2 MB/s
> kernel4: 77.0759 s, 20.4 MB/s
> kernel5: 65.6309 s, 24.0 MB/s
> kernel6: 81.1402 s, 19.4 MB/s
> kernel7: 70.3881 s, 22.3 MB/s
> kernel8: 77.4475 s, 20.3 MB/s
> 
> Results vary a little bit from run to run, but it is nothing
> spectacular, as weights of 1000 vs. 100 would suggest.
> 
> So I went and tried to throttle I/O of kernel3-8 to 10MB/s instead of
> weighing I/O. First I rebooted everything so that no old configuration
> of cgroup was left in place and then setup everything except the 100 and
> 1000 weight configuration.
> 
> quote from blkio.txt:
> ------------
> - blkio.throttle.write_bps_device
>         - Specifies upper limit on WRITE rate to the device. IO rate is
>           specified in bytes per second. Rules are per deivce. Following is
>           the format.
> 
>   echo "<major>:<minor>  <rate_bytes_per_second>" >
> /cgrp/blkio.write_bps_device
> -------------
> 
> for vm in kernel1 kernel2 kernel3 kernel4 kernel5 kernel6 kernel7
> kernel8; do ls -lH /dev/vdisks/$vm; done
> brw-rw---- 1 root root 254, 23 Feb 18 13:45 /dev/vdisks/kernel1
> brw-rw---- 1 root root 254, 24 Feb 18 13:45 /dev/vdisks/kernel2
> brw-rw---- 1 root root 254, 25 Feb 18 13:45 /dev/vdisks/kernel3
> brw-rw---- 1 root root 254, 26 Feb 18 13:45 /dev/vdisks/kernel4
> brw-rw---- 1 root root 254, 27 Feb 18 13:45 /dev/vdisks/kernel5
> brw-rw---- 1 root root 254, 28 Feb 18 13:45 /dev/vdisks/kernel6
> brw-rw---- 1 root root 254, 29 Feb 18 13:45 /dev/vdisks/kernel7
> brw-rw---- 1 root root 254, 30 Feb 18 13:45 /dev/vdisks/kernel8
> 
> /bin/echo 254:25 10000000 >
> /mnt/notimportant/blkio.throttle.write_bps_device
> /bin/echo 254:26 10000000 >
> /mnt/notimportant/blkio.throttle.write_bps_device
> /bin/echo 254:27 10000000 >
> /mnt/notimportant/blkio.throttle.write_bps_device
> /bin/echo 254:28 10000000 >
> /mnt/notimportant/blkio.throttle.write_bps_device
> /bin/echo 254:29 10000000 >
> /mnt/notimportant/blkio.throttle.write_bps_device
> /bin/echo 254:30 10000000 >
> /mnt/notimportant/blkio.throttle.write_bps_device
> /bin/echo 254:30 10000000 >
> /mnt/notimportant/blkio.throttle.write_bps_device
> 
> Then I ran the previous test again. This resulted in an ever increasing
> load (last I checked was ~ 300) on the host system. (This is perfectly
> reproducible).
> 
> uptime
> Fri Feb 18 14:42:17 2011
> 14:42:17 up 12 min,  9 users,  load average: 286.51, 142.22, 56.71

Have you run top or something to figure out why load average is shooting
up. I suspect that because of throttling limit, IO threads have been
blocked and qemu is forking more IO threads. Can you just run top/ps
and figure out what's happening.

Again, is it some kind of linear volume group from which you have carved
out logical volumes for each virtual machine?

For throttling to begin with, can we do a simple test first. That is
run a single virtual machine, put some throttling limit on logical volume
and try to do READs. Once READs work, lets test WRITES and check why
does system load go up.

Thanks
Vivek