[libvirt] [Qemu-devel] Long QEMU main loop pauses during migration (to file) under heavy load

Daniel P. Berrange berrange at redhat.com
Fri Nov 11 13:13:11 UTC 2011


On Fri, Nov 11, 2011 at 01:03:20PM +0000, Daniel P. Berrange wrote:
> Libvirt recently introduced a change to the way it does 'save to file'
> with QEMU. Historically QEMU has a 32MB/s I/O limit on migration by
> default. When saving to file, we didn't want any artificial limit,
> but rather to max out the underlying storage. So when doing save to
> file, we set a large bandwidth limit (INT64_MAX / (1024 * 1024)) so
> it is effectively unlimited.
> 
> After doing this, we discovered that the QEMU monitor was becoming
> entirely blocked. It did not even return from the 'migrate' command
> until migration was complete despite the 'detach' flag being set.
> This was a bug in libvirt, because we passed a plain file descriptor
> which does not support EAGAIN. Thank you POSIX.
> 
> Libvirt has another mode where it uses an I/O helper command so get
> O_DIRECT, and in this mode we pass a pipe() FD to QEMU. After ensuring
> that this pipe FD really does have O_NONBLOCK set, we still saw some
> odd behaviour.
> 
> I'm not sure whether what I describe can neccessarily be called a QEMU
> bug, but I wanted to raise it for discussion anyway....
> 
> The sequence of steps is
> 
>   - libvirt sets qemu migration bandwidth to "unlimited"
>   - libvirt opens a pipe() and sets O_NONBLOCK on the write end
>   - libvirt spawns  libvirt-iohelper giving it the target file
>     on disk, and the read end of the pipe
>   - libvirt does 'getfd migfile' monitor command to give QEMU
>     the write end of the pipe
>   - libvirt does 'migrate fd:migfile -d' to run migration
>   - In parallel
>        - QEMU is writing to the pipe (which is non-blocking)
>        - libvirt_helper reading the pipe & writing to disk with O_DIRECT


I should have mentioned that the way I'm testing this is with
libvirt 0.9.7, with both QEMU 0.14 and QEMU GIT master, using
a guest with 2 GB of RAM:


    $ virsh start l3
    Domain l3 started

    $ virsh dominfo l3
    Id:             17
    Name:           l3
    UUID:           c7a3edbd-edaf-9455-926a-d65c16db1803
    OS Type:        hvm
    State:          running
    CPU(s):         1
    CPU time:       1.1s
    Max memory:     2292000 kB
    Used memory:    2292736 kB
    Persistent:     yes
    Autostart:      disable
    Managed save:   no
    Security model: selinux
    Security DOI:   0
    Security label: system_u:system_r:unconfined_t:s0:c94,c700 (permissive)

To actually perform the save-to-file, I use the '--bypass-cache' flag
for libvirt, which ensures we pass a pipe to QEMU and run our I/O
helper for O_DIRECT, instead of directly giving QEMU a plain file

     $ virsh save --bypass-cache l3 l3.image 
     Domain l3 saved to l3.image

>  - Most of the qemu_savevm_state_iterate() calls complete in 10-20 ms
> 
>  - Reasonably often a qemu_savevm_state_iterate() call takes 300-400 ms
> 
>  - Fairly rarely a qemu_savevm_state_iterate() call takes 10-20 *seconds*

I use the attached systemtap script for determining these

eg run this before starting the migration to disk:

# stap qemu-mig.stp
Begin
  0.000 Start
  5.198 > Begin
  5.220 < Begin   0.022
  5.220 > Iterate
  5.224 < Iterate   0.004

  ...snip..

  6.299 > Iterate
  6.314 < Iterate   0.015
  6.314 > Iterate
  6.319 < Iterate   0.005
  6.409 > Iterate
  8.139 < Iterate   1.730         <<< very slow iteration
  8.152 > Iterate
 13.078 < Iterate   4.926         <<< very slow iteration
 13.963 > Iterate
 14.248 < Iterate   0.285
 14.441 > Iterate
 14.448 < Iterate   0.007

 ...snip...

 24.171 > Iterate
 24.178 < Iterate   0.007
 24.178 > Complete
 24.588 < Complete   0.410

 <Ctrl-C>

avg 79 = sum 8033 / count 101; min 3 max 4926
value |-------------------------------------------------- count
    0 |                                                    0
    1 |                                                    0
    2 |                                                    1
    4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@              74
    8 |@@@@@@@@@                                          19
   16 |@                                                   3
   32 |                                                    0
   64 |                                                    0
  128 |                                                    0
  256 |@                                                   2
  512 |                                                    0
 1024 |                                                    1
 2048 |                                                    0
 4096 |                                                    1
 8192 |                                                    0
16384 |                                                    0

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|
-------------- next part --------------


global then;
global deltas;

global start;

function print_ts(str) {
  now = gettimeofday_ns() / (1000*1000)

  delta = (now - start)

  printf("%3d.%03d %s\n",
         (delta / 1000), (delta % 1000), str);
}


probe begin {
   printf("Begin\n");
   then = 0;
  start = gettimeofday_ns() / (1000*1000);
  print_ts("Start");

}


probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_begin") {
   then = gettimeofday_ns() / (1000*1000);
   print_ts("> Begin");   
}

probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_begin").return {
   now = gettimeofday_ns() / (1000*1000);
   if (then != 0) {
     delta = now - then;
     deltas <<< delta;
     print_ts(sprintf("< Begin %3d.%03d",
         (delta / 1000), (delta % 1000)));
   }
}

probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_iterate") {
   then = gettimeofday_ns() / (1000*1000);
   print_ts("> Iterate");   
}

probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_iterate").return {
   now = gettimeofday_ns() / (1000*1000);
   if (then != 0) {
     delta = now - then;
     deltas <<< delta;
     print_ts(sprintf("< Iterate %3d.%03d",
         (delta / 1000), (delta % 1000)));
   }
}

probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_complete") {
   then = gettimeofday_ns() / (1000*1000);
   print_ts("> Complete");   
}

probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_complete").return {
   now = gettimeofday_ns() / (1000*1000);
   if (then != 0) {
     delta = now - then;
     deltas <<< delta;
     print_ts(sprintf("< Complete %3d.%03d",
         (delta / 1000), (delta % 1000)));
   }
}


probe end {
   printf ("avg %d = sum %d / count %d; min %d max %d\n",
           @avg(deltas), @sum(deltas), @count(deltas), @min(deltas), @max(deltas));
   print (@hist_log(deltas));
}


More information about the libvir-list mailing list