[libvirt] Restoring from a largish 'virsh save' file

Thu Mar 26 13:14:33 UTC 2009

On Thu, Mar 26, 2009 at 05:13:00PM +0900, Matt McCowan wrote:
> On Mon, 23 Mar 2009 13:44:58 +0000
> "Daniel P. Berrange" <berrange at redhat.com> wrote:
> 
> > On Sun, Mar 22, 2009 at 07:28:36PM +0900, Matt McCowan wrote:
> > > Running into an issue where, if I/O is hampered by load for example,
> > > reading a largish state file (created by 'virsh save') is not allowed to
> > > complete.
> > > qemudStartVMDaemon in src/qemu_driver.c has a loop that waits 10 seconds
> > > for the VM to be brought up. An strace against libvirt when doing a
> > > 'virsh restore' against a largish state file shows the VM being sent a
> > > kill when it's still happily reading from the file.
> 
> My bad. It's not the timeout loop in qemudStartVMDaemon that's killing
> it. It's as you suggested and the code is crapping out in
> qemudReadMonitorOutput, seemingly when poll()ing the consoles fd - it
> doesn't get any POLLIN in the 10 secs it waits. (Against latest CVS
> pull)

Hmm, this is the exact scenario I thought we had gotten fixed in upstream 
QEMU/KVM. 

> 
> > This is a little odd to me - we had previously fixed KVM migration
> > code so that during startup with -incoming, it would correctly
> > respond to monitor commands, explicitly to avoid libvirt timing
> > out in this way. I'm wondering what has broken since then, whether
> > its libvirt's usage changing, or KVM impl changing.
> 
> I'm running kvm-83 (QEMU 0.9.1) if that's of any help.
> The state files I have dragged in during testing were generally 4G+ and
> worked without problem. The ones I'm playing with in the production
> environment are <3G, but on a more heavily loaded system with lots of
> snap shotted LVs.
> 
> 'virsh restore' on the other VMs with <2G state files works just fine.

Clearly the monitor console is not responding while it is reading in
the state file & how long that takes is dependant on host OS load.

As a temporary workaround the only option is really to increase that
10 second timeout significantly, if doing a restore/migrate operation.
In parallel with that we'll have to look at KVM code again and figure
out why its behaving this way.

Daniel
-- 
|: Red Hat, Engineering, London   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|