[libvirt] [PATCH] Avoid a race when restoring a qemu domain.

Laine Stump laine at laine.org
Thu Apr 8 04:48:45 UTC 2010

On 04/07/2010 03:43 PM, Chris Lalancette wrote:
> Hm, this really doesn't seem like it's the way to fix this.

You are correct that it isn't what should be done in the long term. 
Short term, though, it definitely fixes bad behavior that I wouldn't 
want to see in an official release (on my hardware, restores will 
basically always fail unless the guest was paused prior to saving).

> We really
> should investigate what is going on in qemu, and see if it's a bug in
> qemu itself (in which case we should fix qemu), or if it's a bug in the
> way we communicate with qemu (in which case we should fix that).

I'm operating on information I learned in an IRC chat. Perhaps Dan 
Berrange can pipe up here to repeat / expand on what he said, but 
basically it sounds like the problem is that qemu will happily start the 
CPUs for us before the restore operation has begun, and there's no way 
for us to verify whether or not it has begun - for that qemu will need 
to make 'info migrate' work on the incoming side, and that's not likely 
to happen very quickly (of course it will take even longer if I don't 
whine about it, I just haven't gotten there yet ;-)

>    A sleep is just hiding the problem

Yes, I dislike this solution. I'd love it if someone could tell me of an 
alternate way. If there is no other way to fix it entirely within 
libvirt, I don't think we should just report the problem to qemu and let 
users suffer until it gets fixed there, though; especially if that fix 
requires a new interface in qemu that must then be supported by libvirt, 
the path to reliably working domain restores could be very long indeed; 
and in the meantime we'd be left with delivered code that may fail in a 
rather bad way for someone, especially in the case of a managed save, 
where the image is deleted as soon as the domain is started - if it 
fails once, you've lost the image so you can't even try again.

> (which means it can still pop up on
> machines slower, or more busy, than yours!).

I'm doubtful that slower VT-capable machines exist (although I haven't 
checked - possibly this same problem exists when doing software 
emulation too). I hadn't considered if this would pop up on faster 
hardware that was also busier - a very good point.

(I did just do some more testing, and found that even 50msec is enough 
to make things work. 10msec isn't enough, though...)

More information about the libvir-list mailing list