[libvirt] domain restore race condition
Laine Stump
laine at laine.org
Mon Feb 22 18:25:34 UTC 2010
As noted in another message, the problem I was seeing is a race
condition in qemudDomainRestore(), not with my modifications to
qemudDmainSave(). Here's some discussion about that problem from IRC,
with a question at the bottom:
> <laine> Does anyone else see a failure of domain restore (immediately
> after domain save? I'm very definitely seeing it on my machine with
> F12+updates testing and libvirt built from unpatched sources.
> <laine> It's very reproduceable - with virsh I do "save domain
> filename", then "restore filename" and it pretty much always gives me
> a black screen. Then I force shutdown the guest (with virt-manager)
> and do "restore filename" again. Tada! It's restored and running!
> [...]
> <danpb> laine: possible race condition
> <danpb> laine: try putting a sleep(10) before the qemuMonitorStartCPUs
> in qemuDomainRestore()
Dan's suggestion *did* eliminate the failures.
> [...]
> <danpb> laine: this sounds like the issue with libvirt prematurely
> starting execution of the CPUs before QEMU has even started restoring
> (or soemthing like that)
> <danpb> laine: search the archives for a mail from Charles Duffy on
> this subject some time ago
>
Here's the BZ filed by Charles Duffy
https://bugzilla.redhat.com/show_bug.cgi?id=537938
It looks like he's dealing with a race condition earlier in the restore,
since his solution was to wait for the migration process to terminate
somewhere inside qemudStartVMDaemon(), rather than waiting until
qemudStartVMDaemon() was finished (which is what it does now). Since
this wait has already been done anyway by the time of Dan's sleep(10) in
my test, I don't think Charles' patch would help this situation.
So is there something that libvirt can wait on here to ensure proper
start? Or is there a problem in qemu? (I'm still running 0.11. I'll also
try upgrading to 0.12 and see if there are changes in behavior.)
More information about the libvir-list
mailing list