[libvirt] [RFC] Add a new feild to qemud_save_header to tell if the saved image is corrupt

Eric Blake eblake at redhat.com
Wed Aug 17 15:50:06 UTC 2011


On 08/17/2011 08:59 AM, Daniel P. Berrange wrote:
> On Wed, Aug 17, 2011 at 09:10:41PM +0800, Osier Yang wrote:
>> If one tries to restore a domain from a corrupt save image, we blindly
>> goes forward to restore from it, this can cause many different errors,
>> depending on how much the image is saved. E.g.
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=730750
>>
>> So I'm thinking if we can introduce a new feild to struct qemud_save_header,
>> such as "bool complete;", and set it true if succeeded to save the image,
>> false if not. So that could do some checking while trying to open the image
>> (qemuDomainSaveImageOpen), and quit early if "complete" is false, with
>> a sensiable error message.
>>
>> Thought?
>
> I assume you mean that when saving the guest, we'd do the following sequence
>
>    1. Write out basic header with complete=false
>    2. Write out XML doc
>    3. Run 'migrate' in QEMU to write save state
>    4. If success, update header with complete=true
>
> And then on restore we'd do
>
>    1. If complete == false, then quit with error
>    2. Run QEMU with -incoming to restore

Yes, I think we need that.

>
> At which point I wonder what's wrong with:
>
>    1. Write out basic header
>    2. Write out XML doc
>    3. Run 'migrate' in QEMU to write save state
>    4. If not success, unlink save fail
>
> The only case where there's any difference, is if libvirtd itself crashes
> between steps 1&  4.

Remember, that 'migrate' is a long-running async job command, and can be 
interrupted.  That is, 'service libvirtd restart' is a legal action to 
take during step 3, and it is not as severe as a libvirtd crash, and we 
have already recently added patches to remember async job status across 
libvirtd restarts with the intention of making it legal to restart 
libvirtd in the middle of an async job (whether the async job should 
still succeed, or should remove the save file, is a slightly different 
question; but removing the save file would require that we save in the 
XML the name of the file to remove if libvirtd is restarted).

Also, in the case of 'virsh save', the file is user-visible, and there's 
no telling if a user does 'virsh save' in one window, then 
asynchronously types 'cp file elsewhere; virsh restore elsewhere' in 
another window before the first one has finished.  Using the longer 
algorithm of marking 'complete=false' then revisiting it to write 
'complete=true' provides more protection against users accessing their 
file system asynchronously, where they can otherwise get a corrupt image 
even without a libvirtd crash or restart.

-- 
Eric Blake   eblake at redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org




More information about the libvir-list mailing list