[libvirt] [PATCH 3/3] introduce pull backup

Tue Oct 4 21:55:35 UTC 2016

On 09/30/2016 10:38 AM, Nikolay Shirokovskiy wrote:
> 
> 
> On 30.09.2016 01:02, John Ferlan wrote:
>>
>>
>> [...]
>>
>>>>
>>>> Because it's also dependent upon an x-blockdev-del, it cannot be pushed
>>>> upstream to libvirt. I know qemu work continues w/r/t blockdev-add and
>>>> backups, but I don't follow it in detail (not enough time in the day!).
>>>
>>> Ok, at least the patch can be some kind of candidate to be pushed upstream
>>> as soon as blockdev-del drops x- prefix.
>>>
>>
>> Not as a single patch though - I have a feeling you're going to have a
>> git branch committed to this for a while...  Consider my recent response
>> to your 1/3 patch w/r/t qemu's statement about the blockdev-add command...
> 
> I can split it to API/rpc/driver based stubs/qemu implementation as usual.
> I just thought all parts except qemu implementation is a kind of boilerplate
> and do not need much attention... Hmm, this can be an agrgument to split
> this into 2 parts - boilerplate/qemu implementation.
> 

See: http://libvirt.org/api_extension.html

I strongly suggest using gitk... Since you'll be making a change to
libvirt_public.syms and remote_protocol.x... Make a "test" change to
either one.. Then use gitk to "view" the previous changes to that file.
You can right click on a previously added line to get a popup menu that
allows you to "Show origin of this line"... Selecting that and you'll
see the set the patches for that change...  You'll probably see 7-15
patches in the same general area from the same author.  I would look at
a few... Get a feel for how others have had similar changes accepted.

If nothing else it gives you an idea of the files that need to change
and the order of the changes so that you don't drop huge one patch bombs
on the list.

All that said - I know from the recent review of my qemu_monitor_json
series that pkrempa is already thinking about this area of the code...
Besides as I hope I pointed out - don't do anything until qemu is ready
for changes to be made.

John
>>
>> My *guess* is when 'blockdev-del' is officially supported - that's when
>> libvirt can "assume" support for blockdev-add (since the x- was taken
>> off of it - something that we can document when that happens).
>>
>> [...]
>>
>>
>>>>
>>>> BTW: Using QUIESCE would then rely on the guest agent being available
>>>> for the domain...  Just making a mental note, but yet something you have
>>>> a dependency upon.
>>>
>>> Well it is just a flag, just as in case of snapshots. As to coding conventions
>>> etc a lot of stuff is rooted in similar snapshot code and xml.
>>>
>>
>> Something that just needs to be remembered - oh and documented in
>> virsh.pod...
> 
> I'd delayed pod until we ready to push upstream, anything interesting
> will be in a cover letter like just right now but in a less formal form))
> 
>>
>> [...]
>>
>>>>> diff --git a/include/libvirt/virterror.h b/include/libvirt/virterror.h
>>>>> index 2ec560e..c1f8c6c 100644
>>>>> --- a/include/libvirt/virterror.h
>>>>> +++ b/include/libvirt/virterror.h
>>>>> @@ -131,6 +131,7 @@ typedef enum {
>>>>>      VIR_FROM_XENXL = 64,        /* Error from Xen xl config code */
>>>>>  
>>>>>      VIR_FROM_PERF = 65,         /* Error from perf */
>>>>> +    VIR_FROM_DOMAIN_BACKUP = 66,/* Error from domain backup */
>>>>
>>>> Use a shorter name, suggest "DOMBKUP"
>>>>
>>>
>>> but there is VIR_FROM_DOMAIN_SNAPSHOT... and it is more pleasant
>>> for eyes.
>>>
>>
>> OK - now that I know SNAPSHOT is your basis...  BTW: IIRC the snapshot
>> code didn't make it all in one patch...
>>
>>>>>  
>>>>>  # ifdef VIR_ENUM_SENTINELS
>>>>>      VIR_ERR_DOMAIN_LAST
>>>>> diff --git a/po/POTFILES.in b/po/POTFILES.in
>>>>> index 25dbc84..4cdeb2f 100644
>>>>> --- a/po/POTFILES.in
>>>>> +++ b/po/POTFILES.in
>>>>> @@ -18,6 +18,7 @@ src/bhyve/bhyve_driver.c
>>>>>  src/bhyve/bhyve_monitor.c
>>>>>  src/bhyve/bhyve_parse_command.c
>>>>>  src/bhyve/bhyve_process.c
>>>>> +src/conf/backup_conf.c
>>>>
>>>> why not domain_backup_conf.{c,h}
>>>
>>> similar to snapshot_conf.{c,h}
>>>
>>
>> It's just a file name in the long run; however, what if in the future
>> the storage driver added some sort of backup, then we'd have
>> 'storage_backup_conf.c' to clearly describe that it was meant for
>> storage.  I think the domain_<function>_conf.c should be used...  And
>> yes, the same argument can be made for snapshot_conf.c - perhaps it
>> should be renamed.
>>
>> [...]
>>
>>
>>>>> +VIR_ENUM_IMPL(virDomainBackupAddress, VIR_DOMAIN_BACKUP_ADDRESS_LAST,
>>>>> +              "ip",
>>>>
>>>> IPv4, IPv6, IPnotinventedyet
>>>>
>>>> why not "tcp", "tls", etc.? That would seemingly allow a bit of code
>>>> reuse for client transport. There's also virStorageNetHostTransport
>>>
>>> Name 'tcp' is more appropriate indeed )) I would not use virStorageNetHostTransport
>>> and associated structures and xml becasue they have different semantics.
>>> In case of backups we have address client can connect to, in case of 
>>> network disks we have address qemu as client can connect to.
>>>
>>
>> Creating a specific name is fine - I was just pointing out the
>> StorageNet for names to consider.
>>
>>>>
>>>> FYI: also see qemuMonitorGraphicsAddressFamily
>>>>
>>>
>>> And this one has right semantics. I remember I evaluated it as reuse candidate.
>>> I thought it was awkward to have port in upper level element:
>>>
>>>     <graphics type='vnc' port='5904' sharePolicy='allow-exclusive'>
>>>       <listen type='address' address='1.2.3.4'/>
>>>     </graphics>
>>>
>>> if we could declare it outdated and recommend to use new style...
>>>
>>>     <graphics type='vnc' sharePolicy='allow-exclusive'>
>>>       <listen type='address' address='1.2.3.4' port='5904' />              (autoport, tlsPort etc goes here too...)
>>>     </graphics>
>>>
>>> Actually we already do this for address attribute:
>>> "The address attribute is duplicated as listen attribute in graphics element for backward compatibility. If both are provided they must be equal."
>>>
>>> So if you are ok with this, i would definetly reuse this element and
>>> associated structures.
>>>
>>
>> Again - just pointed out different things to consider. My knowledge of
>> blockdev-backup is miniscule (I think I spelled it right).
>>
>> [...]
>>
>>>>
>>>> Why differ from existing transport definitions? Seems like there would
>>>> be code reuse opportunities.
>>>>
>>>> BTW: If you're going to have a network protocol, then shouldn't you also
>>>> have an authentication mechanism?
>>>
>>> AFAIK there is not any yet for pull backups. The situation is in case of
>>> migration, it is not secure when moving state and disks. This is what
>>> Daniel is working on AFAIK.
>>>
>>
>> Just something that may have to be considered - although who knows. I've
>> recently added LUKS support and now am being told about all those corner
>> cases that weren't considered and where things didn't work quite right -
>> so I'm just trying to consider various things as they come to mind. With
>> any luck the end result will be better.
>>
>> [...]
>>
>>
>>>>> +
>>>>> +    if ((n = virXPathNodeSet("./disks/*", ctxt, &nodes)) < 0)
>>>>> +        goto cleanup;
>>>>
>>>> So the purpose of disks is what?
>>>
>>> At least to specify where temporary backup files should be.
>>>
>>
>> There's nothing specific in the <disks> xml - you could just list 'n'
>> disks and use specific XML API's to get you a count of <disk> elements.
>> Thus the <disks> just seemed superfluous.
> 
> This is mimicing snapshots too... Looks like <disks> is redundant yes,
> but for the sake of 'least surprise' principle I would keep it.
> 
>>
>> [...]
>>
>>
>>>>> --- a/src/qemu/qemu_domain.h
>>>>> +++ b/src/qemu/qemu_domain.h
>>>>> @@ -284,6 +284,10 @@ struct _qemuDomainDiskPrivate {
>>>>>       * single disk */
>>>>>      bool blockjob;
>>>>>  
>>>>> +    bool backuping;
>>>>
>>>> Is this like hiccuping ;-)
>>>>
>>>> Maybe consider an enum of sorts for a state of a backup... none,
>>>> possible, in process, failed, ?? others... Haven't got that far yet.
>>>
>>> i'll check, but looks like bool is enough. There is 'migrating'
>>> by the way, I can name it 'backup_active' if you prefer.
>>>
>>
>> It just looked funny to read...
>>
>>>>
>>>> If it's possible to backup, then there's perhaps some sanity checks that
>>>> need to be made or at least synchronizing with migration. Would be
>>>> horrible to start a migration and then start a backup. Just thinking off
>>>> the cuff without looking at the rest of the code.
>>>>
>>>>> +    bool backupdev;
>>>>
>>>> bool ?  cannot wait to see this!
>>>
>>> ok, backupdev_created
>>>
>>
>> Sounds like you need a enum that can go through states "not desired",
>> "desired", "selected", "in process", "succeeded", etc.
> 
> True, I had this feeling too to some extent. I'll look deeper into this
> opportunity in next version.
> 
>>
>>>>
>>>>> +    bool backupFailed;
>>>>> +
>>
>> And this is the one that is the kicker... Once you have failed, you'd
>> need to decide how to handle any others that were "in process" somehow
>> or had been processed successfully.  Error recovery could be a full time
>> job - similar to error recovery for snapshot failures (internal,
>> external, make my head spin)
> 
> I took simple approach. Stop operation finishes with the error in this
> case. 
> 
> One don't need complex recovery with backups because they are
> unobtrusive for user supplied disks. On any error you probably
> get some garbage in backup targets but user disks are totally
> untouched.
> 
> 
>>
>> [...]
>>
>>>>> +        if (virAsprintf(&disk->src->path, "%s.backup", tmppath) < 0) {
>>>>
>>>> A name selection problem...   Seems to me some way to use a temporary
>>>> file name could be devised.  Consider using ".XXXXXX" and then mkostemp
>>>> (there are other examples).
>>>>
>>>> Once you've successfully backed things up the name then changes to
>>>> whatever user supplied name is provided.  And of course similar issues
>>>> exist there - do you overwrite existing files?
>>>
>>> It is a pull backup. User don't really need this temporary file. It is
>>> keeped only in the process of backup as inverse delta to keep disk
>>> state at the moment of backup start. So there is no reason to rename.
>>>
>>> As to randomly generated names... Do we really need this? We can afford
>>> only one backup at a time, one backup delta per disk...
>>>
>>
>> I thought there was a whole naming thing with snapshots - not sure.
>> Certainly synergies can be made.
>>
>>>>
>>>> You may also end up with a space problem. That is - does the size of
>>>> your backup exceed the size of the space to be written. Again, I didn't
>>>> check following code to see if/whether that's true - it just came to
>>>> mind as another one of those areas we need to consider.
>>>
>>> You can not check space at all. The size of backup inverse delta heavily
>>> depends on guest activity. It can be 0, it can be full disk size.
>>>
>>
>> Ok - again something that having done backups before I'd have to
>> consider... It's more of a general comment...
>>
>>>>
>>>> We also have to consider the security aspects of creating a file and
>>>> using security manager for "domain" files (the selinux, apparmor, etc)
>>>> setup for labels and the such.
>>>
>>> While this is true I thought this can be done in next patches which 
>>> can address other issues like disk cleanups, handling daemon
>>> restart etc.
>>>
>>
>> If history has taught us anything it's that future patches sometimes
>> aren't written... Leaving that task to someone else to pick up.
> 
> )) True. But hey my goal is stabilize backups API and its 
> significant implementation part. We can hold it from pushing
> upstream as long as we wish and write missing parts meanwhile.
> However we deal with upstream and it is really a not-that-stable
> branch and it is up to user to use some feature or not if
> is is not complete in all aspects (reality: many start to 
> try then use)
> 
>>
>>>>
>>>>
>>>>> +            VIR_FREE(tmppath);
>>>>> +            return -1;
>>>>> +        }
>>>>> +
>>>>> +        VIR_FREE(tmppath);
>>>>> +
>>>>> +        /* verify that we didn't generate a duplicate name */
>>>>> +        for (j = 0; j < i; j++) {
>>>>> +            if (STREQ_NULLABLE(disk->src->path, def->disks[j].src->path)) {
>>>>
>>>> Wouldn't this only work if we used the same name from within the domain?
>>>>  What if another domain used the ".backup" file name?  IOW: This isn't
>>>> checking if "on disk" right now there isn't a file with the name you
>>>> just attempted to create.
>>>
>>> Well I have a copy-paste argument ) IOW snapshots use this already.
>>> It comes from b60af444 and here to handle special case when
>>> image paths differ only in suffix. Ok, to make some really bulletproof protection
>>> we need to make functions like virStorageFileBackendFileCreate
>>> to include O_EXCL on open.
>>>
>>
>> Again - if snapshots is your basis then we should consider more code
>> sharing options. Not sure how to accomplish that at this point, but
>> that's the suggestion at this point in time.
>>
>> [...]
>>
>>>>> +    /* Double check requested disks.  */
>>>>> +    for (i = 0; i < def->ndisks; i++) {
>>>>
>>>> I cannot believe the number of times we traverse this array. For 1 or 2
>>>> disks - easy... For 100-200 disks, that could be brutal especially error
>>>> paths...
>>>
>>> At least we stay at O(n) complexity. '200 disks' is for real?
>>>
>>
>> In a former company, 1000 was the test case.  People uses guests for all
>> different kinds of things. I don't have that many disks, but there is a
>> numerical problem. Recently it was pointed out on list that going
>> through ndisks in migration code should be done in one place and let's
>> try to limit the number of times we need to run through the list.
>>
>> Works great in my test environment of 1 or 2 disks, can't understand why
>> it's failing for (name your favorite large company) with many more disks...
> 
> It is a test issue actually. Nothing prevents this code work well with
> large collection of disk from theoretical POV and it is impossible
> to predict how minizing iterations will influence on this aspect...
> But I'll give it a try, just not sure what are tradeoffs from 
> number of lines/program structure complexity POV
> 
>>
>> [...]
>>
>>>>> +static int
>>>>> +qemuDomainBackupExportDisks(virQEMUDriverPtr driver,
>>>>> +                            virDomainObjPtr vm,
>>>>> +                            virDomainBackupDefPtr def)
>>>>> +{
>>>>> +    qemuDomainObjPrivatePtr priv = vm->privateData;
>>>>> +    char *dev = NULL;
>>>>> +    char *dev_backup = NULL;
>>>>> +    int ret = -1, rc;
>>>>> +    virJSONValuePtr actions = NULL;
>>>>> +    size_t i;
>>>>> +    virHashTablePtr block_info = NULL;
>>>>> +
>>>>> +    qemuDomainObjEnterMonitor(driver, vm);
>>>>> +    rc = qemuMonitorNBDServerStart(priv->mon, def->address.data.ip.host,
>>>>> +                                   def->address.data.ip.port);
>>>>
>>>> Well it may have been really good to note that usage of the NBD server
>>>> was taking place somewhere earlier.
>>>
>>> You mean track nbd server usage? To give better errors or smth?
>>>
>>
>> Like you have a reliance on guest agent for quiesce - you have a
>> reliance on nbd server to do something...
>>
>> I think this is where I went back and wrote you need to consider
>> networked disks in your XML (the auth portion).
>>
>> I have bz that's looking to add/allow TLS for NBD server starts...
>>
>>>>
>>>>> +    if (qemuDomainObjExitMonitor(driver, vm) < 0 || rc < 0)
>>>>> +        return -1;
>>>>> +
>>>>> +    if (!(actions = virJSONValueNewArray()))
>>>>> +        goto cleanup;
>>>>> +
>>>>> +    qemuDomainObjEnterMonitor(driver, vm);
>>>>> +    block_info = qemuMonitorGetBlockInfo(priv->mon);
>>>>> +    if (qemuDomainObjExitMonitor(driver, vm) < 0 || !block_info)
>>>>> +        goto cleanup;
>>>>> +
>>>>
>>>> Each of the following goes through ndisks array and does
>>>> multiple/different things, but it's not clear what error recovery looks
>>>> like if any one fails - the cleanup path is shall we say sparse while
>>>> the setup is multiple steps.  I haven't compared it to the migration code...
>>>
>>> I tried to address failure in code, so that flags are triggered after
>>> specific action in the loop, like 'blockdev'. I doubt that splitting
>>> the loop will make code more comprehensible. On cleanup you need to undo
>>> setup actions in reverse order starting from last successful one which
>>> means use 'i' counter after goto and this is more complicated than setup 
>>> per disk flags in my opinion.
>>>
>>
>> At this point the brain is already going to mush and the eyes are tired,
>> but you get the point - there's something "not quite right" with number
>> of times going through the loop.
>>
>> [...]
>>
>>>>
>>>> and realistically my time as run out.  I really didn't look too much in
>>>> depth at the qemu_driver code. My head was spinning from the number of
>>>> times going through the ndisks and the inlined code...
>>>>
>>>>
>>>> John
>>>
>>> Thank you for review!
>>>
>>> Nikolay
>>>
>>
>> Thank you for starting to think about this. It's a tricky and complex
>> issue based upon code that isn't quite baked in QEMU yet. I think given
>> more cycles in my day and perhaps smaller piles of patches eventually
>> we'll hammer out some working code.  I'm sure I'll learn a lot about
>> blockdev during the next few months too...
>>
> 
> Sounds great!
> 
> Nikolay
>