[libvirt] [PATCHv2 13/14] Exit early after domain crash in monitor on snapshots

Wed Jan 14 14:27:52 UTC 2015

On 01/13/2015 03:43 PM, John Ferlan wrote:
> <no commit message>
> 
> On 01/07/2015 10:42 AM, Ján Tomko wrote:
>> ---
>>  src/qemu/qemu_domain.c |  2 +-
>>  src/qemu/qemu_driver.c | 18 ++++++------------
>>  2 files changed, 7 insertions(+), 13 deletions(-)
>>
>> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
>> index 82d0c91..3d56eb8 100644
>> --- a/src/qemu/qemu_domain.c
>> +++ b/src/qemu/qemu_domain.c
>> @@ -2313,7 +2313,7 @@ qemuDomainSnapshotDiscard(virQEMUDriverPtr driver,
>>              qemuDomainObjEnterMonitor(driver, vm);
>>              /* we continue on even in the face of error */
>>              qemuMonitorDeleteSnapshot(priv->mon, snap->def->name);
>> -            qemuDomainObjExitMonitor(driver, vm);
>> +            ignore_value(qemuDomainObjExitMonitor(driver, vm));
>>          }
>>      }
>>  
>> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
>> index 55d6fb3..a3d8983 100644
>> --- a/src/qemu/qemu_driver.c
>> +++ b/src/qemu/qemu_driver.c
>> @@ -12896,7 +12896,8 @@ qemuDomainSnapshotCreateActiveInternal(virConnectPtr conn,
>>      }
>>  
>>      ret = qemuMonitorCreateSnapshot(priv->mon, snap->def->name);
>> -    qemuDomainObjExitMonitor(driver, vm);
>> +    if (qemuDomainObjExitMonitor(driver, vm) < 0)
>> +        ret = -1;
>>      if (ret < 0)
>>          goto cleanup;
>>  
>> @@ -13417,12 +13418,8 @@ qemuDomainSnapshotCreateSingleDiskActive(virQEMUDriverPtr driver,
>>      ret = qemuMonitorDiskSnapshot(priv->mon, actions, device, source,
>>                                    formatStr, reuse);
>>      if (!actions) {
>> -        qemuDomainObjExitMonitor(driver, vm);
>> -        if (!virDomainObjIsActive(vm)) {
>> -            virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
>> -                           _("domain crashed while taking the snapshot"));
>> +        if (qemuDomainObjExitMonitor(driver, vm) < 0)
>>              ret = -1;
>> -        }
>>      }
> 
> In this case we do fall through to Audit - what the "strange" part of
> the Audit is though is we've changed the value of 'ret' if the
> ExitMonitor fails. Could a 'successful' DiskSnapshot ret value be
> changed into a failure because the guest died for some reason. Could
> anything else jump in (I've totally lost track now :-))
> 
> If that's expected, fine, but I thought it worthy to point out. Although
> I do "suspect" it could be reasonable to expect that the crash was
> because of taking the snapshot, but who knows for sure.
> 
> ACK in general...
> 
> Now that I'm done with the various callers. It's been hard to keep track
> in my own mind of the consistency between each routine and what I may
> have caught along the way. The following 4 do spring to mind though
> 
> 1. Audit
>    I think the Audits should occur. Whether they "all" happen right
> after the qemuMonitor* call while we still have the lock or whether
> they're done afterwards doesn't matter to me.
> 

As of now, the audits are broken because they possibly access freed data.
While the vm->def that will be replaced on domain crash still contains
the basic data needed for audit, the device pointers into vm->def will be stale.

The aim of this series is to remove the crash possibility by not using those
pointers.

While we are in monitor, the domain object is unlocked, so even accessing
vm->def should not be safe. After exiting the monitor, we'd need a copy of the
device that was not a part of vm->def, which we have for attach and
things like setvcpus, where the auditted 'device' is just an int, but not for
detach.

Regardless of this series, the audit logging is inconsistent and should be
IMHO addressed separately.

> 2. EndJob
>    I started noticing some cases where EndJob was called and other times
> where it was missed. Of course I didn't go back to earlier patches to
> check all cases, but that might be something you want to do to ensure
> we're not leaving somewhere without the EndJob
> 
> 3. Possibility of ret = qemuMonitor* == 0 and ExitMonitor == -1
>    While I agree it's highly likely that ret == -1 prior to ExitMonitor
> calls where sometimes ret = -1 is done and other times it's not. Still,
> if it is possible, then it needs to be accounted for. It doesn't seem
> so, but to have some call sequences do things one way while others do it
> differently makes it difficult to keep track.
> 

Ouch, that's probably a result of me spending less time per line on the last 6
patches. I'll take another look before pushing/resending. (So far I've got
patches 1, 2, 6-8 queued as I believe there were no audit issues there)

> 4. Consistency
>    I realize the inconsistencies existed before these patches, but since
> you're tackling all the callers in this series of patches - it may be
> nice to make things consistent. That way the next person who copies what
> some command does will then hopefully create consistent code. Another
> option is to make a macro that handles the ExitMonitor and goto's...
> 

I'm not sure I understand what should be consistent and can't imagine a
universal and readable macro for that.

For the audit, if qemuMonitor* succeeds but ExitMonitor fails, it would be
nice to log 'true', but that would be nicer in a separate series adressing all
the audit calls.

As for the ignore_value vs. goto cleanup, I try to use goto when possible,
because ingore value is harder to read to me.

Jan

> 
> John
> 
>>  
>>      virDomainAuditDisk(vm, disk->src, snap->src, "snapshot", ret >= 0);
>> @@ -13560,12 +13557,8 @@ qemuDomainSnapshotCreateDiskActive(virQEMUDriverPtr driver,
>>          if (ret == 0) {
>>              if (qemuDomainObjEnterMonitorAsync(driver, vm, asyncJob) == 0) {
>>                  ret = qemuMonitorTransaction(priv->mon, actions);
>> -                qemuDomainObjExitMonitor(driver, vm);
>> -                if (!virDomainObjIsActive(vm)) {
>> -                    virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
>> -                                   _("domain crashed while taking the snapshot"));
>> +                if (qemuDomainObjExitMonitor(driver, vm) < 0)
>>                      ret = -1;
>> -                }
>>              } else {
>>                  /* failed to enter monitor, clean stuff up and quit */
>>                  ret = -1;
>> @@ -14656,7 +14649,8 @@ qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot,
>>              }
>>              qemuDomainObjEnterMonitor(driver, vm);
>>              rc = qemuMonitorLoadSnapshot(priv->mon, snap->def->name);
>> -            qemuDomainObjExitMonitor(driver, vm);
>> +            if (qemuDomainObjExitMonitor(driver, vm) < 0)
>> +                goto endjob;
>>              if (rc < 0) {
>>                  /* XXX resume domain if it was running before the
>>                   * failed loadvm attempt? */
>>
> 
> --
> libvir-list mailing list
> libvir-list at redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20150114/21845e87/attachment-0001.sig>