Permission to disk set wrong when restoring from memory snapshot?

Tue Jun 9 13:41:36 UTC 2020

On Tue, Jun 9, 2020 at 4:14 PM Peter Krempa <pkrempa at redhat.com> wrote:

> On Tue, Jun 09, 2020 at 16:00:02 +0300, Liran Rotenberg wrote:
> > On Tue, Jun 9, 2020 at 3:46 PM Peter Krempa <pkrempa at redhat.com> wrote:
> >
> > > On Tue, Jun 09, 2020 at 15:38:53 +0300, Liran Rotenberg wrote:
> > > > Hi all,
> > > > Passing on Bug 1840609 <
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1840609>
> > > > - Wake up from hibernation failed:internal error: unable to execute
> QEMU
> > > > command 'cont': Failed to get "write" lock.
> > > >
> > > > In Ovirt/RHV there is a specific flow that prevents the VM from
> starting
> > > on
> > > > the first host.
> > > > The result is:
> > > > 2020-06-09T12:12:58.111610Z qemu-kvm: Failed to get "write" lock
> > > > Is another process using the image
> > > >
> > >
> [/rhev/data-center/3b67fb92-906b-11ea-bb36-482ae35a5f83/4fd23357-6047-46c9-aa81-ba6a12a9e8bd/images/0191384a-3e0a-472f-a889-d95622cb6916/7f553f44-db08-480e-8c86-cbdeccedfafe]?
> > > > 2020-06-09T12:12:58.668140Z qemu-kvm: terminating on signal 15 from
> pid
> > > > 177876 (<unknown process>)
> > >
> > > This error comes from qemu's internal file locking. It usually means
> > > that there is another qemu or qemu-img which has the given image open.
> > >
> > > Is there anything which would access the image at that specific time or
> > > slightly around?
> > >
> > I don't think so. The volumes are created and added to the volume chain
> on
> > the VM metadata domxml(two snapshots created). Then the user restores the
> > latest snapshot and deletes them (while the VM is down) - they are
> removed.
> > The VM is set with the volume and going up, restoring the memory. The
> > mounting place (in /rhev/data-center) points to the same disk and volume.
> > On the first run I see the new place /rhev/data-center/<uuid>... that I
> > can't tell why or where it comes from. It is set with 'rw', while the
> > normal destination to the shared NFS is only with 'r'.
>
> That's definitely something not related to the locking itself.
>
> Please attach the XML document used to start the VM in both places
> (working and non-working). I can't tell if there's a difference without
> seeing those.
>
> The difference might very well be in the XMLs.
>
I added a snippet from the engine log. The two first domxmls are from the
first attempt on 'host_mixed_1' resulting in:
"unable to execute QEMU command 'cont': Failed to get "write" lock."
>From this point, it's the second attempt on 'host_mixed_2' (2020-06-09
15:13:00,308+03 in log time).

>
> > > It might be a race condition from something trying to modify the image
> > > perhaps combined with propagation of locks via NFS.
> > >
> > I can't see any race from the RHV point of view. The strangest thing is
> why
> > it works on the second host?
> > In RHV after the VM is up we remove the memory disk, but that doesn't
> > happen in this case, since the VM wasn't up.
>
> I'd speculate that it's timing. There's more time for the locks to
> propagate or an offending process to terminate. The error message
> definitely does not look like a permission issue, and
> qemuSetupImagePathCgroup:75 is in the end a no-op on on non-device
> paths (makes sense only for LVM logical volumes, partitions etc).
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/libvirt-users/attachments/20200609/1202d701/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: engine.log
Type: text/x-log
Size: 70263 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/libvirt-users/attachments/20200609/1202d701/attachment.bin>