<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jun 9, 2020 at 4:14 PM Peter Krempa <<a href="mailto:pkrempa@redhat.com">pkrempa@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Tue, Jun 09, 2020 at 16:00:02 +0300, Liran Rotenberg wrote:<br> > On Tue, Jun 9, 2020 at 3:46 PM Peter Krempa <<a href="mailto:pkrempa@redhat.com" target="_blank">pkrempa@redhat.com</a>> wrote:<br> > <br> > > On Tue, Jun 09, 2020 at 15:38:53 +0300, Liran Rotenberg wrote:<br> > > > Hi all,<br> > > > Passing on Bug 1840609 <<br> > > <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1840609" rel="noreferrer" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=1840609</a>><br> > > > - Wake up from hibernation failed:internal error: unable to execute QEMU<br> > > > command 'cont': Failed to get "write" lock.<br> > > ><br> > > > In Ovirt/RHV there is a specific flow that prevents the VM from starting<br> > > on<br> > > > the first host.<br> > > > The result is:<br> > > > 2020-06-09T12:12:58.111610Z qemu-kvm: Failed to get "write" lock<br> > > > Is another process using the image<br> > > ><br> > > [/rhev/data-center/3b67fb92-906b-11ea-bb36-482ae35a5f83/4fd23357-6047-46c9-aa81-ba6a12a9e8bd/images/0191384a-3e0a-472f-a889-d95622cb6916/7f553f44-db08-480e-8c86-cbdeccedfafe]?<br> > > > 2020-06-09T12:12:58.668140Z qemu-kvm: terminating on signal 15 from pid<br> > > > 177876 (<unknown process>)<br> > ><br> > > This error comes from qemu's internal file locking. It usually means<br> > > that there is another qemu or qemu-img which has the given image open.<br> > ><br> > > Is there anything which would access the image at that specific time or<br> > > slightly around?<br> > ><br> > I don't think so. The volumes are created and added to the volume chain on<br> > the VM metadata domxml(two snapshots created). Then the user restores the<br> > latest snapshot and deletes them (while the VM is down) - they are removed.<br> > The VM is set with the volume and going up, restoring the memory. The<br> > mounting place (in /rhev/data-center) points to the same disk and volume.<br> > On the first run I see the new place /rhev/data-center/<uuid>... that I<br> > can't tell why or where it comes from. It is set with 'rw', while the<br> > normal destination to the shared NFS is only with 'r'.<br> <br> That's definitely something not related to the locking itself.<br> <br> Please attach the XML document used to start the VM in both places<br> (working and non-working). I can't tell if there's a difference without<br> seeing those.<br> <br> The difference might very well be in the XMLs.<br></blockquote><div>I added a snippet from the engine log. The two first domxmls are from the first attempt on 'host_mixed_1' resulting in:</div><div>"unable to execute QEMU command 'cont': Failed to get "write" lock."</div><div>From this point, it's the second attempt on 'host_mixed_2' (2020-06-09 15:13:00,308+03 in log time). </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <br> > > It might be a race condition from something trying to modify the image<br> > > perhaps combined with propagation of locks via NFS.<br> > ><br> > I can't see any race from the RHV point of view. The strangest thing is why<br> > it works on the second host?<br> > In RHV after the VM is up we remove the memory disk, but that doesn't<br> > happen in this case, since the VM wasn't up.<br> <br> I'd speculate that it's timing. There's more time for the locks to<br> propagate or an offending process to terminate. The error message<br> definitely does not look like a permission issue, and<br> qemuSetupImagePathCgroup:75 is in the end a no-op on on non-device<br> paths (makes sense only for LVM logical volumes, partitions etc).<br> <br> </blockquote></div></div>