<div dir="ltr"><div>Thanks Peter for your feedback. Interestingly the version of virsh is newer than 1.2.18 and thus should contain the fix:</div><div><br></div><div>$ virsh --version</div><div>1.3.1</div><div><br></div><div><div>$ uname -a</div><div>Linux agsserver 4.4.0-91-generic #114-Ubuntu SMP Tue Aug 8 11:56:56 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux</div></div><div><br></div><div><div>$ lsb_release -a</div><div>No LSB modules are available.</div><div>Distributor ID:<span style="white-space:pre"> </span>Ubuntu</div><div>Description:<span style="white-space:pre"> </span>Ubuntu 16.04.3 LTS</div><div>Release:<span style="white-space:pre"> </span>16.04</div><div>Codename:<span style="white-space:pre"> </span>xenial</div></div><div><br></div><div>But we're still having the issue. Is there anything else that you can think about? Feel free to query me for more information. I'm willing to help wherever I can because this bugs us quite regularly. We could probably improve our daily backup cronjob to retry blockcommit after a blockjob abort, but it feels so hacky that I would do that only as the last resort.</div></div><div class="gmail_extra"><br><div class="gmail_quote">2017-08-14 17:05 GMT+02:00 Peter Krempa <span dir="ltr"><<a href="mailto:pkrempa@redhat.com" target="_blank">pkrempa@redhat.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Mon, Aug 14, 2017 at 08:42:24 +0200, Dominik Psenner wrote:<br> > Hi,<br> <span class=""><br> Hi,<br> <br> ><br> > a small update on this. We have migrated the virtualized host to use the<br> > virtio drivers and now the drive performance is improved so that we can see<br> > a constant transfer rate. Before it used to be the same rate but regularly<br> > dropped to a few bytes/sec for a few seconds and then was fast again.<br> ><br> > However we still observe that the following fails regularily:<br> ><br> > $ virsh snapshot-create-as --domain domain --name backup --no-metadata<br> > --atomic --disk-only --diskspec hda,snapshot=external<br> > $ virsh blockcommit domain hda --active --pivot<br> > error: failed to pivot job for disk hda<br> > error: block copy still active: disk 'hda' not ready for pivot yet<br> > Could not merge changes for disk hda of domain. VM may be in invalid state.<br> <br> </span>since this thread was renamed, please re-state the version of libvirt<br> you are using. I don't really want to dig through the old thread.<br> <span class=""><br> > Then running the following in the morning succeeds and successfully pivotes<br> > the snapshot into the base image while the vm is live:<br> ><br> > $ virsh blockjob domain hda --abort<br> > $ virsh blockcommit domain hda --active --pivot<br> > Successfully pivoted<br> ><br> > We run the backup process every day once and it failed on the following<br> > days:<br> ><br> > 2017-07-07<br> > 2017-07-20<br> > 2017-07-27<br> > 2017-08-12<br> > 2017-08-14<br> ><br> > Looking at this it roughly happens once a week and the guest from then on<br> > writes into the snapshot backlog. That snapshot backlog file grows about<br> > 8gb every day and thus the issue always needs immediate attention.<br> ><br> > Any ideas what could cause this issue? Is this a bug (race condition) of<br> > `virsh blockcommit` that sometimes fails because it is invoked at the wrong<br> > time?<br> <br> </span>So the 'virsh blockcommit domain hda --active --pivot' operation<br> consists of 3 parts:<br> <br> 1) virsh blockcommit domain hda --active<br> 2) waiting until the block job finishes<br> 3) virsh blockjob --pivot domain hda<br> <br> The problem is that some times 2) finishes too soon and then operation 3<br> fails. This should not happen any more, since there's code in virsh [1]<br> which waits for the completion event from libvirtd, which is fired only<br> when the job is actually ready to be pivoted.<br> <br> This code has a lot of fallback options in case when libvirtd is old or<br> so.<br> <br> At any rate, manual pivoting later should help. Also probably updating<br> to a more recent version.<br> <br> In case you are using a farily recent version, it's possible that there<br> are still bugs though.<br> <br> Peter<br> <br> [1]:<br> <br> commit 7408403560f7d054da75acaab855a9<wbr>5c51a92e2b<br> Author: Peter Krempa <<a href="mailto:pkrempa@redhat.com">pkrempa@redhat.com</a>><br> Date: Mon Jul 13 17:04:49 2015 +0200<br> <br> virsh: Refactor block job waiting in cmdBlockCommit<br> <br> Reuse the vshBlockJobWait infrastructure to refactor cmdBlockCommit to<br> use the common code. This additionally fixes a bug when working with<br> new qemus, where when doing an active commit with --pivot the pivoting<br> would fail, since qemu reaches 100% completion but the job doesn't<br> switch to synchronized phase right away.<br> <br> $ git describe --contains 7408403560f7d054da75acaab855a9<wbr>5c51a92e2b<br> v1.2.18-rc1~33<br> <br> </blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">Dominik Psenner<br></div></div> </div>