[Libguestfs] [libvirt] Quantifying libvirt errors in launching the libguestfs appliance

Wed Jan 13 23:48:08 UTC 2016

On 01/13/2016 06:45 PM, Cole Robinson wrote:
> On 01/13/2016 05:18 AM, Richard W.M. Jones wrote:
>> As people may know, we frequently encounter errors caused by libvirt
>> when running the libguestfs appliance.
>>
>> I wanted to find out exactly how frequently these happen and classify
>> the errors, so I ran the 'virt-df' tool overnight 1700 times.  This
>> tool runs several parallel qemu:///session libvirt connections both
>> creating a short-lived appliance guest.
>>
>> Note that I have added Cole's patch to fix https://bugzilla.redhat.com/1271183
>> "XML-RPC error : Cannot write data: Transport endpoint is not connected"
>>
>> Results:
>>
>> The test failed 538 times (32% of the time), which is pretty dismal.
>> To be fair, virt-df is aggressive about how it launches parallel
>> libvirt connections.  Most other virt-* tools use only a single
>> libvirt connection and are consequently more reliable.
>>
>> Of the failures, 518 (96%) were of the form:
>>
>>   process exited while connecting to monitor: qemu: could not load kernel '/home/rjones/d/libguestfs/tmp/.guestfs-1000/appliance.d/kernel': Permission denied
>>
>> which is https://bugzilla.redhat.com/921135 or maybe
>> https://bugzilla.redhat.com/1269975.  It's not clear to me if these
>> bugs have different causes, but if they do then potentially we're
>> seeing a mix of both since my test has no way to distinguish them.
>>
> 
> I just experimented with this, I think it's the issue I suggested at:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1269975#c4
> 
> I created two VMs, kernel1 and kernel2, just booting off a kernel in
> $HOME/session-kernel/vmlinuz. Then I added this patch:
> 
> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
> index f083f3f..5d9f0fa 100644
> --- a/src/qemu/qemu_process.c
> +++ b/src/qemu/qemu_process.c
> @@ -4901,6 +4901,13 @@ qemuProcessLaunch(virConnectPtr conn,
>                                        incoming ? incoming->path : NULL) < 0)
>          goto cleanup;
> 
> +    if (STREQ(vm->def->name, "kernel1")) {
> +        for (int z = 0; z < 30; z++) {
> +            printf("kernel1: sleeping %d of 30\n", z + 1);
> +            sleep(1);
> +        }
> +    }
> +
>      /* Security manager labeled all devices, therefore
>       * if any operation from now on fails, we need to ask the caller to
>       * restore labels.
> 
> 
> Which is right after selinux labels are set on VM startup. This is then easy
> to reproduce with:
> 
> virsh start kernel1 (sleeps)
> virsh start kernel2 && virsh destroy kernel2
> 
> The shared vmlinuz is reset to user_home_t after kernel2 is shut down, so
> kernel1 fails to start after the patch's timeout
> 
> When we detect similar issues with <disk> devices, like when the media already
> has the expected label, we encode 'relabel=no' in the disk XML, which tells
> libvirt not to run restorecon on the disks path when the VM is shutdown.
> However kernel/initrd XML doesn't have support for this XML, so it won't work
> there. Adding that could be one fix.
> 
> But I think there's longer term plans for this type of issue by using ACLs, or
> virtlockd or something, Michal had patches but I don't know the specifics.
> 
> Unfortunately even hardlinks share selinux labels so I don't think there's any
> workaround on the libguestfs side short of using a separate copy of the
> appliance kernel for each VM
> 

Whoops, should have checked my libvirt mail first, you guys already came to
this conclusion elsewhere in the thread :)

- Cole