[libvirt] [PATCH v1 09/21] qemu: Spawn qemu under mount namespace

Mon Dec 5 14:14:49 UTC 2016

On 05.12.2016 14:26, Daniel P. Berrange wrote:
> On Thu, Nov 24, 2016 at 03:47:58PM +0100, Michal Privoznik wrote:
>> Prime time. When it comes to spawning qemu process and
>> relabelling all the devices it's going to touch, there's inherent
>> race with other applications in the system (e.g. udev). Instead
>> of trying convincing udev to not touch libvirt managed devices,
>> we can create a separate mount namespace for the qemu, and mount
>> our own /dev there. Of course this puts more work onto us as we
>> have to maintain /dev files on each domain start and device
>> hot(un-)plug. On the other hand, this enhances security also.
>>
>> >From technical POV, on domain startup process the parent
>> (libvirtd) creates:
>>
>>   /var/lib/libvirt/qemu/$domain.dev
>>   /var/lib/libvirt/qemu/$domain.devpts
>>
>> The child (which is going to be qemu eventually) calls unshare()
>> to create new mount namespace. From now on anything that child
>> does is invisible to the parent. Child then mounts tmpfs on
>> $domain.dev (so that it still sees original /dev from the host)
>> and creates some devices (as explained in one of the previous
>> patches). The devices have to be created exactly as they are in
>> the host (including perms, seclabels, ACLs, ...). After that it
>> moves $domain.dev mount to /dev.
>>
>> What's the $domain.devpts mount there for then you ask? QEMU can
>> create PTYs for some chardevs. And historically we exposed the
>> host ends in our domain XML allowing users to connect to them.
>> Therefore we must preserve devpts mount to be shared with the
>> host's one.
>>
>> To make this patch as small as possible, creating of devices
>> configured for domain in question is implemented in next patches.
> 
> IIUC, this means that QEMU startup will be broken for any guests
> that need host devices for disk, usb/pci passthrough, etc ?

Yep.

> 
> It is nice to keep the work incremental though.  I wonder if there
> is any value in doing everything *except* the MS_MOVE of
> /var/lib/libvirt/qemu/$domain.dev -> /dev/. ie only turn it on once
> all the code is in place ?

Ah, okay.

> 
> 
>> Signed-off-by: Michal Privoznik <mprivozn at redhat.com>
>> ---
>>  src/qemu/qemu_domain.c  | 289 ++++++++++++++++++++++++++++++++++++++++++++++++
>>  src/qemu/qemu_domain.h  |  10 ++
>>  src/qemu/qemu_process.c |  13 +++
>>  3 files changed, 312 insertions(+)
>>
>> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
>> index 137d4d5..d6a1c29 100644
>> --- a/src/qemu/qemu_domain.c
>> +++ b/src/qemu/qemu_domain.c
>> @@ -55,6 +55,9 @@
>>  
>>  #include <sys/time.h>
>>  #include <fcntl.h>
>> +#if defined(HAVE_SYS_MOUNT_H)
>> +# include <sys/mount.h>
>> +#endif
>>  
>>  #include <libxml/xpathInternals.h>
>>  
>> @@ -1627,6 +1630,9 @@ qemuDomainObjPrivateXMLFormat(virBufferPtr buf,
>>                            virDomainChrTypeToString(priv->monConfig->type));
>>      }
>>  
>> +    if (priv->containerized)
>> +        virBufferAddLit(buf, "<containerized/>\n");
> 
> I wonder if it is worth explicitly listing the namespaces we enabled ?
> eg
> 
>   <namespaces>
>      <mount/>
>   </namespaces>
> 
> so we're prepared to deal with us adding use of more private namespaces
> in future ?

Right. I was thinking about this but was unable to come with a reason
for other namespaces. But this doesn't hurt and looks like to be more
future-proof, so why not, right?

> 
>> +
>>      qemuDomainObjPrivateXMLFormatVcpus(buf, vm->def);
>>  
>>      if (priv->qemuCaps) {
>> @@ -1809,6 +1815,8 @@ qemuDomainObjPrivateXMLParse(xmlXPathContextPtr ctxt,
>>          goto error;
>>      }
>>  
>> +    priv->containerized = virXPathBoolean("count(./containerized) > 0", ctxt) > 0;
>> +
>>      if ((n = virXPathNodeSet("./vcpus/vcpu", ctxt, &nodes)) < 0)
>>          goto error;
>>  
>> @@ -6653,3 +6661,284 @@ qemuDomainSupportsVideoVga(virDomainVideoDefPtr video,
>>  
>>      return true;
>>  }
>> +
>> +
>> +static int
>> +qemuDomainCreateDevice(const char *device,
>> +                       const char *path,
>> +                       bool allow_noent)
>> +{
>> +    char *devicePath = NULL;
>> +    struct stat sb;
>> +    int ret = -1;
>> +
>> +    if (!STRPREFIX(device, "/dev")) {
>> +        virReportError(VIR_ERR_INTERNAL_ERROR,
>> +                       _("invalid device: %s"),
>> +                       device);
>> +        goto cleanup;
>> +    }
>> +
>> +    if (virAsprintf(&devicePath, "%s/%s",
>> +                    path, device + 4) < 0)
>> +        goto cleanup;
>> +
>> +    if (stat(device, &sb) < 0) {
>> +        if (errno == ENOENT && allow_noent) {
>> +            /* Ignore non-existent device. */
>> +            ret = 0;
>> +            goto cleanup;
>> +        }
>> +
>> +        virReportSystemError(errno, _("Unable to stat %s"), device);
>> +        goto cleanup;
>> +    }
>> +
>> +    if (virFileMakeParentPath(devicePath) < 0) {
>> +        virReportSystemError(errno,
>> +                             _("Unable to create %s"),
>> +                             devicePath);
>> +        goto cleanup;
>> +    }
>> +
>> +    if (mknod(devicePath, sb.st_mode, sb.st_rdev) < 0) {
>> +        virReportSystemError(errno,
>> +                             _("Failed to make device %s"),
>> +                             devicePath);
>> +        goto cleanup;
>> +    }
>> +
>> +    if (chown(devicePath, sb.st_uid, sb.st_gid) < 0) {
>> +        virReportSystemError(errno,
>> +                             _("Failed to chown device %s"),
>> +                             devicePath);
>> +        goto cleanup;
>> +    }
>> +
>> +    if (virFileCopyACLs(device, devicePath) < 0 &&
>> +        errno != ENOTSUP) {
>> +        virReportSystemError(errno,
>> +                             _("Failed to copy ACLs on device %s"),
>> +                             devicePath);
>> +        goto cleanup;
>> +    }
> 
> Per my comments on earlier patches, I don't think we need to have
> the chown or ACL copy. Just have the dev owned by root:root and
> let our security drivers deal with the rest.

Meanwhile, we discussed this in other thread and found out we really ned
copy as-is.

> 
>> +
>> +    ret = 0;
>> + cleanup:
>> +    VIR_FREE(devicePath);
>> +    return ret;
>> +}
>> +
>> +
>> +
>> +static int
>> +qemuDomainPopulateDevices(virQEMUDriverPtr driver,
>> +                          virDomainObjPtr vm ATTRIBUTE_UNUSED,
>> +                          const char *path)
>> +{
>> +    virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver);
>> +    const char *const *devices = (const char *const *) cfg->cgroupDeviceACL;
>> +    size_t i;
>> +    int ret = -1;
>> +
>> +    if (!devices)
>> +        devices = defaultDeviceACL;
>> +
>> +    for (i = 0; devices[i]; i++) {
>> +        if (qemuDomainCreateDevice(devices[i], path, true) < 0)
>> +            goto cleanup;
>> +    }
>> +
>> +    ret = 0;
>> + cleanup:
>> +    virObjectUnref(cfg);
>> +    return ret;
>> +}
>> +
>> +
>> +static int
>> +qemuDomainSetupDev(virQEMUDriverPtr driver,
>> +                   virDomainObjPtr vm,
>> +                   const char *path)
>> +{
>> +    char *mount_options = NULL;
>> +    char *opts = NULL;
>> +    int ret = -1;
>> +
>> +    VIR_DEBUG("Setting up /dev/ for domain %s", vm->def->name);
>> +
>> +    mount_options = virSecurityManagerGetMountOptions(driver->securityManager,
>> +                                                      vm->def);
>> +
>> +    if (!mount_options &&
>> +        VIR_STRDUP(mount_options, "") < 0)
>> +        goto cleanup;
>> +
>> +    /*
>> +     * tmpfs is limited to 64kb, since we only have device nodes in there
>> +     * and don't want to DOS the entire OS RAM usage
>> +     */
>> +    if (virAsprintf(&opts,
>> +                    "mode=755,size=65536%s", mount_options) < 0)
>> +        goto cleanup;
>> +
>> +    if (virFileSetupDev(path, opts) < 0)
>> +        goto cleanup;
>> +
>> +    if (qemuDomainPopulateDevices(driver, vm, path) < 0)
>> +        goto cleanup;
>> +
>> +    ret = 0;
>> + cleanup:
>> +    VIR_FREE(opts);
>> +    VIR_FREE(mount_options);
>> +    return ret;
>> +}
>> +
>> +
>> +int
>> +qemuDomainBuildNamespace(virQEMUDriverPtr driver,
>> +                         virDomainObjPtr vm)
>> +{
>> +    virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver);
>> +    qemuDomainObjPrivatePtr priv = vm->privateData;
>> +    const unsigned long mount_flags = MS_MOVE;
>> +    char *devPath = NULL;
>> +    char *devptsPath = NULL;
>> +    int ret = -1;
>> +
>> +    if (!priv->containerized) {
>> +        ret = 0;
>> +        goto cleanup;
>> +    }
>> +
>> +    if (virAsprintf(&devPath, "%s/%s.dev",
>> +                    cfg->stateDir, vm->def->name) < 0 ||
>> +        virAsprintf(&devptsPath, "%s/%s.devpts",
>> +                    cfg->stateDir, vm->def->name) < 0)
>> +        goto cleanup;
>> +
>> +    if (qemuDomainSetupDev(driver, vm, devPath) < 0)
>> +        goto cleanup;
>> +
>> +    /* Save /dev/pts mount point because /dev/pts/NNN is exposed in our live
>> +     * XML and other applications are supposed to be able to use it. */
>> +    if (mount("/dev/pts", devptsPath, NULL, mount_flags, NULL) < 0) {
>> +        virReportSystemError(errno, "%s",
>> +                             _("Unable to move /dev/pts mount"));
>> +        goto cleanup;
>> +    }
>> +
>> +    if (mount(devPath, "/dev", NULL, mount_flags, NULL) < 0) {
>> +        virReportSystemError(errno,
>> +                             _("Failed to mount %s on /dev"),
>> +                             devPath);
>> +        goto cleanup;
>> +    }
>> +
>> +    if (virFileMakePath("/dev/pts") < 0) {
>> +        virReportSystemError(errno, "%s",
>> +                             _("Cannot create /dev/pts"));
>> +        goto cleanup;
>> +    }
>> +
>> +    if (mount(devptsPath, "/dev/pts", NULL, mount_flags, NULL) < 0) {
>> +        virReportSystemError(errno,
>> +                             _("Failed to mount %s on /dev/pts"),
>> +                             devptsPath);
>> +        goto cleanup;
>> +    }
>> +
>> +    if (virFileBindMountDevice("/dev/pts/ptmx", "/dev/ptmx") < 0)
>> +        goto cleanup;
>> +
>> +    VIR_DEBUG("blaaah: %d", system("find /dev/ -ls > /tmp/blaaah"));
>> +    VIR_DEBUG("blaaah: %d", system("echo >> /tmp/blaaah"));
>> +    VIR_DEBUG("blaaah: %d", system("mount >> /tmp/blaaah"));
> 
> Heh, :-)
> 

D'oh! I remembered I needed to undo some debug lines somewhere...

> 
>> +    ret = 0;
>> + cleanup:
>> +    virObjectUnref(cfg);
>> +    VIR_FREE(devPath);
>> +    return ret;
>> +}
>> +
>> +
>> +int
>> +qemuDomainCreateNamespace(virQEMUDriverPtr driver,
>> +                          virDomainObjPtr vm)
>> +{
>> +    virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver);
>> +    qemuDomainObjPrivatePtr priv = vm->privateData;
>> +    int ret = -1;
>> +    char *path = NULL;
>> +
>> +#if !defined(__linux__)
>> +    /* Namespaces are Linux specific. On other platforms just
>> +     * carry on with the old behaviour. */
>> +    return 0;
>> +#endif
> 
> This feels quite likely to create compiler warnings on non-linux
> about unreachable code. It feels like we should just stub out
> the entire method instead.

Ah, good point.

> 
>> +
>> +    if (!virQEMUDriverIsPrivileged(driver)) {
>> +        ret = 0;
>> +        goto cleanup;
>> +    }
>> +
>> +    if (virAsprintf(&path, "%s/%s.dev",
>> +                    cfg->stateDir, vm->def->name) < 0)
>> +        goto cleanup;
>> +
>> +    if (virFileMakePath(path) < 0) {
>> +        virReportSystemError(errno,
>> +                             _("Failed to create %s"),
>> +                             path);
>> +        goto cleanup;
>> +    }
>> +
>> +    VIR_FREE(path);
>> +    if (virAsprintf(&path, "%s/%s.devpts",
>> +                    cfg->stateDir, vm->def->name) < 0)
>> +        goto cleanup;
>> +
>> +    if (virFileMakePath(path) < 0) {
>> +        virReportSystemError(errno,
>> +                             _("Failed to create %s"),
>> +                             path);
>> +        goto cleanup;
>> +    }
>> +
>> +    priv->containerized = true;
>> +    ret = 0;
>> + cleanup:
>> +    VIR_FREE(path);
>> +    virObjectUnref(cfg);
>> +    return ret;
>> +}
>> +
>> +
>> +void
>> +qemuDomainDeleteNamespace(virQEMUDriverPtr driver,
>> +                          virDomainObjPtr vm)
>> +{
>> +    virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver);
>> +    qemuDomainObjPrivatePtr priv = vm->privateData;
>> +    char *path;
>> +
>> +    if (!priv->containerized)
>> +        return;
>> +
>> +    if (virAsprintf(&path, "%s/%s.dev",
>> +                    cfg->stateDir, vm->def->name) < 0)
>> +        goto cleanup;
>> +
>> +    virFileDeleteTree(path);
>> +
>> +    VIR_FREE(path);
>> +    if (virAsprintf(&path, "%s/%s.devpts",
>> +                    cfg->stateDir, vm->def->name) < 0)
>> +        goto cleanup;
>> +
>> +    virFileDeleteTree(path);
>> + cleanup:
>> +    virObjectUnref(cfg);
>> +    VIR_FREE(path);
>> +}
> 
> This is running in the host namespace and custom /dev tmpfs
> was only ever mounted in the QEMU private namespace, so
> should be invisible to the host. As such, IIUC, these
> directories should both be 100% empty. IOW, it seems that
> we don't need virFileDeleteTree - plain rmdir should be
> sufficient and safer against accidents.

Again, very good point. I'll fix it in v2.

Michal