[PATCH] qemu_namespace: Umount the original /dev before replacing it with tmpfs

Michal Prívozník mprivozn at redhat.com
Wed Jan 18 07:42:40 UTC 2023


On 1/18/23 01:58, Jim Fehlig wrote:
> On 12/16/22 08:04, Michal Privoznik wrote:
>> Our code relies on mount events propagating into the namespace we
>> create for a domain. However, there's one caveat. In v8.8.0-rc1~8
>> I've tried to make us detect differences in mount tables between
>> the namespace in which libvirtd runs and the domain namespace.
>> This is crucial for any mounts that happen after the domain was
>> started (for instance new hugetlbfs can be mounted on say
>> /dev/hugepages1G).
>>
>> Therefore, we take a look into /proc/$(pgrep qemu)/mounts to see
>> what filesystems are mounted under /dev. Now, since we don't
>> umount the original /dev, just mount a tmpfs over it, we get all
>> the events (e.g. aforementioned hugetlbfs mount on
>> /dev/hugepages1G), but we are not really able to access it
>> because of the tmpfs that's placed on top. This then confuses our
>> algorithm for detecting which filesystems are mounted (the
>> algorithm is implemented in qemuDomainGetPreservedMounts()).
>>
>> To break the link between host's and guest's /dev we just need to
>> umount() the original /dev in the namespace. Just before our
>> artificially created tmpfs is moved into its place.
>>
>> Fixes: 46b03819ae8d833b11c2aaccb2c2a0361727f51b
>> Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2151869#c6
>> Signed-off-by: Michal Privoznik <mprivozn at redhat.com>
>> ---
>>   src/qemu/qemu_namespace.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/src/qemu/qemu_namespace.c b/src/qemu/qemu_namespace.c
>> index 90c0b90024..a6b9af1307 100644
>> --- a/src/qemu/qemu_namespace.c
>> +++ b/src/qemu/qemu_namespace.c
>> @@ -775,6 +775,11 @@ qemuDomainUnshareNamespace(virQEMUDriverConfig *cfg,
>>               goto cleanup;
>>       }
>>   +    if (umount("/dev") < 0) {
>> +        virReportSystemError(errno, "%s", _("failed to umount devfs
>> on /dev"));
>> +        return -1;
>> +    }
> 
> Hi Michal,
> 
> While doing some downstream testing of 9.0.0 with apparmor confinement
> of libvirtd enabled, I encountered the above error when attempting to
> start a VM.

Ah, I keep forgetting about apparmor, sorry.

> Additionally from the audit subsystem
> 
> kernel: audit: type=1400 audit(1674002774.461:13): apparmor="DENIED"
> operation="umount" profile="libvirtd" name="/dev/" pid=4778
> comm="rpc-libvirtd"
> 
> Hmm, libvirtd needs to umount /dev, is that right? I added
> 
>    umount /dev/,
> 
> to the usr.sbin.libvirtd profile, reloaded it, and was then able to
> start the VM. That seems like a big hammer. I was expecting the umount
> in /run/libvirt/qemu/*.dev, but the profile already contains a rule for
> that path.

Yeah, we really need to umount /dev and not /run/libvirt/.../*.dev. The
reason is twofold:

1) security - we used to mount a tmpfs over an existing /dev. Therefore,
the original /dev with all devices was still available. If a malicious
QEMU did just 'umount /dev' it would unmount the top most /dev (i.e. our
tmpfs) exposing the original /dev. I agree that this is more theoretical
threat than anything, because QEMU wouldn't probably have the capability
to play with mount table.

2) as the commit message explains, it's needed because of
qemuDomainGetPreservedMounts(). Long story short, when a QEMU is running
and we are deciding whether to make a new node under /dev we look at the
mount table of the QEMU process. This is because it's only devtmpfs we
are replacing. All other filesystems mounted under /dev are preserved
(e.g. /dev/pts, /dev/mqueue, /dev/shm, ...). Since these are preserved
(i.e. they are NOT a copy of their image in the parent namespace, they
ARE the parent namespace mounts), we don't need to mknod() anything.
For instance, when hotplugging a serial console with /dev/pts/123
backend, we must not create it in the QEMU's namespace as it already
exists. But we need to mknod() say /dev/sda1. Now, up until this it
worked just fine. Problem is, when a new filesystem is mounted after
QEMU was started.

What I've seen (and what the associated bug is about), is when another
hugetlbfs is mounted after a guest was started. For instance:

1) virsh start guest
2) mount -t hugetlbfs hugetlbfs /dev/hugepages1G -o pagesize=1G
3) systemctl restart libvirtd
4) hotplug a <memory/> into the guest that is backed by 1GB hugepages

At the last step, we want to take a look into the mount table of running
QEMU and see whether /dev/hugepages1G is mounted. Well, if the original
devtmpfs is kept mounted, then the mount event is propagated into QEMU's
mount table (even though it's inaccessible). Maybe the following shell
commands explain it better:

Domain 'fedora' started

# nsenter -a -t $(pgrep qemu) mount | grep huge
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
hugetlb on /sys/fs/cgroup/hugetlb type cgroup
(rw,nosuid,nodev,noexec,relatime,hugetlb)


# mount -t hugetlbfs hugetlbfs /dev/hugepages1G -o pagesize=1G

# nsenter -a -t $(pgrep qemu) mount | grep huge
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
hugetlb on /sys/fs/cgroup/hugetlb type cgroup
(rw,nosuid,nodev,noexec,relatime,hugetlb)
hugetlbfs on /dev/hugepages1G type hugetlbfs (rw,relatime,pagesize=1024M)

# nsenter -a -t $(pgrep qemu) ls -l /dev/hugepages1G
ls: cannot access '/dev/hugepages1G': No such file or directory

The reason why this new mount point exists in the QEMU's mount table is
the original devtmpfs. Therefore, to break this 'link', we need to
umount the original devtmpfs after which things work as expected:

# virsh start fedora
Domain 'fedora' started

# nsenter -a -t $(pgrep qemu) mount | grep huge
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
hugetlb on /sys/fs/cgroup/hugetlb type cgroup
(rw,nosuid,nodev,noexec,relatime,hugetlb)

# mount -t hugetlbfs hugetlbfs /dev/hugepages1G -o pagesize=1G

# nsenter -a -t $(pgrep qemu) mount | grep huge
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
hugetlb on /sys/fs/cgroup/hugetlb type cgroup
(rw,nosuid,nodev,noexec,relatime,hugetlb)

I hope this explains the reason why we need to unmount the original /dev.

Michal



More information about the libvir-list mailing list