<div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Thu, Jul 5, 2018 at 4:20 PM Jason Baron <<a href="mailto:jbaron@akamai.com">jbaron@akamai.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br> <br> Opening tap devices, such as macvtap, that are created in containers is<br> problematic because the interface for opening tap devices is via<br> /dev/tapNN and devtmpfs is not typically mounted inside a container as<br> its not namespace aware. It is possible to do a mknod() in the<br> container, once the tap devices are created, however, since the tap<br> devices are created dynamically its not possible to apriori allow access<br> to certain major/minor numbers, since we don't know what these are going<br> to be. In addition, its desirable to not allow the mknod capability in<br> containers. This behavior, I think is somewhat inconsistent with the<br> tuntap driver where one can create tuntap devices inside a container by<br> first opening /dev/net/tun and then using them by supplying the tuntap<br> device name via the ioctl(TUNSETIFF). And since TUNSETIFF validates the<br> network namespace, one is limited to opening network devices that belong<br> to your current network namespace.<br> <br> Here are some options to this issue, that I wanted to get feedback<br> about, and just wondering if anybody else has run into this.<br> <br> 1)<br> <br> Don't create the tap device, such as macvtap in the container. Instead,<br> create the tap device outside of the container and then move it into the<br> desired container network namespace. In addition, do a mknod() for the<br> corresponding /dev/tapNN device from outside the container before doing<br> chroot().<br> <br> This solution still doesn't allow tap devices to be created inside the<br> container. Thus, in the case of kubevirt, which runs libvirtd inside of<br> a container, it would mean changing libvirtd to open existing tap<br> devices (as opposed to the current behavior of creating new ones). This<br> would not require any kernel changes, but as mentioned seems<br> inconsistent with the tuntap interface.<br></blockquote><div><br></div><div>For KubeVirt, apart from how exactly the device ends up in the container, I would want to pursue a way where all network preparations which require privileges happens from a privileged process *outside* of the container. Like CNI solutions do it. They run outside, have privileges and then create devices in the right network/mount namespace or move them there. The final goal for KubeVirt is that our pod with the qemu process is completely unprivileged and privileged setup happens from outside.</div><div><br></div><div>As a consequence, and depending on which route Dan pursues with the restructured libvirt, I would assume that either a privileged libvirtd-part outside of containers creates the devices by entering the right namespaces, or that libvirt in the container can consume pre-created tun/tap devices, like qemu.</div><div><br></div><div>Best Regards,</div><div>Roman</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br> 2)<br> <br> Add a new kernel interface for tap devices similar to how /dev/net/tun<br> currently works. It might be nice to use TUNSETIFF for tap devices, but<br> because tap devices have different fops they can't be easily switched<br> after open(). So the suggestion is a new ioctl (TUNGETFDBYNAME?), where<br> the tap device name is supplied and a new fd (distinct from the fd<br> returned by the open of /dev/net/tun) is returned as an output field as<br> part of the new ioctl parameter.<br> <br> It may not make sense to have this new ioctl call for /dev/net/tun since<br> its really about opening a tap device, so it may make sense to introduce<br> it as part of a new device, such as /dev/net/tap. This new ioctl could<br> be used for macvtap and ipvtap (or any tap device). I think it might<br> also improve performance for tuntap devices themselves, if they are<br> opened this way since currently all tun operations such as read() and<br> write() take a reference count on the underlying tuntap device, since it<br> can be changed via TUNSETIFF. I tested this interface out, so I can<br> provide the kernel changes if that's helpful for clarification.<br> <br> Thanks,<br> <br> -Jason<br> </blockquote></div></div>