[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[libvirt-users] [libvirt] LXC, user namespaces and systemd


I with my colleagues from Samsung trying to run systemd in Linux container. I saw that the others are experimenting in this topic, so I would like to present the results of my work and tests, perhaps it will be helpful to others.

As the prototype I used a manual written by Daniel: https://www.berrange.com/posts/2013/08/12/running-a-full-fedora-os-inside-a-libvirt-lxc-guest/
After many attempts, I managed to run systemd. Let's move to specifics.

1. Host configuration, Fedora 20

- kernel 3.14 with NAMESPACES, UTS_NS, IPC_NS, USER_NS, PID_NS, NET_NS enabled in kernel config I used kernel-3.14.0-0.rc2.git0.1.fc21.i686.rpm downloaded from https://dl.fedoraproject.org/pub/fedora/linux/development/rawhide

- libvirtd (libvirt) 1.2.2
I used libvirt build from git sources, it is important that the source contained commit 6fb42d7cdc57da453691d043d6b9bf23e2bae15e Patch from Richard Weinberger "Ensure systemd cgroup ownership is delegated to container with userns"

2. Container configuration

- setup Fedora environment
# yum -y --releasever=20 --nogpg --installroot=/var/lib/libvirt/filesystems/mycontainer --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal openssh-server procps-ng
# echo "pts/0" >> /var/lib/libvirt/filesystems/mycontainer/etc/securetty
# chroot /var/lib/libvirt/filesystems/mycontainer /bin/passwd root

- In the final solution I want to map root inside container to some normal user in the host. So let's create some user (in host):
# useradd foo -u 666
#id foo
uid=666(foo) gid=1001(foo) grupy=1001(foo)
# chown -R foo:foo /var/lib/libvirt/filesystems/mycontainer

- enabling user namespace (user mapping setup), look at my full libvirt config file
# cat /etc/libvirt/lxc/container.xml

<domain type='lxc'>
  <memory unit='KiB'>819200</memory>
  <currentMemory unit='KiB'>819200</currentMemory>
  <vcpu placement='static'>1</vcpu>
    <type arch='i686'>exe</type>
  <clock offset='utc'/>
    <uid start='0' target='666' count='1000'/>
    <gid start='0' target='1001' count='1000'/>
    <filesystem type='mount' accessmode='passthrough'>
      <source dir='/var/lib/libvirt/filesystems/mycontainer'/>
      <target dir='/'/>
    <interface type='network'>
      <mac address='00:16:3e:34:a2:dd'/>
      <source network='default'/>
    <console type='pty'>
      <target type='lxc' port='0'/>

3. Start container

# virsh --connect lxc:/// define /etc/libvirt/lxc/container.xml
# virsh --connect lxc:/// start mycontainer --console

If all login attempts are rejected, please boot host machine with audit=0

# vi /etc/default/grub
GRUB_CMDLINE_LINUX=" [...] audit=0 [...]"
# grub2-mkconfig -o /boot/grub2/grub.cfg
# reboot

4. Problems and solutions

"Cannot add dependency job for unit display-manager.service, ignoring: Unit display-manager.service failed to load: No such file or directory."

Delete or just comment line "Wants=display-manager.service"
# cat /usr/lib/systemd/system/default.target
Description=Graphical Interface


[FAILED] Failed to mount Huge Pages File System.
See 'systemctl status dev-hugepages.mount' for details.
[FAILED] Failed to mount Configuration File System.
See 'systemctl status sys-kernel-config.mount' for details.
[FAILED] Failed to mount Debug File System.
See 'systemctl status sys-kernel-debug.mount' for details.
[FAILED] Failed to mount FUSE Control File System.
See 'systemctl status sys-fs-fuse-connections.mount' for details.

Based on knowledge, which gave Daniel: "When a syscall requires CAP_SYS_ADMIN, for example, the kernel will either use capable(CAP_SYS_ADMIN) which only succeeds in the host, or ns_capable(CAP_SYS_ADMIN) which is allowed to suceed in the container. Different filesystems have differing restrictions, but at this time the vast majority of filesystems require that capable(CAP_SYS_ADMIN) succeeed and thus you can only mount them in the host.", and discussion about "allow some kernel filesystems to be mounted in a user namespace" from:

I decided to disable mounting this filesystems:

# systemctl mask dev-hugepages.mount
ln -s '/dev/null' '/etc/systemd/system/dev-hugepages.mount'
# systemctl mask sys-kernel-config.mount
ln -s '/dev/null' '/etc/systemd/system/sys-kernel-config.mount'
# systemctl mask sys-kernel-debug.mount
ln -s '/dev/null' '/etc/systemd/system/sys-kernel-debug.mount'
# systemctl mask sys-fs-fuse-connections.mount
ln -s '/dev/null' '/etc/systemd/system/sys-fs-fuse-connections.mount'

[FAILED] Failed to start D-Bus System Message Bus.
See 'systemctl status dbus.service' for details.

Feb 26 09:26:12 localhost.localdomain systemd[1]: Starting D-Bus System Message Bus... Feb 26 09:26:12 localhost.localdomain systemd[20]: Failed at step OOM_ADJUST spawning /bin/dbus-daemon: Permission denied

# echo -900 > /proc/20/oom_score_adj
/proc/20/oom_score_adj: Permission denied

# ls -l /proc/20/oom_score_adj
-rw-r--r--. 1 65534 65534 0 Feb 26 10:28 /proc/20/oom_score_adj

Regarding to kernel documentation in user namespace local root user (on guest) cannot set the OOM on any value. Set OOM on any value required except CAP_SYS_RESOURCE also full root privileges.

To disable OOM support delete or just comment line "OOMScoreAdjust=-900"
# cat /usr/lib/systemd/system/dbus.service
Description=D-Bus System Message Bus

ExecStart=/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation ExecReload=/bin/dbus-send --print-reply --system --type=method_call --dest=org.freedesktop.DBus / org.freedesktop.DBus.ReloadConfig

5. Final systemd start
# virsh --connect lxc:/// start mycontainer --console

systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)
Detected virtualization 'lxc-libvirt'.

Welcome to Fedora 20 (Heisenbug)!

Failed to install release agent, ignoring: No such file or directory
[  OK  ] Reached target Remote File Systems.
[  OK  ] Created slice Root Slice.
[  OK  ] Created slice User and Session Slice.
[  OK  ] Created slice System Slice.
[  OK  ] Created slice system-getty.slice.
[  OK  ] Reached target Slices.
[  OK  ] Listening on Delayed Shutdown Socket.
[  OK  ] Listening on /dev/initctl Compatibility Named Pipe.
[  OK  ] Reached target Paths.
[  OK  ] Reached target Encrypted Volumes.
[  OK  ] Listening on Journal Socket.
         Mounting POSIX Message Queue File System...
         Starting Journal Service...
[  OK  ] Started Journal Service.
         Starting Create static device nodes in /dev...
[  OK  ] Reached target Swap.
         Mounting Temporary Directory...
         Starting Load/Save Random Seed...
[  OK  ] Mounted POSIX Message Queue File System.
[  OK  ] Started Create static device nodes in /dev.
[  OK  ] Reached target Local File Systems (Pre).
[  OK  ] Started Load/Save Random Seed.
[  OK  ] Mounted Temporary Directory.
[  OK  ] Reached target Local File Systems.
         Starting Trigger Flushing of Journal to Persistent Storage...
         Starting Recreate Volatile Files and Directories...
[  OK  ] Started Trigger Flushing of Journal to Persistent Storage.
[  OK  ] Started Recreate Volatile Files and Directories.
         Starting Update UTMP about System Reboot/Shutdown...
[  OK  ] Started Update UTMP about System Reboot/Shutdown.
[  OK  ] Reached target System Initialization.
[  OK  ] Reached target Timers.
[  OK  ] Listening on D-Bus System Message Bus Socket.
[  OK  ] Reached target Sockets.
[  OK  ] Reached target Basic System.
         Starting OpenSSH server daemon...
         Starting Permit User Sessions...
         Starting D-Bus System Message Bus...
[  OK  ] Started D-Bus System Message Bus.
         Starting Login Service...
[  OK  ] Started OpenSSH server daemon.
[  OK  ] Started Permit User Sessions.
         Starting Console Getty...
[  OK  ] Started Console Getty.
[  OK  ] Reached target Login Prompts.
         Starting Cleanup of Temporary Directories...
[  OK  ] Started Cleanup of Temporary Directories.
[  OK  ] Started Login Service.
[  OK  ] Reached target Multi-User System.
[  OK  ] Reached target Graphical Interface.

Fedora release 20 (Heisenbug)
Kernel 3.14.0-0.rc2.git0.1.fc21.i686 on an i686 (console)

localhost login: root
Last login: Wed Feb 26 09:26:21 on pts/0

- verification which namespace is used
inside container
# ls -l /proc/self/ns/
 ipc -> ipc:[4026532341]
 mnt -> mnt:[4026532338]
 net -> net:[4026532344]
 pid -> pid:[4026532342]
 user -> user:[4026532337]
 uts -> uts:[4026532339]

outside container
$ ls -l /proc/self/ns/
 ipc -> ipc:[4026531839]
 mnt -> mnt:[4026531840]
 net -> net:[4026531956]
 pid -> pid:[4026531836]
 user -> user:[4026531837]
 uts -> uts:[4026531838]

I know that no one likes to read long emails , but most is config and logs. I will be grateful for comments and suggestions.

Dariusz Michaluk
Samsung R&D Institute Poland
Samsung Electronics
d michaluk samsung com

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]