[libvirt] [PATCH v2 03/17] vircgroup: introduce virCgroupV2DevicesAttachProg

Pavel Hrdina phrdina at redhat.com
Thu Jan 24 16:21:16 UTC 2019


On Tue, Jan 15, 2019 at 01:07:55PM +0100, Pavel Hrdina wrote:
> On Tue, Jan 15, 2019 at 10:14:33AM +0000, Daniel P. Berrangé wrote:
> > On Tue, Jan 15, 2019 at 09:58:47AM +0100, Pavel Hrdina wrote:
> > > On Mon, Jan 14, 2019 at 04:43:34PM +0000, Daniel P. Berrangé wrote:
> > > > On Mon, Jan 14, 2019 at 04:47:37PM +0100, Pavel Hrdina wrote:
> > > > > This function loads the BPF prog with prepared map into kernel and
> > > > > attaches it into guest cgroup.  It can be also used to replace existing
> > > > > program in the cgroup if we need to resize BPF map to store more rules
> > > > > for devices. The old program will be closed and removed from kernel.
> > > > > 
> > > > > There are two possible ways how to create BPF program:
> > > > > 
> > > > >     - One way is to write simple C-like code which can by compiled into
> > > > >       BPF object file which can be loaded into kernel using elfutils.
> > > > > 
> > > > >     - The second way is to define macros which looks like assembler
> > > > >       instructions and can be used directly to create BPF program that
> > > > >       can be directly loaded into kernel.
> > > > > 
> > > > > Since the program is not too complex we can use the second option.
> > > > 
> > > > I'm really not liking this to be honest. Even for this "simple" program,
> > > > I struggle to understand what the code below is doing. It is basically
> > > > assembly language, which is inevitably hard to follow even for simple
> > > > things.
> > > > 
> > > > I'd like to see us use the BPF C compiler to build the loaded
> > > > program unless there's a compelling reason why it won't work
> > > 
> > > I personally don't like it as well, but the issue is that for the
> > > compilation we would have to use clang/llvm, AFAIK there is no support
> > > to compile BPF in GCC.  Another issue is the loading part, you need to
> > > read the object file, extract correct sections which is not trivial and
> > > in order to have different size of maps we would have to modify the
> > > loaded data from the compiled object file.
> > 
> > I don't see a problem with using clang for BPF.  Reading the DPDK lists,
> > I see you can do something like this:
> > 
> >   $ clang -I iproute2/include/ -O2 -emit-llvm -c tap_bpf_program.c -o - \
> >         | llc -march=bpf -filetype=obj -o tap_bpf_program.o
> >   $ objdump -j l3_l4 -s tap_bpf_program.o > tap_bpf_program.bin
> > 
> > where the tap_bpf_program.bin should be the raw instructions to be
> > loaded.
> 
> There is no technical problem with using clang for BPF, the only issue
> that I have is that we will require clang to compile libvirt from
> source code.
> 
> > Supposedly they also have a script that turns the .bin file into a
> > C byte array which they then compile into their binary, so they
> > can just load the uint8[] directly, instead of from a separate file
> 
> This is not an issue, there is a sample code in kernel that uses
> elfutils to load the bpf.obj file into C byte array, just different
> method how the get the list of instructions.
> 
> > The matter of dynamically sized device maps is more tricky. Assuming
> > the device map size is basically a list of (major,minor) numbers,
> > that's 2 bytes per entry. We could easily do a fixed size 50 elements
> > giving 100 bytes per guest if that makes it easier. I guess the hard
> > thing is actually getting the dynamic data into the code.
> 
> 50 elements per guest will not be good enough.  Yes, in most cases it's
> perfectly OK, but there are some users having hundreds of disks inside
> a guest and we would break their usage and since we don't have any
> written limitation of how many disk/devices can be inside a guest we
> have to be able to store unlimited number of devices into the map until
> we run out of memory.
> 
> > > I can try to prepare alternative patches for that but it might me even
> > > more complex to understand than short assembly program.
> > 
> > Maybe there's a compromise/hybrid that would work simpler ? eg is
> > it possible to structure things so all the checking code is
> > statically compiled, and then we dynamically generate the instructions
> > to define the list of devices and append that to the main static
> > code. ?

This compromise would not work.  The dynamic data are stored in the BPF
map and the program uses only the map FD which is hardcoded into the
program itself.  The map FD is placed into correct register before we
call bpf_map_lookup function.  The only dynamic part of the program is
the map FD which is used on 4 places in the program.

I was also looking into the possibility of using elfutils as kernel is
using in one of the samples and that feels more complex than this simple
assembler like program.  We would have to open the binary file generated
by clang, load map data and instruction data, update the map size in the
loaded map data and do syscall in order to create the map, get the map
FD returned by syscall, inject that FD into the loaded instructions and
do syscall to create the program.  This IMHO sounds more complex and
than the "simple" assembler like program.

As a compromise we could have the C code as a comment in our source
files right before the actual assembler like version with instructions
how to use clang and llvm-objdump to get the assembler instructions.
That's what I've done when I was implementing the program.  The only
optimization was to add two macros and change some jump logic but we
can remove that optimization and make it 1:1 to what we get from
llvm-objdump.

Pavel

> This is not an issue, the list of devices is stored into the BPF map
> from user-space (libvirt) and the BPF map is used by the BPF program
> in the kernel.  The limits are that you cannot reallocate the BPF map
> (this is solved by creating new program with bigger map) and you cannot
> change the BPF map FD while the program is loaded into kernel.
> 
> Pavel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20190124/db474969/attachment-0001.sig>


More information about the libvir-list mailing list