[libvirt PATCH v2 00/16] Use nbdkit for http/ftp/ssh network drives in libvirt

Thu Sep 22 21:17:08 UTC 2022

On 9/21/22 1:44 PM, Jonathon Jongsma wrote:
> On 9/19/22 8:48 AM, Peter Krempa wrote:
>> On Wed, Aug 31, 2022 at 13:40:45 -0500, Jonathon Jongsma wrote:
>>> After a bit of a lengthy delay, this is the second version of this patch
>>> series. See https://bugzilla.redhat.com/show_bug.cgi?id=2016527 for more
>>> information about the goal, but the summary is that RHEL does not 
>>> want to ship
>>> the qemu storage plugins for curl and ssh.  Handling them outside of 
>>> the qemu
>>> process provides several advantages such as reduced attack surface and
>>> stability.
>>
>> IMO it's also worthy noting that it increases complexity of the setup
>> and potentially also resource usage.
>>
>>> A quick summary of the code:
>>
>> [...]
>>
>>> Open questions
>>>   - selinux: I need some help from people more familiar with selinux 
>>> to figure
>>>     out what is needed here. When selinux is enforcing, I get a 
>>> failure to
>>>     launch nbdkit to serve the disks. I suspect we need a new context 
>>> and policy
>>>     for /usr/sbin/nbdkit that allows it to transition to the 
>>> appropriate selinux
>>>     context. The current context (on fedora) is 
>>> "system_u:object_r:bin_t:s0".
>>>     When I (temporarily) change the context to something like 
>>> qemu_exec_t,
>>>     I am able to start nbdkit and the domain launches.
>>
>> Is the problem with starting the 'nbdkit' process itself or with the
>> socket?
> 
> The problem seems to be with the nbdkit process itself. Because I use 
> qemuSecurityCommandRun() to launch nbdkit, it sets the security label 
> for the nbkit process to be the security label for the domain and then 
> attempts to launch it. Presumably there is no rule allowing the binaries 
> with the bin_t context to transition to the label for the domain. But as 
> I said, my selinux knowledge is quite shallow, so I may need some help 
> figuring out the proper way to do this.

Just for completeness, here's some additional information from 
SETroubleshoot when trying to start a domain using nbdkit with my 
patches. In this case I'm just executing the monolithic libvirtd daemon 
from my working directory for testing. The domain fails to start :

SELinux is preventing rpc-libvirtd from entrypoint access on the file 
/usr/sbin/nbdkit.

Additional Information:
Source Context                unconfined_u:unconfined_r:svirt_t:s0:c129,c164
Target Context                system_u:object_r:bin_t:s0
Target Objects                /usr/sbin/nbdkit [ file ]
Source                        rpc-libvirtd
Source Path                   rpc-libvirtd
Port                          <Unknown>

Raw Audit Messages
type=AVC msg=audit(1663876079.221:7519): avc:  denied  { entrypoint } 
for  pid=1750906 comm="rpc-libvirtd" path="/usr/sbin/nbdkit" dev="dm-1" 
ino=3160232 scontext=unconfined_u:unconfined_r:svirt_t:s0:c129,c164 
tcontext=system_u:object_r:bin_t:s0 tclass=file permissive=0

Then if I temporarily do something like:

# chcon -t qemu_exec_t /usr/sbin/nbdkit

nbdkit (and the domain) starts fine

>> At least in case of the socket we must make sure that no other process
>> can acess it especially once you pass authentication to nbdkit to avoid
>> any kind of backdoor to authenticated storage.
> 
> nbdkit does have a --selinux-label argument which will set the label of 
> the socket.
> 
>>
>> Few more open questions:
>>   - What if 'nbdkit' crashes
>>      With an integrated block layer, all of the VM crashes. Now when we
>>      have a separated access to disks (this is also an issue for use of
>>      the qemu-storage-daemon) if any of the helper processes crash we get
>>      into a new situation.
>>
>>      I think we'll need to think about:
>>          - adding an event for any of the helper processes failing
>>          - adding a lifecycle action for it (e.g. pause qemu if nbdkit
>>            dies)
>>          - think about possible recovery of the situation
> 
> Indeed. Seems like something re-usable that could also be used for 
> qemu-storage-daemon might be useful. I'll look into it.
> 
>>
>>   - resource pinning
>>     For now all the resources are integral to the qemu process so
>>     emulator and iothread pinning can be used to steer which cpus the
>>     disk should use. With us adding new possibly cpu intensive processes
>>     we'll probably need to consider how to handle them more generally and
>>     manage their resources.
> 
> Good point. I hadn't really thought about this.
> 
>>
>>   - Integration with qemu storage daemon used for a VM
>>
>>     With the attempt to rewrite QSD in other languages it will possibly
>>     make sense to run it instead of the native qemu block layer (e.g.
>>     take advantage of memory safe languages). So we should also think
>>     about how these two will be able to coexist.
>>
>> The last point is more of a future-work thing to consider, but the
>> first two points should be considered for the final release of this
>> feature. Specifically because in your current design it replaces the
>> in-qemu driver even in cases when it is compiled into qemu thus also for
>> existing users. Alternatively it would have to be opt-in.
> 
> As far as I can tell, this is currently no good way to determine (via 
> capabilities or otherwise) whether e.g. the curl driver is compiled into 
> a particular qemu binary. If I've missed something, please let me know. 
> My recollection of previous discussions was that I should try to use 
> nbdkit if available and then fall back to trying the built-in qemu 
> driver approach. But we could change this strategy if there's a good way 
> to do it. When you talk about opt-in, do you mean per-domain? e.g. in 
> the xml? or some other mechanism?
> 
>>
>>> Known shortcomings
>>>   - creating disks (in ssh) still isn't supported. I wanted to send 
>>> out the
>>>     patch series anyway since it's been delayed too long already.
>>
>> That shouldn't be a problem, there's plenty protocols where we don't
>> support creating the storage. Creating storage is needed only for
>> snapshots so we can simply refuse to do it in the first place.
>>
>