[libvirt] [PATCH v3 04/13] security_manager: Rework metadata locking

Wed Oct 17 12:05:38 UTC 2018

On 10/17/2018 01:46 PM, Daniel P. Berrangé wrote:
> On Wed, Oct 17, 2018 at 12:52:50PM +0200, Michal Privoznik wrote:
>> On 10/17/2018 11:45 AM, Daniel P. Berrangé wrote:
>>> On Wed, Oct 17, 2018 at 11:06:46AM +0200, Michal Privoznik wrote:
>>>> Trying to use virlockd to lock metadata turns out to be too big
>>>> gun. Since we will always spawn a separate process for relabeling
>>>> we are safe to use thread unsafe POSIX locks and take out
>>>> virtlockd completely out of the picture.
>>>
>>> Note safety of POSIX locks is not merely about other threads.
>>>
>>> If *any* code we invoke from security_dac or security_selinux
>>> were to open() & close() the file we're labelling we'll
>>> loose the locks.
>>>
>>> This is a little concerning since we call out to a few functions
>>> in src/util that might not be aware they should *never* open)(
>>> and close() the path they're given. It looks like we lucky at
>>> the moment, but this has future gun -> foot -> ouch potential.
>>>
>>> Likewise we also call out to libselinux APIs. I've looked at
>>> the code for the APIs we call and again, right now, we look
>>> to be safe, but there's some risk there.
>>>
>>> I'm not sure whether this is big enough risk to invalidate this
>>> forking approach or not though.  This kind of problem was the
>>> reason why virtlockd was created to hold locks in a completely
>>> separate process from the code with the critical section.
>>
>> But unless we check for hard links virtlockd will suffer from the POSIX
>> locks too. For instance, assume that B is a hardlink to A. Then, if a
>> process opens A, locks it an the other opens B and closes it the lock is
>> released. Even though A and B look distinct.
> 
> Yes, that is correct - there's an implicit assumption that all VMs
> are using the same path if there's multiple hard links. Foruntately
> this scenario is fairly rare - more common is symlinks which can be
> canonicalized.
> 
>> The problem with virlockd is that our current virlockspace impl is not
>> aware of different offsets. We can try to make it so, but that would
>> complicate the code a lot. Just imagine that there's a domain already
>> running and it holds disk content lock over file C. If an user wants to
>> start a domain (which too has disk C), libvirtd will try to lock C again
>> (even though on different offset) but it will fail to do so because C is
>> already locked.
> 
> Yep, we would need intelligence around offsets to avoid opening it
> twice for each offset.
> 
>> Another problem with virtlockd that I faced was that with current
>> implementation the connection to virlockd is opened in metadataLock(),
>> where the connection FD is dup()-ed to prevent connection closing. The
>> connection is then closed in metadataUnlock() where the dup()-ed FD is
>> closed. This works pretty much good, except for fork(). If a fork()
>> occurs between metadataLock() and metadataUnlock() the duplicated FD is
>> leaked into the child and we can't be certain when will the child close
>> it. The worst case scenario is that the child will close the FD when we
>> are holding some resources, which results in virtlockd killing libvirtd
>> (= registered owner of the locks).
> 
> I don't think that's the case actually. If you have an FD that is a
> UNIX/TCP socket that gets dup(), the remote end of the socket won't
> see HUP until all duplicates of the socket are closed. IOW, both the
> parent & child process would need to close their respective dups.

It is the case. I've seen it out in the wild (when starting several
domains at once):

https://www.redhat.com/archives/libvir-list/2018-September/msg01138.html

> 
>> Alternatively, we can switch to OFD locks and have the feature work only
>> on Linux. If anybody from BSD users will want the feature they will have
>> to push their kernel developers to implement some sensible file locking.
>>
>> Also, one of the questions is what we want metadata locking to be. So
>> far I am aiming on having exclusive access to the file that we are
>> chown()-ing (or setfcon()-ing) so that I can put a code around it that
>> manipulates XATTR where the original owner would be stored. This way the
>> metadata lock is held only for a fraction of a second.
>> I guess we don't want the metadata lock be held through whole run of a
>> domain, i.e. obtained at domain start and released at domain shutdown. I
>> don't see much benefit in that.
> 
> Yes, we must only lock it for the short time that we're reading or writing
> the metadata. We must not hold the long long term, because other threads
> or processes need access to the metadata while the VM is running, for
> example to deal with validating shared disk labels.
> 
> 
> On further reflection, I think we'll be safe enough with what you're
> proposing. For it to go wrong, we would have to have a bug in the code
> which accidentally open+close the socket, and be on an old kernel that
> lacks OFD locks, and two processes must be in the critical section at
> the same time in that short < 1 second window.

Cool. Thanks.

Let me just say that I find it disturbing that POSIX doesn't specify
file locks that are usable in a multithreaded app. Sure, we can have
some abstraction layer over lockf() + open() + close() but it can hardly
be something usable. Or something that every app should do.

Michal