[libvirt] [Qemu-devel] [RFC 0/5] block: File descriptor passing using -open-hook-fd

Mon Jul 9 20:46:55 UTC 2012

On 07/09/2012 03:29 PM, Eric Blake wrote:
> On 07/09/2012 02:00 PM, Anthony Liguori wrote:
>
>>> with the fd:name approach, the sequence is:
>>>
>>> libvirt calls getfd:name1 over normal monitor
>>> qemu responds
>>> libvirt calls getfd:name2 over normal monitor
>>> qemu responds
>>> libvirt calls transaction around blockdev-snapshot-sync over normal
>>> monitor, using fd:name1 and fd:name2
>>> qemu responds
>
> This general layout is true whether we rewrite all commands to
> understand fd:nnn (proposal 1) or whether we add new magic parsing
> (/dev/fd/nnn of proposal 3, or even /dev/fdset/nnn of proposal 5), all
> as called out in these messages:
>
> https://lists.gnu.org/archive/html/qemu-devel/2012-07/msg00227.html
> https://lists.gnu.org/archive/html/qemu-devel/2012-07/msg01098.html
>
>>>
>>> but with -open-hook-fd, the approach would be:
>>>
>>> libvirt calls transaction
>>> qemu calls open(file1) over hook
>>> libvirt responds
>>> qemu calls open(file2) over hook
>>> libvirt responds
>>> qemu responds to the original transaction
>
> whereas this approach is quite different in semantics, but may indeed be
> easier for qemu to implement, at the expense of some more complexity on
> the part of libvirt.
>
> At the high level, I think both approaches have one thing in common: by
> refactoring all qemu code to go through qemu_open(), we can then
> implement our desired complexity (whether fd:nn, /dev/fd/nnn,
> /dev/fdset/nnn, or some other magic name parsing; or whether it is an
> rpc call over a second socket in parallel to the monitor socket) in just
> one location.  Likewise, both approaches have to deal with libvirtd
> restarts (magic name parsing by changing an 'inuse' flag when the
> monitor detects EOF; rpc passing by failing a qemu_open() when the rpc
> socket detects EOF).

Ack.

>
>>>
>>> The 'transaction' operation is thus blocked by the time it takes to do
>>> two intermediate opens over a second channel, which kind of defeats the
>>> purpose of making the transaction take effect with minimal guest
>>> downtime.
>>
>> How are you defining "guest down time"?
>>
>> It's important to note that code running in QEMU does not equate to
>> guest visible down time unless QEMU does an explicit vm_stop() which is
>> not happening here.
>>
>> Instead, a VCPU may become blocked *if* it attempts to acquire qemu_mute
>> while QEMU is holding it.
>>
>> If your concern is qemu_mutex being held while waiting for libvirt, it
>> would be fairly easy to implement a qemu_open_async() that dropped
>> allowed dropping back to the main loop and then calling a callback when
>> the open completes.
>>
>> It would be pretty trivial to convert qmp_transaction to use such a
>> command.
>
> In other words, remembering that transactions are divided into phases:
>
> phase 1 - prepare: obtain all needed fds (whether by pre-opening them
> via 'pass-fd' or other new 'getfd' relative, or whether by RPC calls);
> no guest downtime, and with cleanup that avoids any leaks on any failures
> phase 2 - commit: flush all devices and actually make the changes in
> qemu state to use the fds obtained in phase 1
>
> and where the guest downtime (if any) is more likely due to flushing
> changes in phase 2

Not quite.  A synchronous flush can cause lock contention.  We need to separate 
out the problem of lock contention from guest down time.

Also, there's no obvious need to move the flushes before opens.  The main issue 
is that we use qemu_mutex to effectively create a write queue.

You can imagine a simple write queueing mechanism that would obviate the need 
need for this such that we could flush, queue upcoming writes, and drop 
qemu_mutex to sleep waiting for libvirt to send us our fds.

>> But this is all speculative.  There's no reason to believe that an RPC
>> would have a noticable guest visible latency unless you assume there's
>> lot contention.  I would strongly suspect that the bdrv_flush() is going
>> to be a much greater source of lock contention than the RPC would be.
>> An RPC is only bound by scheduler latency whereas synchronous disk I/O
>> is bound spinning a platter.
>>
>>> And libvirt code becomes a lot trickier to deal with the fact
>>> that two channels are in use, and that the channel that issued the
>>> 'transaction' command must block while the other channel for handling
>>> hooks must be responsive.
>>
>> All libvirt needs to do is listen on a socket and delegate access
>> according to a white list.  Whatever is providing fd's needs to have no
>> knowledge of anythign other than what the guest is allowed to access
>> which shouldn't depend on an executing command.
>
> That's not quite accurate.  What the guest is allowed to access should
> indeed change depending on the executing command.  That is, if I start a
> guest with:

I should have spoke more clearly.  libvirt may change the white list for various 
reasons dynamically.  But there shouldn't be a direct dependency on whatever is 
serving up fd's and whatever is changing the white list.

Basically, you just need a shared hash table for each guest.  It should be quite 
simple.

> Maybe the only reason that I'm still leaning towards a 'pass-fd'
> solution instead of a hook fd solution is that libvirt would have less
> code to write to get it working.  But it was originally Dan's complaint
> that an rpc solution has too much risk for deadlock or leaks;

The reason I came back to this is that after reading through the threads, I 
started thinking about how to solve the leak problem.

You need clear ownership.  Having QEMU "own" a file descriptor because it "asks" 
for an fd allows QEMU to be the clear owner.  OTOH, having libvirt give QEMU an 
fd with a floating reference that QEMU may or not may not pin ends up being 
extremely complex in practice.  I'm not sure you can really solve the reliable 
closing problem either.  If you did have a "kill all floating references" 
command, that could introduce other problems (what about multiple clients?).

> if we are
> confident that we can avoid deadlock,

I don't think deadlocks are possible FWIW.

> and that the idea of passing in
> fds in response to an rpc involves less bookkeeping and speculation on
> libvirt's part about which monitor commands will require opening an fd,
> then maybe it really is a better technical solution, even if it costs
> more libvirt code to implement.

I think the important part is that it allows libvirt to not have to have 
intimate knowledge of how QEMU commands work.  If we decide we need to change 
the flags/perms on a file descriptor down the road, it's a lot easier to cope 
with that as it is to cope with changing the order in which we open files.

Plus, once you implement this in libvirt, you don't have to worry about it for 
future block commands.  With fdsets, would need to deal with figuring out the 
magic incantation of setfd commands for all future block commands.

Plus, making /dev/fdset be treated as not a valid file path is just asking for 
trouble down the road...

Regards,

Anthony Liguori