[PATCH 00/10] resolve hangs/crashes on libvirtd shutdown

Nikolay Shirokovskiy nshirokovskiy at virtuozzo.com
Wed Jul 15 05:51:03 UTC 2020



On 14.07.2020 17:53, Daniel Henrique Barboza wrote:
> As far as code goes:
> 
> 
> Reviewed-by: Daniel Henrique Barboza <danielhb413 at gmail.com>
> 
> 
> About the design I have a question about the timeout. Patch 5/10 is setting a
> 15 second timeout. How did you reach this value? Reading the bug, specially
> this comment from Daniel:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1828207#c6
> 
> He mentions "give it 5 seconds of running before shutting it down".

I guess 5 seconds is time for libvirtd to finish startup. This time has
different nature than time for libvirtd to finish it's work on shutdown
so it can be different.

> 
> 5 seconds before shutdown is something that most users can be slightly annoyed
> but in the end don't mind that much, but 15 seconds is something that will
> cause bugs to be opened because "Libvirt is taking too long to shutdown".
> Besides, it's a fair assumption that a transaction that takes more than
> 5 or so seconds to finish is already compromised* - might as well shutdown
> the daemon and deal with the errors.

15 seconds was mentioned by Daniel in [1] when he first proposed the approach
so I used this value without any extra thought. However I missed that in
the last John's series [2] the default for waiting time is 0s. May be this
is the current decision on waiting time. Let's wait for others to join
the review.

[1] https://www.redhat.com/archives/libvir-list/2018-January/msg00942.html
[2] https://www.redhat.com/archives/libvir-list/2018-July/msg00382.html

Nikolay

> 
> 
> 
> * assuming user discretion to avoid shutting down the daemon in the middle
> of a long duration transaction, of course.
> 
> 
> Thanks,
> 
> 
> DHB
> 
> 
> On 7/14/20 6:32 AM, Nikolay Shirokovskiy wrote:
>> This series follows [1] but address the issue slightly differently.
>> Instead of polling for RPC thread pool termination it waits for
>> thread pool drain in distinct thread and then signal the main loop
>> to exit.
>>
>> The series introduces new driver's methods stateShutdown/stateShutdownWait
>> to finish all driver's background threads. The methods however are only
>> implemented for qemu driver and only partially. There are other drivers
>> that have background threads and I don't check every one of them in
>> terms of how they manage their background threads.
>>
>> For example node driver creates 2 threads. One of them is supposed to live
>> a for a short amount of time and thus not tracked. This thread can cause issues
>> on shutdown. The second thread is tracked and finished synchronously on driver
>> cleanup. So this thread can not cause crashes but can cause hangs theoretically
>> speaking so we may want to move the thread finishing to stateShutdownWait
>> method so that possible hang will be handled by shutdown timeout.
>>
>> The qemu driver also has untracked threads and they can cause crashes on
>> shutdown. For example reconnect threads or reboot thread. These need to be
>> tracked.
>>
>> I'm going to address these issues in qemu and other drivers once the overall
>> approach will be approved.
>>
>> I added 2 new driver's methods so that thread finishing will be done in
>> parallel. If we have only one method then the shutdown is done one by one
>> effectively.
>>
>> I've added clean shutdown timeout in event loop as suggested by Daniel in [2].
>> Now I think why we can't just go with systemd unit management? Systemd will
>> eventually send SIGKILL and we can tune the timeout using TimeoutStopUSec
>> parameter. This way we even don't need to introduce new driver's methods.
>> Driver's background threads can be finished in stateCleanup method. AFAIU as
>> drivers are cleaned up in reverse order it is safe in the sense that already
>> cleaned up driver can not be used by background threads of not yet cleaned up
>> driver. Of course this way the cleanup is not done in parallel. Well to
>> turn it into parallel we can introduce just stateShutdown which we don't need
>> to call in netdaemon code and thus don't introduce undesired dependency of
>> netdaemon on drivers concept.
>>
>> [1] Resolve libvirtd hang on termination with connected long running client
>>      https://www.redhat.com/archives/libvir-list/2018-July/msg00382.html
>> [2] Races / crashes in shutdown of libvirtd daemon
>>      https://www.redhat.com/archives/libvir-list/2020-April/msg01328.html
>>
>> Nikolay Shirokovskiy (10):
>>    libvirt: add stateShutdown/stateShutdownWait to drivers
>>    util: always initialize priority condition
>>    util: add stop/drain functions to thread pool
>>    rpc: don't unref service ref on socket behalf twice
>>    rpc: finish all threads before exiting main loop
>>    vireventthread: add virEventThreadClose
>>    qemu: exit thread synchronously in qemuDomainObjStopWorker
>>    qemu: implement driver's shutdown/shutdown wait methods
>>    rpc: cleanup virNetDaemonClose method
>>    util: remove unused virThreadPoolNew macro
>>
>>   scripts/check-drivername.py   |  2 +
>>   src/driver-state.h            |  8 ++++
>>   src/libvirt.c                 | 42 +++++++++++++++++++
>>   src/libvirt_internal.h        |  2 +
>>   src/libvirt_private.syms      |  3 ++
>>   src/libvirt_remote.syms       |  1 -
>>   src/qemu/qemu_domain.c        |  1 +
>>   src/qemu/qemu_driver.c        | 32 +++++++++++++++
>>   src/remote/remote_daemon.c    |  3 --
>>   src/rpc/virnetdaemon.c        | 95 ++++++++++++++++++++++++++++++++++++-------
>>   src/rpc/virnetdaemon.h        |  2 -
>>   src/rpc/virnetserver.c        |  8 ++++
>>   src/rpc/virnetserver.h        |  1 +
>>   src/rpc/virnetserverservice.c |  1 -
>>   src/util/vireventthread.c     |  9 ++++
>>   src/util/vireventthread.h     |  1 +
>>   src/util/virthreadpool.c      | 65 ++++++++++++++++++++---------
>>   src/util/virthreadpool.h      |  6 +--
>>   18 files changed, 238 insertions(+), 44 deletions(-)
>>





More information about the libvir-list mailing list