[PATCH 00/10] resolve hangs/crashes on libvirtd shutdown
Nikolay Shirokovskiy
nshirokovskiy at virtuozzo.com
Wed Jul 15 05:51:03 UTC 2020
On 14.07.2020 17:53, Daniel Henrique Barboza wrote:
> As far as code goes:
>
>
> Reviewed-by: Daniel Henrique Barboza <danielhb413 at gmail.com>
>
>
> About the design I have a question about the timeout. Patch 5/10 is setting a
> 15 second timeout. How did you reach this value? Reading the bug, specially
> this comment from Daniel:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1828207#c6
>
> He mentions "give it 5 seconds of running before shutting it down".
I guess 5 seconds is time for libvirtd to finish startup. This time has
different nature than time for libvirtd to finish it's work on shutdown
so it can be different.
>
> 5 seconds before shutdown is something that most users can be slightly annoyed
> but in the end don't mind that much, but 15 seconds is something that will
> cause bugs to be opened because "Libvirt is taking too long to shutdown".
> Besides, it's a fair assumption that a transaction that takes more than
> 5 or so seconds to finish is already compromised* - might as well shutdown
> the daemon and deal with the errors.
15 seconds was mentioned by Daniel in [1] when he first proposed the approach
so I used this value without any extra thought. However I missed that in
the last John's series [2] the default for waiting time is 0s. May be this
is the current decision on waiting time. Let's wait for others to join
the review.
[1] https://www.redhat.com/archives/libvir-list/2018-January/msg00942.html
[2] https://www.redhat.com/archives/libvir-list/2018-July/msg00382.html
Nikolay
>
>
>
> * assuming user discretion to avoid shutting down the daemon in the middle
> of a long duration transaction, of course.
>
>
> Thanks,
>
>
> DHB
>
>
> On 7/14/20 6:32 AM, Nikolay Shirokovskiy wrote:
>> This series follows [1] but address the issue slightly differently.
>> Instead of polling for RPC thread pool termination it waits for
>> thread pool drain in distinct thread and then signal the main loop
>> to exit.
>>
>> The series introduces new driver's methods stateShutdown/stateShutdownWait
>> to finish all driver's background threads. The methods however are only
>> implemented for qemu driver and only partially. There are other drivers
>> that have background threads and I don't check every one of them in
>> terms of how they manage their background threads.
>>
>> For example node driver creates 2 threads. One of them is supposed to live
>> a for a short amount of time and thus not tracked. This thread can cause issues
>> on shutdown. The second thread is tracked and finished synchronously on driver
>> cleanup. So this thread can not cause crashes but can cause hangs theoretically
>> speaking so we may want to move the thread finishing to stateShutdownWait
>> method so that possible hang will be handled by shutdown timeout.
>>
>> The qemu driver also has untracked threads and they can cause crashes on
>> shutdown. For example reconnect threads or reboot thread. These need to be
>> tracked.
>>
>> I'm going to address these issues in qemu and other drivers once the overall
>> approach will be approved.
>>
>> I added 2 new driver's methods so that thread finishing will be done in
>> parallel. If we have only one method then the shutdown is done one by one
>> effectively.
>>
>> I've added clean shutdown timeout in event loop as suggested by Daniel in [2].
>> Now I think why we can't just go with systemd unit management? Systemd will
>> eventually send SIGKILL and we can tune the timeout using TimeoutStopUSec
>> parameter. This way we even don't need to introduce new driver's methods.
>> Driver's background threads can be finished in stateCleanup method. AFAIU as
>> drivers are cleaned up in reverse order it is safe in the sense that already
>> cleaned up driver can not be used by background threads of not yet cleaned up
>> driver. Of course this way the cleanup is not done in parallel. Well to
>> turn it into parallel we can introduce just stateShutdown which we don't need
>> to call in netdaemon code and thus don't introduce undesired dependency of
>> netdaemon on drivers concept.
>>
>> [1] Resolve libvirtd hang on termination with connected long running client
>> https://www.redhat.com/archives/libvir-list/2018-July/msg00382.html
>> [2] Races / crashes in shutdown of libvirtd daemon
>> https://www.redhat.com/archives/libvir-list/2020-April/msg01328.html
>>
>> Nikolay Shirokovskiy (10):
>> libvirt: add stateShutdown/stateShutdownWait to drivers
>> util: always initialize priority condition
>> util: add stop/drain functions to thread pool
>> rpc: don't unref service ref on socket behalf twice
>> rpc: finish all threads before exiting main loop
>> vireventthread: add virEventThreadClose
>> qemu: exit thread synchronously in qemuDomainObjStopWorker
>> qemu: implement driver's shutdown/shutdown wait methods
>> rpc: cleanup virNetDaemonClose method
>> util: remove unused virThreadPoolNew macro
>>
>> scripts/check-drivername.py | 2 +
>> src/driver-state.h | 8 ++++
>> src/libvirt.c | 42 +++++++++++++++++++
>> src/libvirt_internal.h | 2 +
>> src/libvirt_private.syms | 3 ++
>> src/libvirt_remote.syms | 1 -
>> src/qemu/qemu_domain.c | 1 +
>> src/qemu/qemu_driver.c | 32 +++++++++++++++
>> src/remote/remote_daemon.c | 3 --
>> src/rpc/virnetdaemon.c | 95 ++++++++++++++++++++++++++++++++++++-------
>> src/rpc/virnetdaemon.h | 2 -
>> src/rpc/virnetserver.c | 8 ++++
>> src/rpc/virnetserver.h | 1 +
>> src/rpc/virnetserverservice.c | 1 -
>> src/util/vireventthread.c | 9 ++++
>> src/util/vireventthread.h | 1 +
>> src/util/virthreadpool.c | 65 ++++++++++++++++++++---------
>> src/util/virthreadpool.h | 6 +--
>> 18 files changed, 238 insertions(+), 44 deletions(-)
>>
More information about the libvir-list
mailing list