<div dir="ltr">Hi all,<div>On my host, I have been seeing instances of keepalive responses slow down intermittently when issuing bulk power offs.</div><div>With some tips from Danpb on the channel, I was able to trace via systemtap that the main event loop would not run for about 6-9 seconds. This would stall keepalives and kill client connections.</div><div><br></div><div>I was able to trace it to the fact that <span style="font-variant-ligatures:no-common-ligatures;background-color:rgb(230,230,0);font-family:menlo;font-size:14px">qemuProcessHandleEvent()</span> needed the vm lock, and this was called from the main loop. I had hook scripts that slightly elongated the time the power off RPC completed and the subsequent keepalive delays were noticeable.</div><div><br></div><div>I agree that the easiest solution is to unblock the Vm lock before hook scripts are activated.</div><div>However, I was wondering why we contend on the per-Vm lock directly from the main loop at all ? Can we do this instead : have the main loop "park" events to a separate event queue, and then have a dedicated thread pool in the qemu driver pick these raw events and then try grabbing the per-vm lock for that VM ?</div><div>That way, we can be sure that the main event loop is _never_ delayed irrespective of an RPC dragging on.</div><div><br></div><div>If this sounds reasonable I will be happy to post the driver rewrite patches to that end.</div><div><br></div><div>Regards,</div><div>Prerna</div> </div>