[libvirt] [PATCH] libvirtd: fix potential deadlock when starting vm

Thu Oct 18 16:04:55 UTC 2018

[...]

>>
>> The one thing I have forgotten or perhaps I should say struck me -
>> should this code use virCondWaitUntil or would it just not matter?
>>
>> Let's say 10 devices were added and on the 10th one we had this issue,
>> set dataReady to false, but then didn't get another udev device ready
>> event for "a while" leaving that 10th one in limbo until such time a new
>> device was available (e.g. udevEventHandleCallback is called by udev).
> 
> Can you be more specific on how we could utilize virCondWaitUntil? Because the
> way I see it, it wouldn't help us in either of the cases because the problem
> isn't the thread waiting to be woken up to process an event rather it's the
> recvmsg syscall in libudev which would cause the issue on old CentOS.
> Looking at your example, if you put virCondWaitUntil in the !device conditional
> path, how would that prevent us to hang after processing the last device if we
> never got a chance to even hit the !device code path, since we kept getting
> legitimate events until the last device where we got several events for a
> single piece of data? It very well might be the jetlag speaking for me, but I
> just don't see it in your example.
> 
> Erik
> 

I guess it was the "getting of legitimate events" that I couldn't recall
the exact workings. My "theory" was we got a group of 10 devices, failed
on the last one, and didn't get another call for let's say hours. Then
we have this device in limbo just because udev told us too soon an we
acted, but the scheduler had other ideas vis-a-vis filling in the
@device.  I'm fine with things as they are still.

More details if you care to read... First, I think the theory of
thinking about using Until processing works better than the actual
practice of coding it - Until would requiring using virTimeMillisNow
which can fail so we need to account for that...

My general thought was without writing a lick of code was:

    bool timedWait = false;
    int rc;

...

    while (!priv->dataReady && !priv->threadQuit) {

    errno = 0;
    if (!timedWait)
        rc = virCondWait(...)
    else
        rc = virCondWaitUntil(...);

    if (rc < 0 && errno != ETIMEDOUT) {
        virReportSystemError ...
    }

    timedWait = false;

...

    if (!device) {
...
            virObjectLock(priv);
            priv->dataReady = false;
            virObjectUnlock(priv);
            timedWait = true;
...

IOW: Only use the timedWait when/if !device caused the "continue;". I
also note now that the virCondWait error check doesn't check < 0, just
that the return is not 0... Maybe that was copied from virFDStreamThread
since other callers check < 0.

Once I started writing code I realized the virTimeMillisNow issue, so
then things would become:

            errno = 0;
            if (timedWait) {
                /* If we fail this, then just use virCondWait */
                if (virTimeMillisNow(&now) < 0) {
                    timedWait = false;
                } else {
                    when = now + 1000ull;
                    rc = virCondWait(&priv->threadCond,
                                     &priv->parent.lock,
                                     when);
                }
            }

            if (!timedWait)
                rc = virCondWait(&priv->threadCond, &priv->parent.lock);

            if (rc < 0 && errno != ETIMEDOUT) {

Much uglier and scarier than I first thought.

John