[libvirt] [PATCH v3 3/6] udev: Convert udevEventHandleThread to an actual thread routine

Thu Aug 31 16:23:09 UTC 2017

On 08/29/2017 08:49 AM, Erik Skultety wrote:
> On Mon, Aug 28, 2017 at 12:26:08PM -0400, John Ferlan wrote:
>>
>>
>> On 08/24/2017 07:23 AM, Erik Skultety wrote:
>>> Adjust udevEventHandleThread to be a proper thread routine running in an
>>> infinite loop handling devices. Also introduce udevEventThreadData
>>> private structure.
>>> Every time there's and incoming event from udev, udevEventHandleCallback
>>> only increments the number of events queuing on the monitor and signals
>>> the worker thread to handle them.
>>>
>>> Signed-off-by: Erik Skultety <eskultet at redhat.com>
>>> ---
>>>  src/node_device/node_device_udev.c | 132 ++++++++++++++++++++++++++++++-------
>>>  1 file changed, 108 insertions(+), 24 deletions(-)
>>>

Let me take my head out of the VxHS sandbox ;-) and peek into this other
quagmire!

>>
>> Once we get here I'm not sure I understand the udev interaction. I guess
>> it's not "crystal clear" that the socket relationship is a queue of
>> device events. I also haven't studied the udev code nor have I been
>> working through this as much as you have lately - so perhaps something
>> you've uncovered will help educate me in this manner. Still here's my
>> thoughts and a small sliver of investigative data...
>>
>> So far there's been a 1-to-1 interaction between libvirt and udev event.
>> With this change there could be an n-to-1 interaction - as in receive @n
>> devices.
>>
>> Up to this point the interaction has been:
>>
>> 1. udev event
>> 2. validate fd/udev_monitor
>> 3. call udev_monitor_receive_device to get @device
>> 4. process @device
>> 5. unref @device
>> 6. return to udev
>>
>> After this point
>>
>> 1. udev event
>> 2. validate fd/udev_monitor
>> 3. increase event count
>> 4. signal condition to thread to wakeup
>> 5. return to udev
>>
>> asynchronously in a loop
>>
>> 1. wakeup on condition and for each 'event' refcnt
>> 2. decrease event refcnt
>> 3. validate fd/udev_monitor
>> 4. call udev_monitor_receive_device to get @device
>> 5. process @device
>> 6. unref @device
>>
>> If there's more than one event, does udev_monitor_receive_device
>> guarantee the order? Are we sure udev buffers/queues in this manner?
> 
> This should be fairly deterministic as it's not that much of a udev problem as
> it is a kernel problem (since we're talking about a netlink socket which is
> internally represented by a 'struct socket' structure in the kernel which in
> fact implements socket buffers, more specifically - very important - queues).
> So, if kernel didn't guarantee the order of packets/events on the socket, you'd
> experience the same problem with calling udev_monitor_receive_device, which
> basically only does recvmsg(), directly from udevEventHandleCallback. IOW if
> that was the case, udev itself could get easily confused by the kernel events
> about a device being deleted that had not been previously added etc. So, yes,
> the order has to be guaranteed and if it is not, then it's a kernel bug.
> 

Ah... OK great - I hadn't dug into that code and figured you'd been
looking more in depth. I think as part of the commit message for this
(or whatever this evolves into), the a description of how the udev
processing works could help. At the very least something that says,
kernel udev keeps messages in a queue on the socket to allow pulling off
the messages by udev_monitor_receive_device. If too many messages
stockpile, ENOSPC is generated, while if we try to pull something off
but nothing exists, then EAGAIN is generated. Since the socket has a
remote end (libvirt) that "should" be notified when data arrives (or
something like that).

>> That is since we're not grabbing @device before we return to udev when
>> the event is signaled, is there any guarantee from udev that *the same*
>> device will still exist when we do get around to making our call? My
>> concern being how "long" is the data guaranteed to be there.
> 
> As I said, unless the socket buffer size limit has been reached, in which case
> ENOSPC was returned by libudev - I had a BZ on this:
> https://bugzilla.redhat.com/show_bug.cgi?id=1450960
> 

Ah yes, fix the race to no space problem by adding more space...

>>
>> As I read the subsequent patch - I'm thinking perhaps we really want to
>> keep that 1-to-1 relationship. Could the reason that you're getting
>> those messages be because we're now calling udev out of the order it
>> expected? Do we know why we're getting a NULL return? I found:
> 
> Yeah, I was getting EAGAIN and it wasn't until that point that I realized that
> the scheduler came into play, IOW the main event loop thread polls the file
> descriptor list, invoking the corresponding handlers, which means that
> udevEventHandleCallback just increased the overall number of events waiting to
> be processed by the handler thread, except that since no one actually extracted
> the data out of the socket and the thread hadn't been scheduled in the meantime
> it's still the same data that the main event loop thread finds waiting on the
> socket the next time it polls the @fd, therefore incrementing the event counter
> multiple times for the same event. Naturally, when the handler thread finally
> removes the event from the monitor, not knowing the counter had incremented
> for this event multiple times, it will fail to remove any more events from the
> monitor, as none had arrived in the meantime, therefore the EAGAIN.
> 

Seems to me then 'nevents' counting is pointless other than to indicate
that you got "some" event that needs to be processed.

When the thread wakes up or is scheduled, then it processes all it can
find (locked of course) and then clears the flag indicating the presence
of events. Of course, that blocks the Callback function from setting the
flag that there's an event.

Thus still having a somewhat similar problem as you're trying to fix but
also introducing the problem that the thread wakes up and locks private
data, event is signaled again indicating the same event waiting, but it
cannot get the private data lock, meanwhile the thread processes all
events, releases the lock. This allows the other thread to gain the
lock, set the new event bit, but since we've already processed it - when
the thread wakes up it finds no event and generates the bogus message.

IIUC, then patch4 resolves this by disabling the main event loop from
sending events until we've processed them. Thus should patch4 be merged
with this one?

In any case, that's some nice detective work - I can only imagine this
was *really* frustrating to dig into! Some day it'll be "fixed" and we
won't trip into the problem, but until then, I now understand things a
lot better...

>>
>> https://sourcecodebrowser.com/udev/141/libudev-monitor_8c.html
>>
>> which seems to indicate a number of reasons to return NULL... Maybe udev
>> is processing another device and while doing so it blocks anyone calling
>> udev_monitor_receive_device. Conversely while processing an event that
>> same block wouldn't be present because the expectation is that the
>> called function will call udev_monitor_receive_device.
> 
> If this was the case, sooner or later we'd hit the problem anyway.
> 
>>
>> I think perhaps the processing thread is fine to have; however, instead
>> of processing the event it's processing an @device that was managed in
>> udevEventHandleCallback and put onto some "worker queue".  At least that
> 
> This would work nicely as well, I just went with a simpler solution where I
> didn't have to create another device queue and only offloaded the whole
> processing onto the handler thread, so it's just a matter taste :).

Yep - of course perhaps that could be a "next" step for some future
fortunate sole that falls into some other issue... At least with a few
more tidbits of history someone may have a chance at figure it out.

Still even with our own device queue - we're still at the mercy of the
speed and processing logic of the udev code, so it may not help in any
way other than providing a few more cycles for files to magically appear
and be properly filled while making sure libvirt's udev event processing
logic isn't a bottle neck. All we do is get an event, get the device,
add the device to our own queue, and return. Then it's up to that queue
mgmt logic to appropriately message issue somewhere.

I'm not requiring our own device queue, but just typing my thoughts.

John

> 
> Erik
> 
>> way we know we have a refcnt'd udev_device and we can process in an
>> expected order.
>>
>> John
>>
>>> diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c
>>> index fe943c78b..444e5be4d 100644
>>> --- a/src/node_device/node_device_udev.c
>>> +++ b/src/node_device/node_device_udev.c
>>> @@ -59,6 +59,54 @@ struct _udevPrivate {
>>>      bool privileged;
>>>  };
>>>
>>> +typedef struct _udevEventThreadData udevEventThreadData;
>>> +typedef udevEventThreadData *udevEventThreadDataPtr;
>>> +
>>> +struct _udevEventThreadData {
>>> +    virMutex lock;
>>> +    virCond threadCond;
>>> +
>>> +    bool threadQuit;
>>> +    int monitor_fd;
>>> +    unsigned long long nevents; /* number of udev events queuing on monitor */
>>> +};
>>> +
>>> +
>>> +static udevEventThreadDataPtr
>>> +udevEventThreadDataNew(int fd)
>>> +{
>>> +    udevEventThreadDataPtr ret = NULL;
>>> +
>>> +    if (VIR_ALLOC_QUIET(ret) < 0)
>>> +        return NULL;
>>> +
>>> +    if (virMutexInit(&ret->lock) < 0)
>>> +        goto cleanup;
>>> +
>>> +    if (virCondInit(&ret->threadCond) < 0)
>>> +        goto cleanup_mutex;
>>> +
>>> +    ret->monitor_fd = fd;
>>> +
>>> +    return ret;
>>> +
>>> + cleanup_mutex:
>>> +    virMutexDestroy(&ret->lock);
>>> +
>>> + cleanup:
>>> +    VIR_FREE(ret);
>>> +    return NULL;
>>> +}
>>> +
>>> +
>>> +static void
>>> +udevEventThreadDataFree(udevEventThreadDataPtr data)
>>> +{
>>> +    virMutexDestroy(&data->lock);
>>> +    virCondDestroy(&data->threadCond);
>>> +    VIR_FREE(data);
>>> +}
>>> +
>>>
>>>  static bool
>>>  udevHasDeviceProperty(struct udev_device *dev,
>>> @@ -1647,35 +1695,53 @@ udevCheckMonitorFD(struct udev_monitor *udev_monitor, int fd)
>>>
>>>
>>>  static void
>>> -udevEventHandleThread(void *opaque ATTRIBUTE_UNUSED)
>>> +udevEventHandleThread(void *opaque)
>>>  {
>>> +    udevEventThreadDataPtr privateData = opaque;
>>>      struct udev_device *device = NULL;
>>>      struct udev_monitor *udev_monitor = NULL;
>>> -    int fd = (intptr_t) opaque;
>>>
>>> -    nodeDeviceLock();
>>> -    udev_monitor = DRV_STATE_UDEV_MONITOR(driver);
>>> +    /* continue rather than break from the loop on non-fatal errors */
>>> +    while (1) {
>>> +        virMutexLock(&privateData->lock);
>>> +        while (privateData->nevents == 0 && !privateData->threadQuit) {
>>> +            if (virCondWait(&privateData->threadCond, &privateData->lock)) {
>>> +                virReportSystemError(errno, "%s",
>>> +                                     _("handler failed to wait on condition"));
>>> +                goto cleanup;
>>> +            }
>>> +        }
>>>
>>> -    if (!udevCheckMonitorFD(udev_monitor, fd))
>>> -        goto unlock;
>>> +        privateData->nevents--;
>>> +        virMutexUnlock(&privateData->lock);
>>>
>>> -    device = udev_monitor_receive_device(udev_monitor);
>>> -    if (device == NULL) {
>>> -        virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
>>> -                       _("udev_monitor_receive_device returned NULL"));
>>> -        goto unlock;
>>> +        nodeDeviceLock();
>>> +        udev_monitor = DRV_STATE_UDEV_MONITOR(driver);
>>> +
>>> +        if (!udevCheckMonitorFD(udev_monitor, privateData->monitor_fd)) {
>>> +            nodeDeviceUnlock();
>>> +            goto cleanup;
>>> +        }
>>> +
>>> +        device = udev_monitor_receive_device(udev_monitor);
>>> +        nodeDeviceUnlock();
>>> +
>>> +        if (!device) {
>>> +            virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
>>> +                           _("udev_monitor_receive_device returned NULL"));
>>> +            goto next;
>>> +        }
>>> +
>>> +        udevHandleOneDevice(device);
>>> +
>>> +    next:
>>> +        udev_device_unref(device);
>>>      }
>>>
>>> -    nodeDeviceUnlock();
>>> -    udevHandleOneDevice(device);
>>> -
>>>   cleanup:
>>>      udev_device_unref(device);
>>> +    udevEventThreadDataFree(privateData);
>>>      return;
>>> -
>>> - unlock:
>>> -    nodeDeviceUnlock();
>>> -    goto cleanup;
>>>  }
>>>
>>>
>>> @@ -1683,20 +1749,28 @@ static void
>>>  udevEventHandleCallback(int watch ATTRIBUTE_UNUSED,
>>>                          int fd,
>>>                          int events ATTRIBUTE_UNUSED,
>>> -                        void *data ATTRIBUTE_UNUSED)
>>> +                        void *opaque)
>>>  {
>>>      struct udev_monitor *udev_monitor = NULL;
>>> +    udevEventThreadDataPtr threadData = opaque;
>>>
>>>      nodeDeviceLock();
>>>      udev_monitor = DRV_STATE_UDEV_MONITOR(driver);
>>> -
>>>      if (!udevCheckMonitorFD(udev_monitor, fd)) {
>>> +        virMutexLock(&threadData->lock);
>>> +        threadData->threadQuit = true;
>>> +        virCondSignal(&threadData->threadCond);
>>> +        virMutexUnlock(&threadData->lock);
>>> +
>>>          nodeDeviceUnlock();
>>>          return;
>>>      }
>>>      nodeDeviceUnlock();
>>>
>>> -    udevEventHandleThread((void *)(intptr_t) fd);
>>> +    virMutexLock(&threadData->lock);
>>> +    threadData->nevents++;
>>> +    virCondSignal(&threadData->threadCond);
>>> +    virMutexUnlock(&threadData->lock);
>>>  }
>>>
>>>
>>> @@ -1823,6 +1897,9 @@ nodeStateInitialize(bool privileged,
>>>  {
>>>      udevPrivate *priv = NULL;
>>>      struct udev *udev = NULL;
>>> +    int monitor_fd = -1;
>>> +    virThread th;
>>> +    udevEventThreadDataPtr threadData = NULL;
>>>
>>>      if (VIR_ALLOC(priv) < 0)
>>>          return -1;
>>> @@ -1883,6 +1960,14 @@ nodeStateInitialize(bool privileged,
>>>                                               128 * 1024 * 1024);
>>>  #endif
>>>
>>> +    monitor_fd = udev_monitor_get_fd(priv->udev_monitor);
>>> +    if (!(threadData = udevEventThreadDataNew(monitor_fd)) ||
>>> +        virThreadCreate(&th, false, udevEventHandleThread, threadData) < 0) {
>>> +        virReportSystemError(errno, "%s",
>>> +                             _("failed to create udev handling thread"));
>>> +        goto cleanup;
>>> +    }
>>> +
>>>      /* We register the monitor with the event callback so we are
>>>       * notified by udev of device changes before we enumerate existing
>>>       * devices because libvirt will simply recreate the device if we
>>> @@ -1891,9 +1976,8 @@ nodeStateInitialize(bool privileged,
>>>       * enumeration.  The alternative is to register the callback after
>>>       * we enumerate, in which case we will fail to create any devices
>>>       * that appear while the enumeration is taking place.  */
>>> -    priv->watch = virEventAddHandle(udev_monitor_get_fd(priv->udev_monitor),
>>> -                                    VIR_EVENT_HANDLE_READABLE,
>>> -                                    udevEventHandleCallback, NULL, NULL);
>>> +    priv->watch = virEventAddHandle(monitor_fd, VIR_EVENT_HANDLE_READABLE,
>>> +                                    udevEventHandleCallback, threadData, NULL);
>>>      if (priv->watch == -1)
>>>          goto unlock;
>>>
>>>