[libvirt] [PATCH v4 5/7] nodedev: Disable/re-enable polling on the udev fd

Mon Oct 2 12:51:30 UTC 2017

On 10/02/2017 03:19 AM, Erik Skultety wrote:
> On Fri, Sep 29, 2017 at 09:46:48AM -0400, John Ferlan wrote:
>>
>>
>> On 09/28/2017 06:00 AM, Erik Skultety wrote:
>>> [...]
>>>
>>>>>
>>>>>          nodeDeviceLock();
>>>>> +        priv = driver->privateData;
>>>>>          udev_monitor = DRV_STATE_UDEV_MONITOR(driver);
>>>>>
>>>>>          if (!udevEventCheckMonitorFD(udev_monitor, privateData->monitor_fd)) {
>>>>> @@ -1725,6 +1727,9 @@ udevEventHandleThread(void *opaque)
>>>>>          device = udev_monitor_receive_device(udev_monitor);
>>>>>          nodeDeviceUnlock();
>>>>>
>>>>> +        /* Re-enable polling for new events on the @udev_monitor */
>>>>> +        virEventUpdateHandle(priv->watch, VIR_EVENT_HANDLE_READABLE);
>>>>> +
>>>>
>>>> I think this should only be done when privateData->nevents == 0?  If we
>>>> have multiple events to read, then calling virEventPollUpdateHandle,
>>>> (eventually) for every pass through the loop seems like a bit of
>>>> overkill especially if udevEventHandleCallback turns right around and
>>>> disables it again.
>>>>
>>>> Also fortunately there isn't more than one udev thread sending the
>>>> events since you access the priv->watch without the driver lock...
>>>>
>>>> Conversely perhaps we only disable if events > 1... Then again, how does
>>>> one get to 2 events queued if we're disabling as soon as we increment
>>>
>>> Very good point, technically events would still get queued, we just wouldn't
>>> check and yes, we would process 1 event at a time. Not optimal, but if you look
>>> at the original code and compare it with this one performance-wise (and I hope
>>> I haven't missed anything that would render everything I write next a complete
>>> rubbish), the time complexity hasn't changed, the space complexity hasn't
>>> changed, what changed is code complexity which makes the code a bit slower due
>>> to the excessive locking and toggling the FD polling back and forth. So
>>> essentially you've got the same thing as you had before..but asynchronous.
>>> However, yes, the usage of @nevents is completely useless now (haven't realized
>>> that immediately, thanks) and a simple signalling should suffice.
>>
>> Having it "slower" is necessarily bad ;-)  That gives some of the other
>> slower buggers a chance to fill in the details we need. Throwing the
>> control back to udev quicker could aid in that too.
> 
> I'm sorry, but I don't follow, how do you hand control back over to something
> you can't control? udev is not paused in any way during the phase libvirt is
> processing the events, so it keeps pushing new events to the socket queue,
> until it can't push any more.
> 

Hmmm... If the problem we're chasing is that some other consumer isn't
processing "fast enough" for our needs (for whatever reason), then by
perhaps making the libvirt implementation a bit slower because now we
have to "wait" for the thread to be signaled that something's there,
then perhaps, just maybe, with any good fortune the thing that was
causing the problem before would have "just enough time" to complete
such that we'll never have to enter the timing loop in subsequent
patches. Hopefully that helps explain.

>>
>>>
>>> So how could we make it faster though? I thought more about the idea you shared
>>> in one of the previous reviews, letting the thread actually pull all the data
>>> from the monitor, to which I IIRC replied something in the sense that the event
>>> counting mechanism wouldn't allow that and it would break. Okay, let's drop the
>>> event counting. What if we now let both the udev handler thread and the event
>>> loop "poll" the file descriptor, IOW let the event loop polling the monitor fd,
>>> thus invoking udevHandleCallback which would in turn keep signalling the handler
>>> thread that there are some data. The difference now in the handler thread would
>>> be that it wouldn't blindly trust the callback about the data, because of the
>>> scheduling issue, it would keep poking the monitor itself until it gets either
>>> EAGAIN or EWOULDBLOCK from recvmsg() called in libudev,  which would then be the
>>> signal to start waiting on the condition. The next an event appears, a signal
>>> from udevHandleCallback would finally have a meaning and wouldn't be ignored.
>>
>> Is making it faster really a goal?  It's preferable that it works
> 
> "I would think it would be overkill to disable/enable for just 1 event." I was
> trying to come up with something faster based on that statement.
> 

True and understood... But can the thread rely on two factors - "A" we
were told there was an event and "B" we process until we get the NULL
(w/ EAGAIN) from the udev_monitor_receive_device.

If so, then having the thread process until empty is perhaps the way out
of the disable/enable for just 1 event and keeping track of the count of
events. If not then I think I agree with the sentiment later in this
response of one-by-one version.

If we go with process until NULL type logic, would the thread need some
sort of wakeup timeout to ensure we don't miss something because of some
scheduling thing? I don't think so, but had to ask anyway.

>> consistently I think. The various errno possibilities and the desire to
>> avoid the "udev_monitor_receive_device returned NULL" message processing
>> because we had too many "cooks in the kitchen" trying to determine
>> whether a new device was really available or was this just another
>> notification for something we're already processing.
>>
>> Also, not that I expect the udev code to change, but if a new errno is
>> added then we may have to keep up... Always a fear especially if we're
> 
> Not necessarily, that would mean that either libudev replaces recvmsg with
> something else or that recvmsg starts returning more/different errnos to signal
> that there are no data to be pulled - since there are already 2 such errnos,
> it's highly unlikely IMHO. Unless libudev changes substantially, in terms of
> error signaling, I think we can safely ignore the other errnos as the actual
> outcome for libvirt is that libudev failed because of some socket error, but
> since that can either come from kernel or from libudev itself, the best thing
> we can do is shrug our shoulders and say "libudev failed for some reason, but
> we can't tell why".
> 

OK - fair enough... Just exhausting all options/excuses!

>> using the errno to dictate our algorithm.
>>
>>>
>>> This way, it's actually the handler thread who's 'the boss', as most of the
>>> signals from udevHandleCallback would be lost/NOPs.
>>>
>>> Would you agree to such an approach?
>>
>> At some point we get too fancy and think too hard about a problem
>> letting it consume us.  I think when I presented my idea - I wasn't
>> really understanding the details of how we consume the udev data. Now
>> that I have a bit more of a clue - perhaps the original idea wasn't so
>> good. If we receive multiple notifications that a device is ready to be
>> processed even though we could be processing it - how are we to truly
>> know whether we missed one or we really got it and udev was just
>> reminding us again.
> 
> udev doesn't remind us of anything, it's our event loop that does - if we don't
> pre-process the events somehow, we can't tell them apart and even if we did, it
> would be utterly unreliable since you'd have to include some kind of nonce and
> a hash to be really sure that it's a duplicate event and not a device that was
> un-plugged and re-plugged again in a very short period of time. So, this issue
> is not a matter of our communication with udev, it's about how good can we
> manage our discover-pull-process event algorithm.
> 

OK - that's just me and my shortcut interpretation of how we're notified.

>>
>> I'm not against looking at a different mechanism - the question then
>> becomes from your perspective is it worth it?
> 
> I think the ultimate question is how fast do we want to deliver a fix. Of
> course I'm open to explore more approaches, but we can do that even with a
> "preliminary" fix included (even though, honestly, once this is merged, I doubt
> anyone will find time to do any kind of experiments unless there's another
> breakage to fix in this code). So, based on what I just wrote, I'm inclined to
> say no, it really wouldn't be of much worth.

Can we win the race to fix the root-cause? I think we've explored our
alternatives thoroughly and am fine with libvirt making an adjustment
since we cannot rely on when something else is fixed. There are others
that have a different opinion. Still while it's MDEV this time, the last
time it was VHBA, and the next time it will be something different -
guaranteed. If we can provide a way to avoid our customers from having
to deal with "intermittent" and "unexplained" failures because we're
waiting for someone else to fix the problem, then I say we fix it and
move on. Especially when we can understand the problem. If eventually
they fix their problem, then our solution gathers cobwebs, but so what.

> 
> Even though I came with another proposal, I personally still like the previous
> version with one-by-one event version, since despite (perhaps) being a bit
> slower, it's a more transparent and consistent (minus the event counter)
> solution than the one I proposed the last time.
> 
> Erik
> 

I don't have that version paged into short term memory ;-)... If you
have a thread that processes events until udev_monitor_receive_device
returns NULL, then does that do the trick? Regardless of event counting?

John