[libvirt] udevadm settle can take too long

Fri Apr 27 04:05:23 UTC 2012

[ CC to Cole ]

> Osier Yang wrote:
>> On 2012年04月24日 03:47, Guido Günther wrote:
>>> Hi,
>>> On Sun, Apr 22, 2012 at 02:41:54PM -0400, Jim Paris wrote:
>>>> Hi,
>>>>
>>>> http://bugs.debian.org/663931 is a bug I'm hitting, where virt-manager
>>>> times out on the initial connection to libvirt.
>>>
>>> I reassigned the bug back to libvirt. I still wonder what triggers this
>>> though for some users but not for others?
>>> Cheers,
>>>   -- Guido
>>>
>>>>
>>>> The basic problem is that, while checking storage volumes,	
>>>> virt-manager causes libvirt to call "udevadm settle".  There's an
>>>> interaction where libvirt's earlier use of network namespaces (to probe
>>>> LXC features) had caused some uevents to be sent that get filtered out
>>>> before they reach udev.  This confuses "udevadm settle" a bit, and so
>>>> it sits there waiting for a 2-3 minute built-in timeout before returning.
>>>> Eventually libvirtd prints:
>>>>    2012-04-22 18:22:18.678+0000: 30503: warning : virKeepAliveTimer:182 : No response from client 0x7feec4003630 after 5 keepalive messages in 30 seconds
>>>> and virt-manager prints:
>>>>    2012-04-22 18:22:18.931+0000: 30647: warning : virKeepAliveSend:128 : Failed to send keepalive response to client 0x25004e0
>>>> and the connection gets dropped.
>>>>
>>>> One workaround could be to specify a shorter timeout when doing the
>>>> settle.  The patch appended below allows virt-manager to work,
>>>> although the connection still has to wait for the 10 second timeout
>>>> before it succeeds.  I don't know what a better solution would be,
>>>> though.  It seems the udevadm behavior might not be considered a bug
>>> >from the udev/kernel point of view:
>>>>    https://lkml.org/lkml/2012/4/22/60
>>>>
>>>> I'm using Linux 3.2.14 with libvirt 0.9.11.  You can trigger the
>>>> udevadm issue using a program I posted at the Debian bug report link
>>>> above.
>>>>
>>>> -jim
>>>>
>>>>>  From 17e5b9ebab76acb0d711e8bc308023372fbc4180 Mon Sep 17 00:00:00 2001
>>>> From: Jim Paris<jim at jtan.com>
>>>> Date: Sun, 22 Apr 2012 14:35:47 -0400
>>>> Subject: [PATCH] shorten udevadmin settle timeout
>>>>
>>>> Otherwise, udevadmin settle can take so long that connections from
>>>> e.g. virt-manager will get closed.
>>>> ---
>>>>   src/util/util.c |    4 ++--
>>>>   1 files changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/src/util/util.c b/src/util/util.c
>>>> index 6e041d6..dfe458e 100644
>>>> --- a/src/util/util.c
>>>> +++ b/src/util/util.c
>>>> @@ -2593,9 +2593,9 @@ virFileFindMountPoint(const char *type ATTRIBUTE_UNUSED)
>>>>   void virFileWaitForDevices(void)
>>>>   {
>>>>   # ifdef UDEVADM
>>>> -    const char *const settleprog[] = { UDEVADM, "settle", NULL };
>>>> +    const char *const settleprog[] = { UDEVADM, "settle", "--timeout", "10", NULL };
>>
>> Though I don't have a good idea to fix it either, I guess this
>> change could cause "lvremove" to fail again for the udev race.
>>
>> See BZs:
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=702260
>> https://bugzilla.redhat.com/show_bug.cgi?id=570359
>
> It seems that those bugs were caused by something like
>
> 1. open(lv, O_RDWR)
> 2. close(lv)
> 3. system("lvremove ...")
>
> where udev would fire off a command between 2 and 3 that caused 3 to
> fail.  Adding "udevadm settle" as step 2.5 is a good way to wait for
> that command to finish, but:
>
> - it doesn't necessarily fix the issue; something could easily re-open
>    the device between 2.5 and 3 and cause the same failure.

Right.

>
> - the race condition sounds like it was a short window, and sometimes
>    the original sequence would still work even without the settle.
>    That would suggest to me that a timeout of 10s is still plenty long.
>
> A few thoughts:
>
> - For lvremove: can we try a short timeout (3 seconds), then if the
>    lvremove still fails, try again with the default udevadm timeout
>    (120 seconds)?
>
> - Even in that case, we need to fix libvirtd to not kill the
>    connection after 30 seconds when it's libvirtd's fault that the
>    connection is blocked for so long anyway.

perhaps we need a timeout property for the client connection,
but not hardcode to 30s.

>
> - When connecting with virt-manager, is the udevadm settle really
>    necessary?  We're not calling lvremove.

virt-manager's hung should be caused by pool refresh, which
uses "udevadm settle" to wait for the new devices show up. So
it doesn't relates with "lvremove".

Except logical storage, storage type of "disk", "scsi", and
"mpath" uses "udevadm settle" too. And node device driver.

Generally the pool refresh will be involked when libvirtd starts,
and surely another case is it's involked explicitly. :-) I.e.
virt-manager can't be hung if it doesn't intent to refresh the
pool. And thus I guess the situation will be much worse if pools
of "disk", "logical", "scsi", "mpath" exists all together.

I'm wondering if virt-manager try to refresh the pools when
it starts, or when user request to "check storage" explicitly,
(e.g. clicking some button). It should be improved if it's the
first case IMHO, (let the user get the connection, and refresh
the pool when neccessary could be better).

I'd agree with that introducing timeout argument for "udevadm
settle" will be better, but "hardcode" a timeout in
"virFileWaitForDevices" is not good, as we can see, it's used
many places, what is the proper timeout for each of them can be
a question.

And on the other hand, even small timeouts are introduced, it's
still possible to hang for a long time while "checking storage"
(refresh the pools all together). I have not much idea about how
to get rid of it totally. But how about only refresh the pool
selected by user? the max waiting time in this case will be 2min,
but not ($num_of_pools * 2)mins.

@cole, any thought?

Regards,
Osier