[libvirt] [PATCH v1] libvirtd: Increase NL buffer size for lots of interface

Laine Stump laine at laine.org
Wed Dec 16 18:33:22 UTC 2015


On 12/16/2015 10:24 AM, Michal Privoznik wrote:
> On 10.12.2015 07:34, Leno Hou wrote:
>> 1. When switching CPUs to offline/online in a system more than 128 cpus
>> 2. When using virsh to destroy domain in a system with more interface
>>
>> All of above happens nl_recv returned with error: No buffer space available.
>> This patch set socket buffer size to 128K and turn on message peeking for nl_recv,
>> as this would solve this problem totally and permanetly.
>> LTC-Bugzilla: #133359 #125768

Apparently "LTC-Bugzilla" refers to a bugzilla instance for the Linux 
Technology Center at IBM, but I don't see how to get to it. References 
to bug reports are always nice in a patch, because it makes it much 
easier to spelunk the discussion leading up to the fix (now and later), 
but they can't be included unless they point to a publicly accessible 
report. Is LTC-Bugzilla (and in particular, these two reports) publicly 
accessible? If so, a clickable link would be better.

>>
>> Signed-off-by: Leno Hou <houqy at linux.vnet.ibm.com>
>> Cc: Wenyi Gao <wenyi at linux.vnet.ibm.com>
>> ---
>>   src/util/virnetlink.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/src/util/virnetlink.c b/src/util/virnetlink.c
>> index 679b48e..c8c9fe0 100644
>> --- a/src/util/virnetlink.c
>> +++ b/src/util/virnetlink.c
>> @@ -696,6 +696,14 @@ virNetlinkEventServiceStart(unsigned int protocol, unsigned int groups)
>>           goto error_server;
>>       }
>>   
>> +    if (nl_socket_set_buffer_size(srv->netlinknh, 131702, 0) < 0) {

The above function doesn't exist in libnl 1.1 (still used in 
RHEL6/CentOS6, for example), so that would cause a build failure on some 
systems. In libnl 1.1 the function is called nl_set_buffer_size().

Also, how did you arrive at 128k for the default buffer size? What kind 
of sizes are you seeing?


>> +        virReportSystemError(errno,
>> +                "%s",_("cannot set netlink socket buffer size to 128k"));
>> +        goto error_server;
>> +    }
>> +
>> +    nl_socket_enable_msg_peek(srv->netlinknh);
>> +

According to a link I followed from another message on this topic last 
week, libnl's message peeking can't be guaranteed to always work, 
because netlink doesn't always return the proper buffer size (depending 
on version).


>>       if ((srv->eventwatch = virEventAddHandle(fd,
>>                                                VIR_EVENT_HANDLE_READABLE,
>>                                                virNetlinkEventCallback,
>>
> I believe this patch appears over and over again. Usually, the problem
> was in libnl library we use and this was just a workaround. Can you test
> with the latest libnl version (probably even GIT HEAD) and see if that
> helps?

I had the same memory. So I just looked back through the history of bug 
reports about this issue, and found the following:

* libnl-1.1 and libnl-3 both originally set the default message 
buffersize to 4096 bytes, with MSG_PEEK turned off.

* when this problem came up in RHEL6, it was unfortunately reported as a 
private BZ (a pet peeve of mine), and the result of the discussion about 
it was that libnl-1.1 (the version used in RHEL6) was patched *upstream* 
to set the default message buffersize to 16384 bytes (getpagesize() * 
4), which would solve the problem for even very large numbers of VFs. 
That was in 2013 and there have been no further reports against RHEL6.

* Although I had assumed the problem was solved, it again came up in 
RHEL7 (which uses libnl-3 - a slightly different API, and maintained in 
a separate git repo), this time in a public BZ:

   https://bugzilla.redhat.com/show_bug.cgi?id=1040626

I asked if perhaps the change that had been made upstream in libnl-1.1 
hadn't also been made to libnl-3 (this is what I assumed during the 
previous incident). It hadn't. So the same change was made for libnl-3, 
both upstream and as a backport to RHEL7, and everyone was happy.

I have very little detailed memory of that time (the above was all 
recalled by looking at archives of discussions) but what had stuck in my 
mind was "This problem has been fixed in libnl, so libvirt should NOT 
put in "workarounds" for broken versions of libnl."

But if you are using a version of libnl3 with this patch (which was in 
libnl-3.2.22 upstream, and is in the libnl-3.2.21 that's in RHEL7.0+), :

https://github.com/tgraf/libnl/commit/807fddc4cd9ecb12ba64e1b7fa26d86b6c2f19b0

then the change to quadruple the buffer size in libnl was insufficient, 
(and also, when I looked back at the discussion now, I see that the 
libnl maintainer had said "The permanent fix would be for libvirt to 
enable message peeking", so I suppose it's time to "bite the bullet" and 
enable netlink message peeking in libvirt (but, since there are 
apparently versions of netlink that don't properly inform libnl when a 
re-read is necessary, we also need to increase the default buffer size).

However, your patch is only fixing the problem in one place. There are 
several places that we allocate netlink sockets, and they should all get 
the same fix, implying that there should be a common function called by 
all three. Fortunately, we already have a macro called "virNetlinkAlloc" 
that is #defined differently depending on the libnl version - this macro 
can simply be made into a static function that is defined differently 
depending on libnl version. It can call nl_handle_alloc or 
nl_socket_alloc depending on libnl version, then call 
nl_socket_set_buffer_size/nl_set_buffer_size depending on version, and 
finally call nl_socket_enable_msg_peek.

A bit of due diligence about the default buffer size is in order though 
- just to make sure that we don't open a ton of netlink sockets at the 
same time, each with an unnecessarily huge buffer.




More information about the libvir-list mailing list