[libvirt] [PATCH] nwfilter: increase pcap buffer size to be compatible with TPACKET_V3

Thu Apr 26 16:20:01 UTC 2018

On Thu, Apr 26, 2018 at 5:55 PM, Laine Stump <laine at laine.org> wrote:

> On 04/26/2018 04:38 AM, Daniel P. Berrangé wrote:
>
> On Thu, Apr 26, 2018 at 08:09:47AM +0200, Christian Ehrhardt wrote:
>
> On Wed, Apr 25, 2018 at 11:25 PM, Laine Stump <laine at laine.org> <laine at laine.org> wrote:
>
>
> When an nwfilter rule sets the parameter CTRL_IP_LEARNING to "dhcp",
> this turns on the "dhcpsnoop" thread, which uses libpcap to monitor
> traffic on the domain's tap device and extract the IP address from the
> DHCP response.
>
> If libpcap on the host is built with TPACKET_V3 defined, the dhcpsnoop
> code's initialization of the libpcap socket fails with the following
> error:
>
>   virNWFilterSnoopDHCPOpen:1134 : internal error: pcap_setfilter: can't
> remove kernel filter: Bad file descriptor
>
> It turns out that this was because libpcap with TPACKET_V3 defined
> requires a larger buffer size than libvirt was setting (we were
> setting it to 128k). Changing the buffer size to 256k eliminates the
> error, and the dhcpsnoop thread once again works properly.
>
> Thanks to Christian Ehrhardt <paelzer at gmail.com> <paelzer at gmail.com> for discovering that
> buffer size was the problem.
>
> Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1547237
> Signed-off-by: Laine Stump <laine at laine.org> <laine at laine.org>
> ---
>  src/nwfilter/nwfilter_dhcpsnoop.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/src/nwfilter/nwfilter_dhcpsnoop.c b/src/nwfilter/nwfilter_
> dhcpsnoop.c
> index 6069e70460..62eb617515 100644
> --- a/src/nwfilter/nwfilter_dhcpsnoop.c
> +++ b/src/nwfilter/nwfilter_dhcpsnoop.c
> @@ -259,7 +259,7 @@ struct _virNWFilterDHCPDecodeJob {
>   * libpcap 1.5 requires a 128kb buffer
>   * 128 kb is bigger than (DHCP_PKT_BURST * PCAP_PBUFSIZE / 2)
>   */
>
>
> I just started building with the change for a few tests on this - no
> results yet.
>
> But we are all puzzled/unsure enough on the size that I'd already ask to
> modify the comment above to explain the new size.
>
> Maybe we should explain:
> - why 128 isn't enough
> - why you chose "only" 256
> - why the default size might be too big
> - your size considerations for many guest scenarios
>
>
> Okay, so I looked into this in more detail. I don't know that I'm totally
> at the bottom of it, but this is what I found in libpcap:
>
> * The function create_ring() calls setsockopt(fd, SOL_PACKET,
> PACKET_RX_RING, &req ...) to setup a ring buffer for receiving packets.
>
> * one of the attributes of req is tp_frame_size, and another is
> tp_frame_nr. tp_frame_nr is set to the user-supplied (or default) buffer
> size (for us it was 128k) / tp_frame_nr.
>
> * if TPACKETV3 isn't used, tp_frame_size is set to (roughly, there is some
> aligning done, and a bit additional added for one ethernet header) the
> user-supplied snaplen (in our case 576). So tp_frame_nr is going to be
> something > 200.
>
> * if TPACKET3 *is* used, tp_frame_size is set to a predefined constant
> MAXIMUM_SNAPLEN, which is 262144. This means that tp_frame_nr will be
> 131072/262144, which is 0.
>
> So the result will be that setsockopt() is called requesting a ring buffer
> that can read 0 frames. I didn't trace through the call to see exactly what
> happens, but it obviously can't be good. (Right after that, tp_frame_nr is
> also used in a call to malloc a buffer to hold pointers to frame headers,
> and that of course would also fail if we ever got that far. I'm assuming we
> don't.
>
> Why are the frames in the ring buffer so large for TPACKETV3? The code
> says this:
>
>    /* The "frames" for this are actually buffers that
>     * contain multiple variable-sized frames.
>     *
>     * We pick a "frame" size of 128K to leave enough
>     * room for at least one reasonably-sized packet
>     * in the "frame".
>     */
>
> This all explains why changing the buffer size to 256k works for us - that
> gets tp_frame_nr to 1 (just barely!) so there is no multiplying by 0. And
> because multiple packets are received into this single "frame", we still
> operate correctly.
>
> However, this all seems very arbitrary - there's an API for setting the
> size of a buffer, but it's not defined anywhere in the API what are the
> units of measure (that's not an exactly accurate description, but I'm too
> tired to think of the right way to say it). pcap-int.h itself even admits
> this in the comment above its definition of MAXIMUM_SNAPLEN:
>
>    /*
>     * Maximum snapshot length.
>     *
>     * Somewhat arbitrary, but chosen to be:
>     *
>     * [blah blah blah details details...]
>     */
>     #define MAXIMUM_SNAPLEN       262144
>
>
> Also, note that the comment where MAXIMUM_SNAPLEN is used in pcap-linux.c
> says that it is 128k, but the actual definition sets it at 256k. :-/
>
> This gives me 0 confidence that any value we choose would continue to be
> viable in the future.
>
> That will help the next one stumbling over this code.
>
> FWIW, having just checked libpcap git history, the current 2 MB default
> size has been there since 2008 !   I'm guessing we don't use that as we
> don't want to consume lots of RAM when many guests are running.
>
>
>
> Yeah, I'm guessing this is why Stefan chose to set the buffer size.
>
> In the case of static IP (the learnIPAddress codepath) we open the pcap
> socket, get a single packet that has an IP address, and then close the pcap
> socket, so we don't normally have more than a few around at any time, and
> the size of the buffer is a non-issue so it's left at the default.
>
> For the case of dhcp snooping, though, the handle remains open for the
> entire lifetime of the domain (I guess in case the address is released and
> re-requested and a different address is handed out the next time). So if
> we're using the default 2MB buffer, there would be 2MB for each client. On
> one hand, for a system with 100 clients, that's 200MB of memory. On the
> other hand, any system that's miniscule in relation to the total amount of
> memory (it might reduce the capacity of a very large host by one guest at
> most).
>
> (NB: I checked this out by putting VIR_WARNs in the source at strategic
> places, and found that we are actually keeping open *TWO* pcap sockets for
> each domain that has CTRL_IP_LEARNING=dhcp. We should probably look into
> that...)
>
> So if everyone is okay with that extra memory usage, I'd suggest Christian
> could send his original patch (deleting the pcap_set_buffer_size()) to the
> list (but including the comments from my patch about the exact error
> message, Resolves: blah.bugzilla.blah, etc).
>

I can do so if "everyone is okay with that extra memory usage" applies.
So waiting for opinions on that for now.

> (Or if you don't have time, I can send it and credit you in the commit
> log).
>
> > I recently see more and more Resolves: instead of "Fixes:" did we
> change the recommended format for some tools and I missed it?
>
> Who is "we"? :-) The overwhelming majority of these tags in libvirt commit
> logs have always been Resolves:
>
> laine at vhost2 ~/devel/libvirt (master*)>git log | grep "Resolves: h" | wc
>     508    1017   33340
> laine at vhost2 ~/devel/libvirt (master*)>git log | grep "Fixes: h" | wc
>       9      18     654
>
> I do notice that the few Fixes: are for bugs on launchpad (with one
> exception), and the Resolves: are all bugzilla.redhat.com and a few
> opensuse.org/suse.com
>

-- 
Christian Ehrhardt
Software Engineer, Ubuntu Server
Canonical Ltd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20180426/568e232b/attachment-0001.htm>