[libvirt] Entering freeze for libvirt-3.5.0

Thu Jul 6 10:18:55 UTC 2017

On Mon, Jul 03, 2017 at 05:20:13PM +0400, Roman Bogorodskiy wrote:
>   Ján Tomko wrote:
> 
> > [cc: Guido]
> > 
> > On Sat, Jul 01, 2017 at 02:18:58PM +0400, Roman Bogorodskiy wrote:
> > >  Andrea Bolognani wrote:
> > >> virnetsockettest also fails pretty often for me, certainly
> > >> more than your figure; even if that wasn't the case, 1/5
> > >> failure rate is way too high for a CI job.
> > >
> > >I played a little more with virnetsockettest to get real stats and
> > >figured the following:
> > >
> > > 1. On my desktop (i5) and laptop (i3), I didn't get any failures in 50
> > > 'check' runs
> > > 2. On a VM that I use to run test builds in Jenkins, out of 50 runs it
> > > fails from 1 to 6 times; I did this test a couple of times and either I
> > > was lucky or failure rate is higher when my Jenkins perform regular
> > > builds.
> > >
> > >Anyway, I'll try to find a way to debug what's going on with
> > >virnetsockettest.
> > >
> > 
> > IIRC Debian disabled this test years ago.
> > 
> > Guido, have you ever discovered the cause?
> > 
> > Jan
> 
> I made some experiments on the weekend, and here are my results:
> 
> On a box where test fails from time to time, it fails at this point:
> 
>     virObjectUnref(csock);
> 
>     for (i = 0; i < nlsock; i++) {
>         if (virNetSocketAccept(lsock[i], &ssock) != -1 && ssock) {
>             char c = 'a';
>             if (virNetSocketWrite(ssock, &c, 1) != -1 &&
>                 virNetSocketRead(ssock, &c, 1) != -1) {
>                 VIR_DEBUG("Unexpected client socket present"); <--- HERE
>                 goto cleanup;
>             }   
>         }   
>         virObjectUnref(ssock);
>         ssock = NULL;
>     }  
> 
> On a box where this test never fails, it reaches this block, but:
> 
>  * virNetSocketWrite(ssock, &c, 1) != -1
>  * virNetSocketRead(ssock, &c, 1) == -1
> 
> It's enough to make the test pass. On a failing box both Write() and Read()
> return != -1 when the test fails.

We discussed this on IRC and what is happening here is a race condition.

The test suite is connecting a client to the server and then immediately
closing the client connection. When the server tries to accept the
client, usually it'll get -1 because the client has already gone away.
Socket termination is a multi-stage process at the network layer, and
so there is a non-zero chance that Accept will succeed. The test suite
is assuming that if the accept did succeeed, then we'll get I/O error
on read & write, but this is also not actually guaranteed. It is
possible that write may suceed buffering output, and read may simply
see '0' for EOF. When this happens the test suite will fail. Roman
debugging a failing run & confirmed this is exactly what's happening.

IOW, this test suite is just plain broken. It needs rewriting to do
a more sensible real world test. ie spawn a thread to act as the
server, and have the server just read from the client & echo it
back to the client. The main thread acts as the client and tests
this echo'ing. This is race free and is an real-world example of
usage.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|