[Libguestfs] [PATCH nbdkit 5/5] New filter: evil

Wed May 17 13:36:31 UTC 2023

On Wed, May 17, 2023 at 03:28:11PM +0200, Laszlo Ersek wrote:
> On 5/16/23 14:12, Richard W.M. Jones wrote:
> 
> > +/* This is the heart of the algorithm, the function which corrupts
> > + * the buffer after reading it from the plugin.
> > + *
> > + * The observation is that if we have a block of (eg) size 10^6 bits
> > + * and our probability of finding a corrupt bit is (eg) 1/10^4, then
> > + * we expect approximately 100 bits in the block to be corrupted.
> 
> This is a binomial distribution (in the example, n=10^6, p=10^(-4)):
> 
> https://en.wikipedia.org/wiki/Binomial_distribution
> 
> The expected value is E = n*p = 10^2, so the comment looks correct in
> that regard.
> 
> There is a different interpretation for "we expect" as well. Namely, the
> "mode" -- the most probable outcome, the number of corrupted bits we're
> most probably going to see. That's a different concept
> <https://en.wikipedia.org/wiki/Mode_(statistics)>. However, per the WP
> article, for the binomial distribution, it is essentially the same.
> 
> Namely, the WP article says, for this distribution, the mode M (an
> integer) is given uniquely by
> 
>   (n + 1) * p - 1 <= M < (n + 1) * p                [1]
> 
> if ((n + 1) * p) is *not* an integer, and the M1 and M2 (equally
> probable, integer) modes
> 
>   M1 = (n + 1) * p                                  [2]
>   M2 = (n + 1) * p - 1                              [3]
> 
> if ((n + 1) * p) *is* an integer.
> 
> Now,if we distribute the multiplications over the addends (break the
> parens open), we get:
> 
>   n * p + p - 1 <= M < n * p + p                [1]
>   M1 = n * p + p                                [2]
>   M2 = n * p + p - 1                            [3]
> 
> But for the binomial distribution, E = n * p, so we can substitute (also
> rewriting +(p-1) as -(1-p) in [1]):
> 
>   E - (1 - p) <= M < E + p                [1]
>   M1 = E + p                              [2]
>   M2 = E + p - 1                          [3]
> 
> Using the values from the code comment:
> 
>   n = 1,000,000
>   p = 0.000,1
>   E = n * p = 100
> 
>   (n + 1) * p = 100.000,1 --> not an integer, so [1] applies:
> 
>   E - (1 - p) <= M < E + p                [1]
>   99.000,1    <= M < 100.000,1
> 
> Therefore the mode M is 100 -- it equals the expected value E, for the
> particular "n" and "p" values!
> 
> In general, they need not be equal. (For example, the mode(s) are always
> integers, but E need not be:if we change n to 1,000,001 then E is
> 100.0001, which is clearly impossible to interpret as a number of
> corrupted bits. It's effectively an average, but never directly produced
> by the distribution as a value.)
> 
> Either way, my point is that M, M1 and M2 (from [1], [2], [3]) are very
> close to E in the general case too, so the "we expect" language applies
> even when we interpret it as "mode", not as "expected value".
> 
> > + *
> > + * For stuck bits we want the corrupted bits to be the same on each
> > + * access, either relative to the backing disk (STUCK_BITS) or to the
> > + * request (STUCK_WIRES).
> > + *
> > + * Instead of creating an expensive bitmap ahead of time covering the
> > + * whole disk, we can use the random number generator with a fixed
> > + * seed derived from the offset of the start of the block.  We can
> > + * then choose a random number uniformly in the range [0..2*(1/P)] (in
> > + * the example [0..2*10^4]) as the distance to the next corrupt bit.
> > + * We jump forwards, corrupt that bit, and repeat until we reach
> > + * the end of the block.
> 
> Two points:
> 
> (1) I figure the idea is that the "average distance" will be 1/p bits --
> the expected value being (0 + 2*(1/p))/2 == 1/p -- so we expect this
> "average distance" to fit n/(1/p) = n*p times in the block size.
> 
> This looks sane, but I'm unequipped to make any arguments about
> "expected value".
> 
> According to wikipedia, this is the Irwin-Hall distribution
> <https://en.wikipedia.org/wiki/Irwin%E2%80%93Hall_distribution>: "the
> sum of a number of independent random variables, each having a uniform
> distribution". The individual random variables need to follow continuous
> (not discrete) uniform distributions though, so I don't know if
> Irwin-Hall "applies" here at all.
> 
> Wikipedia says that, as "n" increases -- that is, as we sample our
> underlying uniform U(0, 1) distribution more and more times, and add up
> the results --, the distribution of the sum (= the Irwin-Hall
> distribution) approximates a Normal distribution, with both the epxected
> value and the mode being (s/2), where "s" is the number of samples
> (variables) summed.
> 
> I don't know what happens if we scale the underlying U(0, 1)
> distribution's domain by 2/p, so that it becomes U(0, 2/p). I guess we'd
> *hope* that the expected value of the sum scales similarly, to:
> 
>   E = s/2 * 2/p = s/p
> 
> Because in that case, we coul wish for the expected value (and mode) of
> the sum of the jumps to match our block size, that is:
> 
>   E = s/p = n
> 
> or put differently,
> 
>   s = n*p
> 
> Meaning that we'd expect having to sum as many "samples", i.e. having to
> take as many "jumps", in the average case, as n*p is (= 100 jumps with
> our numbers).
> 
> But this is entirely hand-waving; I don't know if this holds up at all!

Yes I don't think the bit picking method I chose gets the right
distribution exactly.  However it has the advantage of being
(a) reproducible without needing to store any state except for the
small random number generator state, and (b) easy enough to implement.

> (2) I think the interval as written is off-by-one. The left hand side
> (0) should indeed be inclusive, but the RHS 2*(1/p) should be
> *excluded*. We're supposed to have 20,000 integers in the set, from 0 to
> 19,999 inclusive.

This is actually what we use - I'll update the comment.

> > +# This will give an approximate count of the number of set bits.
> > +
> > +zbytes="$( od -A n -w1 -v -t x1 < $f | sort | uniq -c |
> > +           $SED -n -E -e 's/([0-9]+)[[:space:]]+00[[:space:]]*$/\1/p' )"
> > +nzbits=$(( 10000000 - zbytes )); # looks wrong but actually correct ...
> 
> Some explanation on this would be nice.

In v2 I killed this completely and use some Python code instead.
(That didn't fix the bias I was observing in this test, and nor did
adding lots of debugging, so I still don't understand that.)

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top