[Libguestfs] [PATCH nbdkit 1/5] Add new public nbdkit_parse_probability function

Wed May 17 14:08:06 UTC 2023

On 5/16/23 21:35, Eric Blake wrote:
> On Tue, May 16, 2023 at 07:20:16PM +0200, Laszlo Ersek wrote:
>> On 5/16/23 14:12, Richard W.M. Jones wrote:

>>> +    else if (d2 == 0)              /* N/0 is bad */
>>> +      goto bad_parse;
>>> +    else
>>> +      d /= d2;
>>
>> I don't want to be a spoilsport, but this division makes me extra
>> nervous [1], on top of parsing floating point *at all* [2].
>>
>> [2] Eric has just posted a 19-part series, v2, for QEMU, for fixing the
>> numerous issues in QEMU's floating point parser. Personal opinion: QEMU
>> doesn't even have a *valid reason* for parsing floating point! All QEMU
>> deals with are *integer quantities* such as disk images consisting of
>> whole numbers of bytes! But again, that's just a personal opinion.
> 
> No offense taken.

To clarify, I didn't believe it could in any way offend you, technically
speaking; I thought my opinion could perhaps be considered offensive by
the author of the original QEMU feature, i.e. by the developer who had
introduced floating point parsing originally to QEMU (I don't know who
that developer was).

You're "only" fixing bugs in an existing feature (and improving its test
coverage). I'm challenging the existence of the entire feature. I do
agree that *IF* the feature exists, it should be reasonably bug-free.

> And while you are correct that my v2 series for
> qemu clocked in at 19 patches, it doesn't necessarily translate to 19
> bugs fixed (although it was definitely more than one), nor were they
> all in floating point (one of the bugs was in qemu_strtoui() which is
> integer only).  In short, my biggest time sink there was fixing
> technical debt on needing a LOT more unit tests to make sure I wasn't
> inadvertently breaking some corner case (and with strtod() or
> scanf("%g"), there are a LOT of those).
> 
> Still, being able to write 1.5G rather than 1610612736 explains why
> qemu goes to the bother of attempting to allow certain floating point
> strings.

This is actually the core mistake, in my opinion. Preferring "one and
half gigabytes" rather than "one billion, six hundred ten million, six
hundred twelve thousand, seven hundred thirty six bytes", is understandable.

What is *not* understandable, knowing the pitfalls of floating point
(!), why the former form was equated with "parsing floating point".

Note: I wrote "one and half gigabytes", not "one point five tenths
gigabytes". Huge difference! In everyday speech, we use super simple
fractions, with both numerator and denominator being very small natural
numbers. And especially with multipliers such as K, M, G, T, which are
powers of 2 (10, 20, 30, 40, respectively), I would argue that no
denominators other than 2, 4 and maybe (!) 8 matter! (Who has ever said,
"give me two thirds of a gigabyte"? Or "five sevenths"?)

I totally don't see a user entering, as disk size, "1.234567G".

I don't even see them entering "1.875G" ("one point eight hundred
seventy five thousandths gigabytes"). At best I can see them say "two
gigs minus an eighth gig", or "one plus seven eighths gig", but even for
that, I need to squint.

We don't actually *think* about sizes in numbers that have many digits
after the decimal point. What we do is, we think in simple fractions,
and have just learned how those simple fractions *look* when written as
decimal numbers. So adding a general floating point parser was overkill;
it must have looked attractive only because the complications of
floating point were under-estimated.

So QEMU should have added the following formats instead:

- a[s]
- b/c[s]
- a+b/c[s]

where "a" is your usual integer, [s] is an (optional) suffix like K or
G, c may be 2, 4 or 8 (nothing else), and b may be [0 .. c-1]. If the
suffix is omitted, it is taken as "byte", and then the second and third
forms are not permitted (no fractions of bytes).

Of course, hindsight is 20/20. I had learned to stay away from floating
point very early on; topics in university like bounding errors and
non-associativity of floating point addition made me look for articles
on the web, and those had permanently convinced me that floating point
is *unusable* unless you are able and willing to do complete numerical
analysis in your C source code.

In practice, C programmers don't even remember the integer promotions
and the usual arithmetic conversions. :/

[...]

>> (Note the round parens in the footnote: it means the boundaries are
>> *excluded*. So, a value like -0.5 will be truncated to zero, and a value
>> like UINT64_MAX + 0.5 will be truncated to UINT64_MAX, but (-1) itself
>> and (UINT64_MAX + 1) itself are not permitted already.)
> 
> I challenge you to find a system where 'double' can hold the value
> UINT64_MAX + 0.5 with sufficient precision (IEEE double only requires
> 51 bits of precision).

This was just an example (independent of actual fp precision) to
illustrate the meaning of the round parens in the footnote. "long
double" would work equally well for the example.

> In fact, some of the back-story of my work on
> qemu's floating point parser was an earlier bug report, raised by
> nbdkit's unit testing, when we discovered that qemu was rounding large
> values (because of its trip through 'double') rather than matching an
> integral parse (found when qemu was unable to produce large sparse
> images as large as nbdkit because of the lost precision).
> Semi-related, just last week, I also fixed a bug in glib where it was
> assuming that the round trip (uint64_t)(long double)(uint64_t)value
> would not lose precision, which happens to be the case on x86_64 but
> is not true on other hardware;

I remember :)

> but here we're using sscanf with 'double' not 'long double'.

Again: yes. This was only a tangent. I meant to quote the fp->integer
conversion rule, and then I thought the related footnote deserved
additional (but independent) reasoning. By that point I didn't feel tied
to "double" any longer, really.

[...]

>> If Eric is OK with this patch set, I won't try to block it, it just
>> makes me feel very uncomfortable. It's more than just scanf() lacking
>> proper internal error handling; the real complications come when we
>> perform operations on the floating point values.
> 
> Our pre-existing error filter is already doing percentage calculations
> via floating point, so on the grounds that this patch is consolidating
> common code patterns to avoid reimplementation, we're good.

Agreed.

> But as you
> say, representing things as a ratio between two integers may be more
> maintainable than floating point where accuracy matters, or this may
> be a case where we are explicit that use of the floating point
> fractions as a percentage is intentionally liable to rounding issues.

I think that, for using floating point, we should:

- error-check any conversion from string to binary (object) representation,

- bound the parsed values as early as possible, with an eye towards
later use,

- in each subsequent operation, show that the result will be in range,
and (hopefully) will not see catastrophic loss of precision.

I don't insist on this; it's just what I could explicitly ACK.

> Still, avoiding undefined behavior when the division can overflow
> (whether that be the overflow to floating point infinity or even the
> overflow beyond UINT64_MAX), does make it worth questioning if
> floating point is our best representation, or at a minimum if we want
> to be stricter in what we parse before passing a substring on to
> strtod() so that we have more control over the values (part of my 19
> patches to qemu was explicitly truncating any string at 'e' or 'E' so
> that strtod() could not do an exponent parse that would inadverently
> hit overflow to INFINITY).

*shudder*. :)

I didn't read your patches in detail BTW. I went through the subjects in
order, and looked at some details. I could tell it required an immense
amount of thinking and work. Dealing with (unsigned!) integers is less
work, IMO.

Thanks!
Laszlo