[Libguestfs] [libnbd PATCH v3 06/22] states: Break deadlock if server goofs on extended replies

Fri Jul 21 14:04:13 UTC 2023

On 7/21/23 14:50, Eric Blake wrote:
> On Thu, Jun 01, 2023 at 03:42:30PM +0200, Laszlo Ersek wrote:
>> On 5/25/23 15:00, Eric Blake wrote:
>>> One of the benefits of extended replies is that we can do a
>>> fixed-length read for the entire header of every server reply, which
>>> is fewer syscalls than the split-read approach required by structured
>>> replies.
>>
>> (Totally tangential comment: recvmsg() could scatter the incoming
>> traffic into non-contiguous fields of a non-packed structure. But I
>> don't know if TLS has anything similar. The current "linear receive"
>> approach is probably the least demanding of the underlying socket
>> abstractions. At the same time it requires us to use packed structs,
>> and/or multiple syscalls.)
>>
>>> But one of the drawbacks of doing a large read is that if
>>> the server is non-compliant (not a problem for normal servers, but
>>> something I hit rather more than I'd like to admit while developing
>>> extended header support in servers),
>>
>> Haha, so this is where it's coming from! :) I can totally relate.
>>
>>> nbd_pwrite() and friends will
>>> deadlock if the server replies with the wrong header.  Add in some
>>> code to catch that failure mode and move the state machine to DEAD
>>> sooner, to make it easier to diagnose the fault in the server.
>>>
>>> Unlike in the case of an unexpected simply reply from a structured
>>
>> (1) s/simply/simple/
> 
> Yep.
> 
>>
>>> server (where we never over-read, and can therefore
>>
>> (2) At this point my English parser gets thrown off:
>>
>>> commit b31e7bac
>>> can merely fail the command with EPROTO and successfully move on to
>>> the next server reply),
> 
> Fixing the parenthetical as follows:
> 
> (where, since commit b31e7bac, we can merely fail the command with
> EPROTO and successfully move on to the next server reply, because we
> didn't read too many bytes yet)
> 
>>
>> resync here:
>>
>>> in this case we really do have to move to
>>> DEAD: in addition to having already read the 16 or 20 bytes that the
>>> server sent in its (short) reply for this command, we may have already
>>> read the initial bytes of the server's next reply, but we have no way
>>> to push those extra bytes back onto our read stream for parsing on our
>>> next pass through the state machine.
>>>
>>> Signed-off-by: Eric Blake <eblake at redhat.com>
>>> ---
>>>  generator/states-reply.c | 23 ++++++++++++++++++++++-
>>>  1 file changed, 22 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/generator/states-reply.c b/generator/states-reply.c
>>> index 4e9f2dde..d4710d91 100644
>>> --- a/generator/states-reply.c
>>> +++ b/generator/states-reply.c
>>> @@ -109,7 +109,28 @@  REPLY.START:
>>>   REPLY.RECV_REPLY:
>>>    switch (recv_into_rbuf (h)) {
>>>    case -1: SET_NEXT_STATE (%.DEAD); return 0;
>>> -  case 1: SET_NEXT_STATE (%.READY); return 0;
>>> +  case 1: SET_NEXT_STATE (%.READY);
>>> +    /* Special case: if we have a short read, but got at least far
>>> +     * enough to decode the magic number, we can check if the server
>>> +     * is matching our expectations. This lets us avoid deadlocking if
>>> +     * a buggy server sends only 16 bytes of a simple reply, and is
>>> +     * waiting for our next command, while we are blocked waiting for
>>> +     * the server to send 32 bytes of an extended reply.
>>> +     */
>>
>> (4) Slight inconsistency between commit message and code comment: the
>> former mentions "20 bytes", but the latter doesn't mention "structured
>> reply".
> 
> Did I miss (3)?  But yes, I can improve this comment.

My mistake, probably due to over-editing! :)

Thanks
Laszlo

> 
>>
>>> +    if (h->extended_headers &&
>>> +        (char *)h->rbuf >= (char *)&h->sbuf.reply.hdr.extended.flags) {
>>
>> (.) I wonder if (address-of-magic + size-of magic) might be more
>> readable / greppable. Just in case.
>>
>> Feel free to ignore.
> 
> Actually, I agree that it is nicer (although a bit longer).
> 
>>
>>> +      uint32_t magic = be32toh (h->sbuf.reply.hdr.extended.magic);
>>> +      if (magic != NBD_EXTENDED_REPLY_MAGIC) {
>>> +        SET_NEXT_STATE (%.DEAD); /* We've probably lost synchronization. */
>>> +        set_error (0, "invalid or unexpected reply magic 0x%" PRIx32, magic);
>>> +#if 0 /* uncomment to see desynchronized data */
>>> +        nbd_internal_hexdump (&h->sbuf.reply.hdr.extended.flags,
>>> +                              sizeof (h->sbuf.reply.hdr.extended.flags),
>>> +                              stderr);
>>> +#endif
>>> +      }
>>> +    }
>>> +    return 0;
>>
>> (5) This could be factored out to a helper function, to share the
>> "invalid:" label logic with the previous patch.
>>
>> (6) Commencing a dump from "flags" onward looks questionable. At this
>> point, the "flags" field need not to be complete, or even started -- all
>> we know is that the "magic" field *before" "flags" is complete.
>>
>> I think we should dump "h->sbuf.reply.hdr.simple", like in patch#5.
> 
> Yep, I noticed that while addressing your point (4), just before
> reading your point (6).  Bug dumping the full hdr.simple is also too
> much; really, given the preconditions above, all we can dump is
> hdr.magic.  At which point, the hexdump is not providing us any
> further information than what the set_error() call already output.
> I'm just deleting the #if 0 segment, which in turn eliminates the need
> to address point (5), as the two invalid: labels are no longer
> identical.
> 
>>
>> (.) Side comment (so no bullet number assigned): because we can take
>> multiple iterations on RECV_REPLY, we may end up checking the "magic"
>> field multiple times. Not very costly, just something to be aware of.
> 
> Indeed.  In the common case, short reads are rare, and even when they
> do happen, it is even more rare to hit it more than once per packet.
> I wonder if nc or similar has a way to force a server's response to be
> flushed after every byte, to see the performance impact of maximum
> network overhead.
> 
> But your observation means you are correctly aware of a larger design
> aspect of the state machine: all code handling short reads (case 1 of
> recv_into_rbuf) must be re-startable (no messing with h->rbuf, for
> example).  Stateful changes can only occur when the read is complete
> (case 0) or irretrievably failed (case -1).
> 
>>
>> (7) Assume that we have a short read first, but then complete
>> REPLY.RECV_REPLY successfully, and move to
>> REPLY.CHECK_SIMPLE_OR_STRUCTURED_REPLY.
> ...
>>
>> *HOWEVER*! :) If we do *not* see some short-but-together-long-enough
>> reads first, but see a full read at once, then our "pre-check" in
>> RECV_REPLY does not get activated at all! And then all the conditions in
>> CHECK_SIMPLE_OR_STRUCTURED_REPLY remain necessary.
>>
>> So this is my long-winded way to ask that:
>>
>>>    case 0: SET_NEXT_STATE (%CHECK_SIMPLE_OR_STRUCTURED_REPLY);
>>>    }
>>>    return 0;
>>
>> you please consider adding a comment here, saying that, while the short
>> read special case *does* short-circuit a number of checks under
>> CHECK_SIMPLE_OR_STRUCTURED_REPLY, all the checks in
>> CHECK_SIMPLE_OR_STRUCTURED_REPLY remain required if under RECV_REPLY we
>> get a full read immediately.
> 
> Good idea.  Added for v4.
>