[Libguestfs] nbdkit & qemu 2.12: qemu-img: Protocol error: simple reply when structured reply chunk was expected

Sat Mar 23 17:48:39 UTC 2019

Adding some cc's; my reply truncates, so if you want to catch up to the
full original mail, see
https://www.redhat.com/archives/libguestfs/2019-March/msg00124.html

On 3/23/19 12:00 PM, Richard W.M. Jones wrote:
> On Sat, Mar 23, 2019 at 09:52:41AM -0500, Eric Blake wrote:
>> On 3/23/19 7:58 AM, Richard W.M. Jones wrote:
>>> NB During this I discovered another (but minor) bug in qemu.  If you
>>> feed qemu a long series of block status replies then it eventually
>>> closes the connection.  However it does not print an error message.
>>>
>>
>> Which version(s) of qemu? Was REQ_ONE in force or not? Was it a long
>> series of status replies that could have been coalesced, or were they
>> alternating status? Did the replies extend beyond the client's original
>> request?
>>
>> I'd like to reproduce the failure, to determine if qemu needs more patches.
> 
> Here's how to reproduce it.
...
Thanks, that helps.

> with the small patch attached to this email on top.

> diff --git a/server/protocol.c b/server/protocol.c
> index ac6eb2c..d406dc8 100644
> --- a/server/protocol.c
> +++ b/server/protocol.c
> @@ -503,6 +503,9 @@ send_structured_reply_block_status (struct connection *conn,
>    if (blocks == NULL)
>      return connection_set_status (conn, -1);
>  
> +  nr_blocks = 32;
> +  blocks[0].length = htobe32 (4096);
> +
>    reply.magic = htobe32 (NBD_STRUCTURED_REPLY_MAGIC);
>    reply.handle = handle;
>    reply.flags = htobe16 (NBD_REPLY_FLAG_DONE);
> @@ -526,7 +529,7 @@ send_structured_reply_block_status (struct connection *conn,
>  
>    /* Send each block descriptor. */
>    for (i = 0; i < nr_blocks; ++i) {
> -    r = conn->send (conn, &blocks[i], sizeof blocks[i]);
> +    r = conn->send (conn, &blocks[0], sizeof blocks[0]);

So you hack the server into sending adjacent replies that were not
coalesced, and sending a reply that covers 128k of the image regardless
of how much length the client requested.  There are several ways in
which that can violate protocol (but hey - testing qemu resiliency is
WORTH temporary protocol violations):
 - If the client requested REQ_ONE, you sent too many extents (sadly,
qemu still uses REQ_ONE always, even though I'd like to lift that
restriction in the future; the rest of my bullets are applicable when
REQ_ONE is not in use)
 - If the client requested an offset that is less than 128k from
end-of-image, you sent a reply that exceeds the advertised image size
 - If the client requested a length < 127k, you sent too many extents

But there are other situations where your reply is valid:
 - If the client requested a length in [127k, 128k), your reply gave
more information than the client asked for, but the protocol allows that
because your extra information lies in the final extent when REQ_ONE is
not in use
 - If the client requested a length > 128k, your reply ended early, but
the protocol allows that

>   ./nbdkit -fv memory size=64M \
>       --run '/path/to/qemu-img convert $nbd /var/tmp/out'
> 
> You should see:
> 
> nbdkit: memory.0: debug: pread count=512 offset=0
> nbdkit: memory.1: debug: extents count=67108864 offset=0 req_one=1
> nbdkit: memory.3: debug: client sent NBD_CMD_DISC, closing connection
> nbdkit: memory.3: debug: exiting worker thread memory.3
> 

So the immediate NBD_CMD_DISC is reasonable, implying that the client
flagged one of the ways you could have violated protocol (but whether
qemu should just hang up, or should just ignore that particular reply
but still keep the connection alive, is a QoI question for qemu).

> and qemu-img exits with code 1, but there is no message printed.
> (You can use --run '... || echo FAIL' if you don't believe that
> qemu-img is exiting with a failure.)
> 
> While preparing this email I noticed that v3.1.0 printed an
> error message:
> 
>   qemu-img: Protocol error: invalid payload for NBD_REPLY_TYPE_BLOCK_STATUS

Which says that qemu used to be noisy at diagnosing the protocol error,
and now it is not.  Ideally, qemu as a client shouldn't encounter a
server that is violating protocol, but when it does, it makes sense for
qemu to call attention to the buggy server.  That sounds like a
regression in qemu worth fixing, so I'll whip up a patch.

> 
> So I have bisected this to the following commit:
> 
> d8b4bad846f08ff0f167b46dc156a5310b750484 is the first bad commit
> commit d8b4bad846f08ff0f167b46dc156a5310b750484
> Author: Vladimir Sementsov-Ogievskiy <vsementsov at virtuozzo.com>
> Date:   Fri Nov 2 18:11:52 2018 +0300
> 
>     block/nbd-client: use traces instead of noisy error_report_err

That's also helpful, as it pinpoints where we regressed at flagging the
protocol error.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/libguestfs/attachments/20190323/498576dc/attachment.sig>