[libvirt] [PATCH] Add some docs about the RPC protocol and APIs

Daniel P. Berrange berrange at redhat.com
Thu Aug 11 15:12:47 UTC 2011


From: "Daniel P. Berrange" <berrange at redhat.com>

* remote.html.in: Remove obsolete notes about internals of the
  RPC protocol
* internals/rpc.html.in: Extensive docs on RPC protocol/API
* sitemap.html.in: Add new page
---
 docs/internals/rpc.html.in |  876 ++++++++++++++++++++++++++++++++++++++++++++
 docs/remote.html.in        |   45 ---
 docs/sitemap.html.in       |    4 +
 3 files changed, 880 insertions(+), 45 deletions(-)
 create mode 100644 docs/internals/rpc.html.in

diff --git a/docs/internals/rpc.html.in b/docs/internals/rpc.html.in
new file mode 100644
index 0000000..761832a
--- /dev/null
+++ b/docs/internals/rpc.html.in
@@ -0,0 +1,876 @@
+<html>
+  <body>
+    <h1>libvirt RPC infrastructure</h1>
+
+    <ul id="toc"></ul>
+
+    <p>
+      libvirt includes a basic protocol and code to implement
+      an extensible, secure client/server RPC service. This was
+      originally designed for communication between the libvirt
+      client library and the libvirtd daemon. It is also also
+      used for communication to the virtlockd daemon and (soon)
+      for the libvirt_lxc controller process. This document
+      provides an overview of the protocol and structure / operation
+      of the internal RPC library APIs.
+    </p>
+
+
+    <h2><a name="protocol">RPC protocol</a></h2>
+
+    <p>
+      libvirt uses a simple, variable length, packet based RPC protocol.
+      All structured data within packets is encoded using the
+      <a href="http://en.wikipedia.org/wiki/External_Data_Representation">XDR standard</a>
+      as currently defined by <a href="https://tools.ietf.org/html/rfc4506">RFC 4506</a>.
+      On any connection running the RPC protocol, there can be multiple
+      programs active, each supporting one or more versions. A program
+      defines a set of procedures that it supports. The procedures can
+      support call+reply method invocation, asynchronous events,
+      and generic data streams. Method invocations can be overlapped,
+      so waiting for a reply to one will not block the receipt of the
+      reply to another outstanding method. The protocol was loosely
+      inspired by the design of SunRPC. The definition of the RC
+      protocol is in the file <code>src/rpc/virnetprotocol.x</code>
+      in the libvirt source tree.
+    </p>
+
+    <h3><a href="protocolframing">Packet framing</a></h3>
+
+    <p>
+      On the wire, there is no explicit packet framing marker. Instead
+      each packet is preceeded by an unsigned 32-bit integer giving
+      the total length of the packet in bytes. This length includes
+      the 4-bytes of the length word itself. Conceptually the framing
+      looks like this:
+    </p>
+
+<pre>
+|~~~   Packet 1   ~~~|~~~   Packet 2   ~~~|~~~  Packet 3    ~~~|~~~
+
++-------+------------+-------+------------+-------+------------+...
+| n=U32 | (n-4) * U8 | n=U32 | (n-4) * U8 | n=U32 | (n-4) * U8 |
++-------+------------+-------+------------+-------+------------+...
+
+|~ Len ~|~   Data   ~|~ Len ~|~   Data   ~|~ Len ~|~   Data   ~|~
+
+</pre>
+
+    <h3><a href="protocoldata">Packet data</a></h3>
+
+    <p>
+      The data in each packet is split into two parts, a short
+      fixed length header, followed by a variable length payload.
+      So a packet from the illustration above is more correctly
+      shown as
+    </p>
+
+<pre>
+
++-------+-------------+---------------....---+
+| n=U32 | 6*U32       | (n-(7*4))*U8         |
++-------+-------------+---------------....---+
+
+|~ Len ~|~  Header   ~|~  Payload     ....  ~|
+</pre>
+
+
+    <h3><a href="protocolheader">Packet header</a></h3>
+    <p>
+      The header contains 6 fields, encoded as signed/unsigned 32-bit
+      integers.
+    </p>
+
+    <pre>
++---------------+
+| program=U32   |
++---------------+
+| version=U32   |
++---------------+
+| procedure=S32 |
++---------------+
+| type=S32      |
++---------------+
+| serial=U32    |
++---------------+
+| status=S32    |
++---------------+
+    </pre>
+
+    <dl>
+      <dt><code>program</code></dt>
+      <dd>
+        This is an arbitrarily chosen number that will uniquely
+        identify the "service" running over the stream.
+      </dd>
+      <dt><code>version</code></dt>
+      <dd>
+        This is the version number of the program, by convention
+        starting from '1'. When an incompatible change is made
+        to a program, the version number is incremented. Ideally
+        both versions will then be supported on the wire in
+        parallel for backwards compatibility.
+      </dd>
+      <dt><code>procedure</code></dt>
+      <dd>
+        This is an arbitrarily chosen number that will uniqely
+        identify the method call, or event associated with the
+        packet. By convention, procedure numbers start from 1
+        and are assigned monotonically thereafter.
+      </dd>
+      <dt><code>type</code></dt>
+      <dd>
+        <p>
+        This can be one of the following enumeration values
+        </p>
+        <ol>
+          <li>call: invocation of a method call</li>
+          <li>reply: completion of a method call</li>
+          <li>event: an asynchronous event</li>
+          <li>stream: control info or data from a stream</li>
+        </ol>
+      </dd>
+      <dt><code>serial</code></dt>
+      <dd>
+        This is an number that starts from 1 and increases
+        each time a method call packet is sent. A reply or
+        stream packet will have a serial number matching the
+        original method call packet serial. Events always
+        have the serial number set to 0.
+      </dd>
+      <dt><code>status</code></dt>
+      <dd>
+        <p>
+        This can one of the following enumeration values
+        </p>
+        <ol>
+          <li>ok: a normal packet. this is always set for method calls or events.
+            For replies it indicates succesful completion of the method. For
+            streams it indicates confirmation of the end of file on the stream.</li>
+          <li>error: for replies this indicates that the method call failed
+            and error information is being returned. For streams this indicates
+            that not all data was sent and the stream has aborted</li>
+          <li>continue: for streams this indicates that further data packets
+            will be following</li>
+        </ol>
+    </dl>
+
+    <h3><a href="protocolpayload">Packet payload</a></h3>
+
+    <p>
+      The payload of a packet will vary depending on the <code>type</code>
+      and <code>status</code> fields from the header.
+    </p>
+
+    <ul>
+      <li>type=call: the in parameters for the method call, XDR encoded</li>
+      <li>type=reply+status=ok: the return value and/or out parameters for the method call, XDR encoded</li>
+      <li>type=reply+status=error: the error information for the method, a virErrorPtr XDR encoded</li>
+      <li>type=event: the parameters for the event, XDR encoded</li>
+      <li>type=stream+status=ok: no payload</li>
+      <li>type=stream+status=error: the error information for the method, a virErrorPtr XDR encoded</li>
+      <li>type=stream+status=continue: the raw bytes of data for the stream. No XDR encoding</li>
+    </ul>
+
+    <p>
+      For the exact payload information for each procedure, consult the XDR protocol
+      definition for the program+version in question
+    </p>
+
+    <h3><a name="wireexamples">Wire examples</a></h3>
+
+    <p>
+      The following diagrams illustrate some example packet exchanges
+      between a client and server
+    </p>
+
+    <h4><a name="wireexamplescall">Method call</a></h4>
+
+    <p>
+      A single method call and succesful
+      reply, for a program=8, version=1, procedure=3, which 10 bytes worth
+      of input args, and 4 bytes worth of return values. The overall input
+      packet length is is 4 + 24 + 10 == 38, and output packet length 32
+    </p>
+
+    <pre>
+          +--+-----------------------+-----------+
+   C -->  |38| 8 | 1 | 3 | 0 | 1 | 0 | .o.oOo.o. |  --> S  (call)
+          +--+-----------------------+-----------+
+
+          +--+-----------------------+--------+
+   C <--  |32| 8 | 1 | 3 | 1 | 1 | 0 | .o.oOo |  <-- S  (reply)
+          +--+-----------------------+--------+
+    </pre>
+
+    <h4><a name="wireexamplescallerr">Method call with error</a></h4>
+
+    <p>
+      An unsuccessful method call will instead return an error object
+    </p>
+
+    <pre>
+          +--+-----------------------+-----------+
+   C -->  |38| 8 | 1 | 3 | 0 | 1 | 0 | .o.oOo.o. |  --> S   (call)
+          +--+-----------------------+-----------+
+
+          +--+-----------------------+--------------------------+
+   C <--  |48| 8 | 1 | 3 | 2 | 1 | 0 | .o.oOo.o.oOo.o.oOo.o.oOo |  <-- S  (error)
+          +--+-----------------------+--------------------------+
+    </pre>
+
+    <h4><a name="wireexamplescallup">Method call with upload stream</a></h4>
+
+    <p>
+      A method call which also involves uploading some data over
+      a stream will result in
+    </p>
+
+    <pre>
+          +--+-----------------------+-----------+
+   C -->  |38| 8 | 1 | 3 | 0 | 1 | 0 | .o.oOo.o. |  --> S  (call)
+          +--+-----------------------+-----------+
+
+          +--+-----------------------+--------+
+   C <--  |32| 8 | 1 | 3 | 1 | 1 | 0 | .o.oOo |  <-- S  (reply)
+          +--+-----------------------+--------+
+
+          +--+-----------------------+-------------....-------+
+   C -->  |38| 8 | 1 | 3 | 3 | 1 | 2 | .o.oOo.o.oOo....o.oOo. |  --> S  (stream data up)
+          +--+-----------------------+-------------....-------+
+          +--+-----------------------+-------------....-------+
+   C -->  |38| 8 | 1 | 3 | 3 | 1 | 2 | .o.oOo.o.oOo....o.oOo. |  --> S  (stream data up)
+          +--+-----------------------+-------------....-------+
+          +--+-----------------------+-------------....-------+
+   C -->  |38| 8 | 1 | 3 | 3 | 1 | 2 | .o.oOo.o.oOo....o.oOo. |  --> S  (stream data up)
+          +--+-----------------------+-------------....-------+
+          ...
+          +--+-----------------------+-------------....-------+
+   C -->  |38| 8 | 1 | 3 | 3 | 1 | 2 | .o.oOo.o.oOo....o.oOo. |  --> S  (stream data up)
+          +--+-----------------------+-------------....-------+
+          +--+-----------------------+
+   C -->  |24| 8 | 1 | 3 | 3 | 1 | 0 | --> S  (stream finish)
+          +--+-----------------------+
+          +--+-----------------------+
+   C <--  |24| 8 | 1 | 3 | 3 | 1 | 0 | <-- S  (stream finish)
+          +--+-----------------------+
+    </pre>
+
+    <h4><a name="wireexamplescallbi">Method call bidirectional stream</a></h4>
+
+    <p>
+      A method call which also involves a bi-directional stream will
+      result in
+    </p>
+
+    <pre>
+          +--+-----------------------+-----------+
+   C -->  |38| 8 | 1 | 3 | 0 | 1 | 0 | .o.oOo.o. |  --> S  (call)
+          +--+-----------------------+-----------+
+
+          +--+-----------------------+--------+
+   C <--  |32| 8 | 1 | 3 | 1 | 1 | 0 | .o.oOo |  <-- S  (reply)
+          +--+-----------------------+--------+
+
+          +--+-----------------------+-------------....-------+
+   C -->  |38| 8 | 1 | 3 | 3 | 1 | 2 | .o.oOo.o.oOo....o.oOo. |  --> S  (stream data up)
+          +--+-----------------------+-------------....-------+
+          +--+-----------------------+-------------....-------+
+   C -->  |38| 8 | 1 | 3 | 3 | 1 | 2 | .o.oOo.o.oOo....o.oOo. |  --> S  (stream data up)
+          +--+-----------------------+-------------....-------+
+          +--+-----------------------+-------------....-------+
+   C <--  |38| 8 | 1 | 3 | 3 | 1 | 2 | .o.oOo.o.oOo....o.oOo. |  <-- S  (stream data down)
+          +--+-----------------------+-------------....-------+
+          +--+-----------------------+-------------....-------+
+   C -->  |38| 8 | 1 | 3 | 3 | 1 | 2 | .o.oOo.o.oOo....o.oOo. |  --> S  (stream data up)
+          +--+-----------------------+-------------....-------+
+          +--+-----------------------+-------------....-------+
+   C -->  |38| 8 | 1 | 3 | 3 | 1 | 2 | .o.oOo.o.oOo....o.oOo. |  --> S  (stream data up)
+          +--+-----------------------+-------------....-------+
+          +--+-----------------------+-------------....-------+
+   C <--  |38| 8 | 1 | 3 | 3 | 1 | 2 | .o.oOo.o.oOo....o.oOo. |  <-- S  (stream data down)
+          +--+-----------------------+-------------....-------+
+          +--+-----------------------+-------------....-------+
+   C <--  |38| 8 | 1 | 3 | 3 | 1 | 2 | .o.oOo.o.oOo....o.oOo. |  <-- S  (stream data down)
+          +--+-----------------------+-------------....-------+
+          +--+-----------------------+-------------....-------+
+   C <--  |38| 8 | 1 | 3 | 3 | 1 | 2 | .o.oOo.o.oOo....o.oOo. |  <-- S  (stream data down)
+          +--+-----------------------+-------------....-------+
+          +--+-----------------------+-------------....-------+
+   C -->  |38| 8 | 1 | 3 | 3 | 1 | 2 | .o.oOo.o.oOo....o.oOo. |  --> S  (stream data up)
+          +--+-----------------------+-------------....-------+
+          ..
+          +--+-----------------------+-------------....-------+
+   C -->  |38| 8 | 1 | 3 | 3 | 1 | 2 | .o.oOo.o.oOo....o.oOo. |  --> S  (stream data up)
+          +--+-----------------------+-------------....-------+
+          +--+-----------------------+
+   C -->  |24| 8 | 1 | 3 | 3 | 1 | 0 | --> S  (stream finish)
+          +--+-----------------------+
+          +--+-----------------------+
+   C <--  |24| 8 | 1 | 3 | 3 | 1 | 0 | <-- S  (stream finish)
+          +--+-----------------------+
+    </pre>
+
+
+    <h4><a name="wireexamplescallmany">Method calls overlapping</a></h4>
+    <pre>
+          +--+-----------------------+-----------+
+   C -->  |38| 8 | 1 | 3 | 0 | 1 | 0 | .o.oOo.o. |  --> S  (call 1)
+          +--+-----------------------+-----------+
+          +--+-----------------------+-----------+
+   C -->  |38| 8 | 1 | 3 | 0 | 2 | 0 | .o.oOo.o. |  --> S  (call 2)
+          +--+-----------------------+-----------+
+          +--+-----------------------+--------+
+   C <--  |32| 8 | 1 | 3 | 1 | 2 | 0 | .o.oOo |  <-- S  (reply 2)
+          +--+-----------------------+--------+
+          +--+-----------------------+-----------+
+   C -->  |38| 8 | 1 | 3 | 0 | 3 | 0 | .o.oOo.o. |  --> S  (call 3)
+          +--+-----------------------+-----------+
+          +--+-----------------------+--------+
+   C <--  |32| 8 | 1 | 3 | 1 | 3 | 0 | .o.oOo |  <-- S  (reply 3)
+          +--+-----------------------+--------+
+          +--+-----------------------+-----------+
+   C -->  |38| 8 | 1 | 3 | 0 | 4 | 0 | .o.oOo.o. |  --> S  (call 4)
+          +--+-----------------------+-----------+
+          +--+-----------------------+--------+
+   C <--  |32| 8 | 1 | 3 | 1 | 1 | 0 | .o.oOo |  <-- S  (reply 1)
+          +--+-----------------------+--------+
+          +--+-----------------------+--------+
+   C <--  |32| 8 | 1 | 3 | 1 | 4 | 0 | .o.oOo |  <-- S  (reply 4)
+          +--+-----------------------+--------+
+    </pre>
+
+
+    <h2><a name="security">RPC security</a></h2>
+
+    <p>
+      There are various things to consider to ensure an implementation
+      of the RPC protocol can be satisfactorily secured
+    </p>
+
+    <h3><a name="securitytls">Authentication/encryption</a></h3>
+
+    <p>
+      The basic RPC protocol does not define or require any specific
+      authentication/encryption capabilities. A generic solution to
+      providing encryption for the protocol is to run the protocol
+      over a TLS encrypted data stream. x509 certificate checks can
+      be done to form a crude authentication mechanism. It is also
+      possible for an RPC program to negotiate an encryption /
+      authentication capability, such as SASL, which may then also
+      provide per-packet data encryption. Finally the protocol data
+      stream can of course be tunnelled over transports such as SSH.
+    </p>
+
+    <h3><a name="securitylimits">Data limits</a></h3>
+
+    <p>
+      Although the protocol itself defines many arbitrary sized data values in the
+      payloads, to avoid denial of service attack there are a number of size limit
+      checks prior to encoding or decoding data. There is a limit on the maximum
+      size of a single RPC message, limit on the maximum string length, and limits
+      on any other parameter which uses a variable length array. These limits can
+      be raised, subject to agreement between client/server, without otherwise
+      breaking compatibility of the RPC data on the wire.
+    </p>
+
+    <h3><a name="securityvalidate">Data validation</a></h3>
+
+    <p>
+      It is important that all data be fully validated before performing
+      any actions based on the data. When reading an RPC packet, the
+      first four bytes must be read and the max packet size limit validated,
+      before any attempt is made to read the variable length packet data.
+      After a complete packet has been read, the header must be decoded
+      and all 6 fields fully validated, before attempting to dispatch
+      the payload. Once dispatched, the payload can be decoded and passed
+      onto the appropriate API for execution. The RPC code must not take
+      any action based on the payload, since it has no way to validate
+      the semantics of the payload data. It must delegate this to the
+      execution API (eg corresponding libvirt public API).
+    </p>
+
+    <h2><a name="internals">RPC internal APIs</a></h2>
+
+    <p>
+      The generic internal RPC library code lives in the <code>src/rpc/</code>
+      directory of the libvirt source tree. Unless otherwise noted, the
+      objects are all threadsafe. The core object types and their
+      purposes are:
+    </p>
+
+    <h3><a name="apioverview">Overview of RPC objects</a></h3>
+
+    <p>
+      The following is a high level overview of the role of each
+      of the main RPC objects
+    </p>
+
+    <dl>
+      <dt><code>virNetSASLContextPtr</code> (virnetsaslcontext.h)</dt>
+      <dd>The virNetSASLContext APIs maintain SASL state for a network
+        service (server or client). This is primarily used on the server
+        to provide a whitelist of allowed SASL usernames for clients.
+      </dd>
+
+      <dt><code>virNetSASLSessionPtr</code> (virnetsaslcontext.h)</dt>
+      <dd>The virNetSASLSession APIs maintain SASL state for a single
+        network connection (socket). This is used to perform the multi-step
+        SASL handshake and perform encryption/decryption of data once
+        authenticated, via integration with virNetSocket.
+      </dd>
+
+      <dt><code>virNetTLSContextPtr</code> (virnettlscontext.h)</dt>
+      <dd>The virNetTLSContext APIs maintain TLS state for a network
+        service (server or client). This is primarily used on the server
+        to provide a whitelist of allowed x509 distinguished names, as
+        well as diffie-hellman keys. It can also do validation of
+        x509 certificates prior to initiating a connection, in order
+        to improve detection of configuration errors.
+      </dd>
+
+      <dt><code>virNetTLSSessionPtr</code> (virnettlscontext.h)</dt>
+      <dd>The virNetTLSSession APIs maintain TLS state for a single
+        network connection (socket). This is used to perform the multi-step
+        TLS handshake and perform encryption/decryption of data once
+        authenticated, via integration with virNetSocket.
+      </dd>
+
+      <dt><code>virNetSocketPtr</code> (virnetsocket.h)</dt>
+      <dd>The virNetSocket APIs provide a higher level wrapper around
+        the raw BSD sockets and getaddrinfo APIs. They allow for creation
+        of both server and client sockets. Data transports supported are
+        TCP, UNIX, SSH tunnel or external command tunnel. Internally the
+        TCP socket impl uses the getaddrinfo info APIs to ensure correct
+        protocol independant behaviour, thus supporting both IPv4 and IPv6.
+        The socket APIs can be associated with a virNetSASLSessionPtr or
+        virNetTLSSessionPtr object to allow seemless encryption/decryption
+        of all writes and reads. For UNIX sockets it is possible to obtain
+        the remote client user ID and process ID. Integration with the
+        libvirt event loop also allows use of callbacks for notification
+        of various I/O conditions
+      </dd>
+
+      <dt><code>virNetMessagePtr</code> (virnetmessage.h)</dt>
+      <dd>The virNetMessage APIs provide a wrapper around the libxdr
+        API calls, to facilitate processing and creation of RPC
+        packets. There are convenience APIs for encoding/encoding the
+        packet headers, encoding/decoding the payload using an XDR
+        filter, encoding/decoding a raw payload (for streams), and
+        encoding a virErrorPtr object. There is also a means to
+        add to/serve from a linked-list queue of messages.</dd>
+
+      <dt><code>virNetClientPtr</code> (virnetclient.h)</dt>
+      <dd>The virNetClient APIs provide a way to connect to a
+        remote server and run one or more RPC protocols over
+        the connection. Connections can be made over TCP, UNIX
+        sockets, SSH tunnels, or external command tunnels. There
+        is support for both TLS and SASL session encryption.
+        The client also supports management of multiple data streams
+        over each connection. Each client object can be used from
+        multiple threads concurrently, with method calls/replies
+        being interleaved on the wire as required.
+      </dd>
+
+      <dt><code>virNetClientProgramPtr</code> (virnetclientprogram.h)</dt>
+      <dd>The virNetClientProgram APIs are used to register a
+        program+version with the connection. This then enables
+        invocation of method calls, receipt of asynchronous
+        events and use of data streams, within that program+version.
+        When created a set of callbacks must be supplied to take
+        care of dispatching any incoming asynchronous events.
+      </dd>
+
+      <dt><code>virNetClientStreamPtr</code> (virnetclientstream.h)</dt>
+      <dd>The virNetClientStream APIs are used to control transmission and
+        receipt of data over a stream active on a client. Streams provide
+        a low latency, unlimited length, bi-directional raw data exchange
+        mechanism layered over the RPC connection
+      </dd>
+
+      <dt><code>virNetServerPtr</code> (virnetserver.h)</dt>
+      <dd>The virNetServer APIs are used to manage a network server. A
+        server exposed one or more programs, over one or more services.
+        It manages multiple client connections invoking multiple RPC
+        calls in parallel, with dispatch across multiple worker threads.
+      </dd>
+
+      <dt><code>virNetServerMDNSPtr</code> (virnetservermdns.h)</dt>
+      <dd>The virNetServerMDNS APIs are used to advertize a server
+        across the local network, enabling clients to automatically
+        detect the existance of remote services. This is done by
+        interfacing with the Avahi mDNS advertisement service.
+      </dd>
+
+      <dt><code>virNetServerClientPtr</code> (virnetserverclient.h)</dt>
+      <dd>The virNetServerClient APIs are used to manage I/O related
+        to a single client network connection. It handles initial
+        validation and routing of incoming RPC packets, and transmission
+        of outgoing packets.
+      </dd>
+
+      <dt><code>virNetServerProgramPtr</code> (virnetserverprogram.h)</dt>
+      <dd>The virNetServerProgram APIs are used to provide the implementation
+        of a single program/version set. Primarily this includes a set of
+        callbacks used to actually invoke the APIs corresponding to
+        program procedure numbers. It is responsible for all the serialization
+        of payloads to/from XDR.</dd>
+
+      <dt><code>virNetServerServicePtr</code> (virnetserverservice.h)</dt>
+      <dd>The virNetServerService APIs are used to connect the server to
+        one or more network protocols. A single service may involve multiple
+        sockets (ie both IPv4 and IPv6). A service also has an associated
+        authentication policy for incoming clients.
+      </dd>
+    </dl>
+
+    <h3><a name="apiclientdispatch">Client RPC dispatch</a></h3>
+
+    <p>
+      The client RPC code must allow for multiple overlapping RPC method
+      calls to be invoked, transmission & receipt of data for mutliple
+      streams and receipt of asynchronous events. Understandably this
+      involves coordination of multiple threads.
+    </p>
+
+    <p>
+      The core requirement in the client dispatch code is that only
+      one thread is allowed to be performing I/O on the socket at
+      any time. This thread is said to be "holding the buck". When
+      any other thread comes along and needs todo I/O it must place
+      its packets on a queue and delegate processing of them to the
+      thread that has the buck. This thread will send out the method
+      call, and if it sees a reply will pass it back to the waiting
+      thread. If the other thread's reply hasn't arrived, by the time
+      the main thread has got its own reply, then it will transfer
+      responsibility for I/O to the thread that has been waiting the
+      longest. It is said to be "passing the buck" for I/O.
+    </p>
+
+    <p>
+      When no thread is performing any RPC method call, or sending
+      stream data there is still a need to monitor the socket for
+      incoming I/O related to asynchronous events, or stream data
+      receipt. For this task, a watch is registered with the event
+      loop which triggers whenever the socket is readable. This
+      watch is automatically disabled whenever any other thread
+      grabs the buck, and re-enabled when the buck is released.
+    </p>
+
+    <h4><a name="apiclientdispatchex1">Example with buck passing</a></h4>
+
+    <p>
+      In the first example, a second thread issues a API call
+      while the first thread holds the buck. The reply to the
+      first call arrives first, so the buck is passed to the
+      second thread.
+    </p>
+
+    <pre>
+        Thread-1
+           |
+           V
+       Call API1()
+           |
+           V
+       Grab Buck
+           |           Thread-2
+           V              |
+       Send method1       V
+           |          Call API2()
+           V              |
+        Wait I/O          V
+           |<--------Queue method2
+           V              |
+       Send method2       V
+           |          Wait for buck
+           V              |
+        Wait I/O          |
+           |              |
+           V              |
+       Recv reply1        |
+           |              |
+           V              |
+       Pass the buck----->|
+           |              V
+           V           Wait I/O
+       Return API1()      |
+                          V
+                      Recv reply2
+                          |
+                          V
+                     Release the buck
+                          |
+                          V
+                      Return API2()
+    </pre>
+
+    <h4><a name="apiclientdispatchex2">Example without buck passing</a></h4>
+
+    <p>
+      In this second example, a second thread issues an API call
+      which is sent and replied to, before the first thread's
+      API call has completed. The first thread thus notifies
+      the second that its reply is ready, and there is no need
+      to pass the buck
+    </p>
+
+    <pre>
+        Thread-1
+           |
+           V
+       Call API1()
+           |
+           V
+       Grab Buck
+           |           Thread-2
+           V              |
+       Send method1       V
+           |          Call API2()
+           V              |
+        Wait I/O          V
+           |<--------Queue method2
+           V              |
+       Send method2       V
+           |          Wait for buck
+           V              |
+        Wait I/O          |
+           |              |
+           V              |
+       Recv reply2        |
+           |              |
+           V              |
+      Notify reply2------>|
+           |              V
+           V          Return API2()
+        Wait I/O
+           |
+           V
+       Recv reply1
+           |
+           V
+     Release the buck
+           |
+           V
+       Return API1()
+    </pre>
+
+    <h4><a name="apiclientdispatchex3">Example with async events</a></h4>
+
+    <p>
+      In this example, only one thread is present and it has to
+      deal with some async events arriving. The events are actually
+      dispatched to the application from the event loop thread
+    </p>
+
+    <pre>
+        Thread-1
+           |
+           V
+       Call API1()
+           |
+           V
+       Grab Buck
+           |
+           V
+       Send method1
+           |
+           V
+        Wait I/O
+           |          Event thread
+           V              ...
+       Recv event1         |
+           |               V
+           V          Wait for timer/fd
+       Queue event1        |
+           |               V
+           V           Timer fires
+        Wait I/O           |
+           |               V
+           V           Emit event1
+       Recv reply1         |
+           |               V
+           V          Wait for timer/fd
+       Return API1()       |
+                          ...
+    </pre>
+
+    <h3><a name="apiserverdispatch">Server RPC dispatch</a></h3>
+
+    <p>
+      The RPC server code must support receipt of incoming RPC requests from
+      multiple client connections, and parallel processing of all RPC
+      requests, even many from a single client. This goal is achieved through
+      a combination of event driven I/O, and multiple processing threads.
+    </p>
+
+    <p>
+      The main libvirt event loop thread is responsible for performing all
+      socket I/O. It will read incoming packets from clients and willl
+      transmit outgoing packets to clients. It will handle the I/O to/from
+      streams associated with client API calls. When doing client I/O it
+      will also take pass the data through any applicable encryption layer
+      (through use of the virNetSocket / virNetTLSSession and virNetSASLSession
+      integration). What is paramount is that the event loop thread never
+      do any task that can take a non-trivial amount of time.
+    </p>
+
+    <p>
+      When reading packets, the event loop will first read the 4 byte length
+      word. This is validated to make sure it does not exceed the maximum
+      permissible packet size, and the client is set to allow receipt of the
+      rest of the packet data. Once a complete packet has been received, the
+      next step is to decode the RPC header. The header is validated to
+      ensure the request is sensible, ie the server should not receive a
+      method reply from a client. If the client has not yet authenticated,
+      a security check is also applied to make sure the procedure is on the
+      whitelist of those allowed prior to auth. If the packet is a method
+      call, it will be placed on a global processing queue. The event loop
+      thread is now done with the packet for the time being.
+    </p>
+
+    <p>
+      The server has a pool of worker threads, which wait for method call
+      packets to be queued. One of them will grab the new method call off
+      the queue for processing. The first step is to decode the payload of
+      the packet to extract the method call arguments. The worker does not
+      attempt todo any semantic validation of the arguments, except to make
+      sure the size of any variable length fields is below defined limits.
+    </p>
+
+    <p>
+      The worker now invokes the libvirt API call that corresponds to the
+      procedure number in the packet header. The worker is thus kept busy
+      until the API call completes. The implemementation of the API call
+      is responsible for doing semantic validation of parameters and any
+      MAC security checks on the objects affected.
+    </p>
+
+    <p>
+      Once the API call has completed, the worker thread will take the
+      return value and output parameters, or error object and encode
+      them into a reply packet. Again it does not attempt todo any
+      semantic validation of output data, aside from variable length
+      field limit checks. The worker thread puts the reply packet onto
+      the transmission queue for the client. The worker is now finished
+      and goes back to wait for another incoming method call.
+    </p>
+
+    <p>
+      The main event loop is back in charge and when the client socket
+      becomes writable, it will start sending the method reply packet
+      back to the client.
+    </p>
+
+    <p>
+      At any time the libvirt connection object can emit asynchronous
+      events. These are handled by callbacks in the main event thread.
+      The callback will simply encode the event parameters into a new
+      data packet and place the packet on the client transmission
+      queue.
+    </p>
+
+    <p>
+      Incoming and outgoing stream packets are also directly handled
+      by the main event thread. When an incoming stream packet is
+      received, instead of placing it in the global dispatch queue
+      for the worker threads, it is sidetracked into a per-stream
+      processing queue. When the stream becomes writable, queued
+      incoming stream packets will be processed, passing their data
+      payload onto the stream. Conversely when the stream becomes
+      readable, chunks of data will be read from it, encoded into
+      new outgoing packets, and placed on the client's transmit
+      queue
+    </p>
+
+    <h4><a name="apiserverdispatchex1">Example with overlapping methods</a></h4>
+
+    <p>
+      This example illustrates processing of two incoming methods with
+      overlapping execution
+    </p>
+
+    <pre>
+   Event thread    Worker 1       Worker 2
+       |               |              |
+       V               V              V
+    Wait I/O       Wait Job       Wait Job
+       |               |              |
+       V               |              |
+   Recv method1        |              |
+       |               |              |
+       V               |              |
+   Queue method1       V              |
+       |          Serve method1       |
+       V               |              |
+    Wait I/O           V              |
+       |           Call API1()        |
+       V               |              |
+   Recv method2        |              |
+       |               |              |
+       V               |              |
+   Queue method2       |              V
+       |               |         Serve method2
+       V               V              |
+    Wait I/O      Return API1()       V
+       |               |          Call API2()
+       |               V              |
+       V         Queue reply1         |
+   Send reply1         |              |
+       |               V              V
+       V           Wait Job       Return API2()
+    Wait I/O           |              |
+       |              ...             V
+       V                          Queue reply2
+   Send reply2                        |
+       |                              V
+       V                          Wait Job
+    Wait I/O                          |
+       |                             ...
+      ...
+    </pre>
+
+    <h4><a name="apiserverdispatchex2">Example with stream data</a></h4>
+
+    <p>
+      This example illustrates processing of stream data
+    </p>
+
+    <pre>
+   Event thread
+       |
+       V
+    Wait I/O
+       |
+       V
+   Recv stream1
+       |
+       V
+   Queue stream1
+       |
+       V
+    Wait I/O
+       |
+       V
+   Recv stream2
+       |
+       V
+   Queue stream2
+       |
+       V
+    Wait I/O
+       |
+       V
+   Write stream1
+       |
+       V
+   Write stream2
+       |
+       V
+    Wait I/O
+       |
+      ...
+    </pre>
+
+  </body>
+</html>
diff --git a/docs/remote.html.in b/docs/remote.html.in
index b554950..6a8e830 100644
--- a/docs/remote.html.in
+++ b/docs/remote.html.in
@@ -53,9 +53,6 @@ machines through authenticated and encrypted connections.
       <li>
         <a href="#Remote_limitations">Limitations</a>
       </li>
-      <li>
-        <a href="#Remote_implementation_notes">Implementation notes</a>
-      </li>
     </ul>
     <h3>
       <a name="Remote_basic_usage">Basic usage</a>
@@ -880,47 +877,5 @@ just read-write/read-only as at present.
     <p>
 Please come and discuss these issues and more on <a href="https://www.redhat.com/mailman/listinfo/libvir-list" title="libvir-list mailing list">the mailing list</a>.
 </p>
-    <h3>
-      <a name="Remote_implementation_notes">Implementation notes</a>
-    </h3>
-    <p>
-The current implementation uses <a href="http://en.wikipedia.org/wiki/External_Data_Representation" title="External Data Representation">XDR</a>-encoded packets with a
-simple remote procedure call implementation which also supports
-asynchronous messaging and asynchronous and out-of-order replies,
-although these latter features are not used at the moment.
-</p>
-    <p>
-The implementation should be considered <b>strictly internal</b> to
-libvirt and <b>subject to change at any time without notice</b>.  If
-you wish to talk to libvirtd, link to libvirt.  If there is a problem
-that means you think you need to use the protocol directly, please
-first discuss this on <a href="https://www.redhat.com/mailman/listinfo/libvir-list" title="libvir-list mailing list">the mailing list</a>.
-</p>
-    <p>
-The messaging protocol is described in
-<code>qemud/remote_protocol.x</code>.
-</p>
-    <p>
-Authentication and encryption (for TLS) is done using <a href="http://www.gnu.org/software/gnutls/" title="GnuTLS project
page">GnuTLS</a> and the RPC protocol is unaware of this layer.
-</p>
-    <p>
-Protocol messages are sent using a simple 32 bit length word (encoded
-XDR int) followed by the message header (XDR
-<code>remote_message_header</code>) followed by the message body.  The
-length count includes the length word itself, and is measured in
-bytes.  Maximum message size is <code>REMOTE_MESSAGE_MAX</code> and to
-avoid denial of services attacks on the XDR decoders strings are
-individually limited to <code>REMOTE_STRING_MAX</code> bytes.  In the
-TLS case, messages may be split over TLS records, but a TLS record
-cannot contain parts of more than one message.  In the common RPC case
-a single <code>REMOTE_CALL</code> message is sent from client to
-server, and the server then replies synchronously with a single
-<code>REMOTE_REPLY</code> message, but other forms of messaging are
-also possible.
-</p>
-    <p>
-The protocol contains support for multiple program types and protocol
-versioning, modelled after SunRPC.
-</p>
   </body>
 </html>
diff --git a/docs/sitemap.html.in b/docs/sitemap.html.in
index 897ee94..f50a8d2 100644
--- a/docs/sitemap.html.in
+++ b/docs/sitemap.html.in
@@ -289,6 +289,10 @@
                 <span>Spawning commands from libvirt driver code</span>
               </li>
               <li>
+                <a href="internals/rpc.html">RPC protocol & APIs</a>
+                <span>RPC protocol information and API / dispatch guide</span>
+              </li>
+              <li>
                 <a href="internals/locking.html">Lock managers</a>
                 <span>Use lock managers to protect disk content</span>
               </li>
-- 
1.7.6




More information about the libvir-list mailing list