[Libguestfs] [PATCH libnbd 2/2] api: Implement local command with systemd socket activation.

Thu Sep 26 20:56:27 UTC 2019

On 9/26/19 11:40 AM, Richard W.M. Jones wrote:
> This adds new APIs for running a local NBD server and connecting to it
> using systemd socket activation (instead of stdin/stdout).
> 
> This includes interop tests against nbdkit and qemu-nbd which I
> believe are the only NBD servers supporting socket activation.  (If we
> find others then we can add more interop tests in future.)
> 
> The upstream spec for systemd socket activation is here:
> http://0pointer.de/blog/projects/socket-activation.html
> ---

> +++ b/generator/states-connect.c
> @@ -37,6 +37,9 @@
>   #include <sys/socket.h>
>   #include <sys/un.h>
>   
> +/* This is baked into the systemd socket activation API. */
> +#define FIRST_SOCKET_ACTIVATION_FD 3
> +
>   /* Disable Nagle's algorithm on the socket, but don't fail. */
>   static void
>   disable_nagle (int sock)
> @@ -292,4 +295,110 @@ disable_nagle (int sock)
>     SET_NEXT_STATE (%^MAGIC.START);
>     return 0;
>   
> + CONNECT_SA.START:
> +  /* This is in some ways similar to nbd_aio_connect_command above,
> +   * but we must pass a listening socket to the child process and
> +   * we must set LISTEN_PID and LISTEN_FDS environment variables.
> +   */
> +  size_t len;
> +  int s;
> +  struct sockaddr_un addr;
> +  pid_t pid;
> +  int flags;
> +  char listen_pid[16];
> +
> +  assert (!h->sock);
> +  assert (h->argv);
> +  assert (h->argv[0]);
> +
> +  /* Use /tmp instead of TMPDIR because we must ensure the path is
> +   * short enough to store in the sockaddr_un.  On some platforms this
> +   * may cause problems so we may need to revisit it.  XXX
> +   */

Is the use of socketpair() any better than creating a socket under /tmp?

> +  h->sa_tmpdir = strdup ("/tmp/libnbdXXXXXX");
> +  h->sa_sockpath = strdup ("/tmp/libnbdXXXXXX/sock");
> +  if (h->sa_tmpdir == NULL || h->sa_sockpath == NULL) {
> +    SET_NEXT_STATE (%.DEAD);
> +    set_error (errno, "strdup");
> +    return 0;
> +  }
> +
> +  if (mkdtemp (h->sa_tmpdir) == NULL) {
> +    SET_NEXT_STATE (%.DEAD);
> +    set_error (errno, "mkdtemp");
> +    return 0;
> +  }
> +  len = strlen (h->sa_tmpdir);
> +  memcpy (h->sa_sockpath, h->sa_tmpdir, len);
> +
> +  s = socket (AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0);
> +  if (s == -1) {
> +    SET_NEXT_STATE (%.DEAD);
> +    set_error (errno, "socket");
> +    return 0;
> +  }

If we fail here or later, do/should we try to clean up the 
/tmp/libnbdXXX directory created earlier?

/me reads ahead - nbd_close tries to address it

Still, if we fail at this point, h->sa_sockpath is set but not yet 
created [1]

> +
> +  addr.sun_family = AF_UNIX;
> +  memcpy (addr.sun_path, h->sa_sockpath, strlen (h->sa_sockpath) + 1);
> +  if (bind (s, (struct sockaddr *) &addr, sizeof addr) == -1) {
> +    SET_NEXT_STATE (%.DEAD);
> +    set_error (errno, "bind: %s", h->sa_sockpath);
> +    return 0;
> +  }
> +
> +  if (listen (s, 1) == -1) {
> +    SET_NEXT_STATE (%.DEAD);
> +    set_error (errno, "listen");
> +    return 0;
> +  }
> +
> +  pid = fork ();
> +  if (pid == -1) {
> +    SET_NEXT_STATE (%.DEAD);
> +    set_error (errno, "fork");
> +    close (s);
> +    return 0;
> +  }
> +  if (pid == 0) {         /* child - run command */
> +    if (s != FIRST_SOCKET_ACTIVATION_FD) {
> +      dup2 (s, FIRST_SOCKET_ACTIVATION_FD);

No check for errors?  But that's probably okay in this instance (we know 
that we won't fail with EBADF, for example).

> +      close (s);
> +    }
> +    else {
> +      /* We must unset CLOEXEC on the fd.  (dup2 above does this
> +       * implicitly because CLOEXEC is set on the fd, not on the
> +       * socket).
> +       */
> +      flags = fcntl (s, F_GETFD, 0);
> +      if (flags == -1) {
> +        perror ("fcntl: F_GETFD");

perror is not async-signal-safe.  Calling it in a child of a 
potentially-multi-threaded parent is therefore prone to deadlock, if 
perror() triggers a request to grab any resource that was left locked by 
a different thread holding the lock but stranded by the fork.

> +        _exit (EXIT_FAILURE);
> +      }
> +      if (fcntl (s, F_SETFD, flags & ~FD_CLOEXEC) == -1) {
> +        perror ("fcntl: F_SETFD");
> +        _exit (EXIT_FAILURE);
> +      }

About all we can _safely_ do is let our _exit() status inform the parent 
process that something failed, and let the parent process print the 
error message.  But even passing errno as the _exit() value is not 
portable (GNU Hurd errno values are intentionally larger than what fit 
in 8-bit returns that the parent would see).  It's also possible to set 
up a cloexec pipe (in addition to everything else) so that the child can 
write() error details into the pipe, but that's a lot of effort compared 
to just blindly stating that the child failed but declaring full details 
of why as lost.

> +    }
> +
> +    snprintf (listen_pid, sizeof listen_pid, "%ld", (long) getpid ());
> +    setenv ("LISTEN_PID", listen_pid, 1);
> +    setenv ("LISTEN_FDS", "1", 1);

snprintf() and setenv() are also not async-signal-safe.  Which is a 
bummer, since you really don't know the child pid until in the child. 
You could open-code the translation of a decimal number into a buffer 
without snprintf, but you also have to figure out how to safely put it 
into the environ seen by the child.  How does systemd do it?

We _could_ create a helper executable, where we fork() into the child 
process, exec() the helper to overwrite the child pid with something 
that is no longer constrained by the limits of async-safety, so that the 
helper app can then use snprintf/setenv at will before re-execing again 
with the real target executable.  The next question is how worried are 
we about actually being hit by an non-async-signal-safe hang.  What you 
have works 99.9% of the time, but if we are trying to be a library that 
is usable by any application, we have to decide if that is good enough.

> +
> +    /* Restore SIGPIPE back to SIG_DFL. */
> +    signal (SIGPIPE, SIG_DFL);
> +
> +    execvp (h->argv[0], h->argv);
> +    perror (h->argv[0]);

and another unsafe perror.

> +    _exit (EXIT_FAILURE);

Is EXIT_FAILURE the best here, or should we consider exiting with status 
126/127 to match what other programs (sh, env, nice, nohup, ...) do when 
execvp() fails?

> +  }
> +
> +  /* Parent. */
> +  close (s);
> +  h->pid = pid;
> +
> +  h->connaddrlen = sizeof addr;
> +  memcpy (&h->connaddr, &addr, h->connaddrlen);
> +  SET_NEXT_STATE (%^CONNECT.START);
> +  return 0;
> +
>   } /* END STATE MACHINE */

> +++ b/interop/socket-activation.c

> +#define SIZE 1048576
> +
> +int
> +main (int argc, char *argv[])
> +{
> +  struct nbd_handle *nbd;
> +  char tmpfile[] = "/tmp/nbdXXXXXX";
> +  int fd;
> +  int64_t actual_size;
> +  char buf[512];
> +  int r = -1;
> +
> +  /* Create a large sparse temporary file. */
> +  fd = mkstemp (tmpfile);
> +  if (fd == -1 ||
> +      ftruncate (fd, SIZE) == -1 ||
> +      close (fd) == -1) {
> +    perror (tmpfile);
> +    goto out;
> +  }

Is it worth populating anything in the temp file...

> +
> +  nbd = nbd_create ();
> +  if (nbd == NULL) {
> +    fprintf (stderr, "%s\n", nbd_get_error ());
> +    goto out;
> +  }
> +
> +  char *args[] = { SERVER, SERVER_PARAMS, NULL };
> +  if (nbd_connect_socket_activation (nbd, args) == -1) {
> +    fprintf (stderr, "%s\n", nbd_get_error ());
> +    goto out;
> +  }
> +
> +  actual_size = nbd_get_size (nbd);
> +  if (actual_size == -1) {
> +    fprintf (stderr, "%s\n", nbd_get_error ());
> +    goto out;
> +  }
> +  if (actual_size != SIZE) {
> +    fprintf (stderr, "%s: actual size %" PRIi64 " <> expected size %d",
> +             argv[0], actual_size, SIZE);
> +    goto out;
> +  }
> +
> +  if (nbd_pread (nbd, buf, sizeof buf, 0, 0) == -1) {
> +    fprintf (stderr, "%s\n", nbd_get_error ());
> +    goto out;
> +  }

...and checking that we read it back out?  But even a successful read 
without checking contents is pretty good evidence that the socket 
activation worked.

> +
> +  if (nbd_shutdown (nbd, 0) == -1) {
> +    fprintf (stderr, "%s\n", nbd_get_error ());
> +    goto out;
> +  }
> +
> +  nbd_close (nbd);
> +
> +  r = 0;
> + out:
> +  unlink (tmpfile);
> +
> +  exit (r == 0 ? EXIT_SUCCESS : EXIT_FAILURE);
> +}
> diff --git a/lib/connect.c b/lib/connect.c
> index f98bcdb..c1cbef7 100644
> --- a/lib/connect.c

> +++ b/lib/handle.c
> @@ -129,6 +129,16 @@ nbd_close (struct nbd_handle *h)
>     free_cmd_list (h->cmds_in_flight);
>     free_cmd_list (h->cmds_done);
>     nbd_internal_free_string_list (h->argv);
> +  if (h->sa_sockpath) {
> +    if (h->pid > 0)
> +      kill (h->pid, SIGTERM);

Are we sure that SIGTERM Is always going to be sufficient?  Or do we 
need a tiered approach where we try SIGTERM, but followup with SIGKILL 
if too much time elapses?  I guess it is the same situation for 
nbd_connect_command, does that mean we should have a common helper 
function for doing appropriate cleanup?

Do we document that nbd_close() may block?

> +    unlink (h->sa_sockpath);

[1] we can end up calling unlink() on a path that does not exist.  But 
since we aren't checking error status here, we are effectively ignoring 
the ENOENT in that case.

> +    free (h->sa_sockpath);
> +  }
> +  if (h->sa_tmpdir) {
> +    rmdir (h->sa_tmpdir);
> +    free (h->sa_tmpdir);
> +  }
>     free (h->unixsocket);
>     free (h->hostname);
>     free (h->port);
> diff --git a/lib/internal.h b/lib/internal.h
> index bdb0e83..bf8b9aa 100644
> --- a/lib/internal.h
> +++ b/lib/internal.h
> @@ -188,6 +188,12 @@ struct nbd_handle {
>     char **argv;
>     pid_t pid;
>   
> +  /* When using systemd socket activation, this directory and socket
> +   * must be deleted, and the pid above must be killed.
> +   */
> +  char *sa_tmpdir;
> +  char *sa_sockpath;
> +
>     /* When connecting to Unix domain socket. */
>     char *unixsocket;
>   
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org