[Libguestfs] [PATCH nbdkit v2] New tmpdisk plugin.

Tue Mar 17 11:16:49 UTC 2020

On 3/17/20 3:53 AM, Richard W.M. Jones wrote:
> This can be used for creating temporary disks to thin clients, as a
> kind of "remote tmpfs".
> 
> See also:
> https://www.redhat.com/archives/libguestfs/2020-March/msg00134.html
> ---

> +++ b/plugins/tmpdisk/nbdkit-tmpdisk-plugin.pod
> @@ -0,0 +1,157 @@
> +=head1 NAME
> +
> +nbdkit-tmpdisk-plugin - create a fresh temporary filesystem for each client
> +
> +=head1 SYNOPSIS
> +
> + nbdkit tmpdisk [size=]SIZE
> +                [type=ext4|xfs|vfat|...] [label=LABEL]
> +                [command=COMMAND]
> +
> +=head1 DESCRIPTION
> +
> +This L<nbdkit(1)> plugin is used for creating temporary filesystems
> +for thin clients.  Each time a client connects it will see a fresh,
> +empty filesystem for its exclusive use.  B<The filesystem is deleted>
> +when the client disconnects.
> +
> +When a new client connects, a blank, sparse file of the required size
> +is created in C<$TMPDIR> (or F</var/tmp>).  L<mkfs(8)> is then run on
> +the file to create the empty filesystem, and this filesystem is served
> +to the client.  Unlike L<nbdkit-linuxdisk-plugin(1)> each client of
> +this plugin sees a different disk.

As you said earlier, we may someday want a mode where the cow filter or 
similar can provide per-client transience to any plugin, but this plugin 
does indeed seem useful on its own.

> +=head2 Security considerations
> +
> +Because each client gets a new disk, the amount of disk space
> +required on the server can be as much as
> +S<<< I<number of clients> × I<size parameter> >>>.  It is therefore
> +best to limit the number of clients using L<nbdkit-limit-filter(1)> or
> +take steps to limit where clients can connect from using
> +L<nbdkit-ip-filter(1)>, firewalls, or TLS client certificates.

Good advice.

> +
> +=head1 PARAMETERS
> +
> +=over 4
> +
> +=item B<command='>COMMANDB<'>
> +
> +Instead of running L<mkfs(8)> to create the initial filesystem, run
> +C<COMMAND> (which usually must be quoted to protect it from the
> +shell).  The following shell variables may be used in C<COMMAND>:

As we run COMMAND through /bin/sh, the nbdkit user can inject ANY action 
they want, but I don't see that as inherently risky as nbdkit is not 
running with more privileges than the user already has.

> +++ b/plugins/tmpdisk/tmpdisk.c
> @@ -0,0 +1,427 @@

> +static const char *command =
> +  "labelopt='-L'\n"
> +  "case \"$type\" in\n"
> +  "    ext?)\n"
> +  "        extra='-F' ;;\n"
> +  "    *fat|msdos)\n"
> +  "        extra='-I' ;;\n"
> +  "    ntfs)\n"
> +  "        extra='-Q -F'\n"
> +  "        labelopt='-n' ;;\n"
> +  "    xfs)\n"
> +  "        extra='-f' ;;\n"
> +  "esac\n"
> +  "if [ \"x$label\" = \"x\" ]; then\n"
> +  "    mkfs -t \"$type\" $extra $disk >/dev/null\n"
> +  "else\n"
> +  "    mkfs -t \"$type\" $extra $labelopt \"$label\" $disk >/dev/null\n"
> +  "fi\n";
> +static const char *label = NULL;
> +static const char *type = "ext4";

At any rate, it looks like you are mostly safely quoting the default 
command for invoking mkfs.  But you do have a bug: you are using $disk 
unquoted, which is okay when TMPDIR is undefined but not always, 
because...[1]

> +
> +/* Absolutely unsafe!  Although this is simply returning the default
> + * value, provide this callback to make it clear.
> + */
> +static int
> +tmpdisk_can_multi_conn (void *handle)
> +{
> +  return 0;

The comment threw me for just a second, until I realized you meant that 
'Advertising multi-conn is absolutely unsafe!' and not 'the 
implementation of this callback is absolutely unsafe'.

> +}
> +
> +static int
> +tmpdisk_can_trim (void *handle)
> +{
> +#ifdef FALLOC_FL_PUNCH_HOLE
> +  return 1;
> +#else
> +  return 0;
> +#endif
> +}
> +
> +/* Pretend we have native FUA support, but actually because all disks
> + * are temporary we will deliberately ignore flush/FUA operations.
> + */
> +static int
> +tmpdisk_can_fua (void *handle)
> +{
> +  return NBDKIT_FUA_NATIVE;
> +}

Agreed.

> +
> +static int64_t
> +tmpdisk_get_size (void *handle)
> +{
> +  return size;
> +}
> +
> +/* This creates and runs the full "mkfs" (or whatever) command. */
> +static int
> +run_command (const char *disk)
> +{
> +  FILE *fp;
> +  CLEANUP_FREE char *cmd = NULL;
> +  size_t len = 0;
> +  int r;
> +
> +  fp = open_memstream (&cmd, &len);
> +  if (fp == NULL) {
> +    nbdkit_error ("open_memstream: %m");
> +    return -1;
> +  }
> +
> +  /* Set the shell variables. */
> +  fprintf (fp, "disk=");
> +  shell_quote (disk, fp);

[1]...here you are correctly quoting the definition of disk, and...[2]

> +  putc ('\n', fp);
> +  if (label) {
> +    fprintf (fp, "label=");
> +    shell_quote (label, fp);
> +    putc ('\n', fp);
> +  }
> +  fprintf (fp, "size=%" PRIi64 "\n", size);
> +  fprintf (fp, "type=");
> +  shell_quote (type, fp);
> +  putc ('\n', fp);
> +
> +  putc ('\n', fp);
> +  fprintf (fp, "%s", command);
> +
> +  if (fclose (fp) == EOF) {
> +    nbdkit_error ("memstream failed");
> +    return -1;
> +  }
> +
> +  r = system (cmd);

Hmm. We can be invoking multiple system() calls across multiple threads 
if clients connect in parallel.  That means we MUST be sure we don't 
leak unintended fds...[3]

Hmm - what happens under 'nbdkit -s tmpdisk $size'?  I suspect the 
script passed to system() should be changed to start with 'exec 
</dev/null >/dev/null' to ensure that mkfs or the user command cannot 
corrupt the NBD client by writing to stdout or confuse nbdkit itself by 
accidentally consuming from stdin (leaving stderr unchanged should be 
fine, though).  Or is that something that 'nbdkit -s' should do 
automatically regardless of the plugin (that is, dup() off the original 
stdin/stdout used to talk to the client, and then changing stdin/stdout 
to /dev/null prior to .load()ing the plugin)?  And if we change nbdkit 
itself for -s, should we also make sure that there is no cross-process 
interaction between the fds used by --run and those handed to the 
captive child nbdkit process?

> +  if (r == -1) {
> +    nbdkit_error ("failed to execute command: %m");
> +    return -1;
> +  }
> +  if (WIFEXITED (r) && WEXITSTATUS (r) != 0) {
> +    nbdkit_error ("command exited with code %d", WEXITSTATUS (r));
> +    return -1;
> +  }
> +  else if (WIFSIGNALED (r)) {
> +    nbdkit_error ("command killed by signal %d", WTERMSIG (r));
> +    return -1;
> +  }
> +  else if (WIFSTOPPED (r)) {
> +    nbdkit_error ("command stopped by signal %d", WSTOPSIG (r));
> +    return -1;
> +  }

Generally, WIFSTOPPED() should be unreachable for the result of 
system().  I don't mind if you leave it for ease of copy/paste, but it 
doesn't hurt to drop it.

> +
> +  return 0;
> +}
> +
> +static void *
> +tmpdisk_open (int readonly)
> +{
> +  struct handle *h;
> +  CLEANUP_FREE char *disk = NULL;
> +  const char *tmpdir;
> +
> +  tmpdir = getenv ("TMPDIR");
> +  if (!tmpdir)
> +    tmpdir = "/var/tmp";

Rather than calling getenv() for every client, should we pre-populate a 
file-scope variable once during .get_ready?  But since we have to 
malloc() per client, that micro-optimization is probably in the noise.

> +
> +  h = malloc (sizeof *h);
> +  if (h == NULL) {
> +    nbdkit_error ("malloc: %m");
> +    goto error;
> +  }
> +  h->fd = -1;
> +  h->can_punch_hole = true;
> +
> +  /* Create the new disk image for this connection. */
> +  if (asprintf (&disk, "%s/tmpdiskXXXXXX", tmpdir) == -1) {

[2]...here, a user-supplied TMPDIR may require shell-quoting at the 
points above.

> +    nbdkit_error ("asprintf: %m");
> +    goto error;
> +  }
> +
> +  h->fd = mkstemp (disk);

[3]...ouch - we are not using mkostemp() to atomically set FD_CLOEXEC, 
nor even mkstemp/fcntl() to set it after the fact due to Haiku still 
lacking mkostemp().  Note that in filters/cow/cow.c, we don't worry 
about a mutex when falling back due to missing mkostemp(), but that was 
done in .load, not .open.  Here, Haiku would definitely need a mutex - 
but we would also need to make sure that such a mutex only guards the 
mkstemp/fcntl pair and not the system() call (there is no reason a 
second client must wait for the mkfs of the first client to finish).  Or 
is mkfs specific enough to Linux that no one would ever need to build 
this plugin on Haiku?

At any rate, without FD_CLOEXEC, and allowing parallel system() calls, 
we have to worry about the consequence of an fd leak: I think that mkfs 
is generally well-behaved at ignoring unknown fds, but if it is 
long-running, this could tie up the disk space used by a fast-exiting 
first client for longer than necessary while waiting for mkfs to finish 
for the second client.  And we have no guarantee that a user-specified 
command is as well-behaved as mkfs about not interfering with unknown 
leaked fds.

> +  if (h->fd == -1) {
> +    nbdkit_error ("mkstemp: %m");
> +    goto error;
> +  }
> +
> +  /* Truncate the disk to a sparse file of the right size. */
> +  if (ftruncate (h->fd, size) == -1) {
> +    nbdkit_error ("ftruncate: %s: %m", disk);
> +    goto error;
> +  }
> +
> +  /* Now run the mkfs command. */
> +  if (run_command (disk) == -1)
> +    goto error;
> +
> +  /* We don't need the disk to appear in the filesystem since we hold
> +   * a file descriptor and access it through that, so unlink the disk.
> +   * This also ensures it is always cleaned up.
> +   */
> +  unlink (disk);

I like this aspect.  It does mean that 'du' won't find the space hogs 
under $TMPDIR (lsof or similar will be needed), but I don't see that as 
a show-stopper.

> +/* Write data to the file. */
> +static int
> +tmpdisk_pwrite (void *handle, const void *buf,
> +                uint32_t count, uint64_t offset,
> +                uint32_t flags)
> +{
> +  struct handle *h = handle;
> +
> +  while (count > 0) {
> +    ssize_t r = pwrite (h->fd, buf, count, offset);
> +    if (r == -1) {
> +      nbdkit_error ("pwrite: %m");
> +      return -1;
> +    }
> +    buf += r;
> +    count -= r;
> +    offset += r;
> +  }
> +
> +  /* Deliberately ignore FUA if present in flags. */

Makes sense.

> +
> +  return 0;
> +}
> +
> +/* This plugin deliberately provides a null flush operation, because
> + * all of the disks created are temporary.
> + */
> +static int
> +tmpdisk_flush (void *handle, uint32_t flags)
> +{
> +  return 0;

As does this.

> +/* Punch a hole in the file. */
> +static int
> +tmpdisk_trim (void *handle, uint32_t count, uint64_t offset, uint32_t flags)
> +{
> +#ifdef FALLOC_FL_PUNCH_HOLE
> +  struct handle *h = handle;
> +  int r;
> +
> +  if (h->can_punch_hole) {
> +    r = do_fallocate (h->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> +                      offset, count);

Hmm - when this succeeds, we guarantee that we read back as zeroes.  Is 
it worth also implementing .zero for this plugin, to take advantage of 
fallocate when flags permits zeroing by trimming, rather than having to 
always rely on nbdkit's emulation through .pwrite?

> +
> +  .open              = tmpdisk_open,
> +  .close             = tmpdisk_close,
> +  .pread             = tmpdisk_pread,
> +  .pwrite            = tmpdisk_pwrite,
> +  .flush             = tmpdisk_flush,
> +  .trim              = tmpdisk_trim,
> +
> +  .errno_is_preserved = 1,
> +};
> +

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org