[Virtio-fs] [PATCH v3 1/1] vhost-user-fs: add migration type property

Mon Feb 27 10:19:53 UTC 2023

On 24/02/2023 06:14, Anton Kuchin wrote:
> On 23/02/2023 23:24, Stefan Hajnoczi wrote:
>> On Thu, Feb 23, 2023 at 02:36:33AM -0500, Michael S. Tsirkin wrote:
>>> On Wed, Feb 22, 2023 at 03:21:42PM -0500, Michael S. Tsirkin wrote:
>>>> On Wed, Feb 22, 2023 at 08:25:19PM +0200, Anton Kuchin wrote:
>>>>> On 22/02/2023 19:12, Michael S. Tsirkin wrote:
>>>>>> On Wed, Feb 22, 2023 at 07:05:47PM +0200, Anton Kuchin wrote:
>>>>>>> On 22/02/2023 18:51, Michael S. Tsirkin wrote:
>>>>>>>> On Wed, Feb 22, 2023 at 06:49:10PM +0200, Anton Kuchin wrote:
>>>>>>>>> On 22/02/2023 17:14, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>>>>> On 22.02.23 17:25, Anton Kuchin wrote:
>>>>>>>>>>>>>> +static int vhost_user_fs_pre_save(void *opaque)
>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>> +    VHostUserFS *fs = opaque;
>>>>>>>>>>>>>> +    g_autofree char *path = 
>>>>>>>>>>>>>> object_get_canonical_path(OBJECT(fs));
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +    switch (fs->migration_type) {
>>>>>>>>>>>>>> +    case VHOST_USER_MIGRATION_TYPE_NONE:
>>>>>>>>>>>>>> +        error_report("Migration is blocked by device 
>>>>>>>>>>>>>> %s", path);
>>>>>>>>>>>>>> +        break;
>>>>>>>>>>>>>> +    case VHOST_USER_MIGRATION_TYPE_EXTERNAL:
>>>>>>>>>>>>>> +        return 0;
>>>>>>>>>>>>>> +    default:
>>>>>>>>>>>>>> +        error_report("Migration type '%s' is not
>>>>>>>>>>>>>> supported by device %s",
>>>>>>>>>>>>>> + VhostUserMigrationType_str(fs->migration_type), path);
>>>>>>>>>>>>>> +        break;
>>>>>>>>>>>>>> +    }
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +    return -1;
>>>>>>>>>>>>>> +}
>>>>>>>>>>>>> Should we also add this as .pre_load, to force user select
>>>>>>>>>>>>> correct migration_type on target too?
>>>>>>>>>>>> In fact, I would claim we only want pre_load.
>>>>>>>>>>>> When qemu is started on destination we know where it's 
>>>>>>>>>>>> migrated
>>>>>>>>>>>> from so this flag can be set.
>>>>>>>>>>>> When qemu is started on source we generally do not yet know so
>>>>>>>>>>>> we don't know whether it's safe to set this flag.
>>>>>>>>>> But destination is a "source" for next migration, so there 
>>>>>>>>>> shouldn't be
>>>>>>>>>> real difference.
>>>>>>>>>> The new property has ".realized_set_allowed = true", so, as I 
>>>>>>>>>> understand
>>>>>>>>>> it may be changed at any time, so that's not a problem.
>>>>>>>>> Yes, exactly. So destination's property sets not how it will 
>>>>>>>>> handle this
>>>>>>>>> incoming
>>>>>>>>> migration but the future outgoing one.
>>>>>>>> How do you know where you are going to migrate though?
>>>>>>>> I think you don't.
>>>>>>>> Setting it on source is better since we know where we
>>>>>>>> are migrating from.
>>>>>>> Yes, I don't know where I'm going to migrate to. This is why 
>>>>>>> property
>>>>>>> affects only how source saves state on outgoing migration.
>>>>>> Um. I don't get the logic.
>>>>> For this feature to work we need orchestrator to manage the 
>>>>> migration. And
>>>>> we
>>>>> generally assume that it is responsibility of orchestrator to ensure
>>>>> matching
>>>>> properties on source and destination.
>>>>> As orchestrator manages both sides of migration it can set option 
>>>>> (and we
>>>>> can
>>>>> check it) on either source or destination. Now it's not important 
>>>>> which side
>>>>> we
>>>>> select, because now the option is essentially binary allow/deny 
>>>>> (but IMHO it
>>>>> is much better to refuse source to migrate than find later that 
>>>>> state can't
>>>>> be
>>>>> loaded by destination, in case of file migration this becomes 
>>>>> especially
>>>>> painful).
>>>>>
>>>>> But there are plans to add internal migration option (extract FUSE 
>>>>> state
>>>>> from
>>>>> backend and transfer it in QEMU migration stream), and that's where
>>>>> setting/checking
>>>>> on source becomes important because it will rely on this property 
>>>>> to decide
>>>>> if
>>>>> extra state form backend needs to be put in the migration stream 
>>>>> subsection.
>>>>
>>>> If we do internal migration that will be a different property
>>>> which has to match on source *and* destination.
>>>>
>>>>
>>>>> If you are concerned about orchestrator breaking assumption of 
>>>>> matching
>>>>> properties
>>>>> on source and destination this is not really supported AFAIK but I 
>>>>> don't
>>>>> think we
>>>>> need to punish it for this, maybe it has its reasons: I can 
>>>>> imagine scenario
>>>>> where orchestrator could want to migrate from source with
>>>>> 'migration=external'
>>>>> to destination with 'migration=none' to ensure that destination 
>>>>> can't be
>>>>> migrated further.
>>>> No. I am concerned about a simple practical matter:
>>>> - I decide to restart qemu on the same host - so I need to enable
>>>>    migration
>>>> - Later I decide to migrate qemu to another host - this should be
>>>>    blocked
>>>>
>>>>
>>>> Property on source does not satisfy both at the same time.
>>>> Property on destination does.
>>>
>>> Stefan what's your take on this? Should we move this from
>>> save to load hook?
>> This property can be changed on the source at runtime via qom-set, so
>> you don't need to predict the future. The device can be started from an
>> incoming migration with "external" and then set to "stateful" migration
>> to migrate to another host later on.
>>
>> Anton, can you share the virtiofsd patches so we have a full picture of
>> how "external" migration works? I'd like to understand the workflow and
>> also how it can be extended when "stateful" migration is added.
>
> Unfortunately internal implementation is relying heavily on our 
> infrastructure,
> and rust virtiofsd still lacks dirty tracking so it is not ready yet. 
> But I did
> have a PoC for deprecated now C virtiofsd that I didn't bother to prepare
> for upstreaming because C version was declared unsupported. It 
> essentially adds
> reconnect and this was the only thing required from virtiofsd to support
> migration via file.
>
> If this helps I'll try to find patches or recreate then and will be 
> happy to share.


Found my stash with PoC. It has some TODOs and needs more work to become 
submission
quality, but I hope this can be helpful to show the idea.

I checked it with file migration on the same host to update qemu binary 
like this:
  1. Run tools/virtiofsd/virtiofsd with "--reconnect" flag
  2. Run src qemu VM with vhost-user-fs device
  3. Mount fs in guest and run fio with verification on test file in 
shared directory
  4. Command via QMP to migrate VM to file
  5. Run dst qemu VM with vhost-user-fs device and "-incoming defer" flag
  6. Command via QMP to load migration stream from file and continue VM
  7. Check that fio keeps running in guest with no errors

[PATCH] tools/virtiofsd: reconnect PoC

Keep daemon listening after disconnect so session can be continued
after VM is restored.

Signed-off-by: Anton Kuchin <antonkuchin at yandex-team.ru>
---
  tools/virtiofsd/fuse_lowlevel.h       |  1 +
  tools/virtiofsd/fuse_virtio.c         | 59 +++++++++++++++++----------
  tools/virtiofsd/helper.c              |  1 +
  tools/virtiofsd/passthrough_ll.c      | 49 ++++++++++++++++------
  tools/virtiofsd/passthrough_seccomp.c | 12 +++++-
  tools/virtiofsd/passthrough_seccomp.h |  2 +-
  6 files changed, 87 insertions(+), 37 deletions(-)

diff --git a/tools/virtiofsd/fuse_lowlevel.h 
b/tools/virtiofsd/fuse_lowlevel.h
index b889dae4de..d5f3bf05ba 100644
--- a/tools/virtiofsd/fuse_lowlevel.h
+++ b/tools/virtiofsd/fuse_lowlevel.h
@@ -1795,6 +1795,7 @@ struct fuse_cmdline_opts {
      int log_level;
      unsigned int max_idle_threads;
      unsigned long rlimit_nofile;
+    int reconnect;
  };

  /**
diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index 9368e292e4..64782cd78d 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -837,6 +837,7 @@ static const VuDevIface fv_iface = {
  int virtio_loop(struct fuse_session *se)
  {
      fuse_log(FUSE_LOG_INFO, "%s: Entry\n", __func__);
+    int ret = 0;

      while (!fuse_session_exited(se)) {
          struct pollfd pf[1];
@@ -858,7 +859,13 @@ int virtio_loop(struct fuse_session *se)
              break;
          }
          assert(poll_res == 1);
-        if (pf[0].revents & (POLLERR | POLLHUP | POLLNVAL)) {
+        if (pf[0].revents & POLLHUP) {
+            fuse_log(FUSE_LOG_ERR, "%s: Client disconnected: %x\n", 
__func__,
+                     pf[0].revents);
+            ret = -1;
+            break;
+        }
+        if (pf[0].revents & (POLLERR | POLLNVAL)) {
              fuse_log(FUSE_LOG_ERR, "%s: Unexpected poll revents %x\n", 
__func__,
                       pf[0].revents);
              break;
@@ -885,7 +892,7 @@ int virtio_loop(struct fuse_session *se)
      stop_all_queues(se->virtio_dev);
      fuse_log(FUSE_LOG_INFO, "%s: Exit\n", __func__);

-    return 0;
+    return ret;
  }

  static void strreplace(char *s, char old, char new)
@@ -1015,25 +1022,31 @@ int virtio_session_mount(struct fuse_session *se)
  {
      int ret;

-    /*
-     * Test that unshare(CLONE_FS) works. fv_queue_worker() will need 
it. It's
-     * an unprivileged system call but some Docker/Moby versions are 
known to
-     * reject it via seccomp when CAP_SYS_ADMIN is not given.
-     *
-     * Note that the program is single-threaded here so this syscall has no
-     * visible effect and is safe to make.
-     */
-    ret = unshare(CLONE_FS);
-    if (ret == -1 && errno == EPERM) {
-        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If "
-                "running in a container please check that the container "
-                "runtime seccomp policy allows unshare.\n");
-        return -1;
-    }

-    ret = fv_create_listen_socket(se);
-    if (ret < 0) {
-        return ret;
+    if (se->vu_listen_fd == -1) {
+        fuse_log(FUSE_LOG_INFO, "listenfd is closed. set it up\n");
+        /*
+         * Test that unshare(CLONE_FS) works. fv_queue_worker() will 
need it. It's
+         * an unprivileged system call but some Docker/Moby versions 
are known to
+         * reject it via seccomp when CAP_SYS_ADMIN is not given.
+         *
+         * Note that the program is single-threaded here so this 
syscall has no
+         * visible effect and is safe to make.
+         */
+        ret = unshare(CLONE_FS);
+        if (ret == -1 && errno == EPERM) {
+            fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with 
EPERM. If "
+                    "running in a container please check that the 
container "
+                    "runtime seccomp policy allows unshare.\n");
+            return -1;
+        }
+
+        ret = fv_create_listen_socket(se);
+        if (ret < 0) {
+            return ret;
+        }
+    } else {
+        fuse_log(FUSE_LOG_INFO, "listenfd is open. reconnecting\n");
      }

      se->fd = -1;
@@ -1046,8 +1059,8 @@ int virtio_session_mount(struct fuse_session *se)
          close(se->vu_listen_fd);
          return -1;
      }
-    close(se->vu_listen_fd);
-    se->vu_listen_fd = -1;
+
+    /* TODO: close vu_listen_fd here if not reconnect */
      fuse_log(FUSE_LOG_INFO, "%s: Received vhost-user socket connection\n",
               __func__);

@@ -1068,6 +1081,8 @@ int virtio_session_mount(struct fuse_session *se)

  void virtio_session_close(struct fuse_session *se)
  {
+    /* TODO: vu_listen_fd can be closed in session_mount */
+    close(se->vu_listen_fd);
      close(se->vu_socketfd);

      if (!se->virtio_dev) {
diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
index f5f66f292c..236a58f109 100644
--- a/tools/virtiofsd/helper.c
+++ b/tools/virtiofsd/helper.c
@@ -53,6 +53,7 @@ static const struct fuse_opt fuse_helper_opts[] = {
      FUSE_HELPER_OPT_VALUE("log_level=info", log_level, FUSE_LOG_INFO),
      FUSE_HELPER_OPT_VALUE("log_level=warn", log_level, FUSE_LOG_WARNING),
      FUSE_HELPER_OPT_VALUE("log_level=err", log_level, FUSE_LOG_ERR),
+    FUSE_HELPER_OPT_VALUE("--reconnect", reconnect, 0),
      FUSE_OPT_END
  };

diff --git a/tools/virtiofsd/passthrough_ll.c 
b/tools/virtiofsd/passthrough_ll.c
index 40ea2ed27f..487d320c70 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -3819,7 +3819,7 @@ static void setup_wait_parent_capabilities(void)
  /*
   * Move to a new mount, net, and pid namespaces to isolate this process.
   */
-static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)
+static void setup_namespaces(struct lo_data *lo, struct fuse_session 
*se, bool enable_reconnect)
  {
      pid_t child;

@@ -3829,12 +3829,21 @@ static void setup_namespaces(struct lo_data *lo, 
struct fuse_session *se)
       * is also needed so that we can remount /proc for the new pid
       * namespace.
       *
-     * Our UNIX domain sockets have been created.  Now we can move to
-     * an empty network namespace to prevent TCP/IP and other network
-     * activity in case this process is compromised.
+     * Our UNIX domain sockets have been created. If we don't need 
reconnect
+     * now we can move to an empty network namespace to prevent TCP/IP and
+     * other network activity in case this process is compromised.
+     *
+     * TODO: Need to double-check if this is necessary. Looks like unix 
sockets
+     * can be shared across network namespaces. Maybe it is better if 
socket can
+     * be created in new namespace (but before we forbid listen with 
seccomp).
       */
-    if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) {
-        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): 
%m\n");
+    int unshare_flags = CLONE_NEWPID | CLONE_NEWNS;
+    if (!enable_reconnect) {
+        unshare_flags |= CLONE_NEWNET;
+    }
+    if (unshare(unshare_flags) != 0) {
+        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS%s): 
%m\n",
+                 enable_reconnect ? " | CLONE_NEWNET" : "");
          exit(1);
      }

@@ -4146,16 +4155,16 @@ static void setup_chroot(struct lo_data *lo)
   * source directory.  This reduces the impact of arbitrary code 
execution bugs.
   */
  static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
-                          bool enable_syslog)
+                          bool enable_syslog, bool enable_reconnect)
  {
      if (lo->sandbox == SANDBOX_NAMESPACE) {
-        setup_namespaces(lo, se);
+        setup_namespaces(lo, se, enable_reconnect);
          setup_mounts(lo->source);
      } else {
          setup_chroot(lo);
      }

-    setup_seccomp(enable_syslog);
+    setup_seccomp(enable_syslog, enable_reconnect);
      setup_capabilities(g_strdup(lo->modcaps));
  }

@@ -4500,13 +4509,27 @@ int main(int argc, char *argv[])
      /* Must be before sandbox since it wants /proc */
      setup_capng();

-    setup_sandbox(&lo, se, opts.syslog);
+    setup_sandbox(&lo, se, opts.syslog, opts.reconnect);

      setup_root(&lo, &lo.root);
-    /* Block until ctrl+c or fusermount -u */
-    ret = virtio_loop(se);
+    do {
+        /* TODO: this loop should be like {mount-loop-unmount} but setup of
+         * listen descriptor happens in mount and will fail after 
sandboxing.
+         * Need to extract setup to virtio_session_new where we can create
+         * socket before sandboxing and in session_mount only accept 
client.*/
+        /* Block until ctrl+c, fusermount -u or client disconnect */
+        ret = virtio_loop(se);
+        fuse_log(FUSE_LOG_ERR, "ret is %d, %s\n", ret,
+                 ret ? "reconnecting" : "terminating");
+        fuse_session_unmount(se);
+        if (ret) {
+            if (fuse_session_mount(se) != 0) {
+                fuse_log(FUSE_LOG_ERR, "failed to remount\n");
+                goto err_out3;
+            }
+        }
+    } while (opts.reconnect && !ret);

-    fuse_session_unmount(se);
      cleanup_capng();
  err_out3:
      fuse_remove_signal_handlers(se);
diff --git a/tools/virtiofsd/passthrough_seccomp.c 
b/tools/virtiofsd/passthrough_seccomp.c
index 0033dab493..8dd773a062 100644
--- a/tools/virtiofsd/passthrough_seccomp.c
+++ b/tools/virtiofsd/passthrough_seccomp.c
@@ -129,6 +129,11 @@ static const int syscall_allowlist_syslog[] = {
      SCMP_SYS(sendto),
  };

+/* Allow accept to handle client reconnect */
+static const int syscall_allowlist_reconnect[] = {
+    SCMP_SYS(accept),
+};
+
  static void add_allowlist(scmp_filter_ctx ctx, const int syscalls[], 
size_t len)
  {
      size_t i;
@@ -142,7 +147,7 @@ static void add_allowlist(scmp_filter_ctx ctx, const 
int syscalls[], size_t len)
      }
  }

-void setup_seccomp(bool enable_syslog)
+void setup_seccomp(bool enable_syslog, bool enable_reconnect)
  {
      scmp_filter_ctx ctx;

@@ -166,6 +171,11 @@ void setup_seccomp(bool enable_syslog)
                        G_N_ELEMENTS(syscall_allowlist_syslog));
      }

+    if (enable_reconnect) {
+        add_allowlist(ctx, syscall_allowlist_reconnect,
+ G_N_ELEMENTS(syscall_allowlist_reconnect));
+    }
+
      /* libvhost-user calls this for post-copy migration, we don't need 
it */
      if (seccomp_rule_add(ctx, SCMP_ACT_ERRNO(ENOSYS),
                           SCMP_SYS(userfaultfd), 0) != 0) {
diff --git a/tools/virtiofsd/passthrough_seccomp.h 
b/tools/virtiofsd/passthrough_seccomp.h
index 12674fc050..bec47b114c 100644
--- a/tools/virtiofsd/passthrough_seccomp.h
+++ b/tools/virtiofsd/passthrough_seccomp.h
@@ -9,6 +9,6 @@
  #ifndef VIRTIOFSD_PASSTHROUGH_SECCOMP_H
  #define VIRTIOFSD_PASSTHROUGH_SECCOMP_H

-void setup_seccomp(bool enable_syslog);
+void setup_seccomp(bool enable_syslog, bool enable_reconnect);

  #endif /* VIRTIOFSD_PASSTHROUGH_SECCOMP_H */
-- 
2.37.2


>
>>
>>>>
>>>>>>
>>>>>>>>>>> This property selects if VM can migrate and if it can what 
>>>>>>>>>>> should
>>>>>>>>>>> qemu put
>>>>>>>>>>> to the migration stream. So we select on source what type of
>>>>>>>>>>> migration is
>>>>>>>>>>> allowed for this VM, destination can't check anything at 
>>>>>>>>>>> load time.
>>>>>>>>>> OK, so the new field "migration" regulates only outgoing 
>>>>>>>>>> migration and
>>>>>>>>>> do nothing for incoming. On incoming migration the migration 
>>>>>>>>>> stream
>>>>>>>>>> itself defines the type of device migration.
>>>>>>>>>> Worth mentioning in doc?
>>>>>>>>> Good point. I don't think this deserves a respin but if I have 
>>>>>>>>> to send v4
>>>>>>>>> I'll include
>>>>>>>>> clarification in it.