<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Feb 4, 2021 at 8:09 PM Dr. David Alan Gilbert <<a href="mailto:dgilbert@redhat.com">dgilbert@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">* Jiachen Zhang (<a href="mailto:zhangjiachen.jaycee@bytedance.com" target="_blank">zhangjiachen.jaycee@bytedance.com</a>) wrote:<br>
> We add two options for virtiofsd crash reconnection: reconnect |<br>
> no_reconnect(default) and<br>
> <br>
> User of virtiofsd should set "-o reconnect" for crash reconnection. Note<br>
> that, when "-o reconnect" is set, some options will not be supported and we<br>
> need to disable them:<br>
> <br>
> - mount namespace: When mount namespace is enabled, reconnected<br>
> virtiofsd will failed to link/rename for EXDEV. The reason is, when<br>
> virtiofsd is sandboxed by mount namespace, attempts to link/rename<br>
> the fds opened before crashing (also recovered from QEMU) will be<br>
> considered as across mount operations between mounts, which is not<br>
> allowed by host kernel.<br>
> <br>
> So we introduce another option "-o mount_ns|no_mount_ns" (mount_ns<br>
> by default, and takes no effect when sandbox=chroot is specified).<br>
> The option "-o no_mount_ns" should be used with "-o reconnect".<br>
<br>
Why not just use -o sandbox=chroot?<br>
<br></blockquote><div><br></div><div>Yes, thanks. I think this is a good idea, and maybe better than add the new options </div><div>(mount_ns | no_mount_ns). Actually, "-o sandbox=" has not been upstream when we</div><div>add the new options.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> - remote locking: As it is hard to determine wether a file is locked or<br>
> not when handling inflight locking requests, we should specify "-o<br>
> no_posix_lock" (default) and "-o no_flock" (default).<br>
> <br>
> - extended attributes: When handling inflight removexattr requests after<br>
> reconnecting, it is hard to determine wether a attr is already removed<br>
> or not. Therefore, we should also use "-o noxattr" (default) with "-o<br>
> reconnect".<br>
<br>
Can you explain a bit more about why removexattr is hard to restart?<br>
<br></blockquote><div><br></div><div>Consider the following removexattr handling procedure:</div><div><br></div><div> lo_removexattr() {</div><div> (1) // ......</div><div> (2). fremovexattr / removexattr</div><div> (3) // ......</div><div> (4). <span class="gmail-n">fuse_reply_err</span><span class="gmail-p">(</span><span class="gmail-n">req</span><span class="gmail-p">,</span> <span class="gmail-n">saverr</span><span class="gmail-p">);</span></div><div><span class="gmail-p"> }</span></div><div><span class="gmail-p"><br></span></div><div><span class="gmail-p">If virtiofsd successfully executed (2) but crashes at (3). When new virtiofsd is</span></div><div><span class="gmail-p">reconnected, as (4) is not executed before crashing, this fuse request will be</span></div><div>re-executed from (1), and (2) will be executed for the second time. As the xattr</div><div>is already removed by the first execution of (2), this time <span class="gmail-p">an </span>ENODATA will be</div><div>returned by (2) and mistakenly replied to fuse by (4).</div><div><br></div><div>Jiachen</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Dave<br>
> <br>
> Signed-off-by: Jiachen Zhang <<a href="mailto:zhangjiachen.jaycee@bytedance.com" target="_blank">zhangjiachen.jaycee@bytedance.com</a>><br>
> Signed-off-by: Xie Yongji <<a href="mailto:xieyongji@bytedance.com" target="_blank">xieyongji@bytedance.com</a>><br>
> ---<br>
> tools/virtiofsd/helper.c | 9 +++<br>
> tools/virtiofsd/passthrough_ll.c | 112 ++++++++++++++++++++++---------<br>
> 2 files changed, 88 insertions(+), 33 deletions(-)<br>
> <br>
> diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c<br>
> index 75ac48dec2..e0d6525b1a 100644<br>
> --- a/tools/virtiofsd/helper.c<br>
> +++ b/tools/virtiofsd/helper.c<br>
> @@ -174,6 +174,10 @@ void fuse_cmdline_help(void)<br>
> " - chroot: chroot(2) into shared\n"<br>
> " directory (use in containers)\n"<br>
> " default: namespace\n"<br>
> + " -o mount_ns|no_mount_ns enable/disable mount namespace\n"<br>
> + " default: mount_ns\n"<br>
> + " note: if chroot sandbox mode is used,\n"<br>
> + " mount_ns will not take effect.\n"<br>
> " -o timeout=<number> I/O timeout (seconds)\n"<br>
> " default: depends on cache= option.\n"<br>
> " -o writeback|no_writeback enable/disable writeback cache\n"<br>
> @@ -191,6 +195,11 @@ void fuse_cmdline_help(void)<br>
> " to virtiofsd from guest applications.\n"<br>
> " default: no_allow_direct_io\n"<br>
> " -o announce_submounts Announce sub-mount points to the guest\n"<br>
> + " -o reconnect|no_reconnect enable/disable crash reconnection\n"<br>
> + " to enable crash reconnection, the options\n"<br>
> + " no_mount_ns, no_flock, no_posix_lock, and\n"<br>
> + " no_xattr should also be set.\n"<br>
> + " default: no_reconnect\n"<br>
> );<br>
> }<br>
> <br>
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c<br>
> index 815b001076..73a50bd7a3 100644<br>
> --- a/tools/virtiofsd/passthrough_ll.c<br>
> +++ b/tools/virtiofsd/passthrough_ll.c<br>
> @@ -170,6 +170,8 @@ struct lo_data {<br>
> pthread_mutex_t mutex;<br>
> int sandbox;<br>
> int debug;<br>
> + int mount_ns;<br>
> + int reconnect;<br>
> int writeback;<br>
> int flock;<br>
> int posix_lock;<br>
> @@ -204,8 +206,12 @@ static const struct fuse_opt lo_opts[] = {<br>
> { "sandbox=chroot",<br>
> offsetof(struct lo_data, sandbox),<br>
> SANDBOX_CHROOT },<br>
> + { "reconnect", offsetof(struct lo_data, reconnect), 1 },<br>
> + { "no_reconnect", offsetof(struct lo_data, reconnect), 0 },<br>
> { "writeback", offsetof(struct lo_data, writeback), 1 },<br>
> { "no_writeback", offsetof(struct lo_data, writeback), 0 },<br>
> + { "mount_ns", offsetof(struct lo_data, mount_ns), 1 },<br>
> + { "no_mount_ns", offsetof(struct lo_data, mount_ns), 0 },<br>
> { "source=%s", offsetof(struct lo_data, source), 0 },<br>
> { "flock", offsetof(struct lo_data, flock), 1 },<br>
> { "no_flock", offsetof(struct lo_data, flock), 0 },<br>
> @@ -3047,8 +3053,14 @@ static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)<br>
> * an empty network namespace to prevent TCP/IP and other network<br>
> * activity in case this process is compromised.<br>
> */<br>
> - if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) {<br>
> - fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n");<br>
> + int unshare_flag;<br>
> + if (lo->mount_ns) {<br>
> + unshare_flag = CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET;<br>
> + } else {<br>
> + unshare_flag = CLONE_NEWPID | CLONE_NEWNET;<br>
> + }<br>
> + if (unshare(unshare_flag) != 0) {<br>
> + fuse_log(FUSE_LOG_ERR, "unshare(): %m\n");<br>
> exit(1);<br>
> }<br>
> <br>
> @@ -3083,38 +3095,47 @@ static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)<br>
> /* Send us SIGTERM when the parent thread terminates, see prctl(2) */<br>
> prctl(PR_SET_PDEATHSIG, SIGTERM);<br>
> <br>
> - /*<br>
> - * If the mounts have shared propagation then we want to opt out so our<br>
> - * mount changes don't affect the parent mount namespace.<br>
> - */<br>
> - if (mount(NULL, "/", NULL, MS_REC | MS_SLAVE, NULL) < 0) {<br>
> - fuse_log(FUSE_LOG_ERR, "mount(/, MS_REC|MS_SLAVE): %m\n");<br>
> - exit(1);<br>
> - }<br>
> + if (lo->mount_ns) {<br>
> + /*<br>
> + * If the mounts have shared propagation then we want to opt out so our<br>
> + * mount changes don't affect the parent mount namespace.<br>
> + */<br>
> + if (mount(NULL, "/", NULL, MS_REC | MS_SLAVE, NULL) < 0) {<br>
> + fuse_log(FUSE_LOG_ERR, "mount(/, MS_REC|MS_SLAVE): %m\n");<br>
> + exit(1);<br>
> + }<br>
> <br>
> - /* The child must remount /proc to use the new pid namespace */<br>
> - if (mount("proc", "/proc", "proc",<br>
> - MS_NODEV | MS_NOEXEC | MS_NOSUID | MS_RELATIME, NULL) < 0) {<br>
> - fuse_log(FUSE_LOG_ERR, "mount(/proc): %m\n");<br>
> - exit(1);<br>
> - }<br>
> + /* The child must remount /proc to use the new pid namespace */<br>
> + if (mount("proc", "/proc", "proc",<br>
> + MS_NODEV | MS_NOEXEC | MS_NOSUID | MS_RELATIME, NULL) < 0) {<br>
> + fuse_log(FUSE_LOG_ERR, "mount(/proc): %m\n");<br>
> + exit(1);<br>
> + }<br>
> <br>
> - /*<br>
> - * We only need /proc/self/fd. Prevent ".." from accessing parent<br>
> - * directories of /proc/self/fd by bind-mounting it over /proc. Since / was<br>
> - * previously remounted with MS_REC | MS_SLAVE this mount change only<br>
> - * affects our process.<br>
> - */<br>
> - if (mount("/proc/self/fd", "/proc", NULL, MS_BIND, NULL) < 0) {<br>
> - fuse_log(FUSE_LOG_ERR, "mount(/proc/self/fd, MS_BIND): %m\n");<br>
> - exit(1);<br>
> - }<br>
> + /*<br>
> + * We only need /proc/self/fd. Prevent ".." from accessing parent<br>
> + * directories of /proc/self/fd by bind-mounting it over /proc. Since<br>
> + * / was previously remounted with MS_REC | MS_SLAVE this mount change<br>
> + * only affects our process.<br>
> + */<br>
> + if (mount("/proc/self/fd", "/proc", NULL, MS_BIND, NULL) < 0) {<br>
> + fuse_log(FUSE_LOG_ERR, "mount(/proc/self/fd, MS_BIND): %m\n");<br>
> + exit(1);<br>
> + }<br>
> <br>
> - /* Get the /proc (actually /proc/self/fd, see above) file descriptor */<br>
> - lo->proc_self_fd = open("/proc", O_PATH);<br>
> - if (lo->proc_self_fd == -1) {<br>
> - fuse_log(FUSE_LOG_ERR, "open(/proc, O_PATH): %m\n");<br>
> - exit(1);<br>
> + /* Get the /proc (actually /proc/self/fd, see above) file descriptor */<br>
> + lo->proc_self_fd = open("/proc", O_PATH);<br>
> + if (lo->proc_self_fd == -1) {<br>
> + fuse_log(FUSE_LOG_ERR, "open(/proc, O_PATH): %m\n");<br>
> + exit(1);<br>
> + }<br>
> + } else {<br>
> + /* Now we can get our /proc/self/fd directory file descriptor */<br>
> + lo->proc_self_fd = open("/proc/self/fd", O_PATH);<br>
> + if (lo->proc_self_fd == -1) {<br>
> + fuse_log(FUSE_LOG_ERR, "open(/proc/self/fd, O_PATH): %m\n");<br>
> + exit(1);<br>
> + }<br>
> }<br>
> }<br>
> <br>
> @@ -3347,7 +3368,9 @@ static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,<br>
> {<br>
> if (lo->sandbox == SANDBOX_NAMESPACE) {<br>
> setup_namespaces(lo, se);<br>
> - setup_mounts(lo->source);<br>
> + if (lo->mount_ns) {<br>
> + setup_mounts(lo->source);<br>
> + }<br>
> } else {<br>
> setup_chroot(lo);<br>
> }<br>
> @@ -3438,7 +3461,11 @@ static void setup_root(struct lo_data *lo, struct lo_inode *root)<br>
> struct stat stat;<br>
> uint64_t mnt_id;<br>
> <br>
> - fd = open("/", O_PATH);<br>
> + if (lo->mount_ns) {<br>
> + fd = open("/", O_PATH);<br>
> + } else {<br>
> + fd = open(lo->source, O_PATH);<br>
> + }<br>
> if (fd == -1) {<br>
> fuse_log(FUSE_LOG_ERR, "open(%s, O_PATH): %m\n", lo->source);<br>
> exit(1);<br>
> @@ -3515,6 +3542,9 @@ int main(int argc, char *argv[])<br>
> lo.allow_direct_io = 0,<br>
> lo.proc_self_fd = -1;<br>
> <br>
> + lo.reconnect = 0;<br>
> + lo.mount_ns = 1;<br>
> +<br>
> /* Don't mask creation mode, kernel already did that */<br>
> umask(0);<br>
> <br>
> @@ -3577,6 +3607,22 @@ int main(int argc, char *argv[])<br>
> goto err_out1;<br>
> }<br>
> <br>
> + /* For sandbox mode "chroot", set the mount_ns option to 0. */<br>
> + if (lo.sandbox == SANDBOX_CHROOT) {<br>
> + lo.mount_ns = 0;<br>
> + }<br>
> +<br>
> + if (lo.reconnect) {<br>
> + if ((lo.sandbox == SANDBOX_NAMESPACE && lo.mount_ns) || lo.flock<br>
> + || lo.posix_lock || lo.xattr) {<br>
> + printf("Mount namespace, remote lock, and extended attributes"<br>
> + " are not supported by reconnection (-o reconnect). Please "<br>
> + "specify -o no_mount_ns, -o no_flock (default), -o"<br>
> + "no_posix_lock (default), and -o no_xattr (default).\n");<br>
> + exit(1);<br>
> + }<br>
> + }<br>
> +<br>
> if (opts.log_level != 0) {<br>
> current_log_level = opts.log_level;<br>
> } else {<br>
> -- <br>
> 2.20.1<br>
> <br>
-- <br>
Dr. David Alan Gilbert / <a href="mailto:dgilbert@redhat.com" target="_blank">dgilbert@redhat.com</a> / Manchester, UK<br>
<br>
</blockquote></div></div>