[Virtio-fs] Ways to uniquely and persistently identify nodes

Wed Jan 15 12:58:06 UTC 2020

On 15.01.20 11:12, Dr. David Alan Gilbert wrote:
> * Max Reitz (mreitz at redhat.com) wrote:
>> Hi,
>>
>> As discussed in today’s meeting, there is a problem with uniquely and
>> persistently identifying nodes in the guest.
>>
>> Actually, there are multiple problems:
>>
>> (1) stat’s st_ino in the guest is not necessarily unique.  Currently, it
>> just the st_ino from the host file, so if you have mounted multiple
>> different filesystems in the exported directory tree, you may get
>> collisions.
>>
>> (2) The FUSE 64-bit fuse_ino_t (which identifies an open file,
>> basically) is not persistent.  It is just an index into a vector that
>> contains all open inodes, and whenever virtiofsd is restarted, the
>> vector is renewed.  That means that whenever this happens, all
>> fuse_ino_t values the guest holds will become invalid.  (And when
>> virtiofsd starts handing out new fuse_ino_t values, those will probably
>> not point to the same inodes as before.)
>>
>> (3) name_to_handle_at()/open_by_handle_at() are implemented by FUSE just
>> by handing out the fuse_ino_t value as the handle.  This is not only a
>> problem as long as fuse_ino_t is not persistent (see (2)), but also in
>> general, because the fuse_ino_t value is only valid (per FUSE protocol)
>> as long as the inode is referenced by the guest.
>>
>>
>> The first question that I think needs to be asked is whether we care
>> about each of this points at all.
>>
>> (1) Maybe it just doesn’t matter whether the st_ino values are unique.
>>
>> (2) Maybe we don’t care about virtiofsd being restarted while the guest
>> is running or only paused.  (“Restarting” includes migration of the
>> daemon to a different host.)
> 
> I prefer to split that answer because I care about migration but not
> about restarting the daemon in place.
> 
>> (3) I suppose we do care about this.
>>
>>
>> Assuming we do care about the points, here are some ways I have
>> considered of addressing them:
>>
>> (1)
>>
>> (a)
>>
>> If we could make the 64-bit fuse_ino_t unique and persistent (see (2)),
>> we could use that for st_ino (also a 64-bit field).
>>
>> (This is the case if we keep the current schema for fuse_ino_t, be it
>> because we don’t care about (2) or because we want (2a).)
>>
>> (b)
>>
>> Otherwise, we probably want to continue passing through st_ino and then
>> ensure that stat’s st_dev is unique for each submount in the exported
>> tree.  We can achieve that by extending the FUSE protocol for virtiofsd
>> to announce submounts and then the FUSE kernel driver to automount them.
> 
> This feels nice to me, since the st_dev map should be small.
> 
>>  (This means that these submounts in the guest are then technically
>> unrelated filesystems.  It also means that the FUSE driver would need to
>> automount them with the “virtiofs” fs type, which is kind of weird, and
>> restricts this solution to virtiofs.)
> 
> Well it has to automount them witht he same type as it's mounted with;
> so it's not virtiofs specific.

Right.  But we need some way to tell the filesystem the context for the
submount.  I’m not sure whether that’s possible in an fs-agnostic way,
because the only information we can reasonably pass goes to
vfs_submount()’s @name parameter.  I think?  Or maybe we can make @data
point to a FUSE-common structure that every FUSE fs with submounts would
then need to be able to parse?  But I’m not sure how that would even
translate to userspace.

>> (2)
>>
>> (a)
>>
>> We can keep the current way if we just store the in-memory mapping while
>> virtiofsd is suspended (and migrate it it if we want to migrate the
>> virtiofsd instance).  The table may grow to be very large, though, and
>> it contains for example file descriptors that we would need to migrate,
>> too (perhaps as file handles?).
> 
> The 'when suspended' worries me - if it's important data to persist then
> we probably need to be more careful with it; i.e. keep it sync'd to
> disk.  If it's not important long term then do we need to keep it that
> long? 
> Be wary of migating a large, rapidly changing table.
> 
>> (b)
>>
>> We could extend the fuse_ino_t type to an arbitrary size basically, to
>> be negotiated between FUSE server and client.  This would require
>> extensive modification of the FUSE protocol and kernel driver (and would
>> ask for respective modification of libfuse, too), though.  Such a larger
>> value could then capture both a submount ID and a unique identifier for
>> inodes on the respective host filesystems, such as st_ino.  This would
>> ensure that every virtiofsd instance would generate the same fuse_ino_t
>> values for the same nodes on the same exported tree.
>>
>> However, note that this doesn’t auto-populate the fuse_ino_t mappings:
>> When after restarting virtiofsd the server wants to access an existing
>> inode, it can’t, because there is no good way to translate even larger
>> fuse_ino_t values to a file descriptor.  (We could do that if the
>> fuse_ino_t value encapsulated a handle.  (As in open_by_handle_at().)
>> The problem is that we can’t trust the guest to keep a handle, so we
>> must ensure that the handle returned points to a file the guest is
>> allowed to access.  Doing that cryptographically (e.g. with a MAC) is
>> probably out of the question, because that would make fuse_ino_t really
>> big.  Another idea would be to set a flag on the host FS for files that
>> the guest has a handle to.  But this flag would need to be
>> guest-specific...  So we’d probably again end up with a large database
>> just as in (2a).  (It doesn’t need to be a flag on the FS, it could also
>> be a database, I suppose.))
> 
> I'm no crypto person, but I don't know how to show that's safe.
> Some inode numbers are well-known (e.g. on xfs / always seems to be 128
> for me).  I'm just worrying that makes it easier for the guest to figure
> out the crypto.

Well, with MACs the adversary sees the data and the MAC together anyway,
so they’re designed in such a way that you can’t infer the key from that
information.

And of course they’re designed such that without a key, you can’t
generate a valid MAC for given data.

IOW, that’s the precise reason we’d use a MAC: So the guest can’t just
return a well-known handle to get access to files it shouldn’t be able
to access.

>> We could also re-enumerate the exported tree after reopening (perhaps
>> lazily for each exported filesystem) and thus recreate the mapping.  But
>> this would take as much time as a “find” over the whole exported tree.
>>
>> (c)
>>
>> We could complement the fuse_ino_t value by a context ID, that in our
>> case would probably be derived from the submount ID (e.g. the relative
>> mount point).  This would only require minor modification of the FUSE
>> protocol: Detecting mount points in lookups; a new command to retrieve a
>> mount point’s context ID; and some way to set this context ID.
>>
>> We could set the context ID either explicitly with a new command; or as
>> part of every FUSE request (a field in the request header); or just as
>> part of virtio-fs (be it with one virtqueue per context (which may or
>> may not be feasible), or just by prefixing every FUSE request on the
>> line with a context ID).
> 
> One-virtqueue per context doesn't seem feasible to me; we don't know
> how many submounts there will be and there could be lots of them.

OK, good, so I don’t have to worry about it. :-)

Max

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/virtio-fs/attachments/20200115/b1b34f05/attachment.sig>