[Virtio-fs] Ways to uniquely and persistently identify nodes

Max Reitz mreitz at redhat.com
Tue Jan 14 16:49:07 UTC 2020


On 14.01.20 17:13, Miklos Szeredi wrote:
> On Mon, Jan 13, 2020 at 6:47 PM Max Reitz <mreitz at redhat.com> wrote:
>>
>> Hi,
>>
>> As discussed in today’s meeting, there is a problem with uniquely and
>> persistently identifying nodes in the guest.
>>
>> Actually, there are multiple problems:
>>
>> (1) stat’s st_ino in the guest is not necessarily unique.  Currently, it
>> just the st_ino from the host file, so if you have mounted multiple
>> different filesystems in the exported directory tree, you may get
>> collisions.
>>
>> (2) The FUSE 64-bit fuse_ino_t (which identifies an open file,
>> basically) is not persistent.  It is just an index into a vector that
>> contains all open inodes, and whenever virtiofsd is restarted, the
>> vector is renewed.  That means that whenever this happens, all
>> fuse_ino_t values the guest holds will become invalid.  (And when
>> virtiofsd starts handing out new fuse_ino_t values, those will probably
>> not point to the same inodes as before.)
>>
>> (3) name_to_handle_at()/open_by_handle_at() are implemented by FUSE just
>> by handing out the fuse_ino_t value as the handle.  This is not only a
>> problem as long as fuse_ino_t is not persistent (see (2)), but also in
>> general, because the fuse_ino_t value is only valid (per FUSE protocol)
>> as long as the inode is referenced by the guest.
>>
>> The first question that I think needs to be asked is whether we care
>> about each of this points at all.
>>
>> (1) Maybe it just doesn’t matter whether the st_ino values are unique.
> 
> It does matter, otherwise we can't claim it's a POSIX filesystem, and
> applications can't rely on the guarantees made by the standard.
> st_ino is used to find hard links by backup utilities and st_ino +
> d_ino values are used by tree traversal (such as find(1)).
> 
>>
>> (2) Maybe we don’t care about virtiofsd being restarted while the guest
>> is running or only paused.  (“Restarting” includes migration of the
>> daemon to a different host.)
>>
>> (3) I suppose we do care about this.
>>
>>
>> Assuming we do care about the points, here are some ways I have
>> considered of addressing them:
>>
>> (1)
>>
>> (a)
>>
>> If we could make the 64-bit fuse_ino_t unique and persistent (see (2)),
>> we could use that for st_ino (also a 64-bit field).
> 
> My gut feeling is that we'd want to avoid boundless maps stored in
> memory or any kind of storage.

Well, if we don’t need persistent fuse_ino_t values (i.e., we don’t care
about (2)), then we could just use what passthrough_ll currently does.
It does sound a bit like you don’t like even the current mapping,
though. (?)

> In fact we want to do the opposite: avoid having to keep map of
> objects currently in guest cache.  We want to flush the server maps
> independently of the guest cache when the maps grow too large or there
> are too many open file descriptors (a big issue with running virtiofsd
> unprivileged)...
> 
>>
>> (b)
>>
>> Otherwise, we probably want to continue passing through st_ino and then
>> ensure that stat’s st_dev is unique for each submount in the exported
>> tree.  We can achieve that by extending the FUSE protocol for virtiofsd
>> to announce submounts and then the FUSE kernel driver to automount them.
>>  (This means that these submounts in the guest are then technically
>> unrelated filesystems.  It also means that the FUSE driver would need to
>> automount them with the “virtiofs” fs type, which is kind of weird, and
>> restricts this solution to virtiofs.)
>>
>>
>> (2)
>>
>> (a)
>>
>> We can keep the current way if we just store the in-memory mapping while
>> virtiofsd is suspended (and migrate it it if we want to migrate the
>> virtiofsd instance).  The table may grow to be very large, though, and
>> it contains for example file descriptors that we would need to migrate,
>> too (perhaps as file handles?).
>>
>> (b)
>>
>> We could extend the fuse_ino_t type to an arbitrary size basically, to
>> be negotiated between FUSE server and client.  This would require
>> extensive modification of the FUSE protocol and kernel driver (and would
>> ask for respective modification of libfuse, too), though.  Such a larger
>> value could then capture both a submount ID and a unique identifier for
>> inodes on the respective host filesystems, such as st_ino.  This would
>> ensure that every virtiofsd instance would generate the same fuse_ino_t
>> values for the same nodes on the same exported tree.
>>
>> However, note that this doesn’t auto-populate the fuse_ino_t mappings:
>> When after restarting virtiofsd the server wants to access an existing
>> inode, it can’t, because there is no good way to translate even larger
>> fuse_ino_t values to a file descriptor.  (We could do that if the
>> fuse_ino_t value encapsulated a handle.  (As in open_by_handle_at().)
> 
> ...and this is the proposal that would solve that one as well.  If we
> could encapsulate the file handle into all messages, than we wouldn't
> have to worry about refcounting objects and keeping files open.  The
> server could just flush it's caches independently of the guest.
> 
>> The problem is that we can’t trust the guest to keep a handle, so we
>> must ensure that the handle returned points to a file the guest is
>> allowed to access.  Doing that cryptographically (e.g. with a MAC) is
>> probably out of the question, because that would make fuse_ino_t really
>> big.
> 
> I'm not a crypto expert, but why would that need to be big?  AES has
> 16 byte block size, so the handle would need to be padded to be a
> multiple of 16 bytes, but that doesn't sound excessive at all.

A MAC is an encrypted hash, so it needs to be as long as the hash is.
That is, 32 bytes for SHA-256 or 64 for SHA-512.  I naively expected
that increasing the file handle size from 8 to 32 + handle (which is
probably 12 to 16 bytes in length, so ~48 B) or even 64 + handle (~80 B)
would not be so nice.

> What worries me more is the variable nature of the field size, but I
> suppose there's no good way to get around that.

What worries me most is how to pass that object around to all FUSE
functions, and that they all need a new interface.

I just had a very fuzzy (and maybe stupid) idea: Maybe we could keep an
internal vector of currently active handles and then when variable-size
handles are enabled, fuse_ino_t would just act as an index into that vector?

I suppose we could then use the full-size handles in all messages and
just hand out temporary indices to existing functions (just so we don’t
have to change their interface).  Server and client have their own
vectors, because when they communicate, only the full handles have meaning.

Or we could implement the table on top of the current system by sharing
it between the client and server.  Whenever the server creates a
fuse_ino_t value, it then also creates a full-size handle, and returns
both the handle and its corresponding fuse_ino_t value to the server.
The server can use the fuse_ino_t normally most of the time, but with a
catch: The client would be able to invalidate it.  Then the server needs
to obtain a new fuse_ino_t value for the existing handle.
(Invalidating and reacquiring a fuse_ino_t value would be new FUSE
operations.)

I just got this idea, so maybe it’s silly.  But I suppose it could solve
our problems without being too invasive?

[...]

>> Side note:
>>
>> As for migrating a virtiofsd instance: Note that everything above that
>> depends on host file handles or host ino_t values will make it
>> impossible to migrate to a different filesystem.  But maybe doing that
>> would lead to all kind of other problems anyway.
> 
> If filesystem exported to guests is backed by a common server (e.g.
> NFS) than that should solve the ino_t and file handle persistency
> across migration.

Yes, but that wouldn’t be a different filesystem then.

Thanks a lot for taking the time to read and respond!

Max

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/virtio-fs/attachments/20200114/68845844/attachment.sig>


More information about the Virtio-fs mailing list