[Cluster-devel] [GFS2 Patch] pass formal ino in do_filldir_main

Steven Whitehouse swhiteho at redhat.com
Mon Feb 26 17:16:08 UTC 2007


Hi,

On Mon, 2007-02-26 at 10:11 -0600, Jonathan E Brassow wrote:
> Please note, I haven't been following current GFS2 development...
> 
> On Feb 26, 2007, at 9:45 AM, Steven Whitehouse wrote:
> 
> > Hi,
> >
> > On Fri, 2007-02-23 at 22:56 -0500, Wendy Cheng wrote:
> >> Steve,
> >>
> >> I gave this issue some more thoughts - would like to suggest we take
> >> this patch (at least for now) since it aligns with current code base.
> >>
> >> The no_formal_ino is apparently intented to get returned back to user
> >> space due to its unique-ness (and we have to trust pick_formal_ino()
> >> does its job right). With normal readdir system call, after the inode
> >> number is sent to user space, there is no route (I've checked) for it 
> >> to
> >> come back to kernel. So the only user that would use these filldir ino
> >> inside the kernel is NFSD. NFSD calls gfs2_ilookup() that requires
> >> no_formal_ino.
> >>
> > no_formal_ino is not ever guaranteed to be unique which is why I want 
> > to
> > keep it away from userspace. I don't really want to have to change that
> > in order to fix NFS.
>
> Why is it not unique?  (from Ken's doc on GFS2:  "If the filesystem 
> allocated 1 billion inodes per second, it would take over 500 years for 
> roll-over.")
> 
Granted it will take a long time, but it does rather look a bit like
another "640k will be enough for anyone" type prediction. The main
problem here being that when it happens there will be no way to tell
that it has happened.

> > As you say there is no route for the inode number returned via readdir
> > to come back to the kernel, but it should match the inode number
> > returned by stat and I don't see that we should change either of them
> > just to get NFS to work when userspace is perfectly ok as it is.
> 
> There were alot of things done to make NFS work properly with GFS - 
> many of them through great pain, because GFS is a clustered FS.
> 
> >> If you look further... The current lookup code actually uses
> >> no_formal_ino, not no_addr. The two main "gate" routines that controls
> >> ino-to-inode conversion are:
> >>
> >> * gfs2_ilookup() (used by NFS route)
> >> * gfs2_inode_lookup() (used by VFS that calls gfs2_lookup()).
> >>
> > Neither of them need to use no_formal_ino at all. The lookup is using
> > no_addr as there is no other index which makes sense. I know that
> > no_formal_ino is used as part of the hash, but it shouldn't be really
> > and its trivial to change that. You can't swap to just using
> > no_formal_ino on its own as there is no mapping to find its on-disk
> > location from no_formal_ino.
> >
> >> Both use no_formal_ino - gfs2_inode_lookup() logic hides this behind 
> >> the
> >> little wrapper "gfs2_iget()". Since current VFS side's lookup has been
> >> working fine, this no_formal_ino idea apparently is working. So let's
> >> make NFSD side work the *same* way. I'm convinced this patch does a
> >> right thing.
> >>
> >> I don't dispute using generation number may not be a bad idea and may
> >> perform better. However, if we really look into the details, it is not
> >> easy for current code structure. Since we have something already
> >> working, it is not wise to mess this code layout at this moment.
> >>
> >> Make sense ?
> >>
> >> -- Wendy
> >>
> > I'm afraid that I'm still not convinced on that. It seems that ext3 
> > uses
> > the method that I'm proposing in that it looks up the inode by inode
> > number and then only afterwards checks the generation number. It also
> > has the advantage of separating the NFS interface from the rest of the
> > filesystem. That seems to me to be a lot of the problem that we are
> > looking at here in that the VFS's lookup function wants to look up by
> > no_addr and (currently) NFS is asking for something slightly different,
> >
> 
> How do you plan on handling inode migration when using NFS?  I believe 
> that was one of the central reasons for having no_formal_inode.  There 
> may be other reasons as well, as I am not familiar with the code.
> 
>   brassow
> 
The problem with no_formal_ino is that it doesn't allow inode migration
since there is no index in which its possible to lookup no_formal_ino
and find out the on-disk location of the inode. If we had such an index
then I'd be suggesting using no_formal_ino and not no_addr as our inode
number.

There are other issues as well. The glock number is determined by
no_addr (thats not something I've changed, btw) and in order to migrate
an inode it really needs to be kept under the same lock, otherwise all
kinds of odd things might happen.

So it seemed easier to use the no_addr as the inode's unique number and
try to keep everything consistent for now. This should then allow the
addition of some kind of indirection layer later on, to allow migration.

In the mean time no_formal_ino will still exist, its just that it won't
be used as the lookup key, so theres no on-disk format change to
consider here, all the same fields will continue to have the same
values, so there should be no problem to use it again if required at a
later date,

Steve.





More information about the Cluster-devel mailing list