[Linux-cluster] Freeze with cluster-2.03.11
Kadlecsik Jozsef
kadlec at mail.kfki.hu
Fri Apr 3 15:12:11 UTC 2009
On Fri, 3 Apr 2009, Wendy Cheng wrote:
> Kadlecsik Jozsef wrote:
> > On Thu, 2 Apr 2009, Wendy Cheng wrote:
> > > > > Kadlecsik Jozsef wrote:
> > > > >
> > > > > > - commit 82d176ba485f2ef049fd303b9e41868667cebbdb
> > > > > > gfs_drop_inode as .drop_inode replacing .put_inode.
> > > > > > .put_inode was called without holding a lock, but .drop_inode
> > > > > > is called under inode_lock held. Might it be a problem
> > > > > >
> > > Based on code reading ...
> > > 1. iput() gets inode_lock (a spin lock)
> > > 2. iput() calls iput_final()
> > > 3. iput_final() calls filesystem drop_inode(), followed by
> > > generic_drop_inode()
> > > 4. generic_drop_inode() unlock inode_lock after doing all sorts of fun
> > > things
> > > with the inode
> > >
> > > So look to me that generic_drop_inode() statement within gfs_drop_inode()
> > > should be removed. Otherwise you would get double unlock and double list
> > > free.
> >
> > I think those function calls are right: iput_final calls either the
> > filesystem drop_inode function (in this case gfs_drop_inode) or
> > generic_drop_inode. There's no double call of generic_drop_inode. However
> > gfs_sync_page_i (and in turn filemap_fdatawrite and filemap_fdatawait) is
> > now called under inode_lock held and that was not so in previous versions.
> > But I'm just speculating.
>
> It *is* called twice unless my eyes deceive me
>
> static inline void iput_final(struct inode *inode)
> {
> const struct super_operations *op = inode->i_sb->s_op;
> void (*drop)(struct inode *) = generic_drop_inode;
>
> if (op && op->drop_inode)
> drop = op->drop_inode; /* gfs call generic_drop_inode() */
> drop(inode); /* second call into generic_drop_inode() again. */
> }
No, the line 'drop = op->drop_inode;' is just an assignment (there's no
function argument): if there's a filesystem-specific drop_inode
function, that is assigned to 'drop', overwriting thus the originally
initialized 'generic_drop_inode' value of the 'drop' variable as function
pointer. And the last line is only the function call, with the proper
argument.
> > I won't get a chance to start a test before Monday, sorry.
>
> I'll be traveling next week as well. However, a few cautious words here:
>
> Even this "fix" eventually solves your hang, running GFS on newer
> kernels with production system simply is *not* a good idea.
That might be so, but this is a catch: at least we must test GFS2 before
migrating from GFS1 to GFS2. That requires a recent kernel, with working
GFS1 and GFS2 support. A migration of an in-production system cannot be
started lightheartedly, and transforming GFS1 to GFS2 wont' be easy: that
needs downtime, fresh backups created after bringing down the systems,
backup verification, converting/creating the GFS2 volumes, restoring the
data. And crossing fingers that it must not be undone if something goes
wrong. It's not the same as just replacing an older kernel and an older
package.
Best regards,
Jozsef
--
E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
H-1525 Budapest 114, POB. 49, Hungary
More information about the Linux-cluster
mailing list