[Linux-cluster] mount is hanging

Tue Oct 5 18:11:46 UTC 2004

Hiya,

On Fri, 2004-10-01 at 08:24, Adam Manthei wrote:
> On Thu, Sep 30, 2004 at 04:01:44PM -0700, micah nerren wrote:
> > Hi,
> > 
> > I have a SAN with 4 file systems on it, each GFS. These are mounted
> > across various servers running GFS, 3 of which are lock_gulm servers.
> > This is on RHEL WS 3 with GFS-6.0.0-7.1 on x86_64.
> 
> How many nodes?

All total 4 servers mounting the 4 file systems. 3 of them are lock_gulm
servers.

> > One of the file systems simply will not mount now. The other 3 mount and
> > unmount fine. They are all part of the same cca. I have my master lock
> > server running in heavy debug mode but none of the output from
> > lock_gulmd tells me anything about this one bad pool. How can I figure
> > out what is going on, any good debug or troubleshooting steps I should
> > do? I think if I just reboot everything it will settle down, but we
> > can't do that just yet, as the master lock server happens to be on a
> > production box right now.
> 
> 1) Are you certain that you have uniquely names all four filesystems?  You can
>    use gfs_tool to verify that there are no duplicate names.

Yes, there are no duplicate names. They all have unique names.

> 2) Is there an expired node that is not fenced holding a lock on that 
>    filesystem?  gulm_tool will help there.

No expired node. gulm_tool tells me everything is perfectly fine, hence
the ability of all the nodes to mount the other 3 file systems. I have
tried manually fencing and unfencing two of the systems, to no avail.

> 3) Did you ever have all 4 filesystems mounted at the same time on the same
>    node?  i.e.  did it "all of a sudden" stop working or was it always 
>    failing?

Yes, its been running fine for several weeks. It "suddenly" freaked out.
It is possible the customer did something I am unaware of, but I don't
know what they could have done to cause this.

> > Also, is there a way to migrate a master lock server to a slave lock
> > server? In other words, can I force the master to become a slave and a
> > slave to become the new master?
> 
> Restarting lock_gulmd on the master will cause one of the slaves to pick up
> as master and the master to come back up as a slave.  Note that this only
> works when you have a dedicated gulm server.  If you have an embedded master
> server (a gulm server also mounting GFS) bad things will happen when the
> server restarts.

Ugh, thats what I really need to avoid. I do not have dedicated gulm
servers, the master is on a machine that is also mounting the file
systems and is in heavy production use.

I am quite certain from past experiences that just rebooting all 4
servers will fix this up, but I can't do that.

What I am going to try right now is blowing away the one pool that is
acting up, rebuilding it and seeing if that works. Luckily this one pool
is non-critical and is backed up, so I can just nuke it.

Thanks,

Micah