[Linux-cluster] Cluster node without access to all resources-trouble

Thu Jun 28 18:39:44 UTC 2007

On Thu, Jun 28, 2007 at 07:54:05PM +0300, Janne Peltonen wrote:
> On Thu, Jun 28, 2007 at 12:29:04PM -0400, Robert Gil wrote:
> > I cant really help you there. In EL4 each of the services are separate.
> > So a node can be part of the cluster but doesn't need to share the
> > resources such as a shared san disk. If you have the resources set up so
> > that it requires that resource, then it should be fenced. 

RHEL5 is the same FWIW, or should be.

> *when a clurgmgrd starts, it wants to know the status of all the
> services, and to make thing sure, it stops all services locally
> (unmounts the filesystems, runs the scripts with "stop") - and asks the
> already-running cluster members their idea of the status

Right.

> *when the clurgmgrd on the fifth node starts, it tries to stop locally
> the SAN requiring services - and cannot match the /dev/<vg>/<lv> paths
> with real nodes, so it ends up with incoherent information about their
> status

This should not cause a problem.

> *if all the nodes with SAN access are restarted (while the fifth node is
> up), the nodes with SAN access first stop the services locally - and
> then, apparently, ask the fifth node about the service status. Result:
> a line like the following, for each service:
> 
> --cut--
> Jun 28 17:56:20 pcn2.mappi.helsinki.fi clurgmgrd[5895]: <err> #34: Cannot get status for service service:im  
> --cut--

What do you mean here, (sorry, being daft)

Restart all nodes = "just rgmanager on all nodes", or "reboot all
nodes"?

> (what is weird, though, is that the fifth node knows perfectly well the
> status of this particular service, since it's running the service
> (service:im doesn't need the SAN access) - perhaps there is some other
> reason not to believe the fifth node at this point. can't imagine what
> it'd be, though.)

cman_tool services from each node could help here.

> *after that, the nodes with SAN access do nothing about any services
> until after the fifth node has left the cluster and has been fenced. 

If you're rebooting the other 4 nodes, it sounds like the 5th is holding
some sort of a lock which it shouldn't be across quorum transitions
(which would be a bug).

If this is the case, could you:

* install rgmanager-debuginfo
* get me a backtrace:

    gdb clurgmgrd `pidof clurgmgrd`
    thr a a bt

-- Lon
-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.