[Linux-cluster] Cluster node without access to all resources-trouble

Janne Peltonen janne.peltonen at helsinki.fi
Thu Jun 28 16:54:05 UTC 2007


On Thu, Jun 28, 2007 at 12:29:04PM -0400, Robert Gil wrote:
> I cant really help you there. In EL4 each of the services are separate.
> So a node can be part of the cluster but doesn't need to share the
> resources such as a shared san disk. If you have the resources set up so
> that it requires that resource, then it should be fenced. 

Yep.

The situation seems to be this (someone who really knows abt the inner
workings of the resource group manager, correct me):

*when a clurgmgrd starts, it wants to know the status of all the
services, and to make thing sure, it stops all services locally
(unmounts the filesystems, runs the scripts with "stop") - and asks the
already-running cluster members their idea of the status

*when the clurgmgrd on the fifth node starts, it tries to stop locally
the SAN requiring services - and cannot match the /dev/<vg>/<lv> paths
with real nodes, so it ends up with incoherent information about their
status

*if all the nodes with SAN access are restarted (while the fifth node is
up), the nodes with SAN access first stop the services locally - and
then, apparently, ask the fifth node about the service status. Result:
a line like the following, for each service:

--cut--
Jun 28 17:56:20 pcn2.mappi.helsinki.fi clurgmgrd[5895]: <err> #34: Cannot get status for service service:im  
--cut--

(what is weird, though, is that the fifth node knows perfectly well the
status of this particular service, since it's running the service
(service:im doesn't need the SAN access) - perhaps there is some other
reason not to believe the fifth node at this point. can't imagine what
it'd be, though.)

*after that, the nodes with SAN access do nothing about any services
until after the fifth node has left the cluster and has been fenced. So,
apparently the other nodes conclude that the fifth node is 'bad' and
could be interfering with their SAN access requiring services. When the
fifth node has been fenced, the other nodes start the services. And the
fifth node can join the cluster and start the services that should be
running there...

> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Janne Peltonen
> > Sent: Thursday, June 28, 2007 11:46 AM
> > To: linux-cluster at redhat.com
> > Subject: [Linux-cluster] Cluster node without access to all resources 
> > -trouble
> > 
> > Hi.
> > 
> > I'm running a five node cluster. Four of the nodes run services that 
> > need access to a SAN, but the fifth doesn't. (The fifth node belongs 
> > to the cluster to avoid a cluster with an even number of nodes.
> > Additionally, the fifth node is a stand-alone rack server, while the 
> > four other nodes are blade server, two of the in two different blade 
> > racks - this way, even if either of the blade racks goes down, I won't
> 
> > lose the cluster.) This seems to create all sorts of trouble. For 
> > example, if I try to manipulate clvm'd filesystems on the other four 
> > nodes, they refuse to commit changes if the fifth node is up. And even
> 
> > if I've restricted the SAN-access-needing services to run only on the 
> > four nodes that have the access, the cluster system tries to shut the 
> > services down in the fifth node also (when quorum is lost, for 
> > example)
> > - and complains about being unable to stop them and, on the nodes that
> 
> > should run the services, refuses to restart them until I've removed 
> > the fifth node from the cluster and fenced it. (Or, rather, I've 
> > removed the fifth node from the cluster and one of the other nodes has
> 
> > successfully fenced it.)
> > 
> > So.
> > 
> > Is it really necessary that all the members in a cluster have access 
> > to all the resources that any of the members have, even if the 
> > services in the cluster are partitioned to run in only a part of the 
> > cluster? Or is there a way to tell the cluster that it shouldn't care 
> > about the fifth members opinion about certain services; that is, it 
> > doesn't need to check if the services are running on it, because they 
> > never do. Or should I just make sure that the fifth member always 
> > comes up last (that is, won't be running while the others are coming 
> > up)? Or should I aceept that I'm going to create more harm than 
> > avoiding by letting the fifth node belong to the cluster, and just run
> it outside the cluster?
> > 
> > Sorry if this was incoherent. I'm a bit tired; this system should be 
> > in production in two weeks, and unexpected problems (that didn't come 
> > up during testing) keep coming up... Any suggestions would be greatly 
> > appreciated.
> > 
> > 
> > --Janne
> > --
> > Janne Peltonen <janne.peltonen at helsinki.fi>
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Janne Peltonen <janne.peltonen at helsinki.fi>
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Janne Peltonen <janne.peltonen at helsinki.fi>




More information about the Linux-cluster mailing list