[Linux-cluster] Can a 2-node Cluster boot-up with only one active node?

Thu Oct 4 16:28:12 UTC 2007

On Thu, 04 Oct 2007 10:35:13 -0400, Lon Hohberger wrote:

 > > What is the correct behaviour? Shouldn't my Cluster come up because I
have
 > > two votes active? In this case each node counts one vote in the cluster,
and
 > > the quorum counts another one.

 > cman_tool status / cman_tool nodes output would be helpful

 > Also, which version of cman do you have?

 > -- Lon

Hi Lon,

Here are some relevant information from the Cluster:

** What is happening:
  If I boot node1 with node2 powered off, it stops for 5 minutes during the
start of ccsd, and after that it regains quorum, qdiskd starts successfully,
but fenced keeps trying to start for 2 minutes and then it gives up with
a "failed" message.

** Relevant log messages collected after boot:
Oct  4 11:51:13 hercules01 kernel: CMAN: Waiting to join or form a Linux-
cluster
Oct  4 11:51:13 hercules01 ccsd[9144]: Connected to cluster infrastruture
via: CMAN/SM Plugin v1.1.7.4
Oct  4 11:51:13 hercules01 ccsd[9144]: Initial status:: Inquorate
Oct  4 11:51:45 hercules01 kernel: CMAN: forming a new cluster
Oct  4 11:56:45 hercules01 cman: Timed-out waiting for cluster failed
       ^^^^^^^^
       5 minutes later
Oct  4 11:56:45 hercules01 lock_gulmd: no <gulm> section detected
in /etc/cluster/cluster.conf succeeded
Oct  4 11:56:45 hercules01 qdiskd: Starting the Quorum Disk Daemon: succeeded
Oct  4 11:57:02 hercules01 kernel: CMAN: quorum regained, resuming activity
Oct  4 11:57:02 hercules01 ccsd[9144]: Cluster is quorate.  Allowing
connections.
Oct  4 11:58:45 hercules01 fenced: startup failed
       ^^^^^^^^
       exactly 2 minutes after the qdiskd message above, I've noticed that
fenced is started in the init scripts with "fence_tool -t 120 join -w"
Oct  4 11:59:38 hercules01 rgmanager: clurgmgrd startup failed
       ^^^^^^^^
       after other service boot up ok, rgmanager fails to boot, probably
because fenced failed to start
Oct  4 11:56:45 hercules01 qdiskd[9292]: <info> Quorum Daemon Initializing
Oct  4 11:56:55 hercules01 qdiskd[9292]: <info> Initial score 1/1
Oct  4 11:56:55 hercules01 qdiskd[9292]: <info> Initialization complete
Oct  4 11:56:55 hercules01 qdiskd[9292]: <notice> Score sufficient for
master operation (1/1; required=1); upgrading
Oct  4 11:57:01 hercules01 qdiskd[9292]: <info> Assuming master role
Oct  4 11:59:08 hercules01 clurgmgrd[10548]: <notice> Resource Group Manager
Starting
Oct  4 11:59:08 hercules01 clurgmgrd[10548]: <info> Loading Service Data
Oct  4 11:59:08 hercules01 clurgmgrd[10548]: <info> Initializing Services
... <messages of stopping the services and making sure filesystems are
unmounted>
Oct  4 11:59:28 hercules01 clurgmgrd[10548]: <info> Services Initialized
--- no more cluster messages after this point ---

** Daemons status:
# service fenced status
fenced (pid 9304) is running...
# service rgmanager status
clurgmgrd (pid 10548 10547) is running...

** Clustat:
< delay of about 10 seconds >
Timed out waiting for a response from Resource Group Manager
Member Status: Quorate

Resource Group Manager not running; no service information available.

  Member Name                              Status
  ------ ----                              ------
  node1                                    Online, Local
  node2                                    Offline

** cman_tool nodes
Node  Votes Exp Sts  Name
   0    1    0   M   /dev/emcpowere1
   1    1    3   M   node1

** cman_tool status
Protocol version: 5.0.1
Config version: 12
Cluster name: clu_prosperdb
Cluster ID: 570
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 3
Total_votes: 2
Quorum: 2
Active subsystems: 2
Node name: node1
Node ID: 1
Node addresses: 192.168.50.1

** Kernel version (uname -r): RHEL4u4 with latest kernel approved by EMC,
the EMC eLab was done for RHEL4U4, not RHEL 4.5, so we can't upgrade the
kernel, unless we move on everything to RHEL 4.5:
2.6.9-42.0.10.ELsmp

** Installed cluster package versions (same on both nodes):
ccs-1.0.10-0.x86_64.rpm
cman-1.0.17-0.x86_64.rpm
cman-kernel-smp-2.6.9-45.15.x86_64.rpm
dlm-1.0.3-1.x86_64.rpm
dlm-kernel-smp-2.6.9-44.9.x86_64.rpm
fence-1.32.45-1.0.2.x86_64.rpm
gulm-1.0.10-0.x86_64.rpm
iddev-2.0.0-4.x86_64.rpm
magma-1.0.7-1.x86_64.rpm
magma-plugins-1.0.12-0.x86_64.rpm
perl-Net-Telnet-3.03-3.noarch.rpm
rgmanager-1.9.68-1.x86_64.rpm
system-config-cluster-1.0.45-1.0.noarch.rpm

** What happens if I boot up the other node (node2):
- ccsd comes up after just a few seconds on node2
- all other cluster daemons start successfully
- fenced and rgmanager on node1 both start
- the logs show node1 starting services when node2 came up:
Oct  4 12:51:44 hercules01 clurgmgrd[10548]: <info> Logged in
SG "usrm::manager"
Oct  4 12:51:44 hercules01 clurgmgrd[10548]: <info> Magma Event: Membership
Change
Oct  4 12:51:44 hercules01 clurgmgrd[10548]: <info> State change: Local UP
... <messages about services starting and filesystems being mounted>
Oct  4 12:52:24 hercules01 clurgmgrd[10548]: <info> Magma Event: Membership
Change
Oct  4 12:52:24 hercules01 clurgmgrd[10548]: <info> State change: node2 UP

The only packages not up to date are the kernel related ones, which I
believe are the correct ones for my kernel version.

Please, tell me if you see any mistake on this setup. The problem is that
the customer cannot boot up the systems if one node is eventually dead. If
both nodes are up and one goes down, the functionality is OK. But as it is
now, if the remaining node reboots, the services cannot come up.

Thank you very much.

Regards,

-- Celso

-- 
Esta mensagem foi verificada pelo sistema de antivírus e
 acredita-se estar livre de perigo.