[Linux-cluster] How to resurrect cluster? (clvmd startup timed out)
Székelyi Szabolcs
szekelyi at niif.hu
Thu Apr 21 15:30:58 UTC 2011
Hi all,
I have a very simple two node cluster, but every time I restart a node, the
cluster falls apart and clvmd doesn't start. I get the error message
$ sudo /etc/init.d/clvm restart
Deactivating VG ::.
Stopping Cluster LVM Daemon: clvm.
Starting Cluster LVM Daemon: clvmclvmd startup timed out
And life stops here, I don't get the prompt back, even SIGINT dosen't work,
however I can put the script into background.
All this after a fresh restart of cman:
$ sudo /etc/init.d/cman restart
Stopping cluster:
Leaving fence domain... [ OK ]
Stopping gfs_controld... [ OK ]
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ]
Starting cluster:
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Starting gfs_controld... [ OK ]
Unfencing self... [ OK ]
Joining fence domain... [ OK ]
Everything looks okay here, however the return status of the init script is 1.
Do you have any idea what the problem could be?
Last lines of syslog:
Apr 21 17:15:17 iscsigw2 corosync[1828]: [TOTEM ] A processor joined or left
the membership and a new membership was formed.
Apr 21 17:15:17 iscsigw2 corosync[1828]: [CMAN ] quorum regained, resuming
activity
Apr 21 17:15:17 iscsigw2 corosync[1828]: [QUORUM] This node is within the
primary component and will provide service.
Apr 21 17:15:17 iscsigw2 corosync[1828]: [QUORUM] Members[2]: 1 2
Apr 21 17:15:17 iscsigw2 corosync[1828]: [QUORUM] Members[2]: 1 2
Apr 21 17:15:17 iscsigw2 corosync[1828]: [MAIN ] Completed service
synchronization, ready to provide service.
Apr 21 17:15:19 iscsigw2 fenced[1880]: fenced 3.0.12 started
Apr 21 17:15:19 iscsigw2 dlm_controld[1905]: dlm_controld 3.0.12 started
Apr 21 17:15:20 iscsigw2 gfs_controld[1950]: gfs_controld 3.0.12 started
Apr 21 17:15:35 iscsigw2 kernel: [ 52.774694] dlm: Using TCP for
communications
Additional info, while the clvm init script is backgrounded:
$ sudo cman_tool status
Version: 6.2.0
Config Version: 6
Cluster Name: iscsigw
Cluster Id: 13649
Cluster Member: Yes
Cluster Generation: 288
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 8
Flags:
Ports Bound: 0 11
Node name: iscsigw2
Node ID: 2
Multicast addresses: 239.192.53.134
Node addresses: 10.0.0.2
$ sudo cman_tool services
fence domain
member count 2
victim count 0
victim now 0
master nodeid 1
wait state none
members 1 2
dlm lockspaces
name clvmd
id 0x4104eefa
flags 0x00000015 need_plock,kern_stop,join
change member 0 joined 0 remove 0 failed 0 seq 0,0
members
new change member 2 joined 1 remove 0 failed 0 seq 1,1
new status wait_messages 1 wait_condition 0
new members 1 2
The only was I found to get out of this situation is to reboot a node. The
shutdown process stops when it tries to shut the VGs down, from there on only
a hard reset helps.
How could I stabilize this cluster so that I can reboot a node without
worrying if the cluster suite will start up correctly or not?
Thanks,
--
cc
More information about the Linux-cluster
mailing list