[Linux-cluster] How to resurrect cluster? (clvmd startup timed out)

Thu Apr 21 15:30:58 UTC 2011

Hi all,

I have a very simple two node cluster, but every time I restart a node, the 
cluster falls apart and clvmd doesn't start. I get the error message

$ sudo /etc/init.d/clvm restart
Deactivating VG ::.
Stopping Cluster LVM Daemon: clvm.
Starting Cluster LVM Daemon: clvmclvmd startup timed out

And life stops here, I don't get the prompt back, even SIGINT dosen't work, 
however I can put the script into background.

All this after a fresh restart of cman:

$ sudo /etc/init.d/cman restart
Stopping cluster: 
   Leaving fence domain... [  OK  ]
   Stopping gfs_controld... [  OK  ]
   Stopping dlm_controld... [  OK  ]
   Stopping fenced... [  OK  ]
   Stopping cman... [  OK  ]
   Unloading kernel modules... [  OK  ]
   Unmounting configfs... [  OK  ]
Starting cluster: 
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... [  OK  ]
   Starting fenced... [  OK  ]
   Starting dlm_controld... [  OK  ]
   Starting gfs_controld... [  OK  ]
   Unfencing self... [  OK  ]
   Joining fence domain... [  OK  ]

Everything looks okay here, however the return status of the init script is 1. 
Do you have any idea what the problem could be?

Last lines of syslog:

Apr 21 17:15:17 iscsigw2 corosync[1828]:   [TOTEM ] A processor joined or left 
the membership and a new membership was formed.
Apr 21 17:15:17 iscsigw2 corosync[1828]:   [CMAN  ] quorum regained, resuming 
activity
Apr 21 17:15:17 iscsigw2 corosync[1828]:   [QUORUM] This node is within the 
primary component and will provide service.
Apr 21 17:15:17 iscsigw2 corosync[1828]:   [QUORUM] Members[2]: 1 2
Apr 21 17:15:17 iscsigw2 corosync[1828]:   [QUORUM] Members[2]: 1 2
Apr 21 17:15:17 iscsigw2 corosync[1828]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Apr 21 17:15:19 iscsigw2 fenced[1880]: fenced 3.0.12 started
Apr 21 17:15:19 iscsigw2 dlm_controld[1905]: dlm_controld 3.0.12 started
Apr 21 17:15:20 iscsigw2 gfs_controld[1950]: gfs_controld 3.0.12 started
Apr 21 17:15:35 iscsigw2 kernel: [   52.774694] dlm: Using TCP for 
communications

Additional info, while the clvm init script is backgrounded:

$ sudo cman_tool status
Version: 6.2.0
Config Version: 6
Cluster Name: iscsigw
Cluster Id: 13649
Cluster Member: Yes
Cluster Generation: 288
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2  
Active subsystems: 8
Flags: 
Ports Bound: 0 11  
Node name: iscsigw2
Node ID: 2
Multicast addresses: 239.192.53.134 
Node addresses: 10.0.0.2 

$ sudo cman_tool services
fence domain
member count  2
victim count  0
victim now    0
master nodeid 1
wait state    none
members       1 2 

dlm lockspaces
name          clvmd
id            0x4104eefa
flags         0x00000015 need_plock,kern_stop,join
change        member 0 joined 0 remove 0 failed 0 seq 0,0
members       
new change    member 2 joined 1 remove 0 failed 0 seq 1,1
new status    wait_messages 1 wait_condition 0 
new members   1 2 

The only was I found to get out of this situation is to reboot a node. The 
shutdown process stops when it tries to shut the VGs down, from there on only 
a hard reset helps.

How could I stabilize this cluster so that I can reboot a node without 
worrying if the cluster suite will start up correctly or not?

Thanks,
-- 
cc