[Linux-cluster] the cluster don't restart (clvmd)

Mon Sep 1 15:14:15 UTC 2008

Gian Paolo Buono wrote:
> Hi,
> I have a cluster configuration with two node..this is my cluster.conf:
> 
> ####################cluster.conf####################
> <?xml version="1.0"?>
> <cluster alias="yoda-cl" config_version="3" name="yoda-cl">
>         <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="3"/>
>         <clusternodes>
>                 <clusternode name="yoda2.cs.tin.it
> <http://yoda2.cs.tin.it>" nodeid="1" votes="1">
>                         <fence/>
>                 </clusternode>
>                 <clusternode name="yoda1.cs.tin.it
> <http://yoda1.cs.tin.it>" nodeid="2" votes="1">
>                         <fence/>
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1"/>
>         <rm>
>                 <failoverdomains/>
>                 <resources/>
>         </rm>
>         <fencedevices/>
> </cluster>
> ####################cluster.conf####################
> 
> I have  tried to restart cluster without reboot because the command
> clustat on node 2 don't work ... but ther is a problem on fence
> device..this is the messages..
> 
> [root at yoda1 cluster]# /etc/init.d/cman start
> Starting cluster:
>    Enabling workaround for Xend bridged networking... done
>    Loading modules... done
>    Mounting configfs... done
>    Starting ccsd... done
>    Starting cman... done
>    Starting daemons... done
>    Starting fencing... failed
> 
> The follow the log:
> Sep  1 11:27:47 yoda1 groupd[8162]: found uncontrolled kernel object
> clvmd in /sys/kernel/dlm
> Sep  1 11:27:47 yoda1 groupd[8162]: local node must be reset to clear 1
> uncontrolled instances of gfs and/or dlm
> Sep  1 11:27:47 yoda1 openais[8154]: [CMAN ] cman killed by node 2
> because we were killed by cman_tool or other application
> Sep  1 11:27:47 yoda1 fence_node[8163]: Fence of "yoda1.cs.tin.it
> <http://yoda1.cs.tin.it>" was unsuccessful
> Sep  1 11:27:47 yoda1 fenced[8169]: cman_init error (nil) 111
> Sep  1 11:27:47 yoda1 gfs_controld[8181]: cman_init error 111
> Sep  1 11:27:57 yoda1 dlm_controld[8175]: group_init error (nil) 111
>

It's all failing to start because the cluster software wasn't shut down
properly originally. ALL the daemons must be shut down and GFS
filesystems mounted etc. Only then can you restart the cluster software.

Looking at the messages I would guess that either clvmd was killed with
-9 (there is a stray clvmd lockspace in existance) or the cluster was
shutdown with "cman_tool leave force". Or maybe the daemons were killed
by hand.

In the event it's often easier to reboot ...

Chrissie