[Linux-cluster] cLVM unusable on quorated cluster

Fri Oct 3 14:38:14 UTC 2014

On 03/10/14 10:35 AM, Daniel Dehennin wrote:
> Hello,
>
> I'm trying to setup pacemaker+corosync on Debian Wheezy to access a SAN
> for an OpenNebula cluster.
>
> As I'm new to cluster world, I have hard time figuring why sometime
> things get really wrong and where I must look to find answers.
>
> My OpenNebula frontend, running in a VM, does not manage to run the
> resources and my syslog has a lot of:
>
> #+begin_src
> ocfs2_controld: Unable to open checkpoint "ocfs2:controld": Object does not exist
> #+end_src
>
> When this happens, other nodes have problem:
>
> #+begin_src
> root at nebula3:~# LANG=C vgscan
>    cluster request failed: Host is down
>    Unable to obtain global lock.
> #+end_src
>
> But things looks fin in “crm_mon”:
>
> #+begin_src
> root at nebula3:~# crm_mon -1
> ============
> Last updated: Fri Oct  3 16:25:43 2014
> Last change: Fri Oct  3 14:51:59 2014 via cibadmin on nebula1
> Stack: openais
> Current DC: nebula3 - partition with quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 5 Nodes configured, 5 expected votes
> 32 Resources configured.
> ============
>
> Node quorum: standby
> Online: [ nebula3 nebula2 nebula1 ]
> OFFLINE: [ one ]
>
>   Stonith-nebula3-IPMILAN    (stonith:external/ipmi):    Started nebula2
>   Stonith-nebula2-IPMILAN    (stonith:external/ipmi):    Started nebula3
>   Stonith-nebula1-IPMILAN    (stonith:external/ipmi):    Started nebula2
>   Clone Set: ONE-Storage-Clone [ONE-Storage]
>       Started: [ nebula1 nebula3 nebula2 ]
>       Stopped: [ ONE-Storage:3 ONE-Storage:4 ]
>   Quorum-Node    (ocf::heartbeat:VirtualDomain): Started nebula3
>   Stonith-Quorum-Node   (stonith:external/libvirt):   Started nebula3
> #+end_src
>
> I don't know how to interpret dlm_tool informations:
>
> #+begin_src
> root at nebula3:~# dlm_tool ls -n
> dlm lockspaces
> name          CCB10CE8D4FF489B9A2ECB288DACF2D7
> id            0x09250e49
> flags         0x00000008 fs_reg
> change        member 3 joined 1 remove 0 failed 0 seq 2,2
> members       1189587136 1206364352 1223141568
> all nodes
> nodeid 1189587136 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none
> nodeid 1206364352 member 1 failed 0 start 1 seq_add 2 seq_rem 0 check none
> nodeid 1223141568 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none
>
> name          clvmd
> id            0x4104eefa
> flags         0x00000000
> change        member 3 joined 0 remove 1 failed 0 seq 4,4
> members       1189587136 1206364352 1223141568
> all nodes
> nodeid 1172809920 member 0 failed 0 start 0 seq_add 3 seq_rem 4 check none
> nodeid 1189587136 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none
> nodeid 1206364352 member 1 failed 0 start 1 seq_add 2 seq_rem 0 check none
> nodeid 1223141568 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none
> #+end_src
>
>
>
>
> Is there any documentation on troubleshooting DLM/cLVM?
>
> Regards.

Can you paste your full pacemaker config and the logs from the other 
nodes starting just before the lost node went away?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?