[Linux-cluster] problems with clvmd

Wed Apr 20 17:22:49 UTC 2011

On Tue, Apr 19, 2011 at 8:32 AM, Terry <td3201 at gmail.com> wrote:
> On Tue, Apr 19, 2011 at 4:59 AM, Christine Caulfield
> <ccaulfie at redhat.com> wrote:
>> On 18/04/11 15:49, Terry wrote:
>>>
>>> On Mon, Apr 18, 2011 at 9:26 AM, Christine Caulfield
>>> <ccaulfie at redhat.com>  wrote:
>>>>
>>>> On 18/04/11 15:11, Terry wrote:
>>>>>
>>>>> On Mon, Apr 18, 2011 at 8:57 AM, Christine Caulfield
>>>>> <ccaulfie at redhat.com>    wrote:
>>>>>>
>>>>>> On 18/04/11 14:38, Terry wrote:
>>>>>>>
>>>>>>> On Mon, Apr 18, 2011 at 3:48 AM, Christine Caulfield
>>>>>>> <ccaulfie at redhat.com>      wrote:
>>>>>>>>
>>>>>>>> On 17/04/11 21:52, Terry wrote:
>>>>>>>>>
>>>>>>>>> As a result of a strange situation where our licensing for storage
>>>>>>>>> dropped off, I need to join a centos 5.6 node to a now single node
>>>>>>>>> cluster.  I got it joined to the cluster but I am having issues with
>>>>>>>>> CLVMD.  Any lvm operations on both boxes hang.  For example, vgscan.
>>>>>>>>> I have increased debugging and I don't see any logs.  The VGs aren't
>>>>>>>>> being populated in /dev/mapper.  This WAS working right after I
>>>>>>>>> joined
>>>>>>>>> it to the cluster and now it's not for some unknown reason.  Not
>>>>>>>>> sure
>>>>>>>>> where to take this at this point.   I did find one weird startup log
>>>>>>>>> that I am not sure what it means yet:
>>>>>>>>> [root at omadvnfs01a ~]# dmesg | grep dlm
>>>>>>>>> dlm: no local IP address has been set
>>>>>>>>> dlm: cannot start dlm lowcomms -107
>>>>>>>>> dlm: Using TCP for communications
>>>>>>>>> dlm: connecting to 2
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> That message usually means that dlm_controld has failed to start. Try
>>>>>>>> starting the cman daemons (groupd, dlm_controld) manually with the -D
>>>>>>>> switch
>>>>>>>> and read the output which might give some clues to why it's not
>>>>>>>> working.
>>>>>>>>
>>>>>>>> Chrissie
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Chrissie,
>>>>>>>
>>>>>>> I thought of that but I see dlm started on both nodes.  See right
>>>>>>> below.
>>>>>>>
>>>>>>>>> [root at omadvnfs01a ~]# ps xauwwww | grep dlm
>>>>>>>>> root      5476  0.0  0.0  24736   760 ?        Ss   15:34   0:00
>>>>>>>>> /sbin/dlm_controld
>>>>>>>>> root      5502  0.0  0.0      0     0 ?        S<          15:34
>>>>>>>>> 0:00
>>>>>>
>>>>>>
>>>>>> Well, that's encouraging in a way! But it's evidently not started fully
>>>>>> or
>>>>>> the DLM itself would be working. So I still recommend starting it with
>>>>>> -D
>>>>>> to
>>>>>> see how far it gets.
>>>>>>
>>>>>>
>>>>>> Chrissie
>>>>>>
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>
>>>>>
>>>>> I think we had posts cross.  Here's my latest:
>>>>>
>>>>> Ok, started all the CMAN elements manually as you suggested.  I
>>>>> started them in order as in the init script. Here's the only error
>>>>> that I see.  I can post the other debug messages if you think they'd
>>>>> be useful but this is the only one that stuck out to me.
>>>>>
>>>>> [root at omadvnfs01a ~]# /sbin/dlm_controld -D
>>>>> 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
>>>>> 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
>>>>> 1303134840 set_ccs_options 480
>>>>> 1303134840 cman: node 2 added
>>>>> 1303134840 set_configfs_node 2 10.198.1.111 local 0
>>>>> 1303134840 cman: node 3 added
>>>>> 1303134840 set_configfs_node 3 10.198.1.110 local 1
>>>>>
>>>>
>>>> Can I see the whole set please ? It looks like dlm_controld might be
>>>> stalled
>>>> registering with groupd.
>>>>
>>>> Chrissie
>>>>
>>>> --
>>>
>>> Here you go.  Thank you very much for the help.  Each daemon's output
>>> that I started is below.
>>>
>>> [root at omadvnfs01a log]# /sbin/ccsd -n
>>> Starting ccsd 2.0.115:
>>>  Built: Mar  6 2011 00:47:03
>>>  Copyright (C) Red Hat, Inc.  2004  All rights reserved.
>>>   No Daemon:: SET
>>>
>>> cluster.conf (cluster name = omadvnfs01, version = 71) found.
>>> Remote copy of cluster.conf is from quorate node.
>>>  Local version # : 71
>>>  Remote version #: 71
>>> Remote copy of cluster.conf is from quorate node.
>>>  Local version # : 71
>>>  Remote version #: 71
>>> Remote copy of cluster.conf is from quorate node.
>>>  Local version # : 71
>>>  Remote version #: 71
>>> Remote copy of cluster.conf is from quorate node.
>>>  Local version # : 71
>>>  Remote version #: 71
>>> Initial status:: Quorate
>>>
>>> [root at omadvnfs01a ~]# /sbin/fenced -D
>>> 1303134822 cman: node 2 added
>>> 1303134822 cman: node 3 added
>>> 1303134822 our_nodeid 3 our_name omadvnfs01a.sec.jel.lc
>>> 1303134822 listen 4 member 5 groupd 7
>>> 1303134861 client 3: join default
>>> 1303134861 delay post_join 3s post_fail 0s
>>> 1303134861 added 2 nodes from ccs
>>> 1303134861 setid default 65537
>>> 1303134861 start default 1 members 2 3
>>> 1303134861 do_recovery stop 0 start 1 finish 0
>>> 1303134861 finish default 1
>>>
>>> [root at omadvnfs01a ~]# /sbin/dlm_controld -D
>>> 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
>>> 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
>>> 1303134840 set_ccs_options 480
>>> 1303134840 cman: node 2 added
>>> 1303134840 set_configfs_node 2 10.198.1.111 local 0
>>> 1303134840 cman: node 3 added
>>> 1303134840 set_configfs_node 3 10.198.1.110 local 1
>>>
>>>
>>> [root at omadvnfs01a ~]# /sbin/groupd -D
>>> 1303134809 cman: our nodeid 3 name omadvnfs01a.sec.jel.lc quorum 1
>>> 1303134809 setup_cpg groupd_handle 6b8b456700000000
>>> 1303134809 groupd confchg total 2 left 0 joined 1
>>> 1303134809 send_version nodeid 3 cluster 2 mode 2 compat 1
>>> 1303134822 client connection 3
>>> 1303134822 got client 3 setup
>>> 1303134822 setup fence 0
>>> 1303134840 client connection 4
>>> 1303134840 got client 4 setup
>>> 1303134840 setup dlm 1
>>> 1303134853 client connection 5
>>> 1303134853 got client 5 setup
>>> 1303134853 setup gfs 2
>>> 1303134861 got client 3 join
>>> 1303134861 0:default got join
>>> 1303134861 0:default is cpg client 6 name 0_default handle
>>> 6633487300000001
>>> 1303134861 0:default cpg_join ok
>>> 1303134861 0:default waiting for first cpg event
>>> 1303134861 client connection 7
>>> 1303134861 0:default waiting for first cpg event
>>> 1303134861 got client 7 get_group
>>> 1303134861 0:default waiting for first cpg event
>>> 1303134861 0:default waiting for first cpg event
>>> 1303134861 0:default confchg left 0 joined 1 total 2
>>> 1303134861 0:default process_node_join 3
>>> 1303134861 0:default cpg add node 2 total 1
>>> 1303134861 0:default cpg add node 3 total 2
>>> 1303134861 0:default make_event_id 300020001 nodeid 3 memb_count 2 type 1
>>> 1303134861 0:default queue join event for nodeid 3
>>> 1303134861 0:default process_current_event 300020001 3 JOIN_BEGIN
>>> 1303134861 0:default app node init: add 3 total 1
>>> 1303134861 0:default app node init: add 2 total 2
>>> 1303134861 0:default waiting for 1 more stopped messages before
>>> JOIN_ALL_STOPPED
>>>
>>
>> That looks like a service error. Is fencing started and working? Check the
>> output of cman_tool services or group_tool
>>
>> Chrissie
>>
>> --
>
> Another point that I saw is the output of clustat looks good on the
> centos node, but the centos node appears offline to the rhel node.
> Here's that clustat as well as group_tool and cman_tool from both
> nodes:
>
> centos:
> [root at omadvnfs01a ~]# clustat
> Cluster Status for omadvnfs01 @ Mon Apr 18 18:25:58 2011
> Member Status: Quorate
>
>  Member Name                                                     ID   Status
>  ------ ----                                                     ---- ------
>  omadvnfs01b.sec.jel.lc                                              2
> Online, rgmanager
>  omadvnfs01a.sec.jel.lc                                              3
> Online, Local, rgmanager
> ...
> [root at omadvnfs01a ~]# group_tool -v ls
> type             level name       id       state node id local_done
> fence            0     default    00010001 none
> [2 3]
> dlm              1     clvmd      00040002 none
> [2 3]
> dlm              1     rgmanager  00030002 none
> [2 3]
> [root at omadvnfs01a ~]# cman_tool status
> Version: 6.2.0
> Config Version: 72
> Cluster Name: omadvnfs01
> Cluster Id: 44973
> Cluster Member: Yes
> Cluster Generation: 1976
> Membership state: Cluster-Member
> Nodes: 2
> Expected votes: 1
> Total votes: 2
> Quorum: 1
> Active subsystems: 9
> Flags: 2node Dirty
> Ports Bound: 0 11 177
> Node name: omadvnfs01a.sec.jel.lc
> Node ID: 3
> Multicast addresses: 239.192.175.93
> Node addresses: 10.198.1.110
>
>
> rhel:
> [root at omadvnfs01b ~]# clustat
> Cluster Status for omadvnfs01 @ Tue Apr 19 08:29:07 2011
> Member Status: Quorate
>
>  Member Name                                                     ID   Status
>  ------ ----                                                     ---- ------
>  omadvnfs01b.sec.jel.lc                                              2
> Online, Local, rgmanager
>  omadvnfs01a.sec.jel.lc                                              3
> Offline, rgmanager
> ...
> [root at omadvnfs01b ~]# group_tool -v ls
> type             level name        id       state node id local_done
> fence            0     default     00010001 none
> [2 3]
> dlm              1     gfs_data00  00020002 none
> [2]
> dlm              1     rgmanager   00030002 none
> [2 3]
> dlm              1     clvmd       00040002 none
> [2 3]
> gfs              2     gfs_data00  00010002 none
> [2]
> [root at omadvnfs01b ~]# cman_tool status
> Version: 6.2.0
> Config Version: 72
> Cluster Name: omadvnfs01
> Cluster Id: 44973
> Cluster Member: Yes
> Cluster Generation: 1976
> Membership state: Cluster-Member
> Nodes: 2
> Expected votes: 1
> Total votes: 2
> Quorum: 1
> Active subsystems: 9
> Flags: 2node Dirty
> Ports Bound: 0 11 177
> Node name: omadvnfs01b.sec.jel.lc
> Node ID: 2
> Multicast addresses: 239.192.175.93
> Node addresses: 10.198.1.111
>
>
> Thanks!
>

I took the risk and rebooted the RHEL node.  It came back up without
issue.  I was then able to join the Centos node to the cluster and do
pvscan/vgscan.  It picked up the cluster volumes.  Clearly clvmd/dlm
was in a broken state on the RHEL node causing all of my issues.
Unfortunately we'll know never exactly what the deal was with it.