[Linux-cluster] problems with clvmd

Terry td3201 at gmail.com
Tue Apr 19 13:32:05 UTC 2011


On Tue, Apr 19, 2011 at 4:59 AM, Christine Caulfield
<ccaulfie at redhat.com> wrote:
> On 18/04/11 15:49, Terry wrote:
>>
>> On Mon, Apr 18, 2011 at 9:26 AM, Christine Caulfield
>> <ccaulfie at redhat.com>  wrote:
>>>
>>> On 18/04/11 15:11, Terry wrote:
>>>>
>>>> On Mon, Apr 18, 2011 at 8:57 AM, Christine Caulfield
>>>> <ccaulfie at redhat.com>    wrote:
>>>>>
>>>>> On 18/04/11 14:38, Terry wrote:
>>>>>>
>>>>>> On Mon, Apr 18, 2011 at 3:48 AM, Christine Caulfield
>>>>>> <ccaulfie at redhat.com>      wrote:
>>>>>>>
>>>>>>> On 17/04/11 21:52, Terry wrote:
>>>>>>>>
>>>>>>>> As a result of a strange situation where our licensing for storage
>>>>>>>> dropped off, I need to join a centos 5.6 node to a now single node
>>>>>>>> cluster.  I got it joined to the cluster but I am having issues with
>>>>>>>> CLVMD.  Any lvm operations on both boxes hang.  For example, vgscan.
>>>>>>>> I have increased debugging and I don't see any logs.  The VGs aren't
>>>>>>>> being populated in /dev/mapper.  This WAS working right after I
>>>>>>>> joined
>>>>>>>> it to the cluster and now it's not for some unknown reason.  Not
>>>>>>>> sure
>>>>>>>> where to take this at this point.   I did find one weird startup log
>>>>>>>> that I am not sure what it means yet:
>>>>>>>> [root at omadvnfs01a ~]# dmesg | grep dlm
>>>>>>>> dlm: no local IP address has been set
>>>>>>>> dlm: cannot start dlm lowcomms -107
>>>>>>>> dlm: Using TCP for communications
>>>>>>>> dlm: connecting to 2
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> That message usually means that dlm_controld has failed to start. Try
>>>>>>> starting the cman daemons (groupd, dlm_controld) manually with the -D
>>>>>>> switch
>>>>>>> and read the output which might give some clues to why it's not
>>>>>>> working.
>>>>>>>
>>>>>>> Chrissie
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Chrissie,
>>>>>>
>>>>>> I thought of that but I see dlm started on both nodes.  See right
>>>>>> below.
>>>>>>
>>>>>>>> [root at omadvnfs01a ~]# ps xauwwww | grep dlm
>>>>>>>> root      5476  0.0  0.0  24736   760 ?        Ss   15:34   0:00
>>>>>>>> /sbin/dlm_controld
>>>>>>>> root      5502  0.0  0.0      0     0 ?        S<          15:34
>>>>>>>> 0:00
>>>>>
>>>>>
>>>>> Well, that's encouraging in a way! But it's evidently not started fully
>>>>> or
>>>>> the DLM itself would be working. So I still recommend starting it with
>>>>> -D
>>>>> to
>>>>> see how far it gets.
>>>>>
>>>>>
>>>>> Chrissie
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>
>>>> I think we had posts cross.  Here's my latest:
>>>>
>>>> Ok, started all the CMAN elements manually as you suggested.  I
>>>> started them in order as in the init script. Here's the only error
>>>> that I see.  I can post the other debug messages if you think they'd
>>>> be useful but this is the only one that stuck out to me.
>>>>
>>>> [root at omadvnfs01a ~]# /sbin/dlm_controld -D
>>>> 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
>>>> 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
>>>> 1303134840 set_ccs_options 480
>>>> 1303134840 cman: node 2 added
>>>> 1303134840 set_configfs_node 2 10.198.1.111 local 0
>>>> 1303134840 cman: node 3 added
>>>> 1303134840 set_configfs_node 3 10.198.1.110 local 1
>>>>
>>>
>>> Can I see the whole set please ? It looks like dlm_controld might be
>>> stalled
>>> registering with groupd.
>>>
>>> Chrissie
>>>
>>> --
>>
>> Here you go.  Thank you very much for the help.  Each daemon's output
>> that I started is below.
>>
>> [root at omadvnfs01a log]# /sbin/ccsd -n
>> Starting ccsd 2.0.115:
>>  Built: Mar  6 2011 00:47:03
>>  Copyright (C) Red Hat, Inc.  2004  All rights reserved.
>>   No Daemon:: SET
>>
>> cluster.conf (cluster name = omadvnfs01, version = 71) found.
>> Remote copy of cluster.conf is from quorate node.
>>  Local version # : 71
>>  Remote version #: 71
>> Remote copy of cluster.conf is from quorate node.
>>  Local version # : 71
>>  Remote version #: 71
>> Remote copy of cluster.conf is from quorate node.
>>  Local version # : 71
>>  Remote version #: 71
>> Remote copy of cluster.conf is from quorate node.
>>  Local version # : 71
>>  Remote version #: 71
>> Initial status:: Quorate
>>
>> [root at omadvnfs01a ~]# /sbin/fenced -D
>> 1303134822 cman: node 2 added
>> 1303134822 cman: node 3 added
>> 1303134822 our_nodeid 3 our_name omadvnfs01a.sec.jel.lc
>> 1303134822 listen 4 member 5 groupd 7
>> 1303134861 client 3: join default
>> 1303134861 delay post_join 3s post_fail 0s
>> 1303134861 added 2 nodes from ccs
>> 1303134861 setid default 65537
>> 1303134861 start default 1 members 2 3
>> 1303134861 do_recovery stop 0 start 1 finish 0
>> 1303134861 finish default 1
>>
>> [root at omadvnfs01a ~]# /sbin/dlm_controld -D
>> 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
>> 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
>> 1303134840 set_ccs_options 480
>> 1303134840 cman: node 2 added
>> 1303134840 set_configfs_node 2 10.198.1.111 local 0
>> 1303134840 cman: node 3 added
>> 1303134840 set_configfs_node 3 10.198.1.110 local 1
>>
>>
>> [root at omadvnfs01a ~]# /sbin/groupd -D
>> 1303134809 cman: our nodeid 3 name omadvnfs01a.sec.jel.lc quorum 1
>> 1303134809 setup_cpg groupd_handle 6b8b456700000000
>> 1303134809 groupd confchg total 2 left 0 joined 1
>> 1303134809 send_version nodeid 3 cluster 2 mode 2 compat 1
>> 1303134822 client connection 3
>> 1303134822 got client 3 setup
>> 1303134822 setup fence 0
>> 1303134840 client connection 4
>> 1303134840 got client 4 setup
>> 1303134840 setup dlm 1
>> 1303134853 client connection 5
>> 1303134853 got client 5 setup
>> 1303134853 setup gfs 2
>> 1303134861 got client 3 join
>> 1303134861 0:default got join
>> 1303134861 0:default is cpg client 6 name 0_default handle
>> 6633487300000001
>> 1303134861 0:default cpg_join ok
>> 1303134861 0:default waiting for first cpg event
>> 1303134861 client connection 7
>> 1303134861 0:default waiting for first cpg event
>> 1303134861 got client 7 get_group
>> 1303134861 0:default waiting for first cpg event
>> 1303134861 0:default waiting for first cpg event
>> 1303134861 0:default confchg left 0 joined 1 total 2
>> 1303134861 0:default process_node_join 3
>> 1303134861 0:default cpg add node 2 total 1
>> 1303134861 0:default cpg add node 3 total 2
>> 1303134861 0:default make_event_id 300020001 nodeid 3 memb_count 2 type 1
>> 1303134861 0:default queue join event for nodeid 3
>> 1303134861 0:default process_current_event 300020001 3 JOIN_BEGIN
>> 1303134861 0:default app node init: add 3 total 1
>> 1303134861 0:default app node init: add 2 total 2
>> 1303134861 0:default waiting for 1 more stopped messages before
>> JOIN_ALL_STOPPED
>>
>
> That looks like a service error. Is fencing started and working? Check the
> output of cman_tool services or group_tool
>
> Chrissie
>
> --

Another point that I saw is the output of clustat looks good on the
centos node, but the centos node appears offline to the rhel node.
Here's that clustat as well as group_tool and cman_tool from both
nodes:

centos:
[root at omadvnfs01a ~]# clustat
Cluster Status for omadvnfs01 @ Mon Apr 18 18:25:58 2011
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 omadvnfs01b.sec.jel.lc                                              2
Online, rgmanager
 omadvnfs01a.sec.jel.lc                                              3
Online, Local, rgmanager
...
[root at omadvnfs01a ~]# group_tool -v ls
type             level name       id       state node id local_done
fence            0     default    00010001 none
[2 3]
dlm              1     clvmd      00040002 none
[2 3]
dlm              1     rgmanager  00030002 none
[2 3]
[root at omadvnfs01a ~]# cman_tool status
Version: 6.2.0
Config Version: 72
Cluster Name: omadvnfs01
Cluster Id: 44973
Cluster Member: Yes
Cluster Generation: 1976
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Quorum: 1
Active subsystems: 9
Flags: 2node Dirty
Ports Bound: 0 11 177
Node name: omadvnfs01a.sec.jel.lc
Node ID: 3
Multicast addresses: 239.192.175.93
Node addresses: 10.198.1.110


rhel:
[root at omadvnfs01b ~]# clustat
Cluster Status for omadvnfs01 @ Tue Apr 19 08:29:07 2011
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 omadvnfs01b.sec.jel.lc                                              2
Online, Local, rgmanager
 omadvnfs01a.sec.jel.lc                                              3
Offline, rgmanager
...
[root at omadvnfs01b ~]# group_tool -v ls
type             level name        id       state node id local_done
fence            0     default     00010001 none
[2 3]
dlm              1     gfs_data00  00020002 none
[2]
dlm              1     rgmanager   00030002 none
[2 3]
dlm              1     clvmd       00040002 none
[2 3]
gfs              2     gfs_data00  00010002 none
[2]
[root at omadvnfs01b ~]# cman_tool status
Version: 6.2.0
Config Version: 72
Cluster Name: omadvnfs01
Cluster Id: 44973
Cluster Member: Yes
Cluster Generation: 1976
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Quorum: 1
Active subsystems: 9
Flags: 2node Dirty
Ports Bound: 0 11 177
Node name: omadvnfs01b.sec.jel.lc
Node ID: 2
Multicast addresses: 239.192.175.93
Node addresses: 10.198.1.111


Thanks!




More information about the Linux-cluster mailing list