[Linux-cluster] problems with clvmd

Mon Apr 18 19:17:22 UTC 2011

On Mon, Apr 18, 2011 at 9:49 AM, Terry <td3201 at gmail.com> wrote:
> On Mon, Apr 18, 2011 at 9:26 AM, Christine Caulfield
> <ccaulfie at redhat.com> wrote:
>> On 18/04/11 15:11, Terry wrote:
>>>
>>> On Mon, Apr 18, 2011 at 8:57 AM, Christine Caulfield
>>> <ccaulfie at redhat.com>  wrote:
>>>>
>>>> On 18/04/11 14:38, Terry wrote:
>>>>>
>>>>> On Mon, Apr 18, 2011 at 3:48 AM, Christine Caulfield
>>>>> <ccaulfie at redhat.com>    wrote:
>>>>>>
>>>>>> On 17/04/11 21:52, Terry wrote:
>>>>>>>
>>>>>>> As a result of a strange situation where our licensing for storage
>>>>>>> dropped off, I need to join a centos 5.6 node to a now single node
>>>>>>> cluster.  I got it joined to the cluster but I am having issues with
>>>>>>> CLVMD.  Any lvm operations on both boxes hang.  For example, vgscan.
>>>>>>> I have increased debugging and I don't see any logs.  The VGs aren't
>>>>>>> being populated in /dev/mapper.  This WAS working right after I joined
>>>>>>> it to the cluster and now it's not for some unknown reason.  Not sure
>>>>>>> where to take this at this point.   I did find one weird startup log
>>>>>>> that I am not sure what it means yet:
>>>>>>> [root at omadvnfs01a ~]# dmesg | grep dlm
>>>>>>> dlm: no local IP address has been set
>>>>>>> dlm: cannot start dlm lowcomms -107
>>>>>>> dlm: Using TCP for communications
>>>>>>> dlm: connecting to 2
>>>>>>>
>>>>>>
>>>>>>
>>>>>> That message usually means that dlm_controld has failed to start. Try
>>>>>> starting the cman daemons (groupd, dlm_controld) manually with the -D
>>>>>> switch
>>>>>> and read the output which might give some clues to why it's not
>>>>>> working.
>>>>>>
>>>>>> Chrissie
>>>>>>
>>>>>
>>>>>
>>>>> Hi Chrissie,
>>>>>
>>>>> I thought of that but I see dlm started on both nodes.  See right below.
>>>>>
>>>>>>> [root at omadvnfs01a ~]# ps xauwwww | grep dlm
>>>>>>> root      5476  0.0  0.0  24736   760 ?        Ss   15:34   0:00
>>>>>>> /sbin/dlm_controld
>>>>>>> root      5502  0.0  0.0      0     0 ?        S<        15:34   0:00
>>>>
>>>>
>>>> Well, that's encouraging in a way! But it's evidently not started fully
>>>> or
>>>> the DLM itself would be working. So I still recommend starting it with -D
>>>> to
>>>> see how far it gets.
>>>>
>>>>
>>>> Chrissie
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>
>>> I think we had posts cross.  Here's my latest:
>>>
>>> Ok, started all the CMAN elements manually as you suggested.  I
>>> started them in order as in the init script. Here's the only error
>>> that I see.  I can post the other debug messages if you think they'd
>>> be useful but this is the only one that stuck out to me.
>>>
>>> [root at omadvnfs01a ~]# /sbin/dlm_controld -D
>>> 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
>>> 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
>>> 1303134840 set_ccs_options 480
>>> 1303134840 cman: node 2 added
>>> 1303134840 set_configfs_node 2 10.198.1.111 local 0
>>> 1303134840 cman: node 3 added
>>> 1303134840 set_configfs_node 3 10.198.1.110 local 1
>>>
>>
>> Can I see the whole set please ? It looks like dlm_controld might be stalled
>> registering with groupd.
>>
>> Chrissie
>>
>> --
>
> Here you go.  Thank you very much for the help.  Each daemon's output
> that I started is below.
>
> [root at omadvnfs01a log]# /sbin/ccsd -n
> Starting ccsd 2.0.115:
>  Built: Mar  6 2011 00:47:03
>  Copyright (C) Red Hat, Inc.  2004  All rights reserved.
>  No Daemon:: SET
>
> cluster.conf (cluster name = omadvnfs01, version = 71) found.
> Remote copy of cluster.conf is from quorate node.
>  Local version # : 71
>  Remote version #: 71
> Remote copy of cluster.conf is from quorate node.
>  Local version # : 71
>  Remote version #: 71
> Remote copy of cluster.conf is from quorate node.
>  Local version # : 71
>  Remote version #: 71
> Remote copy of cluster.conf is from quorate node.
>  Local version # : 71
>  Remote version #: 71
> Initial status:: Quorate
>
> [root at omadvnfs01a ~]# /sbin/fenced -D
> 1303134822 cman: node 2 added
> 1303134822 cman: node 3 added
> 1303134822 our_nodeid 3 our_name omadvnfs01a.sec.jel.lc
> 1303134822 listen 4 member 5 groupd 7
> 1303134861 client 3: join default
> 1303134861 delay post_join 3s post_fail 0s
> 1303134861 added 2 nodes from ccs
> 1303134861 setid default 65537
> 1303134861 start default 1 members 2 3
> 1303134861 do_recovery stop 0 start 1 finish 0
> 1303134861 finish default 1
>
> [root at omadvnfs01a ~]# /sbin/dlm_controld -D
> 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
> 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
> 1303134840 set_ccs_options 480
> 1303134840 cman: node 2 added
> 1303134840 set_configfs_node 2 10.198.1.111 local 0
> 1303134840 cman: node 3 added
> 1303134840 set_configfs_node 3 10.198.1.110 local 1
>
>
> [root at omadvnfs01a ~]# /sbin/groupd -D
> 1303134809 cman: our nodeid 3 name omadvnfs01a.sec.jel.lc quorum 1
> 1303134809 setup_cpg groupd_handle 6b8b456700000000
> 1303134809 groupd confchg total 2 left 0 joined 1
> 1303134809 send_version nodeid 3 cluster 2 mode 2 compat 1
> 1303134822 client connection 3
> 1303134822 got client 3 setup
> 1303134822 setup fence 0
> 1303134840 client connection 4
> 1303134840 got client 4 setup
> 1303134840 setup dlm 1
> 1303134853 client connection 5
> 1303134853 got client 5 setup
> 1303134853 setup gfs 2
> 1303134861 got client 3 join
> 1303134861 0:default got join
> 1303134861 0:default is cpg client 6 name 0_default handle 6633487300000001
> 1303134861 0:default cpg_join ok
> 1303134861 0:default waiting for first cpg event
> 1303134861 client connection 7
> 1303134861 0:default waiting for first cpg event
> 1303134861 got client 7 get_group
> 1303134861 0:default waiting for first cpg event
> 1303134861 0:default waiting for first cpg event
> 1303134861 0:default confchg left 0 joined 1 total 2
> 1303134861 0:default process_node_join 3
> 1303134861 0:default cpg add node 2 total 1
> 1303134861 0:default cpg add node 3 total 2
> 1303134861 0:default make_event_id 300020001 nodeid 3 memb_count 2 type 1
> 1303134861 0:default queue join event for nodeid 3
> 1303134861 0:default process_current_event 300020001 3 JOIN_BEGIN
> 1303134861 0:default app node init: add 3 total 1
> 1303134861 0:default app node init: add 2 total 2
> 1303134861 0:default waiting for 1 more stopped messages before
> JOIN_ALL_STOPPED
>
>  3
> 1303134861 0:default mark node 2 stopped
> 1303134861 0:default set global_id 10001 from 2
> 1303134861 0:default process_current_event 300020001 3 JOIN_ALL_STOPPED
> 1303134861 0:default action for app: setid default 65537
> 1303134861 0:default action for app: start default 1 2 2 2 3
> 1303134861 client connection 7
> 1303134861 got client 7 get_group
> 1303134861 0:default mark node 2 started
> 1303134861 client connection 7
> 1303134861 got client 7 get_group
> 1303134861 got client 3 start_done
> 1303134861 0:default send started
> 1303134861 0:default mark node 3 started
> 1303134861 0:default process_current_event 300020001 3 JOIN_ALL_STARTED
> 1303134861 0:default action for app: finish default 1
> 1303134862 client connection 7
> 1303134862 got client 7 get_group
>
>
> [root at omadvnfs01a ~]# /sbin/gfs_controld -D
> 1303134853 config_no_withdraw 0
> 1303134853 config_no_plock 0
> 1303134853 config_plock_rate_limit 100
> 1303134853 config_plock_ownership 0
> 1303134853 config_drop_resources_time 10000
> 1303134853 config_drop_resources_count 10
> 1303134853 config_drop_resources_age 10000
> 1303134853 protocol 1.0.0
> 1303134853 listen 3
> 1303134853 cpg 6
> 1303134853 groupd 7
> 1303134853 uevent 8
> 1303134853 plocks 10
> 1303134853 plock need_fsid_translation 1
> 1303134853 plock cpg message size: 336 bytes
> 1303134853 setup done
>

Another gap that I just found is I forgot to specify a fencing method
for the new centos node.  I put that in and now the rhel node wants to
fence it so I am letting it do that then i'll see where i end up.