[Linux-cluster] fence in xen

Mon Oct 4 23:27:51 UTC 2010

This :

"""
[root at clu5 ~]# group_tool
type             level name       id       state
fence            0     default    00010001 JOIN_STOP_WAIT
[1 2 2]
dlm              1     clvmd      00020001 JOIN_STOP_WAIT
[1 2 2]
dlm              1     rgmanager  00010002 none
[1 2]
""

To my understanding this means that fence and dlm for clvm both see two
copies of node 2. You'll have to check how this is happening, did cman start
twice? Did you manually stop and it and start it?

Try disabling your firewall and get both nodes up in a stable state. The
state should all be "none". Once that is complete, then look at trying to do
fencing.

Joel

On Fri, Oct 1, 2010 at 11:42 PM, Rakovec Jost <Jost.Rakovec at snt.si> wrote:

> Hi Joel,
>
>
> Hi Joel,
>
>
>
> On Fri, 2010-10-01 at 15:09 +1000, Joel Heenan wrote:
> > Are you saying that if you manually destroy the guest, then start it
> > up it works?
>
> No. I have to destroy both node.
>
> >
> > I don't think your problem is with fencing I think its that the two
> > guests are not joining correctly. It seems like the fencing part is
> > working.
> >
> > Do the logs in /var/log/messages show that one node succesfully fenced
> > the other? What is the output of group_tool on both nodes after they
> > have come up, this should help you debug it.
> >
> yes
> Oct  1 11:04:39 clu5 fenced[1541]: fence "clu6.snt.si" success
>
>
>
>
>
> node1
>
> [root at clu5 ~]# group_tool
> type             level name       id       state
> fence            0     default    00010001 JOIN_STOP_WAIT
> [1 2 2]
> dlm              1     clvmd      00020001 JOIN_STOP_WAIT
> [1 2 2]
> dlm              1     rgmanager  00010002 none
> [1 2]
> [root at clu5 ~]#
> [root at clu5 ~]#
> [root at clu5 ~]# group_tool dump fence
> 1285924843 our_nodeid 1 our_name clu5.snt.si
> 1285924843 listen 4 member 5 groupd 7
> 1285924846 client 3: join default
> 1285924846 delay post_join 3s post_fail 0s
> 1285924846 added 2 nodes from ccs
> 1285924846 setid default 65537
> 1285924846 start default 1 members 1
> 1285924846 do_recovery stop 0 start 1 finish 0
> 1285924846 finish default 1
> 1285924846 stop default
> 1285924846 start default 2 members 2 1
> 1285924846 do_recovery stop 1 start 2 finish 1
> 1285924846 finish default 2
> 1285924936 stop default
> 1285924985 client 3: dump
> 1285925065 client 3: dump
> 1285925281 client 3: dump
> [root at clu5 ~]#
>
>
>
> node2
>
> [root at clu6 ~]# group_tool
> type             level name     id       state
> fence            0     default  00000000 JOIN_STOP_WAIT
> [1 2]
> dlm              1     clvmd    00000000 JOIN_STOP_WAIT
> [1 2]
> [root at clu6 ~]#
> [root at clu6 ~]#
> [root at clu6 ~]# group_tool dump fence
> 1285924935 our_nodeid 2 our_name clu6.snt.si
> 1285924935 listen 4 member 5 groupd 7
> 1285924936 client 3: join default
> 1285924936 delay post_join 3s post_fail 0s
> 1285924936 added 2 nodes from ccs
> 1285925291 client 3: dump
> [root at clu6 ~]#
>
>
> thx
>
> br jost
>
>
> ________________________________________
> From: linux-cluster-bounces at redhat.com [linux-cluster-bounces at redhat.com]
> On Behalf Of Joel Heenan [joelh at planetjoel.com]
> Sent: Friday, October 01, 2010 7:09 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] fence in xen
>
> Are you saying that if you manually destroy the guest, then start it up it
> works?
>
>
>
> I don't think your problem is with fencing I think its that the two guests
> are not joining correctly. It seems like the fencing part is working.
>
> Do the logs in /var/log/messages show that one node succesfully fenced the
> other? What is the output of group_tool on both nodes after they have come
> up, this should help you debug it.
>
> I don't think its relevant but this item from the FAQ may help:
>
> http://sources.redhat.com/cluster/wiki/FAQ/Fencing#fence_stuck
>
> Joel
>
> On Wed, Sep 22, 2010 at 7:08 PM, Rakovec Jost <Jost.Rakovec at snt.si<mailto:
> Jost.Rakovec at snt.si>> wrote:
> Hi
>
> anybody any idea? Please help!!
>
>
> now i can fence node but after booting it can't connect in to cluster.
>
> on dom0
>
>  fence_xvmd -LX -I xenbr0 -U xen:/// -fdddddddddddddd
>
>
> ipv4_connect: Connecting to client
> ipv4_connect: Success; fd = 12
> Rebooting domain oelcl21...
> [REBOOT] Calling virDomainDestroy(0x99cede0)
> libvir: Xen error : Domain not found: xenUnifiedDomainLookupByName
> [[ XML Domain Info ]]
> <domain type='xen' id='41'>
>  <name>oelcl21</name>
>  <uuid>07e31b27-1ff1-4754-4f58-221e8d2057d6</uuid>
>  <memory>1048576</memory>
>  <currentMemory>1048576</currentMemory>
>  <vcpu>2</vcpu>
>  <bootloader>/usr/bin/pygrub</bootloader>
>  <os>
>   <type>linux</type>
>  </os>
>  <clock offset='utc'/>
>  <on_poweroff>destroy</on_poweroff>
>  <on_reboot>restart</on_reboot>
>  <on_crash>restart</on_crash>
>  <devices>
>   <disk type='block' device='disk'>
>     <driver name='phy'/>
>     <source dev='/dev/vg_datastore/oelcl21'/>
>     <target dev='xvda' bus='xen'/>
>   </disk>
>   <disk type='block' device='disk'>
>     <driver name='phy'/>
>     <source dev='/dev/vg_datastore/skupni1'/>
>     <target dev='xvdb' bus='xen'/>
>     <shareable/>
>   </disk>
>   <interface type='bridge'>
>     <mac address='00:16:3e:7c:60:aa'/>
>     <source bridge='xenbr0'/>
>     <script path='/etc/xen/scripts/vif-bridge'/>
>     <target dev='vif41.0'/>
>   </interface>
>   <console type='pty' tty='/dev/pts/2'>
>     <source path='/dev/pts/2'/>
>     <target port='0'/>
>   </console>
>  </devices>
> </domain>
>
> [[ XML END ]]
> Calling virDomainCreateLinux()..
>
>
> on domU -node1
>
> fence_xvm -H oelcl21 -ddd
>
> clustat on node1:
>
> [root at oelcl11 ~]# clustat
> Cluster Status for cluster2 @ Wed Sep 22 11:04:49 2010
> Member Status: Quorate
>
>  Member Name                                        ID   Status
>  ------ ----                                        ---- ------
>  oelcl11                                                1 Online, Local,
> rgmanager
>  oelcl21                                                2 Online, rgmanager
>
>  Service Name                              Owner (Last)
>          State
>  ------- ----                              ----- ------
>          -----
>  service:web                               oelcl11
>           started
> [root at oelcl11 ~]#
>
>
> but node2 it waits for 300s an can 't connect
>
>  Starting daemons... done
>  Starting fencing... Sep 22 10:41:06 oelcl21 kernel: eth0: no IPv6 routers
> present
> done
> [  OK  ]
>
> [root at oelcl21 ~]# clustat
> Cluster Status for cluster2 @ Wed Sep 22 11:04:19 2010
> Member Status: Quorate
>
>  Member Name                             ID   Status
>  ------ ----                             ---- ------
>  oelcl11                                     1 Online
>  oelcl21                                     2 Online, Local
>
> [root at oelcl21 ~]#
>
>
>
> br
> jost
>
>
>
>
> ________________________________________
> From: linux-cluster-bounces at redhat.com<mailto:
> linux-cluster-bounces at redhat.com> [linux-cluster-bounces at redhat.com
> <mailto:linux-cluster-bounces at redhat.com>] On Behalf Of Rakovec Jost [
> Jost.Rakovec at snt.si<mailto:Jost.Rakovec at snt.si>]
> Sent: Monday, September 13, 2010 9:31 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] fence in xen
>
> Hi
>
>
> Q: do fence_xvmd must run also  in domU?
> Because I notice that if I run on host when fence_xvmd is running:
>
> [root at oelcl1 ~]# fence_xvm -H oelcl2 -ddd -o null
> Debugging threshold is now 3
> -- args @ 0x7fffe3f71fb0 --
>  args->addr = 225.0.0.12
>  args->domain = oelcl2
>  args->key_file = /etc/cluster/fence_xvm.key
>  args->op = 0
>  args->hash = 2
>  args->auth = 2
>  args->port = 1229
>  args->ifindex = 0
>  args->family = 2
>  args->timeout = 30
>  args->retr_time = 20
>  args->flags = 0
>  args->debug = 3
> -- end args --
> Reading in key file /etc/cluster/fence_xvm.key into 0x7fffe3f70f60 (4096
> max size)
> Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1
> Sending to 225.0.0.12 via 10.9.131.80
> Sending to 225.0.0.12 via 10.9.131.83
> Sending to 225.0.0.12 via 192.168.122.1
> Waiting for connection from XVM host daemon.
> Issuing TCP challenge
> Responding to TCP challenge
> TCP Exchange + Authentication done...
> Waiting for return value from XVM host
> Remote: Operation was successful
>
>
> but if I try to fence ---> reboot then I get:
>
> [root at oelcl1 ~]# fence_xvm -H oelc2
> Remote: Operation was successful
> [root at oelcl1 ~]#
>
> but host2 is not reboot.
>
>
> if fence_xvmd is not run on hosts then I get time out.
>
>
>
> [root at oelcl1 sysconfig]# fence_xvm -H oelcl2 -ddd -o null
> Debugging threshold is now 3
> -- args @ 0x7fff1a6b5580 --
>  args->addr = 225.0.0.12
>  args->domain = oelcl2
>  args->key_file = /etc/cluster/fence_xvm.key
>  args->op = 0
>  args->hash = 2
>  args->auth = 2
>  args->port = 1229
>  args->ifindex = 0
>  args->family = 2
>  args->timeout = 30
>  args->retr_time = 20
>  args->flags = 0
>  args->debug = 3
> -- end args --
> Reading in key file /etc/cluster/fence_xvm.key into 0x7fff1a6b4530 (4096
> max size)
> Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1
> Sending to 225.0.0.12 via 10.9.131.80
> Waiting for connection from XVM host daemon.
> Sending to 225.0.0.12 via 127.0.0.1
> Sending to 225.0.0.12 via 10.9.131.80
> Waiting for connection from XVM host daemon.
>
>
>
> Q: how can I try if multicast is ok?
>
> Q: on which network interface must fence_xvmd run on dom0? I notice that on
> hosts-domU is:
>
> virbr0    Link encap:Ethernet  HWaddr 00:00:00:00:00:00
>         inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
>         inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link
>         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>         RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>         TX packets:40 errors:0 dropped:0 overruns:0 carrier:0
>         collisions:0 txqueuelen:0
>         RX bytes:0 (0.0 b)  TX bytes:7212 (7.0 KiB)
>
>
> also virbr0
>
> and on dom0 guest:
>
> [root at vm5 ~]# fence_xvmd -fdd -I xenbr0
> -- args @ 0xbfd26234 --
>  args->addr = 225.0.0.12
>  args->domain = (null)
>  args->key_file = /etc/cluster/fence_xvm.key
>  args->op = 2
>  args->hash = 2
>  args->auth = 2
>  args->port = 1229
>  args->ifindex = 7
>  args->family = 2
>  args->timeout = 30
>  args->retr_time = 20
>  args->flags = 1
>  args->debug = 2
> -- end args --
> Opened ckpt vm_states
> My Node ID = 1
> Domain                   UUID                                 Owner State
> ------                   ----                                 ----- -----
> Domain-0                 00000000-0000-0000-0000-000000000000 00001 00001
> oelcl1                   2a53022c-5836-68f0-4514-02a5a0b07e81 00001 00002
> oelcl2                   dd268dd4-f012-e0f7-7c77-aa8a58e1e6ab 00001 00002
> oelcman                  09c783bd-9107-0916-ebbf-bd27bcc8babe 00001 00002
> Storing oelcl1
> Storing oelcl2
>
>
>
> [root at vm5 ~]# fence_xvmd -fdd -I virbr0
> -- args @ 0xbfd26234 --
>  args->addr = 225.0.0.12
>  args->domain = (null)
>  args->key_file = /etc/cluster/fence_xvm.key
>  args->op = 2
>  args->hash = 2
>  args->auth = 2
>  args->port = 1229
>  args->ifindex = 7
>  args->family = 2
>  args->timeout = 30
>  args->retr_time = 20
>  args->flags = 1
>  args->debug = 2
> -- end args --
> Opened ckpt vm_states
> My Node ID = 1
> Domain                   UUID                                 Owner State
> ------                   ----                                 ----- -----
> Domain-0                 00000000-0000-0000-0000-000000000000 00001 00001
> oelcl1                   2a53022c-5836-68f0-4514-02a5a0b07e81 00001 00002
> oelcl2                   dd268dd4-f012-e0f7-7c77-aa8a58e1e6ab 00001 00002
> oelcman                  09c783bd-9107-0916-ebbf-bd27bcc8babe 00001 00002
> Storing oelcl1
> Storing oelcl2
>
>
> no meter whic interface I take fence is not done.
>
>
> thx
>
> br jost
>
>
>
>
>
>
>
>
>
> _____________________________________
> From: linux-cluster-bounces at redhat.com<mailto:
> linux-cluster-bounces at redhat.com> [linux-cluster-bounces at redhat.com
> <mailto:linux-cluster-bounces at redhat.com>] On Behalf Of Rakovec Jost [
> Jost.Rakovec at snt.si<mailto:Jost.Rakovec at snt.si>]
> Sent: Saturday, September 11, 2010 6:36 PM
> To: linux-cluster at redhat.com<mailto:linux-cluster at redhat.com>
> Subject: [Linux-cluster] fence in xen
>
> Hi list!
>
>
> I have a question about fence_xvm.
>
> Situation is:
>
> one physical server with xen --> dom0  with 2 domU. Cluster work fine
> between domU --reboot, relocate,
>
> I'm using redhat 5.5
>
> Problem is with fence from dom0  with "fence_xvm -H oelcl2" ,  domU is
> destroyed but when it is booted back domU can't join to the cluster. domU
> boot very long time --> FENCED_START_TIMEOUT=300
>
>
> on console I get after the node2 is up:
>
> node2:
>
> INFO: task clurgmgrd:2127 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> clurgmgrd     D 0000000000000010     0  2127   2126
> (NOTLB)
>  ffff88006f08dda8  0000000000000286  ffff88007cc0b810  0000000000000000
>  0000000000000003  ffff880072009860  ffff880072f6b0c0  00000000000455ec
>  ffff880072009a48  ffffffff802649d7
> Call Trace:
>  [<ffffffff802649d7>] _read_lock_irq+0x9/0x19
>  [<ffffffff8021420e>] filemap_nopage+0x193/0x360
>  [<ffffffff80263a7e>] __mutex_lock_slowpath+0x60/0x9b
>  [<ffffffff80263ac8>] .text.lock.mutex+0xf/0x14
>  [<ffffffff88424b64>] :dlm:dlm_new_lockspace+0x2c/0x860
>  [<ffffffff80222b08>] __up_read+0x19/0x7f
>  [<ffffffff802d0abb>] __kmalloc+0x8f/0x9f
>  [<ffffffff8842b6fa>] :dlm:device_write+0x438/0x5e5
>  [<ffffffff80217377>] vfs_write+0xce/0x174
>  [<ffffffff80217bc4>] sys_write+0x45/0x6e
>  [<ffffffff802602f9>] tracesys+0xab/0xb6
>
>
> between booting on node2:
>
> Starting clvmd: dlm: Using TCP for communications
> clvmd startup timed out
> [FAILED]
>
>
>
> node2:
>
> [root at oelcl2 init.d]# clustat
> Cluster Status for cluster1 @ Sat Sep 11 18:11:21 2010
> Member Status: Quorate
>
>  Member Name                                                ID   Status
>  ------ ----                                                ---- ------
>  oelcl1                                                  1 Online
>  oelcl2                                                 2 Online, Local
>
> [root at oelcl2 init.d]#
>
>
> on first node:
>
> [root at oelcl1 ~]# clustat
> Cluster Status for cluster1 @ Sat Sep 11 18:12:07 2010
> Member Status: Quorate
>
>  Member Name                                                ID   Status
>  ------ ----                                                ---- ------
>  oelcl1                                                  1 Online, Local,
> rgmanager
>  oelcl2                                                  2 Online,
> rgmanager
>
>  Service Name                                      Owner (Last)
>                          State
>  ------- ----                                      ----- ------
>                          -----
>  service:webby                                     oelcl1
>                   started
> [root at oelcl1 ~]#
>
>
> and then I have to destroy both domU on guest and create it back to get
> node2 work again.
>
> I have use how to on https://access.redhat.com/kb/docs/DOC-5937 and
> http://sources.redhat.com/cluster/wiki/VMClusterCookbook
>
>
> cluster config on dom0
>
>
> <?xml version="1.0"?>
> <cluster alias="vmcluster" config_version="1" name="vmcluster">
>       <clusternodes>
>               <clusternode name="vm5" nodeid="1" votes="1"/>
>       </clusternodes>
>       <cman/>
>       <fencedevices/>
>       <rm/>
>       <fence_xvmd/>
> </cluster>
>
>
>
> cluster config on domU
>
>
> <?xml version="1.0"?>
> <cluster alias="cluster1" config_version="49" name="cluster1">
>       <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="4"/>
>       <clusternodes>
>               <clusternode name="oelcl1.name.comi" nodeid="1" votes="1">
>                       <fence>
>                               <method name="1">
>                                       <device domain="oelcl1"
> name="xenfence1"/>
>                               </method>
>                       </fence>
>               </clusternode>
>                <clusternode name="oelcl2.name.com<http://oelcl2.name.com>"
> nodeid="2" votes="1">
>                        <fence>
>                               <method name="1">
>                                       <device domain="oelcl2"
> name="xenfence1"/>
>                               </method>
>                       </fence>
>               </clusternode>
>       </clusternodes>
>       <cman expected_votes="1" two_node="1"/>
>       <fencedevices>
>               <fencedevice agent="fence_xvm" name="xenfence1"/>
>       </fencedevices>
>       <rm>
>               <failoverdomains>
>                       <failoverdomain name="prefer_node1" nofailback="0"
> ordered="1" restricted="1">
>                                <failoverdomainnode name="oelcl1.name.com<
> http://oelcl1.name.com>" priority="1"/>
>                               <failoverdomainnode name="oelcl2.name.com<
> http://oelcl2.name.com>" priority="2"/>
>                        </failoverdomain>
>               </failoverdomains>
>               <resources>
>                       <ip address="xx.xx.xx.xx" monitor_link="1"/>
>                       <fs device="/dev/xvdb1" force_fsck="0"
> force_unmount="0" fsid="8669" fstype="ext3" mountpoint="/var/www/html"
> name="docroot" self_fence="0"/>
>                       <script file="/etc/init.d/httpd" name="apache_s"/>
>               </resources>
>               <service autostart="1" domain="prefer_node1" exclusive="0"
> name="webby" recovery="relocate">
>                       <ip ref="xx.xx.xx.xx"/>
>                       <fs ref="docroot"/>
>                       <script ref="apache_s"/>
>               </service>
>       </rm>
> </cluster>
>
>
>
>
> fence proces on dom0
>
> [root at vm5 cluster]# ps -ef |grep fenc
> root     18690     1  0 17:40 ?        00:00:00 /sbin/fenced
> root     18720     1  0 17:40 ?        00:00:00 /sbin/fence_xvmd -I xenbr0
> root     22633 14524  0 18:21 pts/3    00:00:00 grep fenc
> [root at vm5 cluster]#
>
>
> and on domU
>
> [root at oelcl1 ~]# ps -ef|grep fen
> root      1523     1  0 17:41 ?        00:00:00 /sbin/fenced
> root     13695  2902  0 18:22 pts/0    00:00:00 grep fen
> [root at oelcl1 ~]#
>
>
>
> Do somebody have any idea why fence don't work?
>
> thx
>
> br
>
> jost
>
>
>
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101005/da70e9c1/attachment.htm>