[Linux-cluster] fence in xen
Joel Heenan
joelh at planetjoel.com
Mon Oct 4 23:27:51 UTC 2010
This :
"""
[root at clu5 ~]# group_tool
type level name id state
fence 0 default 00010001 JOIN_STOP_WAIT
[1 2 2]
dlm 1 clvmd 00020001 JOIN_STOP_WAIT
[1 2 2]
dlm 1 rgmanager 00010002 none
[1 2]
""
To my understanding this means that fence and dlm for clvm both see two
copies of node 2. You'll have to check how this is happening, did cman start
twice? Did you manually stop and it and start it?
Try disabling your firewall and get both nodes up in a stable state. The
state should all be "none". Once that is complete, then look at trying to do
fencing.
Joel
On Fri, Oct 1, 2010 at 11:42 PM, Rakovec Jost <Jost.Rakovec at snt.si> wrote:
> Hi Joel,
>
>
> Hi Joel,
>
>
>
> On Fri, 2010-10-01 at 15:09 +1000, Joel Heenan wrote:
> > Are you saying that if you manually destroy the guest, then start it
> > up it works?
>
> No. I have to destroy both node.
>
> >
> > I don't think your problem is with fencing I think its that the two
> > guests are not joining correctly. It seems like the fencing part is
> > working.
> >
> > Do the logs in /var/log/messages show that one node succesfully fenced
> > the other? What is the output of group_tool on both nodes after they
> > have come up, this should help you debug it.
> >
> yes
> Oct 1 11:04:39 clu5 fenced[1541]: fence "clu6.snt.si" success
>
>
>
>
>
> node1
>
> [root at clu5 ~]# group_tool
> type level name id state
> fence 0 default 00010001 JOIN_STOP_WAIT
> [1 2 2]
> dlm 1 clvmd 00020001 JOIN_STOP_WAIT
> [1 2 2]
> dlm 1 rgmanager 00010002 none
> [1 2]
> [root at clu5 ~]#
> [root at clu5 ~]#
> [root at clu5 ~]# group_tool dump fence
> 1285924843 our_nodeid 1 our_name clu5.snt.si
> 1285924843 listen 4 member 5 groupd 7
> 1285924846 client 3: join default
> 1285924846 delay post_join 3s post_fail 0s
> 1285924846 added 2 nodes from ccs
> 1285924846 setid default 65537
> 1285924846 start default 1 members 1
> 1285924846 do_recovery stop 0 start 1 finish 0
> 1285924846 finish default 1
> 1285924846 stop default
> 1285924846 start default 2 members 2 1
> 1285924846 do_recovery stop 1 start 2 finish 1
> 1285924846 finish default 2
> 1285924936 stop default
> 1285924985 client 3: dump
> 1285925065 client 3: dump
> 1285925281 client 3: dump
> [root at clu5 ~]#
>
>
>
> node2
>
> [root at clu6 ~]# group_tool
> type level name id state
> fence 0 default 00000000 JOIN_STOP_WAIT
> [1 2]
> dlm 1 clvmd 00000000 JOIN_STOP_WAIT
> [1 2]
> [root at clu6 ~]#
> [root at clu6 ~]#
> [root at clu6 ~]# group_tool dump fence
> 1285924935 our_nodeid 2 our_name clu6.snt.si
> 1285924935 listen 4 member 5 groupd 7
> 1285924936 client 3: join default
> 1285924936 delay post_join 3s post_fail 0s
> 1285924936 added 2 nodes from ccs
> 1285925291 client 3: dump
> [root at clu6 ~]#
>
>
> thx
>
> br jost
>
>
> ________________________________________
> From: linux-cluster-bounces at redhat.com [linux-cluster-bounces at redhat.com]
> On Behalf Of Joel Heenan [joelh at planetjoel.com]
> Sent: Friday, October 01, 2010 7:09 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] fence in xen
>
> Are you saying that if you manually destroy the guest, then start it up it
> works?
>
>
>
> I don't think your problem is with fencing I think its that the two guests
> are not joining correctly. It seems like the fencing part is working.
>
> Do the logs in /var/log/messages show that one node succesfully fenced the
> other? What is the output of group_tool on both nodes after they have come
> up, this should help you debug it.
>
> I don't think its relevant but this item from the FAQ may help:
>
> http://sources.redhat.com/cluster/wiki/FAQ/Fencing#fence_stuck
>
> Joel
>
> On Wed, Sep 22, 2010 at 7:08 PM, Rakovec Jost <Jost.Rakovec at snt.si<mailto:
> Jost.Rakovec at snt.si>> wrote:
> Hi
>
> anybody any idea? Please help!!
>
>
> now i can fence node but after booting it can't connect in to cluster.
>
> on dom0
>
> fence_xvmd -LX -I xenbr0 -U xen:/// -fdddddddddddddd
>
>
> ipv4_connect: Connecting to client
> ipv4_connect: Success; fd = 12
> Rebooting domain oelcl21...
> [REBOOT] Calling virDomainDestroy(0x99cede0)
> libvir: Xen error : Domain not found: xenUnifiedDomainLookupByName
> [[ XML Domain Info ]]
> <domain type='xen' id='41'>
> <name>oelcl21</name>
> <uuid>07e31b27-1ff1-4754-4f58-221e8d2057d6</uuid>
> <memory>1048576</memory>
> <currentMemory>1048576</currentMemory>
> <vcpu>2</vcpu>
> <bootloader>/usr/bin/pygrub</bootloader>
> <os>
> <type>linux</type>
> </os>
> <clock offset='utc'/>
> <on_poweroff>destroy</on_poweroff>
> <on_reboot>restart</on_reboot>
> <on_crash>restart</on_crash>
> <devices>
> <disk type='block' device='disk'>
> <driver name='phy'/>
> <source dev='/dev/vg_datastore/oelcl21'/>
> <target dev='xvda' bus='xen'/>
> </disk>
> <disk type='block' device='disk'>
> <driver name='phy'/>
> <source dev='/dev/vg_datastore/skupni1'/>
> <target dev='xvdb' bus='xen'/>
> <shareable/>
> </disk>
> <interface type='bridge'>
> <mac address='00:16:3e:7c:60:aa'/>
> <source bridge='xenbr0'/>
> <script path='/etc/xen/scripts/vif-bridge'/>
> <target dev='vif41.0'/>
> </interface>
> <console type='pty' tty='/dev/pts/2'>
> <source path='/dev/pts/2'/>
> <target port='0'/>
> </console>
> </devices>
> </domain>
>
> [[ XML END ]]
> Calling virDomainCreateLinux()..
>
>
> on domU -node1
>
> fence_xvm -H oelcl21 -ddd
>
> clustat on node1:
>
> [root at oelcl11 ~]# clustat
> Cluster Status for cluster2 @ Wed Sep 22 11:04:49 2010
> Member Status: Quorate
>
> Member Name ID Status
> ------ ---- ---- ------
> oelcl11 1 Online, Local,
> rgmanager
> oelcl21 2 Online, rgmanager
>
> Service Name Owner (Last)
> State
> ------- ---- ----- ------
> -----
> service:web oelcl11
> started
> [root at oelcl11 ~]#
>
>
> but node2 it waits for 300s an can 't connect
>
> Starting daemons... done
> Starting fencing... Sep 22 10:41:06 oelcl21 kernel: eth0: no IPv6 routers
> present
> done
> [ OK ]
>
> [root at oelcl21 ~]# clustat
> Cluster Status for cluster2 @ Wed Sep 22 11:04:19 2010
> Member Status: Quorate
>
> Member Name ID Status
> ------ ---- ---- ------
> oelcl11 1 Online
> oelcl21 2 Online, Local
>
> [root at oelcl21 ~]#
>
>
>
> br
> jost
>
>
>
>
> ________________________________________
> From: linux-cluster-bounces at redhat.com<mailto:
> linux-cluster-bounces at redhat.com> [linux-cluster-bounces at redhat.com
> <mailto:linux-cluster-bounces at redhat.com>] On Behalf Of Rakovec Jost [
> Jost.Rakovec at snt.si<mailto:Jost.Rakovec at snt.si>]
> Sent: Monday, September 13, 2010 9:31 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] fence in xen
>
> Hi
>
>
> Q: do fence_xvmd must run also in domU?
> Because I notice that if I run on host when fence_xvmd is running:
>
> [root at oelcl1 ~]# fence_xvm -H oelcl2 -ddd -o null
> Debugging threshold is now 3
> -- args @ 0x7fffe3f71fb0 --
> args->addr = 225.0.0.12
> args->domain = oelcl2
> args->key_file = /etc/cluster/fence_xvm.key
> args->op = 0
> args->hash = 2
> args->auth = 2
> args->port = 1229
> args->ifindex = 0
> args->family = 2
> args->timeout = 30
> args->retr_time = 20
> args->flags = 0
> args->debug = 3
> -- end args --
> Reading in key file /etc/cluster/fence_xvm.key into 0x7fffe3f70f60 (4096
> max size)
> Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1
> Sending to 225.0.0.12 via 10.9.131.80
> Sending to 225.0.0.12 via 10.9.131.83
> Sending to 225.0.0.12 via 192.168.122.1
> Waiting for connection from XVM host daemon.
> Issuing TCP challenge
> Responding to TCP challenge
> TCP Exchange + Authentication done...
> Waiting for return value from XVM host
> Remote: Operation was successful
>
>
> but if I try to fence ---> reboot then I get:
>
> [root at oelcl1 ~]# fence_xvm -H oelc2
> Remote: Operation was successful
> [root at oelcl1 ~]#
>
> but host2 is not reboot.
>
>
> if fence_xvmd is not run on hosts then I get time out.
>
>
>
> [root at oelcl1 sysconfig]# fence_xvm -H oelcl2 -ddd -o null
> Debugging threshold is now 3
> -- args @ 0x7fff1a6b5580 --
> args->addr = 225.0.0.12
> args->domain = oelcl2
> args->key_file = /etc/cluster/fence_xvm.key
> args->op = 0
> args->hash = 2
> args->auth = 2
> args->port = 1229
> args->ifindex = 0
> args->family = 2
> args->timeout = 30
> args->retr_time = 20
> args->flags = 0
> args->debug = 3
> -- end args --
> Reading in key file /etc/cluster/fence_xvm.key into 0x7fff1a6b4530 (4096
> max size)
> Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1
> Sending to 225.0.0.12 via 10.9.131.80
> Waiting for connection from XVM host daemon.
> Sending to 225.0.0.12 via 127.0.0.1
> Sending to 225.0.0.12 via 10.9.131.80
> Waiting for connection from XVM host daemon.
>
>
>
> Q: how can I try if multicast is ok?
>
> Q: on which network interface must fence_xvmd run on dom0? I notice that on
> hosts-domU is:
>
> virbr0 Link encap:Ethernet HWaddr 00:00:00:00:00:00
> inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0
> inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:40 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:0 (0.0 b) TX bytes:7212 (7.0 KiB)
>
>
> also virbr0
>
> and on dom0 guest:
>
> [root at vm5 ~]# fence_xvmd -fdd -I xenbr0
> -- args @ 0xbfd26234 --
> args->addr = 225.0.0.12
> args->domain = (null)
> args->key_file = /etc/cluster/fence_xvm.key
> args->op = 2
> args->hash = 2
> args->auth = 2
> args->port = 1229
> args->ifindex = 7
> args->family = 2
> args->timeout = 30
> args->retr_time = 20
> args->flags = 1
> args->debug = 2
> -- end args --
> Opened ckpt vm_states
> My Node ID = 1
> Domain UUID Owner State
> ------ ---- ----- -----
> Domain-0 00000000-0000-0000-0000-000000000000 00001 00001
> oelcl1 2a53022c-5836-68f0-4514-02a5a0b07e81 00001 00002
> oelcl2 dd268dd4-f012-e0f7-7c77-aa8a58e1e6ab 00001 00002
> oelcman 09c783bd-9107-0916-ebbf-bd27bcc8babe 00001 00002
> Storing oelcl1
> Storing oelcl2
>
>
>
> [root at vm5 ~]# fence_xvmd -fdd -I virbr0
> -- args @ 0xbfd26234 --
> args->addr = 225.0.0.12
> args->domain = (null)
> args->key_file = /etc/cluster/fence_xvm.key
> args->op = 2
> args->hash = 2
> args->auth = 2
> args->port = 1229
> args->ifindex = 7
> args->family = 2
> args->timeout = 30
> args->retr_time = 20
> args->flags = 1
> args->debug = 2
> -- end args --
> Opened ckpt vm_states
> My Node ID = 1
> Domain UUID Owner State
> ------ ---- ----- -----
> Domain-0 00000000-0000-0000-0000-000000000000 00001 00001
> oelcl1 2a53022c-5836-68f0-4514-02a5a0b07e81 00001 00002
> oelcl2 dd268dd4-f012-e0f7-7c77-aa8a58e1e6ab 00001 00002
> oelcman 09c783bd-9107-0916-ebbf-bd27bcc8babe 00001 00002
> Storing oelcl1
> Storing oelcl2
>
>
> no meter whic interface I take fence is not done.
>
>
> thx
>
> br jost
>
>
>
>
>
>
>
>
>
> _____________________________________
> From: linux-cluster-bounces at redhat.com<mailto:
> linux-cluster-bounces at redhat.com> [linux-cluster-bounces at redhat.com
> <mailto:linux-cluster-bounces at redhat.com>] On Behalf Of Rakovec Jost [
> Jost.Rakovec at snt.si<mailto:Jost.Rakovec at snt.si>]
> Sent: Saturday, September 11, 2010 6:36 PM
> To: linux-cluster at redhat.com<mailto:linux-cluster at redhat.com>
> Subject: [Linux-cluster] fence in xen
>
> Hi list!
>
>
> I have a question about fence_xvm.
>
> Situation is:
>
> one physical server with xen --> dom0 with 2 domU. Cluster work fine
> between domU --reboot, relocate,
>
> I'm using redhat 5.5
>
> Problem is with fence from dom0 with "fence_xvm -H oelcl2" , domU is
> destroyed but when it is booted back domU can't join to the cluster. domU
> boot very long time --> FENCED_START_TIMEOUT=300
>
>
> on console I get after the node2 is up:
>
> node2:
>
> INFO: task clurgmgrd:2127 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> clurgmgrd D 0000000000000010 0 2127 2126
> (NOTLB)
> ffff88006f08dda8 0000000000000286 ffff88007cc0b810 0000000000000000
> 0000000000000003 ffff880072009860 ffff880072f6b0c0 00000000000455ec
> ffff880072009a48 ffffffff802649d7
> Call Trace:
> [<ffffffff802649d7>] _read_lock_irq+0x9/0x19
> [<ffffffff8021420e>] filemap_nopage+0x193/0x360
> [<ffffffff80263a7e>] __mutex_lock_slowpath+0x60/0x9b
> [<ffffffff80263ac8>] .text.lock.mutex+0xf/0x14
> [<ffffffff88424b64>] :dlm:dlm_new_lockspace+0x2c/0x860
> [<ffffffff80222b08>] __up_read+0x19/0x7f
> [<ffffffff802d0abb>] __kmalloc+0x8f/0x9f
> [<ffffffff8842b6fa>] :dlm:device_write+0x438/0x5e5
> [<ffffffff80217377>] vfs_write+0xce/0x174
> [<ffffffff80217bc4>] sys_write+0x45/0x6e
> [<ffffffff802602f9>] tracesys+0xab/0xb6
>
>
> between booting on node2:
>
> Starting clvmd: dlm: Using TCP for communications
> clvmd startup timed out
> [FAILED]
>
>
>
> node2:
>
> [root at oelcl2 init.d]# clustat
> Cluster Status for cluster1 @ Sat Sep 11 18:11:21 2010
> Member Status: Quorate
>
> Member Name ID Status
> ------ ---- ---- ------
> oelcl1 1 Online
> oelcl2 2 Online, Local
>
> [root at oelcl2 init.d]#
>
>
> on first node:
>
> [root at oelcl1 ~]# clustat
> Cluster Status for cluster1 @ Sat Sep 11 18:12:07 2010
> Member Status: Quorate
>
> Member Name ID Status
> ------ ---- ---- ------
> oelcl1 1 Online, Local,
> rgmanager
> oelcl2 2 Online,
> rgmanager
>
> Service Name Owner (Last)
> State
> ------- ---- ----- ------
> -----
> service:webby oelcl1
> started
> [root at oelcl1 ~]#
>
>
> and then I have to destroy both domU on guest and create it back to get
> node2 work again.
>
> I have use how to on https://access.redhat.com/kb/docs/DOC-5937 and
> http://sources.redhat.com/cluster/wiki/VMClusterCookbook
>
>
> cluster config on dom0
>
>
> <?xml version="1.0"?>
> <cluster alias="vmcluster" config_version="1" name="vmcluster">
> <clusternodes>
> <clusternode name="vm5" nodeid="1" votes="1"/>
> </clusternodes>
> <cman/>
> <fencedevices/>
> <rm/>
> <fence_xvmd/>
> </cluster>
>
>
>
> cluster config on domU
>
>
> <?xml version="1.0"?>
> <cluster alias="cluster1" config_version="49" name="cluster1">
> <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="4"/>
> <clusternodes>
> <clusternode name="oelcl1.name.comi" nodeid="1" votes="1">
> <fence>
> <method name="1">
> <device domain="oelcl1"
> name="xenfence1"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="oelcl2.name.com<http://oelcl2.name.com>"
> nodeid="2" votes="1">
> <fence>
> <method name="1">
> <device domain="oelcl2"
> name="xenfence1"/>
> </method>
> </fence>
> </clusternode>
> </clusternodes>
> <cman expected_votes="1" two_node="1"/>
> <fencedevices>
> <fencedevice agent="fence_xvm" name="xenfence1"/>
> </fencedevices>
> <rm>
> <failoverdomains>
> <failoverdomain name="prefer_node1" nofailback="0"
> ordered="1" restricted="1">
> <failoverdomainnode name="oelcl1.name.com<
> http://oelcl1.name.com>" priority="1"/>
> <failoverdomainnode name="oelcl2.name.com<
> http://oelcl2.name.com>" priority="2"/>
> </failoverdomain>
> </failoverdomains>
> <resources>
> <ip address="xx.xx.xx.xx" monitor_link="1"/>
> <fs device="/dev/xvdb1" force_fsck="0"
> force_unmount="0" fsid="8669" fstype="ext3" mountpoint="/var/www/html"
> name="docroot" self_fence="0"/>
> <script file="/etc/init.d/httpd" name="apache_s"/>
> </resources>
> <service autostart="1" domain="prefer_node1" exclusive="0"
> name="webby" recovery="relocate">
> <ip ref="xx.xx.xx.xx"/>
> <fs ref="docroot"/>
> <script ref="apache_s"/>
> </service>
> </rm>
> </cluster>
>
>
>
>
> fence proces on dom0
>
> [root at vm5 cluster]# ps -ef |grep fenc
> root 18690 1 0 17:40 ? 00:00:00 /sbin/fenced
> root 18720 1 0 17:40 ? 00:00:00 /sbin/fence_xvmd -I xenbr0
> root 22633 14524 0 18:21 pts/3 00:00:00 grep fenc
> [root at vm5 cluster]#
>
>
> and on domU
>
> [root at oelcl1 ~]# ps -ef|grep fen
> root 1523 1 0 17:41 ? 00:00:00 /sbin/fenced
> root 13695 2902 0 18:22 pts/0 00:00:00 grep fen
> [root at oelcl1 ~]#
>
>
>
> Do somebody have any idea why fence don't work?
>
> thx
>
> br
>
> jost
>
>
>
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com<mailto:Linux-cluster at redhat.com>
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101005/da70e9c1/attachment.htm>
More information about the Linux-cluster
mailing list