[Linux-cluster] Xen network config -> Fence problem - More info

Madison Kelly linux at alteeve.com
Sat Oct 31 05:01:23 UTC 2009


   After sending this, I went back to debugging the problem. The 
machines had stopped fencing and the DRBD link was down.

   So first I stopped and then started 'xend' and this got the Xen-type 
networking up. I left the machines alone for about ten minutes to see if 
they would fence one another, they didn't.

   So then I set about fixing DRBD. I got the array re-sync'ing and I 
thought I might have gotten things working, but about 15 or 30 seconds 
after getting the DRBD back online, one node fenced the other again. It 
may have been a coincidence, but the last command I called before one 
node fenced the other was 'pvdisplay' to check the LVM PVs. That command 
didn't return, and may have been the trigger, I am not sure.

   So it looks like they fence each other until DRBD breaks. Once array 
is fixed and/or pvdisplay is called, the fence loop starts again.

Madi

Madison Kelly wrote:
> Hi all,
> 
>   I've got CentOS 5.3 installed on two nodes (simple two node cluster). 
> On this, I've got a DRBD partition running cluster aware LVM. I use this 
> to host VMs under Xen.
> 
>   I've got a problem where I am trying to use eth0 as a back channel for 
> the VMs on either node via a firewall VM. The network setup on each node 
> is:
> 
> eth0: back channel, IPMI only connected to an internal network.
> eth1: dedicated DRBD link.
> eth2: Internet-facing interface.
> 
>   I want to get eth0 and eth2 under Xen's networking but the default 
> config was to leave eth0 alone. Specifically, the 
> convirt-xen-multibridge is set to:
> 
> "$dir/network-bridge" "$@" vifnum=0 netdev=peth0 bridge=xenbr0
> 
>   When I change this to:
> 
> "$dir/network-bridge" "$@" vifnum=0 netdev=eth0 bridge=xenbr0
> 
>   One of the nodes will soon fence the other, and when it comes back up 
> it fences the first. Eventually one node stays up and constantly fences 
> the other.
> 
>   The node that survives prints this to repeatedly to the log just 
> before it is fenced:
> 
> Oct 31 00:27:21 vsh02 openais[3133]: [TOTEM] FAILED TO RECEIVE
> Oct 31 00:27:21 vsh02 openais[3133]: [TOTEM] entering GATHER state from 6.
> 
>   And the node that stays up prints this:
> 
> Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] The token was lost in the 
> OPERATIONAL state.
> Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] Receive multicast socket 
> recv buffer size (288000 bytes).
> Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] Transmit multicast socket 
> send buffer size (262142 bytes).
> Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] entering GATHER state from 2.
> Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] entering GATHER state from 0.
> Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Creating commit token 
> because I am the rep.
> Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Saving state aru 2c high 
> seq received 2c
> Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Storing new sequence id for 
> ring 108
> Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] entering COMMIT state.
> Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] entering RECOVERY state.
> Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] position [0] member 
> 10.255.135.3:
> Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] previous ring seq 260 rep 
> 10.255.135.2
> Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] aru 2c high delivered 2c 
> received flag 1
> Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Did not need to originate 
> any messages in recovery.
> Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Sending initial ORF token
> Oct 31 00:35:51 vsh03 openais[3237]: [CLM  ] CLM CONFIGURATION CHANGE
> Oct 31 00:35:51 vsh03 openais[3237]: [CLM  ] New Configuration:
> Oct 31 00:35:51 vsh03 kernel: dlm: closing connection to node 1
> Oct 31 00:35:51 vsh03 fenced[3256]: vsh02.domain.com not a cluster 
> member after 0 sec post_fail_delay
> Oct 31 00:35:51 vsh03 openais[3237]: [CLM  ]     r(0) ip(10.255.135.3)
> Oct 31 00:35:51 vsh03 fenced[3256]: fencing node "vsh02.domain.com"
> 
>   If I leave it long enough, the failed node (vsh02 in this case), stops 
> getting fenced but the Xen networking doesn't come up. Specifically, no 
> vifX.Y, xenbrX or other devices get created.
> 
>   Any idea what might be going on? I really need to get eth0 virtualized 
> so that I can get routing to work.
> 
> Thanks!
> 
> Madi




More information about the Linux-cluster mailing list