[Linux-cluster] [cluster-linux] rejoining cluster after being fenced

Mon Mar 17 14:12:30 UTC 2008

hello,

i actually run a 2 node RH5.1 cluster with openais 0.80.3-13 and cman
2.0.80-1

both nodes are hosted on VMware ESX3.02 servers, fencing works fine but
here's my issue :

whenever I simulate the failure of a node (shut Eth0 or hard reboot),
the node is fenced but it can never rejoin the cluster again.

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] entering COMMIT
state. 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] entering RECOVERY
state. 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] position [0] member
10.148.46.50: 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] previous ring seq
7692 rep 10.148.46.50 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] aru c high delivered
c received flag 1 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] position [1] member
10.148.46.51: 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] previous ring seq
7688 rep 10.148.46.51 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] aru b high delivered
b received flag 1 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] Did not need to
originate any messages in recovery. 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] Sending initial ORF
token 

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM  ] CLM CONFIGURATION
CHANGE 

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM  ] New Configuration: 

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM  ]      r(0)
ip(10.148.46.50)  

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM  ] Members Left: 

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM  ] Members Joined: 

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM  ] CLM CONFIGURATION
CHANGE 

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM  ] New Configuration: 

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM  ]      r(0)
ip(10.148.46.50)  

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM  ]      r(0)
ip(10.148.46.51)  

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM  ] Members Left: 

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM  ] Members Joined: 

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM  ]      r(0)
ip(10.148.46.51)  

Mar 17 14:24:32 VMClutest01 openais[1941]: [SYNC ] This node is within
the primary component and will provide service. 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] entering OPERATIONAL
state. 

Mar 17 14:24:32 VMClutest01 openais[1941]: [MAIN ] Killing node
VMClutest02 because it has rejoined the cluster with existing state

is there anything to do after a failure in one node to make it rejoing
the cluster in a < clean > state ?

If I try to cleanly restart note 2 with "shutdown -r now" it hangs on
stopping cluster services 

if I hard reboot node 2 it can never rejoin cluster and log is the same
as above.

my cluster.conf

<?xml version="1.0"?>

<cluster alias="TestClu01" config_version="9"
name="TestClu01"><fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="60"/>

<clusternodes>

<clusternode name="VMClutest01" nodeid="1" votes="1">

<fence><method name="FENCESX"><device name="ESX01"/></method>

</fence>

</clusternode>

<clusternode name="VMClutest02" nodeid="2" votes="1">

<fence><method name="FENCESX"><device name="ESX02"/></method>

</fence>

</clusternode>

</clusternodes>

<cman expected_votes="1" two_node="1"/>

<fencedevices>

<fencedevice name="ESX01" agent="fence_vi3" ipaddr="10.148.45.206"
port="VMClutest01" login="" passwd=" "/>

<fencedevice name="ESX02" agent="fence_vi3" ipaddr="10.148.45.206"
port="VMClutest02" login="" passwd=" "/>

</fencedevices>

 <rm>

<failoverdomains>

<failoverdomain name="AppCluster" ordered="0" restricted="0">

<failoverdomainnode name="VMClutest01" priority="1"/>

<failoverdomainnode name="VMClutest02" priority="1"/>

</failoverdomain>

</failoverdomains>

<resources>

<ip address="10.148.46.55" monitor_link="1"/>

</resources>

<service autostart="1" domain="AppCluster" exclusive="0"
name="AppServer" recovery="restart">

<ip ref="10.148.46.55"/>

</service>

</rm>

<totem consensus="4800" join="1000" token="5000"
token_retransmits_before_loss_const="20"/>

</cluster>

any idea ? 

Mathieu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080317/0efb8adb/attachment.htm>