[Linux-cluster] [cluster-linux] rejoining cluster after being fenced
MARY, Mathieu
Mathieu.MARY at neufcegetel.fr
Mon Mar 17 14:12:30 UTC 2008
hello,
i actually run a 2 node RH5.1 cluster with openais 0.80.3-13 and cman
2.0.80-1
both nodes are hosted on VMware ESX3.02 servers, fencing works fine but
here's my issue :
whenever I simulate the failure of a node (shut Eth0 or hard reboot),
the node is fenced but it can never rejoin the cluster again.
Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] entering COMMIT
state.
Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] entering RECOVERY
state.
Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] position [0] member
10.148.46.50:
Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] previous ring seq
7692 rep 10.148.46.50
Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] aru c high delivered
c received flag 1
Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] position [1] member
10.148.46.51:
Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] previous ring seq
7688 rep 10.148.46.51
Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] aru b high delivered
b received flag 1
Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] Did not need to
originate any messages in recovery.
Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] Sending initial ORF
token
Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] CLM CONFIGURATION
CHANGE
Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] New Configuration:
Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] r(0)
ip(10.148.46.50)
Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] Members Left:
Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] Members Joined:
Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] CLM CONFIGURATION
CHANGE
Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] New Configuration:
Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] r(0)
ip(10.148.46.50)
Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] r(0)
ip(10.148.46.51)
Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] Members Left:
Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] Members Joined:
Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM ] r(0)
ip(10.148.46.51)
Mar 17 14:24:32 VMClutest01 openais[1941]: [SYNC ] This node is within
the primary component and will provide service.
Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM] entering OPERATIONAL
state.
Mar 17 14:24:32 VMClutest01 openais[1941]: [MAIN ] Killing node
VMClutest02 because it has rejoined the cluster with existing state
is there anything to do after a failure in one node to make it rejoing
the cluster in a < clean > state ?
If I try to cleanly restart note 2 with "shutdown -r now" it hangs on
stopping cluster services
if I hard reboot node 2 it can never rejoin cluster and log is the same
as above.
my cluster.conf
<?xml version="1.0"?>
<cluster alias="TestClu01" config_version="9"
name="TestClu01"><fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="60"/>
<clusternodes>
<clusternode name="VMClutest01" nodeid="1" votes="1">
<fence><method name="FENCESX"><device name="ESX01"/></method>
</fence>
</clusternode>
<clusternode name="VMClutest02" nodeid="2" votes="1">
<fence><method name="FENCESX"><device name="ESX02"/></method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice name="ESX01" agent="fence_vi3" ipaddr="10.148.45.206"
port="VMClutest01" login="" passwd=" "/>
<fencedevice name="ESX02" agent="fence_vi3" ipaddr="10.148.45.206"
port="VMClutest02" login="" passwd=" "/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="AppCluster" ordered="0" restricted="0">
<failoverdomainnode name="VMClutest01" priority="1"/>
<failoverdomainnode name="VMClutest02" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="10.148.46.55" monitor_link="1"/>
</resources>
<service autostart="1" domain="AppCluster" exclusive="0"
name="AppServer" recovery="restart">
<ip ref="10.148.46.55"/>
</service>
</rm>
<totem consensus="4800" join="1000" token="5000"
token_retransmits_before_loss_const="20"/>
</cluster>
any idea ?
Mathieu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080317/0efb8adb/attachment.htm>
More information about the Linux-cluster
mailing list