[Linux-cluster] What are the recommend settings when using a multipathed device for my cluster's quorum disk?
Ricardo Masashi Maeda
ricardo.maeda at webbertek.com.br
Mon Jun 7 17:41:53 UTC 2010
Hi, everybody,
We've configured our qdisk/cman/multipath timeout settings, based on the following KB: http://kbase.redhat.com/faq/docs/DOC-2882.
The cluster is RHCS 5.4 + PowerPath 5.3.1 (1),
Basically, I've tried the following values, as you can see in cluster.conf (2):
PowerPath failover = X = 45 seconds
qdisk failover = X * 1,3 = 58,5 (tko = 59 s)
cman failover = X * 2,7 = 121,5 (token = 122000 ms)
However, when we've done a simple test, by removing heartbeat interface, it took almost 6 minutes to fence one of the nodes (3).
We'd like to know, if this behavior is expected.
I really appreciate any help on that!
Thanks!
(1) [root at mercurio dell]# rpm -qi EMCpower.LINUX
Name : EMCpower.LINUX Relocations: /
Version : 5.3.1.00.00 Vendor: EMC, Inc.
Release : 111 Build Date: Thu 13 Aug 2009 04:01:31 PM BRT
Install Date: Wed 02 Jun 2010 03:01:44 PM BRT Build Host: lsca2111.lss.emc.com
Group : System Environment/Kernel Source RPM: EMCpower.LINUX-5.3.1.00.00-111.src.rpm
Size : 22070425 License: Copyright (c) 2002-2009, EMC Corporation. All Rights Reserved.
Signature : (none)
Summary : EMC PowerPath
Description :
Multi-path software providing fail-over and load-sharing for SCSI disks.
(2) Source: /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="clu-informix" config_version="17" name="clu-informix">
<fence_daemon clean_start="0" post_fail_delay="30" post_join_delay="5"/>
<clusternodes>
<clusternode name="clu-urano" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="fence_urano"/>
</method>
</fence>
</clusternode>
<clusternode name="clu-gemini" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="fence_gemini"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman quorum_dev_poll="50000" expected_votes="3"/>
<fencedevices>
<fencedevice agent="fence_ipmilan" ipaddr="gemini-ipmi" login="cluster" name="fence_gemini" passwd="clusteraguia" method="cycle"/>
<fencedevice agent="fence_ipmilan" ipaddr="urano-ipmi" login="cluster" name="fence_urano" passwd="clusteraguia" method="cycle"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="srvkrm" nofailback="0" ordered="0" restricted="0">
<failoverdomainnode name="clu-urano" priority="1"/>
<failoverdomainnode name="clu-gemini" priority="1"/>
</failoverdomain>
<failoverdomain name="srvvdsa" nofailback="0" ordered="0" restricted="0">
<failoverdomainnode name="clu-urano" priority="1"/>
<failoverdomainnode name="clu-gemini" priority="1"/>
</failoverdomain>
</failoverdomains>
... # Removed service and resource tags
</rm>
<totem token="122000"/>
<quorumd device="/dev/emcpowera1" interval="1" min_score="1" tko="59" votes="1"/>
</cluster>
(3) Heartbeat tests:
[root at gemini ~]# clustat
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
clu-urano 1 Online, rgmanager
clu-gemini 2 Online, Local, rgmanager
/dev/emcpowera1 0 Online, Quorum Disk
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:srvkrm clu-urano started
service:srvvdsa clu-urano started
(3.1) Removed the heartbeat interface in gemini server, at Jun 7, 13:55:07.
(3.2) Around 60-80 seconds, got 'token lost' in gemini.
Jun 7 13:56:28 gemini openais[5922]: [TOTEM] The token was lost in the OPERATIONAL state.
Jun 7 13:56:28 gemini openais[5922]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes).
Jun 7 13:56:28 gemini openais[5922]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Jun 7 13:56:28 gemini openais[5922]: [TOTEM] entering GATHER state from 2.
(3.2) Then, after 121 seconds, got the second 'token lost', but in urano.
Jun 7 13:58:29 urano openais[5837]: [TOTEM] The token was lost in the OPERATIONAL state.
Jun 7 13:58:29 urano openais[5837]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes).
Jun 7 13:58:29 urano openais[5837]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Jun 7 13:58:29 urano openais[5837]: [TOTEM] entering GATHER state from 2.
(3.3) After 122 seconds, node urano has left.
Jun 7 14:00:32 gemini openais[5922]: [TOTEM] entering GATHER state from 0.
Jun 7 14:00:32 gemini openais[5922]: [TOTEM] Creating commit token because I am the rep.
Jun 7 14:00:32 gemini openais[5922]: [TOTEM] Saving state aru 34 high seq received 34
Jun 7 14:00:32 gemini openais[5922]: [TOTEM] Storing new sequence id for ring 140
Jun 7 14:00:32 gemini openais[5922]: [TOTEM] entering COMMIT state.
Jun 7 14:00:32 gemini openais[5922]: [TOTEM] entering RECOVERY state.
Jun 7 14:00:32 gemini openais[5922]: [TOTEM] position [0] member 10.1.1.32:
Jun 7 14:00:32 gemini openais[5922]: [TOTEM] previous ring seq 316 rep 10.1.1.32
Jun 7 14:00:32 gemini openais[5922]: [TOTEM] aru 34 high delivered 34 received flag 1
Jun 7 14:00:32 gemini openais[5922]: [TOTEM] Did not need to originate any messages in recovery.
Jun 7 14:00:32 gemini openais[5922]: [TOTEM] Sending initial ORF token
Jun 7 14:00:32 gemini openais[5922]: [CLM ] CLM CONFIGURATION CHANGE
Jun 7 14:00:32 gemini openais[5922]: [CLM ] New Configuration:
Jun 7 14:00:32 gemini openais[5922]: [CLM ] r(0) ip(10.1.1.32)
Jun 7 14:00:32 gemini openais[5922]: [CLM ] Members Left:
Jun 7 14:00:32 gemini openais[5922]: [CLM ] r(0) ip(10.1.1.39)
Jun 7 14:00:32 gemini openais[5922]: [CLM ] Members Joined:
Jun 7 14:00:32 gemini openais[5922]: [CLM ] CLM CONFIGURATION CHANGE
Jun 7 14:00:32 gemini openais[5922]: [CLM ] New Configuration:
Jun 7 14:00:32 gemini kernel: dlm: closing connection to node 1
Jun 7 14:00:32 gemini openais[5922]: [CLM ] r(0) ip(10.1.1.32)
Jun 7 14:00:32 gemini openais[5922]: [CLM ] Members Left:
Jun 7 14:00:32 gemini openais[5922]: [CLM ] Members Joined:
Jun 7 14:00:32 gemini openais[5922]: [SYNC ] This node is within the primary component and will provide service.
Jun 7 14:00:32 gemini openais[5922]: [TOTEM] entering OPERATIONAL state.
Jun 7 14:00:32 gemini openais[5922]: [CLM ] got nodejoin message 10.1.1.32
Jun 7 14:00:32 gemini openais[5922]: [CPG ] got joinlist message from node 2
(3.3) After 48 seconds (post_fail_delay), urano was fenced.
Jun 7 14:01:20 gemini fenced[5971]: clu-urano not a cluster member after 48 sec post_fail_delay
Jun 7 14:01:20 gemini fenced[5971]: fencing node "clu-urano"
Jun 7 14:01:20 gemini fenced[5971]: fence "clu-urano" success
*Ricardo Masashi Maeda*
Consultor Oracle / DBA
ricardo.maeda at webbertek.com.br
*Webbertek - Professional IT Services*
+55 (41) 4063-8448 - fixo
+55 (41) 8834-8354 - celular
--
Esta mensagem foi verificada pelo sistema de antivmrus e
acredita-se estar livre de perigo.
More information about the Linux-cluster
mailing list