[Linux-cluster] cman + qdisk timeouts....

Tue Jul 7 10:21:02 UTC 2009

On Mon, Jun 15, 2009 at 4:17 PM, Moralejo, Alfredo <
alfredo.moralejo at roche.com> wrote:

>  Hi,
>
>
>
> I’m having what I think is a timeouts issue in my cluster.
>
>
>
> I have a two node cluster using qdisk. Everytime the node that has the
> master role for qdisk becomes down (for failure or even stopping qdiskd
> manually), packages in the sane node are stopped because of the lack of
> quorum as the qdiskd becames unresponsive until second node becames master
> node and start working properly. Once qdiskd start working fine (usually 5-6
> seconds) packages are started again.
>
>
>
> I’ve read in the cluster manual section for “CMAN membership timeout
> value” and I think this is the case. I’ve used RHEL 5.3 and I thought this
> parameter is the token that I set much longer that needed:
>
>
>
> <cluster alias="CLUSTER_ENG" config_version="75" name="CLUSTER_ENG">
>
>         <totem token="50000"/>
>
> …
>
>
>
>         <quorumd device="/dev/mapper/mpathquorump1" interval="3"
> status_file="/tmp/qdisk" tko="3" votes="5" log_level="7"
> log_facility="local4"/>
>
>
>
>
>
> Totem token is much more that double of qdisk timeout, so I guess it should
> be enough but everytime qdisk dies in the master node I get same result,
> services restarted in the sane node:
>
>
>
> Jun 15 16:11:33 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
> (2/3)
>
> Jun 15 16:11:38 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
> (3/3)
>
> Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
> (4/3)
>
> Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Node 1 DOWN
>
> Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Making bid for master
>
> Jun 15 16:11:44 rmamseslab07 clurgmgrd: [18510]: <info> Executing
> /etc/init.d/watchdog status
>
> Jun 15 16:11:48 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
> (5/3)
>
> Jun 15 16:11:53 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
> (6/3)
>
> *Jun 15 16:11:53 rmamseslab07 qdiskd[14130]: <info> Assuming master role*
>
>
>
> Message from syslogd at rmamseslab07 at Jun 15 16:11:53 ...
>
>  clurgmgrd[18510]: <emerg> #1: Quorum Dissolved
>
> Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] lost contact with
> quorum device
>
> Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] quorum lost, blocking
> activity
>
> Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Membership Change
> Event
>
> *Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <emerg> #1: Quorum
> Dissolved*
>
> Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of
> service:Cluster_test_2
>
> Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of
> service:wdtcscript-rmamseslab05-ic
>
> Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of
> service:wdtcscript-rmamseslab07-ic
>
> Jun 15 16:11:54 rmamseslab07 clurgmgrd[18510]: <debug> Emergency stop of
> service:Logical volume 1
>
> Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
> (7/3)
>
> Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <notice> Writing eviction
> notice for node 1
>
> Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <debug> Telling CMAN to kill
> the node
>
> *Jun 15 16:11:58 rmamseslab07 openais[14087]: [CMAN ] quorum regained,
> resuming activity*
>
>
>
> I’ve just logged a case but… any idea????
>
>
>
> Regards,
>
> Hi!

Have you set two_node="0" in cman section?
Why don't you use any heuristics within the quorumd configuration? I.e:
pinging a router...
Could you paste us your cluster.conf?

Greetings,
Juanra

>
>
>
>
> *Alfredo Moralejo*
> Business Platforms Engineering - OS Servers - UNIX Senior Specialist
>
> F. Hoffmann-La Roche Ltd.
>
> Global Informatics Group Infrastructure
> Josefa Valcárcel, 40
> 28027 Madrid SPAIN
>
> Phone: +34 91 305 97 87
>
> alfredo.moralejo at roche.com
>
> *Confidentiality Note:* This message is intended only for the use of the
> named recipient(s) and may contain confidential and/or proprietary
> information. If you are not the intended recipient, please contact the
> sender and delete this message. Any unauthorized use of the information
> contained in this message is prohibited.
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090707/bc0f54b1/attachment.htm>