[Linux-cluster] Problem with "<emerg> #1: Quorum Dissolved"
Agnieszka Kukałowicz
qqlka at nask.pl
Tue Mar 11 15:59:55 UTC 2008
Hi,
During some tests I got errors like "<emerg> #1: Quorum Dissolved" ...
My cluster has 6 nodes that are virtual services running on 2 physical
nodes. On each node there is 3 virtual services:
Member Name ID Status
------ ---- ---- ------
w2.local 1 Online, Local, rgmanager
w1.local 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
vm:VM_Work11_RHEL51 w1.local started
vm:VM_Work12_RHEL51 w1.local started
vm:VM_Work13_RHEL51 w1.local started
vm:VM_Work21_RHEL51 w2.local started
vm:VM_Work22_RHEL51 w2.local started
vm:VM_Work23_RHEL51 w2.local started
On the 6-node cluster I runnig 2 httpd services (in restricted failover
domain).
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
w11.local 1 Online, rgmanager
w12.local 2 Online, rgmanager
w13.local 3 Online, rgmanager
w21.local 4 Online, Local, rgmanager
w22.local 5 Online, rgmanager
w23.local 6 Online, rgmanager
/dev/xvdd1 0 Online, Quorum Disk
Service Name Owner (Last) State
------- ---- ----- ------ -----
service: httpd_w11 w11.local started
service: httpd_w21 w21.local started
After shutting down w11.local node this cluster should run normallly
because there is still qourum ( qourum device has 5 votes). The
httpd_w11 service should be down but the httpd_w21 service should be up
(the w21.local node is runnig). That not happens.
On w21.local I get error that qourum is dissolved and cluster is not
quorate. It takes a time the cluster is again qourate. During the time
rgmanager is not working and service httpd_w21 is down. After gaining
qourum I get error:
Mar 11 16:16:20 w21 clurgmgrd[1946]: <err> #34: Cannot get status for
service service:httpd_w21
When all members of cluster are online the clustat shows:
1. on w21.local
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
w11.local 1 Online, rgmanager
w12.local 2 Online, rgmanager
w13.local 3 Online, rgmanager
w21.local 4 Online, Local, rgmanager
w22.local 5 Online, rgmanager
w23.local 6 Online, rgmanager
/dev/xvdd1 0 Online, Quorum Disk
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:httpd_w11 w11.local started
2. on w11.local, w12.local, w13.local that were fenced:
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
w11.local 1 Online, Local
w12.local 2 Online
w13.local 3 Online
w21.local 4 Online
w22.local 5 Online
w23.local 6 Online
/dev/xvdd1 0 Online, Quorum Disk
clustat shows that rgmanager is not running. But in the logs there is:
Mar 11 14:56:27 w11 Mar 11 14:56:37 w11 clurgmgrd[1942]: <err> #34:
Cannot get status for service service:httpd_w11
Mar 11 14:56:37 w11 clurgmgrd[1942]: <err> #34: Cannot get status for
service service:httpd_w21 clurgmgrd[1942]: <notice> Resource Group
Manager Starting
3. on w23.local:
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
w11.local 1 Online, rgmanager
w12.local 2 Online, rgmanager
w13.local 3 Online, rgmanager
w21.local 4 Online, rgmanager
w22.local 5 Online, rgmanager
w23.local 6 Online, Local, rgmanager
/dev/xvdd1 0 Online, Quorum Disk
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:httpd_w11 w11.local started
service:httpd_w21 w21.local started
So, depend on the node the state of cluster is different. The problems
are:
1. after fencing nodes w11,w12,w13 the qourum is dissolved
2. services that should run on left working nodes are going down.
3. after bringing up fenced nodes the rgmanager has different view of
services on each node.
I can't always reproduce this bug. Sometimes everything goes ok but it
happens quit rarely.
Cheers
Agnieszka Kukalowicz
More information about the Linux-cluster
mailing list