[Linux-cluster] qdisk problems during/after network problems

Mon Dec 18 11:14:19 UTC 2006

Hi List,

I am currently testing Redhat Cluster Suite for a number of two node
clusters accessing EMC storage systems. Everything seems to be running
fine expect for qdisk.

On Friday we had a network problem during which the nodes were still
able to see each other but none of the addresses used in my heuristics
for qdisk. The result was not what I expected, when the network came
back, both nodes claimed to be master.

See below the quorumd part of my cluster.conf
<snip>
        <quorumd interval="1" tko="10" votes="3" log_level="9" log_facility="local4" status_file="/qdisk_status" min_score="3" device="/dev/emcpowerk1">
                <heuristic program="ping 172.23.4.254 -c1 -t1" score="2" interval="2"/>
                <heuristic program="ping 130.246.8.13 -c1 -t3" score="1" interval="2"/>
                <heuristic program="ping 130.246.72.21 -c1 -t3" score="1" interval="2"/>
                <heuristic program="ping 172.23.5.120 -c1 -t1" score="2" interval="2"/>
        </quorumd>
</snip>

/qdisk_status on one node while everything seems to be running fine:
<snip>
Node ID: 2
Score (current / min req. / max allowed): 6 / 3 / 6
Current state: Running
Current disk state: None
Visible Set: { 1 2 }
Master Node ID: 1
Quorate Set: { 1 2 }
</snip>

After a "/etc/init.d/qdiskd restart" I find the following in the log
files: (logs fine to me...)

Dec 18 10:50:40 duoserv2 qdiskd[24304]: <info> Quorum Daemon Initializing
Dec 18 10:50:40 duoserv2 qdiskd: Starting the Quorum Disk Daemon: succeeded
Dec 18 10:50:47 duoserv2 qdiskd[24304]: <info> Node 1 is the master
Dec 18 10:50:50 duoserv2 qdiskd[24304]: <info> Initial score 6/6
Dec 18 10:50:50 duoserv2 qdiskd[24304]: <info> Initialization complete

And finally during the network issue last week I found the following log
entries:

Dec 15 09:53:48 duoserv2 qdiskd[31393]: <info> Node 1 shutdown
Dec 15 09:53:48 duoserv2 qdiskd[31393]: <notice> Score insufficient for master operation (0/3; max=6); downgrading
Dec 15 09:53:48 duoserv2 clurgmgrd[7950]: <emerg> #1: Quorum Dissolved
Dec 15 09:53:48 duoserv2 kernel: CMAN: quorum lost, blocking activity
Dec 15 09:53:48 duoserv2 ccsd[5595]: Cluster is not quorate.  Refusing connection.
Dec 15 09:53:48 duoserv2 ccsd[5595]: Error while processing connect: Connection refused
Dec 15 09:53:48 duoserv2 ccsd[5595]: Invalid descriptor specified (-111).
Dec 15 09:53:48 duoserv2 ccsd[5595]: Someone may be attempting something evil.
Dec 15 09:53:48 duoserv2 ccsd[5595]: Error while processing get: Invalid request descriptor

And later when the network came back:
Dec 15 10:31:45 duoserv2 qdiskd[31393]: <notice> Score sufficient for master operation (6/3; max=6); upgrading
Dec 15 10:31:46 duoserv2 qdiskd[31393]: <info> Assuming master role
Dec 15 10:31:47 duoserv2 kernel: CMAN: quorum regained, resuming activity
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <notice> Quorum Achieved
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> Magma Event: Membership Change
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> State change: Local UP
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> State change: duoserv1 UP
Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> Loading Service Data
Dec 15 10:31:47 duoserv2 ccsd[5595]: Cluster is quorate.  Allowing connections.
Dec 15 10:31:50 duoserv2 clurgmgrd: [7950]: <info> /dev/mapper/logs1-logs1 is not mounted
Dec 15 10:31:51 duoserv2 qdiskd[31393]: <crit> Critical Error: More than one master found!
Dec 15 10:31:51 duoserv2 qdiskd[31393]: <crit> A master exists, but it's not me?!
Dec 15 10:31:52 duoserv2 qdiskd[31393]: <info> Node 1 is the master
...

At the same time on the second node:
Dec 15 10:31:45 duoserv1 qdiskd[316]: <notice> Score sufficient for master operation (5/3; max=6); upgrading
Dec 15 10:31:46 duoserv1 qdiskd[316]: <info> Assuming master role
Dec 15 10:31:47 duoserv1 kernel: CMAN: quorum regained, resuming activity
Dec 15 10:31:47 duoserv1 ccsd[5624]: Cluster is quorate.  Allowing connections.
Dec 15 10:31:47 duoserv1 clurgmgrd[3631]: <notice> Quorum Achieved
Dec 15 10:31:51 duoserv1 qdiskd[316]: <crit> Critical Error: More than one master found!
Dec 15 10:31:52 duoserv1 qdiskd[316]: <info> Node 2 is the master
Dec 15 10:31:52 duoserv1 qdiskd[316]: <crit> Critical Error: More than one master found!
...

This continues until I finally notice and restart qdiskd on both nodes,
when they agree on one master again.

I have the following packages installed on both nodes
ccs-1.0.7-0
rgmanager-1.9.54-1
lvm2-cluster-2.02.01-1.2.RHEL4
cman-1.0.11-0
cman-kernel-smp-2.6.9-43.8.5
fence-1.32.25-1
cman-kernel-smp-2.6.9-45.8

The running kernel is: 2.6.9-42.0.3.ELsmp

Does anyone have any idea what I could do to avoid this situation in the
future?

If I can provide any more information, please ask.

Many thanks,
Frederik

-- 
Frederik Ferner 
Systems Administrator                  Phone: +44 (0)1235-778624
Diamond Light Source                   Fax:   +44 (0)1235-778468