[Linux-cluster] CS5/ Question about behavior with a corrupted Quorum disk

Mon Feb 4 17:18:12 UTC 2008

On Mon, 2008-02-04 at 08:33 +0100, Alain Moulle wrote:
> Hi
> 
> Just for information, I wonder if this behavior is normal :
> I have a two-nodes cluster with a quorum disk, and the
> CS5 is started on both nodes with a service on each one.
> Quorum is working fine when I break the quorum disk format
> (with a mkfs on the device !) so that mkqisk -L returns
> none.

It will keep *trying* to operate.

> The behavior is : the CS5 is always working fine as if nothing
> has happen. I wonder if it is only due to the heuristics or
> if this behavior is simply the std behavior of CS5 with
> regard to the quorum disk ?

It /should/ throw warnings in the log for all the blocks that are
corrupt (and it will probably annoy you ;) ).  After 1 cycle, the blocks
corresponding to active cluster nodes will have correct/current data on
them, and life should continue, but reading the rest of the 16 node
blocks should continue throwing warnings:

[1533] warning: Error reading node ID block 3
[1533] warning: Error reading node ID block 4
[1533] warning: Error reading node ID block 5
[1533] warning: Error reading node ID block 6
[1533] warning: Error reading node ID block 7
...
[1533] warning: Error reading node ID block 16

(Granted, I used 'dd if=/dev/zero ...' instead mkfs)

Qdiskd will not function if you restart it, however, and nodes will be
unable to find the quorum disk after a reboot.  The header of the quorum
disk is not rewritten while qdiskd is running.  You'll have to run
mkqdisk to fix it - which should also work (but certainly isn't
recommended!).

This produced the following on the non-master node, but nothing
significant on the master node:

[1533] info: Node 1 shutdown
[1533] debug: Making bid for master
[1533] debug: Node 1 is marked master, but is dead.
[1533] debug: Node 1 is marked master, but is dead.
[1533] debug: Node 1 is marked master, but is dead.
[1533] debug: Node 1 is UP
[1533] info: Node 1 is the master

Looking at the code, if a node dies between the time you clobber qdisk
the quorum disk and the time qdiskd on that node writes a new block,
qdiskd won't evict that node.  Solution: Don't rub salt in cuts.

Also, intentionally corrupting your quorum disk could result in the
following:

https://bugzilla.redhat.com/show_bug.cgi?id=430264

-- Lon