[Linux-cluster] qdiskd master election and loss of quorum

Thu Nov 5 14:28:07 UTC 2009

On Tue, 03 Nov 2009 08:15:05 -0500 Lon Hohberger  wrote:

> Though it's a bit odd that stopping node 1 causes a loss of quorum on
> node2. :(

I'm experimenting the same behaviour with a cluster composed by two nodes in
CentOS 5.4
openais-0.80.6-8.el5_4.1
cman-2.0.115-1.el5_4.3
rgmanager-2.0.52-1.el5.centos.2

Here the lines in cluster.conf and below the simulated scenario
[root at mork ~]# egrep "totem|quorum" /etc/cluster/cluster.conf
    <totem token="162000"/>
    <cman quorum_dev_poll="80000" expected_votes="3" two_node="0"/>
    <quorumd device="/dev/sda" interval="5" label="clummquorum"
log_facility="local4" log_level="7" tko="16" votes="1">
    </quorumd>

the white paper referred by Alain, apart from related to multipath as he
already wrote, says only that quorum_dev_poll must be lesser than totem
token....
and the quorum_dev_poll should be configured to be greater than the value of
multipath failover (but here we don't have multipath...)

- mork is the second node and has no services active and its quorum is not
master at this moment:
logs on mork
[root at mork ~]# tail -f /var/log/messages
Nov  5 12:35:41 mork ricci: startup succeeded
Nov  5 12:35:42 mork clurgmgrd: [2633]: <err>   node2   owns vg_cl1/lv_cl1
unable to stop
Nov  5 12:35:42 mork clurgmgrd[2633]: <notice> stop on lvm "CL1" returned 1
(generic error)
Nov  5 12:35:42 mork clurgmgrd: [2633]: <err>   node2   owns vg_cl2/lv_cl2
unable to stop
Nov  5 12:35:42 mork clurgmgrd[2633]: <notice> stop on lvm "CL2" returned 1
(generic error)
Nov  5 12:36:02 mork qdiskd[2214]: <info> Node 2 is the master
Nov  5 12:36:52 mork qdiskd[2214]: <info> Initial score 1/1
Nov  5 12:36:52 mork qdiskd[2214]: <info> Initialization complete
Nov  5 12:36:52 mork openais[2185]: [CMAN ] quorum device registered
Nov  5 12:36:52 mork qdiskd[2214]: <notice> Score sufficient for master
operation (1/1; required=1); upgrading

- shutdown of the other rnode (mindy) that has in charge three services
(note that mindy shutdowns cleanly)
logs on mork
Nov  5 12:52:53 mork clurgmgrd[2633]: <notice> Member 2 shutting down
Nov  5 12:52:57 mork qdiskd[2214]: <info> Node 2 shutdown
Nov  5 12:52:58 mork clurgmgrd[2633]: <notice> Starting stopped service
service:MM1SRV
Nov  5 12:52:58 mork clurgmgrd[2633]: <notice> Starting stopped service
service:MM2SRV
Nov  5 12:52:58 mork clurgmgrd[2633]: <notice> Starting stopped service
service:MM3SRV
Nov  5 12:52:58 mork clurgmgrd: [2633]: <notice> Activating vg_cl1/lv_cl1
Nov  5 12:52:58 mork clurgmgrd: [2633]: <notice> Making resilient : lvchange
-ay vg_cl1/lv_cl1
Nov  5 12:52:59 mork clurgmgrd: [2633]: <notice> Activating vg_cl2/lv_cl2
Nov  5 12:52:59 mork clurgmgrd: [2633]: <notice> Resilient command: lvchange
-ay vg_cl1/lv_cl1 --config
devices{filter=["a|/dev/hda2|","a|/dev/hdb1|","a|/dev/sdb1|","a|/dev/sdc1|","r|.*|"]}

Nov  5 12:52:59 mork clurgmgrd: [2633]: <notice> Making resilient : lvchange
-ay vg_cl2/lv_cl2
Nov  5 12:52:59 mork clurgmgrd: [2633]: <notice> Resilient command: lvchange
-ay vg_cl2/lv_cl2 --config
devices{filter=["a|/dev/hda2|","a|/dev/hdb1|","a|/dev/sdb1|","a|/dev/sdc1|","r|.*|"]}

Nov  5 12:52:59 mork kernel: kjournald starting.  Commit interval 5 seconds
Nov  5 12:52:59 mork kernel: EXT3 FS on dm-3, internal journal
Nov  5 12:52:59 mork kernel: EXT3-fs: mounted filesystem with ordered data
mode.
Nov  5 12:52:59 mork kernel: kjournald starting.  Commit interval 5 seconds
Nov  5 12:52:59 mork kernel: EXT3 FS on dm-4, internal journal
Nov  5 12:52:59 mork kernel: EXT3-fs: mounted filesystem with ordered data
mode.
Nov  5 12:53:15 mork clurgmgrd[2633]: <err> #75: Failed changing service
status
Nov  5 12:53:30 mork clurgmgrd[2633]: <err> #75: Failed changing service
status
Nov  5 12:53:30 mork clurgmgrd[2633]: <notice> Stopping service
service:MM3SRV
Nov  5 12:53:32 mork qdiskd[2214]: <info> Assuming master role
Nov  5 12:53:45 mork clurgmgrd[2633]: <err> #52: Failed changing RG status
Nov  5 12:53:45 mork clurgmgrd[2633]: <crit> #13: Service service:MM3SRV
failed to stop cleanly

- clustat run several times on mork during this phase (note the timeout
messages)
[root at mork ~]# clustat
Timed out waiting for a response from Resource Group Manager
Cluster Status for clumm @ Thu Nov  5 12:54:08 2009
Member Status: Quorate

 Member Name                                                    ID   Status
 ------ ----                                                    ---- ------
 node1                                                              1
Online, Local
 node2                                                              2
Offline
 /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_scsi0-hd0                 0
Online, Quorum Disk

[root at mork ~]# clustat
Service states unavailable: Temporary failure; try again
Cluster Status for clumm @ Thu Nov  5 12:54:14 2009
Member Status: Quorate

 Member Name                                                    ID   Status
 ------ ----                                                    ---- ------
 node1                                                              1
Online, Local
 node2                                                              2
Offline
 /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_scsi0-hd0                 0
Online, Quorum Disk

[root at mork ~]# clustat
Service states unavailable: Temporary failure; try again
Cluster Status for clumm @ Thu Nov  5 12:54:15 2009
Member Status: Quorate

 Member Name                                                    ID   Status
 ------ ----                                                    ---- ------
 node1                                                              1
Online, Local
 node2                                                              2
Offline
 /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_scsi0-hd0                 0
Online, Quorum Disk

[root at mork ~]# clustat
Timed out waiting for a response from Resource Group Manager
Cluster Status for clumm @ Thu Nov  5 12:54:46 2009
Member Status: Quorate

 Member Name                                                    ID   Status
 ------ ----                                                    ---- ------
 node1                                                              1
Online, Local
 node2                                                              2
Offline
 /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_scsi0-hd0                 0
Online, Quorum Disk

- service manager is running
[root at mork ~]# service rgmanager status
clurgmgrd (pid  2632) is running...

- cman_tool command outputs
[root at mork ~]# cman_tool services
type             level name       id       state
fence            0     default    00010001 none
[1]
dlm              1     rgmanager  00020001 none
[1]

[root at mork ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   0   M      0   2009-11-05 12:36:52
/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_scsi0-hd0
   1   M     52   2009-11-05 12:35:30  node1
   2   X     56                        node2

[root at mork ~]# cman_tool status
Version: 6.2.0
Config Version: 7
Cluster Name: clumm
Cluster Id: 3243
Cluster Member: Yes
Cluster Generation: 56
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Quorum device votes: 1
Total votes: 2
Quorum: 2
Active subsystems: 9
Flags: Dirty
Ports Bound: 0 177
Node name: node1
Node ID: 1
Multicast addresses: 239.192.12.183
Node addresses: 172.16.0.11

- now clustat gives output but the services remain in starting and never go
to "started"
[root at mork ~]# clustat
Cluster Status for clumm @ Thu Nov  5 12:55:16 2009
Member Status: Quorate

 Member Name                                                    ID   Status
 ------ ----                                                    ---- ------
 node1                                                              1
Online, Local, rgmanager
 node2                                                              2
Offline
 /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_scsi0-hd0                 0
Online, Quorum Disk

 Service Name                                          Owner
(Last)                                          State
 ------- ----                                          -----
------                                          -----
 service:MM1SRV
node1                                                 starting
 service:MM2SRV
node1                                                 starting
 service:MM3SRV
node1                                                 starting

- latest entries in messages
[root at mork ~]# tail -f  /var/log/messages
Nov  5 12:53:45 mork clurgmgrd[2633]: <crit> #13: Service service:MM3SRV
failed to stop cleanly
Nov  5 12:54:00 mork clurgmgrd[2633]: <err> #75: Failed changing service
status
Nov  5 12:54:15 mork clurgmgrd[2633]: <err> #57: Failed changing RG status
Nov  5 12:54:15 mork clurgmgrd[2633]: <notice> Stopping service
service:MM1SRV
Nov  5 12:54:30 mork clurgmgrd[2633]: <notice> Stopping service
service:MM2SRV
Nov  5 12:54:30 mork clurgmgrd[2633]: <err> #52: Failed changing RG status
Nov  5 12:54:30 mork clurgmgrd[2633]: <crit> #13: Service service:MM1SRV
failed to stop cleanly
Nov  5 12:54:45 mork clurgmgrd[2633]: <err> #52: Failed changing RG status
Nov  5 12:54:45 mork clurgmgrd[2633]: <crit> #13: Service service:MM2SRV
failed to stop cleanly
Nov  5 12:55:00 mork clurgmgrd[2633]: <err> #57: Failed changing RG status

- new entries in messages
[root at mork ~]# tail -f  /var/log/messages
Nov  5 12:54:30 mork clurgmgrd[2633]: <err> #52: Failed changing RG status
Nov  5 12:54:30 mork clurgmgrd[2633]: <crit> #13: Service service:MM1SRV
failed to stop cleanly
Nov  5 12:54:45 mork clurgmgrd[2633]: <err> #52: Failed changing RG status
Nov  5 12:54:45 mork clurgmgrd[2633]: <crit> #13: Service service:MM2SRV
failed to stop cleanly
Nov  5 12:55:00 mork clurgmgrd[2633]: <err> #57: Failed changing RG status
Nov  5 12:55:15 mork clurgmgrd[2633]: <err> #57: Failed changing RG status
Nov  5 12:55:41 mork openais[2185]: [TOTEM] The token was lost in the
OPERATIONAL state.
Nov  5 12:55:41 mork openais[2185]: [TOTEM] Receive multicast socket recv
buffer size (320000 bytes).
Nov  5 12:55:41 mork openais[2185]: [TOTEM] Transmit multicast socket send
buffer size (262142 bytes).
Nov  5 12:55:41 mork openais[2185]: [TOTEM] entering GATHER state from 2.
Nov  5 12:55:46 mork openais[2185]: [TOTEM] entering GATHER state from 0.
Nov  5 12:55:46 mork openais[2185]: [TOTEM] Creating commit token because I
am the rep.
Nov  5 12:55:46 mork openais[2185]: [TOTEM] Saving state aru 64 high seq
received 64
Nov  5 12:55:46 mork openais[2185]: [TOTEM] Storing new sequence id for ring
3c
Nov  5 12:55:46 mork openais[2185]: [TOTEM] entering COMMIT state.
Nov  5 12:55:46 mork openais[2185]: [TOTEM] entering RECOVERY state.
Nov  5 12:55:46 mork openais[2185]: [TOTEM] position [0] member 172.16.0.11:

Nov  5 12:55:46 mork openais[2185]: [TOTEM] previous ring seq 56 rep
172.16.0.11
Nov  5 12:55:46 mork openais[2185]: [TOTEM] aru 64 high delivered 64
received flag 1
Nov  5 12:55:46 mork openais[2185]: [TOTEM] Did not need to originate any
messages in recovery.
Nov  5 12:55:46 mork openais[2185]: [TOTEM] Sending initial ORF token
Nov  5 12:55:46 mork openais[2185]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  5 12:55:46 mork openais[2185]: [CLM  ] New Configuration:
Nov  5 12:55:46 mork kernel: dlm: closing connection to node 2
Nov  5 12:55:46 mork openais[2185]: [CLM  ]     r(0) ip(172.16.0.11)
Nov  5 12:55:46 mork openais[2185]: [CLM  ] Members Left:
Nov  5 12:55:46 mork openais[2185]: [CLM  ]     r(0) ip(172.16.0.12)
Nov  5 12:55:46 mork openais[2185]: [CLM  ] Members Joined:
Nov  5 12:55:46 mork openais[2185]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  5 12:55:46 mork openais[2185]: [CLM  ] New Configuration:
Nov  5 12:55:46 mork openais[2185]: [CLM  ]     r(0) ip(172.16.0.11)
Nov  5 12:55:46 mork openais[2185]: [CLM  ] Members Left:
Nov  5 12:55:46 mork openais[2185]: [CLM  ] Members Joined:
Nov  5 12:55:46 mork openais[2185]: [SYNC ] This node is within the primary
component and will provide service.
Nov  5 12:55:46 mork openais[2185]: [TOTEM] entering OPERATIONAL state.
Nov  5 12:55:46 mork openais[2185]: [CLM  ] got nodejoin message 172.16.0.11

Nov  5 12:55:46 mork openais[2185]: [CPG  ] got joinlist message from node 1

- services remain in "starting"
[root at mork ~]# clustat
Cluster Status for clumm @ Thu Nov  5 12:58:47 2009
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 node1                                                               1
Online, Local, rgmanager
 node2                                                               2
Offline
 /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_scsi0-hd0                  0
Online, Quorum Disk

 Service Name                                                Owner
(Last)                                                State
 ------- ----                                                -----
------                                                -----
 service:MM1SRV
node1                                                       starting
 service:MM2SRV
node1                                                       starting
 service:MM3SRV
node1                                                       starting

- services MM1SRV and MM2SRV are ip+fs (/cl1 and /cl2 respectively): they
are active so it seems all was done good but without passing to started form
starting....
Also MM3SRV that is an ip only service has been started

[root at mork ~]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                       5808616   4045884   1462908  74% /
/dev/hda1               101086     38786     57081  41% /boot
tmpfs                   447656         0    447656   0% /dev/shm
/dev/mapper/vg_cl1-lv_cl1
                       4124352   1258064   2656780  33% /cl1
/dev/mapper/vg_cl2-lv_cl2
                       4124352   1563032   2351812  40% /cl2

[root at mork ~]# ip addr list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen
1000
    link/ether 54:52:00:6a:cb:ba brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.101/24 brd 192.168.122.255 scope global eth0
    inet 192.168.122.113/24 scope global secondary eth0   <--- MM3SRV ip
    inet 192.168.122.111/24 scope global secondary eth0   <--- MM1SRV ip
    inet 192.168.122.112/24 scope global secondary eth0   <--- MM2SRV ip
    inet6 fe80::5652:ff:fe6a:cbba/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen
1000
    link/ether 54:52:00:00:0c:c5 brd ff:ff:ff:ff:ff:ff
    inet 172.16.0.11/12 brd 172.31.255.255 scope global eth1
    inet6 fe80::5652:ff:fe00:cc5/64 scope link
       valid_lft forever preferred_lft forever
4: sit0: <NOARP> mtu 1480 qdisc noop
    link/sit 0.0.0.0 brd 0.0.0.0
[root at mork ~]#

- I wait a couple of hours
[root at mork ~]# clustat
Cluster Status for clumm @ Thu Nov  5 15:22:23 2009
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 node1                                                               1
Online, Local, rgmanager
 node2                                                               2
Offline
 /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_scsi0-hd0                  0
Online, Quorum Disk

 Service Name                                                     Owner
(Last)                                                     State
 ------- ----                                                     -----
------                                                     -----
 service:MM1SRV
node1
starting
 service:MM2SRV
node1
starting
 service:MM3SRV
node1
starting

- resource groups are unlocked:
[root at mork ~]# clusvcadm -S
Resource groups unlocked

- [root at mork ~]# clusvcadm -e MM3SRV
Local machine trying to enable service:MM3SRV...Service is already running

Note that the other node is still powered off
- So to solve the situation I have to do a disable/enable sequence, having
downtime (ip alias removed and file systems unmounted in my case):
[root at mork ~]# clusvcadm -d MM3SRV
Local machine disabling service:MM3SRV...Success

[root at mork ~]# clustat
Cluster Status for clumm @ Thu Nov  5 15:25:49 2009
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 node1                                                               1
Online, Local, rgmanager
 node2                                                               2
Offline
 /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_scsi0-hd0                  0
Online, Quorum Disk

 Service Name                                                     Owner
(Last)                                                     State
 ------- ----                                                     -----
------                                                     -----
 service:MM1SRV
node1
starting
 service:MM2SRV
node1
starting
 service:MM3SRV
(node1)                                                          disabled

[root at mork ~]# clusvcadm -e MM3SRV
Local machine trying to enable service:MM3SRV...Success
service:MM3SRV is now running on node1
[root at mork ~]# clusvcadm -d MM1SRV
Local machine disabling service:MM1SRV...Success
[root at mork ~]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                       5808616   4047656   1461136  74% /
/dev/hda1               101086     38786     57081  41% /boot
tmpfs                   447656         0    447656   0% /dev/shm
/dev/mapper/vg_cl2-lv_cl2
                       4124352   1563032   2351812  40% /cl2
[root at mork ~]# clusvcadm -e MM1SRV
Local machine trying to enable service:MM1SRV...Success
service:MM1SRV is now running on node1
[root at mork ~]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                       5808616   4047664   1461128  74% /
/dev/hda1               101086     38786     57081  41% /boot
tmpfs                   447656         0    447656   0% /dev/shm
/dev/mapper/vg_cl2-lv_cl2
                       4124352   1563032   2351812  40% /cl2
/dev/mapper/vg_cl1-lv_cl1
                       4124352   1258064   2656780  33% /cl1

Gianluca
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20091105/c53bcd41/attachment.htm>