[Linux-cluster] Qdisk and Reboot problem

Tue Jan 18 11:05:11 UTC 2011

I we have several redhat cluster running cman-2.0.115-34.el5 and yesterday
we faced and incident I cant figure it out:

We have 2node cluster+qdisk connected through iscsi with dual paths
we had some network issue and one of paths to the quorum disk was
unavailable. after that I see some qdisk eviction messages and system
reboot shortly after, followed by fencing from the 2nd node some seconds
later

That shouldn't be the correct behavior right? I have reboot=1 on qdisk
settings but it's supposed to work on heuristics downgrade if I understood
it right.
I tried to simulate on the system but cant get this behavior, on the test
systems nothing happens after qdisk unavailability witch is ok

Im posting some info below, some hints on how to investigate would help.
Disabling reboot should disable this anyway, yes?

cluster.conf:
<totem token="15000"/>
<quorumd max_error_cycles="10" tko_up="2" master_wait="2" allow_kill="0"
interval="2" label="OracleOne_Quorum" min_score="1" reboot="1" tko="20"
votes="1" log_level="7" status_file="/tmp/Quorumstatus">
<heuristic interval="5" program="/bin/ping `cat /etc/sysconfig/network  |
awk -F '=' '/GATEWAY/ {print $2}'` -c1 -t2" score="1" tko="50"/>
........
<fence_daemon clean_start="0" post_fail_delay="60" post_join_delay="3"/>
<cman expected_votes="3"/>

cman_tool status:
Version: 6.2.0
Config Version: 23
Cluster Name: OracleOne
Cluster Id: 47365
Cluster Member: Yes
Cluster Generation: 444
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Quorum device votes: 1
Total votes: 3
Quorum: 2
Active subsystems: 9
Flags: Dirty
Ports Bound: 0 177
Node ID: 2
.....

cat /tmp/Quorumstatus
Time Stamp: Tue Jan 18 11:54:17 2011
Node ID: 2
Score: 1/1 (Minimum required = 1)
Current state: Running
Initializing Set: { }
Visible Set: { 1 2 }
Master Node ID: 1

/var/log/messages:
Jan 17 17:53:35 <kern.err> NODE2 kernel:  connection1:0: ping timeout of 5
secs expired, recv timeout 5, last rx 4554104323, last ping 4554109323,
now 4554114323
Jan 17 17:53:35 <kern.info> NODE2 kernel:  connection1:0: detected conn
error (1011)
Jan 17 17:53:36 <daemon.warn> NODE2 iscsid: Kernel reported iSCSI
connection 1:0 error (1011) state (3)
Jan 17 17:53:36 <daemon.warn> NODE2 iscsid: Kernel reported iSCSI
connection 1:0 error (1011) state (3)
Jan 17 17:53:37 <kern.err> NODE2 kernel:  connection2:0: ping timeout of 5
secs expired, recv timeout 5, last rx 4554106362, last ping 4554111362,
now 4554116362
Jan 17 17:53:37 <kern.info> NODE2 kernel:  connection2:0: detected conn
error (1011)
Jan 17 17:53:38 <daemon.warn> NODE2 iscsid: Kernel reported iSCSI
connection 2:0 error (1011) state (3)
Jan 17 17:53:38 <daemon.warn> NODE2 iscsid: Kernel reported iSCSI
connection 2:0 error (1011) state (3)
Jan 17 17:53:42 <kern.info> NODE2 kernel:  session2: session recovery
timed out after 5 secs
Jan 17 17:53:42 <kern.info> NODE2 kernel: sd 12:0:0:1: SCSI error: return
code = 0x000f0000
Jan 17 17:53:42 <kern.warn> NODE2 kernel: end_request: I/O error, dev
sdau, sector 8
Jan 17 17:53:42 <kern.warn> NODE2 kernel: device-mapper: multipath:
Failing path 66:224.
Jan 17 17:53:42 <daemon.notice> NODE2 multipathd: dm-116: remove map (uevent)
Jan 17 17:53:42 <daemon.warn> NODE2 multipathd: sdau: tur checker reports
path is down
Jan 17 17:53:42 <daemon.warn> NODE2 multipathd: sdau: tur checker reports
path is down
Jan 17 17:53:42 <daemon.notice> NODE2 multipathd: checker failed path
66:224 in map mpath_iSCSI_qdisk
Jan 17 17:53:42 <daemon.notice> NODE2 multipathd: checker failed path
66:224 in map mpath_iSCSI_qdisk
Jan 17 17:53:42 <daemon.notice> NODE2 multipathd: mpath_iSCSI_qdisk:
remaining active paths: 1
Jan 17 17:53:42 <daemon.notice> NODE2 multipathd: mpath_iSCSI_qdisk:
remaining active paths: 1
Jan 17 17:53:42 <daemon.notice> NODE2 multipathd: dm-75: add map (uevent)
Jan 17 17:53:42 <daemon.notice> NODE2 multipathd: dm-75: add map (uevent)
Jan 17 17:53:44 <local4.info> NODE2 openais[8952]: [CMAN ] lost contact
with quorum device
Jan 17 17:53:49 <daemon.warn> NODE2 qdiskd[8981]: <warning> qdiskd: read
(system call) has hung for 20 seconds
Jan 17 17:53:49 <daemon.warn> NODE2 qdiskd[8981]: <warning> qdiskd: read
(system call) has hung for 20 seconds
Jan 17 17:53:49 <daemon.warn> NODE2 qdiskd[8981]: <warning> In 20 more
seconds, we will be evicted
Jan 17 17:53:49 <daemon.warn> NODE2 qdiskd[8981]: <warning> In 20 more
seconds, we will be evicted
Jan 17 17:55:10 <kern.info> NODE2 kernel: md: stopping all md devices. (
Shutdown )
Jan 17 17:55:12 <kern.info> NODE2 kernel: bonding: bond2: link status down
for interface eth0, disabling it in 2000 ms.

2nd node /var/log/messages:
Jan 17 17:55:57 <kern.err> NODE1 kernel: dlm: closing connection to node 2
Jan 17 17:56:57 <daemon.info> NODE1 fenced[8982]: NODE2-cl not a cluster
member after 60 sec post_fail_delay
Jan 17 17:56:57 <daemon.info> NODE1 fenced[8982]: fencing node "NODE2-cl"

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.