[Linux-cluster] qdiskd does not call heuristics regularly?
Gerbatsch, Andre
Andre.Gerbatsch at globalfoundries.com
Fri May 13 12:00:23 UTC 2011
.. small correction of the qdiskd->heuristic script timing:
dummy: Fri May 13 08:59:16 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 <--qdiskd restart, rval=1
dummy: Fri May 13 08:59:21 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:26 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:31 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:36 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:41 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:46 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:51 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:56 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <--changed script, rval=0
dummy: Fri May 13 09:00:01 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:00:06 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:00:11 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <-- until this point ok (dt=5s)
dummy: Fri May 13 09:01:53 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <-- below: ?? every 103s ?
dummy: Fri May 13 09:03:35 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:05:17 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:06:58 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:08:40 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:10:22 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:12:04 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:23:46 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <-- ?? no regular checks ?
dummy: Fri May 13 09:31:48 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 10:20:19 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 10:40:29 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Gerbatsch, Andre
Sent: Freitag, 13. Mai 2011 12:10
To: 'linux-cluster at redhat.com'
Subject: [Linux-cluster] qdiskd does not call heuristics regularly?
Hello,
Im at a point where I have different answers from different experts, read "qdiskd" source code by myself and would be happy if someone could help me:
I expected in my configuration (see below) that a heuristics script will be called on a regularly bases (every "interval" s) to have a chance to influence quorumd scores if something happened with the cluster node.
What I see is, that there were some cycles during quorum device initialization, after that heuristics is called "from time to time".
Question: is this the expected behavior ? If yes, is there a chance to call heuristics regularly ?
Question2: how can I determine the cman/qdisk version I use.. cman_1_0_??? (see rpm -qi cman)
The final effect is: if I disconnect one node in a 2-node cluster from network the "wrong" node won - and heuristics had no influence on the fencing decision.
Thank you in advance for any response
Andre
=================================================
== rpm -qi cman
Name : cman Relocations: (not relocatable)
Version : 2.0.115 Vendor: Red Hat, Inc.
Release : 68.el5_6.1 Build Date: Mon Dec 20 19:28:36 2010
Install Date: Thu Apr 28 11:11:43 2011 Build Host: ls20-bc2-14.build.redhat.com
Group : System Environment/Base Source RPM: cman-2.0.115-68.el5_6.1.src.rpm
Size : 2619414 License: GPL
Signature : DSA/SHA1, Fri Dec 31 06:29:03 2010, Key ID 5326810137017186
Packager : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
URL : http://sources.redhat.com/cluster/
Summary : cman - The Cluster Manager
Description :
cman - The Cluster Manager
==
cluster.conf:
..
<totem consensus="4800" join="60" token="60000" token_retransmits_before_loss_const="20"/>
<quorumd status_file="/tmp/qdiskd_status" log_level="7" interval="5" device="/dev/mapper/xp1_00p1" tko="5" votes="1">
<heuristic interval="5" program="/root/root/cluster/checkpvtlink.sh eth0" score="1" tko="3"/>
</quorumd>
..
==
> ps -eLf | grep qdiskd
root 3976 1 3976 0 3 08:59 ? 00:00:00 qdiskd -Q
root 3976 1 3978 0 3 08:59 ? 00:00:00 qdiskd -Q
root 3976 1 4226 0 3 08:59 ? 00:00:00 qdiskd -Q
root 21613 12673 21613 0 1 10:45 pts/0 00:00:00 grep qdiskd
== strace "score thread" (hopefully :-)
= it seems simply waiting for some timer...
clock_gettime(CLOCK_MONOTONIC, {60774, 182881847}) = 0
clock_gettime(CLOCK_MONOTONIC, {60774, 182920847}) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0
nanosleep({1, 0}, {1, 0}) = 0
clock_gettime(CLOCK_MONOTONIC, {60775, 202918847}) = 0
clock_gettime(CLOCK_MONOTONIC, {60775, 202961847}) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0
nanosleep({1, 0}, {1, 0}) = 0
clock_gettime(CLOCK_MONOTONIC, {60776, 222868847}) = 0
clock_gettime(CLOCK_MONOTONIC, {60776, 222912847}) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0
nanosleep({1, 0}, <unfinished ...>
Process 3978 detached
.. seems to me that this is the score thread with a "wrong" h->nextrun.. but I think I simply do not understand smthg..
cman/qdiskd/score.c: from http://git.fedorahosted.org/git/?p=cluster.git;a=summary
99 fork_heuristic(struct h_data *h)
100 {
...
110 now = time(NULL);
111 if (now < h->nextrun)
112 return 0;
113
114 h->nextrun = now + h->interval;
115
116 pid = fork();
== output from heuristic testscript
> cat checkpvtlink.sh
#!/bin/sh
rval=0
echo "dummy: $(date) $0 rval=$rval" >> /root/root/cluster/checkpvtlink.log
exit $rval
> tail checkpvtlink.log
dummy: Fri May 13 09:03:35 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <== service qdiskd restart
dummy: Fri May 13 09:05:17 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:06:58 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:08:40 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:10:22 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:12:04 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:23:46 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:31:48 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 10:20:19 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <== why so late ??
dummy: Fri May 13 10:40:29 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
Andre Gerbatsch
MTS IT Systems Engineer
Tel +49 (0) 351 277-1762
Fax +49 (0) 351 277-91762
andre.gerbatsch at globalfoundries.com
GLOBALFOUNDRIES Dresden Module Two GmbH & Co. KG
Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland, Sitz Dresden I Registergericht Dresden HRA 4896
--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
More information about the Linux-cluster
mailing list