[Linux-cluster] qdiskd does not call heuristics regularly?

Gerbatsch, Andre Andre.Gerbatsch at globalfoundries.com
Fri May 13 12:00:23 UTC 2011


.. small correction of the qdiskd->heuristic script timing:
dummy: Fri May 13 08:59:16 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 <--qdiskd restart, rval=1
dummy: Fri May 13 08:59:21 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:26 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:31 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:36 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:41 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:46 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:51 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1
dummy: Fri May 13 08:59:56 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <--changed script, rval=0
dummy: Fri May 13 09:00:01 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:00:06 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:00:11 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <-- until this point ok (dt=5s)
dummy: Fri May 13 09:01:53 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <-- below: ?? every 103s ?
dummy: Fri May 13 09:03:35 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:05:17 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:06:58 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:08:40 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:10:22 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:12:04 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:23:46 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <-- ?? no regular checks ?
dummy: Fri May 13 09:31:48 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 10:20:19 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 10:40:29 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Gerbatsch, Andre
Sent: Freitag, 13. Mai 2011 12:10
To: 'linux-cluster at redhat.com'
Subject: [Linux-cluster] qdiskd does not call heuristics regularly?

Hello,

Im at a point where I have different answers from different experts, read "qdiskd" source code by myself and would be happy if someone could help me:

I expected in my configuration (see below) that a heuristics script will be called on a regularly bases (every "interval" s) to have a chance to influence quorumd scores if something happened with the cluster node.

What I see is, that there were some cycles during quorum device initialization, after that heuristics is called "from time to time".

Question: is this the expected behavior ? If yes, is there a chance to call heuristics regularly ?
Question2: how can I determine the cman/qdisk version I use.. cman_1_0_??? (see rpm -qi cman) 

The final effect is: if I disconnect one node in a 2-node cluster from network the "wrong" node won - and heuristics had no influence on the fencing decision.

Thank you in advance for any response
Andre

=================================================
== rpm -qi cman
Name        : cman                         Relocations: (not relocatable)
Version     : 2.0.115                           Vendor: Red Hat, Inc.
Release     : 68.el5_6.1                    Build Date: Mon Dec 20 19:28:36 2010
Install Date: Thu Apr 28 11:11:43 2011         Build Host: ls20-bc2-14.build.redhat.com
Group       : System Environment/Base       Source RPM: cman-2.0.115-68.el5_6.1.src.rpm
Size        : 2619414                          License: GPL
Signature   : DSA/SHA1, Fri Dec 31 06:29:03 2010, Key ID 5326810137017186
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
URL         : http://sources.redhat.com/cluster/
Summary     : cman - The Cluster Manager
Description :
cman - The Cluster Manager

==
cluster.conf:
..
<totem consensus="4800" join="60" token="60000" token_retransmits_before_loss_const="20"/>

<quorumd status_file="/tmp/qdiskd_status" log_level="7" interval="5" device="/dev/mapper/xp1_00p1" tko="5" votes="1">
                <heuristic interval="5" program="/root/root/cluster/checkpvtlink.sh eth0" score="1" tko="3"/>
</quorumd>
..
==
> ps -eLf | grep qdiskd
root      3976     1  3976  0    3 08:59 ?        00:00:00 qdiskd -Q
root      3976     1  3978  0    3 08:59 ?        00:00:00 qdiskd -Q
root      3976     1  4226  0    3 08:59 ?        00:00:00 qdiskd -Q
root     21613 12673 21613  0    1 10:45 pts/0    00:00:00 grep qdiskd

== strace "score thread" (hopefully :-)
=  it seems simply waiting for some timer...
clock_gettime(CLOCK_MONOTONIC, {60774, 182881847}) = 0
clock_gettime(CLOCK_MONOTONIC, {60774, 182920847}) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0
nanosleep({1, 0}, {1, 0})               = 0
clock_gettime(CLOCK_MONOTONIC, {60775, 202918847}) = 0
clock_gettime(CLOCK_MONOTONIC, {60775, 202961847}) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0
nanosleep({1, 0}, {1, 0})               = 0
clock_gettime(CLOCK_MONOTONIC, {60776, 222868847}) = 0
clock_gettime(CLOCK_MONOTONIC, {60776, 222912847}) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0
nanosleep({1, 0},  <unfinished ...>
Process 3978 detached


.. seems to me that this is the score thread with a "wrong" h->nextrun.. but I think I simply do not understand smthg..

cman/qdiskd/score.c: from http://git.fedorahosted.org/git/?p=cluster.git;a=summary

99	fork_heuristic(struct h_data *h) 
100 { 
...
110         now = time(NULL); 
111         if (now < h->nextrun) 
112                 return 0; 
113  
114         h->nextrun = now + h->interval; 
115  
116         pid = fork();


== output from heuristic testscript
> cat checkpvtlink.sh
#!/bin/sh
rval=0
echo "dummy: $(date) $0 rval=$rval" >> /root/root/cluster/checkpvtlink.log
exit $rval

> tail checkpvtlink.log
dummy: Fri May 13 09:03:35 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <== service qdiskd restart
dummy: Fri May 13 09:05:17 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:06:58 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:08:40 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:10:22 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:12:04 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:23:46 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 09:31:48 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0
dummy: Fri May 13 10:20:19 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <== why so late ??
dummy: Fri May 13 10:40:29 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0





Andre Gerbatsch
MTS IT Systems Engineer
Tel  +49 (0) 351 277-1762
Fax +49 (0) 351 277-91762
andre.gerbatsch at globalfoundries.com  

GLOBALFOUNDRIES Dresden Module Two GmbH & Co. KG
Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland, Sitz Dresden I Registergericht Dresden HRA 4896


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




More information about the Linux-cluster mailing list