[Linux-cluster] Problem with ping as an heuristic with qdiskd

Fri Mar 9 14:14:17 UTC 2012

Hello,
I have a cluster in RH EL 5.7 with quorum disk and an heuristic.
Current versions of main cluster packages are:
rgmanager-2.0.52-21.el5_7.1
cman-2.0.115-85.el5_7.3

This is the loaded heuristic

Heuristic: 'ping -c1 -w1 10.4.5.250' score=1 interval=2 tko=200

Line in cluster.conf:
<heuristic interval="2" program="ping -c1 -w1 10.4.5.250" score="1" tko="200"/>

where 10.4.5.250 is the gateway of the production lan,
>From ping man page:
 -c count
 Stop after sending count ECHO_REQUEST packets. With deadline (-w)
option,  ping  waits  for count ECHO_REPLY packets, until the timeout
expires.
-w deadline
 Specify a timeout, in seconds, before ping exits regardless of how many
packets have  been  sent or  received.  In  this case ping does not stop
after count packet are sent, it waits either for deadline expire or
until count probes are answered or for some error notification from
network.

So I would expect that the single ping command, executed as a sanity
check, at most after 1 second
should exit with a code, regardless an echo reply has been received or not
And in fact I had no particular problem for many months

As a test, putting an ip on an unreachable lan (say 10.4.6.5):
date
n=0
while [ $n -lt 20 ]
do
  ping -c1 -w1 10.4.6.5
  sleep 2
  n=$(expr $n + 1)
done
date

Output is
Fri Mar  9 11:59:02 CET 2012
PING 10.4.6.5 (10.4.6.5) 56(84) bytes of data.

--- 10.4.6.5 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1000ms

...

--- 10.4.6.5 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms

Fri Mar  9 12:00:02 CET 2012

so 60 seconds....

In case of gateway reachability problems (also tested with an iptables
rule that drops icmp output request) I would then have:

qdiskd[2780]: <debug> Heuristic: 'ping -c1 -w1 10.4.5.250' missed
(1/200)

Strange thing I got yesterday night was this only line:

qdiskd[22145]: <info> Heuristic: 'ping -c1 -w1 10.4.5.250' DOWN -
Exceeded timeout of 75 seconds

and the node self-fencing causing relocation of some services
So for some reason the ping command was not able to exit at all, I presume...
despite the -c and -w options....

I suppose a condition that causes an internal timeout defined for the
monitor operation itself (default to 75 seconds?)
something like a pacemaker directive
op monitor interval="20" timeout="40"

And the cluster at this point considering as heuristic failed at all
and self-fencing....
Is this right?

My default quorumd directive is this one, btw:

<quorumd device="/dev/mapper/mpquorum" interval="5" label="oraprquorum"
log_facility="local4" log_level="7" tko="16" votes="1">

And in fact when for some reason I have temporary problems with my
SAN, I get something like:

qdiskd[1339]: <warning> qdisk cycle took more than 5 seconds to complete
(34.540000)

and on the other node
qdiskd[6025]: <debug> Node 1 missed an update (2/200)
qdiskd[6025]: <debug> Node 1 missed an update (3/200)
...

Can anyone give any insight for the message I got yesterday that I
never saw before:
qdiskd[22145]: <info> Heuristic: 'ping -c1 -w1 10.4.5.250' DOWN -
Exceeded timeout of 75 seconds

?
Do I have to suppose a bug in the ping command?

Thanks in advance,
Gianluca