[Linux-cluster] Fw: STONITH

Fri Oct 6 12:03:00 UTC 2006

 Forgot to say I also get the following msgs in syslog when I telnet to 
the NPS....

Oct  6 12:53:34 node1 cluquorumd[27339]: Cannot log into WTI 
Network/Telnet Power Switch.
Oct  6 12:53:34 node1 cluquorumd[27339]: <err> STONITH: Device at 
xx.xxx.xxx.xxx controlling node2-h FAILED status check: Bad configuration
Oct  6 12:53:47 node1 cluquorumd[2384]: <crit> Error returned from STONITH 
device(s) controlling node1-h. See system logs on node2-h for more 
information.

I obscured the IP address in there - but it is the correct address of the 
NPS.

What could this "Bad Config" be - is it the /etc/cluster.xml?

Regards,
GXW  :o)
----- Forwarded by Grant Waters/GIS/CSC on 06/10/2006 13:00 -----

Grant Waters/GIS/CSC 
06/10/2006 12:11

To
linux-cluster at redhat.com
cc

Subject
STONITH 

I had a quick search through your threads but couldn't find an exact hit 
which includes a resolution so I thought I'd try posting this here.

We have a two node RH ES 3.0 cluster which uses an MSA 500 G2 shared array 
with a single LUN, and a crossover cable set up as eth1 for heartbeat.
Both nodes are dual fed through an NPS power switch.

All works fine and has done for 18 months but we've had 2 outages recently 
where the following happens...

We appear to lose eth1, and the MSA 500 G2 starts timing out, and by the 
time I get in in the morning I can see errors on the MSA 500 G2 LCDs 
saying "43 REDUNDANCY FAILED" and "POWER OK" resepctively on the secondary 
and primary controllers.

Both servers are up, but the failover node appears to have been forcibly 
rebooted by STONITH, with 2 plugs in the NPS being turned off & on again.

This leaves neither node able to talk to the shared array, and the service 
down.

Powering cycling both nodes and the array fixes the problem, but I want to 
know whats causing it in the first place.  It doesn't appear to be related 
to load, although I can't rule that out - both outages were at approx 
04:40 on a Friday.

Here are the key msgs from syslog...

Sep 29 04:44:50 node1 kernel: tg3: eth1: Link is down.
Sep 29 04:44:51 node1 kernel: cciss: cmd f79252b0 timedout
.......~100 of these
Sep 29 04:44:51 node1 kernel: cciss: cmd f79216f8 timedout
Sep 29 04:44:53 node1 kernel: tg3: eth1: Link is up at 1000 Mbps, full 
duplex.
Sep 29 04:44:53 node1 kernel: tg3: eth1: Flow control is off for TX and 
off for RX.
Sep 29 04:45:03 node1 clumembd[2411]: <info> Membership View #3:0x00000001
Sep 29 04:45:04 node1 cluquorumd[2389]: <warning> --> Commencing STONITH 
<--
Sep 29 04:45:06 node1 cluquorumd[2389]: Power to NPS outlet(s) 6 turned 
/Off.
Sep 29 04:45:07 node1 kernel: tg3: eth1: Link is down.
Sep 29 04:45:08 node1 cluquorumd[2389]: Power to NPS outlet(s) 2 turned 
/Off.
Sep 29 04:45:08 node1 cluquorumd[2389]: <notice> STONITH: node2-h has been 
fenced!
Sep 29 04:45:10 node1 cluquorumd[2389]: Power to NPS outlet(s) 6 turned 
/On.
Sep 29 04:45:12 node1 cluquorumd[2389]: Power to NPS outlet(s) 2 turned 
/On.
Sep 29 04:45:12 node1 cluquorumd[2389]: <notice> STONITH: node2-h is no 
longer fenced off.
Sep 29 04:45:14 node1 kernel: tg3: eth1: Link is up at 1000 Mbps, full 
duplex.
Sep 29 04:45:14 node1 kernel: tg3: eth1: Flow control is off for TX and 
off for RX.
Sep 29 04:47:41 node1 kernel: tg3: eth1: Link is down.
Sep 29 04:47:44 node1 kernel: tg3: eth1: Link is up at 1000 Mbps, full 
duplex.
Sep 29 04:47:44 node1 kernel: tg3: eth1: Flow control is on for TX and on 
for RX.

I thought it would go again this morning so I turned up the cluster daemon 
loglevels, and unfortunately it didn't crash but I spotted this in the 
debug msgs....

Oct  6 04:39:31 node1 clulockd[2462]: <debug> ioctl(fd,SIOCGARP,ar 
[eth1]): No such device or address
Oct  6 04:39:31 node1 clulockd[2462]: <debug> Connect: Member #1 
(192.168.100.101) [IPv4]
Oct  6 04:39:31 node1 clulockd[2462]: <debug> Processing message on 11
Oct  6 04:39:31 node1 clulockd[2462]: <debug> Received 188 bytes from peer
Oct  6 04:39:31 node1 clulockd[2462]: <debug> LOCK_LOCK | LOCK_TRYLOCK
Oct  6 04:39:31 node1 clulockd[2462]: <debug> lockd_trylock: member #1 
lock 0
Oct  6 04:39:31 node1 clulockd[2462]: <debug> Replying ACK

The point is the cluster is working fine, and fails over and back fine.  I 
can telnet onto the NPS from both nodes so thats OK too.
As far as I can tell eth1 is set up OK, and working across 192.168 
addresses.

Any ideas where to start looking at this?

Regards,
GXW  :o)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20061006/2b2d86d6/attachment.htm>