[Linux-cluster] Freeze with cluster-2.03.11

Fri Mar 27 20:01:24 UTC 2009

On Fri, 27 Mar 2009, Bob Peterson wrote:

> | Combing through the log files I found the following:
> | 
> | Mar 27 13:31:56 lxserv0 fenced[3833]: web1-gfs not a cluster member
> | after 0 sec post_fail_delay
> | Mar 27 13:31:56 lxserv0 fenced[3833]: fencing node "web1-gfs"
> | Mar 27 13:31:56 lxserv0 fenced[3833]: can't get node number for node
> | e1÷?e1÷? 
> | Mar 27 13:31:56 lxserv0 fenced[3833]: fence "web1-gfs" success
> | 
> | The line saying "can't get node number for node e1÷?e1÷?" might be 
> | innocent, but looks suspicious. Why fenced could not get the victim
> | name?
> 
> This leads me to believe that this is a cluster problem,
> not a GFS problem.  If a node is fenced, GFS can't give out
> new locks until the fenced node is properly deal with by
> the cluster software.  Therefore, GFS can appear to hang until
> the dead node is resolved.  Did web1-gfs get rebooted and
> brought back in to the cluster?

Yes. Probably it's worth to summarize what's happening here:

- Full, healthy-looking cluster with all of the five nodes joined
  runs smoothly.
- One node freezes out of the blue; it can reliably be triggered
  anytime by starting mailman, which works over GFS.
- The freezed node gets fenced off - I assume it's not reversed and
  the node freezes *because* it got fenced.

As we use AOE, the fencing happens at AOE level and the node is *not* 
rebooted automatically but the access right to the AOE devices are 
withdrawn. Freeze means there's no response at the console. The node still 
answers to ping, but nothing else. There's no a single error message in 
the kernel log or at the console screen.

GFS does not freeze at all. There's a short pause, but then it works fine 
until the quorum is lost as more nodes fall out.

We tried vanilla kernels 2.6.27.14 and 2.6.27.21 with the same results so 
I don't think it's a kernel problem. It >looks< either a GFS kernel module 
or an openais problem, if latter (as the victim machine fenced off) can 
cause system freeze.

In daytime (active users) it was like an infenction: in ten minutes after 
bringing back the machines one failed, then shortly after another too. 
Now, since 17:22 (more than three hours) the cluster runs smoothly, but 
it's lightly used. However a node can be killed anytime by starting that 
damned mailman, which should run.

Best regards,
Jozsef
--
E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary