[Linux-cluster] Freeze with cluster-2.03.11

Kadlecsik Jozsef kadlec at mail.kfki.hu
Tue Mar 31 09:50:45 UTC 2009


On Mon, 30 Mar 2009, David Teigland wrote:

> On Thu, Mar 26, 2009 at 11:47:00PM +0100, Kadlecsik Jozsef wrote:
> > 
> > Freshly built cluster-2.03.11 reproducibly freezes as mailman started. 
> > The versions are:
> > 
> > linux-2.6.27.21
> > cluster-2.03.11
> > openais from svn, subrev 1152 version 0.80
> 
> So, in summary:
> - nodes 1-5 are correctly forming a cluster, and appear to be stable
> - nodes 1-5 all correctly mount the gfs file system
> - node5 runs: init.d/mailman start
> - node5 "freezes completely"
> - node5 is fenced by another node, e.g. node4
> - sometimes, node4 then freezes completely

Yes, exactly. The freeze can reliably be triggered by starting mailman, 
but it can occur (and did) otherwise as well. The fact that node4 
sometimes freezes too does not related to the fact that it fenced off 
node5.
 
> You're using STABLE2 code, which is equivalent to RHEL5 code *except* 
> for the gfs-kernel patches that are necessary to make gfs run on recent 
> kernels.  The RHEL5 code is thoroughly tested, but the STABLE2 code is 
> not, so any differences between them (i.e. the gfs-kernel patches for 
> recent kernels) are the most likely causes for regression bugs.

That's bad because then one cannot check it by just removing patches - the 
kernel must be changed as well.
 
> It's always possible that a patch like the one in bz 466645 could be 
> responsible, but it's less likely since it does go through a QE process 
> unlike the patches for kernel updates.

I'll try to find a reliable way to crash the kernel without mailman. 
That'd make easier the bug-hunting.
 
> Aside from gfs, the fact that you're running AoE over the same network 
> at openais does raise some flags.  We've seen problems with openais in 
> the past when block i/o is sent over the same network causing load 
> problems.  It seems unlikely to be your problem, though, since it works 
> fine with the previous version, and the freezing symptoms aren't what 
> we'd expect to see from openais trouble.

AoE and openais seem to work side-by-side just fine. I can imagine that 
iSCSI and openais have got more trouble because iSCSI is much more heavy 
weighted than AoE. We pondered a lot over the setup, but decided to go 
with it and so far it resulted no problem. (Not absolutely true, AoE 
ethernet interface coming up with speed 10Mbps(!) instead of 10000 can 
practically kill AoE ;-).

Best regards,
Jozsef
--
E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary




More information about the Linux-cluster mailing list