[Linux-cluster] Freeze with cluster-2.03.11

David Teigland teigland at redhat.com
Mon Mar 30 18:07:40 UTC 2009


On Thu, Mar 26, 2009 at 11:47:00PM +0100, Kadlecsik Jozsef wrote:
> Hi,
> 
> Freshly built cluster-2.03.11 reproducibly freezes as mailman started. 
> The versions are:
> 
> linux-2.6.27.21
> cluster-2.03.11
> openais from svn, subrev 1152 version 0.80

So, in summary:
- nodes 1-5 are correctly forming a cluster, and appear to be stable
- nodes 1-5 all correctly mount the gfs file system
- node5 runs: init.d/mailman start
- node5 "freezes completely"
- node5 is fenced by another node, e.g. node4
- sometimes, node4 then freezes completely

You're using STABLE2 code, which is equivalent to RHEL5 code *except* for the
gfs-kernel patches that are necessary to make gfs run on recent kernels.  The
RHEL5 code is thoroughly tested, but the STABLE2 code is not, so any
differences between them (i.e. the gfs-kernel patches for recent kernels) are
the most likely causes for regression bugs.

It's always possible that a patch like the one in bz 466645 could be
responsible, but it's less likely since it does go through a QE process unlike
the patches for kernel updates.

Hopefully, some gfs developers can look at the backtraces (which as Wendy
points out do look suspicious) and try to reproduce this problem with recent
kernels.

Aside from gfs, the fact that you're running AoE over the same network at
openais does raise some flags.  We've seen problems with openais in the past
when block i/o is sent over the same network causing load problems.  It seems
unlikely to be your problem, though, since it works fine with the previous
version, and the freezing symptoms aren't what we'd expect to see from openais
trouble.

Dave




More information about the Linux-cluster mailing list