[Linux-cluster] Inconsistent cluster view, shutdown, kernel panic

Moreno Baricevic baro at democritos.it
Sun May 28 22:20:31 UTC 2006


we are trying to install GFS (cluster-1.02 on vanilla on a 
CentOS cluster of 70 "diskless" nodes.

The structure is something like this:

+---+   GNBD-SERVERS                  GNBD CLIENTS
|   |-----[node63]-----[node64 node65 node66 node67 node68 node69]
| S |.....
| A |.....
| N |-----[node07]-----[node08 node09 node10 node11 node12 node13]
|   |-----[node00]-----[node01 node02 node03 node04 node05 node06]

All the nodes have a gigabit NIC and all the nodes see each other.
Only the gnbd-servers have a fiber adapter to connect to the SAN.

Everything works fine as far as we test on 33 nodes: 9 nodes with the 
fiber adapter (acting as both GFS nodes and gnbd-servers) and 24 gnbd 
clients (connected to 4 of the gnbd-servers). "Fine" means that we have 
been able to mount and use the GFS filesystem.

When we try to start cman on 39 nodes (or worst, when we try with 63
nodes), more or less half of the nodes soon get this:

 	"kernel panic - not syncing: membership stopped responding"

(/etc/init.d/cman), but the problem persists.

We tried to boot the nodes 10 at once, with a 2 minutes delay between 
groups. As soon as we reach the quorum (or one of the timeout?) the nodes 
start collapsing due to "Inconsistent cluster view", "Shutdown", "No 
response to messages".

We also tried the patch supplied as solution for the bug report 187777, 
but nothing changes.

Is there a limit on the number of nodes, a timeout, or any other issue 
that we didn't consider?

Here you can find the cluster.conf, logs from survived and dead nodes, 
tcpdump for UDP:6809, nodes' /proc/cluster/{status,nodes,services}:


There's a lot of stuff, let me know if you need something more specific.

RTFM's are welcome.

Thanks in advance



More information about the Linux-cluster mailing list