[Linux-cluster] cluster instability

Shawn Hood shawnlhood at gmail.com
Mon Jun 16 15:54:36 UTC 2008


All,

This message was sent out to my office, so the voice may seem a bit
odd.  We have a 4 node cluster running RHEL4U6 on Dell Poweredge
1950s.  Fencing is done via DRAC.

Using packages (from RHN):

cman-kernel-smp-2.6.9-53.13
cman-1.0.17-0.el4_6.5
ccs-1.0.11-1.el4_6.1
fence-1.32.50-2.el4_6.1
lvm2-cluster-2.02.27-2.el4_6.2
dlm-kernel-smp-2.6.9-52.9
dlm-kernheaders-2.6.9-52.9

Our cluster became unstable on Saturday morning.  Apparently
hugin stopped sending out heartbeats, causing it to become fenced.  hugin
was under heavy load (~10) at the time:

03:30:02 AM         6       453      9.35     10.29     10.51
03:40:01 AM        12       465     11.02     11.00     10.75
03:50:02 AM         3       446      9.75     10.80     10.86
04:00:01 AM         5       430      9.23      9.47     10.07
Average:            7       455     10.19     10.32     10.28

04:09:35 AM       LINUX RESTART

As you can see, hugin was fenced at 4:09.  The other nodes then began
logging the following:

Jun 14 04:08:06 munin kernel: CMAN: Initiating transition, generation 58
Jun 14 04:08:21 munin kernel: CMAN: Initiating transition, generation 59
Jun 14 04:08:36 munin kernel: CMAN: Initiating transition, generation 60
Jun 14 04:08:51 munin kernel: CMAN: Initiating transition, generation 61
Jun 14 04:09:06 munin kernel: CMAN: too many transition restarts - will die
Jun 14 04:09:06 munin kernel: CMAN: we are leaving the cluster. Inconsistent
cluster view

After so many 'initiating transition' messages, the cluster died.  Our
network utilization was very low at the time.

Any ideas?

Shawn




More information about the Linux-cluster mailing list