[Linux-cluster] cluster instability

Christine Caulfield ccaulfie at redhat.com
Tue Jun 17 07:29:00 UTC 2008


GS R wrote:
> 
> 
> On 6/16/08, *Shawn Hood* <shawnlhood at gmail.com 
> <mailto:shawnlhood at gmail.com>> wrote:
> 
>     All,
> 
>     This message was sent out to my office, so the voice may seem a bit
>     odd.  We have a 4 node cluster running RHEL4U6 on Dell Poweredge
>     1950s.  Fencing is done via DRAC.
> 
>     Using packages (from RHN):
> 
>     cman-kernel-smp-2.6.9-53.13
>     cman-1.0.17-0.el4_6.5
>     ccs-1.0.11-1.el4_6.1
>     fence-1.32.50-2.el4_6.1
>     lvm2-cluster-2.02.27-2.el4_6.2
>     dlm-kernel-smp-2.6.9-52.9
>     dlm-kernheaders-2.6.9-52.9
> 
>     Our cluster became unstable on Saturday morning.  Apparently
>     hugin stopped sending out heartbeats, causing it to become
>     fenced.  hugin
>     was under heavy load (~10) at the time:
> 
>     03:30:02 AM         6       453      9.35     10.29     10.51
>     03:40:01 AM        12       465     11.02     11.00     10.75
>     03:50:02 AM         3       446      9.75     10.80     10.86
>     04:00:01 AM         5       430      9.23      9.47     10.07
>     Average:            7       455     10.19     10.32     10.28
> 
>     04:09:35 AM       LINUX RESTART
> 
>     As you can see, hugin was fenced at 4:09.  The other nodes then began
>     logging the following:
> 
>     Jun 14 04:08:06 munin kernel: CMAN: Initiating transition, generation 58
>     Jun 14 04:08:21 munin kernel: CMAN: Initiating transition, generation 59
>     Jun 14 04:08:36 munin kernel: CMAN: Initiating transition, generation 60
>     Jun 14 04:08:51 munin kernel: CMAN: Initiating transition, generation 61
>     Jun 14 04:09:06 munin kernel: CMAN: too many transition restarts -
>     will die
>     Jun 14 04:09:06 munin kernel: CMAN: we are leaving the cluster.
>     Inconsistent
>     cluster view
> 
>  
> I guess this has to do with network issue though its utilization was low 
> when this logged.
> The node is not able to receive messages.
> 

I suspect you've hit this bug:

https://bugzilla.redhat.com/show_bug.cgi?id=444751


There's a patch in the bugzilla, and a workaround program you can run 
which should help if you can't upgrade the kernel module (See comment #10)

-- 

Chrissie




More information about the Linux-cluster mailing list