[Linux-cluster] Corosync memory problem

Tue Dec 27 16:00:25 UTC 2011

On 12/21/2011 11:04 AM, Chris Alexander wrote:
> An update in case anyone ever runs into something like this - we had
> corosync-notify running on the servers and once we removed that and
> restarted the cluster stack, corosync seemed to return to normal.
> 
> Additionally, according to the corosync mailing list, the cluster 1.2.3
> version is basically very similar to (if not the same as) the 1.4 that
> they currently have released, someone's been backporting.
> 

The upstream 1.2.3 version hasn't had any backports applied to it.  Only
the RHEL 1.2.3-z versions have been backported.

Regards
-steve

> Cheers
> 
> Chris
> 
> On 19 December 2011 19:01, Chris Alexander <chris.alexander at kusiri.com
> <mailto:chris.alexander at kusiri.com>> wrote:
> 
>     Hi all,
> 
>     You may remember our recent issue, I believe this is being worsened
>     if not caused by another problem we have encountered.
> 
>     Every few days our nodes are (non-simultaneously) being fenced due
>     to corosync taking up vast amounts of memory (i.e. 100% of the box).
>     Please see a sample log message, we have several just like this, [1]
>     which occurs when this happens. Note that it is not always corosync
>     being killed - but it is clearly corosync eating all the memory (see
>     top output from three servers at various times since their last
>     reboot, [2] [3] [4]).
> 
>     The corosync version is 1.2.3:
>     [g at cluster1 ~]$ corosync -v
>     Corosync Cluster Engine, version '1.2.3'
>     Copyright (c) 2006-2009 Red Hat, Inc.
> 
>     We had a bit of a dig around and there are a significant number of
>     bugfix updates which address various segfaults, crashes, memory
>     leaks etc. in this minor as well as subsequent minor versions. [5] [6]
> 
>     We're trialling the Fedora 14 (fc14) RPMs for corosync and
>     corosynclib (v1.4.2) to see if it fixes the particular issue we are
>     seeing (i.e. whether or not the memory keeps spiralling way out of
>     control).
> 
>     Has anyone else seen an issue like this, and is there any known way
>     to debug or fix it? If we can assist debugging by providing further
>     information, please specify what this is (and, if non-obvious, how
>     to get it).
> 
>     Thanks again for your help
> 
>     Chris
> 
>     [1] http://pastebin.com/CbyERaRT
>     [2] http://pastebin.com/uk9ZGL7H
>     [3] http://pastebin.com/H4w5Zg46
>     [4] http://pastebin.com/KPZxL6UB
>     [5] http://rhn.redhat.com/errata/RHBA-2011-1361.html
>     [6] http://rhn.redhat.com/errata/RHBA-2011-1515.html
> 
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster