[Linux-cluster] Corosync memory problem
chris.alexander at kusiri.com
Wed Dec 21 18:04:55 UTC 2011
An update in case anyone ever runs into something like this - we had
corosync-notify running on the servers and once we removed that and
restarted the cluster stack, corosync seemed to return to normal.
Additionally, according to the corosync mailing list, the cluster 1.2.3
version is basically very similar to (if not the same as) the 1.4 that they
currently have released, someone's been backporting.
On 19 December 2011 19:01, Chris Alexander <chris.alexander at kusiri.com>wrote:
> Hi all,
> You may remember our recent issue, I believe this is being worsened if not
> caused by another problem we have encountered.
> Every few days our nodes are (non-simultaneously) being fenced due to
> corosync taking up vast amounts of memory (i.e. 100% of the box). Please
> see a sample log message, we have several just like this,  which occurs
> when this happens. Note that it is not always corosync being killed - but
> it is clearly corosync eating all the memory (see top output from three
> servers at various times since their last reboot,   ).
> The corosync version is 1.2.3:
> [g at cluster1 ~]$ corosync -v
> Corosync Cluster Engine, version '1.2.3'
> Copyright (c) 2006-2009 Red Hat, Inc.
> We had a bit of a dig around and there are a significant number of bugfix
> updates which address various segfaults, crashes, memory leaks etc. in this
> minor as well as subsequent minor versions.  
> We're trialling the Fedora 14 (fc14) RPMs for corosync and corosynclib
> (v1.4.2) to see if it fixes the particular issue we are seeing (i.e.
> whether or not the memory keeps spiralling way out of control).
> Has anyone else seen an issue like this, and is there any known way to
> debug or fix it? If we can assist debugging by providing further
> information, please specify what this is (and, if non-obvious, how to get
> Thanks again for your help
>  http://pastebin.com/CbyERaRT
>  http://pastebin.com/uk9ZGL7H
>  http://pastebin.com/H4w5Zg46
>  http://pastebin.com/KPZxL6UB
>  http://rhn.redhat.com/errata/RHBA-2011-1361.html
>  http://rhn.redhat.com/errata/RHBA-2011-1515.html
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Linux-cluster