[Linux-cluster] Corosync memory problem
Steven Dake
sdake at redhat.com
Tue Dec 27 16:00:25 UTC 2011
On 12/21/2011 11:04 AM, Chris Alexander wrote:
> An update in case anyone ever runs into something like this - we had
> corosync-notify running on the servers and once we removed that and
> restarted the cluster stack, corosync seemed to return to normal.
>
> Additionally, according to the corosync mailing list, the cluster 1.2.3
> version is basically very similar to (if not the same as) the 1.4 that
> they currently have released, someone's been backporting.
>
The upstream 1.2.3 version hasn't had any backports applied to it. Only
the RHEL 1.2.3-z versions have been backported.
Regards
-steve
> Cheers
>
> Chris
>
> On 19 December 2011 19:01, Chris Alexander <chris.alexander at kusiri.com
> <mailto:chris.alexander at kusiri.com>> wrote:
>
> Hi all,
>
> You may remember our recent issue, I believe this is being worsened
> if not caused by another problem we have encountered.
>
> Every few days our nodes are (non-simultaneously) being fenced due
> to corosync taking up vast amounts of memory (i.e. 100% of the box).
> Please see a sample log message, we have several just like this, [1]
> which occurs when this happens. Note that it is not always corosync
> being killed - but it is clearly corosync eating all the memory (see
> top output from three servers at various times since their last
> reboot, [2] [3] [4]).
>
> The corosync version is 1.2.3:
> [g at cluster1 ~]$ corosync -v
> Corosync Cluster Engine, version '1.2.3'
> Copyright (c) 2006-2009 Red Hat, Inc.
>
> We had a bit of a dig around and there are a significant number of
> bugfix updates which address various segfaults, crashes, memory
> leaks etc. in this minor as well as subsequent minor versions. [5] [6]
>
> We're trialling the Fedora 14 (fc14) RPMs for corosync and
> corosynclib (v1.4.2) to see if it fixes the particular issue we are
> seeing (i.e. whether or not the memory keeps spiralling way out of
> control).
>
> Has anyone else seen an issue like this, and is there any known way
> to debug or fix it? If we can assist debugging by providing further
> information, please specify what this is (and, if non-obvious, how
> to get it).
>
> Thanks again for your help
>
> Chris
>
> [1] http://pastebin.com/CbyERaRT
> [2] http://pastebin.com/uk9ZGL7H
> [3] http://pastebin.com/H4w5Zg46
> [4] http://pastebin.com/KPZxL6UB
> [5] http://rhn.redhat.com/errata/RHBA-2011-1361.html
> [6] http://rhn.redhat.com/errata/RHBA-2011-1515.html
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
More information about the Linux-cluster
mailing list