[Linux-cluster] Nodes leaving and re-joining intermittently

Sun Dec 11 15:26:23 UTC 2011

Please find below the cluster.conf Matt mentioned.

Regarding logs, I have verified the 2 SNMP trap notifications that Matt
posted in his first message are the only ones that were processed by our
script anywhere near this event window (days until the previous one, none
since). I will have a look in the on-disk logging tomorrow and see if
there's anything of any worth over that time period on any of the cluster
nodes.

Thanks,

Chris

<?xml version="1.0"?>
<cluster config_version="30" name="camra">

<fence_daemon clean_start="1" post_fail_delay="30" post_join_delay="30"
override_time="30"/>

        <clusternodes>
                <clusternode name="xxx.xxx.xxx.1" nodeid="1">
                        <fence>
                                <method name="ilo">
                                        <device name="ilo1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="xxx.xxx.xxx.2" nodeid="2">
                        <fence>
                                <method name="ilo">
                                        <device name="ilo2"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="xxx.xxx.xxx.3" nodeid="3">
                        <fence>
                                <method name="ilo">
                                        <device name="ilo3"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>

<rm log_facility="local4" log_level="7">
                <failoverdomains>
                        <failoverdomain name="mysql" nofailback="0"
ordered="1" restricted="1">
                                <failoverdomainnode name="xxx.xxx.xxx.1"
priority="1"/>
                                <failoverdomainnode name="xxx.xxx.xxx.2"
priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="solr" nofailback="0"
ordered="1" restricted="1">
                                <failoverdomainnode name="xxx.xxx.xxx.2"
priority="2"/>
                                <failoverdomainnode name="xxx.xxx.xxx.1"
priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="cluster1" nofailback="1"
ordered="0" restricted="1">
                                <failoverdomainnode name="xxx.xxx.xxx.1"/>
                        </failoverdomain>
                        <failoverdomain name="cluster2" nofailback="1"
ordered="0" restricted="1">
                                <failoverdomainnode name="xxx.xxx.xxx.2"/>
                        </failoverdomain>
                        <failoverdomain name="cluster3" nofailback="1"
ordered="0" restricted="1">
                                <failoverdomainnode name="xxx.xxx.xxx.3"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <script file="/etc/init.d/fstest" name="fstest"/>
                        <script file="/etc/init.d/postfixstatus"
name="postfixstatus"/>
                        <script file="/etc/init.d/snmpdstatus"
name="snmpdstatus"/>
                        <script file="/etc/init.d/snmptrapdstatus"
name="snmptrapdstatus"/>
                        <script file="/etc/init.d/foghornstatus"
name="foghornstatus"/>
                </resources>
                <service domain="cluster1" max_restarts="50"
name="snmptrap1" recovery="restart-disable" restart_expire_time="900">
                        <script ref="fstest"/>
                        <script ref="foghornstatus"/>
                        <script ref="snmpdstatus"/>
                        <script ref="postfixstatus"/>
                        <script ref="snmptrapdstatus"/>
                </service>
                <service domain="cluster2" max_restarts="50"
name="snmptrap2" recovery="restart-disable" restart_expire_time="900">
                        <script ref="fstest"/>
                        <script ref="foghornstatus"/>
                        <script ref="snmpdstatus"/>
                        <script ref="postfixstatus"/>
                        <script ref="snmptrapdstatus"/>
                </service>
                <service domain="cluster3" max_restarts="50"
name="snmptrap3" recovery="restart-disable" restart_expire_time="900">
                        <script ref="fstest"/>
                        <script ref="foghornstatus"/>
                        <script ref="snmpdstatus"/>
                        <script ref="postfixstatus"/>
                        <script ref="snmptrapdstatus"/>
                </service>
        </rm>
        <fencedevices>
                <fencedevice agent="fence_ipmilan" ipaddr="xxx.xxx.xxx.101"
login="x" name="ilo1" passwd="x"/>
                <fencedevice agent="fence_ipmilan" ipaddr="xxx.xxx.xxx.102"
login="x" name="ilo2" passwd="x"/>
                <fencedevice agent="fence_ipmilan" ipaddr="xxx.xxx.xxx.103"
login="x" name="ilo3" passwd="x"/>
        </fencedevices>
</cluster>

On 11 December 2011 11:12, Matthew Painter <matthew.painter at kusiri.com>wrote:

> Thank you for your input :)
>
> The nodes are syncd using NTP. Although I am unsure about the respective
> run levels.
>
> I will look into this, thank you.
>
>
> On Sun, Dec 11, 2011 at 7:16 AM, Dukhan, Meir <Mdukhan at nds.com> wrote:
>
>>
>> Are your nodes time synced and how?
>>
>> We ran into problems of nodes being fenced because NTP problem.
>>
>> The solution (AFAIR, from the Redhat knowledge base) was to start ntpd
>> _before_ cman.
>> I'm not sure but there could be an update of openais or ntpd re this
>> issue.
>>
>> For those of you who have RedHat account, see the RedHat KB article:
>>
>>        Does cman need to have the time of nodes in sync?
>>        https://access.redhat.com/kb/docs/DOC-42471
>>
>> Hope this help,
>>
>> Regards,
>> -- Meir R. Dukhan
>>
>> |-----Original Message-----
>> |From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-
>> |bounces at redhat.com] On Behalf Of Digimer
>> |Sent: Sunday, December 11, 2011 0:23 AM
>> |To: Matthew Painter
>> |Cc: linux clustering
>> |Subject: Re: [Linux-cluster] Nodes leaving and re-joining intermittently
>> |
>> |On 12/10/2011 05:00 PM, Matthew Painter wrote:
>> |> The switch was our first thought, but that has been swapped, and while
>> |> we are not having nodes fenced anymore (we were daily), this anomoly
>> |> remains.
>> |>
>> |> I will ask for those logs and conf on Monday.
>> |>
>> |> I think it might be worth reinstalling corosync on this box anyway?
>> |> Can't be healthy if it is exiting unclearly. I have has reports of the
>> |> rgmanager dying on this box. (pid file but not running) Could that be
>> |> related?
>> |>
>> |> Thanks :)
>> |
>> |It's impossible to say without knowing your configuration. Please share
>> the
>> |cluster.conf (only obfuscate passwords, please) along with the log files.
>> |The more detail, the better. Versions, distros, network config, etc.
>> |
>> |Uninstalling corosync is not likely help. RGManager is something fairly
>> |high up in the stack, so it's not likely the cause either.
>> |
>> |Did you configure the timeouts to be very high, by chance? I'm finding it
>> |difficult to fathom how the node can withdraw without being fenced, short
>> |of cleanly stopping the cluster stack. I suspect there is something
>> |important not being said, which the configuration information, versions
>> and
>> |logs will hopefully expose.
>> |
>> |--
>> |Digimer
>> |E-Mail:              digimer at alteeve.com
>> |Freenode handle:     digimer
>> |Papers and Projects: http://alteeve.com
>> |Node Assassin:       http://nodeassassin.org
>> |"omg my singularity battery is dead again.
>> |stupid hawking radiation." - epitron
>> |
>> |--
>> |Linux-cluster mailing list
>> |Linux-cluster at redhat.com
>> |https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> This message is confidential and intended only for the addressee. If you
>> have received this message in error, please immediately notify the
>> postmaster at nds.com and delete it from your system as well as any copies.
>> The content of e-mails as well as traffic data may be monitored by NDS for
>> employment and security purposes.
>> To protect the environment please do not print this e-mail unless
>> necessary.
>>
>> An NDS Group Limited company. www.nds.com
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20111211/51ed606a/attachment.htm>