[Linux-cluster] Corosync fails to start using cman

Tue Aug 2 15:35:29 UTC 2011

Here is my cluster.conf:

<?xml version="1.0"?>
<cluster config_version="33" name="GFSpfsCluster">
<logging debug="on"/>
<clusternodes>
<clusternode name="pfs03.ns.gfs2.us" nodeid="1" votes="1">
<fence>
<method name="single">
<device name="pfs03.ns.us.ctidata.net_vmware"/>
</method>
</fence>
</clusternode>
<clusternode name="pfs04.ns.gfs2.us" nodeid="2" votes="1">
<fence>
<method name="single">
<device name="pfs04.ns.us.ctidata.net_vmware"/>
</method>
</fence>
</clusternode>
<clusternode name="pfs05.ns.gfs2.us" nodeid="3" votes="1">
<fence>
<method name="single">
<device name="pfs05.ns.us.ctidata.net_vmware"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_vmware" ipaddr="10.50.6.20" 
login="administrator" name="pfs03.ns.us.ctidata.net_vmware" 
passwd="secret" port="pfs03.ns.us.ctidata.net"/>
<fencedevice agent="fence_vmware" ipaddr="10.50.6.20" 
login="administrator" name="pfs04.ns.us.ctidata.net_vmware" 
passwd="secret" port="pfs04.ns.us.ctidata.net"/>
<fencedevice agent="fence_vmware" ipaddr="10.50.6.20" 
login="administrator" name="pfs05.ns.us.ctidata.net_vmware" 
passwd="secret" port="pfs05.ns.us.ctidata.net"/>
</fencedevices>
<rm>
<resources>
<script file="/etc/init.d/httpd" name="httpd"/>
</resources>
<failoverdomains>
<failoverdomain name="pfs03_only" nofailback="0" ordered="0" restricted="1">
<failoverdomainnode name="pfs03.ns.gfs2.us" priority="1"/>
</failoverdomain>
<failoverdomain name="pfs04_only" nofailback="0" ordered="0" restricted="1">
<failoverdomainnode name="pfs04.ns.gfs2.us" priority="1"/>
</failoverdomain>
<failoverdomain name="pfs05_only" nofailback="0" ordered="0" restricted="1">
<failoverdomainnode name="pfs05.ns.gfs2.us" priority="1"/>
</failoverdomain>
</failoverdomains>
<service autostart="1" domain="pfs03_only" exclusive="0" 
name="pfs03_apache" recovery="restart">
<script ref="httpd"/>
</service>
<service autostart="1" domain="pfs04_only" exclusive="0" 
name="pfs04_apache" recovery="restart">
<script ref="httpd"/>
</service>
<service autostart="1" domain="pfs05_only" exclusive="0" 
name="pfs05_apache" recovery="restart">
</service>
</rm>
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
<cman/>
</cluster>

uname -n = pfs05.ns.us.ctidata.net

As I am sure you will notice the cluster.conf has the node set to 
pfs05.ns.gfs2.us while the hostname is set to pfs05.ns.us.ctidata.net.  
This was working prior, is working on the other 2 nodes and is 
configured this way so that the cluster uses a private vlan specifically 
setup for cluster communications.

The network is setup as follows:

eth0 = 10.50.10.32/24 this is the production traffic interface
eth1 = 10.50.20.32/24 this is the interface used for iSCSI connections 
to our SAN
eth2 = 10.50.6.32/24 this is the interface setup for FreeIPA 
authenticated ssh access in from our mgmt vlan.
eth3 = 10.50.1.32/24 this is a legacy interface used during the 
transition from the old env to this new env
eth4 = 10.50.3.70/27 this is the interface pfs05.ns.gfs2.us resolves to 
used for cluster communications.

David

On 08/01/2011 08:56 PM, Digimer wrote:
> On 08/01/2011 09:50 PM, David wrote:
>> I have the RHCS installed on CentOS6 x86_64.
>>
>> One of the nodes in a 3 node cluster won't start after I moved the nodes
>> to a new vlan.
>>
>> When I start cman this is what I get:
>>
>> Starting cluster:
>>     Checking Network Manager...                             [  OK  ]
>>     Global setup...                                         [  OK  ]
>>     Loading kernel modules...                               [  OK  ]
>>     Mounting configfs...                                    [  OK  ]
>>     Starting cman... Aug 02 01:45:17 corosync [MAIN  ] Corosync Cluster
>> Engine ('1.2.3'): started and ready to provide service.
>> Aug 02 01:45:17 corosync [MAIN  ] Corosync built-in features: nss rdma
>> Aug 02 01:45:17 corosync [MAIN  ] Successfully read config from
>> /etc/cluster/cluster.conf
>> Aug 02 01:45:17 corosync [MAIN  ] Successfully parsed cman config
>> Aug 02 01:45:17 corosync [TOTEM ] Token Timeout (10000 ms) retransmit
>> timeout (2380 ms)
>> Aug 02 01:45:17 corosync [TOTEM ] token hold (1894 ms) retransmits
>> before loss (4 retrans)
>> Aug 02 01:45:17 corosync [TOTEM ] join (60 ms) send_join (0 ms)
>> consensus (12000 ms) merge (200 ms)
>> Aug 02 01:45:17 corosync [TOTEM ] downcheck (1000 ms) fail to recv const
>> (2500 msgs)
>> Aug 02 01:45:17 corosync [TOTEM ] seqno unchanged const (30 rotations)
>> Maximum network MTU 1402
>> Aug 02 01:45:17 corosync [TOTEM ] window size per rotation (50 messages)
>> maximum messages per rotation (17 messages)
>> Aug 02 01:45:17 corosync [TOTEM ] missed count const (5 messages)
>> Aug 02 01:45:17 corosync [TOTEM ] send threads (0 threads)
>> Aug 02 01:45:17 corosync [TOTEM ] RRP token expired timeout (2380 ms)
>> Aug 02 01:45:17 corosync [TOTEM ] RRP token problem counter (2000 ms)
>> Aug 02 01:45:17 corosync [TOTEM ] RRP threshold (10 problem count)
>> Aug 02 01:45:17 corosync [TOTEM ] RRP mode set to none.
>> Aug 02 01:45:17 corosync [TOTEM ] heartbeat_failures_allowed (0)
>> Aug 02 01:45:17 corosync [TOTEM ] max_network_delay (50 ms)
>> Aug 02 01:45:17 corosync [TOTEM ] HeartBeat is Disabled. To enable set
>> heartbeat_failures_allowed>  0
>> Aug 02 01:45:17 corosync [TOTEM ] Initializing transport (UDP/IP).
>> Aug 02 01:45:17 corosync [TOTEM ] Initializing transmit/receive
>> security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
>> Aug 02 01:45:17 corosync [IPC   ] you are using ipc api v2
>> Aug 02 01:45:18 corosync [TOTEM ] Receive multicast socket recv buffer
>> size (262142 bytes).
>> Aug 02 01:45:18 corosync [TOTEM ] Transmit multicast socket send buffer
>> size (262142 bytes).
>> corosync: totemsrp.c:3091: memb_ring_id_create_or_load: Assertion `res
>> == sizeof (unsigned long long)' failed.
>> Aug 02 01:45:18 corosync [TOTEM ] The network interface [10.50.3.70] is
>> now up.
>> corosync died with signal: 6 Check cluster logs for details
>>
>>
>> Any idea what the issue could be?
>>
>> Thanks
>> David
> What is your cluster.conf file (please obscure passwords only), what
> does `uname -n` return and what is your network configuration (interface
> names and IPs)?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110802/5ad251aa/attachment.htm>