From neale at sinenomine.net  Thu Oct  2 19:30:03 2014
From: neale at sinenomine.net (Neale Ferguson)
Date: Thu, 2 Oct 2014 19:30:03 +0000
Subject: [Linux-cluster] Fencing of node
Message-ID: <555388B6-13CB-48F1-B6C7-6496A4C90B1D@sinenomine.net>

After creating simple two node cluster, one node is being fenced continually. I'm running pacemaker (1.1.10-29) with two nodes and the following corosync.conf:

totem {
version: 2
secauth: off
cluster_name: rh7cluster
transport: udpu
}

nodelist {
  node {
        ring0_addr: rh7cn1.devlab.sinenomine.net
        nodeid: 1
       }
  node {
        ring0_addr: rh7cn2.devlab.sinenomine.net
        nodeid: 2
       }
}

quorum {
provider: corosync_votequorum
two_node: 1
}

logging {
to_syslog: yes
}

Starting the cluster shows:

Oct  2 15:17:47 rh7cn1 kernel: dlm: connect from non cluster node

In the logs of both nodes. Both nodes then try and bring up resources (dlm, clvmd, and a cluster fs). 

Just prior to a node being fence, both nodes show the following

# pcs resource show
 Clone Set: dlm-clone [dlm]
     Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
 Clone Set: clvmd-clone [clvmd]
     clvmd	(ocf::heartbeat:clvm):	FAILED 
     Started: [ rh7cn2.devlab.sinenomine.net ]
 Clone Set: clusterfs-clone [clusterfs]
     Started: [ rh7cn2.devlab.sinenomine.net ]
     Stopped: [ rh7cn1.devlab.sinenomine.net ]

Shortly after there is a clvmd timeout message in one of the logs and then that node gets fenced. I had added the high-availability firewalld service to both nodes.

Running crm_simulate -SL -VV shows:

 warning: unpack_rsc_op: 	Processing failed op start for clvmd:1 on rh7cn1.devlab.sinenomine.net: unknown error (1)

Current cluster status:
Online: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]

 ZVMPOWER	(stonith:fence_zvm):	Started rh7cn2.devlab.sinenomine.net 
 Clone Set: dlm-clone [dlm]
     Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
 Clone Set: clvmd-clone [clvmd]
     clvmd	(ocf::heartbeat:clvm):	FAILED rh7cn1.devlab.sinenomine.net 
     Started: [ rh7cn2.devlab.sinenomine.net ]
 Clone Set: clusterfs-clone [clusterfs]
     Started: [ rh7cn2.devlab.sinenomine.net ]
     Stopped: [ rh7cn1.devlab.sinenomine.net ]

 warning: common_apply_stickiness: 	Forcing clvmd-clone away from rh7cn1.devlab.sinenomine.net after 1000000 failures (max=1000000)
 warning: common_apply_stickiness: 	Forcing clvmd-clone away from rh7cn1.devlab.sinenomine.net after 1000000 failures (max=1000000)
Transition Summary:
 * Stop    clvmd:1	(rh7cn1.devlab.sinenomine.net)

Executing cluster transition:
 * Pseudo action:   clvmd-clone_stop_0
 * Resource action: clvmd           stop on rh7cn1.devlab.sinenomine.net
 * Pseudo action:   clvmd-clone_stopped_0
 * Pseudo action:   all_stopped

Revised cluster status:
 warning: unpack_rsc_op: 	Processing failed op start for clvmd:1 on rh7cn1.devlab.sinenomine.net: unknown error (1)
Online: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]

 ZVMPOWER	(stonith:fence_zvm):	Started rh7cn2.devlab.sinenomine.net 
 Clone Set: dlm-clone [dlm]
     Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ rh7cn2.devlab.sinenomine.net ]
     Stopped: [ rh7cn1.devlab.sinenomine.net ]
 Clone Set: clusterfs-clone [clusterfs]
     Started: [ rh7cn2.devlab.sinenomine.net ]
     Stopped: [ rh7cn1.devlab.sinenomine.net ]

With RHEL 6 I would use a qdisk but this has been replaced by corosync_votequorum.

This is my first RHEL 7 HA cluster so I'm at the beginning of my learning. Any pointers as to what I should look at or what I need to read?

Neale


From neale at sinenomine.net  Thu Oct  2 19:44:24 2014
From: neale at sinenomine.net (Neale Ferguson)
Date: Thu, 2 Oct 2014 19:44:24 +0000
Subject: [Linux-cluster] Fencing of node
In-Reply-To: <555388B6-13CB-48F1-B6C7-6496A4C90B1D@sinenomine.net>
References: <555388B6-13CB-48F1-B6C7-6496A4C90B1D@sinenomine.net>
Message-ID: <17363D5B-1124-492E-8AB4-1A00B84CC38A@sinenomine.net>

Forgot to include cib.xml:

<cib epoch="17" num_updates="0" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Thu Oct  2 15:13:47 2014" update-origin="rh7cn1.devlab.sinenomine.net" update-client="cibadmin" crm_feature_set="3.0.7" have-quorum="1">
  <configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
        <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.10-29.el7-368c726"/>
        <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/>
        <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="freeze"/>
      </cluster_property_set>
    </crm_config>
    <nodes>
      <node id="1" uname="rh7cn1.devlab.sinenomine.net"/>
      <node id="2" uname="rh7cn2.devlab.sinenomine.net"/>
    </nodes>
    <resources>
      <primitive class="stonith" id="ZVMPOWER" type="fence_zvm">
        <instance_attributes id="ZVMPOWER-instance_attributes">
          <nvpair id="ZVMPOWER-instance_attributes-ipaddr" name="ipaddr" value="VSMREQIU"/>
          <nvpair id="ZVMPOWER-instance_attributes-pcmk_host_map" name="pcmk_host_map" value="rh7cn1.devlab.sinenomine.net:RH7CN1;rh7cn2.devlab.sinenomine.net:RH7CN2"/>
          <nvpair id="ZVMPOWER-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="rh7cn1.devlab.sinenomine.net;rh7cn2.devlab.sinenomine.net"/>
          <nvpair id="ZVMPOWER-instance_attributes-pcmk_host_check" name="pcmk_host_check" value="static-list"/>
        </instance_attributes>
        <operations>
          <op id="ZVMPOWER-monitor-interval-60s" interval="60s" name="monitor"/>
        </operations>
      </primitive>
      <clone id="dlm-clone">
        <primitive class="ocf" id="dlm" provider="pacemaker" type="controld">
          <instance_attributes id="dlm-instance_attributes"/>
          <operations>
            <op id="dlm-start-timeout-90" interval="0s" name="start" timeout="90"/>
            <op id="dlm-stop-timeout-100" interval="0s" name="stop" timeout="100"/>
            <op id="dlm-monitor-interval-30s" interval="30s" name="monitor" on-fail="fence"/>
          </operations>
        </primitive>
        <meta_attributes id="dlm-clone-meta">
          <nvpair id="dlm-interleave" name="interleave" value="true"/>
          <nvpair id="dlm-ordered" name="ordered" value="true"/>
        </meta_attributes>
      </clone>
      <clone id="clvmd-clone">
        <primitive class="ocf" id="clvmd" provider="heartbeat" type="clvm">
          <instance_attributes id="clvmd-instance_attributes"/>
          <operations>
            <op id="clvmd-start-timeout-90" interval="0s" name="start" timeout="90"/>
            <op id="clvmd-stop-timeout-90" interval="0s" name="stop" timeout="90"/>
            <op id="clvmd-monitor-interval-30s" interval="30s" name="monitor" on-fail="fence"/>
          </operations>
        </primitive>
        <meta_attributes id="clvmd-clone-meta">
          <nvpair id="clvmd-interleave" name="interleave" value="true"/>
          <nvpair id="clvmd-ordered" name="ordered" value="true"/>
        </meta_attributes>
      </clone>
      <clone id="clusterfs-clone">
        <primitive class="ocf" id="clusterfs" provider="heartbeat" type="Filesystem">
          <instance_attributes id="clusterfs-instance_attributes">
            <nvpair id="clusterfs-instance_attributes-device" name="device" value="/dev/vg_cluster/ha_lv"/>
            <nvpair id="clusterfs-instance_attributes-directory" name="directory" value="/mnt/gfs2-demo"/>
            <nvpair id="clusterfs-instance_attributes-fstype" name="fstype" value="gfs2"/>
            <nvpair id="clusterfs-instance_attributes-options" name="options" value="noatime"/>
          </instance_attributes>
          <operations>
            <op id="clusterfs-start-timeout-60" interval="0s" name="start" timeout="60"/>
            <op id="clusterfs-stop-timeout-60" interval="0s" name="stop" timeout="60"/>
            <op id="clusterfs-monitor-interval-10s" interval="10s" name="monitor" on-fail="fence"/>
          </operations>
        </primitive>
        <meta_attributes id="clusterfs-clone-meta">
          <nvpair id="clusterfs-interleave" name="interleave" value="true"/>
        </meta_attributes>
      </clone>
    </resources>
    <constraints>
      <rsc_order first="dlm-clone" first-action="start" id="order-dlm-clone-clvmd-clone-mandatory" then="clvmd-clone" then-action="start"/>
      <rsc_colocation id="colocation-clvmd-clone-dlm-clone-INFINITY" rsc="clvmd-clone" score="INFINITY" with-rsc="dlm-clone"/>
      <rsc_order first="clvmd-clone" first-action="start" id="order-clvmd-clone-clusterfs-clone-mandatory" then="clusterfs-clone" then-action="start"/>
      <rsc_colocation id="colocation-clusterfs-clone-clvmd-clone-INFINITY" rsc="clusterfs-clone" score="INFINITY" with-rsc="clvmd-clone"/>
    </constraints>
  </configuration>
</cib>


From ccaulfie at redhat.com  Fri Oct  3 07:34:49 2014
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Fri, 03 Oct 2014 08:34:49 +0100
Subject: [Linux-cluster] Fencing of node
In-Reply-To: <17363D5B-1124-492E-8AB4-1A00B84CC38A@sinenomine.net>
References: <555388B6-13CB-48F1-B6C7-6496A4C90B1D@sinenomine.net>
	<17363D5B-1124-492E-8AB4-1A00B84CC38A@sinenomine.net>
Message-ID: <542E5199.3000901@redhat.com>

I think you're hitting this bug:

https://www.redhat.com/archives/cluster-devel/2014-September/msg00031.html


The fix is in git, but no packages are available yet, sadly.

Chrissie


On 02/10/14 20:44, Neale Ferguson wrote:
> Forgot to include cib.xml:
>
> <cib epoch="17" num_updates="0" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Thu Oct  2 15:13:47 2014" update-origin="rh7cn1.devlab.sinenomine.net" update-client="cibadmin" crm_feature_set="3.0.7" have-quorum="1">
>    <configuration>
>      <crm_config>
>        <cluster_property_set id="cib-bootstrap-options">
>          <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.10-29.el7-368c726"/>
>          <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/>
>          <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="freeze"/>
>        </cluster_property_set>
>      </crm_config>
>      <nodes>
>        <node id="1" uname="rh7cn1.devlab.sinenomine.net"/>
>        <node id="2" uname="rh7cn2.devlab.sinenomine.net"/>
>      </nodes>
>      <resources>
>        <primitive class="stonith" id="ZVMPOWER" type="fence_zvm">
>          <instance_attributes id="ZVMPOWER-instance_attributes">
>            <nvpair id="ZVMPOWER-instance_attributes-ipaddr" name="ipaddr" value="VSMREQIU"/>
>            <nvpair id="ZVMPOWER-instance_attributes-pcmk_host_map" name="pcmk_host_map" value="rh7cn1.devlab.sinenomine.net:RH7CN1;rh7cn2.devlab.sinenomine.net:RH7CN2"/>
>            <nvpair id="ZVMPOWER-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="rh7cn1.devlab.sinenomine.net;rh7cn2.devlab.sinenomine.net"/>
>            <nvpair id="ZVMPOWER-instance_attributes-pcmk_host_check" name="pcmk_host_check" value="static-list"/>
>          </instance_attributes>
>          <operations>
>            <op id="ZVMPOWER-monitor-interval-60s" interval="60s" name="monitor"/>
>          </operations>
>        </primitive>
>        <clone id="dlm-clone">
>          <primitive class="ocf" id="dlm" provider="pacemaker" type="controld">
>            <instance_attributes id="dlm-instance_attributes"/>
>            <operations>
>              <op id="dlm-start-timeout-90" interval="0s" name="start" timeout="90"/>
>              <op id="dlm-stop-timeout-100" interval="0s" name="stop" timeout="100"/>
>              <op id="dlm-monitor-interval-30s" interval="30s" name="monitor" on-fail="fence"/>
>            </operations>
>          </primitive>
>          <meta_attributes id="dlm-clone-meta">
>            <nvpair id="dlm-interleave" name="interleave" value="true"/>
>            <nvpair id="dlm-ordered" name="ordered" value="true"/>
>          </meta_attributes>
>        </clone>
>        <clone id="clvmd-clone">
>          <primitive class="ocf" id="clvmd" provider="heartbeat" type="clvm">
>            <instance_attributes id="clvmd-instance_attributes"/>
>            <operations>
>              <op id="clvmd-start-timeout-90" interval="0s" name="start" timeout="90"/>
>              <op id="clvmd-stop-timeout-90" interval="0s" name="stop" timeout="90"/>
>              <op id="clvmd-monitor-interval-30s" interval="30s" name="monitor" on-fail="fence"/>
>            </operations>
>          </primitive>
>          <meta_attributes id="clvmd-clone-meta">
>            <nvpair id="clvmd-interleave" name="interleave" value="true"/>
>            <nvpair id="clvmd-ordered" name="ordered" value="true"/>
>          </meta_attributes>
>        </clone>
>        <clone id="clusterfs-clone">
>          <primitive class="ocf" id="clusterfs" provider="heartbeat" type="Filesystem">
>            <instance_attributes id="clusterfs-instance_attributes">
>              <nvpair id="clusterfs-instance_attributes-device" name="device" value="/dev/vg_cluster/ha_lv"/>
>              <nvpair id="clusterfs-instance_attributes-directory" name="directory" value="/mnt/gfs2-demo"/>
>              <nvpair id="clusterfs-instance_attributes-fstype" name="fstype" value="gfs2"/>
>              <nvpair id="clusterfs-instance_attributes-options" name="options" value="noatime"/>
>            </instance_attributes>
>            <operations>
>              <op id="clusterfs-start-timeout-60" interval="0s" name="start" timeout="60"/>
>              <op id="clusterfs-stop-timeout-60" interval="0s" name="stop" timeout="60"/>
>              <op id="clusterfs-monitor-interval-10s" interval="10s" name="monitor" on-fail="fence"/>
>            </operations>
>          </primitive>
>          <meta_attributes id="clusterfs-clone-meta">
>            <nvpair id="clusterfs-interleave" name="interleave" value="true"/>
>          </meta_attributes>
>        </clone>
>      </resources>
>      <constraints>
>        <rsc_order first="dlm-clone" first-action="start" id="order-dlm-clone-clvmd-clone-mandatory" then="clvmd-clone" then-action="start"/>
>        <rsc_colocation id="colocation-clvmd-clone-dlm-clone-INFINITY" rsc="clvmd-clone" score="INFINITY" with-rsc="dlm-clone"/>
>        <rsc_order first="clvmd-clone" first-action="start" id="order-clvmd-clone-clusterfs-clone-mandatory" then="clusterfs-clone" then-action="start"/>
>        <rsc_colocation id="colocation-clusterfs-clone-clvmd-clone-INFINITY" rsc="clusterfs-clone" score="INFINITY" with-rsc="clvmd-clone"/>
>      </constraints>
>    </configuration>
> </cib>
>


From daniel.dehennin at baby-gnu.org  Fri Oct  3 14:35:36 2014
From: daniel.dehennin at baby-gnu.org (Daniel Dehennin)
Date: Fri, 03 Oct 2014 16:35:36 +0200
Subject: [Linux-cluster] cLVM unusable on quorated cluster
Message-ID: <87egupfcg7.fsf@hati.baby-gnu.org>

Hello,

I'm trying to setup pacemaker+corosync on Debian Wheezy to access a SAN
for an OpenNebula cluster.

As I'm new to cluster world, I have hard time figuring why sometime
things get really wrong and where I must look to find answers.

My OpenNebula frontend, running in a VM, does not manage to run the
resources and my syslog has a lot of:

#+begin_src
ocfs2_controld: Unable to open checkpoint "ocfs2:controld": Object does not exist
#+end_src

When this happens, other nodes have problem:

#+begin_src
root at nebula3:~# LANG=C vgscan
  cluster request failed: Host is down
  Unable to obtain global lock.
#+end_src

But things looks fin in ?crm_mon?:

#+begin_src
root at nebula3:~# crm_mon -1
============
Last updated: Fri Oct  3 16:25:43 2014
Last change: Fri Oct  3 14:51:59 2014 via cibadmin on nebula1
Stack: openais
Current DC: nebula3 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
5 Nodes configured, 5 expected votes
32 Resources configured.
============

Node quorum: standby
Online: [ nebula3 nebula2 nebula1 ]
OFFLINE: [ one ]

 Stonith-nebula3-IPMILAN    (stonith:external/ipmi):    Started nebula2
 Stonith-nebula2-IPMILAN    (stonith:external/ipmi):    Started nebula3
 Stonith-nebula1-IPMILAN    (stonith:external/ipmi):    Started nebula2
 Clone Set: ONE-Storage-Clone [ONE-Storage]
     Started: [ nebula1 nebula3 nebula2 ]
     Stopped: [ ONE-Storage:3 ONE-Storage:4 ]
 Quorum-Node    (ocf::heartbeat:VirtualDomain): Started nebula3
 Stonith-Quorum-Node   (stonith:external/libvirt):   Started nebula3
#+end_src

I don't know how to interpret dlm_tool informations:

#+begin_src
root at nebula3:~# dlm_tool ls -n
dlm lockspaces
name          CCB10CE8D4FF489B9A2ECB288DACF2D7
id            0x09250e49
flags         0x00000008 fs_reg
change        member 3 joined 1 remove 0 failed 0 seq 2,2
members       1189587136 1206364352 1223141568 
all nodes
nodeid 1189587136 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none
nodeid 1206364352 member 1 failed 0 start 1 seq_add 2 seq_rem 0 check none
nodeid 1223141568 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none

name          clvmd
id            0x4104eefa
flags         0x00000000 
change        member 3 joined 0 remove 1 failed 0 seq 4,4
members       1189587136 1206364352 1223141568 
all nodes
nodeid 1172809920 member 0 failed 0 start 0 seq_add 3 seq_rem 4 check none
nodeid 1189587136 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none
nodeid 1206364352 member 1 failed 0 start 1 seq_add 2 seq_rem 0 check none
nodeid 1223141568 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none
#+end_src

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: dlm_tool-dump.txt
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20141003/96364a60/attachment.txt>
-------------- next part --------------

Is there any documentation on troubleshooting DLM/cLVM?

Regards.

-- 
Daniel Dehennin
R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20141003/96364a60/attachment.sig>

From lists at alteeve.ca  Fri Oct  3 14:38:14 2014
From: lists at alteeve.ca (Digimer)
Date: Fri, 03 Oct 2014 10:38:14 -0400
Subject: [Linux-cluster] cLVM unusable on quorated cluster
In-Reply-To: <87egupfcg7.fsf@hati.baby-gnu.org>
References: <87egupfcg7.fsf@hati.baby-gnu.org>
Message-ID: <542EB4D6.4030008@alteeve.ca>

On 03/10/14 10:35 AM, Daniel Dehennin wrote:
> Hello,
>
> I'm trying to setup pacemaker+corosync on Debian Wheezy to access a SAN
> for an OpenNebula cluster.
>
> As I'm new to cluster world, I have hard time figuring why sometime
> things get really wrong and where I must look to find answers.
>
> My OpenNebula frontend, running in a VM, does not manage to run the
> resources and my syslog has a lot of:
>
> #+begin_src
> ocfs2_controld: Unable to open checkpoint "ocfs2:controld": Object does not exist
> #+end_src
>
> When this happens, other nodes have problem:
>
> #+begin_src
> root at nebula3:~# LANG=C vgscan
>    cluster request failed: Host is down
>    Unable to obtain global lock.
> #+end_src
>
> But things looks fin in ?crm_mon?:
>
> #+begin_src
> root at nebula3:~# crm_mon -1
> ============
> Last updated: Fri Oct  3 16:25:43 2014
> Last change: Fri Oct  3 14:51:59 2014 via cibadmin on nebula1
> Stack: openais
> Current DC: nebula3 - partition with quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 5 Nodes configured, 5 expected votes
> 32 Resources configured.
> ============
>
> Node quorum: standby
> Online: [ nebula3 nebula2 nebula1 ]
> OFFLINE: [ one ]
>
>   Stonith-nebula3-IPMILAN    (stonith:external/ipmi):    Started nebula2
>   Stonith-nebula2-IPMILAN    (stonith:external/ipmi):    Started nebula3
>   Stonith-nebula1-IPMILAN    (stonith:external/ipmi):    Started nebula2
>   Clone Set: ONE-Storage-Clone [ONE-Storage]
>       Started: [ nebula1 nebula3 nebula2 ]
>       Stopped: [ ONE-Storage:3 ONE-Storage:4 ]
>   Quorum-Node    (ocf::heartbeat:VirtualDomain): Started nebula3
>   Stonith-Quorum-Node   (stonith:external/libvirt):   Started nebula3
> #+end_src
>
> I don't know how to interpret dlm_tool informations:
>
> #+begin_src
> root at nebula3:~# dlm_tool ls -n
> dlm lockspaces
> name          CCB10CE8D4FF489B9A2ECB288DACF2D7
> id            0x09250e49
> flags         0x00000008 fs_reg
> change        member 3 joined 1 remove 0 failed 0 seq 2,2
> members       1189587136 1206364352 1223141568
> all nodes
> nodeid 1189587136 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none
> nodeid 1206364352 member 1 failed 0 start 1 seq_add 2 seq_rem 0 check none
> nodeid 1223141568 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none
>
> name          clvmd
> id            0x4104eefa
> flags         0x00000000
> change        member 3 joined 0 remove 1 failed 0 seq 4,4
> members       1189587136 1206364352 1223141568
> all nodes
> nodeid 1172809920 member 0 failed 0 start 0 seq_add 3 seq_rem 4 check none
> nodeid 1189587136 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none
> nodeid 1206364352 member 1 failed 0 start 1 seq_add 2 seq_rem 0 check none
> nodeid 1223141568 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none
> #+end_src
>
>
>
>
> Is there any documentation on troubleshooting DLM/cLVM?
>
> Regards.

Can you paste your full pacemaker config and the logs from the other 
nodes starting just before the lost node went away?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From daniel.dehennin at baby-gnu.org  Fri Oct  3 15:05:23 2014
From: daniel.dehennin at baby-gnu.org (Daniel Dehennin)
Date: Fri, 03 Oct 2014 17:05:23 +0200
Subject: [Linux-cluster] cLVM unusable on quorated cluster
In-Reply-To: <542EB4D6.4030008@alteeve.ca> (Digimer's message of "Fri, 03 Oct
	2014 10:38:14 -0400")
References: <87egupfcg7.fsf@hati.baby-gnu.org> <542EB4D6.4030008@alteeve.ca>
Message-ID: <87a95dfb2k.fsf@hati.baby-gnu.org>

Digimer <lists at alteeve.ca> writes:

> Can you paste your full pacemaker config and the logs from the other
> nodes starting just before the lost node went away?

Sorry, I forgot to attach it:

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: pcmk.conf
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20141003/d7cbfa8a/attachment.conf>
-------------- next part --------------

Here are the logs on the 3 hypervisors, note that pacemaker does not start at bootime:

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nebula1.log
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20141003/d7cbfa8a/attachment.log>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nebula2.log
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20141003/d7cbfa8a/attachment-0001.log>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nebula3.log
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20141003/d7cbfa8a/attachment-0002.log>
-------------- next part --------------

-- 
Daniel Dehennin
R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20141003/d7cbfa8a/attachment.sig>

From neale at sinenomine.net  Fri Oct  3 15:14:44 2014
From: neale at sinenomine.net (Neale Ferguson)
Date: Fri, 3 Oct 2014 15:14:44 +0000
Subject: [Linux-cluster] Fencing of node
In-Reply-To: <542E5199.3000901@redhat.com>
References: <555388B6-13CB-48F1-B6C7-6496A4C90B1D@sinenomine.net>
	<17363D5B-1124-492E-8AB4-1A00B84CC38A@sinenomine.net>
	<542E5199.3000901@redhat.com>
Message-ID: <616EF409-A374-4BD1-8418-012B3756F56C@sinenomine.net>

That was the problem! I applied a local patch, rebuilt, restarted, and we're up fine and dandy!

Thanks very much... Neale

On Oct 3, 2014, at 3:34 AM, Christine Caulfield <ccaulfie at redhat.com> wrote:

> I think you're hitting this bug:
> 
> https://www.redhat.com/archives/cluster-devel/2014-September/msg00031.html
> 
> 
> The fix is in git, but no packages are available yet, sadly.
> 
> Chrissie


From ccaulfie at redhat.com  Fri Oct  3 15:29:56 2014
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Fri, 03 Oct 2014 16:29:56 +0100
Subject: [Linux-cluster] Fencing of node
In-Reply-To: <616EF409-A374-4BD1-8418-012B3756F56C@sinenomine.net>
References: <555388B6-13CB-48F1-B6C7-6496A4C90B1D@sinenomine.net>	<17363D5B-1124-492E-8AB4-1A00B84CC38A@sinenomine.net>	<542E5199.3000901@redhat.com>
	<616EF409-A374-4BD1-8418-012B3756F56C@sinenomine.net>
Message-ID: <542EC0F4.7050202@redhat.com>

Great! I'm pleased to hear it :-)

Chrissie

On 03/10/14 16:14, Neale Ferguson wrote:
> That was the problem! I applied a local patch, rebuilt, restarted, and we're up fine and dandy!
>
> Thanks very much... Neale
>
> On Oct 3, 2014, at 3:34 AM, Christine Caulfield <ccaulfie at redhat.com> wrote:
>
>> I think you're hitting this bug:
>>
>> https://www.redhat.com/archives/cluster-devel/2014-September/msg00031.html
>>
>>
>> The fix is in git, but no packages are available yet, sadly.
>>
>> Chrissie
>


From manish631 at rediffmail.com  Fri Oct  3 16:57:04 2014
From: manish631 at rediffmail.com (manish vaidya)
Date: 3 Oct 2014 16:57:04 -0000
Subject: [Linux-cluster]
	=?utf-8?q?Linux-cluster_Digest=2C_Vol_124=2C_Issu?= =?utf-8?q?e_7?=
In-Reply-To: <mailman.29.1409414405.19078.linux-cluster@redhat.com>
Message-ID: <1409414978.S.13239.Z.15124.F.H.TmxpbnV4LWNsdXN0ZXItcmVxdWVzdEByZWRoYXQuYwBMaW51eC1jbHVzdGVyIERpZ2VzdCwgVm9sIDEyNCw_.RU.rfs294,
	rfs294, 408,
	370.f5-224-163.old.1412355424.6200@webmail.rediffmail.com>

First i apologise for late reply , delay due to i cannot believe ,any response from  site , I am a newcomer , already , i had posted this problem on many online forums , but they didn't give any response
   
Thank all , for taking  my problem seriously 

** response from you 

are you using clvmd? if your answer is = yes, you need to be sure, you pv

is visibile to your cluster nodes

*** i am using clvmd & When use pvscan command cluster hangs

I want to reproduce this situation again for perfection , such as when i try to run pvcreate command in cluster , message should come lock from node2 & node3 , I have created new cluster , this new cluster is working fine , 
How to do This? any setting in lvm.conf


On Sat, 30 Aug 2014 21:39:38 +0530 linux-cluster-request at redhat.com wrote
>Send Linux-cluster mailing list submissions to

linux-cluster at redhat.com


To subscribe or unsubscribe via the World Wide Web, visit

https://www.redhat.com/mailman/listinfo/linux-cluster

or, via email, send a message with subject or body 'help' to

linux-cluster-request at redhat.com


You can reach the person managing the list at

linux-cluster-owner at redhat.com


When replying, please edit your Subject line so it is more specific

than "Re: Contents of Linux-cluster digest..."


Today's Topics:


  1. Please help me on cluster error (manish vaidya)

  2. Re: Please help me on cluster error (emmanuel segura)


----------------------------------------------------------------------


Message: 1

Date: 30 Aug 2014 14:12:42 -0000

From: "manish vaidya" 

To: 

Subject: [Linux-cluster] Please help me on cluster error

Message-ID: 

Content-Type: text/plain; charset="utf-8"


i created four node cluster in kvm enviorment But i faced error when create new pv such as pvcreate /dev/sdb1

got error ,lock from node 2 & lock from node3


also strange cluster logs


Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e


  Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e

  5f

  Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5f

  60

  Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 61

  Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 63

  64

  Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 69

  6a

  Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 78

  Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 84

  85

  Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 9a

  9b


Please help me on this issue

-------------- next part --------------

An HTML attachment was scrubbed...

URL: 


------------------------------


Message: 2

Date: Sat, 30 Aug 2014 16:53:08 +0200

From: emmanuel segura 

To: linux clustering 

Subject: Re: [Linux-cluster] Please help me on cluster error

Message-ID:


Content-Type: text/plain; charset="utf-8"


are you using clvmd? if your answer is = yes, you need to be sure, you pv

is visibile to your cluster nodes


2014-08-30 16:12 GMT+02:00 manish vaidya :


> i created four node cluster in kvm enviorment But i faced error when

> create new pv such as pvcreate /dev/sdb1

> got error ,lock from node 2 & lock from node3

>

> also strange cluster logs

>

> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e

>

> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e

> 5f

> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5f

> 60

> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 61

> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 63

> 64

> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 69

> 6a

> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 78

> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 84

> 85

> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 9a

> 9b

>

>

> Please help me on this issue

>

> 

> Get your own *FREE* website, *FREE* domain & *FREE* mobile app with

> Company email.

> *Know More >*

> 

> --

> Linux-cluster mailing list

> Linux-cluster at redhat.com

> https://www.redhat.com/mailman/listinfo/linux-cluster

>


-- 

esta es mi vida e me la vivo hasta que dios quiera

-------------- next part --------------

An HTML attachment was scrubbed...

URL: 


------------------------------


--

Linux-cluster mailing list

Linux-cluster at redhat.com

https://www.redhat.com/mailman/listinfo/linux-cluster


End of Linux-cluster Digest, Vol 124, Issue 7

*********************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20141003/2e5be7f0/attachment.htm>

From manish631 at rediffmail.com  Fri Oct  3 17:03:15 2014
From: manish631 at rediffmail.com (manish vaidya)
Date: 3 Oct 2014 17:03:15 -0000
Subject: [Linux-cluster]
	=?utf-8?q?Linux-cluster_Digest=2C_Vol_124=2C_Issu?= =?utf-8?q?e_8?=
In-Reply-To: <mailman.19.1409500806.19760.linux-cluster@redhat.com>
Message-ID: <1409501135.S.10373.25687.F.H.TmxpbnV4LWNsdXN0ZXItcmVxdWVzdEByZWRoYXQuYwBMaW51eC1jbHVzdGVyIERpZ2VzdCwgVm9sIDEyNCw_.RU.rfs294,
	rfs294, 441,
	91.f5-224-149.old.1412355795.30029@webmail.rediffmail.com>


First i apologise for late reply , delay due to i cannot believe ,any response from  site , I am a newcomer , already , i have posted this problem on many on-line forum , but they didn't give any response
   
Thank all , for take my problem seriously 

**currently using red hat version 6.5
I have created new cluster ,working fine , But i want to recreate this situation for proper understanding
such as when using pvcreate command message should come lock from node2 & node3
How to do This?
 

On Sun, 31 Aug 2014 21:35:35 +0530 linux-cluster-request at redhat.com wrote
>Send Linux-cluster mailing list submissions to

linux-cluster at redhat.com


To subscribe or unsubscribe via the World Wide Web, visit

https://www.redhat.com/mailman/listinfo/linux-cluster

or, via email, send a message with subject or body 'help' to

linux-cluster-request at redhat.com


You can reach the person managing the list at

linux-cluster-owner at redhat.com


When replying, please edit your Subject line so it is more specific

than "Re: Contents of Linux-cluster digest..."


Today's Topics:


  1. Re: Please help me on cluster error (Digimer)


----------------------------------------------------------------------


Message: 1

Date: Sat, 30 Aug 2014 12:35:52 -0400

From: Digimer 

To: linux clustering 

Subject: Re: [Linux-cluster] Please help me on cluster error

Message-ID: 

Content-Type: text/plain; charset=ISO-8859-1; format=flowed


Can you share your cluster information please?


This could be a network problem, as the messages below happen when the 

network between the nodes isn't fast enough or has too long latency and 

cluster traffic is considered lost and re-requested.


If you don't have fencing working properly, and if a network issue 

caused a node to be declared lost, clustered LVM (and anything else 

using cluster locking) will fail (by design).


If you share your configuration and more of your logs, it will help us 

understand what is happening. Please also tell us what version of the 

cluster software you're using.


digimer


On 30/08/14 10:12 AM, manish vaidya wrote:

> i created four node cluster in kvm enviorment But i faced error when

> create new pv such as pvcreate /dev/sdb1

> got error ,lock from node 2 & lock from node3

>

> also strange cluster logs

>

> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e

>

> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e

> 5f

> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5f

> 60

> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 61

> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 63

> 64

> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 69

> 6a

> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 78

> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 84

> 85

> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 9a

> 9b

>

>

> Please help me on this issue

> 

>

> Get your own *FREE* website, *FREE* domain & *FREE* mobile app with

> Company email.

> *Know More >*

> 

>

>

>


-- 

Digimer

Papers and Projects: https://alteeve.ca/w/

What if the cure for cancer is trapped in the mind of a person without 

access to education?


------------------------------


--

Linux-cluster mailing list

Linux-cluster at redhat.com

https://www.redhat.com/mailman/listinfo/linux-cluster


End of Linux-cluster Digest, Vol 124, Issue 8

*********************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20141003/4e38bf0e/attachment.htm>

From lists at alteeve.ca  Fri Oct  3 17:56:57 2014
From: lists at alteeve.ca (Digimer)
Date: Fri, 03 Oct 2014 13:56:57 -0400
Subject: [Linux-cluster] clvmd issues
In-Reply-To: <1409414978.S.13239.Z.15124.F.H.TmxpbnV4LWNsdXN0ZXItcmVxdWVzdEByZWRoYXQuYwBMaW51eC1jbHVzdGVyIERpZ2VzdCwgVm9sIDEyNCw_.RU.rfs294,
	rfs294, 408,
	370.f5-224-163.old.1412355424.6200@webmail.rediffmail.com>
References: <1409414978.S.13239.Z.15124.F.H.TmxpbnV4LWNsdXN0ZXItcmVxdWVzdEByZWRoYXQuYwBMaW51eC1jbHVzdGVyIERpZ2VzdCwgVm9sIDEyNCw_.RU.rfs294,
	rfs294, 408,
	370.f5-224-163.old.1412355424.6200@webmail.rediffmail.com>
Message-ID: <542EE369.1080102@alteeve.ca>

On 03/10/14 12:57 PM, manish vaidya wrote:
> First i apologise for late reply , delay due to i cannot believe ,any
> response from site , I am a newcomer , already , i had posted this
> problem on many online forums , but they didn't give any response
>
> Thank all , for taking my problem seriously
>
> ** response from you
>
> are you using clvmd? if your answer is = yes, you need to be sure, you pv
>
> is visibile to your cluster nodes
>
> *** i am using clvmd & When use pvscan command cluster hangs
>
> I want to reproduce this situation again for perfection , such as when i
> try to run pvcreate command in cluster , message should come lock from
> node2 & node3 , I have created new cluster , this new cluster is working
> fine ,
> How to do This? any setting in lvm.conf

Can you share your setup please?

What kind of cluster? What version? What is the configuration file? Was 
there anything interesting in the system logs? etc.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From neale at sinenomine.net  Fri Oct  3 19:32:34 2014
From: neale at sinenomine.net (Neale Ferguson)
Date: Fri, 3 Oct 2014 19:32:34 +0000
Subject: [Linux-cluster] gfs2 resource not mounting
Message-ID: <0339B794-F2BA-4BC7-88BB-E6016B3DD2CB@sinenomine.net>

Using the same two-node configuration I described in an earlier post this forum, I'm having problems getting a gfs2 resource started on one of the nodes. The resource in question:

 Resource: clusterfs (class=ocf provider=heartbeat type=Filesystem)
  Attributes: device=/dev/vg_cluster/ha_lv directory=/mnt/gfs2-demo fstype=gfs2 options=noatime 
  Operations: start interval=0s timeout=60 (clusterfs-start-timeout-60)
              stop interval=0s timeout=60 (clusterfs-stop-timeout-60)
              monitor interval=10s on-fail=fence (clusterfs-monitor-interval-10s)

pcs status shows:

Clone Set: dlm-clone [dlm]
     Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
 Clone Set: clusterfs-clone [clusterfs]
     Started: [ rh7cn1.devlab.sinenomine.net ]
     Stopped: [ rh7cn2.devlab.sinenomine.net ]

Failed actions:
    clusterfs_start_0 on rh7cn2.devlab.sinenomine.net 'unknown error' (1): call=46, status=complete, last-rc-change='Fri Oct  3 14:41:26 2014', queued=4702ms, exec=0ms

Using pcs resource debug-start I see:

Operation start for clusterfs:0 (ocf:heartbeat:Filesystem) returned 1
 >  stderr: INFO: Running start for /dev/vg_cluster/ha_lv on /mnt/gfs2-demo
 >  stderr: mount: permission denied
 >  stderr: ERROR: Couldn't mount filesystem /dev/vg_cluster/ha_lv on /mnt/gfs2-demo

The log on the node shows - 

Oct  3 14:57:37 rh7cn2 kernel: GFS2: fsid=rh7cluster:vol1: Trying to join cluster "lock_dlm", "rh7cluster:vol1"
Oct  3 14:57:38 rh7cn2 kernel: GFS2: fsid=rh7cluster:vol1: Joined cluster. Now mounting FS...
Oct  3 14:57:38 rh7cn2 dlm_controld[5857]: 1564 cpg_dispatch error 9

On the other node - 

Oct  3 15:09:47 rh7cn1 kernel: GFS2: fsid=rh7cluster:vol1.0: recover generation 14 done
Oct  3 15:09:48 rh7cn1 kernel: GFS2: fsid=rh7cluster:vol1.0: recover generation 15 done

I'm assuming I didn't define the gfs2 resource such that it could be used concurrently by both nodes. Here's the cib.xml definition for it:

      <clone id="clusterfs-clone">
        <primitive class="ocf" id="clusterfs" provider="heartbeat" type="Filesystem">
          <instance_attributes id="clusterfs-instance_attributes">
            <nvpair id="clusterfs-instance_attributes-device" name="device" value="/dev/vg_cluster/ha_lv"/>
            <nvpair id="clusterfs-instance_attributes-directory" name="directory" value="/mnt/gfs2-demo"/>
            <nvpair id="clusterfs-instance_attributes-fstype" name="fstype" value="gfs2"/>
            <nvpair id="clusterfs-instance_attributes-options" name="options" value="noatime"/>
          </instance_attributes>
          <operations>
            <op id="clusterfs-start-timeout-60" interval="0s" name="start" timeout="60"/>
            <op id="clusterfs-stop-timeout-60" interval="0s" name="stop" timeout="60"/>
            <op id="clusterfs-monitor-interval-10s" interval="10s" name="monitor" on-fail="fence"/>
          </operations>
        </primitive>
        <meta_attributes id="clusterfs-clone-meta">
          <nvpair id="clusterfs-interleave" name="interleave" value="true"/>
        </meta_attributes>
      </clone>

-------------------------------

Unrelated (I believe) to the above, I also note the following messages in /var/log/messages which appear to be related to pacemaker and http (another resource I have defined):

Oct  3 15:05:06 rh7cn2 systemd: pacemaker.service: Got notification message from PID 6036, but reception only permitted for PID 5575

I'm running systemd-208-11.el7_0.2. A bugzilla search matches with one report but the fix was put into -11.

Neale


From neale at sinenomine.net  Mon Oct  6 02:30:51 2014
From: neale at sinenomine.net (Neale Ferguson)
Date: Mon, 6 Oct 2014 02:30:51 +0000
Subject: [Linux-cluster] gfs2 resource not mounting
In-Reply-To: <0339B794-F2BA-4BC7-88BB-E6016B3DD2CB@sinenomine.net>
References: <0339B794-F2BA-4BC7-88BB-E6016B3DD2CB@sinenomine.net>
Message-ID: <5025AC66-8F86-49D0-B223-ACD9B2E428CC@sinenomine.net>

I found the problem. It was a configuration error I made when I modified the gfs2 resource. Everything is working correctly now.

If I want to change a setting like the token timeout, do I simply edit corosync.conf and sync the changes or can I use the pcs cluster setup command to modify an existing configuration?

Neale

On Oct 3, 2014, at 3:32 PM, Neale Ferguson <neale at sinenomine.net> wrote:

> Using the same two-node configuration I described in an earlier post this forum, I'm having problems getting a gfs2 resource started on one of the nodes. The resource in question:
> 
> Resource: clusterfs (class=ocf provider=heartbeat type=Filesystem)
>  Attributes: device=/dev/vg_cluster/ha_lv directory=/mnt/gfs2-demo fstype=gfs2 options=noatime 
>  Operations: start interval=0s timeout=60 (clusterfs-start-timeout-60)
>              stop interval=0s timeout=60 (clusterfs-stop-timeout-60)
>              monitor interval=10s on-fail=fence (clusterfs-monitor-interval-10s)


From bubble at hoster-ok.com  Mon Oct  6 05:28:04 2014
From: bubble at hoster-ok.com (Vladislav Bogdanov)
Date: Mon, 06 Oct 2014 08:28:04 +0300
Subject: [Linux-cluster] cLVM unusable on quorated cluster
In-Reply-To: <87a95dfb2k.fsf@hati.baby-gnu.org>
References: <87egupfcg7.fsf@hati.baby-gnu.org> <542EB4D6.4030008@alteeve.ca>
	<87a95dfb2k.fsf@hati.baby-gnu.org>
Message-ID: <54322864.5040706@hoster-ok.com>

03.10.2014 18:05, Daniel Dehennin wrote:

I'd recommend to make sure that:
1. clvmd runs in 'corosync' mode, not 'openais' (controlled by -I
command-line switch), because otherwise it uses buggy LCK AIS service
instead of well tested CPG+DLM,
2. You have recent enough version of lvm2. 2.02.102 should be ok, you
need git commit 431eda6
(https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=431eda63cc0ebff7c62dacb313cabcffbda6573a),
introduced somewhere between 2.02.99 and 2.02.102.

I didn't test that commit with corosync-1, but that should work for it.

Hope this helps,

Vladislav


> Digimer <lists at alteeve.ca> writes:
> 
>> Can you paste your full pacemaker config and the logs from the other
>> nodes starting just before the lost node went away?
> 
> Sorry, I forgot to attach it:
> 
> 
> 
> 
> Here are the logs on the 3 hypervisors, note that pacemaker does not start at bootime:
> 
> 
> 
> 
> 
> 


From neale at sinenomine.net  Mon Oct 13 15:20:05 2014
From: neale at sinenomine.net (Neale Ferguson)
Date: Mon, 13 Oct 2014 15:20:05 +0000
Subject: [Linux-cluster] Permission denied
Message-ID: <D06165E1.F16E%neale@sinenomine.net>

I reported last week that I was getting permission denied when pcs was
starting a gfs2 resource. I thought it was due to the resource being
defined incorrectly, but it doesn?t appear to be the case. On rare
occasions the mount works but most of the time one node gets it mounted
but the other gets denied. I?ve enabled a number of logging options and
done straces on both sides but I?m not getting anywhere.

My cluster looks like:

# pcs resource show
 Clone Set: dlm-clone [dlm]
   Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
 Resource Group: apachegroup
   VirtualIP	(ocf::heartbeat:IPaddr2):	Started
   Website	(ocf::heartbeat:apache):	Started
   httplvm	(ocf::heartbeat:LVM):	Started
   http_fs	(ocf::heartbeat:Filesystem):	Started
 Clone Set: clvmd-clone [clvmd]
   Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ]
 Clone Set: clusterfs-clone [clusterfs]
   Started: [ rh7cn1.devlab.sinenomine.net ]
   Stopped: [ rh7cn2.devlab.sinenomine.net ]

The gfs2 resource is defined:

# pcs resource show clusterfs
 Resource: clusterfs (class=ocf provider=heartbeat type=Filesystem)
 Attributes: device=/dev/vg_cluster/ha_lv directory=/mnt/gfs2-demo
fstype=gfs2 options=noatime
  Operations: start interval=0s timeout=60 (clusterfs-start-timeout-60)
              stop interval=0s timeout=60 (clusterfs-stop-timeout-60)
              monitor interval=10s on-fail=fence
(clusterfs-monitor-interval-10s)


When the mount is attempted on node 2 the log contains:


Oct 13 11:10:42 rh7cn2 kernel: GFS2: fsid=rh7cluster:vol1: Trying to join
cluster "lock_dlm", "rh7cluster:vol1"
Oct 13 11:10:42 rh7cn2 corosync[47978]: [QB    ]
ipc_setup.c:handle_new_connection:485 IPC credentials authenticated
(47978-48271-30)
Oct 13 11:10:42 rh7cn2 corosync[47978]: [QB    ]
ipc_shm.c:qb_ipcs_shm_connect:294 connecting to client [48271]
Oct 13 11:10:42 rh7cn2 corosync[47978]: [QB    ]
ringbuffer.c:qb_rb_open_2:236 shm size:1048589; real_size:1052672;
rb->word_size:263168
Oct 13 11:10:42 rh7cn2 corosync[47978]: message repeated 2 times: [[QB
] ringbuffer.c:qb_rb_open_2:236 shm size:1048589; real_size:1052672;
rb->word_size:263168]
Oct 13 11:10:42 rh7cn2 corosync[47978]: [MAIN  ]
ipc_glue.c:cs_ipcs_connection_created:272 connection created
Oct 13 11:10:42 rh7cn2 corosync[47978]: [CPG   ]
cpg.c:cpg_lib_init_fn:1532 lib_init_fn: conn=0x2ab16a953a0,
cpd=0x2ab16a95a64
Oct 13 11:10:42 rh7cn2 corosync[47978]: [CPG   ]
cpg.c:message_handler_req_exec_cpg_procjoin:1349 got procjoin message from
cluster node 0x2 (r(0) ip(172.17.16.148) ) for pid 48271
Oct 13 11:10:43 rh7cn2 kernel: GFS2: fsid=rh7cluster:vol1: Joined cluster.
Now mounting FS...
Oct 13 11:10:43 rh7cn2 corosync[47978]: [CPG   ]
cpg.c:message_handler_req_lib_cpg_leave:1617 got leave reques
t on 0x2ab16a953a0Oct 13 11:10:43 rh7cn2 corosync[47978]: [CPG   ]
cpg.c:message_handler_req_exec_cpg_procleave:1365 got proclea
ve message from cluster node 0x2 (r(0) ip(172.17.16.148) ) for pid 48271
Oct 13 11:10:43 rh7cn2 corosync[47978]: [CPG   ]
cpg.c:message_handler_req_lib_cpg_finalize:1655 cpg finalize for
conn=0x2ab16a953a0
Oct 13 11:10:43 rh7cn2 dlm_controld[48271]: 251492 cpg_dispatch error 9


Is the ?leave request? symptomatic or causal? If the latter, why is it
generated? 
On other other side:
Oct 13 11:10:41 rh7cn1 corosync[10423]: [QUORUM]
vsf_quorum.c:message_handler_req_lib_quorum_getquorate:395 got quorate
request on 0x2ab0e33c8b0
Oct 13 11:10:41 rh7cn1 corosync[10423]: [QUORUM]
vsf_quorum.c:message_handler_req_lib_quorum_getquorate:395 got quorate
request on 0x2ab0e33c8b0
Oct 13 11:10:42 rh7cn1 corosync[10423]: [CPG   ]
cpg.c:message_handler_req_exec_cpg_procjoin:1349 got procjoin message from
cluster node 0x2 (r(0) ip(172.17.16.148) ) for pid 48271
Oct 13 11:10:43 rh7cn1 kernel: GFS2: fsid=rh7cluster:vol1.0: recover
generation 6 doneOct 13 11:10:43 rh7cn1 corosync[10423]: [CPG   ]
cpg.c:message_handler_req_exec_cpg_procleave:1365 got proclea
ve message from cluster node 0x2 (r(0) ip(172.17.16.148) ) for pid
48271Oct 13 11:10:43 rh7cn1 kernel: GFS2: fsid=rh7cluster:vol1.0: recover
generation 7 done

dlm_tool dump shows:

251469 dlm:ls:vol1 conf 2 1 0 memb 1 2 join 2 left
251469 vol1 add_change cg 6 joined nodeid 2
251469 vol1 add_change cg 6 counts member 2 joined 1 remove 0 failed 0
251469 vol1 stop_kernel cg 6
251469 write "0" to "/sys/kernel/dlm/vol1/control"
251469 vol1 check_ringid done cluster 43280 cpg 1:43280
251469 vol1 check_fencing done
251469 vol1 send_start 1:6 counts 5 2 1 0 0
251469 vol1 receive_start 1:6 len 80
251469 vol1 match_change 1:6 matches cg 6
251469 vol1 wait_messages cg 6 need 1 of 2
251469 vol1 receive_start 2:1 len 80
251469 vol1 match_change 2:1 matches cg 6
251469 vol1 wait_messages cg 6 got all 2
251469 vol1 start_kernel cg 6 member_count 2
251469 dir_member 1
251469 set_members mkdir
"/sys/kernel/config/dlm/cluster/spaces/vol1/nodes/2"
251469 write "1" to "/sys/kernel/dlm/vol1/control"
251469 vol1 prepare_plocks
251469 vol1 set_plock_data_node from 1 to 1
251469 vol1 send_all_plocks_data 1:6
251469 vol1 send_all_plocks_data 1:6 0 done
251469 vol1 send_plocks_done 1:6 counts 5 2 1 0 0 plocks_data 0
251469 vol1 receive_plocks_done 1:6 flags 2 plocks_data 0 need 0 save 0
251470 dlm:ls:vol1 conf 1 0 1 memb 1 join left 2
251470 vol1 add_change cg 7 remove nodeid 2 reason leave
251470 vol1 add_change cg 7 counts member 1 joined 0 remove 1 failed 0
251470 vol1 stop_kernel cg 7
251470 write "0" to "/sys/kernel/dlm/vol1/control"
251470 vol1 purged 0 plocks for 2
251470 vol1 check_ringid done cluster 43280 cpg 1:43280
251470 vol1 check_fencing done
251470 vol1 send_start 1:7 counts 6 1 0 1 0
251470 vol1 receive_start 1:7 len 76
251470 vol1 match_change 1:7 matches cg 7
251470 vol1 wait_messages cg 7 got all 1
251470 vol1 start_kernel cg 7 member_count 1
251470 dir_member 2
251470 dir_member 1
251470 set_members rmdir
"/sys/kernel/config/dlm/cluster/spaces/vol1/nodes/2"
251470 write "1" to "/sys/kernel/dlm/vol1/control"
251470 vol1 prepare_plocks


I would appreciate any debugging suggestions. I?ve straced
dlm_controld/corosync but not gained much clarity.

Neale


From neale at sinenomine.net  Mon Oct 13 15:33:57 2014
From: neale at sinenomine.net (Neale Ferguson)
Date: Mon, 13 Oct 2014 15:33:57 +0000
Subject: [Linux-cluster] Permission denied
Message-ID: <D06168EF.F17C%neale@sinenomine.net>

Software levels:

pacemaker-1.1.10-29
pcs-0.9.115-32
dlm-4.0.2-4
corosync-2.3.3-2
lvm2-cluster-2.02.105-14


On 10/13/14, 11:20 AM, "Neale Ferguson" <neale at sinenomine.net> wrote:

>I reported last week that I was getting permission denied when pcs was
>starting a gfs2 resource. I thought it was due to the resource being
>defined incorrectly, but it doesn?t appear to be the case. On rare
>occasions the mount works but most of the time one node gets it mounted
>but the other gets denied. I?ve enabled a number of logging options and
>done straces on both sides but I?m not getting anywhere.


From emi2fast at gmail.com  Mon Oct 13 15:52:36 2014
From: emi2fast at gmail.com (emmanuel segura)
Date: Mon, 13 Oct 2014 17:52:36 +0200
Subject: [Linux-cluster] Permission denied
In-Reply-To: <D06168EF.F17C%neale@sinenomine.net>
References: <D06168EF.F17C%neale@sinenomine.net>
Message-ID: <CAE7pJ3D_raqLL5w0zZY+8=_7natq155pgMBwRtdyusTNum_yQw@mail.gmail.com>

have you configured the fencing?

2014-10-13 17:33 GMT+02:00 Neale Ferguson <neale at sinenomine.net>:
> Software levels:
>
> pacemaker-1.1.10-29
> pcs-0.9.115-32
> dlm-4.0.2-4
> corosync-2.3.3-2
> lvm2-cluster-2.02.105-14
>
>
> On 10/13/14, 11:20 AM, "Neale Ferguson" <neale at sinenomine.net> wrote:
>
>>I reported last week that I was getting permission denied when pcs was
>>starting a gfs2 resource. I thought it was due to the resource being
>>defined incorrectly, but it doesn?t appear to be the case. On rare
>>occasions the mount works but most of the time one node gets it mounted
>>but the other gets denied. I?ve enabled a number of logging options and
>>done straces on both sides but I?m not getting anywhere.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
esta es mi vida e me la vivo hasta que dios quiera


From neale at sinenomine.net  Mon Oct 13 16:05:56 2014
From: neale at sinenomine.net (Neale Ferguson)
Date: Mon, 13 Oct 2014 16:05:56 +0000
Subject: [Linux-cluster] Permission denied
In-Reply-To: <CAE7pJ3D_raqLL5w0zZY+8=_7natq155pgMBwRtdyusTNum_yQw@mail.gmail.com>
References: <D06168EF.F17C%neale@sinenomine.net>
	<CAE7pJ3D_raqLL5w0zZY+8=_7natq155pgMBwRtdyusTNum_yQw@mail.gmail.com>
Message-ID: <D0617079.F194%neale@sinenomine.net>

Yep:

# pcs stonith show ZVMPOWER
 Resource: ZVMPOWER (class=stonith type=fence_zvm)
  Attributes: ipaddr=VSMREQIU
pcmk_host_map=rh7cn1.devlab.sinenomine.net:RH7CN1;rh7cn2.devlab.sinenomine.
net:RH7CN2 
pcmk_host_list=rh7cn1.devlab.sinenomine.net;rh7cn2.devlab.sinenomine.net
pcmk_host_check=static-list
  Operations: monitor interval=60s (ZVMPOWER-monitor-interval-60s)


I?ve verified its operation by causing fencing to be triggered.

On 10/13/14, 11:52 AM, "emmanuel segura" <emi2fast at gmail.com> wrote:

>have you configured the fencing?


From rpeterso at redhat.com  Mon Oct 13 16:16:30 2014
From: rpeterso at redhat.com (Bob Peterson)
Date: Mon, 13 Oct 2014 12:16:30 -0400 (EDT)
Subject: [Linux-cluster] Permission denied
In-Reply-To: <D06165E1.F16E%neale@sinenomine.net>
References: <D06165E1.F16E%neale@sinenomine.net>
Message-ID: <1982114005.2362400.1413216990955.JavaMail.zimbra@redhat.com>

----- Original Message -----
> I would appreciate any debugging suggestions. I?ve straced
> dlm_controld/corosync but not gained much clarity.
> 
> Neale

Hi Neale,

1. What does it say if you try to mount the GFS2 file system manually
   rather than from the configured service?
2. After the failure, what does dmesg on all the nodes show?
3. What kernel is this?

I would:
(1) Check to make sure the file system has enough journals for all nodes.
    You can do gfs2_edit -p journals <device>. If your version of gfs2-utils
    doesn't have that option, you can alternately do: gfs2_edit -p jindex <device>
    and see how many journals are in the index.
(2) Check to make sure the locking protocol is lock_dlm in the file system
    superblock. You can get that from gfs2_edit -p sb <device>
(3) Check to make sure the cluster name in the file system superblock
    matches the configured cluster name. That's also in the superblock

Regards,

Bob Peterson
Red Hat File Systems


From neale at sinenomine.net  Mon Oct 13 16:47:14 2014
From: neale at sinenomine.net (Neale Ferguson)
Date: Mon, 13 Oct 2014 16:47:14 +0000
Subject: [Linux-cluster] Permission denied
Message-ID: <D0617644.F1AE%neale@sinenomine.net>

Thanks Bob, answers inline...

On 10/13/14, 12:16 PM, "Bob Peterson" <rpeterso at redhat.com> wrote:

>----- Original Message -----
>> I would appreciate any debugging suggestions. I?ve straced
>> dlm_controld/corosync but not gained much clarity.
>> 
>> Neale
>
>Hi Neale,
>
>1. What does it say if you try to mount the GFS2 file system manually
>   rather than from the configured service?
Permissioned denied (I also used resource debug-start and that?s the
message it gets as well). I disabled the resource and then tried mounting
it as well and I was successful once but not a second time. As I
mentioned, on rare occasions both sides do mount on cluster start up,
which is worse than it never mounting!

>2. After the failure, what does dmesg on all the nodes show?
Node 1 -

[256184.632116] dlm: vol1: dlm_recover 15
[256184.633300] dlm: vol1: add member 2
[256184.636944] dlm: vol1: dlm_recover_members 2 nodes
[256184.664495] dlm: vol1: generation 8 slots 2 1:1 2:2
[256184.664531] dlm: vol1: dlm_recover_directory
[256184.668865] dlm: vol1: dlm_recover_directory 0 in 0 new
[256184.703328] dlm: vol1: dlm_recover_directory 10 out 1 messages
[256184.784404] dlm: vol1: dlm_recover 15 generation 8 done: 120 ms
[256184.785050] GFS2: fsid=rh7cluster:vol1.0: recover generation 8 done
[256185.375091] dlm: vol1: dlm_recover 17
[256185.375655] dlm: vol1: dlm_clear_toss 1 done
[256185.376263] dlm: vol1: remove member 2
[256185.376339] dlm: vol1: dlm_recover_members 1 nodes
[256185.376403] dlm: vol1: generation 9 slots 1 1:1
[256185.376430] dlm: vol1: dlm_recover_directory
[256185.376458] dlm: vol1: dlm_recover_directory 0 in 0 new
[256185.376490] dlm: vol1: dlm_recover_directory 0 out 0 messages
[256185.376638] dlm: vol1: dlm_recover_purge 6 locks for 1 nodes
[256185.376664] dlm: vol1: dlm_recover_masters
[256185.376714] dlm: vol1: dlm_recover_masters 0 of 26
[256185.376746] dlm: vol1: dlm_recover_locks 0 out
[256185.376778] dlm: vol1: dlm_recover_locks 0 in
[256185.376831] dlm: vol1: dlm_recover_rsbs 26 done
[256185.377444] dlm: vol1: dlm_recover 17 generation 9 done: 0 ms
[256185.377833] GFS2: fsid=rh7cluster:vol1.0: recover generation 9 done


Node 2 (failing) - 
[256206.973005] GFS2: fsid=rh7cluster:vol1: Trying to join cluster
"lock_dlm", "rh7cluster:vol1"
[256206.973105] GFS2: fsid=rh7cluster:vol1: In gdlm_mount
[256207.019743] dlm: vol1: joining the lockspace group...
[256207.169061] dlm: vol1: group event done 0 0
[256207.169135] dlm: vol1: dlm_recover 1
[256207.170735] dlm: vol1: add member 2
[256207.170822] dlm: vol1: add member 1
[256207.174493] dlm: vol1: dlm_recover_members 2 nodes
[256207.174798] dlm: vol1: join complete
[256207.205167] dlm: vol1: dlm_recover_directory
[256207.208924] dlm: vol1: dlm_recover_directory 10 in 10 new
[256207.245335] dlm: vol1: dlm_recover_directory 0 out 1 messages
[256207.329101] dlm: vol1: dlm_recover 1 generation 8 done: 120 ms
[256207.851390] GFS2: fsid=rh7cluster:vol1: Joined cluster. Now mounting
FS...
[256207.881216] dlm: vol1: leaving the lockspace group...
[256207.947479] dlm: vol1: group event done 0 0
[256207.949530] dlm: vol1: release_lockspace final free


>3. What kernel is this?
>
>I would:
>(1) Check to make sure the file system has enough journals for all nodes.
>    You can do gfs2_edit -p journals <device>. If your version of
>gfs2-utils
>    doesn't have that option, you can alternately do: gfs2_edit -p jindex
><device>
>    and see how many journals are in the index.
3/3 [fc7745eb] 4/21 (0x4/0x15): File    journal0
   4/4 [8b70757d] 5/4127 (0x5/0x101f): File    journal1

It was made via:
mkfs.gfs2 -j 2 -J 16 -r 32 -t rh7cluster:vol1
/dev/mapper/vg_cluster-ha_lv


>(2) Check to make sure the locking protocol is lock_dlm in the file system
>    superblock. You can get that from gfs2_edit -p sb <device>
sb_lockproto          lock_dlm


>(3) Check to make sure the cluster name in the file system superblock
>    matches the configured cluster name. That's also in the superblock
sb_locktable          rh7cluster:vol1

Strangely, while /etc/corosync/corosync.conf has the cluster name
specified, pcs status reports it as blank:

# pcs status
Cluster name: 
Last updated: Mon Oct 13 12:40:47 2014


>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: default.xml
Type: application/xml
Size: 3101 bytes
Desc: default.xml
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20141013/bb14f34c/attachment.wsdl>

From rpeterso at redhat.com  Mon Oct 13 17:58:46 2014
From: rpeterso at redhat.com (Bob Peterson)
Date: Mon, 13 Oct 2014 13:58:46 -0400 (EDT)
Subject: [Linux-cluster] Permission denied
In-Reply-To: <D0617644.F1AE%neale@sinenomine.net>
References: <D0617644.F1AE%neale@sinenomine.net>
Message-ID: <565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com>

----- Original Message -----
(snip)
> >3. What kernel is this?

Make sure both nodes are running the same kernel, at any rate.

> It was made via:
> mkfs.gfs2 -j 2 -J 16 -r 32 -t rh7cluster:vol1
> /dev/mapper/vg_cluster-ha_lv
> 

Hm. This must be a small SSD device or embedded or something.
That's a pretty non-standard journal size (and resource group size).
I'm not worried about the resource group size of 32. Shouldn't be an issue.
The journal size, on the other hand, is a little concerning.

Can you try with the standard 128MB journal size just as an experiment
to see if it mounts more consistently or if you get the same error?
Maybe GFS2's recovery code is sending an error back for some reason
due to its size...

Regards,

Bob Peterson
Red Hat File Systems


From neale at sinenomine.net  Mon Oct 13 18:13:35 2014
From: neale at sinenomine.net (Neale Ferguson)
Date: Mon, 13 Oct 2014 18:13:35 +0000
Subject: [Linux-cluster] Permission denied
In-Reply-To: <565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com>
References: <D0617644.F1AE%neale@sinenomine.net>
	<565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com>
Message-ID: <D0618E0A.F22E%neale@sinenomine.net>


On 10/13/14, 1:58 PM, "Bob Peterson" <rpeterso at redhat.com> wrote:

>----- Original Message -----
>(snip)
>> >3. What kernel is this?
>
>Make sure both nodes are running the same kernel, at any rate.
Both running 3.10.0-123.8.1

>
>> It was made via:
>> mkfs.gfs2 -j 2 -J 16 -r 32 -t rh7cluster:vol1
>> /dev/mapper/vg_cluster-ha_lv
>> 
>
>Hm. This must be a small SSD device or embedded or something.
>That's a pretty non-standard journal size (and resource group size).
>I'm not worried about the resource group size of 32. Shouldn't be an
>issue.
>The journal size, on the other hand, is a little concerning.
>
>Can you try with the standard 128MB journal size just as an experiment
>to see if it mounts more consistently or if you get the same error?
>Maybe GFS2's recovery code is sending an error back for some reason
>due to its size...
Will do. It?s just a demo system to verify the bits and pieces before
rolling out something more serious. I did the same with the first cman
system I built for RHEL 6, so just used the same sizes for things.

Thanks again Bob


From neale at sinenomine.net  Mon Oct 13 18:50:55 2014
From: neale at sinenomine.net (Neale Ferguson)
Date: Mon, 13 Oct 2014 18:50:55 +0000
Subject: [Linux-cluster] Permission denied
In-Reply-To: <565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com>
References: <D0617644.F1AE%neale@sinenomine.net>
	<565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com>
Message-ID: <D061960B.F255%neale@sinenomine.net>

On 10/13/14, 1:58 PM, "Bob Peterson" <rpeterso at redhat.com> wrote:

>----- Original Message -----
>Can you try with the standard 128MB journal size just as an experiment
>to see if it mounts more consistently or if you get the same error?
>Maybe GFS2's recovery code is sending an error back for some reason
>due to its size...
Disabled resource; Remade filesystem; Re-enabled resource. It mounted on
both systems. I disabled/enabled again. Only mounted on one. dmesg from
failing node showing success followed by fail:

[  469.968521] GFS2: fsid=rh7cluster:vol1: Trying to join cluster
"lock_dlm", "rh7cluster:vol1"
[  469.968638] GFS2: fsid=rh7cluster:vol1: In gdlm_mount
[  469.979229] dlm: vol1: joining the lockspace group...
[  470.065511] dlm: vol1: group event done 0 0
[  470.065644] dlm: vol1: dlm_recover 1
[  470.066623] dlm: vol1: add member 2
[  470.066688] dlm: vol1: dlm_recover_members 1 nodes
[  470.066749] dlm: vol1: generation 1 slots 1 1:2
[  470.066787] dlm: vol1: dlm_recover_directory
[  470.066819] dlm: vol1: dlm_recover_directory 0 in 0 new
[  470.066852] dlm: vol1: dlm_recover_directory 0 out 0 messages
[  470.067350] dlm: vol1: dlm_recover 1 generation 1 done: 0 ms
[  470.067674] dlm: vol1: join complete
[  470.282466] dlm: vol1: dlm_recover 3
[  470.283380] dlm: vol1: add member 1
[  470.289840] dlm: vol1: dlm_recover_members 2 nodes
[  470.327670] dlm: vol1: dlm_recover_directory
[  470.330863] dlm: vol1: dlm_recover_directory 0 in 0 new
[  470.406706] dlm: vol1: dlm_recover_directory 1 out 1 messages
[  470.567983] dlm: vol1: dlm_process_requestqueue msg 11 from 1 lkid 1
remid 0 result 0 seq 3
[  470.568520] dlm: vol1: dlm_recover 3 generation 2 done: 240 ms
[  470.578773] GFS2: fsid=rh7cluster:vol1: first mounter control
generation 0
[  470.578856] GFS2: fsid=rh7cluster:vol1: Joined cluster. Now mounting
FS...
[  470.788200] GFS2: fsid=rh7cluster:vol1.0: jid=0, already locked for use
[  470.788293] GFS2: fsid=rh7cluster:vol1.0: jid=0: Looking at journal...
[  470.843038] GFS2: fsid=rh7cluster:vol1.0: jid=0: Done
[  470.851041] GFS2: fsid=rh7cluster:vol1.0: jid=1: Trying to acquire
journal lock...
[  470.858019] GFS2: fsid=rh7cluster:vol1.0: jid=1: Looking at journal...
[  470.953275] GFS2: fsid=rh7cluster:vol1.0: jid=1: Done
[  470.962088] GFS2: fsid=rh7cluster:vol1.0: first mount done, others may
mount
[  471.132738] SELinux: initialized (dev dm-5, type gfs2), uses xattr
[  524.435169] dlm: vol1: leaving the lockspace group...
[  524.495477] dlm: vol1: group event done 0 0
[  524.497957] dlm: vol1: release_lockspace final free
[  540.342079] GFS2: fsid=rh7cluster:vol1: Trying to join cluster
"lock_dlm", "rh7cluster:vol1"
[  540.342156] GFS2: fsid=rh7cluster:vol1: In gdlm_mount
[  540.361232] dlm: vol1: joining the lockspace group...
[  540.450770] dlm: vol1: group event done 0 0
[  540.450834] dlm: vol1: dlm_recover 1
[  540.451553] dlm: vol1: add member 2
[  540.451975] dlm: vol1: add member 1
[  540.453783] dlm: vol1: dlm_recover_members 2 nodes
[  540.454073] dlm: vol1: join complete
[  540.486970] dlm: vol1: dlm_recover_directory
[  540.489807] dlm: vol1: dlm_recover_directory 1 in 1 new
[  540.516820] dlm: vol1: dlm_recover_directory 0 out 1 messages
[  540.576710] dlm: vol1: dlm_recover 1 generation 2 done: 90 ms
[  541.105327] GFS2: fsid=rh7cluster:vol1: Joined cluster. Now mounting
FS...
[  541.202840] dlm: vol1: leaving the lockspace group...
[  541.215728] dlm: vol1: group event done 0 0
[  541.217632] dlm: vol1: release_lockspace final free


From thomasmeier1976 at gmx.de  Mon Oct 13 19:10:27 2014
From: thomasmeier1976 at gmx.de (Thomas Meier)
Date: Mon, 13 Oct 2014 21:10:27 +0200
Subject: [Linux-cluster] Fencing issues with fence_apc_snmp (APC Firmware
	6.x)
Message-ID: <trinity-0f1ffa1f-5552-43aa-bdd7-c340bf3773e6-1413227427076@3capp-gmx-bs69>

Hi

When configuring PDU fencing in my 2-node-cluster I ran into some problems with
the fence_apc_snmp agent. Turning a node off works fine, but
fence_apc_snmp then exits with error.


When I do this manually (from node2):

   fence_apc_snmp -a node1 -n 1 -o off

the output of the command is not an expected:

   Success: Powered OFF

but in my case:

   Returned 2: Error in packet.
   Reason: (genError) A general failure occured
   Failed object: .1.3.6.1.4.1.318.1.1.4.4.2.1.3.21


When I check the PDU, the port is without power, so this part works.
But it seems that the fence agent can't read the status of the PDU
and then exits with error. The same seems to happen when fenced 
is calling the agent. The agent also exits with an error and fencing can't succeed
and the cluster hangs.

>From the logfile: 
    
    fenced[2100]: fence node1 dev 1.0 agent fence_apc_snmp result: error from agent


My Setup: - CentOS 6.5 with fence-agents-3.1.5-35.el6_5.4.x86_64 installed. 
          - APC AP8953 PDU with firmware 6.1
          - 2-node-cluster based on https://alteeve.ca/w/AN!Cluster_Tutorial_2
          - fencing agents in use: fence_ipmilan (working) and fence_apc_snmp


I did some recherche, and for me it looks like that my fence-agents package is too old for my APC firmware.

I've already found the fence-agents repo: https://git.fedorahosted.org/cgit/fence-agents.git/

Here https://git.fedorahosted.org/cgit/fence-agents.git/commit/?id=55ccdd79f530092af06eea5b4ce6a24bd82c0875
it says: "fence_apc_snmp: Add support for firmware 6.x"


I've managed to build fence-agents-4.0.11.tar.gz on a CentOS 6.5 test box, but my build
of fence_apc_snmp doesn't work.

It gives:

[root at box1]# fence_apc_snmp -v -a node1 -n 1 -o status
Traceback (most recent call last):
  File "/usr/sbin/fence_apc_snmp", line 223, in <module>
    main()
  File "/usr/sbin/fence_apc_snmp", line 197, in main
    options = check_input(device_opt, process_input(device_opt))
  File "/usr/share/fence/fencing.py", line 705, in check_input
    logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stderr))
TypeError: __init__() got an unexpected keyword argument 'stream'


I'd really like to see if a patched fence_apc_snmp agent fixes my problem, and if so,
install the right version of fence_apc_snmp on the cluster without breaking things,
but I'm a bit clueless how to build me a working version. 


Maybe you have some tips?


Thanks in advance

Thomas


From neale at sinenomine.net  Mon Oct 13 21:10:16 2014
From: neale at sinenomine.net (Neale Ferguson)
Date: Mon, 13 Oct 2014 21:10:16 +0000
Subject: [Linux-cluster] Permission denied
In-Reply-To: <D061960B.F255%neale@sinenomine.net>
References: <D0617644.F1AE%neale@sinenomine.net>
	<565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com>
	<D061960B.F255%neale@sinenomine.net>
Message-ID: <D061B7AC.F2C3%neale@sinenomine.net>

I put some debug code into the gfs2 module and I see it failing the mount
at this point:

/*
         * If user space has failed to join the cluster or some similar
         * failure has occurred, then the journal id will contain a
         * negative (error) number. This will then be returned to the
         * caller (of the mount syscall). We do this even for spectator
         * mounts (which just write a jid of 0 to indicate "ok" even though
         * the jid is unused in the spectator case)
         */
        if (sdp->sd_lockstruct.ls_jid < 0) {

Now to find out who?s stick -PERM into ls_jid.

Neale


From kgronlund at suse.com  Tue Oct 14 05:58:19 2014
From: kgronlund at suse.com (Kristoffer =?utf-8?Q?Gr=C3=B6nlund?=)
Date: Tue, 14 Oct 2014 07:58:19 +0200
Subject: [Linux-cluster] Fencing issues with fence_apc_snmp (APC
	Firmware	6.x)
In-Reply-To: <trinity-0f1ffa1f-5552-43aa-bdd7-c340bf3773e6-1413227427076@3capp-gmx-bs69>
References: <trinity-0f1ffa1f-5552-43aa-bdd7-c340bf3773e6-1413227427076@3capp-gmx-bs69>
Message-ID: <87oatfdwg4.fsf@krigpad.site>

Thomas Meier <thomasmeier1976 at gmx.de> writes:

> I've managed to build fence-agents-4.0.11.tar.gz on a CentOS 6.5 test box, but my build
> of fence_apc_snmp doesn't work.
[...]
>   File "/usr/share/fence/fencing.py", line 705, in check_input
>     logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stderr))
> TypeError: __init__() got an unexpected keyword argument 'stream'

Your version of Python is too old. Possibly you have a newer version of
python installed, but by default the older version is used.

I think the stream argument was added in Python 2.6.

-- 
// Kristoffer Gr?nlund
// kgronlund at suse.com


From lists at alteeve.ca  Tue Oct 14 11:01:42 2014
From: lists at alteeve.ca (Digimer)
Date: Tue, 14 Oct 2014 07:01:42 -0400
Subject: [Linux-cluster] Fencing issues with fence_apc_snmp (APC
 Firmware 6.x)
In-Reply-To: <trinity-0f1ffa1f-5552-43aa-bdd7-c340bf3773e6-1413227427076@3capp-gmx-bs69>
References: <trinity-0f1ffa1f-5552-43aa-bdd7-c340bf3773e6-1413227427076@3capp-gmx-bs69>
Message-ID: <543D0296.8090606@alteeve.ca>

On 13/10/14 03:10 PM, Thomas Meier wrote:
> Hi
>
> When configuring PDU fencing in my 2-node-cluster I ran into some problems with
> the fence_apc_snmp agent. Turning a node off works fine, but
> fence_apc_snmp then exits with error.
>
>
>
> When I do this manually (from node2):
>
>     fence_apc_snmp -a node1 -n 1 -o off
>
> the output of the command is not an expected:
>
>     Success: Powered OFF
>
> but in my case:
>
>     Returned 2: Error in packet.
>     Reason: (genError) A general failure occured
>     Failed object: .1.3.6.1.4.1.318.1.1.4.4.2.1.3.21
>
>
> When I check the PDU, the port is without power, so this part works.
> But it seems that the fence agent can't read the status of the PDU
> and then exits with error. The same seems to happen when fenced
> is calling the agent. The agent also exits with an error and fencing can't succeed
> and the cluster hangs.
>
>>From the logfile:
>
>      fenced[2100]: fence node1 dev 1.0 agent fence_apc_snmp result: error from agent
>
>
> My Setup: - CentOS 6.5 with fence-agents-3.1.5-35.el6_5.4.x86_64 installed.
>            - APC AP8953 PDU with firmware 6.1
>            - 2-node-cluster based on https://alteeve.ca/w/AN!Cluster_Tutorial_2
>            - fencing agents in use: fence_ipmilan (working) and fence_apc_snmp
>
>
> I did some recherche, and for me it looks like that my fence-agents package is too old for my APC firmware.
>
> I've already found the fence-agents repo: https://git.fedorahosted.org/cgit/fence-agents.git/
>
> Here https://git.fedorahosted.org/cgit/fence-agents.git/commit/?id=55ccdd79f530092af06eea5b4ce6a24bd82c0875
> it says: "fence_apc_snmp: Add support for firmware 6.x"
>
>
> I've managed to build fence-agents-4.0.11.tar.gz on a CentOS 6.5 test box, but my build
> of fence_apc_snmp doesn't work.
>
> It gives:
>
> [root at box1]# fence_apc_snmp -v -a node1 -n 1 -o status
> Traceback (most recent call last):
>    File "/usr/sbin/fence_apc_snmp", line 223, in <module>
>      main()
>    File "/usr/sbin/fence_apc_snmp", line 197, in main
>      options = check_input(device_opt, process_input(device_opt))
>    File "/usr/share/fence/fencing.py", line 705, in check_input
>      logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stderr))
> TypeError: __init__() got an unexpected keyword argument 'stream'
>
>
> I'd really like to see if a patched fence_apc_snmp agent fixes my problem, and if so,
> install the right version of fence_apc_snmp on the cluster without breaking things,
> but I'm a bit clueless how to build me a working version.
>
>
> Maybe you have some tips?
>
>
>
> Thanks in advance
>
> Thomas

Hi Marek et. al.,

   This is a RHEL 6.5 install, so Kristoffer's comment about needing a 
newer version of python is a bit of a concern. Has this been tested on 
RHEL 6 with an APC with the 6.x firmware?

cheeps

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From thomasmeier1976 at gmx.de  Tue Oct 14 11:04:10 2014
From: thomasmeier1976 at gmx.de (Thomas Meier)
Date: Tue, 14 Oct 2014 13:04:10 +0200
Subject: [Linux-cluster]
	=?utf-8?q?Fencing_issues_with_fence_apc_snmp_=28A?=
	=?utf-8?q?PC_Firmware=096=2Ex=29?=
In-Reply-To: <87oatfdwg4.fsf@krigpad.site>
References: <trinity-0f1ffa1f-5552-43aa-bdd7-c340bf3773e6-1413227427076@3capp-gmx-bs69>,
	<87oatfdwg4.fsf@krigpad.site>
Message-ID: <trinity-7efe0a35-4b0e-464b-8b3a-4a9e29662a1d-1413284650621@3capp-gmx-bs57>

Hi
?
My installed Python is version 2.6.6.
(System Python of RHEL6)
?


The latest stable version is fence-agents-4.0.10
The problem is that fence_apc_snmp from release 4.0.10
is not containing the code for APC firmware 6.x yet
and fence-agents 4.0.11 is not yet released, so maybe
still has bugs (or I just don't get it right).
?
I've tried version 4.0.10, too.
(untar - autogen.sh - configure - make - make install)
But I don't expect this version to work.


It fails like this:
?
[root at box1 fence-agents-4.0.10]# fence_apc_snmp -v -a 10.124.0.246 -n 1 -o status
DEBUG:root:/usr/bin/snmpwalk -m '' -Oeqn? -v '1' -c 'private' '10.124.0.246:161' '.1.3.6.1.2.1.1.2.0'
DEBUG:root:.1.3.6.1.2.1.1.2.0 .1.3.6.1.4.1.318.1.3.4.6
Traceback (most recent call last):
? File "/usr/sbin/fence_apc_snmp", line 209, in <module>
? ? main()
? File "/usr/sbin/fence_apc_snmp", line 205, in main
? ? result = fence_action(FencingSnmp(options), options, set_power_status, get_power_status, get_outlets_status)
? File "/usr/share/fence/fencing.py", line 880, in fence_action
? ? status = get_multi_power_fn(tn, options, get_power_fn)
? File "/usr/share/fence/fencing.py", line 800, in get_multi_power_fn
? ? plug_status = get_power_fn(tn, options)
? File "/usr/sbin/fence_apc_snmp", line 138, in get_power_status
? ? apc_resolv_port_id(conn, options)
? File "/usr/sbin/fence_apc_snmp", line 113, in apc_resolv_port_id
? ? apc_set_device(conn)
? File "/usr/sbin/fence_apc_snmp", line 107, in apc_set_device
? ? conn.log_command("Trying %s"%(device.ident_str))
AttributeError: FencingSnmp instance has no attribute 'log_command'

?


Regards
Thomas
?
?
?
?
?
?

Gesendet:?Dienstag, 14. Oktober 2014 um 07:58 Uhr
Von:?"Kristoffer Gr?nlund" <kgronlund at suse.com>
An:?"Thomas Meier" <thomasmeier1976 at gmx.de>, linux-cluster at redhat.com
Betreff:?Re: [Linux-cluster] Fencing issues with fence_apc_snmp (APC Firmware 6.x)
Thomas Meier <thomasmeier1976 at gmx.de> writes:

> I've managed to build fence-agents-4.0.11.tar.gz on a CentOS 6.5 test box, but my build
> of fence_apc_snmp doesn't work.
[...]
> File "/usr/share/fence/fencing.py", line 705, in check_input
> logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stderr))
> TypeError: __init__() got an unexpected keyword argument 'stream'

Your version of Python is too old. Possibly you have a newer version of
python installed, but by default the older version is used.

I think the stream argument was added in Python 2.6.

--
// Kristoffer Gr?nlund
// kgronlund at suse.com


From neale at sinenomine.net  Tue Oct 14 15:00:57 2014
From: neale at sinenomine.net (Neale Ferguson)
Date: Tue, 14 Oct 2014 15:00:57 +0000
Subject: [Linux-cluster] Permission denied
In-Reply-To: <D061B7AC.F2C3%neale@sinenomine.net>
References: <D0617644.F1AE%neale@sinenomine.net>
	<565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com>
	<D061960B.F255%neale@sinenomine.net>
	<D061B7AC.F2C3%neale@sinenomine.net>
Message-ID: <D062B1AE.F490%neale@sinenomine.net>

Following this thread a bit further I find that the jid is set to -1
because the ?our_slot? value being passed to gdlm_recover_done is 0.
dlm_recoverd is retrieving the clvmd lockspace and its ls_slot value is 0
(which is the source of our_slot):

[73115.541794] name:       clvmd
[73115.541847] global_id:  4104eefa
[73115.541893] node_count: 2
[73115.541937] low node:   1
[73115.541986] slot:       0 (00000000263d5268)
[73115.542031] n'slots:    0

dlm_tool ls reports:

dlm lockspaces
name          clvmd
id            0x4104eefa
flags         0x00000000
change        member 2 joined 1 remove 0 failed 0 seq 1,1
members       1 2 


Now to determine why ls_slot is 0.


Neale

On 10/13/14, 5:10 PM, "Neale Ferguson" <neale at sinenomine.net> wrote:

>I put some debug code into the gfs2 module and I see it failing the mount
>at this point:
>
>/*
>         * If user space has failed to join the cluster or some similar
>         * failure has occurred, then the journal id will contain a
>         * negative (error) number. This will then be returned to the
>         * caller (of the mount syscall). We do this even for spectator
>         * mounts (which just write a jid of 0 to indicate "ok" even
>though
>         * the jid is unused in the spectator case)
>         */
>        if (sdp->sd_lockstruct.ls_jid < 0) {
>
>Now to find out who?s stick -PERM into ls_jid.
>
>Neale
>
>
>-- 
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster


From neale at sinenomine.net  Tue Oct 14 19:40:42 2014
From: neale at sinenomine.net (Neale Ferguson)
Date: Tue, 14 Oct 2014 19:40:42 +0000
Subject: [Linux-cluster] Permission denied
In-Reply-To: <20141014192057.GA10594@redhat.com>
References: <D0617644.F1AE%neale@sinenomine.net>
	<565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com>
	<D061960B.F255%neale@sinenomine.net>
	<D061B7AC.F2C3%neale@sinenomine.net>
	<D062B1AE.F490%neale@sinenomine.net>
	<20141014192057.GA10594@redhat.com>
Message-ID: <D062F039.F62F%neale@sinenomine.net>

Yeah, I noted I was looking at the wrong lockspace. The gfs2 lockspace in
this cluster is vol1. Once I corrected at what I was looking at, I think I
solved my problem: I believe the problem is an endian thing. In
set_rcom_status:

        rs->rs_flags = cpu_to_le32(flags)

However, in receive_rcom_status() flags are checked:

        if (!(rs->rs_flags & DLM_RSF_NEED_SLOTS)) {


But it should be:

        if (!(le32_to_cpu(rs->rs_flags) & DLM_RSF_NEED_SLOTS)) {

I made this change and now the gfs2 volume is being mounted correctly on
both nodes. I?ve repeated it a number of times and it?s kept working.


Neale

On 10/14/14, 3:20 PM, "David Teigland" <teigland at redhat.com> wrote:

>clvmd is a userland lockspace and does not use lockspace_ops or slots/jids
>like a gfs2 (kernel) lockspace.
>
>To debug the dlm/gfs2 control mechanism, which assigns gfs2 a jid based on
>dlm slots, enable the fs_info() lines in gfs2/lock_dlm.c.  (Make sure that
>you're not somehow running gfs_controld on these nodes; we quit using that
>in RHEL7.)


From teigland at redhat.com  Tue Oct 14 20:15:05 2014
From: teigland at redhat.com (David Teigland)
Date: Tue, 14 Oct 2014 15:15:05 -0500
Subject: [Linux-cluster] [PATCH] dlm: fix missing endian conversion of
	rcom_status flags
In-Reply-To: <D062F039.F62F%neale@sinenomine.net>
References: <D0617644.F1AE%neale@sinenomine.net>
	<565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com>
	<D061960B.F255%neale@sinenomine.net>
	<D061B7AC.F2C3%neale@sinenomine.net>
	<D062B1AE.F490%neale@sinenomine.net>
	<20141014192057.GA10594@redhat.com>
	<D062F039.F62F%neale@sinenomine.net>
Message-ID: <20141014201505.GC10594@redhat.com>

The flags are already converted to le when being sent,
but are not being converted back to cpu when received.

Signed-off-by: Neale Ferguson <neale at sinenomine.net>
Signed-off-by: David Teigland <teigland at redhat.com>
---
 fs/dlm/rcom.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dlm/rcom.c b/fs/dlm/rcom.c
index 9d61947d473a..f3f5e72a29ba 100644
--- a/fs/dlm/rcom.c
+++ b/fs/dlm/rcom.c
@@ -206,7 +206,7 @@ static void receive_rcom_status(struct dlm_ls *ls, struct dlm_rcom *rc_in)
 
 	rs = (struct rcom_status *)rc_in->rc_buf;
 
-	if (!(rs->rs_flags & DLM_RSF_NEED_SLOTS)) {
+	if (!(le32_to_cpu(rs->rs_flags) & DLM_RSF_NEED_SLOTS)) {
 		status = dlm_recover_status(ls);
 		goto do_create;
 	}
-- 
1.8.3.1


From rpeterso at redhat.com  Tue Oct 14 20:22:36 2014
From: rpeterso at redhat.com (Bob Peterson)
Date: Tue, 14 Oct 2014 16:22:36 -0400 (EDT)
Subject: [Linux-cluster] [PATCH] dlm: fix missing endian conversion
	of	rcom_status flags
In-Reply-To: <20141014201505.GC10594@redhat.com>
References: <D0617644.F1AE%neale@sinenomine.net>
	<565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com>
	<D061960B.F255%neale@sinenomine.net>
	<D061B7AC.F2C3%neale@sinenomine.net>
	<D062B1AE.F490%neale@sinenomine.net>
	<20141014192057.GA10594@redhat.com>
	<D062F039.F62F%neale@sinenomine.net>
	<20141014201505.GC10594@redhat.com>
Message-ID: <1398269457.3271283.1413318156147.JavaMail.zimbra@redhat.com>

----- Original Message -----
> The flags are already converted to le when being sent,
> but are not being converted back to cpu when received.
> 
> Signed-off-by: Neale Ferguson <neale at sinenomine.net>
> Signed-off-by: David Teigland <teigland at redhat.com>
> ---
>  fs/dlm/rcom.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/dlm/rcom.c b/fs/dlm/rcom.c
> index 9d61947d473a..f3f5e72a29ba 100644
> --- a/fs/dlm/rcom.c
> +++ b/fs/dlm/rcom.c
> @@ -206,7 +206,7 @@ static void receive_rcom_status(struct dlm_ls *ls, struct
> dlm_rcom *rc_in)
>  
>  	rs = (struct rcom_status *)rc_in->rc_buf;
>  
> -	if (!(rs->rs_flags & DLM_RSF_NEED_SLOTS)) {
> +	if (!(le32_to_cpu(rs->rs_flags) & DLM_RSF_NEED_SLOTS)) {
>  		status = dlm_recover_status(ls);
>  		goto do_create;
>  	}
> --
> 1.8.3.1
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Hi Dave,

Did you mean for this patch to go to cluster-devel?

Bob Peterson
Red Hat File Systems


From mgrac at redhat.com  Wed Oct 15 14:12:13 2014
From: mgrac at redhat.com (Marek "marx" Grac)
Date: Wed, 15 Oct 2014 16:12:13 +0200
Subject: [Linux-cluster] Fencing issues with fence_apc_snmp (APC
 Firmware 6.x)
In-Reply-To: <trinity-0f1ffa1f-5552-43aa-bdd7-c340bf3773e6-1413227427076@3capp-gmx-bs69>
References: <trinity-0f1ffa1f-5552-43aa-bdd7-c340bf3773e6-1413227427076@3capp-gmx-bs69>
Message-ID: <543E80BD.8000705@redhat.com>

Hi,

On 10/13/2014 09:10 PM, Thomas Meier wrote:
> Hi
>
> When configuring PDU fencing in my 2-node-cluster I ran into some problems with
> the fence_apc_snmp agent. Turning a node off works fine, but
> fence_apc_snmp then exits with error.
>
>
>
> When I do this manually (from node2):
>
>     fence_apc_snmp -a node1 -n 1 -o off
>
> the output of the command is not an expected:
>
>     Success: Powered OFF
>
> but in my case:
>
>     Returned 2: Error in packet.
>     Reason: (genError) A general failure occured
>     Failed object: .1.3.6.1.4.1.318.1.1.4.4.2.1.3.21
>
>
> When I check the PDU, the port is without power, so this part works.
> But it seems that the fence agent can't read the status of the PDU
> and then exits with error. The same seems to happen when fenced
> is calling the agent. The agent also exits with an error and fencing can't succeed
> and the cluster hangs.
Yes, this is known bug as APC in 6.x firmware has changed a table with 
information.

> I've already found the fence-agents repo: https://git.fedorahosted.org/cgit/fence-agents.git/
>
> Here https://git.fedorahosted.org/cgit/fence-agents.git/commit/?id=55ccdd79f530092af06eea5b4ce6a24bd82c0875
> it says: "fence_apc_snmp: Add support for firmware 6.x"
yes, this should fix the issue

> I've managed to build fence-agents-4.0.11.tar.gz on a CentOS 6.5 test box, but my build
> of fence_apc_snmp doesn't work.
>
> It gives:
>
> [root at box1]# fence_apc_snmp -v -a node1 -n 1 -o status
> Traceback (most recent call last):
>    File "/usr/sbin/fence_apc_snmp", line 223, in <module>
>      main()
>    File "/usr/sbin/fence_apc_snmp", line 197, in main
>      options = check_input(device_opt, process_input(device_opt))
>    File "/usr/share/fence/fencing.py", line 705, in check_input
>      logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stderr))
> TypeError: __init__() got an unexpected keyword argument 'stream'
Feel free to remove logging if it does not work. The other option is to 
just take a patch from git and backport it. There should be no big 
differences (I expect only very minor changes).

> I'd really like to see if a patched fence_apc_snmp agent fixes my problem, and if so,
> install the right version of fence_apc_snmp on the cluster without breaking things,
> but I'm a bit clueless how to build me a working version.

Sure, there will be a new official release for RHEL 6.7 (as 6.6 was 
released few days ago). So until that time only upstream or patches.

m,


From mgrac at redhat.com  Wed Oct 15 14:15:10 2014
From: mgrac at redhat.com (Marek "marx" Grac)
Date: Wed, 15 Oct 2014 16:15:10 +0200
Subject: [Linux-cluster] Fencing issues with fence_apc_snmp (APC
 Firmware 6.x)
In-Reply-To: <543D0296.8090606@alteeve.ca>
References: <trinity-0f1ffa1f-5552-43aa-bdd7-c340bf3773e6-1413227427076@3capp-gmx-bs69>
	<543D0296.8090606@alteeve.ca>
Message-ID: <543E816E.20608@redhat.com>


On 10/14/2014 01:01 PM, Digimer wrote:
>
> Hi Marek et. al.,
>
>   This is a RHEL 6.5 install, so Kristoffer's comment about needing a 
> newer version of python is a bit of a concern. Has this been tested on 
> RHEL 6 with an APC with the 6.x firmware?

Current release do not contain required patch, it will be in next one 
(or z-stream if someone request it). The upstream release work as 
expected (retested today) on Fedora20/RHEL7. Fact that upstream release 
can not be run on RHEL6 is new issue for me but we did not try that before.

m,


From lists at alteeve.ca  Wed Oct 15 14:35:30 2014
From: lists at alteeve.ca (Digimer)
Date: Wed, 15 Oct 2014 10:35:30 -0400
Subject: [Linux-cluster] Fencing issues with fence_apc_snmp (APC
 Firmware 6.x)
In-Reply-To: <543E816E.20608@redhat.com>
References: <trinity-0f1ffa1f-5552-43aa-bdd7-c340bf3773e6-1413227427076@3capp-gmx-bs69>
	<543D0296.8090606@alteeve.ca> <543E816E.20608@redhat.com>
Message-ID: <543E8632.4040206@alteeve.ca>

On 15/10/14 10:15 AM, Marek "marx" Grac wrote:
>
> On 10/14/2014 01:01 PM, Digimer wrote:
>>
>> Hi Marek et. al.,
>>
>>   This is a RHEL 6.5 install, so Kristoffer's comment about needing a
>> newer version of python is a bit of a concern. Has this been tested on
>> RHEL 6 with an APC with the 6.x firmware?
>
> Current release do not contain required patch, it will be in next one
> (or z-stream if someone request it). The upstream release work as
> expected (retested today) on Fedora20/RHEL7. Fact that upstream release
> can not be run on RHEL6 is new issue for me but we did not try that before.
>
> m,

Consider it officially requested. We use APC switched PDUs as backup 
fence devices extensively, so this would pretty heavily hurt us if we 
started getting v6 firmware.

Should I open a RHBZ?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From mgrac at redhat.com  Thu Oct 16 07:53:37 2014
From: mgrac at redhat.com (Marek "marx" Grac)
Date: Thu, 16 Oct 2014 09:53:37 +0200
Subject: [Linux-cluster] Fencing issues with fence_apc_snmp (APC
 Firmware 6.x)
In-Reply-To: <543E8632.4040206@alteeve.ca>
References: <trinity-0f1ffa1f-5552-43aa-bdd7-c340bf3773e6-1413227427076@3capp-gmx-bs69>
	<543D0296.8090606@alteeve.ca> <543E816E.20608@redhat.com>
	<543E8632.4040206@alteeve.ca>
Message-ID: <543F7981.9070901@redhat.com>

Hi,

On 10/15/2014 04:35 PM, Digimer wrote:
> Consider it officially requested. We use APC switched PDUs as backup 
> fence devices extensively, so this would pretty heavily hurt us if we 
> started getting v6 firmware.

To summarize, support for v6 firmware over SNMP:

* is not in RHEL6.6
* should be in RHEL7.1
* should be in RHEL6.7
* can be part of z-stream

> Should I open a RHBZ?
Bug is already opened, perhaps it is not cloned everywhere. So only 
raising z-stream request change something.

m,


From mgrac at redhat.com  Thu Oct 16 13:11:04 2014
From: mgrac at redhat.com (Marek "marx" Grac)
Date: Thu, 16 Oct 2014 15:11:04 +0200
Subject: [Linux-cluster] fence-agents-4.0.12 stable release
Message-ID: <543FC3E8.9020409@redhat.com>

Welcome to the fence-agents 4.0.12 release

This release includes some new features and several bugfixes:

* new up-to-date wiki page with STDIN / command line arguments
     http://fedorahosted.org/cluster/wiki/FenceArguments
* Fence agent fence_pve now supports --ssl-secure and --ssl-insecure 
(check certificate or not)
* Fence agent for RHEV-M supports cookie based authentication 
(--use-cookies)
* improvements in build system

* Fix issue with regular expression in fence_rsb
* Fix uninitialized EOL in fence_wti

The new source tarball can be downloaded here:

https://fedorahosted.org/releases/f/e/fence-agents/fence-agents-4.0.12.tar.xz 


To report bugs or issues:

https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

    Join us on IRC (irc.freenode.net #linux-cluster) and share your
    experience  with other sysadministrators or power users.

Thanks/congratulations to all people that contributed to achieve this
great milestone.

m,


From sunhux at gmail.com  Wed Oct 22 08:44:33 2014
From: sunhux at gmail.com (Sunhux G)
Date: Wed, 22 Oct 2014 16:44:33 +0800
Subject: [Linux-cluster] Rhel BootLoader,
 Single-user mode password & Interactive Boot in a Cloud environment
Message-ID: <CABTxP=4vaMpaw9msMygHdYL3UbzzD=HH_VkGoXRDqyh3eG9NAQ@mail.gmail.com>

We run cloud service & our vCenter is not accessible to our tenants
and their IT support; so I would say console access is not feasible
unless the tenant/customer IT come to our DC.

If the following 3 hardenings are done our tenant/customer RHEL
Linux VM,  what's the impact to the tenant's sysadmin & IT operation?


a) CIS 1.5.3 Set Boot Loader Password *:*
    if this password is set, when tenant reboot (shutdown -r)
    their VM each time, will it prompt for the bootloader
    password at console?  If so, is there any way the tenant,
    could still get their VM booted up if they have no access
    to vCenter's console?

b) CIS 1.5.4 Require Authentication for Single-User Mode *:*
    Does Linux allow ssh access while in single-user mode &
    can this 'single-user mode password' be entered via an
    ssh session (without access to console), assuming certain
    'terminal' service is started up / running while in single
    user mode

c) CIS 1.5.5 Disable Interactive Boot *:*
    what's the general consensus on this? Disable or enable?
    Our corporate hardening guide does not mention this item.
    So if the tenant wishes to boot up step by step (ie pausing
    at each startup script), they can't do it?

Feel free to add any other impacts that anyone can think of

Lastly, how do people out there grant console access to their
tenants in Cloud environment without security compromise
(I mean without granting vCenter access) : I heard that we can
customize vCenter to grant limited access of vCenter to
tenants, is this so?


Sun
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20141022/aca971f5/attachment.htm>

From lists at alteeve.ca  Wed Oct 22 10:46:22 2014
From: lists at alteeve.ca (Digimer)
Date: Wed, 22 Oct 2014 06:46:22 -0400
Subject: [Linux-cluster] Rhel BootLoader,
 Single-user mode password & Interactive Boot in a Cloud environment
In-Reply-To: <CABTxP=4vaMpaw9msMygHdYL3UbzzD=HH_VkGoXRDqyh3eG9NAQ@mail.gmail.com>
References: <CABTxP=4vaMpaw9msMygHdYL3UbzzD=HH_VkGoXRDqyh3eG9NAQ@mail.gmail.com>
Message-ID: <54478AFE.3030506@alteeve.ca>

On 22/10/14 04:44 AM, Sunhux G wrote:
> We run cloud service & our vCenter is not accessible to our tenants
> and their IT support; so I would say console access is not feasible
> unless the tenant/customer IT come to our DC.
>
> If the following 3 hardenings are done our tenant/customer RHEL
> Linux VM, what's the impact to the tenant's sysadmin & IT operation?
>
>
> a) CIS 1.5.3 Set Boot Loader Password *:*
>      if this password is set, when tenant reboot (shutdown -r)
>      their VM each time, will it prompt for the bootloader
>      password at console?  If so, is there any way the tenant,
>      could still get their VM booted up if they have no access
>      to vCenter's console?
>
> b) CIS 1.5.4 Require Authentication for Single-User Mode *:*
>      Does Linux allow ssh access while in single-user mode &
>      can this 'single-user mode password' be entered via an
>      ssh session (without access to console), assuming certain
>      'terminal' service is started up / running while in single
>      user mode
>
> c) CIS 1.5.5 Disable Interactive Boot *:*
>      what's the general consensus on this? Disable or enable?
>      Our corporate hardening guide does not mention this item.
>      So if the tenant wishes to boot up step by step (ie pausing
>      at each startup script), they can't do it?
>
> Feel free to add any other impacts that anyone can think of
>
> Lastly, how do people out there grant console access to their
> tenants in Cloud environment without security compromise
> (I mean without granting vCenter access) : I heard that we can
> customize vCenter to grant limited access of vCenter to
> tenants, is this so?
>
>
> Sun

Hi Sun,

   Did you mean to post this to the vmware mailing list?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From alanoe at linux.vnet.ibm.com  Thu Oct 23 13:49:45 2014
From: alanoe at linux.vnet.ibm.com (Alan Evangelista)
Date: Thu, 23 Oct 2014 11:49:45 -0200
Subject: [Linux-cluster] Problems building fence-agents from source
Message-ID: <54490779.9080408@linux.vnet.ibm.com>

Hi.

I'm trying to build fence-agents from source (master branch) on CentOS 6.5.
I already installed the following rpm packages (dependencies): autoconf,
automake, gcc, libtool, nss, nss-devel. When I tried to run ./autogen.sh,
I got:

configure.ac:162: error: possibly undefined macro: AC_PYTHON_MODULE
       If this token and others are legitimate, please use m4_pattern_allow.
       See the Autoconf documentation.


I then run

$ autoreconf --install

and autogen worked. Then, I have a problem running ./configure:

./configure: line 18284: syntax error near unexpected token `suds,'
./configure: line 18284: `AC_PYTHON_MODULE(suds, 1)'

I never had this problem before with earlier fence-agents versions.
Am I missing something or is there an issue with upstream code?


RPM dependencies versions:
  autoconf-2.63-5.1.el6.noarch
  automake-1.11.1-4.el6.noarch
  libtool-2.2.6-15.5.el6.x86_64


Regards,
Alan Evangelista


From bmr at redhat.com  Thu Oct 23 14:45:56 2014
From: bmr at redhat.com (Bryn M. Reeves)
Date: Thu, 23 Oct 2014 15:45:56 +0100
Subject: [Linux-cluster] Problems building fence-agents from source
In-Reply-To: <54490779.9080408@linux.vnet.ibm.com>
References: <54490779.9080408@linux.vnet.ibm.com>
Message-ID: <20141023144555.GB26744@localhost.localdomain>

On Thu, Oct 23, 2014 at 11:49:45AM -0200, Alan Evangelista wrote:
> I'm trying to build fence-agents from source (master branch) on CentOS 6.5.
> I already installed the following rpm packages (dependencies): autoconf,
> automake, gcc, libtool, nss, nss-devel. When I tried to run ./autogen.sh,
> I got:

You might find it easier to just rebuild the RPMs using either rpmbuild or
a tool like mock[1].
 
> ./configure: line 18284: syntax error near unexpected token `suds,'
> ./configure: line 18284: `AC_PYTHON_MODULE(suds, 1)'

I'd guess this is because you're missing the python-suds package:

# yum list | grep python-suds
python-suds.noarch                   0.4.1-3.el6           @rhel-x86_64-server-6

python-suds is a Python SOAP client library.

If you check the BuildRequires in the fence-agents.spec file you'll see:

 # Build dependencies
 BuildRequires: perl python
 BuildRequires: glibc-devel
 BuildRequires: nss-devel nspr-devel
 BuildRequires: libxslt pexpect
 BuildRequires: python-pycurl
 BuildRequires: python-suds
 BuildRequires: automake autoconf pkgconfig libtool
 BuildRequires: net-snmp-utils perl-Net-Telnet

> I never had this problem before with earlier fence-agents versions.
> Am I missing something or is there an issue with upstream code?

I'd guess it's a required dependency for the fence_vmware_soap agent.
The BuildRequires and f_v_s scripts were added in 3.1.4-1.el6 back
in 2011:

* Tue Jun  7 2011 Fabio M. Di Nitto <fdinitto at redhat.com> - 3.1.4-1
- Rebase package on top of new upstream
- spec file update:
  * update spec file copyright date
  * update upstream URL
  * drop all patches
  * update list of fence_agents (ibmblade listed twice, bladecenter_snmp deprecated)
  * drop libxml2-devel libvirt-devel clusterlib-devel corosynclib-devel and
    openaislib-devel from BuildRequires
  * make ready to enable fence_vmware_soap
  * update and clean configure and build section.
  * create bladecenter_snmp compat symlink at rpm install time
  * update file list to include scsi_check script

Regards,
Bryn.

[1] http://fedoraproject.org/wiki/Projects/Mock


From bcodding at redhat.com  Thu Oct 23 14:46:49 2014
From: bcodding at redhat.com (Benjamin Coddington)
Date: Thu, 23 Oct 2014 10:46:49 -0400 (EDT)
Subject: [Linux-cluster] Problems building fence-agents from source
In-Reply-To: <54490779.9080408@linux.vnet.ibm.com>
References: <54490779.9080408@linux.vnet.ibm.com>
Message-ID: <alpine.LRH.2.11.1410231034250.20279@sh-el6.eng.rdu2.redhat.com>

Hi Alan,

I don't know how well the upstream fence-agents will work or build on 
CentOS 6.5, but I can tell you that the way to resolve this particular 
problem would be to find the m4 for AC_PYTHON_MODULE and drop it in your 
build's m4/ directory..

Ben

On Thu, 23 Oct 2014, Alan Evangelista wrote:

> Hi.
>
> I'm trying to build fence-agents from source (master branch) on CentOS 6.5.
> I already installed the following rpm packages (dependencies): autoconf,
> automake, gcc, libtool, nss, nss-devel. When I tried to run ./autogen.sh,
> I got:
>
> configure.ac:162: error: possibly undefined macro: AC_PYTHON_MODULE
>       If this token and others are legitimate, please use m4_pattern_allow.
>       See the Autoconf documentation.
>
>
> I then run
>
> $ autoreconf --install
>
> and autogen worked. Then, I have a problem running ./configure:
>
> ./configure: line 18284: syntax error near unexpected token `suds,'
> ./configure: line 18284: `AC_PYTHON_MODULE(suds, 1)'
>
> I never had this problem before with earlier fence-agents versions.
> Am I missing something or is there an issue with upstream code?
>
>
> RPM dependencies versions:
>  autoconf-2.63-5.1.el6.noarch
>  automake-1.11.1-4.el6.noarch
>  libtool-2.2.6-15.5.el6.x86_64
>
>
> Regards,
> Alan Evangelista
>
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From alanoe at linux.vnet.ibm.com  Thu Oct 23 15:03:17 2014
From: alanoe at linux.vnet.ibm.com (Alan Evangelista)
Date: Thu, 23 Oct 2014 13:03:17 -0200
Subject: [Linux-cluster] Problems building fence-agents from source
In-Reply-To: <20141023144555.GB26744@localhost.localdomain>
References: <54490779.9080408@linux.vnet.ibm.com>
	<20141023144555.GB26744@localhost.localdomain>
Message-ID: <544918B5.50902@linux.vnet.ibm.com>

On 10/23/2014 12:45 PM, Bryn M. Reeves wrote:
>> ./configure: line 18284: syntax error near unexpected token `suds,'
>> ./configure: line 18284: `AC_PYTHON_MODULE(suds, 1)'
> I'd guess this is because you're missing the python-suds package:
>
> # yum list | grep python-suds
> python-suds.noarch                   0.4.1-3.el6           @rhel-x86_64-server-6

No, it is not, I already installed that rpm package.

$ rpm -qa | grep suds
python-suds-0.4.1-3.el6.noarch


>
>> I never had this problem before with earlier fence-agents versions.
>> Am I missing something or is there an issue with upstream code?

I forgot to mention, previous installations were done in RHEL 6.5. Maybe 
fence-agents does
not work out of the box in CentOS 6.5.


Regards,
Alan Evangelista


From alanoe at linux.vnet.ibm.com  Thu Oct 23 15:28:55 2014
From: alanoe at linux.vnet.ibm.com (Alan Evangelista)
Date: Thu, 23 Oct 2014 13:28:55 -0200
Subject: [Linux-cluster] Problems building fence-agents from source
In-Reply-To: <alpine.LRH.2.11.1410231034250.20279@sh-el6.eng.rdu2.redhat.com>
References: <54490779.9080408@linux.vnet.ibm.com>
	<alpine.LRH.2.11.1410231034250.20279@sh-el6.eng.rdu2.redhat.com>
Message-ID: <54491EB7.40300@linux.vnet.ibm.com>

On 10/23/2014 12:46 PM, Benjamin Coddington wrote:
> Hi Alan,
>
> I don't know how well the upstream fence-agents will work or build on 
> CentOS 6.5, but I can tell you that the way to resolve this particular 
> problem would be to find the m4 for AC_PYTHON_MODULE and drop it in 
> your build's m4/ directory..

I already see the m4 file in make/ac_python_module.m4. Copying/moving 
file to m4 directory didnt solve the problem.

FYI this macro was introduced in a patch today (commit 
5a87866c70e3dc77798d3e6fd77e2607757d26b5).
Maybe the macro is broken?

AC_DEFUN([AC_PYTHON_MODULE],[
        AC_MSG_CHECKING(python module: $1)
        python -c "import $1" 2>/dev/null
        if test $? -eq 0;
        then
                AC_MSG_RESULT(yes)
                eval AS_TR_CPP(HAVE_PYMOD_$1)=yes
        else
                AC_MSG_RESULT(no)
                eval AS_TR_CPP(HAVE_PYMOD_$1)=no
                #
                if test -n "$2"
                then
                        AC_MSG_ERROR(failed to find required module $1)
                        exit 1
                fi
        fi
])


Regards,
Alan Evangelista


From bcodding at redhat.com  Thu Oct 23 15:45:07 2014
From: bcodding at redhat.com (Benjamin Coddington)
Date: Thu, 23 Oct 2014 11:45:07 -0400 (EDT)
Subject: [Linux-cluster] Problems building fence-agents from source
In-Reply-To: <54491EB7.40300@linux.vnet.ibm.com>
References: <54490779.9080408@linux.vnet.ibm.com>
	<alpine.LRH.2.11.1410231034250.20279@sh-el6.eng.rdu2.redhat.com>
	<54491EB7.40300@linux.vnet.ibm.com>
Message-ID: <alpine.LRH.2.11.1410231136370.20279@sh-el6.eng.rdu2.redhat.com>


On Thu, 23 Oct 2014, Alan Evangelista wrote:

> On 10/23/2014 12:46 PM, Benjamin Coddington wrote:
>>  Hi Alan,
>>
>>  I don't know how well the upstream fence-agents will work or build on
>>  CentOS 6.5, but I can tell you that the way to resolve this particular
>>  problem would be to find the m4 for AC_PYTHON_MODULE and drop it in your
>>  build's m4/ directory..
>
> I already see the m4 file in make/ac_python_module.m4. Copying/moving file to 
> m4 directory didnt solve the problem.
>
> FYI this macro was introduced in a patch today (commit 
> 5a87866c70e3dc77798d3e6fd77e2607757d26b5).
> Maybe the macro is broken?
>
> AC_DEFUN([AC_PYTHON_MODULE],[
>        AC_MSG_CHECKING(python module: $1)
>        python -c "import $1" 2>/dev/null
>        if test $? -eq 0;
>        then
>                AC_MSG_RESULT(yes)
>                eval AS_TR_CPP(HAVE_PYMOD_$1)=yes
>        else
>                AC_MSG_RESULT(no)
>                eval AS_TR_CPP(HAVE_PYMOD_$1)=no
>                #
>                if test -n "$2"
>                then
>                        AC_MSG_ERROR(failed to find required module $1)
>                        exit 1
>                fi
>        fi
> ] ) 
>

Ah, looking at the second portion of your original error report now.. it
looks like you have AC rules in your configure script.  That indicates
that configure wasn't correctly created..  delete your configure script
and run autogen.sh again now that you have ac_python_module.m4 in the
AC_CONFIG_MACRO_DIR (which is m4/).

Ben


From alanoe at linux.vnet.ibm.com  Thu Oct 23 15:55:21 2014
From: alanoe at linux.vnet.ibm.com (Alan Evangelista)
Date: Thu, 23 Oct 2014 13:55:21 -0200
Subject: [Linux-cluster] Problems building fence-agents from source
In-Reply-To: <alpine.LRH.2.11.1410231136370.20279@sh-el6.eng.rdu2.redhat.com>
References: <54490779.9080408@linux.vnet.ibm.com>
	<alpine.LRH.2.11.1410231034250.20279@sh-el6.eng.rdu2.redhat.com>
	<54491EB7.40300@linux.vnet.ibm.com>
	<alpine.LRH.2.11.1410231136370.20279@sh-el6.eng.rdu2.redhat.com>
Message-ID: <544924E9.9060306@linux.vnet.ibm.com>

On 10/23/2014 01:45 PM, Benjamin Coddington wrote:
>
>
> On Thu, 23 Oct 2014, Alan Evangelista wrote:
>
>> On 10/23/2014 12:46 PM, Benjamin Coddington wrote:
>>>  Hi Alan,
>>>
>>>  I don't know how well the upstream fence-agents will work or build on
>>>  CentOS 6.5, but I can tell you that the way to resolve this particular
>>>  problem would be to find the m4 for AC_PYTHON_MODULE and drop it in 
>>> your
>>>  build's m4/ directory..
>>
>> I already see the m4 file in make/ac_python_module.m4. Copying/moving 
>> file to m4 directory didnt solve the problem.
>>
>> FYI this macro was introduced in a patch today (commit 
>> 5a87866c70e3dc77798d3e6fd77e2607757d26b5).
>> Maybe the macro is broken?
>>
>> AC_DEFUN([AC_PYTHON_MODULE],[
>>        AC_MSG_CHECKING(python module: $1)
>>        python -c "import $1" 2>/dev/null
>>        if test $? -eq 0;
>>        then
>>                AC_MSG_RESULT(yes)
>>                eval AS_TR_CPP(HAVE_PYMOD_$1)=yes
>>        else
>>                AC_MSG_RESULT(no)
>>                eval AS_TR_CPP(HAVE_PYMOD_$1)=no
>>                #
>>                if test -n "$2"
>>                then
>>                        AC_MSG_ERROR(failed to find required module $1)
>>                        exit 1
>>                fi
>>        fi
>> ] )
>
> Ah, looking at the second portion of your original error report now.. it
> looks like you have AC rules in your configure script.  That indicates
> that configure wasn't correctly created..  delete your configure script
> and run autogen.sh again now that you have ac_python_module.m4 in the
> AC_CONFIG_MACRO_DIR (which is m4/).

That worked. I didn't know I had to run ./autogen.sh again after moving 
the m4
file to the correct directory. I'll send an email in fence-agents-devel 
about the
incorrect directory of the new m4 file added today.

Thanks for the help!


Regards,
Alan Evangelista


From bmr at redhat.com  Thu Oct 23 15:59:04 2014
From: bmr at redhat.com (Bryn M. Reeves)
Date: Thu, 23 Oct 2014 16:59:04 +0100
Subject: [Linux-cluster] Problems building fence-agents from source
In-Reply-To: <544918B5.50902@linux.vnet.ibm.com>
References: <54490779.9080408@linux.vnet.ibm.com>
	<20141023144555.GB26744@localhost.localdomain>
	<544918B5.50902@linux.vnet.ibm.com>
Message-ID: <20141023155904.GI26744@localhost.localdomain>

On Thu, Oct 23, 2014 at 01:03:17PM -0200, Alan Evangelista wrote:
> >>I never had this problem before with earlier fence-agents versions.
> >>Am I missing something or is there an issue with upstream code?
> 
> I forgot to mention, previous installations were done in RHEL 6.5. Maybe
> fence-agents does
> not work out of the box in CentOS 6.5.
> 

What version of the package are you actually trying to build? If it's
the native RHEL-6.5 package then I would expect that to build out-of-the
box on unmodified CentOS 6.5.

If it is some later upstream version then you may find there are
considerable changes in dependencies needed to build and those may
require a large number of package updates to enable building the later
version (which you'd need to also build from source).

The autoconf problems you're hitting make it sound like that may be the
case (although you could encounter similar problems if e.g. the package
is from 6.6 beta and there was also an updated autotools in that
release).

Regards,
Bryn.


From mgrac at redhat.com  Mon Oct 27 15:57:37 2014
From: mgrac at redhat.com (Marek "marx" Grac)
Date: Mon, 27 Oct 2014 16:57:37 +0100
Subject: [Linux-cluster] Problems building fence-agents from source
In-Reply-To: <alpine.LRH.2.11.1410231136370.20279@sh-el6.eng.rdu2.redhat.com>
References: <54490779.9080408@linux.vnet.ibm.com>	<alpine.LRH.2.11.1410231034250.20279@sh-el6.eng.rdu2.redhat.com>	<54491EB7.40300@linux.vnet.ibm.com>
	<alpine.LRH.2.11.1410231136370.20279@sh-el6.eng.rdu2.redhat.com>
Message-ID: <544E6B71.1020300@redhat.com>


On 10/23/2014 05:45 PM, Benjamin Coddington wrote:
>
>
> On Thu, 23 Oct 2014, Alan Evangelista wrote:
>
>> On 10/23/2014 12:46 PM, Benjamin Coddington wrote:
>>>  Hi Alan,
>>>
>>>  I don't know how well the upstream fence-agents will work or build on
>>>  CentOS 6.5, but I can tell you that the way to resolve this particular
>>>  problem would be to find the m4 for AC_PYTHON_MODULE and drop it in 
>>> your
>>>  build's m4/ directory..
>>
>> I already see the m4 file in make/ac_python_module.m4. Copying/moving 
>> file to m4 directory didnt solve the problem.
>>
>> FYI this macro was introduced in a patch today (commit 
>> 5a87866c70e3dc77798d3e6fd77e2607757d26b5).
>> Maybe the macro is broken?
>>
>> AC_DEFUN([AC_PYTHON_MODULE],[
>>        AC_MSG_CHECKING(python module: $1)
>>        python -c "import $1" 2>/dev/null
>>        if test $? -eq 0;
>>        then
>>                AC_MSG_RESULT(yes)
>>                eval AS_TR_CPP(HAVE_PYMOD_$1)=yes
>>        else
>>                AC_MSG_RESULT(no)
>>                eval AS_TR_CPP(HAVE_PYMOD_$1)=no
>>                #
>>                if test -n "$2"
>>                then
>>                        AC_MSG_ERROR(failed to find required module $1)
>>                        exit 1
>>                fi
>>        fi
>> ] )
>
> Ah, looking at the second portion of your original error report now.. it
> looks like you have AC rules in your configure script.  That indicates
> that configure wasn't correctly created..  delete your configure script
> and run autogen.sh again now that you have ac_python_module.m4 in the
> AC_CONFIG_MACRO_DIR (which is m4/).

autogen.sh was modified so it uses also (-I make) to obtain macro also 
from make/ directory. Because m4/ directory is created after running 
autogen.sh

m,


From mgrac at redhat.com  Wed Oct 29 08:37:31 2014
From: mgrac at redhat.com (Marek "marx" Grac)
Date: Wed, 29 Oct 2014 09:37:31 +0100
Subject: [Linux-cluster] How we conform OCF in fence agents and what to do
	with it
Message-ID: <5450A74B.20701@redhat.com>

Hi,

I took a look at OCF specification for resource agents from 
https://github.com/ClusterLabs/OCF-spec

I rewrote it from DTD to Relax NG (XML form) and attempts to modify it 
until it accept current resource agents. These changes are put for 
discussion and I will mark those that are important for fence agent with 
asterisk.

<resource-agent> is root element

1*) new actions required: on, off, reboot, monitor, list, metadata

2) "timeout" for service should be only optional?

3) I don't understand element "version" directly under <resource-agent> 
as it has attribute "version"

4) we have added directly elements "vendor-url" and "longdesc" under 
<resource-agent>. This is inconsistent with "shortdesc" that is 
attribute but long description really should not be an attribute.

5) we have added attribute "automatic" to <actions> (e.g. fence_scsi)

6) our parameters use only "shortdesc", so perhaphs "longdesc" can be 
optional

7*)  element <getopt> for parameters and how they can be called from 
command line (used for man pages generation)

8) add "required" attribute for each parameter

9) add "default" value for <content> element

10) make element <special> optional. what should be inside?

11) <resource-agent> does not have only londgdesc but also shortdesc 
(single-line)


m,


From andrew at beekhof.net  Wed Oct 29 09:46:26 2014
From: andrew at beekhof.net (Andrew Beekhof)
Date: Wed, 29 Oct 2014 20:46:26 +1100
Subject: [Linux-cluster] How we conform OCF in fence agents and what to
	do with it
In-Reply-To: <5450A74B.20701@redhat.com>
References: <5450A74B.20701@redhat.com>
Message-ID: <E30FF420-EA48-4867-A796-D49ED4A0DEBA@beekhof.net>


> On 29 Oct 2014, at 7:37 pm, Marek marx Grac <mgrac at redhat.com> wrote:
> 
> Hi,
> 
> I took a look at OCF specification for resource agents from https://github.com/ClusterLabs/OCF-spec
> 
> I rewrote it from DTD to Relax NG

Please don't.
Its hard enough getting any change in let alone coupling it with a translation to another format.

Please just leave it as DTD for now


> (XML form) and attempts to modify it until it accept current resource agents. These changes are put for discussion and I will mark those that are important for fence agent with asterisk.
> 
> <resource-agent> is root element
> 
> 1*) new actions required: on, off, reboot, monitor, list, metadata
> 
> 2) "timeout" for service should be only optional?
> 
> 3) I don't understand element "version" directly under <resource-agent> as it has attribute "version"
> 
> 4) we have added directly elements "vendor-url" and "longdesc" under <resource-agent>. This is inconsistent with "shortdesc" that is attribute but long description really should not be an attribute.
> 
> 5) we have added attribute "automatic" to <actions> (e.g. fence_scsi)
> 
> 6) our parameters use only "shortdesc", so perhaphs "longdesc" can be optional
> 
> 7*)  element <getopt> for parameters and how they can be called from command line (used for man pages generation)
> 
> 8) add "required" attribute for each parameter
> 
> 9) add "default" value for <content> element
> 
> 10) make element <special> optional. what should be inside?
> 
> 11) <resource-agent> does not have only londgdesc but also shortdesc (single-line)
> 
> 
> m,
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From mgrac at redhat.com  Wed Oct 29 12:26:16 2014
From: mgrac at redhat.com (Marek "marx" Grac)
Date: Wed, 29 Oct 2014 13:26:16 +0100
Subject: [Linux-cluster] How we conform OCF in fence agents and what to
 do with it
In-Reply-To: <E30FF420-EA48-4867-A796-D49ED4A0DEBA@beekhof.net>
References: <5450A74B.20701@redhat.com>
	<E30FF420-EA48-4867-A796-D49ED4A0DEBA@beekhof.net>
Message-ID: <5450DCE8.4080002@redhat.com>


On 10/29/2014 10:46 AM, Andrew Beekhof wrote:
>> On 29 Oct 2014, at 7:37 pm, Marek marx Grac <mgrac at redhat.com> wrote:
>>
>> Hi,
>>
>> I took a look at OCF specification for resource agents from https://github.com/ClusterLabs/OCF-spec
>>
>> I rewrote it from DTD to Relax NG
> Please don't.
> Its hard enough getting any change in let alone coupling it with a translation to another format.
>
> Please just leave it as DTD for now
Sure, it was more for this research then for pushing it upwards.


From lkota at cisco.com  Wed Oct 29 21:38:58 2014
From: lkota at cisco.com (Lax Kota (lkota))
Date: Wed, 29 Oct 2014 21:38:58 +0000
Subject: [Linux-cluster] daemon cpg_join error retrying
Message-ID: <D8FF83C6AA1D214B9F7B36281C06366713D72E18@xmb-rcd-x09.cisco.com>

Hi All,

In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error  retrying'. I have a 2 Node setup with pacemaker and corosync.

Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is  there any way to fix this issue?


Thanks
Lax

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20141029/eb90849d/attachment.htm>

From andrew at beekhof.net  Wed Oct 29 21:42:06 2014
From: andrew at beekhof.net (Andrew Beekhof)
Date: Thu, 30 Oct 2014 08:42:06 +1100
Subject: [Linux-cluster] daemon cpg_join error retrying
In-Reply-To: <D8FF83C6AA1D214B9F7B36281C06366713D72E18@xmb-rcd-x09.cisco.com>
References: <D8FF83C6AA1D214B9F7B36281C06366713D72E18@xmb-rcd-x09.cisco.com>
Message-ID: <A2E74581-3795-472A-8E27-B18DE5E6E4C1@beekhof.net>


> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
> 
> Hi All,
>  
> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error  retrying'. I have a 2 Node setup with pacemaker and corosync.

I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.

>  
> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is  there any way to fix this issue?  
>  
>  
> Thanks
> Lax
>  
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From lkota at cisco.com  Wed Oct 29 22:06:43 2014
From: lkota at cisco.com (Lax Kota (lkota))
Date: Wed, 29 Oct 2014 22:06:43 +0000
Subject: [Linux-cluster] daemon cpg_join error retrying
In-Reply-To: <A2E74581-3795-472A-8E27-B18DE5E6E4C1@beekhof.net>
References: <D8FF83C6AA1D214B9F7B36281C06366713D72E18@xmb-rcd-x09.cisco.com>
	<A2E74581-3795-472A-8E27-B18DE5E6E4C1@beekhof.net>
Message-ID: <D8FF83C6AA1D214B9F7B36281C06366713D72E64@xmb-rcd-x09.cisco.com>

> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.

Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs)

Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.

Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.

Thanks
Lax


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof
Sent: Wednesday, October 29, 2014 2:42 PM
To: linux clustering
Subject: Re: [Linux-cluster] daemon cpg_join error retrying


> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
> 
> Hi All,
>  
> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error  retrying'. I have a 2 Node setup with pacemaker and corosync.

I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.

>  
> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is  there any way to fix this issue?  
>  
>  
> Thanks
> Lax
>  
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From andrew at beekhof.net  Wed Oct 29 22:16:35 2014
From: andrew at beekhof.net (Andrew Beekhof)
Date: Thu, 30 Oct 2014 09:16:35 +1100
Subject: [Linux-cluster] daemon cpg_join error retrying
In-Reply-To: <D8FF83C6AA1D214B9F7B36281C06366713D72E64@xmb-rcd-x09.cisco.com>
References: <D8FF83C6AA1D214B9F7B36281C06366713D72E18@xmb-rcd-x09.cisco.com>
	<A2E74581-3795-472A-8E27-B18DE5E6E4C1@beekhof.net>
	<D8FF83C6AA1D214B9F7B36281C06366713D72E64@xmb-rcd-x09.cisco.com>
Message-ID: <E7C2C0CC-7DC3-45D8-BBD3-0D0356B63628@beekhof.net>


> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
> 
>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.

I don't really recall. Hopefully someone more familiar with GFS2 can chime in.

> 
> Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs)
> 
> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
> 
> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.

It does not sound like your network is particularly healthy.
Are you using multicast or udpu? If multicast, it might be worth trying udpu

> 
> Thanks
> Lax
> 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof
> Sent: Wednesday, October 29, 2014 2:42 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
> 
> 
>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>> 
>> Hi All,
>> 
>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error  retrying'. I have a 2 Node setup with pacemaker and corosync.
> 
> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
> 
>> 
>> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is  there any way to fix this issue?  
>> 
>> 
>> Thanks
>> Lax
>> 
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From lists at alteeve.ca  Wed Oct 29 22:29:38 2014
From: lists at alteeve.ca (Digimer)
Date: Wed, 29 Oct 2014 18:29:38 -0400
Subject: [Linux-cluster] daemon cpg_join error retrying
In-Reply-To: <E7C2C0CC-7DC3-45D8-BBD3-0D0356B63628@beekhof.net>
References: <D8FF83C6AA1D214B9F7B36281C06366713D72E18@xmb-rcd-x09.cisco.com>	<A2E74581-3795-472A-8E27-B18DE5E6E4C1@beekhof.net>	<D8FF83C6AA1D214B9F7B36281C06366713D72E64@xmb-rcd-x09.cisco.com>
	<E7C2C0CC-7DC3-45D8-BBD3-0D0356B63628@beekhof.net>
Message-ID: <54516A52.5020901@alteeve.ca>

On 29/10/14 06:16 PM, Andrew Beekhof wrote:
>
>> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>>
>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.
>
> I don't really recall. Hopefully someone more familiar with GFS2 can chime in.

# gfs2_tool sb /dev/c01n01_vg0/shared table
current lock table name = "an-cluster-01:shared"

Replace with your device, of course. :)

>
>>
>> Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs)
>>
>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>
>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>
> It does not sound like your network is particularly healthy.
> Are you using multicast or udpu? If multicast, it might be worth trying udpu

Agreed. Persistent multicast required?

>> Thanks
>> Lax
>>
>>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof
>> Sent: Wednesday, October 29, 2014 2:42 PM
>> To: linux clustering
>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
>>
>>
>>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>>>
>>> Hi All,
>>>
>>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error  retrying'. I have a 2 Node setup with pacemaker and corosync.
>>
>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>
>>>
>>> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is  there any way to fix this issue?
>>>
>>>
>>> Thanks
>>> Lax
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From lkota at cisco.com  Wed Oct 29 22:32:28 2014
From: lkota at cisco.com (Lax Kota (lkota))
Date: Wed, 29 Oct 2014 22:32:28 +0000
Subject: [Linux-cluster] daemon cpg_join error retrying
In-Reply-To: <E7C2C0CC-7DC3-45D8-BBD3-0D0356B63628@beekhof.net>
References: <D8FF83C6AA1D214B9F7B36281C06366713D72E18@xmb-rcd-x09.cisco.com>
	<A2E74581-3795-472A-8E27-B18DE5E6E4C1@beekhof.net>
	<D8FF83C6AA1D214B9F7B36281C06366713D72E64@xmb-rcd-x09.cisco.com>
	<E7C2C0CC-7DC3-45D8-BBD3-0D0356B63628@beekhof.net>
Message-ID: <D8FF83C6AA1D214B9F7B36281C06366713D72F9A@xmb-rcd-x09.cisco.com>


>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.

> I don't really recall. Hopefully someone more familiar with GFS2 can chime in.
Ok.

>> 
>> Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs)
>> 
>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>> 
>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.

>It does not sound like your network is particularly healthy.
>Are you using multicast or udpu? If multicast, it might be worth trying udpu

I am using udpu and I also have firewall opened for ports 5404 & 5405. Tcpdump looks fine too, it does not complain of any issues. This is a VM envirornment and even if I switch to other node within same VM I keep getting same failure.  

Thanks
Lax


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof
Sent: Wednesday, October 29, 2014 3:17 PM
To: linux clustering
Subject: Re: [Linux-cluster] daemon cpg_join error retrying


> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
> 
>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.

I don't really recall. Hopefully someone more familiar with GFS2 can chime in.

> 
> Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs)
> 
> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
> 
> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.

It does not sound like your network is particularly healthy.
Are you using multicast or udpu? If multicast, it might be worth trying udpu

> 
> Thanks
> Lax
> 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof
> Sent: Wednesday, October 29, 2014 2:42 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
> 
> 
>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>> 
>> Hi All,
>> 
>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error  retrying'. I have a 2 Node setup with pacemaker and corosync.
> 
> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
> 
>> 
>> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is  there any way to fix this issue?  
>> 
>> 
>> Thanks
>> Lax
>> 
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From andrew at beekhof.net  Wed Oct 29 22:38:00 2014
From: andrew at beekhof.net (Andrew Beekhof)
Date: Thu, 30 Oct 2014 09:38:00 +1100
Subject: [Linux-cluster] daemon cpg_join error retrying
In-Reply-To: <D8FF83C6AA1D214B9F7B36281C06366713D72F9A@xmb-rcd-x09.cisco.com>
References: <D8FF83C6AA1D214B9F7B36281C06366713D72E18@xmb-rcd-x09.cisco.com>
	<A2E74581-3795-472A-8E27-B18DE5E6E4C1@beekhof.net>
	<D8FF83C6AA1D214B9F7B36281C06366713D72E64@xmb-rcd-x09.cisco.com>
	<E7C2C0CC-7DC3-45D8-BBD3-0D0356B63628@beekhof.net>
	<D8FF83C6AA1D214B9F7B36281C06366713D72F9A@xmb-rcd-x09.cisco.com>
Message-ID: <68ABE774-8755-416F-829B-CED002B14D03@beekhof.net>


> On 30 Oct 2014, at 9:32 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
> 
> 
>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.
> 
>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in.
> Ok.
> 
>>> 
>>> Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs)
>>> 
>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>> 
>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
> 
>> It does not sound like your network is particularly healthy.
>> Are you using multicast or udpu? If multicast, it might be worth trying udpu
> 
> I am using udpu and I also have firewall opened for ports 5404 & 5405. Tcpdump looks fine too, it does not complain of any issues. This is a VM envirornment and even if I switch to other node within same VM I keep getting same failure.  

Depending on what the host and VMs are doing, that might be your problem.
In any case, I will defer to the corosync guys at this point.

> 
> Thanks
> Lax
> 
> 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof
> Sent: Wednesday, October 29, 2014 3:17 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
> 
> 
>> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>> 
>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.
> 
> I don't really recall. Hopefully someone more familiar with GFS2 can chime in.
> 
>> 
>> Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs)
>> 
>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>> 
>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
> 
> It does not sound like your network is particularly healthy.
> Are you using multicast or udpu? If multicast, it might be worth trying udpu
> 
>> 
>> Thanks
>> Lax
>> 
>> 
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof
>> Sent: Wednesday, October 29, 2014 2:42 PM
>> To: linux clustering
>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
>> 
>> 
>>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>>> 
>>> Hi All,
>>> 
>>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error  retrying'. I have a 2 Node setup with pacemaker and corosync.
>> 
>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>> 
>>> 
>>> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is  there any way to fix this issue?  
>>> 
>>> 
>>> Thanks
>>> Lax
>>> 
>>> -- 
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> 
>> 
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> 
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From jfriesse at redhat.com  Thu Oct 30 08:23:29 2014
From: jfriesse at redhat.com (Jan Friesse)
Date: Thu, 30 Oct 2014 09:23:29 +0100
Subject: [Linux-cluster] daemon cpg_join error retrying
In-Reply-To: <68ABE774-8755-416F-829B-CED002B14D03@beekhof.net>
References: <D8FF83C6AA1D214B9F7B36281C06366713D72E18@xmb-rcd-x09.cisco.com>
	<A2E74581-3795-472A-8E27-B18DE5E6E4C1@beekhof.net>
	<D8FF83C6AA1D214B9F7B36281C06366713D72E64@xmb-rcd-x09.cisco.com>
	<E7C2C0CC-7DC3-45D8-BBD3-0D0356B63628@beekhof.net>
	<D8FF83C6AA1D214B9F7B36281C06366713D72F9A@xmb-rcd-x09.cisco.com>
	<68ABE774-8755-416F-829B-CED002B14D03@beekhof.net>
Message-ID: <5451F581.5050100@redhat.com>

> 
>> On 30 Oct 2014, at 9:32 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>>
>>
>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>>> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.
>>
>>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in.
>> Ok.
>>
>>>>
>>>> Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs)
>>>>
>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>>>
>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>
>>> It does not sound like your network is particularly healthy.
>>> Are you using multicast or udpu? If multicast, it might be worth trying udpu
>>
>> I am using udpu and I also have firewall opened for ports 5404 & 5405. Tcpdump looks fine too, it does not complain of any issues. This is a VM envirornment and even if I switch to other node within same VM I keep getting same failure.  
> 
> Depending on what the host and VMs are doing, that might be your problem.
> In any case, I will defer to the corosync guys at this point.
> 

Lax,
usual reasons for this problem:
1. mtu is too high and fragmented packets are not enabled (take a look
to netmtu configuration option)
2. config file on nodes are not in sync and one node may contain more
node entries then other nodes (this may be also the case if you have two
clusters and one cluster contains entry of one node for other cluster)
3. firewall is asymmetrically blocked (so node can send but not
receive). Also keep in mind that ports 5404 & 5405 may not be enough for
udpu, because udpu uses one socket per remote node for sending.

I would recommend to disable firewall completely (for testing) and if
everything will work, you just need to adjust firewall.

Regards,
  Honza


>>
>> Thanks
>> Lax
>>
>>
>>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof
>> Sent: Wednesday, October 29, 2014 3:17 PM
>> To: linux clustering
>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
>>
>>
>>> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>>>
>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.
>>
>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in.
>>
>>>
>>> Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs)
>>>
>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>>
>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>
>> It does not sound like your network is particularly healthy.
>> Are you using multicast or udpu? If multicast, it might be worth trying udpu
>>
>>>
>>> Thanks
>>> Lax
>>>
>>>
>>> -----Original Message-----
>>> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof
>>> Sent: Wednesday, October 29, 2014 2:42 PM
>>> To: linux clustering
>>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
>>>
>>>
>>>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error  retrying'. I have a 2 Node setup with pacemaker and corosync.
>>>
>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>>
>>>>
>>>> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is  there any way to fix this issue?  
>>>>
>>>>
>>>> Thanks
>>>> Lax
>>>>
>>>> -- 
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>> -- 
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>> -- 
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 


From mgrac at redhat.com  Thu Oct 30 15:40:05 2014
From: mgrac at redhat.com (Marek "marx" Grac)
Date: Thu, 30 Oct 2014 16:40:05 +0100
Subject: [Linux-cluster] Building upstream fence agents on RHEL/CentOS 6
Message-ID: <54525BD5.9060409@redhat.com>

Hi,

After small investigation on RHEL6.6 and fence agents from upstream 
(latest git).

Summary: Yes, it should work.

Details:
* it is required to fix auto* stuff as Alan found
     fix - it will be in next release very likely
         change ACLOCAL_AMFLAGS from -I m4 to -I make
         change AC_CONFIG_MACRO-DIR from m4 to make

*  a) fence_vmware_soap requires package python-requests (+deps) 
available only in EPEL
     b) ignore fence_vmware_soap
         fix) from configure.ac remove AC_PYTHON_MODULE(requests, 1)

* in lib/fencing.py.py replace 'stream=sys.stderr' with 'sys.stderr' 
(one occurency)

* standard ./autogen.sh; ./configure; make

m,


From lkota at cisco.com  Thu Oct 30 17:46:30 2014
From: lkota at cisco.com (Lax Kota (lkota))
Date: Thu, 30 Oct 2014 17:46:30 +0000
Subject: [Linux-cluster] daemon cpg_join error retrying
In-Reply-To: <5451F581.5050100@redhat.com>
References: <D8FF83C6AA1D214B9F7B36281C06366713D72E18@xmb-rcd-x09.cisco.com>
	<A2E74581-3795-472A-8E27-B18DE5E6E4C1@beekhof.net>
	<D8FF83C6AA1D214B9F7B36281C06366713D72E64@xmb-rcd-x09.cisco.com>
	<E7C2C0CC-7DC3-45D8-BBD3-0D0356B63628@beekhof.net>
	<D8FF83C6AA1D214B9F7B36281C06366713D72F9A@xmb-rcd-x09.cisco.com>
	<68ABE774-8755-416F-829B-CED002B14D03@beekhof.net>
	<5451F581.5050100@redhat.com>
Message-ID: <D8FF83C6AA1D214B9F7B36281C06366713D7339B@xmb-rcd-x09.cisco.com>

Thanks Honza. Here is what I was doing,

> usual reasons for this problem:
> 1. mtu is too high and fragmented packets are not enabled (take a look to netmtu configuration option) 
I am running with default mtu settings which is 1500. And I do see my interface(eth1) on the box does have MTU as 1500 too. 


2. config file on nodes are not in sync and one node may contain more node entries then other nodes (this may be also the case if you have two > clusters and one cluster contains entry of one node for other cluster) 3. firewall is asymmetrically blocked (so node can send but not receive). Also keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu uses one socket per remote node for sending.
Verfiifed my config files cluster.conf and cib.xml and both have same no of node entries (2)

> I would recommend to disable firewall completely (for testing) and if everything will work, you just need to adjust firewall.
I also ran tests with firewall off too on both the participating nodes, still see same issue

In corosync log I see repeated set of these messages, hoping these will give some more pointers.

Oct 29 22:11:02 corosync [SYNC  ] Committing synchronization for (corosync cluster closed process group service v1.01)
Oct 29 22:11:02 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Oct 29 22:11:02 corosync [TOTEM ] waiting_trans_ack changed to 0
Oct 29 22:11:03 corosync [TOTEM ] entering GATHER state from 11.
Oct 29 22:11:03 corosync [TOTEM ] entering GATHER state from 10.
Oct 29 22:11:05 corosync [TOTEM ] entering GATHER state from 0.
Oct 29 22:11:05 corosync [TOTEM ] got commit token
Oct 29 22:11:05 corosync [TOTEM ] Saving state aru 1b high seq received 1b
Oct 29 22:11:05 corosync [TOTEM ] Storing new sequence id for ring 51708
Oct 29 22:11:05 corosync [TOTEM ] entering COMMIT state.
Oct 29 22:11:05 corosync [TOTEM ] got commit token
Oct 29 22:11:05 corosync [TOTEM ] entering RECOVERY state.
Oct 29 22:11:05 corosync [TOTEM ] TRANS [0] member 172.28.0.64:
Oct 29 22:11:05 corosync [TOTEM ] TRANS [1] member 172.28.0.65:
Oct 29 22:11:05 corosync [TOTEM ] position [0] member 172.28.0.64:
Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep 172.28.0.64
Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b received flag 1
Oct 29 22:11:05 corosync [TOTEM ] position [1] member 172.28.0.65:
Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep 172.28.0.64
Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b received flag 1
Oct 29 22:11:05 corosync [TOTEM ] Did not need to originate any messages in recovery.
Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff
Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0
Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0
Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0
Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
Oct 29 22:11:05 corosync [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0
Oct 29 22:11:05 corosync [TOTEM ] Resetting old ring state
Oct 29 22:11:05 corosync [TOTEM ] recovery to regular 1-0
Oct 29 22:11:05 corosync [CMAN  ] ais: confchg_fn called type = 1, seq=333576
Oct 29 22:11:05 corosync [TOTEM ] waiting_trans_ack changed to 1
Oct 29 22:11:05 corosync [CMAN  ] ais: confchg_fn called type = 0, seq=333576
Oct 29 22:11:05 corosync [CMAN  ] ais: last memb_count = 2, current = 2
Oct 29 22:11:05 corosync [CMAN  ] memb: sending TRANSITION message. cluster_name = vsomcluster
Oct 29 22:11:05 corosync [CMAN  ] ais: comms send message 0x7fff8185ca00 len = 65
Oct 29 22:11:05 corosync [CMAN  ] daemon: sending reply 103 to fd 24
Oct 29 22:11:05 corosync [CMAN  ] daemon: sending reply 103 to fd 34
Oct 29 22:11:05 corosync [SYNC  ] This node is within the primary component and will provide service.
Oct 29 22:11:05 corosync [TOTEM ] entering OPERATIONAL state.
Oct 29 22:11:05 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 29 22:11:05 corosync [CMAN  ] ais: deliver_fn source nodeid = 2, len=81, endian_conv=0
Oct 29 22:11:05 corosync [CMAN  ] memb: Message on port 0 is 5
Oct 29 22:11:05 corosync [CMAN  ] memb: got TRANSITION from node 2
Oct 29 22:11:05 corosync [CMAN  ] memb: Got TRANSITION message. msg->flags=20, node->flags=20, first_trans=0
Oct 29 22:11:05 corosync [CMAN  ] memb: add_ais_node ID=2, incarnation = 333576
Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 2
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 0.
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
Oct 29 22:11:05 corosync [CMAN  ] ais: deliver_fn source nodeid = 1, len=81, endian_conv=0
Oct 29 22:11:05 corosync [CMAN  ] memb: Message on port 0 is 5
Oct 29 22:11:05 corosync [CMAN  ] memb: got TRANSITION from node 1
Oct 29 22:11:05 corosync [CMAN  ] Completed first transition with nodes on the same config versions
Oct 29 22:11:05 corosync [CMAN  ] memb: Got TRANSITION message. msg->flags=20, node->flags=20, first_trans=0
Oct 29 22:11:05 corosync [CMAN  ] memb: add_ais_node ID=1, incarnation = 333576
Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 1
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed
Oct 29 22:11:05 corosync [SYNC  ] Synchronization actions starting for (dummy CLM service)
Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 1
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 0.
Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 2
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed
Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for (dummy CLM service)
Oct 29 22:11:05 corosync [SYNC  ] Synchronization actions starting for (dummy AMF service)
Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 2
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 0.
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 1
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed
Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for (dummy AMF service)
Oct 29 22:11:05 corosync [SYNC  ] Synchronization actions starting for (openais checkpoint service B.01.01)
Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 1
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 0.
Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 2
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed
Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for (openais checkpoint service B.01.01)
Oct 29 22:11:05 corosync [SYNC  ] Synchronization actions starting for (dummy EVT service)
Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 2
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 0.
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 1
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed
Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for (dummy EVT service)
Oct 29 22:11:05 corosync [SYNC  ] Synchronization actions starting for (corosync cluster closed process group service v1.01)
Oct 29 22:11:05 corosync [CPG   ] got joinlist message from node 1
Oct 29 22:11:05 corosync [CPG   ] comparing: sender r(0) ip(172.28.0.65) ; members(old:2 left:0)
Oct 29 22:11:05 corosync [CPG   ] comparing: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
Oct 29 22:11:05 corosync [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
Oct 29 22:11:05 corosync [CPG   ] got joinlist message from node 2
Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 1
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 0.
Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 2
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed
Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[0] group:crmd\x00, ip:r(0) ip(172.28.0.65) , pid:9198
Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[1] group:attrd\x00, ip:r(0) ip(172.28.0.65) , pid:9196
Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[2] group:stonith-ng\x00, ip:r(0) ip(172.28.0.65) , pid:9194
Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[3] group:cib\x00, ip:r(0) ip(172.28.0.65) , pid:9193
Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[4] group:pcmk\x00, ip:r(0) ip(172.28.0.65) , pid:9187
Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[5] group:gfs:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9111
Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[6] group:dlm:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9057
Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[7] group:fenced:default\x00, ip:r(0) ip(172.28.0.65) , pid:9040
Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[8] group:fenced:daemon\x00, ip:r(0) ip(172.28.0.65) , pid:9040
Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[9] group:crmd\x00, ip:r(0) ip(172.28.0.64) , pid:14530
Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for (corosync cluster closed process group service v1.01)
Oct 29 22:11:05 corosync [MAIN  ] Completed service synchronization, ready to provide service.

Thanks
Lax 


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jan Friesse
Sent: Thursday, October 30, 2014 1:23 AM
To: linux clustering
Subject: Re: [Linux-cluster] daemon cpg_join error retrying

> 
>> On 30 Oct 2014, at 9:32 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>>
>>
>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>>> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.
>>
>>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in.
>> Ok.
>>
>>>>
>>>> Also one more issue I am seeing in one other setup a repeated flood 
>>>> of 'A processor joined or left the membership and a new membership 
>>>> was formed' messages for every 4secs. I am running with default 
>>>> TOTEM settings with token time out as 10 secs. Even after I 
>>>> increase the token, consensus values to be higher. It goes on 
>>>> flooding the same message after newer consensus defined time (eg: 
>>>> if I increase it to be 10secs, then I see new membership formed 
>>>> messages for every 10secs)
>>>>
>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>>>
>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>
>>> It does not sound like your network is particularly healthy.
>>> Are you using multicast or udpu? If multicast, it might be worth 
>>> trying udpu
>>
>> I am using udpu and I also have firewall opened for ports 5404 & 5405. Tcpdump looks fine too, it does not complain of any issues. This is a VM envirornment and even if I switch to other node within same VM I keep getting same failure.  
> 
> Depending on what the host and VMs are doing, that might be your problem.
> In any case, I will defer to the corosync guys at this point.
> 

Lax,
usual reasons for this problem:
1. mtu is too high and fragmented packets are not enabled (take a look to netmtu configuration option) 2. config file on nodes are not in sync and one node may contain more node entries then other nodes (this may be also the case if you have two clusters and one cluster contains entry of one node for other cluster) 3. firewall is asymmetrically blocked (so node can send but not receive). Also keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu uses one socket per remote node for sending.

I would recommend to disable firewall completely (for testing) and if everything will work, you just need to adjust firewall.

Regards,
  Honza


>>
>> Thanks
>> Lax
>>
>>
>>
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com 
>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof
>> Sent: Wednesday, October 29, 2014 3:17 PM
>> To: linux clustering
>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
>>
>>
>>> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>>>
>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.
>>
>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in.
>>
>>>
>>> Also one more issue I am seeing in one other setup a repeated flood 
>>> of 'A processor joined or left the membership and a new membership 
>>> was formed' messages for every 4secs. I am running with default 
>>> TOTEM settings with token time out as 10 secs. Even after I increase 
>>> the token, consensus values to be higher. It goes on flooding the 
>>> same message after newer consensus defined time (eg: if I increase 
>>> it to be 10secs, then I see new membership formed messages for every 
>>> 10secs)
>>>
>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>>
>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>
>> It does not sound like your network is particularly healthy.
>> Are you using multicast or udpu? If multicast, it might be worth 
>> trying udpu
>>
>>>
>>> Thanks
>>> Lax
>>>
>>>
>>> -----Original Message-----
>>> From: linux-cluster-bounces at redhat.com 
>>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew 
>>> Beekhof
>>> Sent: Wednesday, October 29, 2014 2:42 PM
>>> To: linux clustering
>>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
>>>
>>>
>>>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error  retrying'. I have a 2 Node setup with pacemaker and corosync.
>>>
>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>>
>>>>
>>>> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is  there any way to fix this issue?  
>>>>
>>>>
>>>> Thanks
>>>> Lax
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From fanyunfeng.ce at gmail.com  Fri Oct 31 05:35:24 2014
From: fanyunfeng.ce at gmail.com (Yunfeng Fan)
Date: Fri, 31 Oct 2014 13:35:24 +0800
Subject: [Linux-cluster] daemon cpg_join error retrying
In-Reply-To: <D8FF83C6AA1D214B9F7B36281C06366713D72E18@xmb-rcd-x09.cisco.com>
References: <D8FF83C6AA1D214B9F7B36281C06366713D72E18@xmb-rcd-x09.cisco.com>
Message-ID: <CAJ7vsabKsyOPamNH79hVQkeQJQ2SwS=1tknU-3C9Zd4bzKwGEQ@mail.gmail.com>

On Oct 30, 2014 5:44 AM, "Lax Kota (lkota)" <lkota at cisco.com> wrote:
>(?o?)?
> Hi All,(?o?)?
>q
>
>
> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon
cpg_join error  retrying'. I have a 2 Node setup with pacemaker and
corosync.
>
>
>
> Even after I force kill the pacemaker processes and reboot the server and
bring the pacemaker back up, it keeps giving cpg_join error. Is  there any
way to fix this issue?
>
>
>
>
>
> Thanks
>
> Lax
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20141031/fda6fef8/attachment.htm>

From jfriesse at redhat.com  Fri Oct 31 16:43:29 2014
From: jfriesse at redhat.com (Jan Friesse)
Date: Fri, 31 Oct 2014 17:43:29 +0100
Subject: [Linux-cluster] daemon cpg_join error retrying
In-Reply-To: <D8FF83C6AA1D214B9F7B36281C06366713D7339B@xmb-rcd-x09.cisco.com>
References: <D8FF83C6AA1D214B9F7B36281C06366713D72E18@xmb-rcd-x09.cisco.com>	<A2E74581-3795-472A-8E27-B18DE5E6E4C1@beekhof.net>	<D8FF83C6AA1D214B9F7B36281C06366713D72E64@xmb-rcd-x09.cisco.com>	<E7C2C0CC-7DC3-45D8-BBD3-0D0356B63628@beekhof.net>	<D8FF83C6AA1D214B9F7B36281C06366713D72F9A@xmb-rcd-x09.cisco.com>	<68ABE774-8755-416F-829B-CED002B14D03@beekhof.net>	<5451F581.5050100@redhat.com>
	<D8FF83C6AA1D214B9F7B36281C06366713D7339B@xmb-rcd-x09.cisco.com>
Message-ID: <5453BC31.2000102@redhat.com>

Lax,


> Thanks Honza. Here is what I was doing,
>
>> usual reasons for this problem:
>> 1. mtu is too high and fragmented packets are not enabled (take a look to netmtu configuration option)
> I am running with default mtu settings which is 1500. And I do see my interface(eth1) on the box does have MTU as 1500 too.
>

Keep in mind that if they are not directly connected, switch can throw 
packets because of MTU.

>
> 2. config file on nodes are not in sync and one node may contain more node entries then other nodes (this may be also the case if you have two > clusters and one cluster contains entry of one node for other cluster) 3. firewall is asymmetrically blocked (so node can send but not receive). Also keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu uses one socket per remote node for sending.
> Verfiifed my config files cluster.conf and cib.xml and both have same no of node entries (2)
>
>> I would recommend to disable firewall completely (for testing) and if everything will work, you just need to adjust firewall.
> I also ran tests with firewall off too on both the participating nodes, still see same issue
>
> In corosync log I see repeated set of these messages, hoping these will give some more pointers.
>
> Oct 29 22:11:02 corosync [SYNC  ] Committing synchronization for (corosync cluster closed process group service v1.01)
> Oct 29 22:11:02 corosync [MAIN  ] Completed service synchronization, ready to provide service.
> Oct 29 22:11:02 corosync [TOTEM ] waiting_trans_ack changed to 0
> Oct 29 22:11:03 corosync [TOTEM ] entering GATHER state from 11.
> Oct 29 22:11:03 corosync [TOTEM ] entering GATHER state from 10.
> Oct 29 22:11:05 corosync [TOTEM ] entering GATHER state from 0.

This is just weird. What exact version of corosync are you running? Do 
you have latest Z stream?

Regards,
   Honza

> Oct 29 22:11:05 corosync [TOTEM ] got commit token
> Oct 29 22:11:05 corosync [TOTEM ] Saving state aru 1b high seq received 1b
> Oct 29 22:11:05 corosync [TOTEM ] Storing new sequence id for ring 51708
> Oct 29 22:11:05 corosync [TOTEM ] entering COMMIT state.
> Oct 29 22:11:05 corosync [TOTEM ] got commit token
> Oct 29 22:11:05 corosync [TOTEM ] entering RECOVERY state.
> Oct 29 22:11:05 corosync [TOTEM ] TRANS [0] member 172.28.0.64:
> Oct 29 22:11:05 corosync [TOTEM ] TRANS [1] member 172.28.0.65:
> Oct 29 22:11:05 corosync [TOTEM ] position [0] member 172.28.0.64:
> Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep 172.28.0.64
> Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b received flag 1
> Oct 29 22:11:05 corosync [TOTEM ] position [1] member 172.28.0.65:
> Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep 172.28.0.64
> Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b received flag 1
> Oct 29 22:11:05 corosync [TOTEM ] Did not need to originate any messages in recovery.
> Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff
> Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0
> Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0
> Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0
> Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
> Oct 29 22:11:05 corosync [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0
> Oct 29 22:11:05 corosync [TOTEM ] Resetting old ring state
> Oct 29 22:11:05 corosync [TOTEM ] recovery to regular 1-0
> Oct 29 22:11:05 corosync [CMAN  ] ais: confchg_fn called type = 1, seq=333576
> Oct 29 22:11:05 corosync [TOTEM ] waiting_trans_ack changed to 1
> Oct 29 22:11:05 corosync [CMAN  ] ais: confchg_fn called type = 0, seq=333576
> Oct 29 22:11:05 corosync [CMAN  ] ais: last memb_count = 2, current = 2
> Oct 29 22:11:05 corosync [CMAN  ] memb: sending TRANSITION message. cluster_name = vsomcluster
> Oct 29 22:11:05 corosync [CMAN  ] ais: comms send message 0x7fff8185ca00 len = 65
> Oct 29 22:11:05 corosync [CMAN  ] daemon: sending reply 103 to fd 24
> Oct 29 22:11:05 corosync [CMAN  ] daemon: sending reply 103 to fd 34
> Oct 29 22:11:05 corosync [SYNC  ] This node is within the primary component and will provide service.
> Oct 29 22:11:05 corosync [TOTEM ] entering OPERATIONAL state.
> Oct 29 22:11:05 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Oct 29 22:11:05 corosync [CMAN  ] ais: deliver_fn source nodeid = 2, len=81, endian_conv=0
> Oct 29 22:11:05 corosync [CMAN  ] memb: Message on port 0 is 5
> Oct 29 22:11:05 corosync [CMAN  ] memb: got TRANSITION from node 2
> Oct 29 22:11:05 corosync [CMAN  ] memb: Got TRANSITION message. msg->flags=20, node->flags=20, first_trans=0
> Oct 29 22:11:05 corosync [CMAN  ] memb: add_ais_node ID=2, incarnation = 333576
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
> Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 2
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 0.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [CMAN  ] ais: deliver_fn source nodeid = 1, len=81, endian_conv=0
> Oct 29 22:11:05 corosync [CMAN  ] memb: Message on port 0 is 5
> Oct 29 22:11:05 corosync [CMAN  ] memb: got TRANSITION from node 1
> Oct 29 22:11:05 corosync [CMAN  ] Completed first transition with nodes on the same config versions
> Oct 29 22:11:05 corosync [CMAN  ] memb: Got TRANSITION message. msg->flags=20, node->flags=20, first_trans=0
> Oct 29 22:11:05 corosync [CMAN  ] memb: add_ais_node ID=1, incarnation = 333576
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
> Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 1
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization actions starting for (dummy CLM service)
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
> Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 1
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 0.
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
> Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 2
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed
> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for (dummy CLM service)
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization actions starting for (dummy AMF service)
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
> Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 2
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 0.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
> Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 1
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed
> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for (dummy AMF service)
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization actions starting for (openais checkpoint service B.01.01)
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
> Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 1
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 0.
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
> Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 2
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed
> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for (openais checkpoint service B.01.01)
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization actions starting for (dummy EVT service)
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
> Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 2
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 0.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
> Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 1
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed
> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for (dummy EVT service)
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization actions starting for (corosync cluster closed process group service v1.01)
> Oct 29 22:11:05 corosync [CPG   ] got joinlist message from node 1
> Oct 29 22:11:05 corosync [CPG   ] comparing: sender r(0) ip(172.28.0.65) ; members(old:2 left:0)
> Oct 29 22:11:05 corosync [CPG   ] comparing: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
> Oct 29 22:11:05 corosync [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
> Oct 29 22:11:05 corosync [CPG   ] got joinlist message from node 2
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
> Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 1
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 0.
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2
> Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 2
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[0] group:crmd\x00, ip:r(0) ip(172.28.0.65) , pid:9198
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[1] group:attrd\x00, ip:r(0) ip(172.28.0.65) , pid:9196
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[2] group:stonith-ng\x00, ip:r(0) ip(172.28.0.65) , pid:9194
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[3] group:cib\x00, ip:r(0) ip(172.28.0.65) , pid:9193
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[4] group:pcmk\x00, ip:r(0) ip(172.28.0.65) , pid:9187
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[5] group:gfs:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9111
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[6] group:dlm:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9057
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[7] group:fenced:default\x00, ip:r(0) ip(172.28.0.65) , pid:9040
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[8] group:fenced:daemon\x00, ip:r(0) ip(172.28.0.65) , pid:9040
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[9] group:crmd\x00, ip:r(0) ip(172.28.0.64) , pid:14530
> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for (corosync cluster closed process group service v1.01)
> Oct 29 22:11:05 corosync [MAIN  ] Completed service synchronization, ready to provide service.
>
> Thanks
> Lax
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jan Friesse
> Sent: Thursday, October 30, 2014 1:23 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
>
>>
>>> On 30 Oct 2014, at 9:32 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>>>
>>>
>>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>>>> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.
>>>
>>>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in.
>>> Ok.
>>>
>>>>>
>>>>> Also one more issue I am seeing in one other setup a repeated flood
>>>>> of 'A processor joined or left the membership and a new membership
>>>>> was formed' messages for every 4secs. I am running with default
>>>>> TOTEM settings with token time out as 10 secs. Even after I
>>>>> increase the token, consensus values to be higher. It goes on
>>>>> flooding the same message after newer consensus defined time (eg:
>>>>> if I increase it to be 10secs, then I see new membership formed
>>>>> messages for every 10secs)
>>>>>
>>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>>>>
>>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>>
>>>> It does not sound like your network is particularly healthy.
>>>> Are you using multicast or udpu? If multicast, it might be worth
>>>> trying udpu
>>>
>>> I am using udpu and I also have firewall opened for ports 5404 & 5405. Tcpdump looks fine too, it does not complain of any issues. This is a VM envirornment and even if I switch to other node within same VM I keep getting same failure.
>>
>> Depending on what the host and VMs are doing, that might be your problem.
>> In any case, I will defer to the corosync guys at this point.
>>
>
> Lax,
> usual reasons for this problem:
> 1. mtu is too high and fragmented packets are not enabled (take a look to netmtu configuration option) 2. config file on nodes are not in sync and one node may contain more node entries then other nodes (this may be also the case if you have two clusters and one cluster contains entry of one node for other cluster) 3. firewall is asymmetrically blocked (so node can send but not receive). Also keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu uses one socket per remote node for sending.
>
> I would recommend to disable firewall completely (for testing) and if everything will work, you just need to adjust firewall.
>
> Regards,
>    Honza
>
>
>
>>>
>>> Thanks
>>> Lax
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: linux-cluster-bounces at redhat.com
>>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof
>>> Sent: Wednesday, October 29, 2014 3:17 PM
>>> To: linux clustering
>>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
>>>
>>>
>>>> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>>>>
>>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>>> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.
>>>
>>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in.
>>>
>>>>
>>>> Also one more issue I am seeing in one other setup a repeated flood
>>>> of 'A processor joined or left the membership and a new membership
>>>> was formed' messages for every 4secs. I am running with default
>>>> TOTEM settings with token time out as 10 secs. Even after I increase
>>>> the token, consensus values to be higher. It goes on flooding the
>>>> same message after newer consensus defined time (eg: if I increase
>>>> it to be 10secs, then I see new membership formed messages for every
>>>> 10secs)
>>>>
>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>>>
>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>>
>>> It does not sound like your network is particularly healthy.
>>> Are you using multicast or udpu? If multicast, it might be worth
>>> trying udpu
>>>
>>>>
>>>> Thanks
>>>> Lax
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: linux-cluster-bounces at redhat.com
>>>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew
>>>> Beekhof
>>>> Sent: Wednesday, October 29, 2014 2:42 PM
>>>> To: linux clustering
>>>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
>>>>
>>>>
>>>>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error  retrying'. I have a 2 Node setup with pacemaker and corosync.
>>>>
>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>>>
>>>>>
>>>>> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is  there any way to fix this issue?
>>>>>
>>>>>
>>>>> Thanks
>>>>> Lax
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From lkota at cisco.com  Fri Oct 31 19:41:24 2014
From: lkota at cisco.com (Lax Kota (lkota))
Date: Fri, 31 Oct 2014 19:41:24 +0000
Subject: [Linux-cluster] daemon cpg_join error retrying
In-Reply-To: <5453BC31.2000102@redhat.com>
References: <D8FF83C6AA1D214B9F7B36281C06366713D72E18@xmb-rcd-x09.cisco.com>
	<A2E74581-3795-472A-8E27-B18DE5E6E4C1@beekhof.net>
	<D8FF83C6AA1D214B9F7B36281C06366713D72E64@xmb-rcd-x09.cisco.com>
	<E7C2C0CC-7DC3-45D8-BBD3-0D0356B63628@beekhof.net>
	<D8FF83C6AA1D214B9F7B36281C06366713D72F9A@xmb-rcd-x09.cisco.com>
	<68ABE774-8755-416F-829B-CED002B14D03@beekhof.net>
	<5451F581.5050100@redhat.com>
	<D8FF83C6AA1D214B9F7B36281C06366713D7339B@xmb-rcd-x09.cisco.com>
	<5453BC31.2000102@redhat.com>
Message-ID: <D8FF83C6AA1D214B9F7B36281C06366713D7387D@xmb-rcd-x09.cisco.com>

> This is just weird. What exact version of corosync are you running? Do you have latest Z stream?
I am running  on Corosync 1.4.1 and pacemaker version is 1.1.8-7.el6


Thanks
Lax


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jan Friesse
Sent: Friday, October 31, 2014 9:43 AM
To: linux clustering
Subject: Re: [Linux-cluster] daemon cpg_join error retrying

Lax,


> Thanks Honza. Here is what I was doing,
>
>> usual reasons for this problem:
>> 1. mtu is too high and fragmented packets are not enabled (take a 
>> look to netmtu configuration option)
> I am running with default mtu settings which is 1500. And I do see my interface(eth1) on the box does have MTU as 1500 too.
>

Keep in mind that if they are not directly connected, switch can throw packets because of MTU.

>
> 2. config file on nodes are not in sync and one node may contain more node entries then other nodes (this may be also the case if you have two > clusters and one cluster contains entry of one node for other cluster) 3. firewall is asymmetrically blocked (so node can send but not receive). Also keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu uses one socket per remote node for sending.
> Verfiifed my config files cluster.conf and cib.xml and both have same 
> no of node entries (2)
>
>> I would recommend to disable firewall completely (for testing) and if everything will work, you just need to adjust firewall.
> I also ran tests with firewall off too on both the participating 
> nodes, still see same issue
>
> In corosync log I see repeated set of these messages, hoping these will give some more pointers.
>
> Oct 29 22:11:02 corosync [SYNC  ] Committing synchronization for 
> (corosync cluster closed process group service v1.01) Oct 29 22:11:02 corosync [MAIN  ] Completed service synchronization, ready to provide service.
> Oct 29 22:11:02 corosync [TOTEM ] waiting_trans_ack changed to 0 Oct 
> 29 22:11:03 corosync [TOTEM ] entering GATHER state from 11.
> Oct 29 22:11:03 corosync [TOTEM ] entering GATHER state from 10.
> Oct 29 22:11:05 corosync [TOTEM ] entering GATHER state from 0.

This is just weird. What exact version of corosync are you running? Do you have latest Z stream?

Regards,
   Honza

> Oct 29 22:11:05 corosync [TOTEM ] got commit token Oct 29 22:11:05 
> corosync [TOTEM ] Saving state aru 1b high seq received 1b Oct 29 
> 22:11:05 corosync [TOTEM ] Storing new sequence id for ring 51708 Oct 
> 29 22:11:05 corosync [TOTEM ] entering COMMIT state.
> Oct 29 22:11:05 corosync [TOTEM ] got commit token Oct 29 22:11:05 
> corosync [TOTEM ] entering RECOVERY state.
> Oct 29 22:11:05 corosync [TOTEM ] TRANS [0] member 172.28.0.64:
> Oct 29 22:11:05 corosync [TOTEM ] TRANS [1] member 172.28.0.65:
> Oct 29 22:11:05 corosync [TOTEM ] position [0] member 172.28.0.64:
> Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep 
> 172.28.0.64 Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b 
> received flag 1 Oct 29 22:11:05 corosync [TOTEM ] position [1] member 172.28.0.65:
> Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep 
> 172.28.0.64 Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b 
> received flag 1 Oct 29 22:11:05 corosync [TOTEM ] Did not need to originate any messages in recovery.
> Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set 
> retrans flag0 retrans queue empty 1 count 0, aru ffffffff Oct 29 
> 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Oct 
> 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans 
> flag0 retrans queue empty 1 count 1, aru 0 Oct 29 22:11:05 corosync 
> [TOTEM ] install seq 0 aru 0 high seq received 0 Oct 29 22:11:05 
> corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans 
> queue empty 1 count 2, aru 0 Oct 29 22:11:05 corosync [TOTEM ] install 
> seq 0 aru 0 high seq received 0 Oct 29 22:11:05 corosync [TOTEM ] 
> token retrans flag is 0 my set retrans flag0 retrans queue empty 1 
> count 3, aru 0 Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 
> high seq received 0 Oct 29 22:11:05 corosync [TOTEM ] retrans flag 
> count 4 token aru 0 install seq 0 aru 0 0 Oct 29 22:11:05 corosync 
> [TOTEM ] Resetting old ring state Oct 29 22:11:05 corosync [TOTEM ] 
> recovery to regular 1-0 Oct 29 22:11:05 corosync [CMAN  ] ais: 
> confchg_fn called type = 1, seq=333576 Oct 29 22:11:05 corosync [TOTEM 
> ] waiting_trans_ack changed to 1 Oct 29 22:11:05 corosync [CMAN  ] 
> ais: confchg_fn called type = 0, seq=333576 Oct 29 22:11:05 corosync 
> [CMAN  ] ais: last memb_count = 2, current = 2 Oct 29 22:11:05 
> corosync [CMAN  ] memb: sending TRANSITION message. cluster_name = vsomcluster Oct 29 22:11:05 corosync [CMAN  ] ais: comms send message 0x7fff8185ca00 len = 65 Oct 29 22:11:05 corosync [CMAN  ] daemon: sending reply 103 to fd 24 Oct 29 22:11:05 corosync [CMAN  ] daemon: sending reply 103 to fd 34 Oct 29 22:11:05 corosync [SYNC  ] This node is within the primary component and will provide service.
> Oct 29 22:11:05 corosync [TOTEM ] entering OPERATIONAL state.
> Oct 29 22:11:05 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Oct 29 22:11:05 corosync [CMAN  ] ais: deliver_fn source nodeid = 2, 
> len=81, endian_conv=0 Oct 29 22:11:05 corosync [CMAN  ] memb: Message 
> on port 0 is 5 Oct 29 22:11:05 corosync [CMAN  ] memb: got TRANSITION 
> from node 2 Oct 29 22:11:05 corosync [CMAN  ] memb: Got TRANSITION 
> message. msg->flags=20, node->flags=20, first_trans=0 Oct 29 22:11:05 
> corosync [CMAN  ] memb: add_ais_node ID=2, incarnation = 333576 Oct 29 
> 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 corosync 
> [SYNC  ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 0.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [CMAN  ] ais: deliver_fn source nodeid = 1, 
> len=81, endian_conv=0 Oct 29 22:11:05 corosync [CMAN  ] memb: Message 
> on port 0 is 5 Oct 29 22:11:05 corosync [CMAN  ] memb: got TRANSITION 
> from node 1 Oct 29 22:11:05 corosync [CMAN  ] Completed first 
> transition with nodes on the same config versions Oct 29 22:11:05 
> corosync [CMAN  ] memb: Got TRANSITION message. msg->flags=20, 
> node->flags=20, first_trans=0 Oct 29 22:11:05 corosync [CMAN  ] memb: 
> add_ais_node ID=1, incarnation = 333576 Oct 29 22:11:05 corosync [SYNC  
> ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 1 Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed 
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization actions starting for 
> (dummy CLM service) Oct 29 22:11:05 corosync [SYNC  ] confchg entries 
> 2 Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 1 Oct 
> 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 0.
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier Start Received From 2 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed 
> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for 
> (dummy CLM service) Oct 29 22:11:05 corosync [SYNC  ] Synchronization 
> actions starting for (dummy AMF service) Oct 29 22:11:05 corosync 
> [SYNC  ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC  ] Barrier 
> Start Received From 2 Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 0.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier Start Received From 1 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed 
> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for 
> (dummy AMF service) Oct 29 22:11:05 corosync [SYNC  ] Synchronization 
> actions starting for (openais checkpoint service B.01.01) Oct 29 
> 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 corosync 
> [SYNC  ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC  ] Barrier 
> Start Received From 1 Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 0.
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier Start Received From 2 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed 
> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for 
> (openais checkpoint service B.01.01) Oct 29 22:11:05 corosync [SYNC  ] 
> Synchronization actions starting for (dummy EVT service) Oct 29 
> 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 corosync 
> [SYNC  ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 0.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier Start Received From 1 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed 
> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for 
> (dummy EVT service) Oct 29 22:11:05 corosync [SYNC  ] Synchronization actions starting for (corosync cluster closed process group service v1.01)
> Oct 29 22:11:05 corosync [CPG   ] got joinlist message from node 1
> Oct 29 22:11:05 corosync [CPG   ] comparing: sender r(0) ip(172.28.0.65) ; members(old:2 left:0)
> Oct 29 22:11:05 corosync [CPG   ] comparing: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
> Oct 29 22:11:05 corosync [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
> Oct 29 22:11:05 corosync [CPG   ] got joinlist message from node 2
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier Start Received From 1 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 0.
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier Start Received From 2 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[0] group:crmd\x00, ip:r(0) ip(172.28.0.65) , pid:9198
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[1] group:attrd\x00, ip:r(0) ip(172.28.0.65) , pid:9196
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[2] group:stonith-ng\x00, ip:r(0) ip(172.28.0.65) , pid:9194
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[3] group:cib\x00, ip:r(0) ip(172.28.0.65) , pid:9193
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[4] group:pcmk\x00, ip:r(0) ip(172.28.0.65) , pid:9187
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[5] group:gfs:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9111
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[6] group:dlm:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9057
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[7] group:fenced:default\x00, ip:r(0) ip(172.28.0.65) , pid:9040
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[8] group:fenced:daemon\x00, ip:r(0) ip(172.28.0.65) , pid:9040
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[9] group:crmd\x00, ip:r(0) ip(172.28.0.64) , pid:14530
> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for 
> (corosync cluster closed process group service v1.01) Oct 29 22:11:05 corosync [MAIN  ] Completed service synchronization, ready to provide service.
>
> Thanks
> Lax
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jan Friesse
> Sent: Thursday, October 30, 2014 1:23 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
>
>>
>>> On 30 Oct 2014, at 9:32 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>>>
>>>
>>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>>>> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.
>>>
>>>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in.
>>> Ok.
>>>
>>>>>
>>>>> Also one more issue I am seeing in one other setup a repeated 
>>>>> flood of 'A processor joined or left the membership and a new 
>>>>> membership was formed' messages for every 4secs. I am running with 
>>>>> default TOTEM settings with token time out as 10 secs. Even after 
>>>>> I increase the token, consensus values to be higher. It goes on 
>>>>> flooding the same message after newer consensus defined time (eg:
>>>>> if I increase it to be 10secs, then I see new membership formed 
>>>>> messages for every 10secs)
>>>>>
>>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>>>>
>>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>>
>>>> It does not sound like your network is particularly healthy.
>>>> Are you using multicast or udpu? If multicast, it might be worth 
>>>> trying udpu
>>>
>>> I am using udpu and I also have firewall opened for ports 5404 & 5405. Tcpdump looks fine too, it does not complain of any issues. This is a VM envirornment and even if I switch to other node within same VM I keep getting same failure.
>>
>> Depending on what the host and VMs are doing, that might be your problem.
>> In any case, I will defer to the corosync guys at this point.
>>
>
> Lax,
> usual reasons for this problem:
> 1. mtu is too high and fragmented packets are not enabled (take a look to netmtu configuration option) 2. config file on nodes are not in sync and one node may contain more node entries then other nodes (this may be also the case if you have two clusters and one cluster contains entry of one node for other cluster) 3. firewall is asymmetrically blocked (so node can send but not receive). Also keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu uses one socket per remote node for sending.
>
> I would recommend to disable firewall completely (for testing) and if everything will work, you just need to adjust firewall.
>
> Regards,
>    Honza
>
>
>
>>>
>>> Thanks
>>> Lax
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: linux-cluster-bounces at redhat.com 
>>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew 
>>> Beekhof
>>> Sent: Wednesday, October 29, 2014 3:17 PM
>>> To: linux clustering
>>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
>>>
>>>
>>>> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>>>>
>>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>>> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.
>>>
>>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in.
>>>
>>>>
>>>> Also one more issue I am seeing in one other setup a repeated flood 
>>>> of 'A processor joined or left the membership and a new membership 
>>>> was formed' messages for every 4secs. I am running with default 
>>>> TOTEM settings with token time out as 10 secs. Even after I 
>>>> increase the token, consensus values to be higher. It goes on 
>>>> flooding the same message after newer consensus defined time (eg: 
>>>> if I increase it to be 10secs, then I see new membership formed 
>>>> messages for every
>>>> 10secs)
>>>>
>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>>>
>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>>
>>> It does not sound like your network is particularly healthy.
>>> Are you using multicast or udpu? If multicast, it might be worth 
>>> trying udpu
>>>
>>>>
>>>> Thanks
>>>> Lax
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: linux-cluster-bounces at redhat.com 
>>>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew 
>>>> Beekhof
>>>> Sent: Wednesday, October 29, 2014 2:42 PM
>>>> To: linux clustering
>>>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
>>>>
>>>>
>>>>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error  retrying'. I have a 2 Node setup with pacemaker and corosync.
>>>>
>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>>>
>>>>>
>>>>> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is  there any way to fix this issue?
>>>>>
>>>>>
>>>>> Thanks
>>>>> Lax
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From lkota at cisco.com  Fri Oct 31 19:43:01 2014
From: lkota at cisco.com (Lax Kota (lkota))
Date: Fri, 31 Oct 2014 19:43:01 +0000
Subject: [Linux-cluster] daemon cpg_join error retrying
References: <D8FF83C6AA1D214B9F7B36281C06366713D72E18@xmb-rcd-x09.cisco.com>
	<A2E74581-3795-472A-8E27-B18DE5E6E4C1@beekhof.net>
	<D8FF83C6AA1D214B9F7B36281C06366713D72E64@xmb-rcd-x09.cisco.com>
	<E7C2C0CC-7DC3-45D8-BBD3-0D0356B63628@beekhof.net>
	<D8FF83C6AA1D214B9F7B36281C06366713D72F9A@xmb-rcd-x09.cisco.com>
	<68ABE774-8755-416F-829B-CED002B14D03@beekhof.net>
	<5451F581.5050100@redhat.com>
	<D8FF83C6AA1D214B9F7B36281C06366713D7339B@xmb-rcd-x09.cisco.com>
	<5453BC31.2000102@redhat.com> 
Message-ID: <D8FF83C6AA1D214B9F7B36281C06366713D7388D@xmb-rcd-x09.cisco.com>


> This is just weird. What exact version of corosync are you running? Do you have latest Z stream?
I am running  on Corosync 1.4.1 and pacemaker version is 1.1.8-7.el6
How should I get access to Z stream? Is there a specific dir I should pick this z stream from?

Thanks
Lax


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jan Friesse
Sent: Friday, October 31, 2014 9:43 AM
To: linux clustering
Subject: Re: [Linux-cluster] daemon cpg_join error retrying

Lax,


> Thanks Honza. Here is what I was doing,
>
>> usual reasons for this problem:
>> 1. mtu is too high and fragmented packets are not enabled (take a 
>> look to netmtu configuration option)
> I am running with default mtu settings which is 1500. And I do see my interface(eth1) on the box does have MTU as 1500 too.
>

Keep in mind that if they are not directly connected, switch can throw packets because of MTU.

>
> 2. config file on nodes are not in sync and one node may contain more node entries then other nodes (this may be also the case if you have two > clusters and one cluster contains entry of one node for other cluster) 3. firewall is asymmetrically blocked (so node can send but not receive). Also keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu uses one socket per remote node for sending.
> Verfiifed my config files cluster.conf and cib.xml and both have same 
> no of node entries (2)
>
>> I would recommend to disable firewall completely (for testing) and if everything will work, you just need to adjust firewall.
> I also ran tests with firewall off too on both the participating 
> nodes, still see same issue
>
> In corosync log I see repeated set of these messages, hoping these will give some more pointers.
>
> Oct 29 22:11:02 corosync [SYNC  ] Committing synchronization for 
> (corosync cluster closed process group service v1.01) Oct 29 22:11:02 corosync [MAIN  ] Completed service synchronization, ready to provide service.
> Oct 29 22:11:02 corosync [TOTEM ] waiting_trans_ack changed to 0 Oct
> 29 22:11:03 corosync [TOTEM ] entering GATHER state from 11.
> Oct 29 22:11:03 corosync [TOTEM ] entering GATHER state from 10.
> Oct 29 22:11:05 corosync [TOTEM ] entering GATHER state from 0.

This is just weird. What exact version of corosync are you running? Do you have latest Z stream?

Regards,
   Honza

> Oct 29 22:11:05 corosync [TOTEM ] got commit token Oct 29 22:11:05 
> corosync [TOTEM ] Saving state aru 1b high seq received 1b Oct 29
> 22:11:05 corosync [TOTEM ] Storing new sequence id for ring 51708 Oct
> 29 22:11:05 corosync [TOTEM ] entering COMMIT state.
> Oct 29 22:11:05 corosync [TOTEM ] got commit token Oct 29 22:11:05 
> corosync [TOTEM ] entering RECOVERY state.
> Oct 29 22:11:05 corosync [TOTEM ] TRANS [0] member 172.28.0.64:
> Oct 29 22:11:05 corosync [TOTEM ] TRANS [1] member 172.28.0.65:
> Oct 29 22:11:05 corosync [TOTEM ] position [0] member 172.28.0.64:
> Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep
> 172.28.0.64 Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b 
> received flag 1 Oct 29 22:11:05 corosync [TOTEM ] position [1] member 172.28.0.65:
> Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep
> 172.28.0.64 Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b 
> received flag 1 Oct 29 22:11:05 corosync [TOTEM ] Did not need to originate any messages in recovery.
> Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set 
> retrans flag0 retrans queue empty 1 count 0, aru ffffffff Oct 29
> 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Oct
> 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans
> flag0 retrans queue empty 1 count 1, aru 0 Oct 29 22:11:05 corosync 
> [TOTEM ] install seq 0 aru 0 high seq received 0 Oct 29 22:11:05 
> corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans 
> queue empty 1 count 2, aru 0 Oct 29 22:11:05 corosync [TOTEM ] install 
> seq 0 aru 0 high seq received 0 Oct 29 22:11:05 corosync [TOTEM ] 
> token retrans flag is 0 my set retrans flag0 retrans queue empty 1 
> count 3, aru 0 Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 
> high seq received 0 Oct 29 22:11:05 corosync [TOTEM ] retrans flag 
> count 4 token aru 0 install seq 0 aru 0 0 Oct 29 22:11:05 corosync 
> [TOTEM ] Resetting old ring state Oct 29 22:11:05 corosync [TOTEM ] 
> recovery to regular 1-0 Oct 29 22:11:05 corosync [CMAN  ] ais:
> confchg_fn called type = 1, seq=333576 Oct 29 22:11:05 corosync [TOTEM 
> ] waiting_trans_ack changed to 1 Oct 29 22:11:05 corosync [CMAN  ]
> ais: confchg_fn called type = 0, seq=333576 Oct 29 22:11:05 corosync 
> [CMAN  ] ais: last memb_count = 2, current = 2 Oct 29 22:11:05 
> corosync [CMAN  ] memb: sending TRANSITION message. cluster_name = vsomcluster Oct 29 22:11:05 corosync [CMAN  ] ais: comms send message 0x7fff8185ca00 len = 65 Oct 29 22:11:05 corosync [CMAN  ] daemon: sending reply 103 to fd 24 Oct 29 22:11:05 corosync [CMAN  ] daemon: sending reply 103 to fd 34 Oct 29 22:11:05 corosync [SYNC  ] This node is within the primary component and will provide service.
> Oct 29 22:11:05 corosync [TOTEM ] entering OPERATIONAL state.
> Oct 29 22:11:05 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Oct 29 22:11:05 corosync [CMAN  ] ais: deliver_fn source nodeid = 2, 
> len=81, endian_conv=0 Oct 29 22:11:05 corosync [CMAN  ] memb: Message 
> on port 0 is 5 Oct 29 22:11:05 corosync [CMAN  ] memb: got TRANSITION 
> from node 2 Oct 29 22:11:05 corosync [CMAN  ] memb: Got TRANSITION 
> message. msg->flags=20, node->flags=20, first_trans=0 Oct 29 22:11:05 
> corosync [CMAN  ] memb: add_ais_node ID=2, incarnation = 333576 Oct 29
> 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 corosync 
> [SYNC  ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 0.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [CMAN  ] ais: deliver_fn source nodeid = 1, 
> len=81, endian_conv=0 Oct 29 22:11:05 corosync [CMAN  ] memb: Message 
> on port 0 is 5 Oct 29 22:11:05 corosync [CMAN  ] memb: got TRANSITION 
> from node 1 Oct 29 22:11:05 corosync [CMAN  ] Completed first 
> transition with nodes on the same config versions Oct 29 22:11:05 
> corosync [CMAN  ] memb: Got TRANSITION message. msg->flags=20,
> node->flags=20, first_trans=0 Oct 29 22:11:05 corosync [CMAN  ] memb: 
> add_ais_node ID=1, incarnation = 333576 Oct 29 22:11:05 corosync [SYNC 
> ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 1 Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed 
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization actions starting for 
> (dummy CLM service) Oct 29 22:11:05 corosync [SYNC  ] confchg entries
> 2 Oct 29 22:11:05 corosync [SYNC  ] Barrier Start Received From 1 Oct
> 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 0.
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier Start Received From 2 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed 
> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for 
> (dummy CLM service) Oct 29 22:11:05 corosync [SYNC  ] Synchronization 
> actions starting for (dummy AMF service) Oct 29 22:11:05 corosync 
> [SYNC  ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC  ] Barrier 
> Start Received From 2 Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 0.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier Start Received From 1 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed 
> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for 
> (dummy AMF service) Oct 29 22:11:05 corosync [SYNC  ] Synchronization 
> actions starting for (openais checkpoint service B.01.01) Oct 29
> 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 corosync 
> [SYNC  ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC  ] Barrier 
> Start Received From 1 Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 0.
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier Start Received From 2 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed 
> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for 
> (openais checkpoint service B.01.01) Oct 29 22:11:05 corosync [SYNC  ] 
> Synchronization actions starting for (dummy EVT service) Oct 29
> 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 corosync 
> [SYNC  ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 1 = 0.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier Start Received From 1 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed 
> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for 
> (dummy EVT service) Oct 29 22:11:05 corosync [SYNC  ] Synchronization actions starting for (corosync cluster closed process group service v1.01)
> Oct 29 22:11:05 corosync [CPG   ] got joinlist message from node 1
> Oct 29 22:11:05 corosync [CPG   ] comparing: sender r(0) ip(172.28.0.65) ; members(old:2 left:0)
> Oct 29 22:11:05 corosync [CPG   ] comparing: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
> Oct 29 22:11:05 corosync [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
> Oct 29 22:11:05 corosync [CPG   ] got joinlist message from node 2
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier Start Received From 1 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 0.
> Oct 29 22:11:05 corosync [SYNC  ] confchg entries 2 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier Start Received From 2 Oct 29 22:11:05 
> corosync [SYNC  ] Barrier completion status for nodeid 1 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Barrier completion status for nodeid 2 = 1.
> Oct 29 22:11:05 corosync [SYNC  ] Synchronization barrier completed
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[0] group:crmd\x00, ip:r(0) ip(172.28.0.65) , pid:9198
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[1] group:attrd\x00, ip:r(0) ip(172.28.0.65) , pid:9196
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[2] group:stonith-ng\x00, ip:r(0) ip(172.28.0.65) , pid:9194
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[3] group:cib\x00, ip:r(0) ip(172.28.0.65) , pid:9193
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[4] group:pcmk\x00, ip:r(0) ip(172.28.0.65) , pid:9187
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[5] group:gfs:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9111
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[6] group:dlm:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9057
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[7] group:fenced:default\x00, ip:r(0) ip(172.28.0.65) , pid:9040
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[8] group:fenced:daemon\x00, ip:r(0) ip(172.28.0.65) , pid:9040
> Oct 29 22:11:05 corosync [CPG   ] joinlist_messages[9] group:crmd\x00, ip:r(0) ip(172.28.0.64) , pid:14530
> Oct 29 22:11:05 corosync [SYNC  ] Committing synchronization for 
> (corosync cluster closed process group service v1.01) Oct 29 22:11:05 corosync [MAIN  ] Completed service synchronization, ready to provide service.
>
> Thanks
> Lax
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jan Friesse
> Sent: Thursday, October 30, 2014 1:23 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
>
>>
>>> On 30 Oct 2014, at 9:32 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>>>
>>>
>>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>>>> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.
>>>
>>>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in.
>>> Ok.
>>>
>>>>>
>>>>> Also one more issue I am seeing in one other setup a repeated 
>>>>> flood of 'A processor joined or left the membership and a new 
>>>>> membership was formed' messages for every 4secs. I am running with 
>>>>> default TOTEM settings with token time out as 10 secs. Even after 
>>>>> I increase the token, consensus values to be higher. It goes on 
>>>>> flooding the same message after newer consensus defined time (eg:
>>>>> if I increase it to be 10secs, then I see new membership formed 
>>>>> messages for every 10secs)
>>>>>
>>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>>>>
>>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>>
>>>> It does not sound like your network is particularly healthy.
>>>> Are you using multicast or udpu? If multicast, it might be worth 
>>>> trying udpu
>>>
>>> I am using udpu and I also have firewall opened for ports 5404 & 5405. Tcpdump looks fine too, it does not complain of any issues. This is a VM envirornment and even if I switch to other node within same VM I keep getting same failure.
>>
>> Depending on what the host and VMs are doing, that might be your problem.
>> In any case, I will defer to the corosync guys at this point.
>>
>
> Lax,
> usual reasons for this problem:
> 1. mtu is too high and fragmented packets are not enabled (take a look to netmtu configuration option) 2. config file on nodes are not in sync and one node may contain more node entries then other nodes (this may be also the case if you have two clusters and one cluster contains entry of one node for other cluster) 3. firewall is asymmetrically blocked (so node can send but not receive). Also keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu uses one socket per remote node for sending.
>
> I would recommend to disable firewall completely (for testing) and if everything will work, you just need to adjust firewall.
>
> Regards,
>    Honza
>
>
>
>>>
>>> Thanks
>>> Lax
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: linux-cluster-bounces at redhat.com 
>>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew 
>>> Beekhof
>>> Sent: Wednesday, October 29, 2014 3:17 PM
>>> To: linux clustering
>>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
>>>
>>>
>>>> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>>>>
>>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>>> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.
>>>
>>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in.
>>>
>>>>
>>>> Also one more issue I am seeing in one other setup a repeated flood 
>>>> of 'A processor joined or left the membership and a new membership 
>>>> was formed' messages for every 4secs. I am running with default 
>>>> TOTEM settings with token time out as 10 secs. Even after I 
>>>> increase the token, consensus values to be higher. It goes on 
>>>> flooding the same message after newer consensus defined time (eg:
>>>> if I increase it to be 10secs, then I see new membership formed 
>>>> messages for every
>>>> 10secs)
>>>>
>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>>>
>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>>
>>> It does not sound like your network is particularly healthy.
>>> Are you using multicast or udpu? If multicast, it might be worth 
>>> trying udpu
>>>
>>>>
>>>> Thanks
>>>> Lax
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: linux-cluster-bounces at redhat.com 
>>>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew 
>>>> Beekhof
>>>> Sent: Wednesday, October 29, 2014 2:42 PM
>>>> To: linux clustering
>>>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying
>>>>
>>>>
>>>>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) <lkota at cisco.com> wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error  retrying'. I have a 2 Node setup with pacemaker and corosync.
>>>>
>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>>>
>>>>>
>>>>> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is  there any way to fix this issue?
>>>>>
>>>>>
>>>>> Thanks
>>>>> Lax
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster