From neale at sinenomine.net Thu Oct 2 19:30:03 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Thu, 2 Oct 2014 19:30:03 +0000 Subject: [Linux-cluster] Fencing of node Message-ID: <555388B6-13CB-48F1-B6C7-6496A4C90B1D@sinenomine.net> After creating simple two node cluster, one node is being fenced continually. I'm running pacemaker (1.1.10-29) with two nodes and the following corosync.conf: totem { version: 2 secauth: off cluster_name: rh7cluster transport: udpu } nodelist { node { ring0_addr: rh7cn1.devlab.sinenomine.net nodeid: 1 } node { ring0_addr: rh7cn2.devlab.sinenomine.net nodeid: 2 } } quorum { provider: corosync_votequorum two_node: 1 } logging { to_syslog: yes } Starting the cluster shows: Oct 2 15:17:47 rh7cn1 kernel: dlm: connect from non cluster node In the logs of both nodes. Both nodes then try and bring up resources (dlm, clvmd, and a cluster fs). Just prior to a node being fence, both nodes show the following # pcs resource show Clone Set: dlm-clone [dlm] Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ] Clone Set: clvmd-clone [clvmd] clvmd (ocf::heartbeat:clvm): FAILED Started: [ rh7cn2.devlab.sinenomine.net ] Clone Set: clusterfs-clone [clusterfs] Started: [ rh7cn2.devlab.sinenomine.net ] Stopped: [ rh7cn1.devlab.sinenomine.net ] Shortly after there is a clvmd timeout message in one of the logs and then that node gets fenced. I had added the high-availability firewalld service to both nodes. Running crm_simulate -SL -VV shows: warning: unpack_rsc_op: Processing failed op start for clvmd:1 on rh7cn1.devlab.sinenomine.net: unknown error (1) Current cluster status: Online: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ] ZVMPOWER (stonith:fence_zvm): Started rh7cn2.devlab.sinenomine.net Clone Set: dlm-clone [dlm] Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ] Clone Set: clvmd-clone [clvmd] clvmd (ocf::heartbeat:clvm): FAILED rh7cn1.devlab.sinenomine.net Started: [ rh7cn2.devlab.sinenomine.net ] Clone Set: clusterfs-clone [clusterfs] Started: [ rh7cn2.devlab.sinenomine.net ] Stopped: [ rh7cn1.devlab.sinenomine.net ] warning: common_apply_stickiness: Forcing clvmd-clone away from rh7cn1.devlab.sinenomine.net after 1000000 failures (max=1000000) warning: common_apply_stickiness: Forcing clvmd-clone away from rh7cn1.devlab.sinenomine.net after 1000000 failures (max=1000000) Transition Summary: * Stop clvmd:1 (rh7cn1.devlab.sinenomine.net) Executing cluster transition: * Pseudo action: clvmd-clone_stop_0 * Resource action: clvmd stop on rh7cn1.devlab.sinenomine.net * Pseudo action: clvmd-clone_stopped_0 * Pseudo action: all_stopped Revised cluster status: warning: unpack_rsc_op: Processing failed op start for clvmd:1 on rh7cn1.devlab.sinenomine.net: unknown error (1) Online: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ] ZVMPOWER (stonith:fence_zvm): Started rh7cn2.devlab.sinenomine.net Clone Set: dlm-clone [dlm] Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ] Clone Set: clvmd-clone [clvmd] Started: [ rh7cn2.devlab.sinenomine.net ] Stopped: [ rh7cn1.devlab.sinenomine.net ] Clone Set: clusterfs-clone [clusterfs] Started: [ rh7cn2.devlab.sinenomine.net ] Stopped: [ rh7cn1.devlab.sinenomine.net ] With RHEL 6 I would use a qdisk but this has been replaced by corosync_votequorum. This is my first RHEL 7 HA cluster so I'm at the beginning of my learning. Any pointers as to what I should look at or what I need to read? Neale From neale at sinenomine.net Thu Oct 2 19:44:24 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Thu, 2 Oct 2014 19:44:24 +0000 Subject: [Linux-cluster] Fencing of node In-Reply-To: <555388B6-13CB-48F1-B6C7-6496A4C90B1D@sinenomine.net> References: <555388B6-13CB-48F1-B6C7-6496A4C90B1D@sinenomine.net> Message-ID: <17363D5B-1124-492E-8AB4-1A00B84CC38A@sinenomine.net> Forgot to include cib.xml: From ccaulfie at redhat.com Fri Oct 3 07:34:49 2014 From: ccaulfie at redhat.com (Christine Caulfield) Date: Fri, 03 Oct 2014 08:34:49 +0100 Subject: [Linux-cluster] Fencing of node In-Reply-To: <17363D5B-1124-492E-8AB4-1A00B84CC38A@sinenomine.net> References: <555388B6-13CB-48F1-B6C7-6496A4C90B1D@sinenomine.net> <17363D5B-1124-492E-8AB4-1A00B84CC38A@sinenomine.net> Message-ID: <542E5199.3000901@redhat.com> I think you're hitting this bug: https://www.redhat.com/archives/cluster-devel/2014-September/msg00031.html The fix is in git, but no packages are available yet, sadly. Chrissie On 02/10/14 20:44, Neale Ferguson wrote: > Forgot to include cib.xml: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From daniel.dehennin at baby-gnu.org Fri Oct 3 14:35:36 2014 From: daniel.dehennin at baby-gnu.org (Daniel Dehennin) Date: Fri, 03 Oct 2014 16:35:36 +0200 Subject: [Linux-cluster] cLVM unusable on quorated cluster Message-ID: <87egupfcg7.fsf@hati.baby-gnu.org> Hello, I'm trying to setup pacemaker+corosync on Debian Wheezy to access a SAN for an OpenNebula cluster. As I'm new to cluster world, I have hard time figuring why sometime things get really wrong and where I must look to find answers. My OpenNebula frontend, running in a VM, does not manage to run the resources and my syslog has a lot of: #+begin_src ocfs2_controld: Unable to open checkpoint "ocfs2:controld": Object does not exist #+end_src When this happens, other nodes have problem: #+begin_src root at nebula3:~# LANG=C vgscan cluster request failed: Host is down Unable to obtain global lock. #+end_src But things looks fin in ?crm_mon?: #+begin_src root at nebula3:~# crm_mon -1 ============ Last updated: Fri Oct 3 16:25:43 2014 Last change: Fri Oct 3 14:51:59 2014 via cibadmin on nebula1 Stack: openais Current DC: nebula3 - partition with quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 5 Nodes configured, 5 expected votes 32 Resources configured. ============ Node quorum: standby Online: [ nebula3 nebula2 nebula1 ] OFFLINE: [ one ] Stonith-nebula3-IPMILAN (stonith:external/ipmi): Started nebula2 Stonith-nebula2-IPMILAN (stonith:external/ipmi): Started nebula3 Stonith-nebula1-IPMILAN (stonith:external/ipmi): Started nebula2 Clone Set: ONE-Storage-Clone [ONE-Storage] Started: [ nebula1 nebula3 nebula2 ] Stopped: [ ONE-Storage:3 ONE-Storage:4 ] Quorum-Node (ocf::heartbeat:VirtualDomain): Started nebula3 Stonith-Quorum-Node (stonith:external/libvirt): Started nebula3 #+end_src I don't know how to interpret dlm_tool informations: #+begin_src root at nebula3:~# dlm_tool ls -n dlm lockspaces name CCB10CE8D4FF489B9A2ECB288DACF2D7 id 0x09250e49 flags 0x00000008 fs_reg change member 3 joined 1 remove 0 failed 0 seq 2,2 members 1189587136 1206364352 1223141568 all nodes nodeid 1189587136 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none nodeid 1206364352 member 1 failed 0 start 1 seq_add 2 seq_rem 0 check none nodeid 1223141568 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none name clvmd id 0x4104eefa flags 0x00000000 change member 3 joined 0 remove 1 failed 0 seq 4,4 members 1189587136 1206364352 1223141568 all nodes nodeid 1172809920 member 0 failed 0 start 0 seq_add 3 seq_rem 4 check none nodeid 1189587136 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none nodeid 1206364352 member 1 failed 0 start 1 seq_add 2 seq_rem 0 check none nodeid 1223141568 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none #+end_src -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: dlm_tool-dump.txt URL: -------------- next part -------------- Is there any documentation on troubleshooting DLM/cLVM? Regards. -- Daniel Dehennin R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 342 bytes Desc: not available URL: From lists at alteeve.ca Fri Oct 3 14:38:14 2014 From: lists at alteeve.ca (Digimer) Date: Fri, 03 Oct 2014 10:38:14 -0400 Subject: [Linux-cluster] cLVM unusable on quorated cluster In-Reply-To: <87egupfcg7.fsf@hati.baby-gnu.org> References: <87egupfcg7.fsf@hati.baby-gnu.org> Message-ID: <542EB4D6.4030008@alteeve.ca> On 03/10/14 10:35 AM, Daniel Dehennin wrote: > Hello, > > I'm trying to setup pacemaker+corosync on Debian Wheezy to access a SAN > for an OpenNebula cluster. > > As I'm new to cluster world, I have hard time figuring why sometime > things get really wrong and where I must look to find answers. > > My OpenNebula frontend, running in a VM, does not manage to run the > resources and my syslog has a lot of: > > #+begin_src > ocfs2_controld: Unable to open checkpoint "ocfs2:controld": Object does not exist > #+end_src > > When this happens, other nodes have problem: > > #+begin_src > root at nebula3:~# LANG=C vgscan > cluster request failed: Host is down > Unable to obtain global lock. > #+end_src > > But things looks fin in ?crm_mon?: > > #+begin_src > root at nebula3:~# crm_mon -1 > ============ > Last updated: Fri Oct 3 16:25:43 2014 > Last change: Fri Oct 3 14:51:59 2014 via cibadmin on nebula1 > Stack: openais > Current DC: nebula3 - partition with quorum > Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff > 5 Nodes configured, 5 expected votes > 32 Resources configured. > ============ > > Node quorum: standby > Online: [ nebula3 nebula2 nebula1 ] > OFFLINE: [ one ] > > Stonith-nebula3-IPMILAN (stonith:external/ipmi): Started nebula2 > Stonith-nebula2-IPMILAN (stonith:external/ipmi): Started nebula3 > Stonith-nebula1-IPMILAN (stonith:external/ipmi): Started nebula2 > Clone Set: ONE-Storage-Clone [ONE-Storage] > Started: [ nebula1 nebula3 nebula2 ] > Stopped: [ ONE-Storage:3 ONE-Storage:4 ] > Quorum-Node (ocf::heartbeat:VirtualDomain): Started nebula3 > Stonith-Quorum-Node (stonith:external/libvirt): Started nebula3 > #+end_src > > I don't know how to interpret dlm_tool informations: > > #+begin_src > root at nebula3:~# dlm_tool ls -n > dlm lockspaces > name CCB10CE8D4FF489B9A2ECB288DACF2D7 > id 0x09250e49 > flags 0x00000008 fs_reg > change member 3 joined 1 remove 0 failed 0 seq 2,2 > members 1189587136 1206364352 1223141568 > all nodes > nodeid 1189587136 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none > nodeid 1206364352 member 1 failed 0 start 1 seq_add 2 seq_rem 0 check none > nodeid 1223141568 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none > > name clvmd > id 0x4104eefa > flags 0x00000000 > change member 3 joined 0 remove 1 failed 0 seq 4,4 > members 1189587136 1206364352 1223141568 > all nodes > nodeid 1172809920 member 0 failed 0 start 0 seq_add 3 seq_rem 4 check none > nodeid 1189587136 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none > nodeid 1206364352 member 1 failed 0 start 1 seq_add 2 seq_rem 0 check none > nodeid 1223141568 member 1 failed 0 start 1 seq_add 1 seq_rem 0 check none > #+end_src > > > > > Is there any documentation on troubleshooting DLM/cLVM? > > Regards. Can you paste your full pacemaker config and the logs from the other nodes starting just before the lost node went away? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From daniel.dehennin at baby-gnu.org Fri Oct 3 15:05:23 2014 From: daniel.dehennin at baby-gnu.org (Daniel Dehennin) Date: Fri, 03 Oct 2014 17:05:23 +0200 Subject: [Linux-cluster] cLVM unusable on quorated cluster In-Reply-To: <542EB4D6.4030008@alteeve.ca> (Digimer's message of "Fri, 03 Oct 2014 10:38:14 -0400") References: <87egupfcg7.fsf@hati.baby-gnu.org> <542EB4D6.4030008@alteeve.ca> Message-ID: <87a95dfb2k.fsf@hati.baby-gnu.org> Digimer writes: > Can you paste your full pacemaker config and the logs from the other > nodes starting just before the lost node went away? Sorry, I forgot to attach it: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: pcmk.conf URL: -------------- next part -------------- Here are the logs on the 3 hypervisors, note that pacemaker does not start at bootime: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: nebula1.log URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: nebula2.log URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: nebula3.log URL: -------------- next part -------------- -- Daniel Dehennin R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 342 bytes Desc: not available URL: From neale at sinenomine.net Fri Oct 3 15:14:44 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Fri, 3 Oct 2014 15:14:44 +0000 Subject: [Linux-cluster] Fencing of node In-Reply-To: <542E5199.3000901@redhat.com> References: <555388B6-13CB-48F1-B6C7-6496A4C90B1D@sinenomine.net> <17363D5B-1124-492E-8AB4-1A00B84CC38A@sinenomine.net> <542E5199.3000901@redhat.com> Message-ID: <616EF409-A374-4BD1-8418-012B3756F56C@sinenomine.net> That was the problem! I applied a local patch, rebuilt, restarted, and we're up fine and dandy! Thanks very much... Neale On Oct 3, 2014, at 3:34 AM, Christine Caulfield wrote: > I think you're hitting this bug: > > https://www.redhat.com/archives/cluster-devel/2014-September/msg00031.html > > > The fix is in git, but no packages are available yet, sadly. > > Chrissie From ccaulfie at redhat.com Fri Oct 3 15:29:56 2014 From: ccaulfie at redhat.com (Christine Caulfield) Date: Fri, 03 Oct 2014 16:29:56 +0100 Subject: [Linux-cluster] Fencing of node In-Reply-To: <616EF409-A374-4BD1-8418-012B3756F56C@sinenomine.net> References: <555388B6-13CB-48F1-B6C7-6496A4C90B1D@sinenomine.net> <17363D5B-1124-492E-8AB4-1A00B84CC38A@sinenomine.net> <542E5199.3000901@redhat.com> <616EF409-A374-4BD1-8418-012B3756F56C@sinenomine.net> Message-ID: <542EC0F4.7050202@redhat.com> Great! I'm pleased to hear it :-) Chrissie On 03/10/14 16:14, Neale Ferguson wrote: > That was the problem! I applied a local patch, rebuilt, restarted, and we're up fine and dandy! > > Thanks very much... Neale > > On Oct 3, 2014, at 3:34 AM, Christine Caulfield wrote: > >> I think you're hitting this bug: >> >> https://www.redhat.com/archives/cluster-devel/2014-September/msg00031.html >> >> >> The fix is in git, but no packages are available yet, sadly. >> >> Chrissie > From manish631 at rediffmail.com Fri Oct 3 16:57:04 2014 From: manish631 at rediffmail.com (manish vaidya) Date: 3 Oct 2014 16:57:04 -0000 Subject: [Linux-cluster] =?utf-8?q?Linux-cluster_Digest=2C_Vol_124=2C_Issu?= =?utf-8?q?e_7?= In-Reply-To: Message-ID: <1409414978.S.13239.Z.15124.F.H.TmxpbnV4LWNsdXN0ZXItcmVxdWVzdEByZWRoYXQuYwBMaW51eC1jbHVzdGVyIERpZ2VzdCwgVm9sIDEyNCw_.RU.rfs294, rfs294, 408, 370.f5-224-163.old.1412355424.6200@webmail.rediffmail.com> First i apologise for late reply , delay due to i cannot believe ,any response from site , I am a newcomer , already , i had posted this problem on many online forums , but they didn't give any response Thank all , for taking my problem seriously ** response from you are you using clvmd? if your answer is = yes, you need to be sure, you pv is visibile to your cluster nodes *** i am using clvmd & When use pvscan command cluster hangs I want to reproduce this situation again for perfection , such as when i try to run pvcreate command in cluster , message should come lock from node2 & node3 , I have created new cluster , this new cluster is working fine , How to do This? any setting in lvm.conf On Sat, 30 Aug 2014 21:39:38 +0530 linux-cluster-request at redhat.com wrote >Send Linux-cluster mailing list submissions to linux-cluster at redhat.com To subscribe or unsubscribe via the World Wide Web, visit https://www.redhat.com/mailman/listinfo/linux-cluster or, via email, send a message with subject or body 'help' to linux-cluster-request at redhat.com You can reach the person managing the list at linux-cluster-owner at redhat.com When replying, please edit your Subject line so it is more specific than "Re: Contents of Linux-cluster digest..." Today's Topics: 1. Please help me on cluster error (manish vaidya) 2. Re: Please help me on cluster error (emmanuel segura) ---------------------------------------------------------------------- Message: 1 Date: 30 Aug 2014 14:12:42 -0000 From: "manish vaidya" To: Subject: [Linux-cluster] Please help me on cluster error Message-ID: Content-Type: text/plain; charset="utf-8" i created four node cluster in kvm enviorment But i faced error when create new pv such as pvcreate /dev/sdb1 got error ,lock from node 2 & lock from node3 also strange cluster logs Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e 5f Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5f 60 Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 61 Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 63 64 Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 69 6a Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 78 Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 84 85 Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 9a 9b Please help me on this issue -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Sat, 30 Aug 2014 16:53:08 +0200 From: emmanuel segura To: linux clustering Subject: Re: [Linux-cluster] Please help me on cluster error Message-ID: Content-Type: text/plain; charset="utf-8" are you using clvmd? if your answer is = yes, you need to be sure, you pv is visibile to your cluster nodes 2014-08-30 16:12 GMT+02:00 manish vaidya : > i created four node cluster in kvm enviorment But i faced error when > create new pv such as pvcreate /dev/sdb1 > got error ,lock from node 2 & lock from node3 > > also strange cluster logs > > Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e > > Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e > 5f > Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5f > 60 > Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 61 > Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 63 > 64 > Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 69 > 6a > Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 78 > Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 84 > 85 > Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 9a > 9b > > > Please help me on this issue > > > Get your own *FREE* website, *FREE* domain & *FREE* mobile app with > Company email. > *Know More >* > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster End of Linux-cluster Digest, Vol 124, Issue 7 ********************************************* -------------- next part -------------- An HTML attachment was scrubbed... URL: From manish631 at rediffmail.com Fri Oct 3 17:03:15 2014 From: manish631 at rediffmail.com (manish vaidya) Date: 3 Oct 2014 17:03:15 -0000 Subject: [Linux-cluster] =?utf-8?q?Linux-cluster_Digest=2C_Vol_124=2C_Issu?= =?utf-8?q?e_8?= In-Reply-To: Message-ID: <1409501135.S.10373.25687.F.H.TmxpbnV4LWNsdXN0ZXItcmVxdWVzdEByZWRoYXQuYwBMaW51eC1jbHVzdGVyIERpZ2VzdCwgVm9sIDEyNCw_.RU.rfs294, rfs294, 441, 91.f5-224-149.old.1412355795.30029@webmail.rediffmail.com> First i apologise for late reply , delay due to i cannot believe ,any response from site , I am a newcomer , already , i have posted this problem on many on-line forum , but they didn't give any response Thank all , for take my problem seriously **currently using red hat version 6.5 I have created new cluster ,working fine , But i want to recreate this situation for proper understanding such as when using pvcreate command message should come lock from node2 & node3 How to do This? On Sun, 31 Aug 2014 21:35:35 +0530 linux-cluster-request at redhat.com wrote >Send Linux-cluster mailing list submissions to linux-cluster at redhat.com To subscribe or unsubscribe via the World Wide Web, visit https://www.redhat.com/mailman/listinfo/linux-cluster or, via email, send a message with subject or body 'help' to linux-cluster-request at redhat.com You can reach the person managing the list at linux-cluster-owner at redhat.com When replying, please edit your Subject line so it is more specific than "Re: Contents of Linux-cluster digest..." Today's Topics: 1. Re: Please help me on cluster error (Digimer) ---------------------------------------------------------------------- Message: 1 Date: Sat, 30 Aug 2014 12:35:52 -0400 From: Digimer To: linux clustering Subject: Re: [Linux-cluster] Please help me on cluster error Message-ID: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Can you share your cluster information please? This could be a network problem, as the messages below happen when the network between the nodes isn't fast enough or has too long latency and cluster traffic is considered lost and re-requested. If you don't have fencing working properly, and if a network issue caused a node to be declared lost, clustered LVM (and anything else using cluster locking) will fail (by design). If you share your configuration and more of your logs, it will help us understand what is happening. Please also tell us what version of the cluster software you're using. digimer On 30/08/14 10:12 AM, manish vaidya wrote: > i created four node cluster in kvm enviorment But i faced error when > create new pv such as pvcreate /dev/sdb1 > got error ,lock from node 2 & lock from node3 > > also strange cluster logs > > Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e > > Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e > 5f > Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5f > 60 > Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 61 > Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 63 > 64 > Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 69 > 6a > Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 78 > Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 84 > 85 > Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 9a > 9b > > > Please help me on this issue > > > Get your own *FREE* website, *FREE* domain & *FREE* mobile app with > Company email. > *Know More >* > > > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ------------------------------ -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster End of Linux-cluster Digest, Vol 124, Issue 8 ********************************************* -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Fri Oct 3 17:56:57 2014 From: lists at alteeve.ca (Digimer) Date: Fri, 03 Oct 2014 13:56:57 -0400 Subject: [Linux-cluster] clvmd issues In-Reply-To: <1409414978.S.13239.Z.15124.F.H.TmxpbnV4LWNsdXN0ZXItcmVxdWVzdEByZWRoYXQuYwBMaW51eC1jbHVzdGVyIERpZ2VzdCwgVm9sIDEyNCw_.RU.rfs294, rfs294, 408, 370.f5-224-163.old.1412355424.6200@webmail.rediffmail.com> References: <1409414978.S.13239.Z.15124.F.H.TmxpbnV4LWNsdXN0ZXItcmVxdWVzdEByZWRoYXQuYwBMaW51eC1jbHVzdGVyIERpZ2VzdCwgVm9sIDEyNCw_.RU.rfs294, rfs294, 408, 370.f5-224-163.old.1412355424.6200@webmail.rediffmail.com> Message-ID: <542EE369.1080102@alteeve.ca> On 03/10/14 12:57 PM, manish vaidya wrote: > First i apologise for late reply , delay due to i cannot believe ,any > response from site , I am a newcomer , already , i had posted this > problem on many online forums , but they didn't give any response > > Thank all , for taking my problem seriously > > ** response from you > > are you using clvmd? if your answer is = yes, you need to be sure, you pv > > is visibile to your cluster nodes > > *** i am using clvmd & When use pvscan command cluster hangs > > I want to reproduce this situation again for perfection , such as when i > try to run pvcreate command in cluster , message should come lock from > node2 & node3 , I have created new cluster , this new cluster is working > fine , > How to do This? any setting in lvm.conf Can you share your setup please? What kind of cluster? What version? What is the configuration file? Was there anything interesting in the system logs? etc. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From neale at sinenomine.net Fri Oct 3 19:32:34 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Fri, 3 Oct 2014 19:32:34 +0000 Subject: [Linux-cluster] gfs2 resource not mounting Message-ID: <0339B794-F2BA-4BC7-88BB-E6016B3DD2CB@sinenomine.net> Using the same two-node configuration I described in an earlier post this forum, I'm having problems getting a gfs2 resource started on one of the nodes. The resource in question: Resource: clusterfs (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/vg_cluster/ha_lv directory=/mnt/gfs2-demo fstype=gfs2 options=noatime Operations: start interval=0s timeout=60 (clusterfs-start-timeout-60) stop interval=0s timeout=60 (clusterfs-stop-timeout-60) monitor interval=10s on-fail=fence (clusterfs-monitor-interval-10s) pcs status shows: Clone Set: dlm-clone [dlm] Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ] Clone Set: clvmd-clone [clvmd] Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ] Clone Set: clusterfs-clone [clusterfs] Started: [ rh7cn1.devlab.sinenomine.net ] Stopped: [ rh7cn2.devlab.sinenomine.net ] Failed actions: clusterfs_start_0 on rh7cn2.devlab.sinenomine.net 'unknown error' (1): call=46, status=complete, last-rc-change='Fri Oct 3 14:41:26 2014', queued=4702ms, exec=0ms Using pcs resource debug-start I see: Operation start for clusterfs:0 (ocf:heartbeat:Filesystem) returned 1 > stderr: INFO: Running start for /dev/vg_cluster/ha_lv on /mnt/gfs2-demo > stderr: mount: permission denied > stderr: ERROR: Couldn't mount filesystem /dev/vg_cluster/ha_lv on /mnt/gfs2-demo The log on the node shows - Oct 3 14:57:37 rh7cn2 kernel: GFS2: fsid=rh7cluster:vol1: Trying to join cluster "lock_dlm", "rh7cluster:vol1" Oct 3 14:57:38 rh7cn2 kernel: GFS2: fsid=rh7cluster:vol1: Joined cluster. Now mounting FS... Oct 3 14:57:38 rh7cn2 dlm_controld[5857]: 1564 cpg_dispatch error 9 On the other node - Oct 3 15:09:47 rh7cn1 kernel: GFS2: fsid=rh7cluster:vol1.0: recover generation 14 done Oct 3 15:09:48 rh7cn1 kernel: GFS2: fsid=rh7cluster:vol1.0: recover generation 15 done I'm assuming I didn't define the gfs2 resource such that it could be used concurrently by both nodes. Here's the cib.xml definition for it: ------------------------------- Unrelated (I believe) to the above, I also note the following messages in /var/log/messages which appear to be related to pacemaker and http (another resource I have defined): Oct 3 15:05:06 rh7cn2 systemd: pacemaker.service: Got notification message from PID 6036, but reception only permitted for PID 5575 I'm running systemd-208-11.el7_0.2. A bugzilla search matches with one report but the fix was put into -11. Neale From neale at sinenomine.net Mon Oct 6 02:30:51 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Mon, 6 Oct 2014 02:30:51 +0000 Subject: [Linux-cluster] gfs2 resource not mounting In-Reply-To: <0339B794-F2BA-4BC7-88BB-E6016B3DD2CB@sinenomine.net> References: <0339B794-F2BA-4BC7-88BB-E6016B3DD2CB@sinenomine.net> Message-ID: <5025AC66-8F86-49D0-B223-ACD9B2E428CC@sinenomine.net> I found the problem. It was a configuration error I made when I modified the gfs2 resource. Everything is working correctly now. If I want to change a setting like the token timeout, do I simply edit corosync.conf and sync the changes or can I use the pcs cluster setup command to modify an existing configuration? Neale On Oct 3, 2014, at 3:32 PM, Neale Ferguson wrote: > Using the same two-node configuration I described in an earlier post this forum, I'm having problems getting a gfs2 resource started on one of the nodes. The resource in question: > > Resource: clusterfs (class=ocf provider=heartbeat type=Filesystem) > Attributes: device=/dev/vg_cluster/ha_lv directory=/mnt/gfs2-demo fstype=gfs2 options=noatime > Operations: start interval=0s timeout=60 (clusterfs-start-timeout-60) > stop interval=0s timeout=60 (clusterfs-stop-timeout-60) > monitor interval=10s on-fail=fence (clusterfs-monitor-interval-10s) From bubble at hoster-ok.com Mon Oct 6 05:28:04 2014 From: bubble at hoster-ok.com (Vladislav Bogdanov) Date: Mon, 06 Oct 2014 08:28:04 +0300 Subject: [Linux-cluster] cLVM unusable on quorated cluster In-Reply-To: <87a95dfb2k.fsf@hati.baby-gnu.org> References: <87egupfcg7.fsf@hati.baby-gnu.org> <542EB4D6.4030008@alteeve.ca> <87a95dfb2k.fsf@hati.baby-gnu.org> Message-ID: <54322864.5040706@hoster-ok.com> 03.10.2014 18:05, Daniel Dehennin wrote: I'd recommend to make sure that: 1. clvmd runs in 'corosync' mode, not 'openais' (controlled by -I command-line switch), because otherwise it uses buggy LCK AIS service instead of well tested CPG+DLM, 2. You have recent enough version of lvm2. 2.02.102 should be ok, you need git commit 431eda6 (https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=431eda63cc0ebff7c62dacb313cabcffbda6573a), introduced somewhere between 2.02.99 and 2.02.102. I didn't test that commit with corosync-1, but that should work for it. Hope this helps, Vladislav > Digimer writes: > >> Can you paste your full pacemaker config and the logs from the other >> nodes starting just before the lost node went away? > > Sorry, I forgot to attach it: > > > > > Here are the logs on the 3 hypervisors, note that pacemaker does not start at bootime: > > > > > > From neale at sinenomine.net Mon Oct 13 15:20:05 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Mon, 13 Oct 2014 15:20:05 +0000 Subject: [Linux-cluster] Permission denied Message-ID: I reported last week that I was getting permission denied when pcs was starting a gfs2 resource. I thought it was due to the resource being defined incorrectly, but it doesn?t appear to be the case. On rare occasions the mount works but most of the time one node gets it mounted but the other gets denied. I?ve enabled a number of logging options and done straces on both sides but I?m not getting anywhere. My cluster looks like: # pcs resource show Clone Set: dlm-clone [dlm] Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ] Resource Group: apachegroup VirtualIP (ocf::heartbeat:IPaddr2): Started Website (ocf::heartbeat:apache): Started httplvm (ocf::heartbeat:LVM): Started http_fs (ocf::heartbeat:Filesystem): Started Clone Set: clvmd-clone [clvmd] Started: [ rh7cn1.devlab.sinenomine.net rh7cn2.devlab.sinenomine.net ] Clone Set: clusterfs-clone [clusterfs] Started: [ rh7cn1.devlab.sinenomine.net ] Stopped: [ rh7cn2.devlab.sinenomine.net ] The gfs2 resource is defined: # pcs resource show clusterfs Resource: clusterfs (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/vg_cluster/ha_lv directory=/mnt/gfs2-demo fstype=gfs2 options=noatime Operations: start interval=0s timeout=60 (clusterfs-start-timeout-60) stop interval=0s timeout=60 (clusterfs-stop-timeout-60) monitor interval=10s on-fail=fence (clusterfs-monitor-interval-10s) When the mount is attempted on node 2 the log contains: Oct 13 11:10:42 rh7cn2 kernel: GFS2: fsid=rh7cluster:vol1: Trying to join cluster "lock_dlm", "rh7cluster:vol1" Oct 13 11:10:42 rh7cn2 corosync[47978]: [QB ] ipc_setup.c:handle_new_connection:485 IPC credentials authenticated (47978-48271-30) Oct 13 11:10:42 rh7cn2 corosync[47978]: [QB ] ipc_shm.c:qb_ipcs_shm_connect:294 connecting to client [48271] Oct 13 11:10:42 rh7cn2 corosync[47978]: [QB ] ringbuffer.c:qb_rb_open_2:236 shm size:1048589; real_size:1052672; rb->word_size:263168 Oct 13 11:10:42 rh7cn2 corosync[47978]: message repeated 2 times: [[QB ] ringbuffer.c:qb_rb_open_2:236 shm size:1048589; real_size:1052672; rb->word_size:263168] Oct 13 11:10:42 rh7cn2 corosync[47978]: [MAIN ] ipc_glue.c:cs_ipcs_connection_created:272 connection created Oct 13 11:10:42 rh7cn2 corosync[47978]: [CPG ] cpg.c:cpg_lib_init_fn:1532 lib_init_fn: conn=0x2ab16a953a0, cpd=0x2ab16a95a64 Oct 13 11:10:42 rh7cn2 corosync[47978]: [CPG ] cpg.c:message_handler_req_exec_cpg_procjoin:1349 got procjoin message from cluster node 0x2 (r(0) ip(172.17.16.148) ) for pid 48271 Oct 13 11:10:43 rh7cn2 kernel: GFS2: fsid=rh7cluster:vol1: Joined cluster. Now mounting FS... Oct 13 11:10:43 rh7cn2 corosync[47978]: [CPG ] cpg.c:message_handler_req_lib_cpg_leave:1617 got leave reques t on 0x2ab16a953a0Oct 13 11:10:43 rh7cn2 corosync[47978]: [CPG ] cpg.c:message_handler_req_exec_cpg_procleave:1365 got proclea ve message from cluster node 0x2 (r(0) ip(172.17.16.148) ) for pid 48271 Oct 13 11:10:43 rh7cn2 corosync[47978]: [CPG ] cpg.c:message_handler_req_lib_cpg_finalize:1655 cpg finalize for conn=0x2ab16a953a0 Oct 13 11:10:43 rh7cn2 dlm_controld[48271]: 251492 cpg_dispatch error 9 Is the ?leave request? symptomatic or causal? If the latter, why is it generated? On other other side: Oct 13 11:10:41 rh7cn1 corosync[10423]: [QUORUM] vsf_quorum.c:message_handler_req_lib_quorum_getquorate:395 got quorate request on 0x2ab0e33c8b0 Oct 13 11:10:41 rh7cn1 corosync[10423]: [QUORUM] vsf_quorum.c:message_handler_req_lib_quorum_getquorate:395 got quorate request on 0x2ab0e33c8b0 Oct 13 11:10:42 rh7cn1 corosync[10423]: [CPG ] cpg.c:message_handler_req_exec_cpg_procjoin:1349 got procjoin message from cluster node 0x2 (r(0) ip(172.17.16.148) ) for pid 48271 Oct 13 11:10:43 rh7cn1 kernel: GFS2: fsid=rh7cluster:vol1.0: recover generation 6 doneOct 13 11:10:43 rh7cn1 corosync[10423]: [CPG ] cpg.c:message_handler_req_exec_cpg_procleave:1365 got proclea ve message from cluster node 0x2 (r(0) ip(172.17.16.148) ) for pid 48271Oct 13 11:10:43 rh7cn1 kernel: GFS2: fsid=rh7cluster:vol1.0: recover generation 7 done dlm_tool dump shows: 251469 dlm:ls:vol1 conf 2 1 0 memb 1 2 join 2 left 251469 vol1 add_change cg 6 joined nodeid 2 251469 vol1 add_change cg 6 counts member 2 joined 1 remove 0 failed 0 251469 vol1 stop_kernel cg 6 251469 write "0" to "/sys/kernel/dlm/vol1/control" 251469 vol1 check_ringid done cluster 43280 cpg 1:43280 251469 vol1 check_fencing done 251469 vol1 send_start 1:6 counts 5 2 1 0 0 251469 vol1 receive_start 1:6 len 80 251469 vol1 match_change 1:6 matches cg 6 251469 vol1 wait_messages cg 6 need 1 of 2 251469 vol1 receive_start 2:1 len 80 251469 vol1 match_change 2:1 matches cg 6 251469 vol1 wait_messages cg 6 got all 2 251469 vol1 start_kernel cg 6 member_count 2 251469 dir_member 1 251469 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/vol1/nodes/2" 251469 write "1" to "/sys/kernel/dlm/vol1/control" 251469 vol1 prepare_plocks 251469 vol1 set_plock_data_node from 1 to 1 251469 vol1 send_all_plocks_data 1:6 251469 vol1 send_all_plocks_data 1:6 0 done 251469 vol1 send_plocks_done 1:6 counts 5 2 1 0 0 plocks_data 0 251469 vol1 receive_plocks_done 1:6 flags 2 plocks_data 0 need 0 save 0 251470 dlm:ls:vol1 conf 1 0 1 memb 1 join left 2 251470 vol1 add_change cg 7 remove nodeid 2 reason leave 251470 vol1 add_change cg 7 counts member 1 joined 0 remove 1 failed 0 251470 vol1 stop_kernel cg 7 251470 write "0" to "/sys/kernel/dlm/vol1/control" 251470 vol1 purged 0 plocks for 2 251470 vol1 check_ringid done cluster 43280 cpg 1:43280 251470 vol1 check_fencing done 251470 vol1 send_start 1:7 counts 6 1 0 1 0 251470 vol1 receive_start 1:7 len 76 251470 vol1 match_change 1:7 matches cg 7 251470 vol1 wait_messages cg 7 got all 1 251470 vol1 start_kernel cg 7 member_count 1 251470 dir_member 2 251470 dir_member 1 251470 set_members rmdir "/sys/kernel/config/dlm/cluster/spaces/vol1/nodes/2" 251470 write "1" to "/sys/kernel/dlm/vol1/control" 251470 vol1 prepare_plocks I would appreciate any debugging suggestions. I?ve straced dlm_controld/corosync but not gained much clarity. Neale From neale at sinenomine.net Mon Oct 13 15:33:57 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Mon, 13 Oct 2014 15:33:57 +0000 Subject: [Linux-cluster] Permission denied Message-ID: Software levels: pacemaker-1.1.10-29 pcs-0.9.115-32 dlm-4.0.2-4 corosync-2.3.3-2 lvm2-cluster-2.02.105-14 On 10/13/14, 11:20 AM, "Neale Ferguson" wrote: >I reported last week that I was getting permission denied when pcs was >starting a gfs2 resource. I thought it was due to the resource being >defined incorrectly, but it doesn?t appear to be the case. On rare >occasions the mount works but most of the time one node gets it mounted >but the other gets denied. I?ve enabled a number of logging options and >done straces on both sides but I?m not getting anywhere. From emi2fast at gmail.com Mon Oct 13 15:52:36 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Mon, 13 Oct 2014 17:52:36 +0200 Subject: [Linux-cluster] Permission denied In-Reply-To: References: Message-ID: have you configured the fencing? 2014-10-13 17:33 GMT+02:00 Neale Ferguson : > Software levels: > > pacemaker-1.1.10-29 > pcs-0.9.115-32 > dlm-4.0.2-4 > corosync-2.3.3-2 > lvm2-cluster-2.02.105-14 > > > On 10/13/14, 11:20 AM, "Neale Ferguson" wrote: > >>I reported last week that I was getting permission denied when pcs was >>starting a gfs2 resource. I thought it was due to the resource being >>defined incorrectly, but it doesn?t appear to be the case. On rare >>occasions the mount works but most of the time one node gets it mounted >>but the other gets denied. I?ve enabled a number of logging options and >>done straces on both sides but I?m not getting anywhere. > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- esta es mi vida e me la vivo hasta que dios quiera From neale at sinenomine.net Mon Oct 13 16:05:56 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Mon, 13 Oct 2014 16:05:56 +0000 Subject: [Linux-cluster] Permission denied In-Reply-To: References: Message-ID: Yep: # pcs stonith show ZVMPOWER Resource: ZVMPOWER (class=stonith type=fence_zvm) Attributes: ipaddr=VSMREQIU pcmk_host_map=rh7cn1.devlab.sinenomine.net:RH7CN1;rh7cn2.devlab.sinenomine. net:RH7CN2 pcmk_host_list=rh7cn1.devlab.sinenomine.net;rh7cn2.devlab.sinenomine.net pcmk_host_check=static-list Operations: monitor interval=60s (ZVMPOWER-monitor-interval-60s) I?ve verified its operation by causing fencing to be triggered. On 10/13/14, 11:52 AM, "emmanuel segura" wrote: >have you configured the fencing? From rpeterso at redhat.com Mon Oct 13 16:16:30 2014 From: rpeterso at redhat.com (Bob Peterson) Date: Mon, 13 Oct 2014 12:16:30 -0400 (EDT) Subject: [Linux-cluster] Permission denied In-Reply-To: References: Message-ID: <1982114005.2362400.1413216990955.JavaMail.zimbra@redhat.com> ----- Original Message ----- > I would appreciate any debugging suggestions. I?ve straced > dlm_controld/corosync but not gained much clarity. > > Neale Hi Neale, 1. What does it say if you try to mount the GFS2 file system manually rather than from the configured service? 2. After the failure, what does dmesg on all the nodes show? 3. What kernel is this? I would: (1) Check to make sure the file system has enough journals for all nodes. You can do gfs2_edit -p journals . If your version of gfs2-utils doesn't have that option, you can alternately do: gfs2_edit -p jindex and see how many journals are in the index. (2) Check to make sure the locking protocol is lock_dlm in the file system superblock. You can get that from gfs2_edit -p sb (3) Check to make sure the cluster name in the file system superblock matches the configured cluster name. That's also in the superblock Regards, Bob Peterson Red Hat File Systems From neale at sinenomine.net Mon Oct 13 16:47:14 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Mon, 13 Oct 2014 16:47:14 +0000 Subject: [Linux-cluster] Permission denied Message-ID: Thanks Bob, answers inline... On 10/13/14, 12:16 PM, "Bob Peterson" wrote: >----- Original Message ----- >> I would appreciate any debugging suggestions. I?ve straced >> dlm_controld/corosync but not gained much clarity. >> >> Neale > >Hi Neale, > >1. What does it say if you try to mount the GFS2 file system manually > rather than from the configured service? Permissioned denied (I also used resource debug-start and that?s the message it gets as well). I disabled the resource and then tried mounting it as well and I was successful once but not a second time. As I mentioned, on rare occasions both sides do mount on cluster start up, which is worse than it never mounting! >2. After the failure, what does dmesg on all the nodes show? Node 1 - [256184.632116] dlm: vol1: dlm_recover 15 [256184.633300] dlm: vol1: add member 2 [256184.636944] dlm: vol1: dlm_recover_members 2 nodes [256184.664495] dlm: vol1: generation 8 slots 2 1:1 2:2 [256184.664531] dlm: vol1: dlm_recover_directory [256184.668865] dlm: vol1: dlm_recover_directory 0 in 0 new [256184.703328] dlm: vol1: dlm_recover_directory 10 out 1 messages [256184.784404] dlm: vol1: dlm_recover 15 generation 8 done: 120 ms [256184.785050] GFS2: fsid=rh7cluster:vol1.0: recover generation 8 done [256185.375091] dlm: vol1: dlm_recover 17 [256185.375655] dlm: vol1: dlm_clear_toss 1 done [256185.376263] dlm: vol1: remove member 2 [256185.376339] dlm: vol1: dlm_recover_members 1 nodes [256185.376403] dlm: vol1: generation 9 slots 1 1:1 [256185.376430] dlm: vol1: dlm_recover_directory [256185.376458] dlm: vol1: dlm_recover_directory 0 in 0 new [256185.376490] dlm: vol1: dlm_recover_directory 0 out 0 messages [256185.376638] dlm: vol1: dlm_recover_purge 6 locks for 1 nodes [256185.376664] dlm: vol1: dlm_recover_masters [256185.376714] dlm: vol1: dlm_recover_masters 0 of 26 [256185.376746] dlm: vol1: dlm_recover_locks 0 out [256185.376778] dlm: vol1: dlm_recover_locks 0 in [256185.376831] dlm: vol1: dlm_recover_rsbs 26 done [256185.377444] dlm: vol1: dlm_recover 17 generation 9 done: 0 ms [256185.377833] GFS2: fsid=rh7cluster:vol1.0: recover generation 9 done Node 2 (failing) - [256206.973005] GFS2: fsid=rh7cluster:vol1: Trying to join cluster "lock_dlm", "rh7cluster:vol1" [256206.973105] GFS2: fsid=rh7cluster:vol1: In gdlm_mount [256207.019743] dlm: vol1: joining the lockspace group... [256207.169061] dlm: vol1: group event done 0 0 [256207.169135] dlm: vol1: dlm_recover 1 [256207.170735] dlm: vol1: add member 2 [256207.170822] dlm: vol1: add member 1 [256207.174493] dlm: vol1: dlm_recover_members 2 nodes [256207.174798] dlm: vol1: join complete [256207.205167] dlm: vol1: dlm_recover_directory [256207.208924] dlm: vol1: dlm_recover_directory 10 in 10 new [256207.245335] dlm: vol1: dlm_recover_directory 0 out 1 messages [256207.329101] dlm: vol1: dlm_recover 1 generation 8 done: 120 ms [256207.851390] GFS2: fsid=rh7cluster:vol1: Joined cluster. Now mounting FS... [256207.881216] dlm: vol1: leaving the lockspace group... [256207.947479] dlm: vol1: group event done 0 0 [256207.949530] dlm: vol1: release_lockspace final free >3. What kernel is this? > >I would: >(1) Check to make sure the file system has enough journals for all nodes. > You can do gfs2_edit -p journals . If your version of >gfs2-utils > doesn't have that option, you can alternately do: gfs2_edit -p jindex > > and see how many journals are in the index. 3/3 [fc7745eb] 4/21 (0x4/0x15): File journal0 4/4 [8b70757d] 5/4127 (0x5/0x101f): File journal1 It was made via: mkfs.gfs2 -j 2 -J 16 -r 32 -t rh7cluster:vol1 /dev/mapper/vg_cluster-ha_lv >(2) Check to make sure the locking protocol is lock_dlm in the file system > superblock. You can get that from gfs2_edit -p sb sb_lockproto lock_dlm >(3) Check to make sure the cluster name in the file system superblock > matches the configured cluster name. That's also in the superblock sb_locktable rh7cluster:vol1 Strangely, while /etc/corosync/corosync.conf has the cluster name specified, pcs status reports it as blank: # pcs status Cluster name: Last updated: Mon Oct 13 12:40:47 2014 > -------------- next part -------------- A non-text attachment was scrubbed... Name: default.xml Type: application/xml Size: 3101 bytes Desc: default.xml URL: From rpeterso at redhat.com Mon Oct 13 17:58:46 2014 From: rpeterso at redhat.com (Bob Peterson) Date: Mon, 13 Oct 2014 13:58:46 -0400 (EDT) Subject: [Linux-cluster] Permission denied In-Reply-To: References: Message-ID: <565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com> ----- Original Message ----- (snip) > >3. What kernel is this? Make sure both nodes are running the same kernel, at any rate. > It was made via: > mkfs.gfs2 -j 2 -J 16 -r 32 -t rh7cluster:vol1 > /dev/mapper/vg_cluster-ha_lv > Hm. This must be a small SSD device or embedded or something. That's a pretty non-standard journal size (and resource group size). I'm not worried about the resource group size of 32. Shouldn't be an issue. The journal size, on the other hand, is a little concerning. Can you try with the standard 128MB journal size just as an experiment to see if it mounts more consistently or if you get the same error? Maybe GFS2's recovery code is sending an error back for some reason due to its size... Regards, Bob Peterson Red Hat File Systems From neale at sinenomine.net Mon Oct 13 18:13:35 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Mon, 13 Oct 2014 18:13:35 +0000 Subject: [Linux-cluster] Permission denied In-Reply-To: <565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com> References: <565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com> Message-ID: On 10/13/14, 1:58 PM, "Bob Peterson" wrote: >----- Original Message ----- >(snip) >> >3. What kernel is this? > >Make sure both nodes are running the same kernel, at any rate. Both running 3.10.0-123.8.1 > >> It was made via: >> mkfs.gfs2 -j 2 -J 16 -r 32 -t rh7cluster:vol1 >> /dev/mapper/vg_cluster-ha_lv >> > >Hm. This must be a small SSD device or embedded or something. >That's a pretty non-standard journal size (and resource group size). >I'm not worried about the resource group size of 32. Shouldn't be an >issue. >The journal size, on the other hand, is a little concerning. > >Can you try with the standard 128MB journal size just as an experiment >to see if it mounts more consistently or if you get the same error? >Maybe GFS2's recovery code is sending an error back for some reason >due to its size... Will do. It?s just a demo system to verify the bits and pieces before rolling out something more serious. I did the same with the first cman system I built for RHEL 6, so just used the same sizes for things. Thanks again Bob From neale at sinenomine.net Mon Oct 13 18:50:55 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Mon, 13 Oct 2014 18:50:55 +0000 Subject: [Linux-cluster] Permission denied In-Reply-To: <565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com> References: <565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com> Message-ID: On 10/13/14, 1:58 PM, "Bob Peterson" wrote: >----- Original Message ----- >Can you try with the standard 128MB journal size just as an experiment >to see if it mounts more consistently or if you get the same error? >Maybe GFS2's recovery code is sending an error back for some reason >due to its size... Disabled resource; Remade filesystem; Re-enabled resource. It mounted on both systems. I disabled/enabled again. Only mounted on one. dmesg from failing node showing success followed by fail: [ 469.968521] GFS2: fsid=rh7cluster:vol1: Trying to join cluster "lock_dlm", "rh7cluster:vol1" [ 469.968638] GFS2: fsid=rh7cluster:vol1: In gdlm_mount [ 469.979229] dlm: vol1: joining the lockspace group... [ 470.065511] dlm: vol1: group event done 0 0 [ 470.065644] dlm: vol1: dlm_recover 1 [ 470.066623] dlm: vol1: add member 2 [ 470.066688] dlm: vol1: dlm_recover_members 1 nodes [ 470.066749] dlm: vol1: generation 1 slots 1 1:2 [ 470.066787] dlm: vol1: dlm_recover_directory [ 470.066819] dlm: vol1: dlm_recover_directory 0 in 0 new [ 470.066852] dlm: vol1: dlm_recover_directory 0 out 0 messages [ 470.067350] dlm: vol1: dlm_recover 1 generation 1 done: 0 ms [ 470.067674] dlm: vol1: join complete [ 470.282466] dlm: vol1: dlm_recover 3 [ 470.283380] dlm: vol1: add member 1 [ 470.289840] dlm: vol1: dlm_recover_members 2 nodes [ 470.327670] dlm: vol1: dlm_recover_directory [ 470.330863] dlm: vol1: dlm_recover_directory 0 in 0 new [ 470.406706] dlm: vol1: dlm_recover_directory 1 out 1 messages [ 470.567983] dlm: vol1: dlm_process_requestqueue msg 11 from 1 lkid 1 remid 0 result 0 seq 3 [ 470.568520] dlm: vol1: dlm_recover 3 generation 2 done: 240 ms [ 470.578773] GFS2: fsid=rh7cluster:vol1: first mounter control generation 0 [ 470.578856] GFS2: fsid=rh7cluster:vol1: Joined cluster. Now mounting FS... [ 470.788200] GFS2: fsid=rh7cluster:vol1.0: jid=0, already locked for use [ 470.788293] GFS2: fsid=rh7cluster:vol1.0: jid=0: Looking at journal... [ 470.843038] GFS2: fsid=rh7cluster:vol1.0: jid=0: Done [ 470.851041] GFS2: fsid=rh7cluster:vol1.0: jid=1: Trying to acquire journal lock... [ 470.858019] GFS2: fsid=rh7cluster:vol1.0: jid=1: Looking at journal... [ 470.953275] GFS2: fsid=rh7cluster:vol1.0: jid=1: Done [ 470.962088] GFS2: fsid=rh7cluster:vol1.0: first mount done, others may mount [ 471.132738] SELinux: initialized (dev dm-5, type gfs2), uses xattr [ 524.435169] dlm: vol1: leaving the lockspace group... [ 524.495477] dlm: vol1: group event done 0 0 [ 524.497957] dlm: vol1: release_lockspace final free [ 540.342079] GFS2: fsid=rh7cluster:vol1: Trying to join cluster "lock_dlm", "rh7cluster:vol1" [ 540.342156] GFS2: fsid=rh7cluster:vol1: In gdlm_mount [ 540.361232] dlm: vol1: joining the lockspace group... [ 540.450770] dlm: vol1: group event done 0 0 [ 540.450834] dlm: vol1: dlm_recover 1 [ 540.451553] dlm: vol1: add member 2 [ 540.451975] dlm: vol1: add member 1 [ 540.453783] dlm: vol1: dlm_recover_members 2 nodes [ 540.454073] dlm: vol1: join complete [ 540.486970] dlm: vol1: dlm_recover_directory [ 540.489807] dlm: vol1: dlm_recover_directory 1 in 1 new [ 540.516820] dlm: vol1: dlm_recover_directory 0 out 1 messages [ 540.576710] dlm: vol1: dlm_recover 1 generation 2 done: 90 ms [ 541.105327] GFS2: fsid=rh7cluster:vol1: Joined cluster. Now mounting FS... [ 541.202840] dlm: vol1: leaving the lockspace group... [ 541.215728] dlm: vol1: group event done 0 0 [ 541.217632] dlm: vol1: release_lockspace final free From thomasmeier1976 at gmx.de Mon Oct 13 19:10:27 2014 From: thomasmeier1976 at gmx.de (Thomas Meier) Date: Mon, 13 Oct 2014 21:10:27 +0200 Subject: [Linux-cluster] Fencing issues with fence_apc_snmp (APC Firmware 6.x) Message-ID: Hi When configuring PDU fencing in my 2-node-cluster I ran into some problems with the fence_apc_snmp agent. Turning a node off works fine, but fence_apc_snmp then exits with error. When I do this manually (from node2): fence_apc_snmp -a node1 -n 1 -o off the output of the command is not an expected: Success: Powered OFF but in my case: Returned 2: Error in packet. Reason: (genError) A general failure occured Failed object: .1.3.6.1.4.1.318.1.1.4.4.2.1.3.21 When I check the PDU, the port is without power, so this part works. But it seems that the fence agent can't read the status of the PDU and then exits with error. The same seems to happen when fenced is calling the agent. The agent also exits with an error and fencing can't succeed and the cluster hangs. >From the logfile: fenced[2100]: fence node1 dev 1.0 agent fence_apc_snmp result: error from agent My Setup: - CentOS 6.5 with fence-agents-3.1.5-35.el6_5.4.x86_64 installed. - APC AP8953 PDU with firmware 6.1 - 2-node-cluster based on https://alteeve.ca/w/AN!Cluster_Tutorial_2 - fencing agents in use: fence_ipmilan (working) and fence_apc_snmp I did some recherche, and for me it looks like that my fence-agents package is too old for my APC firmware. I've already found the fence-agents repo: https://git.fedorahosted.org/cgit/fence-agents.git/ Here https://git.fedorahosted.org/cgit/fence-agents.git/commit/?id=55ccdd79f530092af06eea5b4ce6a24bd82c0875 it says: "fence_apc_snmp: Add support for firmware 6.x" I've managed to build fence-agents-4.0.11.tar.gz on a CentOS 6.5 test box, but my build of fence_apc_snmp doesn't work. It gives: [root at box1]# fence_apc_snmp -v -a node1 -n 1 -o status Traceback (most recent call last): File "/usr/sbin/fence_apc_snmp", line 223, in main() File "/usr/sbin/fence_apc_snmp", line 197, in main options = check_input(device_opt, process_input(device_opt)) File "/usr/share/fence/fencing.py", line 705, in check_input logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stderr)) TypeError: __init__() got an unexpected keyword argument 'stream' I'd really like to see if a patched fence_apc_snmp agent fixes my problem, and if so, install the right version of fence_apc_snmp on the cluster without breaking things, but I'm a bit clueless how to build me a working version. Maybe you have some tips? Thanks in advance Thomas From neale at sinenomine.net Mon Oct 13 21:10:16 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Mon, 13 Oct 2014 21:10:16 +0000 Subject: [Linux-cluster] Permission denied In-Reply-To: References: <565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com> Message-ID: I put some debug code into the gfs2 module and I see it failing the mount at this point: /* * If user space has failed to join the cluster or some similar * failure has occurred, then the journal id will contain a * negative (error) number. This will then be returned to the * caller (of the mount syscall). We do this even for spectator * mounts (which just write a jid of 0 to indicate "ok" even though * the jid is unused in the spectator case) */ if (sdp->sd_lockstruct.ls_jid < 0) { Now to find out who?s stick -PERM into ls_jid. Neale From kgronlund at suse.com Tue Oct 14 05:58:19 2014 From: kgronlund at suse.com (Kristoffer =?utf-8?Q?Gr=C3=B6nlund?=) Date: Tue, 14 Oct 2014 07:58:19 +0200 Subject: [Linux-cluster] Fencing issues with fence_apc_snmp (APC Firmware 6.x) In-Reply-To: References: Message-ID: <87oatfdwg4.fsf@krigpad.site> Thomas Meier writes: > I've managed to build fence-agents-4.0.11.tar.gz on a CentOS 6.5 test box, but my build > of fence_apc_snmp doesn't work. [...] > File "/usr/share/fence/fencing.py", line 705, in check_input > logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stderr)) > TypeError: __init__() got an unexpected keyword argument 'stream' Your version of Python is too old. Possibly you have a newer version of python installed, but by default the older version is used. I think the stream argument was added in Python 2.6. -- // Kristoffer Gr?nlund // kgronlund at suse.com From lists at alteeve.ca Tue Oct 14 11:01:42 2014 From: lists at alteeve.ca (Digimer) Date: Tue, 14 Oct 2014 07:01:42 -0400 Subject: [Linux-cluster] Fencing issues with fence_apc_snmp (APC Firmware 6.x) In-Reply-To: References: Message-ID: <543D0296.8090606@alteeve.ca> On 13/10/14 03:10 PM, Thomas Meier wrote: > Hi > > When configuring PDU fencing in my 2-node-cluster I ran into some problems with > the fence_apc_snmp agent. Turning a node off works fine, but > fence_apc_snmp then exits with error. > > > > When I do this manually (from node2): > > fence_apc_snmp -a node1 -n 1 -o off > > the output of the command is not an expected: > > Success: Powered OFF > > but in my case: > > Returned 2: Error in packet. > Reason: (genError) A general failure occured > Failed object: .1.3.6.1.4.1.318.1.1.4.4.2.1.3.21 > > > When I check the PDU, the port is without power, so this part works. > But it seems that the fence agent can't read the status of the PDU > and then exits with error. The same seems to happen when fenced > is calling the agent. The agent also exits with an error and fencing can't succeed > and the cluster hangs. > >>From the logfile: > > fenced[2100]: fence node1 dev 1.0 agent fence_apc_snmp result: error from agent > > > My Setup: - CentOS 6.5 with fence-agents-3.1.5-35.el6_5.4.x86_64 installed. > - APC AP8953 PDU with firmware 6.1 > - 2-node-cluster based on https://alteeve.ca/w/AN!Cluster_Tutorial_2 > - fencing agents in use: fence_ipmilan (working) and fence_apc_snmp > > > I did some recherche, and for me it looks like that my fence-agents package is too old for my APC firmware. > > I've already found the fence-agents repo: https://git.fedorahosted.org/cgit/fence-agents.git/ > > Here https://git.fedorahosted.org/cgit/fence-agents.git/commit/?id=55ccdd79f530092af06eea5b4ce6a24bd82c0875 > it says: "fence_apc_snmp: Add support for firmware 6.x" > > > I've managed to build fence-agents-4.0.11.tar.gz on a CentOS 6.5 test box, but my build > of fence_apc_snmp doesn't work. > > It gives: > > [root at box1]# fence_apc_snmp -v -a node1 -n 1 -o status > Traceback (most recent call last): > File "/usr/sbin/fence_apc_snmp", line 223, in > main() > File "/usr/sbin/fence_apc_snmp", line 197, in main > options = check_input(device_opt, process_input(device_opt)) > File "/usr/share/fence/fencing.py", line 705, in check_input > logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stderr)) > TypeError: __init__() got an unexpected keyword argument 'stream' > > > I'd really like to see if a patched fence_apc_snmp agent fixes my problem, and if so, > install the right version of fence_apc_snmp on the cluster without breaking things, > but I'm a bit clueless how to build me a working version. > > > Maybe you have some tips? > > > > Thanks in advance > > Thomas Hi Marek et. al., This is a RHEL 6.5 install, so Kristoffer's comment about needing a newer version of python is a bit of a concern. Has this been tested on RHEL 6 with an APC with the 6.x firmware? cheeps -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From thomasmeier1976 at gmx.de Tue Oct 14 11:04:10 2014 From: thomasmeier1976 at gmx.de (Thomas Meier) Date: Tue, 14 Oct 2014 13:04:10 +0200 Subject: [Linux-cluster] =?utf-8?q?Fencing_issues_with_fence_apc_snmp_=28A?= =?utf-8?q?PC_Firmware=096=2Ex=29?= In-Reply-To: <87oatfdwg4.fsf@krigpad.site> References: , <87oatfdwg4.fsf@krigpad.site> Message-ID: Hi ? My installed Python is version 2.6.6. (System Python of RHEL6) ? The latest stable version is fence-agents-4.0.10 The problem is that fence_apc_snmp from release 4.0.10 is not containing the code for APC firmware 6.x yet and fence-agents 4.0.11 is not yet released, so maybe still has bugs (or I just don't get it right). ? I've tried version 4.0.10, too. (untar - autogen.sh - configure - make - make install) But I don't expect this version to work. It fails like this: ? [root at box1 fence-agents-4.0.10]# fence_apc_snmp -v -a 10.124.0.246 -n 1 -o status DEBUG:root:/usr/bin/snmpwalk -m '' -Oeqn? -v '1' -c 'private' '10.124.0.246:161' '.1.3.6.1.2.1.1.2.0' DEBUG:root:.1.3.6.1.2.1.1.2.0 .1.3.6.1.4.1.318.1.3.4.6 Traceback (most recent call last): ? File "/usr/sbin/fence_apc_snmp", line 209, in ? ? main() ? File "/usr/sbin/fence_apc_snmp", line 205, in main ? ? result = fence_action(FencingSnmp(options), options, set_power_status, get_power_status, get_outlets_status) ? File "/usr/share/fence/fencing.py", line 880, in fence_action ? ? status = get_multi_power_fn(tn, options, get_power_fn) ? File "/usr/share/fence/fencing.py", line 800, in get_multi_power_fn ? ? plug_status = get_power_fn(tn, options) ? File "/usr/sbin/fence_apc_snmp", line 138, in get_power_status ? ? apc_resolv_port_id(conn, options) ? File "/usr/sbin/fence_apc_snmp", line 113, in apc_resolv_port_id ? ? apc_set_device(conn) ? File "/usr/sbin/fence_apc_snmp", line 107, in apc_set_device ? ? conn.log_command("Trying %s"%(device.ident_str)) AttributeError: FencingSnmp instance has no attribute 'log_command' ? Regards Thomas ? ? ? ? ? ? Gesendet:?Dienstag, 14. Oktober 2014 um 07:58 Uhr Von:?"Kristoffer Gr?nlund" An:?"Thomas Meier" , linux-cluster at redhat.com Betreff:?Re: [Linux-cluster] Fencing issues with fence_apc_snmp (APC Firmware 6.x) Thomas Meier writes: > I've managed to build fence-agents-4.0.11.tar.gz on a CentOS 6.5 test box, but my build > of fence_apc_snmp doesn't work. [...] > File "/usr/share/fence/fencing.py", line 705, in check_input > logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stderr)) > TypeError: __init__() got an unexpected keyword argument 'stream' Your version of Python is too old. Possibly you have a newer version of python installed, but by default the older version is used. I think the stream argument was added in Python 2.6. -- // Kristoffer Gr?nlund // kgronlund at suse.com From neale at sinenomine.net Tue Oct 14 15:00:57 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Tue, 14 Oct 2014 15:00:57 +0000 Subject: [Linux-cluster] Permission denied In-Reply-To: References: <565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com> Message-ID: Following this thread a bit further I find that the jid is set to -1 because the ?our_slot? value being passed to gdlm_recover_done is 0. dlm_recoverd is retrieving the clvmd lockspace and its ls_slot value is 0 (which is the source of our_slot): [73115.541794] name: clvmd [73115.541847] global_id: 4104eefa [73115.541893] node_count: 2 [73115.541937] low node: 1 [73115.541986] slot: 0 (00000000263d5268) [73115.542031] n'slots: 0 dlm_tool ls reports: dlm lockspaces name clvmd id 0x4104eefa flags 0x00000000 change member 2 joined 1 remove 0 failed 0 seq 1,1 members 1 2 Now to determine why ls_slot is 0. Neale On 10/13/14, 5:10 PM, "Neale Ferguson" wrote: >I put some debug code into the gfs2 module and I see it failing the mount >at this point: > >/* > * If user space has failed to join the cluster or some similar > * failure has occurred, then the journal id will contain a > * negative (error) number. This will then be returned to the > * caller (of the mount syscall). We do this even for spectator > * mounts (which just write a jid of 0 to indicate "ok" even >though > * the jid is unused in the spectator case) > */ > if (sdp->sd_lockstruct.ls_jid < 0) { > >Now to find out who?s stick -PERM into ls_jid. > >Neale > > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster From neale at sinenomine.net Tue Oct 14 19:40:42 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Tue, 14 Oct 2014 19:40:42 +0000 Subject: [Linux-cluster] Permission denied In-Reply-To: <20141014192057.GA10594@redhat.com> References: <565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com> <20141014192057.GA10594@redhat.com> Message-ID: Yeah, I noted I was looking at the wrong lockspace. The gfs2 lockspace in this cluster is vol1. Once I corrected at what I was looking at, I think I solved my problem: I believe the problem is an endian thing. In set_rcom_status: rs->rs_flags = cpu_to_le32(flags) However, in receive_rcom_status() flags are checked: if (!(rs->rs_flags & DLM_RSF_NEED_SLOTS)) { But it should be: if (!(le32_to_cpu(rs->rs_flags) & DLM_RSF_NEED_SLOTS)) { I made this change and now the gfs2 volume is being mounted correctly on both nodes. I?ve repeated it a number of times and it?s kept working. Neale On 10/14/14, 3:20 PM, "David Teigland" wrote: >clvmd is a userland lockspace and does not use lockspace_ops or slots/jids >like a gfs2 (kernel) lockspace. > >To debug the dlm/gfs2 control mechanism, which assigns gfs2 a jid based on >dlm slots, enable the fs_info() lines in gfs2/lock_dlm.c. (Make sure that >you're not somehow running gfs_controld on these nodes; we quit using that >in RHEL7.) From teigland at redhat.com Tue Oct 14 20:15:05 2014 From: teigland at redhat.com (David Teigland) Date: Tue, 14 Oct 2014 15:15:05 -0500 Subject: [Linux-cluster] [PATCH] dlm: fix missing endian conversion of rcom_status flags In-Reply-To: References: <565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com> <20141014192057.GA10594@redhat.com> Message-ID: <20141014201505.GC10594@redhat.com> The flags are already converted to le when being sent, but are not being converted back to cpu when received. Signed-off-by: Neale Ferguson Signed-off-by: David Teigland --- fs/dlm/rcom.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/dlm/rcom.c b/fs/dlm/rcom.c index 9d61947d473a..f3f5e72a29ba 100644 --- a/fs/dlm/rcom.c +++ b/fs/dlm/rcom.c @@ -206,7 +206,7 @@ static void receive_rcom_status(struct dlm_ls *ls, struct dlm_rcom *rc_in) rs = (struct rcom_status *)rc_in->rc_buf; - if (!(rs->rs_flags & DLM_RSF_NEED_SLOTS)) { + if (!(le32_to_cpu(rs->rs_flags) & DLM_RSF_NEED_SLOTS)) { status = dlm_recover_status(ls); goto do_create; } -- 1.8.3.1 From rpeterso at redhat.com Tue Oct 14 20:22:36 2014 From: rpeterso at redhat.com (Bob Peterson) Date: Tue, 14 Oct 2014 16:22:36 -0400 (EDT) Subject: [Linux-cluster] [PATCH] dlm: fix missing endian conversion of rcom_status flags In-Reply-To: <20141014201505.GC10594@redhat.com> References: <565341126.2424782.1413223126854.JavaMail.zimbra@redhat.com> <20141014192057.GA10594@redhat.com> <20141014201505.GC10594@redhat.com> Message-ID: <1398269457.3271283.1413318156147.JavaMail.zimbra@redhat.com> ----- Original Message ----- > The flags are already converted to le when being sent, > but are not being converted back to cpu when received. > > Signed-off-by: Neale Ferguson > Signed-off-by: David Teigland > --- > fs/dlm/rcom.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/dlm/rcom.c b/fs/dlm/rcom.c > index 9d61947d473a..f3f5e72a29ba 100644 > --- a/fs/dlm/rcom.c > +++ b/fs/dlm/rcom.c > @@ -206,7 +206,7 @@ static void receive_rcom_status(struct dlm_ls *ls, struct > dlm_rcom *rc_in) > > rs = (struct rcom_status *)rc_in->rc_buf; > > - if (!(rs->rs_flags & DLM_RSF_NEED_SLOTS)) { > + if (!(le32_to_cpu(rs->rs_flags) & DLM_RSF_NEED_SLOTS)) { > status = dlm_recover_status(ls); > goto do_create; > } > -- > 1.8.3.1 > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Hi Dave, Did you mean for this patch to go to cluster-devel? Bob Peterson Red Hat File Systems From mgrac at redhat.com Wed Oct 15 14:12:13 2014 From: mgrac at redhat.com (Marek "marx" Grac) Date: Wed, 15 Oct 2014 16:12:13 +0200 Subject: [Linux-cluster] Fencing issues with fence_apc_snmp (APC Firmware 6.x) In-Reply-To: References: Message-ID: <543E80BD.8000705@redhat.com> Hi, On 10/13/2014 09:10 PM, Thomas Meier wrote: > Hi > > When configuring PDU fencing in my 2-node-cluster I ran into some problems with > the fence_apc_snmp agent. Turning a node off works fine, but > fence_apc_snmp then exits with error. > > > > When I do this manually (from node2): > > fence_apc_snmp -a node1 -n 1 -o off > > the output of the command is not an expected: > > Success: Powered OFF > > but in my case: > > Returned 2: Error in packet. > Reason: (genError) A general failure occured > Failed object: .1.3.6.1.4.1.318.1.1.4.4.2.1.3.21 > > > When I check the PDU, the port is without power, so this part works. > But it seems that the fence agent can't read the status of the PDU > and then exits with error. The same seems to happen when fenced > is calling the agent. The agent also exits with an error and fencing can't succeed > and the cluster hangs. Yes, this is known bug as APC in 6.x firmware has changed a table with information. > I've already found the fence-agents repo: https://git.fedorahosted.org/cgit/fence-agents.git/ > > Here https://git.fedorahosted.org/cgit/fence-agents.git/commit/?id=55ccdd79f530092af06eea5b4ce6a24bd82c0875 > it says: "fence_apc_snmp: Add support for firmware 6.x" yes, this should fix the issue > I've managed to build fence-agents-4.0.11.tar.gz on a CentOS 6.5 test box, but my build > of fence_apc_snmp doesn't work. > > It gives: > > [root at box1]# fence_apc_snmp -v -a node1 -n 1 -o status > Traceback (most recent call last): > File "/usr/sbin/fence_apc_snmp", line 223, in > main() > File "/usr/sbin/fence_apc_snmp", line 197, in main > options = check_input(device_opt, process_input(device_opt)) > File "/usr/share/fence/fencing.py", line 705, in check_input > logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stderr)) > TypeError: __init__() got an unexpected keyword argument 'stream' Feel free to remove logging if it does not work. The other option is to just take a patch from git and backport it. There should be no big differences (I expect only very minor changes). > I'd really like to see if a patched fence_apc_snmp agent fixes my problem, and if so, > install the right version of fence_apc_snmp on the cluster without breaking things, > but I'm a bit clueless how to build me a working version. Sure, there will be a new official release for RHEL 6.7 (as 6.6 was released few days ago). So until that time only upstream or patches. m, From mgrac at redhat.com Wed Oct 15 14:15:10 2014 From: mgrac at redhat.com (Marek "marx" Grac) Date: Wed, 15 Oct 2014 16:15:10 +0200 Subject: [Linux-cluster] Fencing issues with fence_apc_snmp (APC Firmware 6.x) In-Reply-To: <543D0296.8090606@alteeve.ca> References: <543D0296.8090606@alteeve.ca> Message-ID: <543E816E.20608@redhat.com> On 10/14/2014 01:01 PM, Digimer wrote: > > Hi Marek et. al., > > This is a RHEL 6.5 install, so Kristoffer's comment about needing a > newer version of python is a bit of a concern. Has this been tested on > RHEL 6 with an APC with the 6.x firmware? Current release do not contain required patch, it will be in next one (or z-stream if someone request it). The upstream release work as expected (retested today) on Fedora20/RHEL7. Fact that upstream release can not be run on RHEL6 is new issue for me but we did not try that before. m, From lists at alteeve.ca Wed Oct 15 14:35:30 2014 From: lists at alteeve.ca (Digimer) Date: Wed, 15 Oct 2014 10:35:30 -0400 Subject: [Linux-cluster] Fencing issues with fence_apc_snmp (APC Firmware 6.x) In-Reply-To: <543E816E.20608@redhat.com> References: <543D0296.8090606@alteeve.ca> <543E816E.20608@redhat.com> Message-ID: <543E8632.4040206@alteeve.ca> On 15/10/14 10:15 AM, Marek "marx" Grac wrote: > > On 10/14/2014 01:01 PM, Digimer wrote: >> >> Hi Marek et. al., >> >> This is a RHEL 6.5 install, so Kristoffer's comment about needing a >> newer version of python is a bit of a concern. Has this been tested on >> RHEL 6 with an APC with the 6.x firmware? > > Current release do not contain required patch, it will be in next one > (or z-stream if someone request it). The upstream release work as > expected (retested today) on Fedora20/RHEL7. Fact that upstream release > can not be run on RHEL6 is new issue for me but we did not try that before. > > m, Consider it officially requested. We use APC switched PDUs as backup fence devices extensively, so this would pretty heavily hurt us if we started getting v6 firmware. Should I open a RHBZ? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From mgrac at redhat.com Thu Oct 16 07:53:37 2014 From: mgrac at redhat.com (Marek "marx" Grac) Date: Thu, 16 Oct 2014 09:53:37 +0200 Subject: [Linux-cluster] Fencing issues with fence_apc_snmp (APC Firmware 6.x) In-Reply-To: <543E8632.4040206@alteeve.ca> References: <543D0296.8090606@alteeve.ca> <543E816E.20608@redhat.com> <543E8632.4040206@alteeve.ca> Message-ID: <543F7981.9070901@redhat.com> Hi, On 10/15/2014 04:35 PM, Digimer wrote: > Consider it officially requested. We use APC switched PDUs as backup > fence devices extensively, so this would pretty heavily hurt us if we > started getting v6 firmware. To summarize, support for v6 firmware over SNMP: * is not in RHEL6.6 * should be in RHEL7.1 * should be in RHEL6.7 * can be part of z-stream > Should I open a RHBZ? Bug is already opened, perhaps it is not cloned everywhere. So only raising z-stream request change something. m, From mgrac at redhat.com Thu Oct 16 13:11:04 2014 From: mgrac at redhat.com (Marek "marx" Grac) Date: Thu, 16 Oct 2014 15:11:04 +0200 Subject: [Linux-cluster] fence-agents-4.0.12 stable release Message-ID: <543FC3E8.9020409@redhat.com> Welcome to the fence-agents 4.0.12 release This release includes some new features and several bugfixes: * new up-to-date wiki page with STDIN / command line arguments http://fedorahosted.org/cluster/wiki/FenceArguments * Fence agent fence_pve now supports --ssl-secure and --ssl-insecure (check certificate or not) * Fence agent for RHEV-M supports cookie based authentication (--use-cookies) * improvements in build system * Fix issue with regular expression in fence_rsb * Fix uninitialized EOL in fence_wti The new source tarball can be downloaded here: https://fedorahosted.org/releases/f/e/fence-agents/fence-agents-4.0.12.tar.xz To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. m, From sunhux at gmail.com Wed Oct 22 08:44:33 2014 From: sunhux at gmail.com (Sunhux G) Date: Wed, 22 Oct 2014 16:44:33 +0800 Subject: [Linux-cluster] Rhel BootLoader, Single-user mode password & Interactive Boot in a Cloud environment Message-ID: We run cloud service & our vCenter is not accessible to our tenants and their IT support; so I would say console access is not feasible unless the tenant/customer IT come to our DC. If the following 3 hardenings are done our tenant/customer RHEL Linux VM, what's the impact to the tenant's sysadmin & IT operation? a) CIS 1.5.3 Set Boot Loader Password *:* if this password is set, when tenant reboot (shutdown -r) their VM each time, will it prompt for the bootloader password at console? If so, is there any way the tenant, could still get their VM booted up if they have no access to vCenter's console? b) CIS 1.5.4 Require Authentication for Single-User Mode *:* Does Linux allow ssh access while in single-user mode & can this 'single-user mode password' be entered via an ssh session (without access to console), assuming certain 'terminal' service is started up / running while in single user mode c) CIS 1.5.5 Disable Interactive Boot *:* what's the general consensus on this? Disable or enable? Our corporate hardening guide does not mention this item. So if the tenant wishes to boot up step by step (ie pausing at each startup script), they can't do it? Feel free to add any other impacts that anyone can think of Lastly, how do people out there grant console access to their tenants in Cloud environment without security compromise (I mean without granting vCenter access) : I heard that we can customize vCenter to grant limited access of vCenter to tenants, is this so? Sun -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Wed Oct 22 10:46:22 2014 From: lists at alteeve.ca (Digimer) Date: Wed, 22 Oct 2014 06:46:22 -0400 Subject: [Linux-cluster] Rhel BootLoader, Single-user mode password & Interactive Boot in a Cloud environment In-Reply-To: References: Message-ID: <54478AFE.3030506@alteeve.ca> On 22/10/14 04:44 AM, Sunhux G wrote: > We run cloud service & our vCenter is not accessible to our tenants > and their IT support; so I would say console access is not feasible > unless the tenant/customer IT come to our DC. > > If the following 3 hardenings are done our tenant/customer RHEL > Linux VM, what's the impact to the tenant's sysadmin & IT operation? > > > a) CIS 1.5.3 Set Boot Loader Password *:* > if this password is set, when tenant reboot (shutdown -r) > their VM each time, will it prompt for the bootloader > password at console? If so, is there any way the tenant, > could still get their VM booted up if they have no access > to vCenter's console? > > b) CIS 1.5.4 Require Authentication for Single-User Mode *:* > Does Linux allow ssh access while in single-user mode & > can this 'single-user mode password' be entered via an > ssh session (without access to console), assuming certain > 'terminal' service is started up / running while in single > user mode > > c) CIS 1.5.5 Disable Interactive Boot *:* > what's the general consensus on this? Disable or enable? > Our corporate hardening guide does not mention this item. > So if the tenant wishes to boot up step by step (ie pausing > at each startup script), they can't do it? > > Feel free to add any other impacts that anyone can think of > > Lastly, how do people out there grant console access to their > tenants in Cloud environment without security compromise > (I mean without granting vCenter access) : I heard that we can > customize vCenter to grant limited access of vCenter to > tenants, is this so? > > > Sun Hi Sun, Did you mean to post this to the vmware mailing list? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From alanoe at linux.vnet.ibm.com Thu Oct 23 13:49:45 2014 From: alanoe at linux.vnet.ibm.com (Alan Evangelista) Date: Thu, 23 Oct 2014 11:49:45 -0200 Subject: [Linux-cluster] Problems building fence-agents from source Message-ID: <54490779.9080408@linux.vnet.ibm.com> Hi. I'm trying to build fence-agents from source (master branch) on CentOS 6.5. I already installed the following rpm packages (dependencies): autoconf, automake, gcc, libtool, nss, nss-devel. When I tried to run ./autogen.sh, I got: configure.ac:162: error: possibly undefined macro: AC_PYTHON_MODULE If this token and others are legitimate, please use m4_pattern_allow. See the Autoconf documentation. I then run $ autoreconf --install and autogen worked. Then, I have a problem running ./configure: ./configure: line 18284: syntax error near unexpected token `suds,' ./configure: line 18284: `AC_PYTHON_MODULE(suds, 1)' I never had this problem before with earlier fence-agents versions. Am I missing something or is there an issue with upstream code? RPM dependencies versions: autoconf-2.63-5.1.el6.noarch automake-1.11.1-4.el6.noarch libtool-2.2.6-15.5.el6.x86_64 Regards, Alan Evangelista From bmr at redhat.com Thu Oct 23 14:45:56 2014 From: bmr at redhat.com (Bryn M. Reeves) Date: Thu, 23 Oct 2014 15:45:56 +0100 Subject: [Linux-cluster] Problems building fence-agents from source In-Reply-To: <54490779.9080408@linux.vnet.ibm.com> References: <54490779.9080408@linux.vnet.ibm.com> Message-ID: <20141023144555.GB26744@localhost.localdomain> On Thu, Oct 23, 2014 at 11:49:45AM -0200, Alan Evangelista wrote: > I'm trying to build fence-agents from source (master branch) on CentOS 6.5. > I already installed the following rpm packages (dependencies): autoconf, > automake, gcc, libtool, nss, nss-devel. When I tried to run ./autogen.sh, > I got: You might find it easier to just rebuild the RPMs using either rpmbuild or a tool like mock[1]. > ./configure: line 18284: syntax error near unexpected token `suds,' > ./configure: line 18284: `AC_PYTHON_MODULE(suds, 1)' I'd guess this is because you're missing the python-suds package: # yum list | grep python-suds python-suds.noarch 0.4.1-3.el6 @rhel-x86_64-server-6 python-suds is a Python SOAP client library. If you check the BuildRequires in the fence-agents.spec file you'll see: # Build dependencies BuildRequires: perl python BuildRequires: glibc-devel BuildRequires: nss-devel nspr-devel BuildRequires: libxslt pexpect BuildRequires: python-pycurl BuildRequires: python-suds BuildRequires: automake autoconf pkgconfig libtool BuildRequires: net-snmp-utils perl-Net-Telnet > I never had this problem before with earlier fence-agents versions. > Am I missing something or is there an issue with upstream code? I'd guess it's a required dependency for the fence_vmware_soap agent. The BuildRequires and f_v_s scripts were added in 3.1.4-1.el6 back in 2011: * Tue Jun 7 2011 Fabio M. Di Nitto - 3.1.4-1 - Rebase package on top of new upstream - spec file update: * update spec file copyright date * update upstream URL * drop all patches * update list of fence_agents (ibmblade listed twice, bladecenter_snmp deprecated) * drop libxml2-devel libvirt-devel clusterlib-devel corosynclib-devel and openaislib-devel from BuildRequires * make ready to enable fence_vmware_soap * update and clean configure and build section. * create bladecenter_snmp compat symlink at rpm install time * update file list to include scsi_check script Regards, Bryn. [1] http://fedoraproject.org/wiki/Projects/Mock From bcodding at redhat.com Thu Oct 23 14:46:49 2014 From: bcodding at redhat.com (Benjamin Coddington) Date: Thu, 23 Oct 2014 10:46:49 -0400 (EDT) Subject: [Linux-cluster] Problems building fence-agents from source In-Reply-To: <54490779.9080408@linux.vnet.ibm.com> References: <54490779.9080408@linux.vnet.ibm.com> Message-ID: Hi Alan, I don't know how well the upstream fence-agents will work or build on CentOS 6.5, but I can tell you that the way to resolve this particular problem would be to find the m4 for AC_PYTHON_MODULE and drop it in your build's m4/ directory.. Ben On Thu, 23 Oct 2014, Alan Evangelista wrote: > Hi. > > I'm trying to build fence-agents from source (master branch) on CentOS 6.5. > I already installed the following rpm packages (dependencies): autoconf, > automake, gcc, libtool, nss, nss-devel. When I tried to run ./autogen.sh, > I got: > > configure.ac:162: error: possibly undefined macro: AC_PYTHON_MODULE > If this token and others are legitimate, please use m4_pattern_allow. > See the Autoconf documentation. > > > I then run > > $ autoreconf --install > > and autogen worked. Then, I have a problem running ./configure: > > ./configure: line 18284: syntax error near unexpected token `suds,' > ./configure: line 18284: `AC_PYTHON_MODULE(suds, 1)' > > I never had this problem before with earlier fence-agents versions. > Am I missing something or is there an issue with upstream code? > > > RPM dependencies versions: > autoconf-2.63-5.1.el6.noarch > automake-1.11.1-4.el6.noarch > libtool-2.2.6-15.5.el6.x86_64 > > > Regards, > Alan Evangelista > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From alanoe at linux.vnet.ibm.com Thu Oct 23 15:03:17 2014 From: alanoe at linux.vnet.ibm.com (Alan Evangelista) Date: Thu, 23 Oct 2014 13:03:17 -0200 Subject: [Linux-cluster] Problems building fence-agents from source In-Reply-To: <20141023144555.GB26744@localhost.localdomain> References: <54490779.9080408@linux.vnet.ibm.com> <20141023144555.GB26744@localhost.localdomain> Message-ID: <544918B5.50902@linux.vnet.ibm.com> On 10/23/2014 12:45 PM, Bryn M. Reeves wrote: >> ./configure: line 18284: syntax error near unexpected token `suds,' >> ./configure: line 18284: `AC_PYTHON_MODULE(suds, 1)' > I'd guess this is because you're missing the python-suds package: > > # yum list | grep python-suds > python-suds.noarch 0.4.1-3.el6 @rhel-x86_64-server-6 No, it is not, I already installed that rpm package. $ rpm -qa | grep suds python-suds-0.4.1-3.el6.noarch > >> I never had this problem before with earlier fence-agents versions. >> Am I missing something or is there an issue with upstream code? I forgot to mention, previous installations were done in RHEL 6.5. Maybe fence-agents does not work out of the box in CentOS 6.5. Regards, Alan Evangelista From alanoe at linux.vnet.ibm.com Thu Oct 23 15:28:55 2014 From: alanoe at linux.vnet.ibm.com (Alan Evangelista) Date: Thu, 23 Oct 2014 13:28:55 -0200 Subject: [Linux-cluster] Problems building fence-agents from source In-Reply-To: References: <54490779.9080408@linux.vnet.ibm.com> Message-ID: <54491EB7.40300@linux.vnet.ibm.com> On 10/23/2014 12:46 PM, Benjamin Coddington wrote: > Hi Alan, > > I don't know how well the upstream fence-agents will work or build on > CentOS 6.5, but I can tell you that the way to resolve this particular > problem would be to find the m4 for AC_PYTHON_MODULE and drop it in > your build's m4/ directory.. I already see the m4 file in make/ac_python_module.m4. Copying/moving file to m4 directory didnt solve the problem. FYI this macro was introduced in a patch today (commit 5a87866c70e3dc77798d3e6fd77e2607757d26b5). Maybe the macro is broken? AC_DEFUN([AC_PYTHON_MODULE],[ AC_MSG_CHECKING(python module: $1) python -c "import $1" 2>/dev/null if test $? -eq 0; then AC_MSG_RESULT(yes) eval AS_TR_CPP(HAVE_PYMOD_$1)=yes else AC_MSG_RESULT(no) eval AS_TR_CPP(HAVE_PYMOD_$1)=no # if test -n "$2" then AC_MSG_ERROR(failed to find required module $1) exit 1 fi fi ]) Regards, Alan Evangelista From bcodding at redhat.com Thu Oct 23 15:45:07 2014 From: bcodding at redhat.com (Benjamin Coddington) Date: Thu, 23 Oct 2014 11:45:07 -0400 (EDT) Subject: [Linux-cluster] Problems building fence-agents from source In-Reply-To: <54491EB7.40300@linux.vnet.ibm.com> References: <54490779.9080408@linux.vnet.ibm.com> <54491EB7.40300@linux.vnet.ibm.com> Message-ID: On Thu, 23 Oct 2014, Alan Evangelista wrote: > On 10/23/2014 12:46 PM, Benjamin Coddington wrote: >> Hi Alan, >> >> I don't know how well the upstream fence-agents will work or build on >> CentOS 6.5, but I can tell you that the way to resolve this particular >> problem would be to find the m4 for AC_PYTHON_MODULE and drop it in your >> build's m4/ directory.. > > I already see the m4 file in make/ac_python_module.m4. Copying/moving file to > m4 directory didnt solve the problem. > > FYI this macro was introduced in a patch today (commit > 5a87866c70e3dc77798d3e6fd77e2607757d26b5). > Maybe the macro is broken? > > AC_DEFUN([AC_PYTHON_MODULE],[ > AC_MSG_CHECKING(python module: $1) > python -c "import $1" 2>/dev/null > if test $? -eq 0; > then > AC_MSG_RESULT(yes) > eval AS_TR_CPP(HAVE_PYMOD_$1)=yes > else > AC_MSG_RESULT(no) > eval AS_TR_CPP(HAVE_PYMOD_$1)=no > # > if test -n "$2" > then > AC_MSG_ERROR(failed to find required module $1) > exit 1 > fi > fi > ] ) > Ah, looking at the second portion of your original error report now.. it looks like you have AC rules in your configure script. That indicates that configure wasn't correctly created.. delete your configure script and run autogen.sh again now that you have ac_python_module.m4 in the AC_CONFIG_MACRO_DIR (which is m4/). Ben From alanoe at linux.vnet.ibm.com Thu Oct 23 15:55:21 2014 From: alanoe at linux.vnet.ibm.com (Alan Evangelista) Date: Thu, 23 Oct 2014 13:55:21 -0200 Subject: [Linux-cluster] Problems building fence-agents from source In-Reply-To: References: <54490779.9080408@linux.vnet.ibm.com> <54491EB7.40300@linux.vnet.ibm.com> Message-ID: <544924E9.9060306@linux.vnet.ibm.com> On 10/23/2014 01:45 PM, Benjamin Coddington wrote: > > > On Thu, 23 Oct 2014, Alan Evangelista wrote: > >> On 10/23/2014 12:46 PM, Benjamin Coddington wrote: >>> Hi Alan, >>> >>> I don't know how well the upstream fence-agents will work or build on >>> CentOS 6.5, but I can tell you that the way to resolve this particular >>> problem would be to find the m4 for AC_PYTHON_MODULE and drop it in >>> your >>> build's m4/ directory.. >> >> I already see the m4 file in make/ac_python_module.m4. Copying/moving >> file to m4 directory didnt solve the problem. >> >> FYI this macro was introduced in a patch today (commit >> 5a87866c70e3dc77798d3e6fd77e2607757d26b5). >> Maybe the macro is broken? >> >> AC_DEFUN([AC_PYTHON_MODULE],[ >> AC_MSG_CHECKING(python module: $1) >> python -c "import $1" 2>/dev/null >> if test $? -eq 0; >> then >> AC_MSG_RESULT(yes) >> eval AS_TR_CPP(HAVE_PYMOD_$1)=yes >> else >> AC_MSG_RESULT(no) >> eval AS_TR_CPP(HAVE_PYMOD_$1)=no >> # >> if test -n "$2" >> then >> AC_MSG_ERROR(failed to find required module $1) >> exit 1 >> fi >> fi >> ] ) > > Ah, looking at the second portion of your original error report now.. it > looks like you have AC rules in your configure script. That indicates > that configure wasn't correctly created.. delete your configure script > and run autogen.sh again now that you have ac_python_module.m4 in the > AC_CONFIG_MACRO_DIR (which is m4/). That worked. I didn't know I had to run ./autogen.sh again after moving the m4 file to the correct directory. I'll send an email in fence-agents-devel about the incorrect directory of the new m4 file added today. Thanks for the help! Regards, Alan Evangelista From bmr at redhat.com Thu Oct 23 15:59:04 2014 From: bmr at redhat.com (Bryn M. Reeves) Date: Thu, 23 Oct 2014 16:59:04 +0100 Subject: [Linux-cluster] Problems building fence-agents from source In-Reply-To: <544918B5.50902@linux.vnet.ibm.com> References: <54490779.9080408@linux.vnet.ibm.com> <20141023144555.GB26744@localhost.localdomain> <544918B5.50902@linux.vnet.ibm.com> Message-ID: <20141023155904.GI26744@localhost.localdomain> On Thu, Oct 23, 2014 at 01:03:17PM -0200, Alan Evangelista wrote: > >>I never had this problem before with earlier fence-agents versions. > >>Am I missing something or is there an issue with upstream code? > > I forgot to mention, previous installations were done in RHEL 6.5. Maybe > fence-agents does > not work out of the box in CentOS 6.5. > What version of the package are you actually trying to build? If it's the native RHEL-6.5 package then I would expect that to build out-of-the box on unmodified CentOS 6.5. If it is some later upstream version then you may find there are considerable changes in dependencies needed to build and those may require a large number of package updates to enable building the later version (which you'd need to also build from source). The autoconf problems you're hitting make it sound like that may be the case (although you could encounter similar problems if e.g. the package is from 6.6 beta and there was also an updated autotools in that release). Regards, Bryn. From mgrac at redhat.com Mon Oct 27 15:57:37 2014 From: mgrac at redhat.com (Marek "marx" Grac) Date: Mon, 27 Oct 2014 16:57:37 +0100 Subject: [Linux-cluster] Problems building fence-agents from source In-Reply-To: References: <54490779.9080408@linux.vnet.ibm.com> <54491EB7.40300@linux.vnet.ibm.com> Message-ID: <544E6B71.1020300@redhat.com> On 10/23/2014 05:45 PM, Benjamin Coddington wrote: > > > On Thu, 23 Oct 2014, Alan Evangelista wrote: > >> On 10/23/2014 12:46 PM, Benjamin Coddington wrote: >>> Hi Alan, >>> >>> I don't know how well the upstream fence-agents will work or build on >>> CentOS 6.5, but I can tell you that the way to resolve this particular >>> problem would be to find the m4 for AC_PYTHON_MODULE and drop it in >>> your >>> build's m4/ directory.. >> >> I already see the m4 file in make/ac_python_module.m4. Copying/moving >> file to m4 directory didnt solve the problem. >> >> FYI this macro was introduced in a patch today (commit >> 5a87866c70e3dc77798d3e6fd77e2607757d26b5). >> Maybe the macro is broken? >> >> AC_DEFUN([AC_PYTHON_MODULE],[ >> AC_MSG_CHECKING(python module: $1) >> python -c "import $1" 2>/dev/null >> if test $? -eq 0; >> then >> AC_MSG_RESULT(yes) >> eval AS_TR_CPP(HAVE_PYMOD_$1)=yes >> else >> AC_MSG_RESULT(no) >> eval AS_TR_CPP(HAVE_PYMOD_$1)=no >> # >> if test -n "$2" >> then >> AC_MSG_ERROR(failed to find required module $1) >> exit 1 >> fi >> fi >> ] ) > > Ah, looking at the second portion of your original error report now.. it > looks like you have AC rules in your configure script. That indicates > that configure wasn't correctly created.. delete your configure script > and run autogen.sh again now that you have ac_python_module.m4 in the > AC_CONFIG_MACRO_DIR (which is m4/). autogen.sh was modified so it uses also (-I make) to obtain macro also from make/ directory. Because m4/ directory is created after running autogen.sh m, From mgrac at redhat.com Wed Oct 29 08:37:31 2014 From: mgrac at redhat.com (Marek "marx" Grac) Date: Wed, 29 Oct 2014 09:37:31 +0100 Subject: [Linux-cluster] How we conform OCF in fence agents and what to do with it Message-ID: <5450A74B.20701@redhat.com> Hi, I took a look at OCF specification for resource agents from https://github.com/ClusterLabs/OCF-spec I rewrote it from DTD to Relax NG (XML form) and attempts to modify it until it accept current resource agents. These changes are put for discussion and I will mark those that are important for fence agent with asterisk. is root element 1*) new actions required: on, off, reboot, monitor, list, metadata 2) "timeout" for service should be only optional? 3) I don't understand element "version" directly under as it has attribute "version" 4) we have added directly elements "vendor-url" and "longdesc" under . This is inconsistent with "shortdesc" that is attribute but long description really should not be an attribute. 5) we have added attribute "automatic" to (e.g. fence_scsi) 6) our parameters use only "shortdesc", so perhaphs "longdesc" can be optional 7*) element for parameters and how they can be called from command line (used for man pages generation) 8) add "required" attribute for each parameter 9) add "default" value for element 10) make element optional. what should be inside? 11) does not have only londgdesc but also shortdesc (single-line) m, From andrew at beekhof.net Wed Oct 29 09:46:26 2014 From: andrew at beekhof.net (Andrew Beekhof) Date: Wed, 29 Oct 2014 20:46:26 +1100 Subject: [Linux-cluster] How we conform OCF in fence agents and what to do with it In-Reply-To: <5450A74B.20701@redhat.com> References: <5450A74B.20701@redhat.com> Message-ID: > On 29 Oct 2014, at 7:37 pm, Marek marx Grac wrote: > > Hi, > > I took a look at OCF specification for resource agents from https://github.com/ClusterLabs/OCF-spec > > I rewrote it from DTD to Relax NG Please don't. Its hard enough getting any change in let alone coupling it with a translation to another format. Please just leave it as DTD for now > (XML form) and attempts to modify it until it accept current resource agents. These changes are put for discussion and I will mark those that are important for fence agent with asterisk. > > is root element > > 1*) new actions required: on, off, reboot, monitor, list, metadata > > 2) "timeout" for service should be only optional? > > 3) I don't understand element "version" directly under as it has attribute "version" > > 4) we have added directly elements "vendor-url" and "longdesc" under . This is inconsistent with "shortdesc" that is attribute but long description really should not be an attribute. > > 5) we have added attribute "automatic" to (e.g. fence_scsi) > > 6) our parameters use only "shortdesc", so perhaphs "longdesc" can be optional > > 7*) element for parameters and how they can be called from command line (used for man pages generation) > > 8) add "required" attribute for each parameter > > 9) add "default" value for element > > 10) make element optional. what should be inside? > > 11) does not have only londgdesc but also shortdesc (single-line) > > > m, > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From mgrac at redhat.com Wed Oct 29 12:26:16 2014 From: mgrac at redhat.com (Marek "marx" Grac) Date: Wed, 29 Oct 2014 13:26:16 +0100 Subject: [Linux-cluster] How we conform OCF in fence agents and what to do with it In-Reply-To: References: <5450A74B.20701@redhat.com> Message-ID: <5450DCE8.4080002@redhat.com> On 10/29/2014 10:46 AM, Andrew Beekhof wrote: >> On 29 Oct 2014, at 7:37 pm, Marek marx Grac wrote: >> >> Hi, >> >> I took a look at OCF specification for resource agents from https://github.com/ClusterLabs/OCF-spec >> >> I rewrote it from DTD to Relax NG > Please don't. > Its hard enough getting any change in let alone coupling it with a translation to another format. > > Please just leave it as DTD for now Sure, it was more for this research then for pushing it upwards. From lkota at cisco.com Wed Oct 29 21:38:58 2014 From: lkota at cisco.com (Lax Kota (lkota)) Date: Wed, 29 Oct 2014 21:38:58 +0000 Subject: [Linux-cluster] daemon cpg_join error retrying Message-ID: Hi All, In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error retrying'. I have a 2 Node setup with pacemaker and corosync. Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is there any way to fix this issue? Thanks Lax -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrew at beekhof.net Wed Oct 29 21:42:06 2014 From: andrew at beekhof.net (Andrew Beekhof) Date: Thu, 30 Oct 2014 08:42:06 +1100 Subject: [Linux-cluster] daemon cpg_join error retrying In-Reply-To: References: Message-ID: > On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) wrote: > > Hi All, > > In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error retrying'. I have a 2 Node setup with pacemaker and corosync. I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. > > Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is there any way to fix this issue? > > > Thanks > Lax > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From lkota at cisco.com Wed Oct 29 22:06:43 2014 From: lkota at cisco.com (Lax Kota (lkota)) Date: Wed, 29 Oct 2014 22:06:43 +0000 Subject: [Linux-cluster] daemon cpg_join error retrying In-Reply-To: References: Message-ID: > I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs) Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. Thanks Lax -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof Sent: Wednesday, October 29, 2014 2:42 PM To: linux clustering Subject: Re: [Linux-cluster] daemon cpg_join error retrying > On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) wrote: > > Hi All, > > In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error retrying'. I have a 2 Node setup with pacemaker and corosync. I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. > > Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is there any way to fix this issue? > > > Thanks > Lax > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From andrew at beekhof.net Wed Oct 29 22:16:35 2014 From: andrew at beekhof.net (Andrew Beekhof) Date: Thu, 30 Oct 2014 09:16:35 +1100 Subject: [Linux-cluster] daemon cpg_join error retrying In-Reply-To: References: Message-ID: > On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) wrote: > >> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. > How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. I don't really recall. Hopefully someone more familiar with GFS2 can chime in. > > Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs) > > Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. > Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) > Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. > > Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. > Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) > Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. It does not sound like your network is particularly healthy. Are you using multicast or udpu? If multicast, it might be worth trying udpu > > Thanks > Lax > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof > Sent: Wednesday, October 29, 2014 2:42 PM > To: linux clustering > Subject: Re: [Linux-cluster] daemon cpg_join error retrying > > >> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) wrote: >> >> Hi All, >> >> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error retrying'. I have a 2 Node setup with pacemaker and corosync. > > I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. > >> >> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is there any way to fix this issue? >> >> >> Thanks >> Lax >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From lists at alteeve.ca Wed Oct 29 22:29:38 2014 From: lists at alteeve.ca (Digimer) Date: Wed, 29 Oct 2014 18:29:38 -0400 Subject: [Linux-cluster] daemon cpg_join error retrying In-Reply-To: References: Message-ID: <54516A52.5020901@alteeve.ca> On 29/10/14 06:16 PM, Andrew Beekhof wrote: > >> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) wrote: >> >>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >> How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. > > I don't really recall. Hopefully someone more familiar with GFS2 can chime in. # gfs2_tool sb /dev/c01n01_vg0/shared table current lock table name = "an-cluster-01:shared" Replace with your device, of course. :) > >> >> Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs) >> >> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >> >> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. > > It does not sound like your network is particularly healthy. > Are you using multicast or udpu? If multicast, it might be worth trying udpu Agreed. Persistent multicast required? >> Thanks >> Lax >> >> >> -----Original Message----- >> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof >> Sent: Wednesday, October 29, 2014 2:42 PM >> To: linux clustering >> Subject: Re: [Linux-cluster] daemon cpg_join error retrying >> >> >>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) wrote: >>> >>> Hi All, >>> >>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error retrying'. I have a 2 Node setup with pacemaker and corosync. >> >> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >> >>> >>> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is there any way to fix this issue? >>> >>> >>> Thanks >>> Lax >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From lkota at cisco.com Wed Oct 29 22:32:28 2014 From: lkota at cisco.com (Lax Kota (lkota)) Date: Wed, 29 Oct 2014 22:32:28 +0000 Subject: [Linux-cluster] daemon cpg_join error retrying In-Reply-To: References: Message-ID: >> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >> How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. > I don't really recall. Hopefully someone more familiar with GFS2 can chime in. Ok. >> >> Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs) >> >> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >> >> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >It does not sound like your network is particularly healthy. >Are you using multicast or udpu? If multicast, it might be worth trying udpu I am using udpu and I also have firewall opened for ports 5404 & 5405. Tcpdump looks fine too, it does not complain of any issues. This is a VM envirornment and even if I switch to other node within same VM I keep getting same failure. Thanks Lax -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof Sent: Wednesday, October 29, 2014 3:17 PM To: linux clustering Subject: Re: [Linux-cluster] daemon cpg_join error retrying > On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) wrote: > >> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. > How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. I don't really recall. Hopefully someone more familiar with GFS2 can chime in. > > Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs) > > Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. > Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) > Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. > > Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. > Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) > Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. It does not sound like your network is particularly healthy. Are you using multicast or udpu? If multicast, it might be worth trying udpu > > Thanks > Lax > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof > Sent: Wednesday, October 29, 2014 2:42 PM > To: linux clustering > Subject: Re: [Linux-cluster] daemon cpg_join error retrying > > >> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) wrote: >> >> Hi All, >> >> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error retrying'. I have a 2 Node setup with pacemaker and corosync. > > I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. > >> >> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is there any way to fix this issue? >> >> >> Thanks >> Lax >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From andrew at beekhof.net Wed Oct 29 22:38:00 2014 From: andrew at beekhof.net (Andrew Beekhof) Date: Thu, 30 Oct 2014 09:38:00 +1100 Subject: [Linux-cluster] daemon cpg_join error retrying In-Reply-To: References: Message-ID: <68ABE774-8755-416F-829B-CED002B14D03@beekhof.net> > On 30 Oct 2014, at 9:32 am, Lax Kota (lkota) wrote: > > >>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >>> How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. > >> I don't really recall. Hopefully someone more familiar with GFS2 can chime in. > Ok. > >>> >>> Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs) >>> >>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>> >>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. > >> It does not sound like your network is particularly healthy. >> Are you using multicast or udpu? If multicast, it might be worth trying udpu > > I am using udpu and I also have firewall opened for ports 5404 & 5405. Tcpdump looks fine too, it does not complain of any issues. This is a VM envirornment and even if I switch to other node within same VM I keep getting same failure. Depending on what the host and VMs are doing, that might be your problem. In any case, I will defer to the corosync guys at this point. > > Thanks > Lax > > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof > Sent: Wednesday, October 29, 2014 3:17 PM > To: linux clustering > Subject: Re: [Linux-cluster] daemon cpg_join error retrying > > >> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) wrote: >> >>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >> How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. > > I don't really recall. Hopefully someone more familiar with GFS2 can chime in. > >> >> Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs) >> >> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >> >> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. > > It does not sound like your network is particularly healthy. > Are you using multicast or udpu? If multicast, it might be worth trying udpu > >> >> Thanks >> Lax >> >> >> -----Original Message----- >> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof >> Sent: Wednesday, October 29, 2014 2:42 PM >> To: linux clustering >> Subject: Re: [Linux-cluster] daemon cpg_join error retrying >> >> >>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) wrote: >>> >>> Hi All, >>> >>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error retrying'. I have a 2 Node setup with pacemaker and corosync. >> >> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >> >>> >>> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is there any way to fix this issue? >>> >>> >>> Thanks >>> Lax >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From jfriesse at redhat.com Thu Oct 30 08:23:29 2014 From: jfriesse at redhat.com (Jan Friesse) Date: Thu, 30 Oct 2014 09:23:29 +0100 Subject: [Linux-cluster] daemon cpg_join error retrying In-Reply-To: <68ABE774-8755-416F-829B-CED002B14D03@beekhof.net> References: <68ABE774-8755-416F-829B-CED002B14D03@beekhof.net> Message-ID: <5451F581.5050100@redhat.com> > >> On 30 Oct 2014, at 9:32 am, Lax Kota (lkota) wrote: >> >> >>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >>>> How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. >> >>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in. >> Ok. >> >>>> >>>> Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs) >>>> >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>>> >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >> >>> It does not sound like your network is particularly healthy. >>> Are you using multicast or udpu? If multicast, it might be worth trying udpu >> >> I am using udpu and I also have firewall opened for ports 5404 & 5405. Tcpdump looks fine too, it does not complain of any issues. This is a VM envirornment and even if I switch to other node within same VM I keep getting same failure. > > Depending on what the host and VMs are doing, that might be your problem. > In any case, I will defer to the corosync guys at this point. > Lax, usual reasons for this problem: 1. mtu is too high and fragmented packets are not enabled (take a look to netmtu configuration option) 2. config file on nodes are not in sync and one node may contain more node entries then other nodes (this may be also the case if you have two clusters and one cluster contains entry of one node for other cluster) 3. firewall is asymmetrically blocked (so node can send but not receive). Also keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu uses one socket per remote node for sending. I would recommend to disable firewall completely (for testing) and if everything will work, you just need to adjust firewall. Regards, Honza >> >> Thanks >> Lax >> >> >> >> -----Original Message----- >> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof >> Sent: Wednesday, October 29, 2014 3:17 PM >> To: linux clustering >> Subject: Re: [Linux-cluster] daemon cpg_join error retrying >> >> >>> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) wrote: >>> >>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >>> How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. >> >> I don't really recall. Hopefully someone more familiar with GFS2 can chime in. >> >>> >>> Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs) >>> >>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>> >>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >> >> It does not sound like your network is particularly healthy. >> Are you using multicast or udpu? If multicast, it might be worth trying udpu >> >>> >>> Thanks >>> Lax >>> >>> >>> -----Original Message----- >>> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof >>> Sent: Wednesday, October 29, 2014 2:42 PM >>> To: linux clustering >>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying >>> >>> >>>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) wrote: >>>> >>>> Hi All, >>>> >>>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error retrying'. I have a 2 Node setup with pacemaker and corosync. >>> >>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >>> >>>> >>>> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is there any way to fix this issue? >>>> >>>> >>>> Thanks >>>> Lax >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > From mgrac at redhat.com Thu Oct 30 15:40:05 2014 From: mgrac at redhat.com (Marek "marx" Grac) Date: Thu, 30 Oct 2014 16:40:05 +0100 Subject: [Linux-cluster] Building upstream fence agents on RHEL/CentOS 6 Message-ID: <54525BD5.9060409@redhat.com> Hi, After small investigation on RHEL6.6 and fence agents from upstream (latest git). Summary: Yes, it should work. Details: * it is required to fix auto* stuff as Alan found fix - it will be in next release very likely change ACLOCAL_AMFLAGS from -I m4 to -I make change AC_CONFIG_MACRO-DIR from m4 to make * a) fence_vmware_soap requires package python-requests (+deps) available only in EPEL b) ignore fence_vmware_soap fix) from configure.ac remove AC_PYTHON_MODULE(requests, 1) * in lib/fencing.py.py replace 'stream=sys.stderr' with 'sys.stderr' (one occurency) * standard ./autogen.sh; ./configure; make m, From lkota at cisco.com Thu Oct 30 17:46:30 2014 From: lkota at cisco.com (Lax Kota (lkota)) Date: Thu, 30 Oct 2014 17:46:30 +0000 Subject: [Linux-cluster] daemon cpg_join error retrying In-Reply-To: <5451F581.5050100@redhat.com> References: <68ABE774-8755-416F-829B-CED002B14D03@beekhof.net> <5451F581.5050100@redhat.com> Message-ID: Thanks Honza. Here is what I was doing, > usual reasons for this problem: > 1. mtu is too high and fragmented packets are not enabled (take a look to netmtu configuration option) I am running with default mtu settings which is 1500. And I do see my interface(eth1) on the box does have MTU as 1500 too. 2. config file on nodes are not in sync and one node may contain more node entries then other nodes (this may be also the case if you have two > clusters and one cluster contains entry of one node for other cluster) 3. firewall is asymmetrically blocked (so node can send but not receive). Also keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu uses one socket per remote node for sending. Verfiifed my config files cluster.conf and cib.xml and both have same no of node entries (2) > I would recommend to disable firewall completely (for testing) and if everything will work, you just need to adjust firewall. I also ran tests with firewall off too on both the participating nodes, still see same issue In corosync log I see repeated set of these messages, hoping these will give some more pointers. Oct 29 22:11:02 corosync [SYNC ] Committing synchronization for (corosync cluster closed process group service v1.01) Oct 29 22:11:02 corosync [MAIN ] Completed service synchronization, ready to provide service. Oct 29 22:11:02 corosync [TOTEM ] waiting_trans_ack changed to 0 Oct 29 22:11:03 corosync [TOTEM ] entering GATHER state from 11. Oct 29 22:11:03 corosync [TOTEM ] entering GATHER state from 10. Oct 29 22:11:05 corosync [TOTEM ] entering GATHER state from 0. Oct 29 22:11:05 corosync [TOTEM ] got commit token Oct 29 22:11:05 corosync [TOTEM ] Saving state aru 1b high seq received 1b Oct 29 22:11:05 corosync [TOTEM ] Storing new sequence id for ring 51708 Oct 29 22:11:05 corosync [TOTEM ] entering COMMIT state. Oct 29 22:11:05 corosync [TOTEM ] got commit token Oct 29 22:11:05 corosync [TOTEM ] entering RECOVERY state. Oct 29 22:11:05 corosync [TOTEM ] TRANS [0] member 172.28.0.64: Oct 29 22:11:05 corosync [TOTEM ] TRANS [1] member 172.28.0.65: Oct 29 22:11:05 corosync [TOTEM ] position [0] member 172.28.0.64: Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep 172.28.0.64 Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b received flag 1 Oct 29 22:11:05 corosync [TOTEM ] position [1] member 172.28.0.65: Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep 172.28.0.64 Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b received flag 1 Oct 29 22:11:05 corosync [TOTEM ] Did not need to originate any messages in recovery. Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Oct 29 22:11:05 corosync [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 Oct 29 22:11:05 corosync [TOTEM ] Resetting old ring state Oct 29 22:11:05 corosync [TOTEM ] recovery to regular 1-0 Oct 29 22:11:05 corosync [CMAN ] ais: confchg_fn called type = 1, seq=333576 Oct 29 22:11:05 corosync [TOTEM ] waiting_trans_ack changed to 1 Oct 29 22:11:05 corosync [CMAN ] ais: confchg_fn called type = 0, seq=333576 Oct 29 22:11:05 corosync [CMAN ] ais: last memb_count = 2, current = 2 Oct 29 22:11:05 corosync [CMAN ] memb: sending TRANSITION message. cluster_name = vsomcluster Oct 29 22:11:05 corosync [CMAN ] ais: comms send message 0x7fff8185ca00 len = 65 Oct 29 22:11:05 corosync [CMAN ] daemon: sending reply 103 to fd 24 Oct 29 22:11:05 corosync [CMAN ] daemon: sending reply 103 to fd 34 Oct 29 22:11:05 corosync [SYNC ] This node is within the primary component and will provide service. Oct 29 22:11:05 corosync [TOTEM ] entering OPERATIONAL state. Oct 29 22:11:05 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 29 22:11:05 corosync [CMAN ] ais: deliver_fn source nodeid = 2, len=81, endian_conv=0 Oct 29 22:11:05 corosync [CMAN ] memb: Message on port 0 is 5 Oct 29 22:11:05 corosync [CMAN ] memb: got TRANSITION from node 2 Oct 29 22:11:05 corosync [CMAN ] memb: Got TRANSITION message. msg->flags=20, node->flags=20, first_trans=0 Oct 29 22:11:05 corosync [CMAN ] memb: add_ais_node ID=2, incarnation = 333576 Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 0. Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. Oct 29 22:11:05 corosync [CMAN ] ais: deliver_fn source nodeid = 1, len=81, endian_conv=0 Oct 29 22:11:05 corosync [CMAN ] memb: Message on port 0 is 5 Oct 29 22:11:05 corosync [CMAN ] memb: got TRANSITION from node 1 Oct 29 22:11:05 corosync [CMAN ] Completed first transition with nodes on the same config versions Oct 29 22:11:05 corosync [CMAN ] memb: Got TRANSITION message. msg->flags=20, node->flags=20, first_trans=0 Oct 29 22:11:05 corosync [CMAN ] memb: add_ais_node ID=1, incarnation = 333576 Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed Oct 29 22:11:05 corosync [SYNC ] Synchronization actions starting for (dummy CLM service) Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 0. Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for (dummy CLM service) Oct 29 22:11:05 corosync [SYNC ] Synchronization actions starting for (dummy AMF service) Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 0. Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for (dummy AMF service) Oct 29 22:11:05 corosync [SYNC ] Synchronization actions starting for (openais checkpoint service B.01.01) Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 0. Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for (openais checkpoint service B.01.01) Oct 29 22:11:05 corosync [SYNC ] Synchronization actions starting for (dummy EVT service) Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 0. Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for (dummy EVT service) Oct 29 22:11:05 corosync [SYNC ] Synchronization actions starting for (corosync cluster closed process group service v1.01) Oct 29 22:11:05 corosync [CPG ] got joinlist message from node 1 Oct 29 22:11:05 corosync [CPG ] comparing: sender r(0) ip(172.28.0.65) ; members(old:2 left:0) Oct 29 22:11:05 corosync [CPG ] comparing: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) Oct 29 22:11:05 corosync [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) Oct 29 22:11:05 corosync [CPG ] got joinlist message from node 2 Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 0. Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed Oct 29 22:11:05 corosync [CPG ] joinlist_messages[0] group:crmd\x00, ip:r(0) ip(172.28.0.65) , pid:9198 Oct 29 22:11:05 corosync [CPG ] joinlist_messages[1] group:attrd\x00, ip:r(0) ip(172.28.0.65) , pid:9196 Oct 29 22:11:05 corosync [CPG ] joinlist_messages[2] group:stonith-ng\x00, ip:r(0) ip(172.28.0.65) , pid:9194 Oct 29 22:11:05 corosync [CPG ] joinlist_messages[3] group:cib\x00, ip:r(0) ip(172.28.0.65) , pid:9193 Oct 29 22:11:05 corosync [CPG ] joinlist_messages[4] group:pcmk\x00, ip:r(0) ip(172.28.0.65) , pid:9187 Oct 29 22:11:05 corosync [CPG ] joinlist_messages[5] group:gfs:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9111 Oct 29 22:11:05 corosync [CPG ] joinlist_messages[6] group:dlm:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9057 Oct 29 22:11:05 corosync [CPG ] joinlist_messages[7] group:fenced:default\x00, ip:r(0) ip(172.28.0.65) , pid:9040 Oct 29 22:11:05 corosync [CPG ] joinlist_messages[8] group:fenced:daemon\x00, ip:r(0) ip(172.28.0.65) , pid:9040 Oct 29 22:11:05 corosync [CPG ] joinlist_messages[9] group:crmd\x00, ip:r(0) ip(172.28.0.64) , pid:14530 Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for (corosync cluster closed process group service v1.01) Oct 29 22:11:05 corosync [MAIN ] Completed service synchronization, ready to provide service. Thanks Lax -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jan Friesse Sent: Thursday, October 30, 2014 1:23 AM To: linux clustering Subject: Re: [Linux-cluster] daemon cpg_join error retrying > >> On 30 Oct 2014, at 9:32 am, Lax Kota (lkota) wrote: >> >> >>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >>>> How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. >> >>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in. >> Ok. >> >>>> >>>> Also one more issue I am seeing in one other setup a repeated flood >>>> of 'A processor joined or left the membership and a new membership >>>> was formed' messages for every 4secs. I am running with default >>>> TOTEM settings with token time out as 10 secs. Even after I >>>> increase the token, consensus values to be higher. It goes on >>>> flooding the same message after newer consensus defined time (eg: >>>> if I increase it to be 10secs, then I see new membership formed >>>> messages for every 10secs) >>>> >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>>> >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >> >>> It does not sound like your network is particularly healthy. >>> Are you using multicast or udpu? If multicast, it might be worth >>> trying udpu >> >> I am using udpu and I also have firewall opened for ports 5404 & 5405. Tcpdump looks fine too, it does not complain of any issues. This is a VM envirornment and even if I switch to other node within same VM I keep getting same failure. > > Depending on what the host and VMs are doing, that might be your problem. > In any case, I will defer to the corosync guys at this point. > Lax, usual reasons for this problem: 1. mtu is too high and fragmented packets are not enabled (take a look to netmtu configuration option) 2. config file on nodes are not in sync and one node may contain more node entries then other nodes (this may be also the case if you have two clusters and one cluster contains entry of one node for other cluster) 3. firewall is asymmetrically blocked (so node can send but not receive). Also keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu uses one socket per remote node for sending. I would recommend to disable firewall completely (for testing) and if everything will work, you just need to adjust firewall. Regards, Honza >> >> Thanks >> Lax >> >> >> >> -----Original Message----- >> From: linux-cluster-bounces at redhat.com >> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof >> Sent: Wednesday, October 29, 2014 3:17 PM >> To: linux clustering >> Subject: Re: [Linux-cluster] daemon cpg_join error retrying >> >> >>> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) wrote: >>> >>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >>> How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. >> >> I don't really recall. Hopefully someone more familiar with GFS2 can chime in. >> >>> >>> Also one more issue I am seeing in one other setup a repeated flood >>> of 'A processor joined or left the membership and a new membership >>> was formed' messages for every 4secs. I am running with default >>> TOTEM settings with token time out as 10 secs. Even after I increase >>> the token, consensus values to be higher. It goes on flooding the >>> same message after newer consensus defined time (eg: if I increase >>> it to be 10secs, then I see new membership formed messages for every >>> 10secs) >>> >>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>> >>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >> >> It does not sound like your network is particularly healthy. >> Are you using multicast or udpu? If multicast, it might be worth >> trying udpu >> >>> >>> Thanks >>> Lax >>> >>> >>> -----Original Message----- >>> From: linux-cluster-bounces at redhat.com >>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew >>> Beekhof >>> Sent: Wednesday, October 29, 2014 2:42 PM >>> To: linux clustering >>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying >>> >>> >>>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) wrote: >>>> >>>> Hi All, >>>> >>>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error retrying'. I have a 2 Node setup with pacemaker and corosync. >>> >>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >>> >>>> >>>> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is there any way to fix this issue? >>>> >>>> >>>> Thanks >>>> Lax >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From fanyunfeng.ce at gmail.com Fri Oct 31 05:35:24 2014 From: fanyunfeng.ce at gmail.com (Yunfeng Fan) Date: Fri, 31 Oct 2014 13:35:24 +0800 Subject: [Linux-cluster] daemon cpg_join error retrying In-Reply-To: References: Message-ID: On Oct 30, 2014 5:44 AM, "Lax Kota (lkota)" wrote: >(?o?)? > Hi All,(?o?)? >q > > > In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error retrying'. I have a 2 Node setup with pacemaker and corosync. > > > > Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is there any way to fix this issue? > > > > > > Thanks > > Lax > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From jfriesse at redhat.com Fri Oct 31 16:43:29 2014 From: jfriesse at redhat.com (Jan Friesse) Date: Fri, 31 Oct 2014 17:43:29 +0100 Subject: [Linux-cluster] daemon cpg_join error retrying In-Reply-To: References: <68ABE774-8755-416F-829B-CED002B14D03@beekhof.net> <5451F581.5050100@redhat.com> Message-ID: <5453BC31.2000102@redhat.com> Lax, > Thanks Honza. Here is what I was doing, > >> usual reasons for this problem: >> 1. mtu is too high and fragmented packets are not enabled (take a look to netmtu configuration option) > I am running with default mtu settings which is 1500. And I do see my interface(eth1) on the box does have MTU as 1500 too. > Keep in mind that if they are not directly connected, switch can throw packets because of MTU. > > 2. config file on nodes are not in sync and one node may contain more node entries then other nodes (this may be also the case if you have two > clusters and one cluster contains entry of one node for other cluster) 3. firewall is asymmetrically blocked (so node can send but not receive). Also keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu uses one socket per remote node for sending. > Verfiifed my config files cluster.conf and cib.xml and both have same no of node entries (2) > >> I would recommend to disable firewall completely (for testing) and if everything will work, you just need to adjust firewall. > I also ran tests with firewall off too on both the participating nodes, still see same issue > > In corosync log I see repeated set of these messages, hoping these will give some more pointers. > > Oct 29 22:11:02 corosync [SYNC ] Committing synchronization for (corosync cluster closed process group service v1.01) > Oct 29 22:11:02 corosync [MAIN ] Completed service synchronization, ready to provide service. > Oct 29 22:11:02 corosync [TOTEM ] waiting_trans_ack changed to 0 > Oct 29 22:11:03 corosync [TOTEM ] entering GATHER state from 11. > Oct 29 22:11:03 corosync [TOTEM ] entering GATHER state from 10. > Oct 29 22:11:05 corosync [TOTEM ] entering GATHER state from 0. This is just weird. What exact version of corosync are you running? Do you have latest Z stream? Regards, Honza > Oct 29 22:11:05 corosync [TOTEM ] got commit token > Oct 29 22:11:05 corosync [TOTEM ] Saving state aru 1b high seq received 1b > Oct 29 22:11:05 corosync [TOTEM ] Storing new sequence id for ring 51708 > Oct 29 22:11:05 corosync [TOTEM ] entering COMMIT state. > Oct 29 22:11:05 corosync [TOTEM ] got commit token > Oct 29 22:11:05 corosync [TOTEM ] entering RECOVERY state. > Oct 29 22:11:05 corosync [TOTEM ] TRANS [0] member 172.28.0.64: > Oct 29 22:11:05 corosync [TOTEM ] TRANS [1] member 172.28.0.65: > Oct 29 22:11:05 corosync [TOTEM ] position [0] member 172.28.0.64: > Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep 172.28.0.64 > Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b received flag 1 > Oct 29 22:11:05 corosync [TOTEM ] position [1] member 172.28.0.65: > Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep 172.28.0.64 > Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b received flag 1 > Oct 29 22:11:05 corosync [TOTEM ] Did not need to originate any messages in recovery. > Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff > Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 > Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 > Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 > Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 > Oct 29 22:11:05 corosync [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 > Oct 29 22:11:05 corosync [TOTEM ] Resetting old ring state > Oct 29 22:11:05 corosync [TOTEM ] recovery to regular 1-0 > Oct 29 22:11:05 corosync [CMAN ] ais: confchg_fn called type = 1, seq=333576 > Oct 29 22:11:05 corosync [TOTEM ] waiting_trans_ack changed to 1 > Oct 29 22:11:05 corosync [CMAN ] ais: confchg_fn called type = 0, seq=333576 > Oct 29 22:11:05 corosync [CMAN ] ais: last memb_count = 2, current = 2 > Oct 29 22:11:05 corosync [CMAN ] memb: sending TRANSITION message. cluster_name = vsomcluster > Oct 29 22:11:05 corosync [CMAN ] ais: comms send message 0x7fff8185ca00 len = 65 > Oct 29 22:11:05 corosync [CMAN ] daemon: sending reply 103 to fd 24 > Oct 29 22:11:05 corosync [CMAN ] daemon: sending reply 103 to fd 34 > Oct 29 22:11:05 corosync [SYNC ] This node is within the primary component and will provide service. > Oct 29 22:11:05 corosync [TOTEM ] entering OPERATIONAL state. > Oct 29 22:11:05 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. > Oct 29 22:11:05 corosync [CMAN ] ais: deliver_fn source nodeid = 2, len=81, endian_conv=0 > Oct 29 22:11:05 corosync [CMAN ] memb: Message on port 0 is 5 > Oct 29 22:11:05 corosync [CMAN ] memb: got TRANSITION from node 2 > Oct 29 22:11:05 corosync [CMAN ] memb: Got TRANSITION message. msg->flags=20, node->flags=20, first_trans=0 > Oct 29 22:11:05 corosync [CMAN ] memb: add_ais_node ID=2, incarnation = 333576 > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 > Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 2 > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 0. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [CMAN ] ais: deliver_fn source nodeid = 1, len=81, endian_conv=0 > Oct 29 22:11:05 corosync [CMAN ] memb: Message on port 0 is 5 > Oct 29 22:11:05 corosync [CMAN ] memb: got TRANSITION from node 1 > Oct 29 22:11:05 corosync [CMAN ] Completed first transition with nodes on the same config versions > Oct 29 22:11:05 corosync [CMAN ] memb: Got TRANSITION message. msg->flags=20, node->flags=20, first_trans=0 > Oct 29 22:11:05 corosync [CMAN ] memb: add_ais_node ID=1, incarnation = 333576 > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 > Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 1 > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Synchronization actions starting for (dummy CLM service) > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 > Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 1 > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 0. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 > Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 2 > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for (dummy CLM service) > Oct 29 22:11:05 corosync [SYNC ] Synchronization actions starting for (dummy AMF service) > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 > Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 2 > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 0. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 > Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 1 > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for (dummy AMF service) > Oct 29 22:11:05 corosync [SYNC ] Synchronization actions starting for (openais checkpoint service B.01.01) > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 > Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 1 > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 0. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 > Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 2 > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for (openais checkpoint service B.01.01) > Oct 29 22:11:05 corosync [SYNC ] Synchronization actions starting for (dummy EVT service) > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 > Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 2 > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 0. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 > Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 1 > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for (dummy EVT service) > Oct 29 22:11:05 corosync [SYNC ] Synchronization actions starting for (corosync cluster closed process group service v1.01) > Oct 29 22:11:05 corosync [CPG ] got joinlist message from node 1 > Oct 29 22:11:05 corosync [CPG ] comparing: sender r(0) ip(172.28.0.65) ; members(old:2 left:0) > Oct 29 22:11:05 corosync [CPG ] comparing: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) > Oct 29 22:11:05 corosync [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) > Oct 29 22:11:05 corosync [CPG ] got joinlist message from node 2 > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 > Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 1 > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 0. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 > Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 2 > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[0] group:crmd\x00, ip:r(0) ip(172.28.0.65) , pid:9198 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[1] group:attrd\x00, ip:r(0) ip(172.28.0.65) , pid:9196 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[2] group:stonith-ng\x00, ip:r(0) ip(172.28.0.65) , pid:9194 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[3] group:cib\x00, ip:r(0) ip(172.28.0.65) , pid:9193 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[4] group:pcmk\x00, ip:r(0) ip(172.28.0.65) , pid:9187 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[5] group:gfs:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9111 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[6] group:dlm:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9057 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[7] group:fenced:default\x00, ip:r(0) ip(172.28.0.65) , pid:9040 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[8] group:fenced:daemon\x00, ip:r(0) ip(172.28.0.65) , pid:9040 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[9] group:crmd\x00, ip:r(0) ip(172.28.0.64) , pid:14530 > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for (corosync cluster closed process group service v1.01) > Oct 29 22:11:05 corosync [MAIN ] Completed service synchronization, ready to provide service. > > Thanks > Lax > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jan Friesse > Sent: Thursday, October 30, 2014 1:23 AM > To: linux clustering > Subject: Re: [Linux-cluster] daemon cpg_join error retrying > >> >>> On 30 Oct 2014, at 9:32 am, Lax Kota (lkota) wrote: >>> >>> >>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >>>>> How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. >>> >>>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in. >>> Ok. >>> >>>>> >>>>> Also one more issue I am seeing in one other setup a repeated flood >>>>> of 'A processor joined or left the membership and a new membership >>>>> was formed' messages for every 4secs. I am running with default >>>>> TOTEM settings with token time out as 10 secs. Even after I >>>>> increase the token, consensus values to be higher. It goes on >>>>> flooding the same message after newer consensus defined time (eg: >>>>> if I increase it to be 10secs, then I see new membership formed >>>>> messages for every 10secs) >>>>> >>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>>>> >>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>> >>>> It does not sound like your network is particularly healthy. >>>> Are you using multicast or udpu? If multicast, it might be worth >>>> trying udpu >>> >>> I am using udpu and I also have firewall opened for ports 5404 & 5405. Tcpdump looks fine too, it does not complain of any issues. This is a VM envirornment and even if I switch to other node within same VM I keep getting same failure. >> >> Depending on what the host and VMs are doing, that might be your problem. >> In any case, I will defer to the corosync guys at this point. >> > > Lax, > usual reasons for this problem: > 1. mtu is too high and fragmented packets are not enabled (take a look to netmtu configuration option) 2. config file on nodes are not in sync and one node may contain more node entries then other nodes (this may be also the case if you have two clusters and one cluster contains entry of one node for other cluster) 3. firewall is asymmetrically blocked (so node can send but not receive). Also keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu uses one socket per remote node for sending. > > I would recommend to disable firewall completely (for testing) and if everything will work, you just need to adjust firewall. > > Regards, > Honza > > > >>> >>> Thanks >>> Lax >>> >>> >>> >>> -----Original Message----- >>> From: linux-cluster-bounces at redhat.com >>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof >>> Sent: Wednesday, October 29, 2014 3:17 PM >>> To: linux clustering >>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying >>> >>> >>>> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) wrote: >>>> >>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >>>> How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. >>> >>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in. >>> >>>> >>>> Also one more issue I am seeing in one other setup a repeated flood >>>> of 'A processor joined or left the membership and a new membership >>>> was formed' messages for every 4secs. I am running with default >>>> TOTEM settings with token time out as 10 secs. Even after I increase >>>> the token, consensus values to be higher. It goes on flooding the >>>> same message after newer consensus defined time (eg: if I increase >>>> it to be 10secs, then I see new membership formed messages for every >>>> 10secs) >>>> >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>>> >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>> >>> It does not sound like your network is particularly healthy. >>> Are you using multicast or udpu? If multicast, it might be worth >>> trying udpu >>> >>>> >>>> Thanks >>>> Lax >>>> >>>> >>>> -----Original Message----- >>>> From: linux-cluster-bounces at redhat.com >>>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew >>>> Beekhof >>>> Sent: Wednesday, October 29, 2014 2:42 PM >>>> To: linux clustering >>>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying >>>> >>>> >>>>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) wrote: >>>>> >>>>> Hi All, >>>>> >>>>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error retrying'. I have a 2 Node setup with pacemaker and corosync. >>>> >>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >>>> >>>>> >>>>> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is there any way to fix this issue? >>>>> >>>>> >>>>> Thanks >>>>> Lax >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From lkota at cisco.com Fri Oct 31 19:41:24 2014 From: lkota at cisco.com (Lax Kota (lkota)) Date: Fri, 31 Oct 2014 19:41:24 +0000 Subject: [Linux-cluster] daemon cpg_join error retrying In-Reply-To: <5453BC31.2000102@redhat.com> References: <68ABE774-8755-416F-829B-CED002B14D03@beekhof.net> <5451F581.5050100@redhat.com> <5453BC31.2000102@redhat.com> Message-ID: > This is just weird. What exact version of corosync are you running? Do you have latest Z stream? I am running on Corosync 1.4.1 and pacemaker version is 1.1.8-7.el6 Thanks Lax -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jan Friesse Sent: Friday, October 31, 2014 9:43 AM To: linux clustering Subject: Re: [Linux-cluster] daemon cpg_join error retrying Lax, > Thanks Honza. Here is what I was doing, > >> usual reasons for this problem: >> 1. mtu is too high and fragmented packets are not enabled (take a >> look to netmtu configuration option) > I am running with default mtu settings which is 1500. And I do see my interface(eth1) on the box does have MTU as 1500 too. > Keep in mind that if they are not directly connected, switch can throw packets because of MTU. > > 2. config file on nodes are not in sync and one node may contain more node entries then other nodes (this may be also the case if you have two > clusters and one cluster contains entry of one node for other cluster) 3. firewall is asymmetrically blocked (so node can send but not receive). Also keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu uses one socket per remote node for sending. > Verfiifed my config files cluster.conf and cib.xml and both have same > no of node entries (2) > >> I would recommend to disable firewall completely (for testing) and if everything will work, you just need to adjust firewall. > I also ran tests with firewall off too on both the participating > nodes, still see same issue > > In corosync log I see repeated set of these messages, hoping these will give some more pointers. > > Oct 29 22:11:02 corosync [SYNC ] Committing synchronization for > (corosync cluster closed process group service v1.01) Oct 29 22:11:02 corosync [MAIN ] Completed service synchronization, ready to provide service. > Oct 29 22:11:02 corosync [TOTEM ] waiting_trans_ack changed to 0 Oct > 29 22:11:03 corosync [TOTEM ] entering GATHER state from 11. > Oct 29 22:11:03 corosync [TOTEM ] entering GATHER state from 10. > Oct 29 22:11:05 corosync [TOTEM ] entering GATHER state from 0. This is just weird. What exact version of corosync are you running? Do you have latest Z stream? Regards, Honza > Oct 29 22:11:05 corosync [TOTEM ] got commit token Oct 29 22:11:05 > corosync [TOTEM ] Saving state aru 1b high seq received 1b Oct 29 > 22:11:05 corosync [TOTEM ] Storing new sequence id for ring 51708 Oct > 29 22:11:05 corosync [TOTEM ] entering COMMIT state. > Oct 29 22:11:05 corosync [TOTEM ] got commit token Oct 29 22:11:05 > corosync [TOTEM ] entering RECOVERY state. > Oct 29 22:11:05 corosync [TOTEM ] TRANS [0] member 172.28.0.64: > Oct 29 22:11:05 corosync [TOTEM ] TRANS [1] member 172.28.0.65: > Oct 29 22:11:05 corosync [TOTEM ] position [0] member 172.28.0.64: > Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep > 172.28.0.64 Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b > received flag 1 Oct 29 22:11:05 corosync [TOTEM ] position [1] member 172.28.0.65: > Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep > 172.28.0.64 Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b > received flag 1 Oct 29 22:11:05 corosync [TOTEM ] Did not need to originate any messages in recovery. > Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set > retrans flag0 retrans queue empty 1 count 0, aru ffffffff Oct 29 > 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Oct > 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 1, aru 0 Oct 29 22:11:05 corosync > [TOTEM ] install seq 0 aru 0 high seq received 0 Oct 29 22:11:05 > corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans > queue empty 1 count 2, aru 0 Oct 29 22:11:05 corosync [TOTEM ] install > seq 0 aru 0 high seq received 0 Oct 29 22:11:05 corosync [TOTEM ] > token retrans flag is 0 my set retrans flag0 retrans queue empty 1 > count 3, aru 0 Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 > high seq received 0 Oct 29 22:11:05 corosync [TOTEM ] retrans flag > count 4 token aru 0 install seq 0 aru 0 0 Oct 29 22:11:05 corosync > [TOTEM ] Resetting old ring state Oct 29 22:11:05 corosync [TOTEM ] > recovery to regular 1-0 Oct 29 22:11:05 corosync [CMAN ] ais: > confchg_fn called type = 1, seq=333576 Oct 29 22:11:05 corosync [TOTEM > ] waiting_trans_ack changed to 1 Oct 29 22:11:05 corosync [CMAN ] > ais: confchg_fn called type = 0, seq=333576 Oct 29 22:11:05 corosync > [CMAN ] ais: last memb_count = 2, current = 2 Oct 29 22:11:05 > corosync [CMAN ] memb: sending TRANSITION message. cluster_name = vsomcluster Oct 29 22:11:05 corosync [CMAN ] ais: comms send message 0x7fff8185ca00 len = 65 Oct 29 22:11:05 corosync [CMAN ] daemon: sending reply 103 to fd 24 Oct 29 22:11:05 corosync [CMAN ] daemon: sending reply 103 to fd 34 Oct 29 22:11:05 corosync [SYNC ] This node is within the primary component and will provide service. > Oct 29 22:11:05 corosync [TOTEM ] entering OPERATIONAL state. > Oct 29 22:11:05 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. > Oct 29 22:11:05 corosync [CMAN ] ais: deliver_fn source nodeid = 2, > len=81, endian_conv=0 Oct 29 22:11:05 corosync [CMAN ] memb: Message > on port 0 is 5 Oct 29 22:11:05 corosync [CMAN ] memb: got TRANSITION > from node 2 Oct 29 22:11:05 corosync [CMAN ] memb: Got TRANSITION > message. msg->flags=20, node->flags=20, first_trans=0 Oct 29 22:11:05 > corosync [CMAN ] memb: add_ais_node ID=2, incarnation = 333576 Oct 29 > 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync > [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 0. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [CMAN ] ais: deliver_fn source nodeid = 1, > len=81, endian_conv=0 Oct 29 22:11:05 corosync [CMAN ] memb: Message > on port 0 is 5 Oct 29 22:11:05 corosync [CMAN ] memb: got TRANSITION > from node 1 Oct 29 22:11:05 corosync [CMAN ] Completed first > transition with nodes on the same config versions Oct 29 22:11:05 > corosync [CMAN ] memb: Got TRANSITION message. msg->flags=20, > node->flags=20, first_trans=0 Oct 29 22:11:05 corosync [CMAN ] memb: > add_ais_node ID=1, incarnation = 333576 Oct 29 22:11:05 corosync [SYNC > ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Synchronization actions starting for > (dummy CLM service) Oct 29 22:11:05 corosync [SYNC ] confchg entries > 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 1 Oct > 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 0. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (dummy CLM service) Oct 29 22:11:05 corosync [SYNC ] Synchronization > actions starting for (dummy AMF service) Oct 29 22:11:05 corosync > [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier > Start Received From 2 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 0. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (dummy AMF service) Oct 29 22:11:05 corosync [SYNC ] Synchronization > actions starting for (openais checkpoint service B.01.01) Oct 29 > 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync > [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier > Start Received From 1 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 0. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (openais checkpoint service B.01.01) Oct 29 22:11:05 corosync [SYNC ] > Synchronization actions starting for (dummy EVT service) Oct 29 > 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync > [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 0. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (dummy EVT service) Oct 29 22:11:05 corosync [SYNC ] Synchronization actions starting for (corosync cluster closed process group service v1.01) > Oct 29 22:11:05 corosync [CPG ] got joinlist message from node 1 > Oct 29 22:11:05 corosync [CPG ] comparing: sender r(0) ip(172.28.0.65) ; members(old:2 left:0) > Oct 29 22:11:05 corosync [CPG ] comparing: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) > Oct 29 22:11:05 corosync [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) > Oct 29 22:11:05 corosync [CPG ] got joinlist message from node 2 > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 0. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[0] group:crmd\x00, ip:r(0) ip(172.28.0.65) , pid:9198 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[1] group:attrd\x00, ip:r(0) ip(172.28.0.65) , pid:9196 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[2] group:stonith-ng\x00, ip:r(0) ip(172.28.0.65) , pid:9194 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[3] group:cib\x00, ip:r(0) ip(172.28.0.65) , pid:9193 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[4] group:pcmk\x00, ip:r(0) ip(172.28.0.65) , pid:9187 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[5] group:gfs:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9111 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[6] group:dlm:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9057 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[7] group:fenced:default\x00, ip:r(0) ip(172.28.0.65) , pid:9040 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[8] group:fenced:daemon\x00, ip:r(0) ip(172.28.0.65) , pid:9040 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[9] group:crmd\x00, ip:r(0) ip(172.28.0.64) , pid:14530 > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (corosync cluster closed process group service v1.01) Oct 29 22:11:05 corosync [MAIN ] Completed service synchronization, ready to provide service. > > Thanks > Lax > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jan Friesse > Sent: Thursday, October 30, 2014 1:23 AM > To: linux clustering > Subject: Re: [Linux-cluster] daemon cpg_join error retrying > >> >>> On 30 Oct 2014, at 9:32 am, Lax Kota (lkota) wrote: >>> >>> >>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >>>>> How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. >>> >>>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in. >>> Ok. >>> >>>>> >>>>> Also one more issue I am seeing in one other setup a repeated >>>>> flood of 'A processor joined or left the membership and a new >>>>> membership was formed' messages for every 4secs. I am running with >>>>> default TOTEM settings with token time out as 10 secs. Even after >>>>> I increase the token, consensus values to be higher. It goes on >>>>> flooding the same message after newer consensus defined time (eg: >>>>> if I increase it to be 10secs, then I see new membership formed >>>>> messages for every 10secs) >>>>> >>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>>>> >>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>> >>>> It does not sound like your network is particularly healthy. >>>> Are you using multicast or udpu? If multicast, it might be worth >>>> trying udpu >>> >>> I am using udpu and I also have firewall opened for ports 5404 & 5405. Tcpdump looks fine too, it does not complain of any issues. This is a VM envirornment and even if I switch to other node within same VM I keep getting same failure. >> >> Depending on what the host and VMs are doing, that might be your problem. >> In any case, I will defer to the corosync guys at this point. >> > > Lax, > usual reasons for this problem: > 1. mtu is too high and fragmented packets are not enabled (take a look to netmtu configuration option) 2. config file on nodes are not in sync and one node may contain more node entries then other nodes (this may be also the case if you have two clusters and one cluster contains entry of one node for other cluster) 3. firewall is asymmetrically blocked (so node can send but not receive). Also keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu uses one socket per remote node for sending. > > I would recommend to disable firewall completely (for testing) and if everything will work, you just need to adjust firewall. > > Regards, > Honza > > > >>> >>> Thanks >>> Lax >>> >>> >>> >>> -----Original Message----- >>> From: linux-cluster-bounces at redhat.com >>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew >>> Beekhof >>> Sent: Wednesday, October 29, 2014 3:17 PM >>> To: linux clustering >>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying >>> >>> >>>> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) wrote: >>>> >>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >>>> How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. >>> >>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in. >>> >>>> >>>> Also one more issue I am seeing in one other setup a repeated flood >>>> of 'A processor joined or left the membership and a new membership >>>> was formed' messages for every 4secs. I am running with default >>>> TOTEM settings with token time out as 10 secs. Even after I >>>> increase the token, consensus values to be higher. It goes on >>>> flooding the same message after newer consensus defined time (eg: >>>> if I increase it to be 10secs, then I see new membership formed >>>> messages for every >>>> 10secs) >>>> >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>>> >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>> >>> It does not sound like your network is particularly healthy. >>> Are you using multicast or udpu? If multicast, it might be worth >>> trying udpu >>> >>>> >>>> Thanks >>>> Lax >>>> >>>> >>>> -----Original Message----- >>>> From: linux-cluster-bounces at redhat.com >>>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew >>>> Beekhof >>>> Sent: Wednesday, October 29, 2014 2:42 PM >>>> To: linux clustering >>>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying >>>> >>>> >>>>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) wrote: >>>>> >>>>> Hi All, >>>>> >>>>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error retrying'. I have a 2 Node setup with pacemaker and corosync. >>>> >>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >>>> >>>>> >>>>> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is there any way to fix this issue? >>>>> >>>>> >>>>> Thanks >>>>> Lax >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From lkota at cisco.com Fri Oct 31 19:43:01 2014 From: lkota at cisco.com (Lax Kota (lkota)) Date: Fri, 31 Oct 2014 19:43:01 +0000 Subject: [Linux-cluster] daemon cpg_join error retrying References: <68ABE774-8755-416F-829B-CED002B14D03@beekhof.net> <5451F581.5050100@redhat.com> <5453BC31.2000102@redhat.com> Message-ID: > This is just weird. What exact version of corosync are you running? Do you have latest Z stream? I am running on Corosync 1.4.1 and pacemaker version is 1.1.8-7.el6 How should I get access to Z stream? Is there a specific dir I should pick this z stream from? Thanks Lax -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jan Friesse Sent: Friday, October 31, 2014 9:43 AM To: linux clustering Subject: Re: [Linux-cluster] daemon cpg_join error retrying Lax, > Thanks Honza. Here is what I was doing, > >> usual reasons for this problem: >> 1. mtu is too high and fragmented packets are not enabled (take a >> look to netmtu configuration option) > I am running with default mtu settings which is 1500. And I do see my interface(eth1) on the box does have MTU as 1500 too. > Keep in mind that if they are not directly connected, switch can throw packets because of MTU. > > 2. config file on nodes are not in sync and one node may contain more node entries then other nodes (this may be also the case if you have two > clusters and one cluster contains entry of one node for other cluster) 3. firewall is asymmetrically blocked (so node can send but not receive). Also keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu uses one socket per remote node for sending. > Verfiifed my config files cluster.conf and cib.xml and both have same > no of node entries (2) > >> I would recommend to disable firewall completely (for testing) and if everything will work, you just need to adjust firewall. > I also ran tests with firewall off too on both the participating > nodes, still see same issue > > In corosync log I see repeated set of these messages, hoping these will give some more pointers. > > Oct 29 22:11:02 corosync [SYNC ] Committing synchronization for > (corosync cluster closed process group service v1.01) Oct 29 22:11:02 corosync [MAIN ] Completed service synchronization, ready to provide service. > Oct 29 22:11:02 corosync [TOTEM ] waiting_trans_ack changed to 0 Oct > 29 22:11:03 corosync [TOTEM ] entering GATHER state from 11. > Oct 29 22:11:03 corosync [TOTEM ] entering GATHER state from 10. > Oct 29 22:11:05 corosync [TOTEM ] entering GATHER state from 0. This is just weird. What exact version of corosync are you running? Do you have latest Z stream? Regards, Honza > Oct 29 22:11:05 corosync [TOTEM ] got commit token Oct 29 22:11:05 > corosync [TOTEM ] Saving state aru 1b high seq received 1b Oct 29 > 22:11:05 corosync [TOTEM ] Storing new sequence id for ring 51708 Oct > 29 22:11:05 corosync [TOTEM ] entering COMMIT state. > Oct 29 22:11:05 corosync [TOTEM ] got commit token Oct 29 22:11:05 > corosync [TOTEM ] entering RECOVERY state. > Oct 29 22:11:05 corosync [TOTEM ] TRANS [0] member 172.28.0.64: > Oct 29 22:11:05 corosync [TOTEM ] TRANS [1] member 172.28.0.65: > Oct 29 22:11:05 corosync [TOTEM ] position [0] member 172.28.0.64: > Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep > 172.28.0.64 Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b > received flag 1 Oct 29 22:11:05 corosync [TOTEM ] position [1] member 172.28.0.65: > Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep > 172.28.0.64 Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b > received flag 1 Oct 29 22:11:05 corosync [TOTEM ] Did not need to originate any messages in recovery. > Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set > retrans flag0 retrans queue empty 1 count 0, aru ffffffff Oct 29 > 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Oct > 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 1, aru 0 Oct 29 22:11:05 corosync > [TOTEM ] install seq 0 aru 0 high seq received 0 Oct 29 22:11:05 > corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans > queue empty 1 count 2, aru 0 Oct 29 22:11:05 corosync [TOTEM ] install > seq 0 aru 0 high seq received 0 Oct 29 22:11:05 corosync [TOTEM ] > token retrans flag is 0 my set retrans flag0 retrans queue empty 1 > count 3, aru 0 Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 > high seq received 0 Oct 29 22:11:05 corosync [TOTEM ] retrans flag > count 4 token aru 0 install seq 0 aru 0 0 Oct 29 22:11:05 corosync > [TOTEM ] Resetting old ring state Oct 29 22:11:05 corosync [TOTEM ] > recovery to regular 1-0 Oct 29 22:11:05 corosync [CMAN ] ais: > confchg_fn called type = 1, seq=333576 Oct 29 22:11:05 corosync [TOTEM > ] waiting_trans_ack changed to 1 Oct 29 22:11:05 corosync [CMAN ] > ais: confchg_fn called type = 0, seq=333576 Oct 29 22:11:05 corosync > [CMAN ] ais: last memb_count = 2, current = 2 Oct 29 22:11:05 > corosync [CMAN ] memb: sending TRANSITION message. cluster_name = vsomcluster Oct 29 22:11:05 corosync [CMAN ] ais: comms send message 0x7fff8185ca00 len = 65 Oct 29 22:11:05 corosync [CMAN ] daemon: sending reply 103 to fd 24 Oct 29 22:11:05 corosync [CMAN ] daemon: sending reply 103 to fd 34 Oct 29 22:11:05 corosync [SYNC ] This node is within the primary component and will provide service. > Oct 29 22:11:05 corosync [TOTEM ] entering OPERATIONAL state. > Oct 29 22:11:05 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. > Oct 29 22:11:05 corosync [CMAN ] ais: deliver_fn source nodeid = 2, > len=81, endian_conv=0 Oct 29 22:11:05 corosync [CMAN ] memb: Message > on port 0 is 5 Oct 29 22:11:05 corosync [CMAN ] memb: got TRANSITION > from node 2 Oct 29 22:11:05 corosync [CMAN ] memb: Got TRANSITION > message. msg->flags=20, node->flags=20, first_trans=0 Oct 29 22:11:05 > corosync [CMAN ] memb: add_ais_node ID=2, incarnation = 333576 Oct 29 > 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync > [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 0. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [CMAN ] ais: deliver_fn source nodeid = 1, > len=81, endian_conv=0 Oct 29 22:11:05 corosync [CMAN ] memb: Message > on port 0 is 5 Oct 29 22:11:05 corosync [CMAN ] memb: got TRANSITION > from node 1 Oct 29 22:11:05 corosync [CMAN ] Completed first > transition with nodes on the same config versions Oct 29 22:11:05 > corosync [CMAN ] memb: Got TRANSITION message. msg->flags=20, > node->flags=20, first_trans=0 Oct 29 22:11:05 corosync [CMAN ] memb: > add_ais_node ID=1, incarnation = 333576 Oct 29 22:11:05 corosync [SYNC > ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Synchronization actions starting for > (dummy CLM service) Oct 29 22:11:05 corosync [SYNC ] confchg entries > 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 1 Oct > 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 0. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (dummy CLM service) Oct 29 22:11:05 corosync [SYNC ] Synchronization > actions starting for (dummy AMF service) Oct 29 22:11:05 corosync > [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier > Start Received From 2 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 0. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (dummy AMF service) Oct 29 22:11:05 corosync [SYNC ] Synchronization > actions starting for (openais checkpoint service B.01.01) Oct 29 > 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync > [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier > Start Received From 1 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 0. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (openais checkpoint service B.01.01) Oct 29 22:11:05 corosync [SYNC ] > Synchronization actions starting for (dummy EVT service) Oct 29 > 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync > [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 0. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (dummy EVT service) Oct 29 22:11:05 corosync [SYNC ] Synchronization actions starting for (corosync cluster closed process group service v1.01) > Oct 29 22:11:05 corosync [CPG ] got joinlist message from node 1 > Oct 29 22:11:05 corosync [CPG ] comparing: sender r(0) ip(172.28.0.65) ; members(old:2 left:0) > Oct 29 22:11:05 corosync [CPG ] comparing: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) > Oct 29 22:11:05 corosync [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) > Oct 29 22:11:05 corosync [CPG ] got joinlist message from node 2 > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 0. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[0] group:crmd\x00, ip:r(0) ip(172.28.0.65) , pid:9198 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[1] group:attrd\x00, ip:r(0) ip(172.28.0.65) , pid:9196 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[2] group:stonith-ng\x00, ip:r(0) ip(172.28.0.65) , pid:9194 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[3] group:cib\x00, ip:r(0) ip(172.28.0.65) , pid:9193 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[4] group:pcmk\x00, ip:r(0) ip(172.28.0.65) , pid:9187 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[5] group:gfs:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9111 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[6] group:dlm:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9057 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[7] group:fenced:default\x00, ip:r(0) ip(172.28.0.65) , pid:9040 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[8] group:fenced:daemon\x00, ip:r(0) ip(172.28.0.65) , pid:9040 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[9] group:crmd\x00, ip:r(0) ip(172.28.0.64) , pid:14530 > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (corosync cluster closed process group service v1.01) Oct 29 22:11:05 corosync [MAIN ] Completed service synchronization, ready to provide service. > > Thanks > Lax > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jan Friesse > Sent: Thursday, October 30, 2014 1:23 AM > To: linux clustering > Subject: Re: [Linux-cluster] daemon cpg_join error retrying > >> >>> On 30 Oct 2014, at 9:32 am, Lax Kota (lkota) wrote: >>> >>> >>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >>>>> How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. >>> >>>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in. >>> Ok. >>> >>>>> >>>>> Also one more issue I am seeing in one other setup a repeated >>>>> flood of 'A processor joined or left the membership and a new >>>>> membership was formed' messages for every 4secs. I am running with >>>>> default TOTEM settings with token time out as 10 secs. Even after >>>>> I increase the token, consensus values to be higher. It goes on >>>>> flooding the same message after newer consensus defined time (eg: >>>>> if I increase it to be 10secs, then I see new membership formed >>>>> messages for every 10secs) >>>>> >>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>>>> >>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>> >>>> It does not sound like your network is particularly healthy. >>>> Are you using multicast or udpu? If multicast, it might be worth >>>> trying udpu >>> >>> I am using udpu and I also have firewall opened for ports 5404 & 5405. Tcpdump looks fine too, it does not complain of any issues. This is a VM envirornment and even if I switch to other node within same VM I keep getting same failure. >> >> Depending on what the host and VMs are doing, that might be your problem. >> In any case, I will defer to the corosync guys at this point. >> > > Lax, > usual reasons for this problem: > 1. mtu is too high and fragmented packets are not enabled (take a look to netmtu configuration option) 2. config file on nodes are not in sync and one node may contain more node entries then other nodes (this may be also the case if you have two clusters and one cluster contains entry of one node for other cluster) 3. firewall is asymmetrically blocked (so node can send but not receive). Also keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu uses one socket per remote node for sending. > > I would recommend to disable firewall completely (for testing) and if everything will work, you just need to adjust firewall. > > Regards, > Honza > > > >>> >>> Thanks >>> Lax >>> >>> >>> >>> -----Original Message----- >>> From: linux-cluster-bounces at redhat.com >>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew >>> Beekhof >>> Sent: Wednesday, October 29, 2014 3:17 PM >>> To: linux clustering >>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying >>> >>> >>>> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) wrote: >>>> >>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >>>> How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. >>> >>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in. >>> >>>> >>>> Also one more issue I am seeing in one other setup a repeated flood >>>> of 'A processor joined or left the membership and a new membership >>>> was formed' messages for every 4secs. I am running with default >>>> TOTEM settings with token time out as 10 secs. Even after I >>>> increase the token, consensus values to be higher. It goes on >>>> flooding the same message after newer consensus defined time (eg: >>>> if I increase it to be 10secs, then I see new membership formed >>>> messages for every >>>> 10secs) >>>> >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>>> >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>> >>> It does not sound like your network is particularly healthy. >>> Are you using multicast or udpu? If multicast, it might be worth >>> trying udpu >>> >>>> >>>> Thanks >>>> Lax >>>> >>>> >>>> -----Original Message----- >>>> From: linux-cluster-bounces at redhat.com >>>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew >>>> Beekhof >>>> Sent: Wednesday, October 29, 2014 2:42 PM >>>> To: linux clustering >>>> Subject: Re: [Linux-cluster] daemon cpg_join error retrying >>>> >>>> >>>>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) wrote: >>>>> >>>>> Hi All, >>>>> >>>>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error retrying'. I have a 2 Node setup with pacemaker and corosync. >>>> >>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >>>> >>>>> >>>>> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is there any way to fix this issue? >>>>> >>>>> >>>>> Thanks >>>>> Lax >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster