From lists at alteeve.ca Mon Jan 6 16:11:13 2014 From: lists at alteeve.ca (Digimer) Date: Mon, 06 Jan 2014 11:11:13 -0500 Subject: [Linux-cluster] Announcing a new HA KVM tutorial! Message-ID: <52CAD5A1.8070602@alteeve.ca> Almost exactly two years ago, I released the first tutorial for building an HA platform for KVM VMs. In that time, I have learned a lot, created some tools to simplify management and refined the design to handle corner-cases seen in the field. Today, the culmination of that learning is summed up in the "2nd Edition" of that tutorial, now called "AN!Cluster Tutorial 2". https://alteeve.ca/w/AN!Cluster_Tutorial_2 These HA KVM platforms have been in production for over two years now in facilities all over the world; Universities, municipal governments, corporate DCs, manufacturing facilities, etc. I've gotten wonderful feedback from users and all that real-world experience has been integrated into this new tutorial. As always, everything is 100% open source and free-as-in-beer! The major changes are: * SELinux and iptables are enabled and used. * Numerous slight changes made to the OS and cluster stack configuration to provide better corner-case fault handling. * Architecture refinements; ** Redundant PSUs, UPSes and fence methods emphasized. ** Monitoring multiple UPSes added via modified apcupsd ** Detailed monitoring of LSI-based RAID controllers and drives ** Discussion on hardware considerations for VM performance based on anticipated work loads * Naming convention changes to support the new AN!CDB dashboard[1] ** New alert system covered with fault and notable event alerting * Wider array of guest OSes are covered; ** Windows 7 ** Windows 8 ** Windows 2008 R2 ** Windows 2012 ** Solaris 11 ** FreeBSD 9 ** RHEL 6 ** SLES 11 Beyond that, the formatting of the tutorial itself has been slightly modified. I do think it is the easiest to follow tutorial I have yet been able to produce. I am very proud of this one! :D As always, feedback is always very much appreciated. Everything from typos/grammar mistakes, functional problems or anything else is very valuable. I take all the feedback I get and use it to helping make the tutorials better. Enjoy! Digimer, who now can now start the next tutorial in earnest! 1. https://alteeve.ca/w/AN!CDB -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From nishantsingal at gmail.com Wed Jan 8 03:14:16 2014 From: nishantsingal at gmail.com (Nishant Singhal) Date: Wed, 8 Jan 2014 11:14:16 +0800 Subject: [Linux-cluster] help Message-ID: Hi Team, I am getting one error while i am configuring Active/Active Gfs2 clustering via clusterlab from scratch. I am using RHEL6.2 and i have install HA full Package. my configure is below node pcmk-1 \ attributes standby="off" node pcmk-2 \ attributes standby="off" primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip="192.168.123.175" cidr_netmask="32" \ op monitor interval="30s" primitive WebData ocf:linbit:drbd \ params drbd_resource="wwwdata" \ op monitor interval="60s" primitive WebFS ocf:heartbeat:Filesystem \ params device="/dev/drbd/by-res/wwwdata" directory="/var/www/html" fstype="ext4" \ meta target-role="Stopped" primitive WebSite ocf:heartbeat:apache \ params configfile="/etc/httpd/conf/httpd.conf" statusurl=" http://localhost/server-status" \ op monitor interval="1min" primitive dlm ocf:pacemaker:controld \ op monitor interval="60s" ms WebDataClone WebData \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" clone dlm_clone dlm \ meta clone-max="2" clone-node-max="1" location prefer-pcmk-1 WebSite 50: pcmk-1 colocation WebSite-with-WebFS inf: WebSite WebFS colocation fs_on_drbd inf: WebFS WebDataClone:Master colocation website-with-ip inf: WebSite ClusterIP order WebFS-after-WebData inf: WebDataClone:promote WebFS:start order WebSite-after-WebFS inf: WebFS WebSite order apache-after-ip inf: ClusterIP WebSite property $id="cib-bootstrap-options" \ dc-version="1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" rsc_defaults $id="rsc-options" \ resource-stickiness="100" op_defaults $id="op-options" \ timeout="240s" crm_mon -1 ------------------- Last updated: Wed Jan 8 11:11:43 2014 Last change: Tue Jan 7 17:59:29 2014 via crm_shadow on pcmk-1 Stack: openais Current DC: pcmk-1 - partition with quorum Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558 2 Nodes configured, 2 expected votes 7 Resources configured. ============ Online: [ pcmk-1 pcmk-2 ] ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-1 Master/Slave Set: WebDataClone [WebData] Masters: [ pcmk-1 ] Slaves: [ pcmk-2 ] Failed actions: dlm:0_monitor_0 (node=pcmk-1, call=6, rc=5, status=complete): not installed dlm:0_monitor_0 (node=pcmk-2, call=6, rc=5, status=complete): not installed Kindly help me to resolve the dlm issue. Thanks & Regards Nishant Kumar -------------- next part -------------- An HTML attachment was scrubbed... URL: From bjoern.teipel at internetbrands.com Wed Jan 8 05:57:32 2014 From: bjoern.teipel at internetbrands.com (Bjoern Teipel) Date: Tue, 7 Jan 2014 21:57:32 -0800 Subject: [Linux-cluster] Adding node to clvm cluster Message-ID: I'm trying to join a new node into an existing 5 node CLVM cluster but I just can't get it to work. When ever I add a new node (I put into the cluster.conf and reloaded with cman_tool version -r -S) I end up with situations like the new node wants to gain the quorum and starts to fence the existing pool master and appears to generate some sort of split cluster. Does it work at all, corosync and dlm do not know about the recently added node ? New Node ========== Node Sts Inc Joined Name 1 X 0 hv-b1clcy1 2 X 0 hv-b1flcy1 3 X 0 hv-b1fmcy1 4 X 0 hv-b1dmcy1 5 X 0 hv-b1fkcy1 6 M 80 2014-01-07 21:37:42 hv-b1dkcy1 <--- host added Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [TOTEM ] The network interface [10.14.18.77] is now up. Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [QUORUM] Using quorum provider quorum_cman Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [CMAN ] CMAN 3.0.12.1 (built Sep 3 2013 09:17:34) started Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [SERV ] Service engine loaded: corosync CMAN membership service 2.90 Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [SERV ] Service engine loaded: openais checkpoint service B.01.01 Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [SERV ] Service engine loaded: corosync extended virtual synchrony service Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [SERV ] Service engine loaded: corosync configuration service Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [SERV ] Service engine loaded: corosync cluster config database access v1.01 Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [SERV ] Service engine loaded: corosync profile loading service Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [QUORUM] Using quorum provider quorum_cman Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [TOTEM ] adding new UDPU member {10.14.18.65} Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [TOTEM ] adding new UDPU member {10.14.18.67} Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [TOTEM ] adding new UDPU member {10.14.18.68} Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [TOTEM ] adding new UDPU member {10.14.18.70} Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [TOTEM ] adding new UDPU member {10.14.18.66} Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [TOTEM ] adding new UDPU member {10.14.18.77} Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [CMAN ] quorum regained, resuming activity Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [QUORUM] This node is within the primary component and will provide service. Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [QUORUM] Members[1]: 6 Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [QUORUM] Members[1]: 6 Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [CPG ] chosen downlist: sender r(0) ip(10.14.18.77) ; members(old:0 left:0) Jan 7 21:37:42 hv-b1dkcy1 corosync[12564]: [MAIN ] Completed service synchronization, ready to provide service. Jan 7 21:37:46 hv-b1dkcy1 fenced[12620]: fenced 3.0.12.1 started Jan 7 21:37:46 hv-b1dkcy1 dlm_controld[12643]: dlm_controld 3.0.12.1 started Jan 7 21:37:47 hv-b1dkcy1 gfs_controld[12695]: gfs_controld 3.0.12.1 started Jan 7 21:37:54 hv-b1dkcy1 fenced[12620]: fencing node hv-b1clcy1 sudo -i corosync-objctl |grep member totem.interface.member.memberaddr=hv-b1clcy1 totem.interface.member.memberaddr=hv-b1fmcy1 totem.interface.member.memberaddr=hv-b1dmcy1 totem.interface.member.memberaddr=hv-b1fkcy1 totem.interface.member.memberaddr=hv-b1flcy1 totem.interface.member.memberaddr=hv-b1dkcy1 runtime.totem.pg.mrp.srp.members.6.ip=r(0) ip(10.14.18.77) runtime.totem.pg.mrp.srp.members.6.join_count=1 runtime.totem.pg.mrp.srp.members.6.status=joined Existing Node ============= member 6 has not been added to the quorum list : Jan 7 21:36:28 hv-b1clcy1 corosync[7769]: [QUORUM] Members[4]: 1 2 3 5 Jan 7 21:37:54 hv-b1clcy1 corosync[7769]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Jan 7 21:37:54 hv-b1clcy1 corosync[7769]: [CPG ] chosen downlist: sender r(0) ip(10.14.18.65) ; members(old:4 left:0) Node Sts Inc Joined Name 1 M 4468 2013-12-10 14:33:27 hv-b1clcy1 2 M 4468 2013-12-10 14:33:27 hv-b1flcy1 3 M 5036 2014-01-07 17:51:26 hv-b1fmcy1 4 X 4468 hv-b1dmcy1 (dead at the moment) 5 M 4468 2013-12-10 14:33:27 hv-b1fkcy1 6 X 0 hv-b1dkcy1 <--- added Jan 7 21:36:28 hv-b1clcy1 corosync[7769]: [QUORUM] Members[4]: 1 2 3 5 Jan 7 21:37:54 hv-b1clcy1 corosync[7769]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Jan 7 21:37:54 hv-b1clcy1 corosync[7769]: [CPG ] chosen downlist: sender r(0) ip(10.14.18.65) ; members(old:4 left:0) Jan 7 21:37:54 hv-b1clcy1 corosync[7769]: [MAIN ] Completed service synchronization, ready to provide service. totem.interface.member.memberaddr=hv-b1clcy1 totem.interface.member.memberaddr=hv-b1fmcy1 totem.interface.member.memberaddr=hv-b1dmcy1 totem.interface.member.memberaddr=hv-b1fkcy1 totem.interface.member.memberaddr=hv-b1flcy1. runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(10.14.18.65) runtime.totem.pg.mrp.srp.members.1.join_count=1 runtime.totem.pg.mrp.srp.members.1.status=joined runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(10.14.18.66) runtime.totem.pg.mrp.srp.members.2.join_count=1 runtime.totem.pg.mrp.srp.members.2.status=joined runtime.totem.pg.mrp.srp.members.4.ip=r(0) ip(10.14.18.68) runtime.totem.pg.mrp.srp.members.4.join_count=1 runtime.totem.pg.mrp.srp.members.4.status=left runtime.totem.pg.mrp.srp.members.5.ip=r(0) ip(10.14.18.70) runtime.totem.pg.mrp.srp.members.5.join_count=1 runtime.totem.pg.mrp.srp.members.5.status=joined runtime.totem.pg.mrp.srp.members.3.ip=r(0) ip(10.14.18.67) runtime.totem.pg.mrp.srp.members.3.join_count=3 runtime.totem.pg.mrp.srp.members.3.status=joined cluster.conf: (manual fencing just for testing) corosync.conf: compatibility: whitetank totem { version: 2 secauth: off threads: 0 # fail_recv_const: 5000 interface { ringnumber: 0 bindnetaddr: 10.14.18.0 mcastaddr: 239.0.0.4 mcastport: 5405 } } logging { fileline: off to_stderr: no to_logfile: yes to_syslog: yes # the pathname of the log file logfile: /var/log/cluster/corosync.log debug: off timestamp: on logger_subsys { subsys: AMF debug: off } } amf { mode: disabled } Many thanks, Bjoern -------------- next part -------------- An HTML attachment was scrubbed... URL: From emi2fast at gmail.com Wed Jan 8 08:31:55 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Wed, 8 Jan 2014 09:31:55 +0100 Subject: [Linux-cluster] help In-Reply-To: References: Message-ID: I tried on centos 6.3 "yum search dlm", but nothing found, on suse i got something different pcmk01:~ # which dlm_controld /usr/sbin/dlm_controld pcmk01:~ # rpm -qf /usr/sbin/dlm_controld libdlm-4.0.2-1.2.x86_64 2014/1/8 Nishant Singhal > Hi Team, > > I am getting one error while i am configuring Active/Active Gfs2 > clustering via clusterlab from scratch. I am using RHEL6.2 and i have > install HA full Package. > > my configure is below > node pcmk-1 \ > attributes standby="off" > node pcmk-2 \ > attributes standby="off" > primitive ClusterIP ocf:heartbeat:IPaddr2 \ > params ip="192.168.123.175" cidr_netmask="32" \ > op monitor interval="30s" > primitive WebData ocf:linbit:drbd \ > params drbd_resource="wwwdata" \ > op monitor interval="60s" > primitive WebFS ocf:heartbeat:Filesystem \ > params device="/dev/drbd/by-res/wwwdata" directory="/var/www/html" > fstype="ext4" \ > meta target-role="Stopped" > primitive WebSite ocf:heartbeat:apache \ > params configfile="/etc/httpd/conf/httpd.conf" statusurl=" > http://localhost/server-status" \ > op monitor interval="1min" > primitive dlm ocf:pacemaker:controld \ > op monitor interval="60s" > ms WebDataClone WebData \ > meta master-max="1" master-node-max="1" clone-max="2" > clone-node-max="1" notify="true" > clone dlm_clone dlm \ > meta clone-max="2" clone-node-max="1" > location prefer-pcmk-1 WebSite 50: pcmk-1 > colocation WebSite-with-WebFS inf: WebSite WebFS > colocation fs_on_drbd inf: WebFS WebDataClone:Master > colocation website-with-ip inf: WebSite ClusterIP > order WebFS-after-WebData inf: WebDataClone:promote WebFS:start > order WebSite-after-WebFS inf: WebFS WebSite > order apache-after-ip inf: ClusterIP WebSite > property $id="cib-bootstrap-options" \ > dc-version="1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558" \ > cluster-infrastructure="openais" \ > expected-quorum-votes="2" \ > stonith-enabled="false" \ > no-quorum-policy="ignore" > rsc_defaults $id="rsc-options" \ > resource-stickiness="100" > op_defaults $id="op-options" \ > timeout="240s" > > crm_mon -1 > ------------------- > Last updated: Wed Jan 8 11:11:43 2014 > Last change: Tue Jan 7 17:59:29 2014 via crm_shadow on pcmk-1 > Stack: openais > Current DC: pcmk-1 - partition with quorum > Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558 > 2 Nodes configured, 2 expected votes > 7 Resources configured. > ============ > > Online: [ pcmk-1 pcmk-2 ] > > ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-1 > Master/Slave Set: WebDataClone [WebData] > Masters: [ pcmk-1 ] > Slaves: [ pcmk-2 ] > > Failed actions: > dlm:0_monitor_0 (node=pcmk-1, call=6, rc=5, status=complete): not > installed > dlm:0_monitor_0 (node=pcmk-2, call=6, rc=5, status=complete): not > installed > > > Kindly help me to resolve the dlm issue. > > Thanks & Regards > Nishant Kumar > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From emi2fast at gmail.com Wed Jan 8 08:44:28 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Wed, 8 Jan 2014 09:44:28 +0100 Subject: [Linux-cluster] help In-Reply-To: References: Message-ID: I think, if you are using cman+pacemaker you don't need pacemaker:controld look this link http://www.drbd.org/users-guide/s-gfs-configure-cman.html 2014/1/8 emmanuel segura > I tried on centos 6.3 "yum search dlm", but nothing found, on suse i got > something different > > pcmk01:~ # which dlm_controld > /usr/sbin/dlm_controld > pcmk01:~ # rpm -qf /usr/sbin/dlm_controld > libdlm-4.0.2-1.2.x86_64 > > > > 2014/1/8 Nishant Singhal > >> Hi Team, >> >> I am getting one error while i am configuring Active/Active Gfs2 >> clustering via clusterlab from scratch. I am using RHEL6.2 and i have >> install HA full Package. >> >> my configure is below >> node pcmk-1 \ >> attributes standby="off" >> node pcmk-2 \ >> attributes standby="off" >> primitive ClusterIP ocf:heartbeat:IPaddr2 \ >> params ip="192.168.123.175" cidr_netmask="32" \ >> op monitor interval="30s" >> primitive WebData ocf:linbit:drbd \ >> params drbd_resource="wwwdata" \ >> op monitor interval="60s" >> primitive WebFS ocf:heartbeat:Filesystem \ >> params device="/dev/drbd/by-res/wwwdata" >> directory="/var/www/html" fstype="ext4" \ >> meta target-role="Stopped" >> primitive WebSite ocf:heartbeat:apache \ >> params configfile="/etc/httpd/conf/httpd.conf" statusurl=" >> http://localhost/server-status" \ >> op monitor interval="1min" >> primitive dlm ocf:pacemaker:controld \ >> op monitor interval="60s" >> ms WebDataClone WebData \ >> meta master-max="1" master-node-max="1" clone-max="2" >> clone-node-max="1" notify="true" >> clone dlm_clone dlm \ >> meta clone-max="2" clone-node-max="1" >> location prefer-pcmk-1 WebSite 50: pcmk-1 >> colocation WebSite-with-WebFS inf: WebSite WebFS >> colocation fs_on_drbd inf: WebFS WebDataClone:Master >> colocation website-with-ip inf: WebSite ClusterIP >> order WebFS-after-WebData inf: WebDataClone:promote WebFS:start >> order WebSite-after-WebFS inf: WebFS WebSite >> order apache-after-ip inf: ClusterIP WebSite >> property $id="cib-bootstrap-options" \ >> dc-version="1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558" >> \ >> cluster-infrastructure="openais" \ >> expected-quorum-votes="2" \ >> stonith-enabled="false" \ >> no-quorum-policy="ignore" >> rsc_defaults $id="rsc-options" \ >> resource-stickiness="100" >> op_defaults $id="op-options" \ >> timeout="240s" >> >> crm_mon -1 >> ------------------- >> Last updated: Wed Jan 8 11:11:43 2014 >> Last change: Tue Jan 7 17:59:29 2014 via crm_shadow on pcmk-1 >> Stack: openais >> Current DC: pcmk-1 - partition with quorum >> Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558 >> 2 Nodes configured, 2 expected votes >> 7 Resources configured. >> ============ >> >> Online: [ pcmk-1 pcmk-2 ] >> >> ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-1 >> Master/Slave Set: WebDataClone [WebData] >> Masters: [ pcmk-1 ] >> Slaves: [ pcmk-2 ] >> >> Failed actions: >> dlm:0_monitor_0 (node=pcmk-1, call=6, rc=5, status=complete): not >> installed >> dlm:0_monitor_0 (node=pcmk-2, call=6, rc=5, status=complete): not >> installed >> >> >> Kindly help me to resolve the dlm issue. >> >> Thanks & Regards >> Nishant Kumar >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > -- > esta es mi vida e me la vivo hasta que dios quiera > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From emi2fast at gmail.com Wed Jan 8 09:39:56 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Wed, 8 Jan 2014 10:39:56 +0100 Subject: [Linux-cluster] help In-Reply-To: References: Message-ID: sorry for the short answer, but if you want more information read the script /etc/init.d/cman, if you are using cman+pacemaker for gfs2 2014/1/8 emmanuel segura > I think, if you are using cman+pacemaker you don't need pacemaker:controld > > look this link http://www.drbd.org/users-guide/s-gfs-configure-cman.html > > > 2014/1/8 emmanuel segura > >> I tried on centos 6.3 "yum search dlm", but nothing found, on suse i got >> something different >> >> pcmk01:~ # which dlm_controld >> /usr/sbin/dlm_controld >> pcmk01:~ # rpm -qf /usr/sbin/dlm_controld >> libdlm-4.0.2-1.2.x86_64 >> >> >> >> 2014/1/8 Nishant Singhal >> >>> Hi Team, >>> >>> I am getting one error while i am configuring Active/Active Gfs2 >>> clustering via clusterlab from scratch. I am using RHEL6.2 and i have >>> install HA full Package. >>> >>> my configure is below >>> node pcmk-1 \ >>> attributes standby="off" >>> node pcmk-2 \ >>> attributes standby="off" >>> primitive ClusterIP ocf:heartbeat:IPaddr2 \ >>> params ip="192.168.123.175" cidr_netmask="32" \ >>> op monitor interval="30s" >>> primitive WebData ocf:linbit:drbd \ >>> params drbd_resource="wwwdata" \ >>> op monitor interval="60s" >>> primitive WebFS ocf:heartbeat:Filesystem \ >>> params device="/dev/drbd/by-res/wwwdata" >>> directory="/var/www/html" fstype="ext4" \ >>> meta target-role="Stopped" >>> primitive WebSite ocf:heartbeat:apache \ >>> params configfile="/etc/httpd/conf/httpd.conf" statusurl=" >>> http://localhost/server-status" \ >>> op monitor interval="1min" >>> primitive dlm ocf:pacemaker:controld \ >>> op monitor interval="60s" >>> ms WebDataClone WebData \ >>> meta master-max="1" master-node-max="1" clone-max="2" >>> clone-node-max="1" notify="true" >>> clone dlm_clone dlm \ >>> meta clone-max="2" clone-node-max="1" >>> location prefer-pcmk-1 WebSite 50: pcmk-1 >>> colocation WebSite-with-WebFS inf: WebSite WebFS >>> colocation fs_on_drbd inf: WebFS WebDataClone:Master >>> colocation website-with-ip inf: WebSite ClusterIP >>> order WebFS-after-WebData inf: WebDataClone:promote WebFS:start >>> order WebSite-after-WebFS inf: WebFS WebSite >>> order apache-after-ip inf: ClusterIP WebSite >>> property $id="cib-bootstrap-options" \ >>> >>> dc-version="1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558" \ >>> cluster-infrastructure="openais" \ >>> expected-quorum-votes="2" \ >>> stonith-enabled="false" \ >>> no-quorum-policy="ignore" >>> rsc_defaults $id="rsc-options" \ >>> resource-stickiness="100" >>> op_defaults $id="op-options" \ >>> timeout="240s" >>> >>> crm_mon -1 >>> ------------------- >>> Last updated: Wed Jan 8 11:11:43 2014 >>> Last change: Tue Jan 7 17:59:29 2014 via crm_shadow on pcmk-1 >>> Stack: openais >>> Current DC: pcmk-1 - partition with quorum >>> Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558 >>> 2 Nodes configured, 2 expected votes >>> 7 Resources configured. >>> ============ >>> >>> Online: [ pcmk-1 pcmk-2 ] >>> >>> ClusterIP (ocf::heartbeat:IPaddr2): Started pcmk-1 >>> Master/Slave Set: WebDataClone [WebData] >>> Masters: [ pcmk-1 ] >>> Slaves: [ pcmk-2 ] >>> >>> Failed actions: >>> dlm:0_monitor_0 (node=pcmk-1, call=6, rc=5, status=complete): not >>> installed >>> dlm:0_monitor_0 (node=pcmk-2, call=6, rc=5, status=complete): not >>> installed >>> >>> >>> Kindly help me to resolve the dlm issue. >>> >>> Thanks & Regards >>> Nishant Kumar >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> >> >> -- >> esta es mi vida e me la vivo hasta que dios quiera >> > > > > -- > esta es mi vida e me la vivo hasta que dios quiera > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From mgrac at redhat.com Fri Jan 10 13:15:47 2014 From: mgrac at redhat.com (Marek Grac) Date: Fri, 10 Jan 2014 14:15:47 +0100 Subject: [Linux-cluster] fence-agents-4.0.6 stable release Message-ID: <52CFF283.5040303@redhat.com> Welcome to the fence-agents 4.0.6 release. This release includes quite a lot of new features and small bugfixes: * support for Dell Drac MC was added to fence_drac5 * support for Tripplite PDU was added to fence_apc (thanks to Bogdan Dobrelya) * support for AMT was added as new fence agent fence_amt (thanks to Ondrej Mular) * support for identification of virtual machine using UUID was added to fence_virsh (thanks to Bogdan) * fence_ipmilan was ported from C to Python using standard fencing library (thanks to Ondrej) * problem with using password on fence device where private key was accepted was fixed The new source tarball can be downloaded here: https://fedorahosted.org/releases/f/e/fence-agents/fence-agents-4.0.6.tar.xz To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. m, From yamato at redhat.com Wed Jan 15 08:16:46 2014 From: yamato at redhat.com (Masatake YAMATO) Date: Wed, 15 Jan 2014 17:16:46 +0900 (JST) Subject: [Linux-cluster] HaveState flag in cman_tool status Message-ID: <20140115.171646.323221883839177210.yamato@redhat.com> Hi, I read "What is the "Dirty" flag that cman_tool shows, and should I be worried?" in https://fedorahosted.org/cluster/wiki/FAQ/CMAN to know when the flag is shown. In my understanding cman_tool status doesn't report HaveState on RHEL6 because our cluster product uses cpg instead of groupd. Is my understanding correct? Regards, Masatake YAMATO From paul.lambert at mac.com Wed Jan 15 13:53:22 2014 From: paul.lambert at mac.com (Paul Lambert) Date: Wed, 15 Jan 2014 13:53:22 +0000 Subject: [Linux-cluster] RHCS guest cluster on OracleVM Message-ID: <9DCFF8BD-7613-426E-97EE-9D2AE16BAAA9@mac.com> Don't ask why - I know its crazy but ... Can anyone assist in configuring fencing of RHCS cluster nodes on OracleVM. From linuxtovishesh at gmail.com Wed Jan 15 14:03:41 2014 From: linuxtovishesh at gmail.com (Vishesh kumar) Date: Wed, 15 Jan 2014 19:33:41 +0530 Subject: [Linux-cluster] RHCS guest cluster on OracleVM In-Reply-To: <9DCFF8BD-7613-426E-97EE-9D2AE16BAAA9@mac.com> References: <9DCFF8BD-7613-426E-97EE-9D2AE16BAAA9@mac.com> Message-ID: Let us know which sort of fencing you want to use in clustering here. -- Regards, Vishesh Kumar http://linuxmantra.com On Wed, Jan 15, 2014 at 7:23 PM, Paul Lambert wrote: > Don't ask why - I know its crazy but ... > Can anyone assist in configuring fencing of RHCS cluster nodes on OracleVM. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Regards, Vishesh Kumar http://linuxmantra.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.lambert at mac.com Wed Jan 15 14:15:09 2014 From: paul.lambert at mac.com (Paul Lambert) Date: Wed, 15 Jan 2014 14:15:09 +0000 Subject: [Linux-cluster] RHCS guest cluster on OracleVM In-Reply-To: References: <9DCFF8BD-7613-426E-97EE-9D2AE16BAAA9@mac.com> Message-ID: <42AB554F-763E-4BA4-B6E5-E0FA0503DB7B@mac.com> any fencing method that will prevail OracleVM hosts OEL6 Linux guests running RHCS > On 15 Jan 2014, at 14:03, Vishesh kumar wrote: > > Let us know which sort of fencing you want to use in clustering here. > > -- > Regards, > Vishesh Kumar > http://linuxmantra.com > >> On Wed, Jan 15, 2014 at 7:23 PM, Paul Lambert wrote: >> Don't ask why - I know its crazy but ... >> Can anyone assist in configuring fencing of RHCS cluster nodes on OracleVM. >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Regards, > Vishesh Kumar > http://linuxmantra.com > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From emi2fast at gmail.com Wed Jan 15 14:29:22 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Wed, 15 Jan 2014 15:29:22 +0100 Subject: [Linux-cluster] RHCS guest cluster on OracleVM In-Reply-To: <9DCFF8BD-7613-426E-97EE-9D2AE16BAAA9@mac.com> References: <9DCFF8BD-7613-426E-97EE-9D2AE16BAAA9@mac.com> Message-ID: you can use fence_xvm, in internet you can find many how-to https://alteeve.ca/w/Fencing_KVM_Virtual_Servers 2014/1/15 Paul Lambert > Don't ask why - I know its crazy but ... > Can anyone assist in configuring fencing of RHCS cluster nodes on OracleVM. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From linuxtovishesh at gmail.com Wed Jan 15 14:39:00 2014 From: linuxtovishesh at gmail.com (Vishesh kumar) Date: Wed, 15 Jan 2014 20:09:00 +0530 Subject: [Linux-cluster] RHCS guest cluster on OracleVM In-Reply-To: References: <9DCFF8BD-7613-426E-97EE-9D2AE16BAAA9@mac.com> Message-ID: I guess fence_xvm works with KVM only. Correct me if I am wrong. Thanks On Wed, Jan 15, 2014 at 7:59 PM, emmanuel segura wrote: > you can use fence_xvm, in internet you can find many how-to > https://alteeve.ca/w/Fencing_KVM_Virtual_Servers > > > 2014/1/15 Paul Lambert > >> Don't ask why - I know its crazy but ... >> Can anyone assist in configuring fencing of RHCS cluster nodes on >> OracleVM. >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > -- > esta es mi vida e me la vivo hasta que dios quiera > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Regards, Vishesh Kumar http://linuxmantra.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.lambert at mac.com Wed Jan 15 14:43:26 2014 From: paul.lambert at mac.com (Paul Lambert) Date: Wed, 15 Jan 2014 14:43:26 +0000 Subject: [Linux-cluster] RHCS guest cluster on OracleVM In-Reply-To: References: <9DCFF8BD-7613-426E-97EE-9D2AE16BAAA9@mac.com> Message-ID: <6CB9F284-CBBB-4259-837A-253F9061A893@mac.com> Yes I have fence_xvm as a possible solution in mind thanks. > On 15 Jan 2014, at 14:29, emmanuel segura wrote: > > you can use fence_xvm, in internet you can find many how-to https://alteeve.ca/w/Fencing_KVM_Virtual_Servers > > > 2014/1/15 Paul Lambert >> Don't ask why - I know its crazy but ... >> Can anyone assist in configuring fencing of RHCS cluster nodes on OracleVM. >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > esta es mi vida e me la vivo hasta que dios quiera > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From emi2fast at gmail.com Wed Jan 15 14:54:00 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Wed, 15 Jan 2014 15:54:00 +0100 Subject: [Linux-cluster] RHCS guest cluster on OracleVM In-Reply-To: <6CB9F284-CBBB-4259-837A-253F9061A893@mac.com> References: <9DCFF8BD-7613-426E-97EE-9D2AE16BAAA9@mac.com> <6CB9F284-CBBB-4259-837A-253F9061A893@mac.com> Message-ID: I think fence_xvm works with xen too http://miao5.blogspot.co.uk/2008/12/xen-fencing-in-rhel5.html 2014/1/15 Paul Lambert > Yes I have fence_xvm as a possible solution in mind thanks. > > On 15 Jan 2014, at 14:29, emmanuel segura wrote: > > you can use fence_xvm, in internet you can find many how-to > https://alteeve.ca/w/Fencing_KVM_Virtual_Servers > > > 2014/1/15 Paul Lambert > >> Don't ask why - I know its crazy but ... >> Can anyone assist in configuring fencing of RHCS cluster nodes on >> OracleVM. >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > -- > esta es mi vida e me la vivo hasta que dios quiera > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From emi2fast at gmail.com Wed Jan 15 15:01:07 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Wed, 15 Jan 2014 16:01:07 +0100 Subject: [Linux-cluster] RHCS guest cluster on OracleVM In-Reply-To: References: <9DCFF8BD-7613-426E-97EE-9D2AE16BAAA9@mac.com> <6CB9F284-CBBB-4259-837A-253F9061A893@mac.com> Message-ID: Sorry for the spamming, but this link convert kvm and xen https://www.google.it/url?sa=t&rct=j&q=&esrc=s&source=web&cd=6&ved=0CGQQFjAF&url=https%3A%2F%2Ffedorahosted.org%2Fcluster%2Fwiki%2FXVM_FencingConfig%3Fformat%3Dtxt&ei=NKLWUr-4DaSW0AX-uIHgCg&usg=AFQjCNE91XrUNPD4sG26isgSv9FLuQPcdQ&sig2=V_5hWR8Q4QwtIz43UI9a6A&cad=rja 2014/1/15 emmanuel segura > I think fence_xvm works with xen too > http://miao5.blogspot.co.uk/2008/12/xen-fencing-in-rhel5.html > > > 2014/1/15 Paul Lambert > >> Yes I have fence_xvm as a possible solution in mind thanks. >> >> On 15 Jan 2014, at 14:29, emmanuel segura wrote: >> >> you can use fence_xvm, in internet you can find many how-to >> https://alteeve.ca/w/Fencing_KVM_Virtual_Servers >> >> >> 2014/1/15 Paul Lambert >> >>> Don't ask why - I know its crazy but ... >>> Can anyone assist in configuring fencing of RHCS cluster nodes on >>> OracleVM. >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> >> >> -- >> esta es mi vida e me la vivo hasta que dios quiera >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > -- > esta es mi vida e me la vivo hasta que dios quiera > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From teigland at redhat.com Wed Jan 15 16:06:45 2014 From: teigland at redhat.com (David Teigland) Date: Wed, 15 Jan 2014 10:06:45 -0600 Subject: [Linux-cluster] HaveState flag in cman_tool status In-Reply-To: <20140115.171646.323221883839177210.yamato@redhat.com> References: <20140115.171646.323221883839177210.yamato@redhat.com> Message-ID: <20140115160645.GA7285@redhat.com> On Wed, Jan 15, 2014 at 05:16:46PM +0900, Masatake YAMATO wrote: > Hi, > > I read "What is the "Dirty" flag that cman_tool shows, and should I be worried?" > in https://fedorahosted.org/cluster/wiki/FAQ/CMAN to know when the flag is shown. > > In my understanding cman_tool status doesn't report HaveState on RHEL6 > because our cluster product uses cpg instead of groupd. Is my understanding > correct? Yes, in RHEL6, cluster daemons use cpg directly, not groupd. More to the point, the cluster daemons detect and deal with the merging of cluster partitions themselves. If you have a cluster merge, you'll see log messages saying things like: "daemon node N kill due to stateful merge" "telling cman to remove nodeid N from cluster" Dave From lists at alteeve.ca Wed Jan 15 18:34:35 2014 From: lists at alteeve.ca (Digimer) Date: Wed, 15 Jan 2014 13:34:35 -0500 Subject: [Linux-cluster] RHCS guest cluster on OracleVM In-Reply-To: References: <9DCFF8BD-7613-426E-97EE-9D2AE16BAAA9@mac.com> Message-ID: <52D6D4BB.4020901@alteeve.ca> I really need to finish that... consider that tutorial a "work in progress" please. Yes, fence_xvm is specific to libvirtd/virsh. If you can use 'virsh' to manager oracleVM, then it will work. If it's a unique hypervisor though, then you will need to write a fence agent to handle it. I know nothing about OracleVM though, so I am only guessing. On the good side, if there is no existing fence agent, they're fairly easy to write. I could answer any specific questions if this was needed, I've written a few myself (and that's an indication of how easy it is, as I'm a sysadmin, not a programmer :) ). digimer On 15/01/14 09:29 AM, emmanuel segura wrote: > you can use fence_xvm, in internet you can find many how-to > https://alteeve.ca/w/Fencing_KVM_Virtual_Servers > > > 2014/1/15 Paul Lambert > > > Don't ask why - I know its crazy but ... > Can anyone assist in configuring fencing of RHCS cluster nodes on > OracleVM. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > -- > esta es mi vida e me la vivo hasta que dios quiera > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From emi2fast at gmail.com Thu Jan 16 10:21:10 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Thu, 16 Jan 2014 11:21:10 +0100 Subject: [Linux-cluster] RHCS guest cluster on OracleVM In-Reply-To: <52D6D4BB.4020901@alteeve.ca> References: <9DCFF8BD-7613-426E-97EE-9D2AE16BAAA9@mac.com> <52D6D4BB.4020901@alteeve.ca> Message-ID: i know OracleVM is based on xen, maybe i'm wrong 2014/1/15 Digimer > I really need to finish that... consider that tutorial a "work in > progress" please. > > Yes, fence_xvm is specific to libvirtd/virsh. If you can use 'virsh' to > manager oracleVM, then it will work. > > If it's a unique hypervisor though, then you will need to write a fence > agent to handle it. I know nothing about OracleVM though, so I am only > guessing. > > On the good side, if there is no existing fence agent, they're fairly easy > to write. I could answer any specific questions if this was needed, I've > written a few myself (and that's an indication of how easy it is, as I'm a > sysadmin, not a programmer :) ). > > digimer > > > On 15/01/14 09:29 AM, emmanuel segura wrote: > >> you can use fence_xvm, in internet you can find many how-to >> https://alteeve.ca/w/Fencing_KVM_Virtual_Servers >> >> >> 2014/1/15 Paul Lambert > >> >> >> >> Don't ask why - I know its crazy but ... >> Can anyone assist in configuring fencing of RHCS cluster nodes on >> OracleVM. >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> >> >> -- >> esta es mi vida e me la vivo hasta que dios quiera >> >> >> > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without > access to education? > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben at zentrix.be Thu Jan 16 17:04:18 2014 From: ben at zentrix.be (Benjamin Budts) Date: Thu, 16 Jan 2014 18:04:18 +0100 Subject: [Linux-cluster] 2 node cluster questions Message-ID: <002801cf12dc$fe8d1e40$fba75ac0$@zentrix.be> Hey All, About my setup : I created a cluster in luci with shared storage, added 2 nodes (reachable), added 2 fence devices (idrac). Now, I can't seem to add my fence devices to my nodes. If I click on a node it times out in the gui. So I checked my node logs and found in /var/log/messages ricci hangs on /etc/init.d/clvmd status When I run the command manually I get the PID and it hangs. I also seem to have a split brain config : # clustat on node 1 shows : node 1 Online,Local Node 2 Offline # clustat on node 2 shows : node 1 offline Node 2 online, Local My next step was testing if my multicast was working correctly (I suspect it isn't). Would you guys have any recommendations besides the following redhat multicast test link ? : Https://access.redhat.com/site/articles/22304 Some info about my systems : ---------------------------------- . Redhat 6.5 on 2 nodes with resilient storage & HA addon licenses . Mgmt. station redhat 6.5 with Luci Thx a lot -------------- next part -------------- An HTML attachment was scrubbed... URL: From emi2fast at gmail.com Thu Jan 16 17:17:11 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Thu, 16 Jan 2014 18:17:11 +0100 Subject: [Linux-cluster] 2 node cluster questions In-Reply-To: <002801cf12dc$fe8d1e40$fba75ac0$@zentrix.be> References: <002801cf12dc$fe8d1e40$fba75ac0$@zentrix.be> Message-ID: If you think your problem is the multicast, try to use unicast https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/s1-unicast-traffic-CA.html, if you are using redhat 6.X 2014/1/16 Benjamin Budts > > > Hey All, > > > > > > About my setup : > > > > I created a cluster in luci with shared storage, added 2 nodes > (reachable), added 2 fence devices (idrac). > > > > Now, I can?t seem to add my fence devices to my nodes. If I click on a > node it times out in the gui. > > > > So I checked my node logs and found in /var/log/messages ricci hangs on > /etc/init.d/clvmd status > > > > When I run the command manually I get the PID and it hangs? > > > > > > I also seem to have a split brain config : > > > > # clustat on node 1 shows : node 1 Online,Local > > Node 2 Offline > > > > # clustat on node 2 shows : node 1 offline > > Node 2 online, Local > > > > > > My next step was testing if my multicast was working correctly (I suspect > it isn?t). Would you guys have any recommendations besides the following > redhat multicast test link ? : > > > > *Https://access.redhat.com/site/articles/22304 > * > > > > > > > > Some info about my systems : > > ---------------------------------- > > > > ? Redhat 6.5 on 2 nodes with resilient storage & HA addon licenses > > ? Mgmt. station redhat 6.5 with Luci > > > > > > > > Thx a lot > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From jplorier at gmail.com Fri Jan 17 11:10:42 2014 From: jplorier at gmail.com (Juan Pablo Lorier) Date: Fri, 17 Jan 2014 09:10:42 -0200 Subject: [Linux-cluster] Is it gfs2 able to perform on a 24TB partition? In-Reply-To: References: Message-ID: <52D90FB2.4060906@gmail.com> Hi, I've been using gfs2 on top of a 24 TB lvm volume used as a file server for several month now and I see a lot of io related to glock_workqueue when there's file transfers. The threads even get to be the top ones in read ops in iotop. Is that normal? how can I debugg this? Regards, From swhiteho at redhat.com Fri Jan 17 11:16:28 2014 From: swhiteho at redhat.com (Steven Whitehouse) Date: Fri, 17 Jan 2014 11:16:28 +0000 Subject: [Linux-cluster] Is it gfs2 able to perform on a 24TB partition? In-Reply-To: <52D90FB2.4060906@gmail.com> References: <52D90FB2.4060906@gmail.com> Message-ID: <1389957388.2734.20.camel@menhir> Hi, On Fri, 2014-01-17 at 09:10 -0200, Juan Pablo Lorier wrote: > Hi, > > I've been using gfs2 on top of a 24 TB lvm volume used as a file server > for several month now and I see a lot of io related to glock_workqueue > when there's file transfers. The threads even get to be the top ones in > read ops in iotop. > Is that normal? how can I debugg this? > Regards, > Well the glock_workqueue threads are running the internal state machine that controls the caching of data. So depending on the workload in question, yes it is expected that you'll see a fair amount of activity there. If the issue is that you are seeing the cache being used inefficiently, then it maybe possible to improve that by looking at the i/o pattern generated by the application. There are also other things which can make a difference, such as setting noatime on the mount. You don't mention what version of the kernel you are using, but more recent kernels have better tools (such as tracepoints) which can assist in debugging this kind of issue, Steve. From ben at zentrix.be Fri Jan 17 14:01:22 2014 From: ben at zentrix.be (Benjamin Budts) Date: Fri, 17 Jan 2014 15:01:22 +0100 Subject: [Linux-cluster] 2 node cluster questions In-Reply-To: <52D8FAD7.5030707@redhat.com> References: <002801cf12dc$fe8d1e40$fba75ac0$@zentrix.be> <52D8FAD7.5030707@redhat.com> Message-ID: <005e01cf138c$9a981100$cfc83300$@zentrix.be> Found my problem When running cman_tool status I saw that each node only saw 1 node. I have 2 pairs of network on each machine (heartbeat network 1gbit multicast enabled/application network 10Gbit) with their own host mapping in /etc/hosts I used the the non-heartbeat hostname to add nodes into cluster via luci which caused issues. Removed luci & /var/lib/luci on nodes, as well as cat /dev/null /etc/cluster/cluster.conf Reinstalled luci, created cluster, added 2 nodes with correct hostnames, problem solved. Could anyone point me to a good guide for using cluster lvm ? Thx From: Elvir Kuric [mailto:ekuric at redhat.com] Sent: vrijdag 17 januari 2014 10:42 To: ben at zentrix.be Subject: Re: [Linux-cluster] 2 node cluster questions On 01/16/2014 06:04 PM, Benjamin Budts wrote: Hey All, About my setup : I created a cluster in luci with shared storage, added 2 nodes (reachable), added 2 fence devices (idrac). Now, I can't seem to add my fence devices to my nodes. If I click on a node it times out in the gui. So I checked my node logs and found in /var/log/messages ricci hangs on /etc/init.d/clvmd status When I run the command manually I get the PID and it hangs. I also seem to have a split brain config : # clustat on node 1 shows : node 1 Online,Local Node 2 Offline # clustat on node 2 shows : node 1 offline Node 2 online, Local My next step was testing if my multicast was working correctly (I suspect it isn't). Would you guys have any recommendations besides the following redhat multicast test link ? : Https://access.redhat.com/site/articles/22304 Some info about my systems : ---------------------------------- . Redhat 6.5 on 2 nodes with resilient storage & HA addon licenses . Mgmt. station redhat 6.5 with Luci Thx a lot Hard to say what issue is based on this without checking fully logs, can you please open case via Red Hat customer portal https://access.redhat.com ( I assume you have valid subscription ) so we can take deeper look into this problem. Thank you in advance, Kind regards, -- Elvir Kuric,Sr. TSE / Red Hat / GSS EMEA / -------------- next part -------------- An HTML attachment was scrubbed... URL: From emi2fast at gmail.com Fri Jan 17 14:27:42 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Fri, 17 Jan 2014 15:27:42 +0100 Subject: [Linux-cluster] 2 node cluster questions In-Reply-To: <005e01cf138c$9a981100$cfc83300$@zentrix.be> References: <002801cf12dc$fe8d1e40$fba75ac0$@zentrix.be> <52D8FAD7.5030707@redhat.com> <005e01cf138c$9a981100$cfc83300$@zentrix.be> Message-ID: https://access.redhat.com/site/documentation/it-IT/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/LVM_Cluster_Overview.html 2014/1/17 Benjamin Budts > > > Found my problem > > > > When running cman_tool status I saw that each node only saw 1 node. > > I have 2 pairs of network on each machine (heartbeat network 1gbit > multicast enabled/application network 10Gbit) with their own host mapping > in /etc/hosts > > > > I used the the non-heartbeat hostname to add nodes into cluster via luci > which caused issues. > > > > Removed luci & /var/lib/luci on nodes, as well as cat /dev/null > /etc/cluster/cluster.conf > > Reinstalled luci, created cluster, added 2 nodes with correct hostnames, > problem solved. > > > > Could anyone point me to a good guide for using cluster lvm ? > > > > Thx > > > > > > > > *From:* Elvir Kuric [mailto:ekuric at redhat.com] > *Sent:* vrijdag 17 januari 2014 10:42 > *To:* ben at zentrix.be > *Subject:* Re: [Linux-cluster] 2 node cluster questions > > > > On 01/16/2014 06:04 PM, Benjamin Budts wrote: > > > > Hey All, > > > > > > About my setup : > > > > I created a cluster in luci with shared storage, added 2 nodes > (reachable), added 2 fence devices (idrac). > > > > Now, I can?t seem to add my fence devices to my nodes. If I click on a > node it times out in the gui. > > > > So I checked my node logs and found in /var/log/messages ricci hangs on > /etc/init.d/clvmd status > > > > When I run the command manually I get the PID and it hangs? > > > > > > I also seem to have a split brain config : > > > > # clustat on node 1 shows : node 1 Online,Local > > Node 2 Offline > > > > # clustat on node 2 shows : node 1 offline > > Node 2 online, Local > > > > > > My next step was testing if my multicast was working correctly (I suspect > it isn?t). Would you guys have any recommendations besides the following > redhat multicast test link ? : > > > > *Https://access.redhat.com/site/articles/22304 > * > > > > > > > > Some info about my systems : > > ---------------------------------- > > > > ? Redhat 6.5 on 2 nodes with resilient storage & HA addon licenses > > ? Mgmt. station redhat 6.5 with Luci > > > > > > > > Thx a lot > > > > Hard to say what issue is based on this without checking fully logs, can > you please open case via Red Hat customer portal https://access.redhat.com( I assume you have valid subscription ) so we can take deeper look into > this problem. > > Thank you in advance, > > Kind regards, > > > > -- > > Elvir Kuric,Sr. TSE / Red Hat / GSS EMEA / > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From jplorier at gmail.com Fri Jan 17 15:12:38 2014 From: jplorier at gmail.com (Juan Pablo Lorier) Date: Fri, 17 Jan 2014 13:12:38 -0200 Subject: [Linux-cluster] Is it gfs2 able to perform on a 24TB partition? In-Reply-To: <1389957388.2734.20.camel@menhir> References: <52D90FB2.4060906@gmail.com> <1389957388.2734.20.camel@menhir> Message-ID: <52D94866.3090505@gmail.com> Hi Steve, Thanks for the reply. I can change fstab to add noatime option. The kernel I'm running is 2.6.32-358.23.2.el6.x86_64 (centos 6.5) so don't know if it's new enough. The services running are vsftp, samba4 and rsync, they all serve files for different users. I'll have to do some studying to learn how to find out the cache usage of those services. Regards, From vijay.240385 at gmail.com Fri Jan 17 17:34:33 2014 From: vijay.240385 at gmail.com (vijay singh) Date: Fri, 17 Jan 2014 23:04:33 +0530 Subject: [Linux-cluster] Linux-cluster Digest, Vol 117, Issue 6 In-Reply-To: References: Message-ID: Hi, According to the clustat status it seems there is some issue with the multi casting. Plz check the multicast using omping On Jan 17, 2014 8:04 PM, wrote: > Send Linux-cluster mailing list submissions to > linux-cluster at redhat.com > > To subscribe or unsubscribe via the World Wide Web, visit > https://www.redhat.com/mailman/listinfo/linux-cluster > or, via email, send a message with subject or body 'help' to > linux-cluster-request at redhat.com > > You can reach the person managing the list at > linux-cluster-owner at redhat.com > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Linux-cluster digest..." > > > Today's Topics: > > 1. 2 node cluster questions (Benjamin Budts) > 2. Re: 2 node cluster questions (emmanuel segura) > 3. Is it gfs2 able to perform on a 24TB partition? > (Juan Pablo Lorier) > 4. Re: Is it gfs2 able to perform on a 24TB partition? > (Steven Whitehouse) > 5. Re: 2 node cluster questions (Benjamin Budts) > 6. Re: 2 node cluster questions (emmanuel segura) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 16 Jan 2014 18:04:18 +0100 > From: "Benjamin Budts" > To: > Subject: [Linux-cluster] 2 node cluster questions > Message-ID: <002801cf12dc$fe8d1e40$fba75ac0$@zentrix.be> > Content-Type: text/plain; charset="us-ascii" > > > > Hey All, > > > > > > About my setup : > > > > I created a cluster in luci with shared storage, added 2 nodes (reachable), > added 2 fence devices (idrac). > > > > Now, I can't seem to add my fence devices to my nodes. If I click on a node > it times out in the gui. > > > > So I checked my node logs and found in /var/log/messages ricci hangs on > /etc/init.d/clvmd status > > > > When I run the command manually I get the PID and it hangs. > > > > > > I also seem to have a split brain config : > > > > # clustat on node 1 shows : node 1 Online,Local > > Node 2 Offline > > > > # clustat on node 2 shows : node 1 offline > > Node 2 online, Local > > > > > > My next step was testing if my multicast was working correctly (I suspect > it > isn't). Would you guys have any recommendations besides the following > redhat > multicast test link ? : > > > > Https://access.redhat.com/site/articles/22304 > > > > > > > > Some info about my systems : > > ---------------------------------- > > > > . Redhat 6.5 on 2 nodes with resilient storage & HA addon licenses > > . Mgmt. station redhat 6.5 with Luci > > > > > > > > Thx a lot > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://www.redhat.com/archives/linux-cluster/attachments/20140116/1e1c660e/attachment.html > > > > ------------------------------ > > Message: 2 > Date: Thu, 16 Jan 2014 18:17:11 +0100 > From: emmanuel segura > To: linux clustering > Subject: Re: [Linux-cluster] 2 node cluster questions > Message-ID: > < > CAE7pJ3DGsXJx602Tr73UjFFrrYpFi686TJeKQFqybxMv+5D00A at mail.gmail.com> > Content-Type: text/plain; charset="windows-1252" > > If you think your problem is the multicast, try to use unicast > > https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/s1-unicast-traffic-CA.html > , > if you are using redhat 6.X > > > 2014/1/16 Benjamin Budts > > > > > > > Hey All, > > > > > > > > > > > > About my setup : > > > > > > > > I created a cluster in luci with shared storage, added 2 nodes > > (reachable), added 2 fence devices (idrac). > > > > > > > > Now, I can?t seem to add my fence devices to my nodes. If I click on a > > node it times out in the gui. > > > > > > > > So I checked my node logs and found in /var/log/messages ricci hangs on > > /etc/init.d/clvmd status > > > > > > > > When I run the command manually I get the PID and it hangs? > > > > > > > > > > > > I also seem to have a split brain config : > > > > > > > > # clustat on node 1 shows : node 1 Online,Local > > > > Node 2 Offline > > > > > > > > # clustat on node 2 shows : node 1 offline > > > > Node 2 online, Local > > > > > > > > > > > > My next step was testing if my multicast was working correctly (I suspect > > it isn?t). Would you guys have any recommendations besides the following > > redhat multicast test link ? : > > > > > > > > *Https://access.redhat.com/site/articles/22304 > > * > > > > > > > > > > > > > > > > Some info about my systems : > > > > ---------------------------------- > > > > > > > > ? Redhat 6.5 on 2 nodes with resilient storage & HA addon > licenses > > > > ? Mgmt. station redhat 6.5 with Luci > > > > > > > > > > > > > > > > Thx a lot > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > -- > esta es mi vida e me la vivo hasta que dios quiera > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://www.redhat.com/archives/linux-cluster/attachments/20140116/b2678815/attachment.html > > > > ------------------------------ > > Message: 3 > Date: Fri, 17 Jan 2014 09:10:42 -0200 > From: Juan Pablo Lorier > To: linux-cluster at redhat.com > Subject: [Linux-cluster] Is it gfs2 able to perform on a 24TB > partition? > Message-ID: <52D90FB2.4060906 at gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Hi, > > I've been using gfs2 on top of a 24 TB lvm volume used as a file server > for several month now and I see a lot of io related to glock_workqueue > when there's file transfers. The threads even get to be the top ones in > read ops in iotop. > Is that normal? how can I debugg this? > Regards, > > > > ------------------------------ > > Message: 4 > Date: Fri, 17 Jan 2014 11:16:28 +0000 > From: Steven Whitehouse > To: Juan Pablo Lorier > Cc: linux-cluster at redhat.com > Subject: Re: [Linux-cluster] Is it gfs2 able to perform on a 24TB > partition? > Message-ID: <1389957388.2734.20.camel at menhir> > Content-Type: text/plain; charset="UTF-8" > > Hi, > > On Fri, 2014-01-17 at 09:10 -0200, Juan Pablo Lorier wrote: > > Hi, > > > > I've been using gfs2 on top of a 24 TB lvm volume used as a file server > > for several month now and I see a lot of io related to glock_workqueue > > when there's file transfers. The threads even get to be the top ones in > > read ops in iotop. > > Is that normal? how can I debugg this? > > Regards, > > > > Well the glock_workqueue threads are running the internal state machine > that controls the caching of data. So depending on the workload in > question, yes it is expected that you'll see a fair amount of activity > there. > > If the issue is that you are seeing the cache being used inefficiently, > then it maybe possible to improve that by looking at the i/o pattern > generated by the application. There are also other things which can make > a difference, such as setting noatime on the mount. > > You don't mention what version of the kernel you are using, but more > recent kernels have better tools (such as tracepoints) which can assist > in debugging this kind of issue, > > Steve. > > > > > ------------------------------ > > Message: 5 > Date: Fri, 17 Jan 2014 15:01:22 +0100 > From: "Benjamin Budts" > To: > Subject: Re: [Linux-cluster] 2 node cluster questions > Message-ID: <005e01cf138c$9a981100$cfc83300$@zentrix.be> > Content-Type: text/plain; charset="us-ascii" > > > > Found my problem > > > > When running cman_tool status I saw that each node only saw 1 node. > > I have 2 pairs of network on each machine (heartbeat network 1gbit > multicast > enabled/application network 10Gbit) with their own host mapping in > /etc/hosts > > > > I used the the non-heartbeat hostname to add nodes into cluster via luci > which caused issues. > > > > Removed luci & /var/lib/luci on nodes, as well as cat /dev/null > /etc/cluster/cluster.conf > > Reinstalled luci, created cluster, added 2 nodes with correct hostnames, > problem solved. > > > > Could anyone point me to a good guide for using cluster lvm ? > > > > Thx > > > > > > > > From: Elvir Kuric [mailto:ekuric at redhat.com] > Sent: vrijdag 17 januari 2014 10:42 > To: ben at zentrix.be > Subject: Re: [Linux-cluster] 2 node cluster questions > > > > On 01/16/2014 06:04 PM, Benjamin Budts wrote: > > > > Hey All, > > > > > > About my setup : > > > > I created a cluster in luci with shared storage, added 2 nodes (reachable), > added 2 fence devices (idrac). > > > > Now, I can't seem to add my fence devices to my nodes. If I click on a node > it times out in the gui. > > > > So I checked my node logs and found in /var/log/messages ricci hangs on > /etc/init.d/clvmd status > > > > When I run the command manually I get the PID and it hangs. > > > > > > I also seem to have a split brain config : > > > > # clustat on node 1 shows : node 1 Online,Local > > Node 2 Offline > > > > # clustat on node 2 shows : node 1 offline > > Node 2 online, Local > > > > > > My next step was testing if my multicast was working correctly (I suspect > it > isn't). Would you guys have any recommendations besides the following > redhat > multicast test link ? : > > > > Https://access.redhat.com/site/articles/22304 > > > > > > > > Some info about my systems : > > ---------------------------------- > > > > . Redhat 6.5 on 2 nodes with resilient storage & HA addon licenses > > . Mgmt. station redhat 6.5 with Luci > > > > > > > > Thx a lot > > > > > > Hard to say what issue is based on this without checking fully logs, can > you > please open case via Red Hat customer portal https://access.redhat.com ( I > assume you have valid subscription ) so we can take deeper look into this > problem. > > Thank you in advance, > > Kind regards, > > > > > > -- > Elvir Kuric,Sr. TSE / Red Hat / GSS EMEA / > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://www.redhat.com/archives/linux-cluster/attachments/20140117/6d673d84/attachment.html > > > > ------------------------------ > > Message: 6 > Date: Fri, 17 Jan 2014 15:27:42 +0100 > From: emmanuel segura > To: linux clustering > Subject: Re: [Linux-cluster] 2 node cluster questions > Message-ID: > < > CAE7pJ3Bi7woB42pD_dCS0xBrAs8uMgGCZCmfc9GyBzk3bwNr-A at mail.gmail.com> > Content-Type: text/plain; charset="windows-1252" > > > https://access.redhat.com/site/documentation/it-IT/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/LVM_Cluster_Overview.html > > > 2014/1/17 Benjamin Budts > > > > > > > Found my problem > > > > > > > > When running cman_tool status I saw that each node only saw 1 node. > > > > I have 2 pairs of network on each machine (heartbeat network 1gbit > > multicast enabled/application network 10Gbit) with their own host mapping > > in /etc/hosts > > > > > > > > I used the the non-heartbeat hostname to add nodes into cluster via luci > > which caused issues. > > > > > > > > Removed luci & /var/lib/luci on nodes, as well as cat /dev/null > > /etc/cluster/cluster.conf > > > > Reinstalled luci, created cluster, added 2 nodes with correct hostnames, > > problem solved. > > > > > > > > Could anyone point me to a good guide for using cluster lvm ? > > > > > > > > Thx > > > > > > > > > > > > > > > > *From:* Elvir Kuric [mailto:ekuric at redhat.com] > > *Sent:* vrijdag 17 januari 2014 10:42 > > *To:* ben at zentrix.be > > *Subject:* Re: [Linux-cluster] 2 node cluster questions > > > > > > > > On 01/16/2014 06:04 PM, Benjamin Budts wrote: > > > > > > > > Hey All, > > > > > > > > > > > > About my setup : > > > > > > > > I created a cluster in luci with shared storage, added 2 nodes > > (reachable), added 2 fence devices (idrac). > > > > > > > > Now, I can?t seem to add my fence devices to my nodes. If I click on a > > node it times out in the gui. > > > > > > > > So I checked my node logs and found in /var/log/messages ricci hangs on > > /etc/init.d/clvmd status > > > > > > > > When I run the command manually I get the PID and it hangs? > > > > > > > > > > > > I also seem to have a split brain config : > > > > > > > > # clustat on node 1 shows : node 1 Online,Local > > > > Node 2 Offline > > > > > > > > # clustat on node 2 shows : node 1 offline > > > > Node 2 online, Local > > > > > > > > > > > > My next step was testing if my multicast was working correctly (I suspect > > it isn?t). Would you guys have any recommendations besides the following > > redhat multicast test link ? : > > > > > > > > *Https://access.redhat.com/site/articles/22304 > > * > > > > > > > > > > > > > > > > Some info about my systems : > > > > ---------------------------------- > > > > > > > > ? Redhat 6.5 on 2 nodes with resilient storage & HA addon > licenses > > > > ? Mgmt. station redhat 6.5 with Luci > > > > > > > > > > > > > > > > Thx a lot > > > > > > > > Hard to say what issue is based on this without checking fully logs, can > > you please open case via Red Hat customer portal > https://access.redhat.com( I assume you have valid subscription ) so we > can take deeper look into > > this problem. > > > > Thank you in advance, > > > > Kind regards, > > > > > > > > -- > > > > Elvir Kuric,Sr. TSE / Red Hat / GSS EMEA / > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > -- > esta es mi vida e me la vivo hasta que dios quiera > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > https://www.redhat.com/archives/linux-cluster/attachments/20140117/6fc1d7e7/attachment.html > > > > ------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > End of Linux-cluster Digest, Vol 117, Issue 6 > ********************************************* > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yamato at redhat.com Mon Jan 20 09:01:27 2014 From: yamato at redhat.com (Masatake YAMATO) Date: Mon, 20 Jan 2014 18:01:27 +0900 (JST) Subject: [Linux-cluster] HaveState flag in cman_tool status In-Reply-To: <20140115160645.GA7285@redhat.com> References: <20140115.171646.323221883839177210.yamato@redhat.com> <20140115160645.GA7285@redhat.com> Message-ID: <20140120.180127.1495679642863598913.yamato@redhat.com> Thank you. Masatake YAMATO > On Wed, Jan 15, 2014 at 05:16:46PM +0900, Masatake YAMATO wrote: >> Hi, >> >> I read "What is the "Dirty" flag that cman_tool shows, and should I be worried?" >> in https://fedorahosted.org/cluster/wiki/FAQ/CMAN to know when the flag is shown. >> >> In my understanding cman_tool status doesn't report HaveState on RHEL6 >> because our cluster product uses cpg instead of groupd. Is my understanding >> correct? > > Yes, in RHEL6, cluster daemons use cpg directly, not groupd. > More to the point, the cluster daemons detect and deal with the > merging of cluster partitions themselves. If you have a cluster > merge, you'll see log messages saying things like: > "daemon node N kill due to stateful merge" > "telling cman to remove nodeid N from cluster" > > Dave From info at innova-studios.com Mon Jan 20 10:21:52 2014 From: info at innova-studios.com (=?iso-8859-1?Q?J=FCrgen_Ladst=E4tter?=) Date: Mon, 20 Jan 2014 11:21:52 +0100 Subject: [Linux-cluster] GFS2 and kernel 2.6.32-431 Message-ID: <3f6601cf15c9$70935fb0$51ba1f10$@innova-studios.com> Hey guys, is anyone running gfs2 with kernel version 2.6.32-431 yet? 358 was unusable due to bugs, 279 was working quite well. Anyone tested the new 431? Is it stable enough for a productive environment? Thanks for your input, J?rgen -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben at zentrix.be Tue Jan 21 08:33:07 2014 From: ben at zentrix.be (Benjamin Budts) Date: Tue, 21 Jan 2014 09:33:07 +0100 Subject: [Linux-cluster] need advice on configuring service groups Message-ID: <000c01cf1683$6979cab0$3c6d6010$@zentrix.be> Hey guys, I?d like your advice on configuring my defined resources optimally as well as the last man standing configuration (howto somewhere ?) In a nutshell : ? I have 2 nodes (with a quorum device) ? I use redhat 6.5 with luci ? It?s a coldfailover setup, so only 1 node will have the floating ip + all services The question I have is basically, should I configure 1 global service group that contains all of the stuff I described down here ? Only 1 node is supposed to run all the services. At the moment I even configured the clustered enabled volume groups as resources, as well as their filesystem. It seems to work, but a bit excessive no ? Configuring the filesystems as resources, might that not be enough ? I will have to install the application on both nodes because of libraries etc... But I?m a bit confused on how to present the application lun. Should I present 2 separate application luns to each node or install the application on both nodes and only share the 1st nodes application lun to both nodes. Thx for clearing this up guys Here some info : Service 1 ------------ ? Samba + vg_share ( -c y ) + ext4 filesystem coming from a netapp lun ? lun visible to both nodes, multipathing enabled Service 2 ----------- ? Tomcat, a custom database, some applications (all dependencies of one another to make application work) ? 3 vg?s for : db index, db archive logs, application logs ? all visible to both nodes, vg?s have been created with ?c y and all have ext4 FS -------------- next part -------------- An HTML attachment was scrubbed... URL: From alain.moulle at bull.net Tue Jan 21 12:16:36 2014 From: alain.moulle at bull.net (=?ISO-8859-1?Q?Moull=E9_Alain?=) Date: Tue, 21 Jan 2014 13:16:36 +0100 Subject: [Linux-cluster] Question about gfs2 and Option 4 of HA stack Message-ID: <52DE6524.90300@bull.net> Hi, As far as I know the final stack option 4 (with Pacemaker and quorum API of corosync instead of cman) will be the stack delivered on RHEL7, always right ? So my question is about GFS2 : will it be working together with this last stack ? (knowing that GFS2 was "a little" tied with cman) Thanks Alain From swhiteho at redhat.com Tue Jan 21 12:28:10 2014 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 21 Jan 2014 12:28:10 +0000 Subject: [Linux-cluster] Question about gfs2 and Option 4 of HA stack In-Reply-To: <52DE6524.90300@bull.net> References: <52DE6524.90300@bull.net> Message-ID: <1390307290.2741.15.camel@menhir> Hi, On Tue, 2014-01-21 at 13:16 +0100, Moull? Alain wrote: > Hi, > > As far as I know the final stack option 4 (with Pacemaker and quorum API > of corosync instead of cman) will be the stack delivered on RHEL7, > always right ? > > So my question is about GFS2 : will it be working together with this > last stack ? (knowing that GFS2 was "a little" tied with cman) > > Thanks > Alain > There is nothing cman specific about GFS2. It can work with pcs quite happily, and that is how current upstream/Fedora works these days, Steve. From teigland at redhat.com Tue Jan 21 17:13:21 2014 From: teigland at redhat.com (David Teigland) Date: Tue, 21 Jan 2014 11:13:21 -0600 Subject: [Linux-cluster] Question about gfs2 and Option 4 of HA stack In-Reply-To: <1390307290.2741.15.camel@menhir> References: <52DE6524.90300@bull.net> <1390307290.2741.15.camel@menhir> Message-ID: <20140121171321.GA9144@redhat.com> On Tue, Jan 21, 2014 at 12:28:10PM +0000, Steven Whitehouse wrote: > Hi, > > On Tue, 2014-01-21 at 13:16 +0100, Moull? Alain wrote: > > Hi, > > > > As far as I know the final stack option 4 (with Pacemaker and quorum API > > of corosync instead of cman) will be the stack delivered on RHEL7, > > always right ? > > > > So my question is about GFS2 : will it be working together with this > > last stack ? (knowing that GFS2 was "a little" tied with cman) > > > > Thanks > > Alain > > > > There is nothing cman specific about GFS2. It can work with pcs quite > happily, and that is how current upstream/Fedora works these days, Upstream also happily works more independently, as shown here: http://people.redhat.com/teigland/cluster4-gfs2-dlm.txt Dave From alain.moulle at bull.net Wed Jan 22 07:41:20 2014 From: alain.moulle at bull.net (=?ISO-8859-1?Q?Moull=E9_Alain?=) Date: Wed, 22 Jan 2014 08:41:20 +0100 Subject: [Linux-cluster] Question about gfs2 and Option 4 of HA stack In-Reply-To: <20140121171321.GA9144@redhat.com> References: <52DE6524.90300@bull.net> <1390307290.2741.15.camel@menhir> <20140121171321.GA9144@redhat.com> Message-ID: <52DF7620.1060700@bull.net> Hi OK but except if I miss something in the responses ... my question was more about managing GFS2 stack and FS as Pacemaker resources , no problem at all ? Thanks again Alain On Tue, Jan 21, 2014 at 12:28:10PM +0000, Steven Whitehouse wrote: >> Hi, >> >> On Tue, 2014-01-21 at 13:16 +0100, Moull? Alain wrote: >>> Hi, >>> >>> As far as I know the final stack option 4 (with Pacemaker and quorum API >>> of corosync instead of cman) will be the stack delivered on RHEL7, >>> always right ? >>> >>> So my question is about GFS2 : will it be working together with this >>> last stack ? (knowing that GFS2 was "a little" tied with cman) >>> >>> Thanks >>> Alain >>> >> There is nothing cman specific about GFS2. It can work with pcs quite >> happily, and that is how current upstream/Fedora works these days, > Upstream also happily works more independently, as shown here: > http://people.redhat.com/teigland/cluster4-gfs2-dlm.txt > > Dave From gq at cs.msu.su Fri Jan 24 14:27:01 2014 From: gq at cs.msu.su (Alexander GQ Gerasiov) Date: Fri, 24 Jan 2014 18:27:01 +0400 Subject: [Linux-cluster] Running clvm on part of the cluster Message-ID: <20140124182701.5915bfd1@snail> //Second try, because previous mail was hold at moderation I think. //Excuse me if there will be dups. Hello there. I need your help to solve some issues I met. I use redhat-cluster (as part of Proxmox VE) in our virtualization environment. I have several servers with SAN storage attached and CLVM managed volumes. In general it works. Today I had to attach one more box to Proxmox instance and found blocking issue: this node joined cluster, proxmosFS started and everything is ok with this host. But it does not have SAN connection, so I didn't start CLVM on it. And when I try to do some lvm-related work on other host I got "clvmd not running on node " and locking failed. Ok, I thought, and started CLVM on that host... and locking still fails with Error locking on node : Volume group for uuid not found: So my question is: How to handle situation, when only part of cluster nodes has access to particular LUN, but need to run CLVM and use CLVM locking over it? -- Best regards, Alexander. From emi2fast at gmail.com Fri Jan 24 15:28:37 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Fri, 24 Jan 2014 16:28:37 +0100 Subject: [Linux-cluster] Running clvm on part of the cluster In-Reply-To: <20140124182701.5915bfd1@snail> References: <20140124182701.5915bfd1@snail> Message-ID: are you sure you can add a node in redhat cluster without reboot the whole cluster? use cman_tool status on old nodes 2014/1/24 Alexander GQ Gerasiov > //Second try, because previous mail was hold at moderation I think. > //Excuse me if there will be dups. > > Hello there. > > I need your help to solve some issues I met. > > I use redhat-cluster (as part of Proxmox VE) in our virtualization > environment. > > I have several servers with SAN storage attached and CLVM managed > volumes. In general it works. > > Today I had to attach one more box to Proxmox instance and found > blocking issue: > > this node joined cluster, proxmosFS started and everything is ok with > this host. But it does not have SAN connection, so I didn't start CLVM > on it. And when I try to do some lvm-related work on other host I got > "clvmd not running on node " > and locking failed. > > Ok, I thought, and started CLVM on that host... > > and locking still fails with > Error locking on node : Volume group for uuid not > found: > > > So my question is: > > How to handle situation, when only part of cluster nodes has access to > particular LUN, but need to run CLVM and use CLVM locking over it? > > > -- > Best regards, Alexander. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From bubble at hoster-ok.com Sat Jan 25 22:34:04 2014 From: bubble at hoster-ok.com (Vladislav Bogdanov) Date: Sun, 26 Jan 2014 01:34:04 +0300 Subject: [Linux-cluster] Running clvm on part of the cluster In-Reply-To: <20140124182701.5915bfd1@snail> References: <20140124182701.5915bfd1@snail> Message-ID: <52E43BDC.3010808@hoster-ok.com> 24.01.2014 17:27, Alexander GQ Gerasiov wrote: > //Second try, because previous mail was hold at moderation I think. > //Excuse me if there will be dups. > > Hello there. > > I need your help to solve some issues I met. > > I use redhat-cluster (as part of Proxmox VE) in our virtualization > environment. > > I have several servers with SAN storage attached and CLVM managed > volumes. In general it works. > > Today I had to attach one more box to Proxmox instance and found > blocking issue: > > this node joined cluster, proxmosFS started and everything is ok with > this host. But it does not have SAN connection, so I didn't start CLVM > on it. And when I try to do some lvm-related work on other host I got > "clvmd not running on node " > and locking failed. > > Ok, I thought, and started CLVM on that host... > > and locking still fails with > Error locking on node : Volume group for uuid not > found: > > > So my question is: > > How to handle situation, when only part of cluster nodes has access to > particular LUN, but need to run CLVM and use CLVM locking over it? > > I think this is possible only with corosync driver which has commit from Christine Caulfield dated 2013-09-23 (https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=431eda63cc0ebff7c62dacb313cabcffbda6573a). In all other cases you have to run clvmd on all cluster nodes. I may misread that commit, but I do not have any problems putting pacemaker node to standby (clvmd is managed as a cluster resource) after it, although it was hell to do that before: lvm is stuck until second node in a two-node cluster returns back. From ben at zentrix.be Tue Jan 28 12:41:25 2014 From: ben at zentrix.be (Benjamin Budts) Date: Tue, 28 Jan 2014 13:41:25 +0100 Subject: [Linux-cluster] executing /usr/bin/virsh nodeinfo but have physical servers ? Message-ID: <000a01cf1c26$424bc2b0$c6e34810$@zentrix.be> Gents, I have a cluster, working fine with 2 physical machines & a luci mgmt. station. I see a lot of ricci : executing '/usr/bin/virsh nodeinfo' in /var/log/messages If I execute this command manually I get : Error : Failed to reconnect to the hypervisor Error: no valid connection Error: internal error Unable to locate libvirtd daemon in /usr/sbin Now. I'm not running hypervisors, everything is physical. How do I get rid of this ? And why is luci trying to check this if my servers are physical ? Thx a lot B -------------- next part -------------- An HTML attachment was scrubbed... URL: From emi2fast at gmail.com Tue Jan 28 12:53:21 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Tue, 28 Jan 2014 13:53:21 +0100 Subject: [Linux-cluster] executing /usr/bin/virsh nodeinfo but have physical servers ? In-Reply-To: <000a01cf1c26$424bc2b0$c6e34810$@zentrix.be> References: <000a01cf1c26$424bc2b0$c6e34810$@zentrix.be> Message-ID: /etc/init.d/libvirtd status 2014-01-28 Benjamin Budts > > > Gents, > > > > I have a cluster, working fine with 2 physical machines & a luci mgmt. > station. > > > > I see a lot of ricci : executing '/usr/bin/virsh nodeinfo' in > /var/log/messages > > > > If I execute this command manually I get : > > > > Error : Failed to reconnect to the hypervisor > > Error: no valid connection > > Error: internal error Unable to locate libvirtd daemon in /usr/sbin > > > > Now... I'm not running hypervisors, everything is physical. How do I get rid > of this ? And why is luci trying to check this if my servers are physical ? > > > > Thx a lot > > B > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben at zentrix.be Tue Jan 28 13:27:29 2014 From: ben at zentrix.be (Benjamin Budts) Date: Tue, 28 Jan 2014 14:27:29 +0100 Subject: [Linux-cluster] fencing debugging ipmilan/idrac Message-ID: <004901cf1c2c$b194ee20$14beca60$@zentrix.be> Gents, How can I see the exact command that is sent my ipmilan/idrac interface from the active node trying to fence the other one ? I have a DELL R610, tried configuring idrac as fencing but I get an error, now I tried via ipmilan but still errors. Webinterface works fine, telnet is not enabled but ssh is. Increased debug loglevel of fenced, but still can't see the exact command that is sent to fence JAN 28 12 :13 :04 fenced Fencing node2.test.local JAN 28 12 :13 :09 fenced fence node2.test.local dev 0/0 agent fence_ipmilan result: error from agent Thx Ben -------------- next part -------------- An HTML attachment was scrubbed... URL: From dpant at intracom-telecom.com Tue Jan 28 14:28:21 2014 From: dpant at intracom-telecom.com (Demetres Pantermalis) Date: Tue, 28 Jan 2014 16:28:21 +0200 Subject: [Linux-cluster] Multiple "rgmanager" instances after re-booting from a kernel panic. Message-ID: <52E7BE85.1030207@intracom-telecom.com> Hello. We have a strange situation with rgmanager on a two node (active/passive) cluster, on physical servers. Nodes are N1 and N2. Out investigation is to simulate how the cluster will react after a kernel panic of the active node. Kernel panic is "simulated" with echo b > /proc/sysrq-trigger Fence agent used: fence_scsi Common storage is: EMC (with powerPath installed on both nodes) The scenarios are: S1. Active node is N1 passive mode is N2 After kernel panic of N1, the N2 resumes the services previously run on N1 (Expected behavior). N1 re-boots and after a while re-joins the cluster. (Expected behavior) S2. Now, active node is N2. We perform a kernel panic on N2. N1 resumes (correctly) the services previously run on N2. After the reboot of N2, cman starts OK (with all other processes), as well as clvmd. But the rgmanager process seems to hang and in 'ps' it appears three times (the normal is two). The logs from N2 show: rgmanager[4985]: I am node #2 rgmanager[4985]: Resource Group Manager Starting rgmanager[4985]: Loading Service Data ps -ef | grep rgmanager root 4983 1 0 15:19 ? 00:00:00 rgmanager root 4985 4983 0 15:19 ? 00:00:00 rgmanager root 5118 4985 0 15:19 ? 00:00:00 rgmanager Versions: rgmanager-3.0.12.1-19.el6.x86_64 cman-3.0.12.1-59.el6.x86_64 corosync-1.4.1-17.el6.x86_64 fence-agents-3.1.5-35.el6.x86_64 clusterlib-3.0.12.1-59.el6.x86_64 lvm2-cluster-2.02.87-6.el6.x86_64 Any help appreciated. Demetres. From lists at alteeve.ca Tue Jan 28 14:48:33 2014 From: lists at alteeve.ca (Digimer) Date: Tue, 28 Jan 2014 09:48:33 -0500 Subject: [Linux-cluster] fencing debugging ipmilan/idrac In-Reply-To: <004901cf1c2c$b194ee20$14beca60$@zentrix.be> References: <004901cf1c2c$b194ee20$14beca60$@zentrix.be> Message-ID: <52E7C341.9000600@alteeve.ca> On 28/01/14 08:27 AM, Benjamin Budts wrote: > Gents, > > How can I see the exact command that is sent my ipmilan/idrac interface > from the active node trying to fence the other one ? > > I have a DELL R610, tried configuring idrac as fencing but I get an > error, now I tried via ipmilan but still errors? > > Webinterface works fine, telnet is not enabled but ssh is. > > Increased debug loglevel of fenced, but still can?t see the exact > command that is sent to fence > > JAN 28 12 :13 :04 fenced Fencing node2.test.local > > JAN 28 12 :13 :09 fenced fence node2.test.local dev 0/0 agent > fence_ipmilan result: error from agent > > Thx > > Ben If you paste your cluster config, it's pretty easy to convert that to the command line call. The attributes have matching command line switches (the man page should cover them). As for Dell, I believe you need to add an attribute to tell it what prompt string to look for (not 100% on that though). If you can sort out the 'fence_ipmilan .... -o status' command that works, we can help you fairly easily convert it to the cluster config version. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From emi2fast at gmail.com Tue Jan 28 16:14:35 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Tue, 28 Jan 2014 17:14:35 +0100 Subject: [Linux-cluster] Multiple "rgmanager" instances after re-booting from a kernel panic. In-Reply-To: <52E7BE85.1030207@intracom-telecom.com> References: <52E7BE85.1030207@intracom-telecom.com> Message-ID: no cluster config, no logs :( 2014-01-28 Demetres Pantermalis > Hello. > > We have a strange situation with rgmanager on a two node > (active/passive) cluster, on physical servers. > Nodes are N1 and N2. > Out investigation is to simulate how the cluster will react after a > kernel panic of the active node. > Kernel panic is "simulated" with echo b > /proc/sysrq-trigger > Fence agent used: fence_scsi > Common storage is: EMC (with powerPath installed on both nodes) > > The scenarios are: > S1. Active node is N1 passive mode is N2 > After kernel panic of N1, the N2 resumes the services previously run on > N1 (Expected behavior). > N1 re-boots and after a while re-joins the cluster. (Expected behavior) > > S2. Now, active node is N2. > We perform a kernel panic on N2. N1 resumes (correctly) the services > previously run on N2. > After the reboot of N2, cman starts OK (with all other processes), as > well as clvmd. > But the rgmanager process seems to hang and in 'ps' it appears three > times (the normal is two). > The logs from N2 show: > rgmanager[4985]: I am node #2 > rgmanager[4985]: Resource Group Manager Starting > rgmanager[4985]: Loading Service Data > > ps -ef | grep rgmanager > root 4983 1 0 15:19 ? 00:00:00 rgmanager > root 4985 4983 0 15:19 ? 00:00:00 rgmanager > root 5118 4985 0 15:19 ? 00:00:00 rgmanager > > Versions: > rgmanager-3.0.12.1-19.el6.x86_64 > cman-3.0.12.1-59.el6.x86_64 > corosync-1.4.1-17.el6.x86_64 > fence-agents-3.1.5-35.el6.x86_64 > clusterlib-3.0.12.1-59.el6.x86_64 > lvm2-cluster-2.02.87-6.el6.x86_64 > > > Any help appreciated. > Demetres. > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From dan131riley at gmail.com Wed Jan 29 00:58:16 2014 From: dan131riley at gmail.com (Dan Riley) Date: Tue, 28 Jan 2014 19:58:16 -0500 Subject: [Linux-cluster] GFS2 and kernel 2.6.32-431 In-Reply-To: <3f6601cf15c9$70935fb0$51ba1f10$@innova-studios.com> References: <3f6601cf15c9$70935fb0$51ba1f10$@innova-studios.com> Message-ID: <07305588-C449-4808-9E02-FB3AF6FACF2F@gmail.com> On Jan 20, 2014, at 5:21 AM, J?rgen Ladst?tter wrote: > is anyone running gfs2 with kernel version 2.6.32-431 yet? 358 was unusable due to bugs, 279 was working quite well. Anyone tested the new 431? Is it stable enough for a productive environment? 358 actually works okay for us, so YMMV. 431 we found to be unusable under heavy load--any kind of backup-like activity would cause large load fluctuations, high glock_workqueue activity, and frequent fencings. The one cluster where we rely heavily on GFS2 lasted 5 days (with ~8 fencings) on 431 before we backed off to 358. Dunno when we'll have the time to investigate further, we do need a better simulation of our production loads on our test cluster. -dan From dpant at intracom-telecom.com Wed Jan 29 10:36:59 2014 From: dpant at intracom-telecom.com (Demetres Pantermalis) Date: Wed, 29 Jan 2014 12:36:59 +0200 Subject: [Linux-cluster] Multiple "rgmanager" instances after re-booting from a kernel panic. In-Reply-To: References: <52E7BE85.1030207@intracom-telecom.com> Message-ID: <52E8D9CB.7060302@intracom-telecom.com> Please find attached the cluster.conf file and the relevant logs from both servers. There are two scenarios executed: 1) From 11:48:00 till 11:55 (This is a normal/expected situation) app01 is active. Kernel panic at 11:48:00 app02 resumes normally the service app01 re-joins the cluster at 11:50:00 Kernel panic on app02 at 11:50:45 app01 starts normally the service app02 re-joins the cluster correctly 2) From 11:55:30 till end (This is where the problem appear) app01 is active. Kernel panic at 11:55:30 app02 resumes normally the service app01 re-joins the cluster at 11:57:07 Manually migrate the service to app01 at 11:58:40 Service start normally on app01 kernel panic on app01 at 12:00:35 service resumes normally on app02 app01 re-joins the cluster at 12:02:09 After that, the clustat output on node app02 is: Cluster Status for par_clu @ Wed Jan 29 12:30:46 2014 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ adr-par-app01-hb 1 Online adr-par-app02-hb 2 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:sv-CPAR adr-par-app02-hb started and on node app01 is: Cluster Status for par_clu @ Wed Jan 29 12:30:43 2014 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ adr-par-app01-hb 1 Online, Local adr-par-app02-hb 2 Online The output of "ps -ef | grep rgmanager" on node app01 is: root 4034 1 0 12:02 ? 00:00:00 rgmanager root 4036 4034 0 12:02 ? 00:00:00 rgmanager root 4175 4036 0 12:02 ? 00:00:00 rgmanager The problem is that rgmanager is not active anymore on node app01. As a workaround, killing the last process (pid 4175) resumes the rgmanager without restart. Thanks for your help. BR, Demetres -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster.conf Type: text/xml Size: 1766 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: messages_app01.txt.gz Type: application/x-gzip Size: 42096 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: messages_app02.txt.gz Type: application/x-gzip Size: 15480 bytes Desc: not available URL: From fdinitto at redhat.com Wed Jan 29 11:29:47 2014 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Wed, 29 Jan 2014 12:29:47 +0100 Subject: [Linux-cluster] Multiple "rgmanager" instances after re-booting from a kernel panic. In-Reply-To: <52E8D9CB.7060302@intracom-telecom.com> References: <52E7BE85.1030207@intracom-telecom.com> <52E8D9CB.7060302@intracom-telecom.com> Message-ID: <52E8E62B.6020808@redhat.com> On 1/29/2014 11:36 AM, Demetres Pantermalis wrote: > Please find attached the cluster.conf file and the relevant logs from > both servers. I didn?t have a chance to look at the whole thread but please be aware that: is something we do not support or test at Red Hat. It might work or not (tho i am doubt it?s related to the problem you see). Fabio From emi2fast at gmail.com Wed Jan 29 11:57:54 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Wed, 29 Jan 2014 12:57:54 +0100 Subject: [Linux-cluster] Multiple "rgmanager" instances after re-booting from a kernel panic. In-Reply-To: <52E8E62B.6020808@redhat.com> References: <52E7BE85.1030207@intracom-telecom.com> <52E8D9CB.7060302@intracom-telecom.com> <52E8E62B.6020808@redhat.com> Message-ID: the only thing i this is, fence_scsi works on dm and from your log i saw many " reservation conflict" 2014-01-29 Fabio M. Di Nitto > On 1/29/2014 11:36 AM, Demetres Pantermalis wrote: > > Please find attached the cluster.conf file and the relevant logs from > > both servers. > > > I didn?t have a chance to look at the whole thread but please be aware > that: > > > logfile="/var/log/cluster/fence_scsi.log" name="FenceSCSI"/> > > > is something we do not support or test at Red Hat. It might work or not > (tho i am doubt it?s related to the problem you see). > > Fabio > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From dpant at intracom-telecom.com Wed Jan 29 11:58:16 2014 From: dpant at intracom-telecom.com (Demetres Pantermalis) Date: Wed, 29 Jan 2014 13:58:16 +0200 Subject: [Linux-cluster] Multiple "rgmanager" instances after re-booting from a kernel panic. In-Reply-To: <52E8E62B.6020808@redhat.com> References: <52E7BE85.1030207@intracom-telecom.com> <52E8D9CB.7060302@intracom-telecom.com> <52E8E62B.6020808@redhat.com> Message-ID: <52E8ECD8.50009@intracom-telecom.com> Hello Fabio, I was under the impression that this is supported, because there is extensive documentation and reference in redhat site. In any case, thank you for the information. Could you please suggest alternative ways, using the common storage as a fence device (for example do you support /storage-based death (/sbd)??? BR, Demetres From emi2fast at gmail.com Wed Jan 29 11:58:48 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Wed, 29 Jan 2014 12:58:48 +0100 Subject: [Linux-cluster] Multiple "rgmanager" instances after re-booting from a kernel panic. In-Reply-To: References: <52E7BE85.1030207@intracom-telecom.com> <52E8D9CB.7060302@intracom-telecom.com> <52E8E62B.6020808@redhat.com> Message-ID: the only thing i know is, fence_scsi works on dm and from your log i saw many " reservation conflict" 2014-01-29 emmanuel segura > the only thing i this is, fence_scsi works on dm and from your log i saw > many " > > reservation conflict" > > > > 2014-01-29 Fabio M. Di Nitto > > On 1/29/2014 11:36 AM, Demetres Pantermalis wrote: >> > Please find attached the cluster.conf file and the relevant logs from >> > both servers. >> >> >> I didn?t have a chance to look at the whole thread but please be aware >> that: >> >> >> > logfile="/var/log/cluster/fence_scsi.log" name="FenceSCSI"/> >> >> >> is something we do not support or test at Red Hat. It might work or not >> (tho i am doubt it?s related to the problem you see). >> >> Fabio >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > -- > esta es mi vida e me la vivo hasta que dios quiera > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From fdinitto at redhat.com Wed Jan 29 12:04:19 2014 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Wed, 29 Jan 2014 13:04:19 +0100 Subject: [Linux-cluster] Multiple "rgmanager" instances after re-booting from a kernel panic. In-Reply-To: <52E8ECD8.50009@intracom-telecom.com> References: <52E7BE85.1030207@intracom-telecom.com> <52E8D9CB.7060302@intracom-telecom.com> <52E8E62B.6020808@redhat.com> <52E8ECD8.50009@intracom-telecom.com> Message-ID: <52E8EE43.8030600@redhat.com> On 1/29/2014 12:58 PM, Demetres Pantermalis wrote: > Hello Fabio, > > I was under the impression that this is supported, because there is > extensive documentation and reference in redhat site. Maybe I should have been more clear, but it?s the fence_scsi on emcpower path that is untested/unsupportedd. Both taken separately are. It might very well work, but I can?t sign it off :) > In any case, thank you for the information. > Could you please suggest alternative ways, using the common storage as a > fence device (for example do you support /storage-based death (/sbd)??? you can try fence_sanlock, but again, untested over emcpowerpath (Tech Preview in RHEL). https://alteeve.ca/w/Watchdog_Recovery has a good tutorial. If you are deploying with RHEL (and not Centos) then I suggest you also contact GSS to have a review of your cluster architecture and make sure there are no other caveats in that setup. Fabio From swhiteho at redhat.com Wed Jan 29 12:05:52 2014 From: swhiteho at redhat.com (Steven Whitehouse) Date: Wed, 29 Jan 2014 12:05:52 +0000 Subject: [Linux-cluster] GFS2 and kernel 2.6.32-431 In-Reply-To: <07305588-C449-4808-9E02-FB3AF6FACF2F@gmail.com> References: <3f6601cf15c9$70935fb0$51ba1f10$@innova-studios.com> <07305588-C449-4808-9E02-FB3AF6FACF2F@gmail.com> Message-ID: <1390997152.2729.9.camel@menhir> Hi, On Tue, 2014-01-28 at 19:58 -0500, Dan Riley wrote: > On Jan 20, 2014, at 5:21 AM, J?rgen Ladst?tter wrote: > > > is anyone running gfs2 with kernel version 2.6.32-431 yet? 358 was unusable due to bugs, 279 was working quite well. Anyone tested the new 431? Is it stable enough for a productive environment? > > 358 actually works okay for us, so YMMV. 431 we found to be unusable under heavy load--any kind of backup-like activity would cause large load fluctuations, high glock_workqueue activity, and frequent fencings. The one cluster where we rely heavily on GFS2 lasted 5 days (with ~8 fencings) on 431 before we backed off to 358. Dunno when we'll have the time to investigate further, we do need a better simulation of our production loads on our test cluster. > > -dan > > Do let us know what you find... I'm not aware of any issues relating to changes in behaviour under heavy load in recent kernels, other than an improvement in some specific cases. If you have a Red Hat support contract then please do contact our support team in the first instance, since they should be able to assist in resolving this kind of thing, Steve. From info at innova-studios.com Wed Jan 29 12:13:18 2014 From: info at innova-studios.com (=?UTF-8?Q?J=C3=BCrgen_Ladst=C3=A4tter?=) Date: Wed, 29 Jan 2014 13:13:18 +0100 Subject: [Linux-cluster] GFS2 and kernel 2.6.32-431 In-Reply-To: <1390997152.2729.9.camel@menhir> References: <3f6601cf15c9$70935fb0$51ba1f10$@innova-studios.com> <07305588-C449-4808-9E02-FB3AF6FACF2F@gmail.com> <1390997152.2729.9.camel@menhir> Message-ID: <122d01cf1ceb$7fc171b0$7f445510$@innova-studios.com> Hi Steve, we're going to start our tests today by switch some nodes to the new kernel version. I'll keep you updated. Juergen -----Urspr?ngliche Nachricht----- Von: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] Im Auftrag von Steven Whitehouse Gesendet: Mittwoch, 29. J?nner 2014 13:06 An: Dan Riley Cc: linux clustering Betreff: Re: [Linux-cluster] GFS2 and kernel 2.6.32-431 Hi, On Tue, 2014-01-28 at 19:58 -0500, Dan Riley wrote: > On Jan 20, 2014, at 5:21 AM, J?rgen Ladst?tter wrote: > > > is anyone running gfs2 with kernel version 2.6.32-431 yet? 358 was unusable due to bugs, 279 was working quite well. Anyone tested the new 431? Is it stable enough for a productive environment? > > 358 actually works okay for us, so YMMV. 431 we found to be unusable under heavy load--any kind of backup-like activity would cause large load fluctuations, high glock_workqueue activity, and frequent fencings. The one cluster where we rely heavily on GFS2 lasted 5 days (with ~8 fencings) on 431 before we backed off to 358. Dunno when we'll have the time to investigate further, we do need a better simulation of our production loads on our test cluster. > > -dan > > Do let us know what you find... I'm not aware of any issues relating to changes in behaviour under heavy load in recent kernels, other than an improvement in some specific cases. If you have a Red Hat support contract then please do contact our support team in the first instance, since they should be able to assist in resolving this kind of thing, Steve. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From dpant at intracom-telecom.com Wed Jan 29 13:32:25 2014 From: dpant at intracom-telecom.com (Demetres Pantermalis) Date: Wed, 29 Jan 2014 15:32:25 +0200 Subject: [Linux-cluster] Multiple "rgmanager" instances after re-booting from a kernel panic. In-Reply-To: <52E8ECD8.50009@intracom-telecom.com> References: <52E7BE85.1030207@intracom-telecom.com> <52E8D9CB.7060302@intracom-telecom.com> <52E8E62B.6020808@redhat.com> <52E8ECD8.50009@intracom-telecom.com> Message-ID: <52E902E9.80400@intracom-telecom.com> Hello Fabio, regarding the support from RH of scsi_fence with powerpath, please have a look at https://access.redhat.com/site/articles/40112 (which is Updated November 21 2013 at 2:16 AM) Paragraph 4. Limitations: First bullet: Multipath devices can be used with fence_scsi. But they must be dm-multipath or powerpath devices. No other types of multipath devices are currently supported. This clearly states it is supported. BR, Demetres From nicolas.kukolja at gmail.com Wed Jan 29 15:14:41 2014 From: nicolas.kukolja at gmail.com (Nicolas Kukolja) Date: Wed, 29 Jan 2014 16:14:41 +0100 Subject: [Linux-cluster] Clusterbehaviour if one node is not reachable & fenceable any longer? Message-ID: Hello, I have a cluster with three nodes (rhel 5.5) and every server has an ipmilan-module configured as fencing device in my cluster-config. Now, if one of the nodes is not reachable and its fencing device is not reachable, too, then the other two nodes try to fence this node again and again... without stopping it. Only when this node is reachable (& fenceable) again, the fencing proceeds sucessfully and the cluster service moves to another node. Why does the service not move to another node earlier? I think, its a common error scenario, that one node and its fencing device are not reachable maybe due to power problems e.g. How do I have to change the cluster configuration to retrieve my expected behaviour? Thanks in advance for any suggestions... Kind regards, Nicolas -------------- next part -------------- An HTML attachment was scrubbed... URL: From emi2fast at gmail.com Wed Jan 29 15:25:55 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Wed, 29 Jan 2014 16:25:55 +0100 Subject: [Linux-cluster] Clusterbehaviour if one node is not reachable & fenceable any longer? In-Reply-To: References: Message-ID: i think is really hard help you with this kind of infomartion 2014-01-29 Nicolas Kukolja > Hello, > > I have a cluster with three nodes (rhel 5.5) and every server has an > ipmilan-module configured as fencing device in my cluster-config. > Now, if one of the nodes is not reachable and its fencing device is not > reachable, too, then the other two nodes try to fence this node again and > again... without stopping it. > > Only when this node is reachable (& fenceable) again, the fencing proceeds > sucessfully and the cluster service moves to another node. > > Why does the service not move to another node earlier? I think, its a > common error scenario, that one node and its fencing device are not > reachable maybe due to power problems e.g. > How do I have to change the cluster configuration to retrieve my expected > behaviour? > > Thanks in advance for any suggestions... > > Kind regards, > Nicolas > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From nicolas.kukolja at gmail.com Wed Jan 29 15:34:54 2014 From: nicolas.kukolja at gmail.com (Nicolas Kukolja) Date: Wed, 29 Jan 2014 16:34:54 +0100 Subject: [Linux-cluster] Clusterbehaviour if one node is not reachable & fenceable any longer? In-Reply-To: References: Message-ID: What information do think you will need to be able to help me? I'll try to provide whatever may be needed... Nicolas 2014-01-29 emmanuel segura > i think is really hard help you with this kind of infomartion > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nicolas.kukolja at gmail.com Wed Jan 29 15:41:47 2014 From: nicolas.kukolja at gmail.com (Nicolas Kukolja) Date: Wed, 29 Jan 2014 15:41:47 +0000 (UTC) Subject: [Linux-cluster] Clusterbehaviour if one node is not reachable & fenceable any longer? References: Message-ID: emmanuel segura gmail.com> writes: > > > i think is really hard help you with this kind of infomartion What information do think you will need to be able to understand my problem/help? I'll try to provide whatever may be needed... Nicolas From lists at alteeve.ca Wed Jan 29 15:43:48 2014 From: lists at alteeve.ca (Digimer) Date: Wed, 29 Jan 2014 10:43:48 -0500 Subject: [Linux-cluster] Clusterbehaviour if one node is not reachable & fenceable any longer? In-Reply-To: References: Message-ID: <52E921B4.60302@alteeve.ca> On 29/01/14 10:14 AM, Nicolas Kukolja wrote: > Hello, > > I have a cluster with three nodes (rhel 5.5) and every server has an > ipmilan-module configured as fencing device in my cluster-config. > Now, if one of the nodes is not reachable and its fencing device is not > reachable, too, then the other two nodes try to fence this node again > and again... without stopping it. > > Only when this node is reachable (& fenceable) again, the fencing > proceeds sucessfully and the cluster service moves to another node. > > Why does the service not move to another node earlier? I think, its a > common error scenario, that one node and its fencing device are not > reachable maybe due to power problems e.g. > How do I have to change the cluster configuration to retrieve my > expected behaviour? > > Thanks in advance for any suggestions... > > Kind regards, > Nicolas This behaviour is expected and by design. The healthy nodes can't safely recover until they know what state the lost node is in. The cluster is not allowed to simply assume that the lost node is dead (no way to tell "disconnected but working" from "smouldering pile of rubble"). The way I deal with this is a second fence method. I use a pair of switched PDUs behind each node (one PDU for the first PSU in each node and the second PDU for the second PSU in each node). This way, if IPMI fencing fails, the nodes will connect to the PDUs and cut the power to the lost node, thus ensuring it's off and allowing prompt recovery of services. This might help: * https://alteeve.ca/w/AN!Cluster_Tutorial_2#Why_Switched_PDUs.3F * https://alteeve.ca/w/AN!Cluster_Tutorial_2#A_Map.21 * https://alteeve.ca/w/AN!Cluster_Tutorial_2#Using_the_Fence_Devices Cheers -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From emi2fast at gmail.com Wed Jan 29 16:00:47 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Wed, 29 Jan 2014 17:00:47 +0100 Subject: [Linux-cluster] Clusterbehaviour if one node is not reachable & fenceable any longer? In-Reply-To: References: Message-ID: as Digimer told, maybe a fencing had failed, but just say i got a cluster with a service that doesn't switch to other nodes after a fencing happen, without info and cluster log and without conf, is not good 2014-01-29 Nicolas Kukolja > emmanuel segura gmail.com> writes: > > > > > > > i think is really hard help you with this kind of infomartion > > > What information do think you will need to be able to understand my > problem/help? > I'll try to provide whatever may be needed... > > Nicolas > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Wed Jan 29 16:53:41 2014 From: lists at alteeve.ca (Digimer) Date: Wed, 29 Jan 2014 11:53:41 -0500 Subject: [Linux-cluster] Clusterbehaviour if one node is not reachable & fenceable any longer? In-Reply-To: References: Message-ID: <52E93215.6030600@alteeve.ca> 99% of the time, I agree totally. Logs and configs are super helpful. In this case though, I am pretty sure I know exactly what's happening. :) digimer On 29/01/14 11:00 AM, emmanuel segura wrote: > as Digimer told, maybe a fencing had failed, but just say i got a > cluster with a service that doesn't switch to other nodes after a > fencing happen, without info and cluster log and without conf, is not good > > > 2014-01-29 Nicolas Kukolja > > > emmanuel segura gmail.com > writes: > > > > > > > i think is really hard help you with this kind of infomartion > > > What information do think you will need to be able to understand my > problem/help? > I'll try to provide whatever may be needed... > > Nicolas > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > -- > esta es mi vida e me la vivo hasta que dios quiera > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From nicolas.kukolja at gmail.com Wed Jan 29 17:42:13 2014 From: nicolas.kukolja at gmail.com (Nicolas Kukolja) Date: Wed, 29 Jan 2014 17:42:13 +0000 (UTC) Subject: [Linux-cluster] Clusterbehaviour if one node is not reachable & fenceable any longer? References: <52E93215.6030600@alteeve.ca> Message-ID: Digimer alteeve.ca> writes: > > 99% of the time, I agree totally. Logs and configs are super helpful. In > this case though, I am pretty sure I know exactly what's happening. :) > > digimer Thanks for the explanation, digimer. You got exactly what I mean an what happens. Unfortunately, that was, what I was afraid of... The three nodes in my scenario are located about 200km from each other. If one of the nodes with all infrastructure around it (PDUs, Switches, IPMI...) is not reachable any longer because of a power outage or a full network outage at this location, switching a PDU is not possible, too... That would mean, that in this (very probably) case, the cluster will not help me? Do you have any suggestions, what I can do to workaround this case? Kind regards, Nicolas From lists at alteeve.ca Wed Jan 29 17:46:44 2014 From: lists at alteeve.ca (Digimer) Date: Wed, 29 Jan 2014 12:46:44 -0500 Subject: [Linux-cluster] Clusterbehaviour if one node is not reachable & fenceable any longer? In-Reply-To: References: <52E93215.6030600@alteeve.ca> Message-ID: <52E93E84.3070500@alteeve.ca> On 29/01/14 12:42 PM, Nicolas Kukolja wrote: > Digimer alteeve.ca> writes: > >> >> 99% of the time, I agree totally. Logs and configs are super helpful. In >> this case though, I am pretty sure I know exactly what's happening. :) >> >> digimer > > Thanks for the explanation, digimer. You got exactly what I mean an what > happens. Unfortunately, that was, what I was afraid of... > > The three nodes in my scenario are located about 200km from each other. > If one of the nodes with all infrastructure around it (PDUs, Switches, > IPMI...) is not reachable any longer because of a power outage or a full > network outage at this location, switching a PDU is not possible, too... > > That would mean, that in this (very probably) case, the cluster will not > help me? > > Do you have any suggestions, what I can do to workaround this case? > > Kind regards, > Nicolas And this is the fundamental problem of stretch/geo-clusters. I am loath to recommend this, because it's soooo easy to screw it up in the heat of the moment, so please only ever do this after you are 100% sure the other node is dead; If you log into the 2 remaining nodes that are blocked (because of the inability to fence), you can type 'fence_ack_manual'. That will tell the cluster that you have manually confirmed the lost node is powered off. Again, USE THIS VERY CAREFULLY! It's tempting to make assumptions when you've got users and managers yelling at you to get services back up. So much so that Red Hat dropped 'fence_manual' entirely in RHEL 6 because it was too easy to blow things up. I can not stress it enough just how critical it is that you confirm that the remote location is truly off before doing this. If it's still on and you clear the fence action, then really bad things could happen when the link returns. digimer -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From fdinitto at redhat.com Thu Jan 30 05:16:33 2014 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Thu, 30 Jan 2014 06:16:33 +0100 Subject: [Linux-cluster] Multiple "rgmanager" instances after re-booting from a kernel panic. In-Reply-To: <52E902E9.80400@intracom-telecom.com> References: <52E7BE85.1030207@intracom-telecom.com> <52E8D9CB.7060302@intracom-telecom.com> <52E8E62B.6020808@redhat.com> <52E8ECD8.50009@intracom-telecom.com> <52E902E9.80400@intracom-telecom.com> Message-ID: <52E9E031.6070905@redhat.com> On 01/29/2014 02:32 PM, Demetres Pantermalis wrote: > Hello Fabio, > > regarding the support from RH of scsi_fence with powerpath, please have > a look at > https://access.redhat.com/site/articles/40112 (which is Updated November > 21 2013 at 2:16 AM) > > Paragraph 4. Limitations: > First bullet: > Multipath devices can be used with fence_scsi. But they must be > dm-multipath or powerpath devices. No other types of multipath devices > are currently supported. > > > This clearly states it is supported. feh, ok i'll look at the doc and see if it was added by mistake. Or my memory is corrupted :) Thanks Fabio From nicolas.kukolja at gmail.com Thu Jan 30 12:00:29 2014 From: nicolas.kukolja at gmail.com (Nicolas Kukolja) Date: Thu, 30 Jan 2014 12:00:29 +0000 (UTC) Subject: [Linux-cluster] Clusterbehaviour if one node is not reachable & fenceable any longer? References: <52E93215.6030600@alteeve.ca> <52E93E84.3070500@alteeve.ca> Message-ID: Digimer alteeve.ca> writes: > And this is the fundamental problem of stretch/geo-clusters. > > I am loath to recommend this, because it's soooo easy to screw it up in > the heat of the moment, so please only ever do this after you are 100% > sure the other node is dead; > > If you log into the 2 remaining nodes that are blocked (because of the > inability to fence), you can type 'fence_ack_manual'. That will tell the > cluster that you have manually confirmed the lost node is powered off. > > Again, USE THIS VERY CAREFULLY! > > It's tempting to make assumptions when you've got users and managers > yelling at you to get services back up. So much so that Red Hat dropped > 'fence_manual' entirely in RHEL 6 because it was too easy to blow things > up. I can not stress it enough just how critical it is that you confirm > that the remote location is truly off before doing this. If it's still > on and you clear the fence action, then really bad things could happen > when the link returns. > > digimer Thanks a lot for your support and explanations... So I will try to explain it to my stakeholders... One little question is still in my mind: If in a three nodes scenario one node is not reachable and fencable, but two other nodes are still alive and able to communicate to each other, where is the risc of a "split-brain" situation? The "lost" third node will, if it is still running but not accessable from the others, disable the service because it has no contact to any other nodes, right? So if two nodes are connected, isn't it guaranteed, that the third node is no longer providing the service? Kind regards, Nicolas From ben at zentrix.be Thu Jan 30 13:51:36 2014 From: ben at zentrix.be (Benjamin Budts) Date: Thu, 30 Jan 2014 14:51:36 +0100 Subject: [Linux-cluster] fencing debugging ipmilan/idrac In-Reply-To: <52E7C341.9000600@alteeve.ca> References: <004901cf1c2c$b194ee20$14beca60$@zentrix.be> <52E7C341.9000600@alteeve.ca> Message-ID: <006c01cf1dc2$65043830$2f0ca890$@zentrix.be> Gents, I didn't debug in the end, but got it working by disabling LANPLUS option in Luci -> Fencing devices -> iDRAC, less secure but it's a private vlan anyway. Thx -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Digimer Sent: dinsdag 28 januari 2014 15:49 To: linux clustering Subject: Re: [Linux-cluster] fencing debugging ipmilan/idrac On 28/01/14 08:27 AM, Benjamin Budts wrote: > Gents, > > How can I see the exact command that is sent my ipmilan/idrac > interface from the active node trying to fence the other one ? > > I have a DELL R610, tried configuring idrac as fencing but I get an > error, now I tried via ipmilan but still errors. > > Webinterface works fine, telnet is not enabled but ssh is. > > Increased debug loglevel of fenced, but still can't see the exact > command that is sent to fence > > JAN 28 12 :13 :04 fenced Fencing node2.test.local > > JAN 28 12 :13 :09 fenced fence node2.test.local dev 0/0 agent > fence_ipmilan result: error from agent > > Thx > > Ben If you paste your cluster config, it's pretty easy to convert that to the command line call. The attributes have matching command line switches (the man page should cover them). As for Dell, I believe you need to add an attribute to tell it what prompt string to look for (not 100% on that though). If you can sort out the 'fence_ipmilan .... -o status' command that works, we can help you fairly easily convert it to the cluster config version. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From info at innova-studios.com Thu Jan 30 10:17:40 2014 From: info at innova-studios.com (=?UTF-8?Q?J=C3=BCrgen_Ladst=C3=A4tter?=) Date: Thu, 30 Jan 2014 11:17:40 +0100 Subject: [Linux-cluster] GFS2 and kernel 2.6.32-431 In-Reply-To: <122d01cf1ceb$7fc171b0$7f445510$@innova-studios.com> References: <3f6601cf15c9$70935fb0$51ba1f10$@innova-studios.com> <07305588-C449-4808-9E02-FB3AF6FACF2F@gmail.com> <1390997152.2729.9.camel@menhir> <122d01cf1ceb$7fc171b0$7f445510$@innova-studios.com> Message-ID: <00a701cf1da4$834b35a0$89e1a0e0$@innova-studios.com> Hi Steve, 1 node in our 5 Node cluster is now running 431 for 18 hours. No real load difference so far, no io_wait difference, stable. I'll keep you updated, but as it looks, we're going to switch all nodes to 431 next week. Juergen -----Urspr?ngliche Nachricht----- Von: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] Im Auftrag von J?rgen Ladst?tter Gesendet: Mittwoch, 29. J?nner 2014 13:13 An: 'linux clustering' Betreff: Re: [Linux-cluster] GFS2 and kernel 2.6.32-431 Hi Steve, we're going to start our tests today by switch some nodes to the new kernel version. I'll keep you updated. Juergen -----Urspr?ngliche Nachricht----- Von: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] Im Auftrag von Steven Whitehouse Gesendet: Mittwoch, 29. J?nner 2014 13:06 An: Dan Riley Cc: linux clustering Betreff: Re: [Linux-cluster] GFS2 and kernel 2.6.32-431 Hi, On Tue, 2014-01-28 at 19:58 -0500, Dan Riley wrote: > On Jan 20, 2014, at 5:21 AM, J?rgen Ladst?tter wrote: > > > is anyone running gfs2 with kernel version 2.6.32-431 yet? 358 was unusable due to bugs, 279 was working quite well. Anyone tested the new 431? Is it stable enough for a productive environment? > > 358 actually works okay for us, so YMMV. 431 we found to be unusable under heavy load--any kind of backup-like activity would cause large load fluctuations, high glock_workqueue activity, and frequent fencings. The one cluster where we rely heavily on GFS2 lasted 5 days (with ~8 fencings) on 431 before we backed off to 358. Dunno when we'll have the time to investigate further, we do need a better simulation of our production loads on our test cluster. > > -dan > > Do let us know what you find... I'm not aware of any issues relating to changes in behaviour under heavy load in recent kernels, other than an improvement in some specific cases. If you have a Red Hat support contract then please do contact our support team in the first instance, since they should be able to assist in resolving this kind of thing, Steve. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From lists at alteeve.ca Thu Jan 30 17:08:15 2014 From: lists at alteeve.ca (Digimer) Date: Thu, 30 Jan 2014 12:08:15 -0500 Subject: [Linux-cluster] Clusterbehaviour if one node is not reachable & fenceable any longer? In-Reply-To: References: <52E93215.6030600@alteeve.ca> <52E93E84.3070500@alteeve.ca> Message-ID: <52EA86FF.30404@alteeve.ca> On 30/01/14 07:00 AM, Nicolas Kukolja wrote: > Digimer alteeve.ca> writes: > >> And this is the fundamental problem of stretch/geo-clusters. >> >> I am loath to recommend this, because it's soooo easy to screw it up in >> the heat of the moment, so please only ever do this after you are 100% >> sure the other node is dead; >> >> If you log into the 2 remaining nodes that are blocked (because of the >> inability to fence), you can type 'fence_ack_manual'. That will tell the >> cluster that you have manually confirmed the lost node is powered off. >> >> Again, USE THIS VERY CAREFULLY! >> >> It's tempting to make assumptions when you've got users and managers >> yelling at you to get services back up. So much so that Red Hat dropped >> 'fence_manual' entirely in RHEL 6 because it was too easy to blow things >> up. I can not stress it enough just how critical it is that you confirm >> that the remote location is truly off before doing this. If it's still >> on and you clear the fence action, then really bad things could happen >> when the link returns. >> >> digimer > > Thanks a lot for your support and explanations... So I will try to explain > it to my stakeholders... > > One little question is still in my mind: > If in a three nodes scenario one node is not reachable and fencable, but two > other nodes are still alive and able to communicate to each other, where is > the risc of a "split-brain" situation? Depending on what happened at the far end, the node could be in a state where it could try to provide or access HA services before realizing it's lost quorum. Quorum only works when the node is behaving in an expected manner. If the node isn't responding, you have to assume it has entered an undefined state, in which case quorum may or may not save you. A classic example, though it doesn't cleanly apply here I suspect, would be a node that froze mid-write to shared storage. It's not dead, it's just hung. If the other nodes decide it's dead and proceed with recovery of the shared FS and go about their business. At some point later, the hung node recovers, has no idea that time has passed so it has no reason to think it's locks are invalid or check quorum, and finishes the write it was in the middle of. You now have a corrupted FS. Again, this probably doesn't map to your setup, but there are other scenarios where things can get equally messed up in the time between a node recovers and it realizes it's lost quorum. The only safe protection is fencing, as it puts the node into a clean state (off or fresh boot). > The "lost" third node will, if it is still running but not accessable from > the others, disable the service because it has no contact to any other > nodes, right? > So if two nodes are connected, isn't it guaranteed, that the third node is > no longer providing the service? Nope, the only guarantee is to put it into a known state. Quorum == useful when nodes are in a defined state Fencing == useful when a nodes is in an undefined state. hth -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?