From ooolinux at 163.com Tue Mar 1 02:03:44 2011 From: ooolinux at 163.com (yue) Date: Tue, 1 Mar 2011 10:03:44 +0800 (CST) Subject: [Linux-cluster] if there is no cman.ko anymore In-Reply-To: <4D6BDD25.8040506@alteeve.com> References: <4D6BDD25.8040506@alteeve.com> <3f1dd29a.50.12e6d450d86.Coremail.ooolinux@163.com> Message-ID: <7976fa5c.2b36.12e6f2823bf.Coremail.ooolinux@163.com> 1.the document says there is a cman.ko ,rhel5. link is http://www.linuxtopia.org/online_books/rhel5/rhel5_clustering_guide/rhel5_cluster_s1-ha-components-CSO.html 2. i use fedara12. cman version is cman-3.0.13-1.x86_64.rpm so i want to know when redhat do not need cman.ko anymore? 3. i am going to test gfs2+clvm. would you give me any suggestion on optimization? thanks At 2011-03-01 01:36:37?Digimer wrote: >On 02/28/2011 12:16 PM, yue wrote: >> my kernel 2.6.32,fc12 >> cman 3.0.17 >> i install cman.rpm >> but i search no cman.ko, redhat cluster can work . >> if there is not cman.ko anymore? >> >> thanks > >I can't speak to this specific version, but I can confirm that CMAN is >going away (in fact, I think it is already gone from 3.1). > >-- >Digimer >E-Mail: digimer at alteeve.com >AN!Whitepapers: http://alteeve.com >Node Assassin: http://nodeassassin.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From kawasaki at redhat.com Tue Mar 1 04:20:50 2011 From: kawasaki at redhat.com (Tatsuo Kawasaki) Date: Tue, 01 Mar 2011 13:20:50 +0900 Subject: [Linux-cluster] if there is no cman.ko anymore In-Reply-To: <7976fa5c.2b36.12e6f2823bf.Coremail.ooolinux@163.com> References: <4D6BDD25.8040506@alteeve.com> <3f1dd29a.50.12e6d450d86.Coremail.ooolinux@163.com> <7976fa5c.2b36.12e6f2823bf.Coremail.ooolinux@163.com> Message-ID: <4D6C7422.6020203@redhat.com> Hi yue, I think this is an error in the document. cman.ko kernel module is required for RHEL4 based cluster suite. ex. RHEL4.8: cman-1.0.27-1.el4.x86_64.rpm cman-kernel-2.6.9-56.7.el4_8.9.x86_64.rpm rpm -ql cman-kernel /lib/modules/2.6.9-89.0.16.EL/kernel/cluster /lib/modules/2.6.9-89.0.16.EL/kernel/cluster/cman.ko /lib/modules/2.6.9-89.0.16.EL/kernel/cluster/cman.symvers Regards, -- Tatsuo Kawasaki On 03/01/2011 11:03 AM, yue wrote: > 1.the document says there is a cman.ko ,rhel5. link is > http://www.linuxtopia.org/online_books/rhel5/rhel5_clustering_guide/rhel5_cluster_s1-ha-components-CSO.html > > 2. i use fedara12. cman version is cman-3.0.13-1.x86_64.rpm > > so i want to know when redhat do not need cman.ko anymore? > > 3. i am going to test gfs2+clvm. > would you give me any suggestion on optimization? > > thanks > > > At 2011-03-01 01:36:37??Digimer wrote: > >>On 02/28/2011 12:16 PM, yue wrote: >>> my kernel 2.6.32,fc12 >>> cman 3.0.17 >>> i install cman.rpm >>> but i search no cman.ko, redhat cluster can work . >>> if there is not cman.ko anymore? >>> >>> thanks >> >>I can't speak to this specific version, but I can confirm that CMAN is >>going away (in fact, I think it is already gone from 3.1). From ccaulfie at redhat.com Tue Mar 1 09:25:15 2011 From: ccaulfie at redhat.com (Christine Caulfield) Date: Tue, 01 Mar 2011 09:25:15 +0000 Subject: [Linux-cluster] if there is no cman.ko anymore In-Reply-To: <3f1dd29a.50.12e6d450d86.Coremail.ooolinux@163.com> References: <3f1dd29a.50.12e6d450d86.Coremail.ooolinux@163.com> Message-ID: <4D6CBB7B.2020801@redhat.com> On 28/02/11 17:16, yue wrote: > my kernel 2.6.32,fc12 > cman 3.0.17 > i install cman.rpm > but i search no cman.ko, redhat cluster can work . > if there is not cman.ko anymore? > thanks > There is no cman kernel module in RHEL5 and above. It's a module of openais/corosync and runs in userspace. Chrissie From szhargrave at ybs.co.uk Tue Mar 1 11:10:45 2011 From: szhargrave at ybs.co.uk (Simon Hargrave) Date: Tue, 1 Mar 2011 11:10:45 +0000 Subject: [Linux-cluster] lvm2-cluster not syncing correctly? In-Reply-To: <20110225085519297.00000004632@H04405> References: <20110224161352225.00000004632@H04405> <1151718125.172523.1298566173100.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <20110224173814596.00000004632@H04405> <09f63f66e569f668e219115b5e26f55f@sjolshagen.net> <20110225085519297.00000004632@H04405> Message-ID: <20110301111045292.00000001752@H04405> Please read the warning at the end of this email ________________________________________________ I just noticed the following errata released which appears to describe my problem: - http://rhn.redhat.com/errata/RHBA-2011-0288.html Suggesting that reads of metadata were no always using O_DIRECT and doing buffered reads. However, having applied this update, the symptoms persist. I'll raise this as a support call. - Simon Hargrave szhargrave at ybs.co.uk Technical Services Team Leader x2831 Yorkshire Building Society 01274 472831 http://wwwtech/sysint/tsgcore.asp ________________________________________________ This email and any attachments are confidential and may contain privileged information. If you are not the person for whom they are intended please return the email and then delete all material from any computer. You must not use the email or attachments for any purpose, nor disclose its contents to anyone other than the intended recipient. Any statements made by an individual in this email do not necessarily reflect the views of the Yorkshire Building Society Group. ________________________________________________ Yorkshire Building Society, which is authorised and regulated by the Financial Services Authority, chooses to introduce its customers to Legal & General for the purposes of advising on and arranging life assurance and investment products bearing Legal & General?s name. We are entered in the FSA Register and our FSA registration number is 106085 http://www.fsa.gov.uk/register Head Office: Yorkshire Building Society, Yorkshire House, Yorkshire Drive, Bradford, BD5 8LJ Tel: 0845 1 200 100 Visit Our Website http://www.ybs.co.uk All communications with us may be monitored/recorded to improve the quality of our service and for your protection and security. ________________________________________________________________________ This e-mail has been scanned for all viruses by Star. The service is powered by MessageLabs. For more information on a proactive anti-virus service working around the clock, around the globe, visit: http://www.star.net.uk ________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From parvez.h.shaikh at gmail.com Tue Mar 1 13:20:18 2011 From: parvez.h.shaikh at gmail.com (Parvez Shaikh) Date: Tue, 1 Mar 2011 18:50:18 +0530 Subject: [Linux-cluster] SNMP support with IBM Blade Center Fence Agent In-Reply-To: <20110228161406.GA14120@redhat.com> References: <20110228161406.GA14120@redhat.com> Message-ID: Hi Ryan, Thank you for response. Does it mean there is no way to intimate administrator about failure of fencing as of now? Let me give more information about my cluster - I have set of nodes in cluster with only IP resource being protected. I have two levels of fencing, first bladecenter fencing and second one is manual fencing. At times if machine is already down(either power failure or turned off abrupty); blade center fencing timesout and manual fencing happens. At this time, administrator is expected to run fence_ack_manual. Clearly this is not something which is desirable, as downtime of services is as long as administrator runs fence_ack_manual. What is recommended method to deal with blade center fencing failure in this situation? Do I have to add another level of fencing(between blade center and manual) which can fence automatically(not requiring manual interference)? Thanks On Mon, Feb 28, 2011 at 9:44 PM, Ryan O'Hara wrote: > On Mon, Feb 28, 2011 at 12:43:10PM +0530, Parvez Shaikh wrote: > > Hi all, > > > > I have a question related to fence agents and SNMP alarms. > > > > Fence Agent can fail to fence the failed node for various reason; e.g. > with > > my bladecenter fencing agent, I sometimes get message saying bladecenter > > fencing failed because of timeout or fence device IP address/user > > credentials are incorrect. > > > > In such a situation is it possible to generate SNMP trap? > > This feature will be in RHEL6.1. There is a new project called > 'foghorn' that creates SNMPv2 traps from dbus signals. > > git://git.fedorahosted.org/foghorn.git > > In RHEL6.1 (and the latest upstream release), certain cluster > components will emit dbus signals when certain events occurs. This > includes fencing. So when a node is fenced a dbus signal is generated > by fenced. The foghorn service catches this signal and generated > SNMPv2 trap. > > Note that foghorn runs as an AgentX subagent, so snmpd must be running > as the master agentx. > > Ryan > > > My cluster config file looks like below and in my case if bladecenter > > fencing fails, manual fencing kicks in and requires user to do > > fence_ack_manual, for this user must at least be notified via SNMP (or > any > > other mechanism?) to intervene - > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > login="USERID" name="BladeCenterFencing" passwd="PASSW0RD"/> > > > > > > > > Thanks, > > Parvez > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailtoaneeshvs at gmail.com Tue Mar 1 13:47:44 2011 From: mailtoaneeshvs at gmail.com (aneesh vs) Date: Tue, 1 Mar 2011 19:17:44 +0530 Subject: [Linux-cluster] lvm2-cluster not syncing correctly? In-Reply-To: <20110301111045292.00000001752@H04405> References: <20110224161352225.00000004632@H04405> <1151718125.172523.1298566173100.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <20110224173814596.00000004632@H04405> <09f63f66e569f668e219115b5e26f55f@sjolshagen.net> <20110225085519297.00000004632@H04405> <20110301111045292.00000001752@H04405> Message-ID: Hello, Does "clvmd -R" on all nodes makes any difference? On Tue, Mar 1, 2011 at 4:40 PM, Simon Hargrave wrote: > Please read the warning at the end of this email > ________________________________________________ > > > I just noticed the following errata released which appears to describe my > problem: - > > http://rhn.redhat.com/errata/RHBA-2011-0288.html > > Suggesting that reads of metadata were no always using O_DIRECT and doing > buffered reads. However, having applied this update, the symptoms persist. > > I'll raise this as a support call. > > - > Simon Hargrave szhargrave at ybs.co.uk > Technical Services Team Leader x2831 > Yorkshire Building Society 01274 472831 > http://wwwtech/sysint/tsgcore.asp > > ________________________________________________ > > This email and any attachments are confidential and may contain privileged > information. > > If you are not the person for whom they are intended please return the > email and then delete all material from any computer. You must not use the > email or attachments for any purpose, nor disclose its contents to anyone > other than the intended recipient. > > Any statements made by an individual in this email do not necessarily > reflect the views of the Yorkshire Building Society Group. > > ________________________________________________ > > Yorkshire Building Society, which is authorised and regulated by the > Financial Services Authority, chooses to introduce its customers to Legal & > General for the purposes of advising on and arranging life assurance and > investment products bearing Legal & General?s name. > > > We are entered in the FSA Register and our FSA registration number is > 106085 http://www.fsa.gov.uk/register > > Head Office: Yorkshire Building Society, Yorkshire House, Yorkshire Drive, > Bradford, BD5 8LJ > Tel: 0845 1 200 100 > > Visit Our Website > http://www.ybs.co.uk > > All communications with us may be monitored/recorded to improve the quality > of our service and for your protection and security. > > > > ________________________________________________________________________ > This e-mail has been scanned for all viruses by Star. The > service is powered by MessageLabs. For more information on a proactive > anti-virus service working around the clock, around the globe, visit: > http://www.star.net.uk > ________________________________________________________________________ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ajb2 at mssl.ucl.ac.uk Tue Mar 1 13:50:38 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Tue, 01 Mar 2011 13:50:38 +0000 Subject: [Linux-cluster] How fast can rsync be on GFS2? In-Reply-To: <4D686679.40104@logik-internet.rs> References: <984986171.189718.1298641413061.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <4D686679.40104@logik-internet.rs> Message-ID: <4D6CF9AE.8050004@mssl.ucl.ac.uk> Nikola Savic wrote: > Rsync is very slow in creating file > list, little faster than 100files/s. That's about what I see too. Ditto on reading. From szhargrave at ybs.co.uk Tue Mar 1 14:45:23 2011 From: szhargrave at ybs.co.uk (Simon Hargrave) Date: Tue, 1 Mar 2011 14:45:23 +0000 Subject: [Linux-cluster] lvm2-cluster not syncing correctly? In-Reply-To: References: <20110224161352225.00000004632@H04405> <1151718125.172523.1298566173100.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <20110224173814596.00000004632@H04405> <09f63f66e569f668e219115b5e26f55f@sjolshagen.net> <20110225085519297.00000004632@H04405> <20110301111045292.00000001752@H04405> Message-ID: <20110301144523300.00000004068@H04405> Please read the warning at the end of this email ________________________________________________ > Does "clvmd -R" on all nodes makes any difference? No it doesn't. Interesting that the lvs command in verbose mode does seem to see the test LV during scan, but not display it at the end: - [root at ybsxlx87 ~]# lvs -vv 2>&1 | tail -30 /dev/vgHBPOCSHARED/lvinv01: size is 2097152 sectors /dev/vgHBPOCSHARED/lvmanaged: size is 41943040 sectors /dev/vgHBPOCSHARED/lvmanaged: size is 41943040 sectors /dev/vgHBPOCSHARED/lvinstance: size is 41943040 sectors /dev/vgHBPOCSHARED/lvinstance: size is 41943040 sectors /dev/vgHBPOCSHARED/lvaserver: size is 41943040 sectors /dev/vgHBPOCSHARED/lvaserver: size is 41943040 sectors /dev/vgHBPOCSHARED/lvcluster: size is 41943040 sectors /dev/vgHBPOCSHARED/lvcluster: size is 41943040 sectors /dev/vgHBPOCSHARED/test: size is 2097152 sectors /dev/vgHBPOCSHARED/test: size is 2097152 sectors LV VG #Seg Attr LSize Maj Min KMaj KMin Origin Snap% Move Copy% Log Convert LV UUID esmlv vg00 1 -wi-ao 480.00M -1 -1 253 9 nMkgWo-FHiE-HcNR-Gejh-aGQu-5jdi-5WuzUl lvol1 vg00 1 -wi-ao 1.00G -1 -1 253 7 gGMvIQ-A1YZ-Rdfj-IuUI-r0pw-o2uE-0TO4t2 lvol2 vg00 1 -wi-ao 4.00G -1 -1 253 17 An7J2B-W2l6-tSFl-M8Ud-rf2o-1mX9-RqP2MX lvol3 vg00 1 -wi-ao 3.91G -1 -1 253 10 NuaTa1-L911-PAYp-x8DL-ccLI-rZvK-tJRNf2 lvol4 vg00 1 -wi-ao 1.00G -1 -1 253 8 dy6xdZ-DMYh-ykjw-FhvT-FeLa-SJSA-xn26qR lvol5 vg00 1 -wi-ao 1.00G -1 -1 253 11 jVBGid-k9ke-kxwt-TIHS-00jW-gHwU-MjyG3h lvol6 vg00 1 -wi-ao 256.00M -1 -1 253 14 36n8Dy-OtBd-QDs1-2L71-kmrx-5vLP-olvHcW netbackuplv vg00 1 -wi-ao 512.00M -1 -1 253 16 3XhLTa-KKOO-ldZd-bd2h-bQGQ-Nj2C-eiqLkK tivolilv vg00 1 -wi-ao 64.00M -1 -1 253 18 9wVbG3-AYzn-31VY-uS1v-nuFz-ZMnH-yk6obG u001lv vg00 1 -wi-ao 5.00G -1 -1 253 13 ETn1bt-GZlk-q67D-JjMJ-Tvse-RbUe-dZ82LJ u003lv vg00 1 -wi-ao 512.00M -1 -1 253 15 ecJ8f1-0FrY-YWYx-hLpe-kHyi-ghGo-qzO0UI ybslv vg00 1 -wi-ao 32.00M -1 -1 253 12 aPHyyp-kWCT-uuKc-qeH3-gFNw-mU29-l0472c fmw1 vgHBPOCSHARED 1 -wi-a- 20.00G -1 -1 253 19 1ychoA-hOjm-EiJ3-0yA1-6FmY-MPlZ-f4q8OB lvaserver vgHBPOCSHARED 1 -wi-ao 20.00G -1 -1 253 23 TOUmoW-xL3U-eozs-eHB8-jWvC-Pxwf-mJASYz lvcluster vgHBPOCSHARED 1 -wi-ao 20.00G -1 -1 253 24 AMDHIB-bXCu-18km-lGoL-Vzke-SIcD-KOEjVf lvinstance vgHBPOCSHARED 1 -wi-ao 20.00G -1 -1 253 22 IyXuAs-qIMS-xs8n-sGdZ-Sv7Z-9lUb-nweVvn lvinv01 vgHBPOCSHARED 1 -wi-a- 1.00G -1 -1 253 20 may1bW-gRcZ-sDnj-mbWH-um0B-hddu-HUY93C lvmanaged vgHBPOCSHARED 1 -wi-ao 20.00G -1 -1 253 21 R1gaUa-FDKx-1DEf-LT9d-vt86-o1Zz-l9LmuU I now have a case raised with RedHat. I'll update if we make any progress. Simon - Simon Hargrave szhargrave at ybs.co.uk Technical Services Team Leader x2831 Yorkshire Building Society 01274 472831 http://wwwtech/sysint/tsgcore.asp ________________________________________________ This email and any attachments are confidential and may contain privileged information. If you are not the person for whom they are intended please return the email and then delete all material from any computer. You must not use the email or attachments for any purpose, nor disclose its contents to anyone other than the intended recipient. Any statements made by an individual in this email do not necessarily reflect the views of the Yorkshire Building Society Group. ________________________________________________ Yorkshire Building Society, which is authorised and regulated by the Financial Services Authority, chooses to introduce its customers to Legal & General for the purposes of advising on and arranging life assurance and investment products bearing Legal & General?s name. We are entered in the FSA Register and our FSA registration number is 106085 http://www.fsa.gov.uk/register Head Office: Yorkshire Building Society, Yorkshire House, Yorkshire Drive, Bradford, BD5 8LJ Tel: 0845 1 200 100 Visit Our Website http://www.ybs.co.uk All communications with us may be monitored/recorded to improve the quality of our service and for your protection and security. ________________________________________________________________________ This e-mail has been scanned for all viruses by Star. The service is powered by MessageLabs. For more information on a proactive anti-virus service working around the clock, around the globe, visit: http://www.star.net.uk ________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From fdinitto at redhat.com Tue Mar 1 14:57:23 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Tue, 01 Mar 2011 15:57:23 +0100 Subject: [Linux-cluster] resource-agents 3.1.1 stable release Message-ID: <4D6D0953.2010208@redhat.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Welcome to the second resource-agents standalone release. this is a bug fix only release. The new source tarball can be downloaded here: https://fedorahosted.org/releases/r/e/resource-agents/resource-agents-3.1.1.tar.xz To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. Happy clustering, Fabio Under the hood (from 3.1.0): Fabio M. Di Nitto (1): fs-lib: fix do_monitor device mapping Lon Hohberger (3): resource-agents: Fix migrateuriopt setting resource-agents: Improve LD_LIBRARY_PATH handling by SAP* resource-agents: Use literal quotes for tr calls Marek 'marx' Grac (3): resource-agents: Add option disable_rdisc to ip.sh resource-agents: Apache resource with spaces in name fails to start resource-agents: Remove netmask from IP address when creating list of them rgmanager/src/resources/SAPDatabase | 19 ++++++++++--------- rgmanager/src/resources/SAPInstance | 5 +++-- rgmanager/src/resources/ip.sh | 21 +++++++++++++++++++-- rgmanager/src/resources/utils/config-utils.sh.in | 10 ++++++---- rgmanager/src/resources/utils/fs-lib.sh | 13 ++++++++++++- rgmanager/src/resources/vm.sh | 5 ++++- 6 files changed, 54 insertions(+), 19 deletions(-) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBCAAGBQJNbQlRAAoJEFA6oBJjVJ+OcbEP/389Dhidd3GMfaO8hn+RPEe0 y0W7CpWR73f2hFVmptLDT4e5bKj5+TRPh5rr/V9y4weDXfmv/YGbsCPmyBGZCtiW DLkBuP8xnnb8M2pJtWM0T6SZLgR/iXviYchIj4D8F6zE2OsXQp7YOcfDeN/Xwe9J znilVw9shBVlV2SWA/2avl6MXnmnO1IypUkSZ4VQt7IiJUYP/CRdxiwJbWGRM7Sk rPeQArcdJ8xqKyPtmXslBiFNawFdw2rywGbRCeXo+IaWhw//urYDCuwSq+wwvsFq BMWNRwqGuvUAmKPnustekfGLcVWwK1SaAgzeiQh5PHr5p7bFk+mRBl5JW63yJsjT wQbSOTX4A6c7QmCGSlfuqz9sUbtb83bHPS3G9lvPiPFY8TpBl4XlAaEyEO9ipwJN k8ktQwNWhDsza8lFEQDoD0p/DQRLkEZ8KXscP7qtyPCQ8MYkxaGFPmxWhkmVp0/l liIoYu8W2wdTvOOcu4qdiuxV5Z9uBmMU6CSZmd3rG/Zg8h9oNHZU9FK6xncgHuxx XvH6MhyVZSYh9K6UHZDWIFCjjL5+H9dMVmeFAs9XEu1RNacOixMBgLoTFy91PInu GXDSmu1X4biiI5WbdahPWvgxxEQG2hgHrhQIVyzp+Lw2DRU1/f4vcOLCoklB69D7 NhJFuXZC7GF/uR7Zw+vU =KZCV -----END PGP SIGNATURE----- From fdinitto at redhat.com Wed Mar 2 09:25:06 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Wed, 02 Mar 2011 10:25:06 +0100 Subject: [Linux-cluster] fence-agents 3.1.2 stable release Message-ID: <4D6E0CF2.6060402@redhat.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Welcome to the fence-agents 3.1.2 release. This release contains a few bug fixes and a new watchdog/fence_scsi integration script (thanks to Ryan O?hara). The new source tarball can be downloaded here: https://fedorahosted.org/releases/f/e/fence-agents/fence-agents-3.1.2.tar.xz To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. Happy clustering, Fabio Under the hood (from 3.1.1): Fabio M. Di Nitto (8): build: cleanup configure.ac fence_eaton_snmp: fix port number handling build: make ready for watchdog integration script build: fence_ipmilan does not need PYTHONPATH to generate manpage build: fix build dependecy build: plug in fence_scsi_check.pl and ship it in sharedir/cluster/ build: update .gitignore Fix .gitignore a bit more Marek 'marx' Grac (3): fence_rsa: Better error handling fence_wti: Unable to parse output when splitted into several screens fence_wti: Unable to parse output when splitted into several screens (2/2) Ryan O'Hara (4): fence_scsi: move key file to /var/run/cluster fence_scsi: write devices to tmp file on unfence fence_scsi: create /var/run/cluster if necessary fence_scsi_check: watchdog script for fence_scsi [fabbione at daikengo fence-agents]$ git diff --stat v3.1.1..v3.1.2 .gitignore | 2 +- configure.ac | 6 +- fence/agents/alom/Makefile.am | 4 +- fence/agents/apc/Makefile.am | 4 +- fence/agents/apc_snmp/Makefile.am | 4 +- fence/agents/baytech/Makefile.am | 4 +- fence/agents/bladecenter/Makefile.am | 4 +- fence/agents/brocade/Makefile.am | 4 +- fence/agents/bullpap/Makefile.am | 4 +- fence/agents/cisco_mds/Makefile.am | 4 +- fence/agents/cisco_ucs/Makefile.am | 4 +- fence/agents/cpint/Makefile.am | 4 +- fence/agents/drac/Makefile.am | 4 +- fence/agents/drac5/Makefile.am | 4 +- fence/agents/eaton_snmp/Makefile.am | 4 +- fence/agents/eaton_snmp/fence_eaton_snmp.py | 5 + fence/agents/egenera/Makefile.am | 4 +- fence/agents/eps/Makefile.am | 4 +- fence/agents/ibmblade/Makefile.am | 4 +- fence/agents/ifmib/Makefile.am | 5 +- fence/agents/ilo/Makefile.am | 4 +- fence/agents/ilo_mp/Makefile.am | 4 +- fence/agents/intelmodular/Makefile.am | 4 +- fence/agents/ipmilan/Makefile.am | 1 - fence/agents/ldom/Makefile.am | 4 +- fence/agents/lib/Makefile.am | 4 +- fence/agents/lpar/Makefile.am | 4 +- fence/agents/mcdata/Makefile.am | 4 +- fence/agents/rhevm/Makefile.am | 4 +- fence/agents/rsa/Makefile.am | 4 +- fence/agents/rsa/fence_rsa.py | 7 +- fence/agents/rsb/Makefile.am | 4 +- fence/agents/sanbox2/Makefile.am | 4 +- fence/agents/scsi/Makefile.am | 9 ++- fence/agents/scsi/fence_scsi.pl | 47 +++++++- fence/agents/scsi/fence_scsi_check.pl | 170 +++++++++++++++++++++++++++ fence/agents/virsh/Makefile.am | 4 +- fence/agents/vixel/Makefile.am | 4 +- fence/agents/vmware/Makefile.am | 4 +- fence/agents/wti/Makefile.am | 4 +- fence/agents/wti/fence_wti.py | 27 ++++- fence/agents/xcat/Makefile.am | 4 +- fence/agents/zvm/Makefile.am | 4 +- make/fencebuild.mk | 2 +- 44 files changed, 365 insertions(+), 48 deletions(-) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBCAAGBQJNbgzsAAoJEFA6oBJjVJ+OIMgQAJV4QUJh77yGIWmI+XM9z0KU +BRikjVKeiBGAET/Nm25P46fLt+x4JTh24QaAzD2ZMPvZKDqd9vOtuA2dij/SxIn vMn5K4+A1XOeew9Ji+48dDKW5vk6hyEQckJQ7/uTty2hJxNoWu1R2+VD59Pet61C lN9uXlJMBIVb8ckxSCD7h5i7FfFTNiDX2opYdtL6i0s8ysP5tlKDcph5LENHyxWM /ux8SXEdPCSwCMsdmSPglKLJcRwWLVRaVJgy+K+Mau94S9AjUZBr0ts3+FNKToAU 5dVa833bedgLtHsM9pGifazgvo7qOYMTgpnULyQF7bg2od6jzbs1CD4k66eSN9yg Qo/fAXiPYl/dNFLic7n6aepDEeAgBGdj2Llp9ien/XWLmkA+mxFuIQJqtYtPRFTl fiYrmL0yfq3jbAQaApeZgDtK9aOz7J+us6y++6TrVUQXhagXI9xW3mQskiigXwWN S+oW4ujAYcLTBlvTNvbHKEzFaq2gFke7goOJRahW+DsqOgvC8otL7f68QoCp5GAh EHbt/5FHi16JS69VY0dTU1kdpCEvtSWXDKfsmHY2SzoLsSesy2bIs6BbQFItZTcf krB/1F/oKogRAUPb09Pl2g8+Z9ruGcjyqhU+RCNkHdIhHC/IL6n3mc4Bqb1Q1qtu qOmKOddGpY2hQFcL1cjl =cXkC -----END PGP SIGNATURE----- From fdinitto at redhat.com Wed Mar 2 10:39:27 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Wed, 02 Mar 2011 11:39:27 +0100 Subject: [Linux-cluster] new resource agents repository Message-ID: <4D6E1E5F.6040507@redhat.com> Hello, There is a new repository for Resource Agents which contains RA sets from both Linux HA and Red Hat projects: git://github.com/ClusterLabs/resource-agents.git The purpose of the common repository is to share maintenance load and try to consolidate resource agents. There were no conflicts with the rgmanager RA set and both source layouts remain the same. It is only that autoconf bits were merged. The only difference is that if you want to get Linux HA set of resource agents installed, configure should be run like this: configure --with-ras-set=rgmanager ... The new repository is git ane the existing history is preserved. The existing repository at git.fedorahosted.org will be retired soon. Many thanks to Dejan for writing the original "new resource agents repository" email to linux-ha-dev for me to copy/paste almost pristine ;) more seriously, thanks to all people for helping in all various aspects of the merge. There are for sure corners that we need to smooth due to the merge. Please report any issue you find and we will try to address it as soon as possible. Cheers, Fabio From mika68vaan at gmail.com Wed Mar 2 15:07:38 2011 From: mika68vaan at gmail.com (Mika i) Date: Wed, 2 Mar 2011 17:07:38 +0200 Subject: [Linux-cluster] iLo3 and RedHat 5.5 : Unable to connect/login to fencing device Message-ID: Hi Is there a way to get cluster-suite Fence to work with Rhel 5.5 and iLo3 I have now in both clusters rhel 5.5 version with: kernel 2.6.18-194.el5 cman-2.0.115-68.el5_6.1 But in fence state i get allways message: Unable to connect/login to fencing device Any help - or must i update the cluster to rhel 5.6? -------------- next part -------------- An HTML attachment was scrubbed... URL: From sklemer at gmail.com Wed Mar 2 15:38:41 2011 From: sklemer at gmail.com (=?UTF-8?B?16nXnNeV150g16fXnNee16g=?=) Date: Wed, 2 Mar 2011 17:38:41 +0200 Subject: [Linux-cluster] iLo3 and RedHat 5.5 : Unable to connect/login to fencing device In-Reply-To: References: Message-ID: Hello. I think its not supported yet. You should use ifence_ipmilan. Regards On Wed, Mar 2, 2011 at 5:07 PM, Mika i wrote: > Hi > > Is there a way to get cluster-suite Fence to work with Rhel 5.5 and iLo3 > I have now in both clusters rhel 5.5 version with: > kernel 2.6.18-194.el5 > cman-2.0.115-68.el5_6.1 > > But in fence state i get allways message: Unable to connect/login to > fencing device > > Any help - or must i update the cluster to rhel 5.6? > > > > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jason_Henderson at Mitel.com Wed Mar 2 15:49:32 2011 From: Jason_Henderson at Mitel.com (Jason_Henderson at Mitel.com) Date: Wed, 2 Mar 2011 10:49:32 -0500 Subject: [Linux-cluster] iLo3 and RedHat 5.5 : Unable to connect/login to fencing device In-Reply-To: Message-ID: linux-cluster-bounces at redhat.com wrote on 03/02/2011 10:07:38 AM: > Hi > > Is there a way to get cluster-suite Fence to work with Rhel 5.5 and iLo3 > I have now in both clusters rhel 5.5 version with: > kernel 2.6.18-194.el5 > cman-2.0.115-68.el5_6.1 > > But in fence state i get allways message: Unable to connect/login to > fencing device > > Any help - or must i update the cluster to rhel 5.6? What fence agent are you using, fence_ilo? You will need to use the fence_ipmilan agent for iLO3. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Ning.Bao at statcan.gc.ca Wed Mar 2 20:23:51 2011 From: Ning.Bao at statcan.gc.ca (Ning.Bao at statcan.gc.ca) Date: Wed, 2 Mar 2011 15:23:51 -0500 Subject: [Linux-cluster] question about backup/restore of GFS2 Message-ID: Hi Does anyone have experience with using Netbackup to backup/restore GFS2 file system in production environment? I noticed that GFS2 is not on the list of file compatitbitly of netbackup 7. Is there any alternative backup tool for GFS2 in enterprise settings if Netbackup can not do it? Thanks! -Ning -------------- next part -------------- An HTML attachment was scrubbed... URL: From vmutu at pcbi.upenn.edu Wed Mar 2 21:50:50 2011 From: vmutu at pcbi.upenn.edu (Valeriu Mutu) Date: Wed, 2 Mar 2011 16:50:50 -0500 Subject: [Linux-cluster] clvmd hangs on startup Message-ID: <20110302215050.GD10674@bsdera.pcbi.upenn.edu> Hi, I have a 2-node cluster setup and trying to get GFS2 working on top of an iSCSI volume. Each node is a Xen virtual machine. I am currently unable to get clvmd working on the 2nd node. It starts fine on the 1st node: [root at vm1 ~]# service clvmd start Starting clvmd: [ OK ] Activating VGs: Logging initialised at Wed Mar 2 15:25:07 2011 Set umask to 0077 Finding all volume groups Finding volume group "PcbiHomesVG" Activated 1 logical volumes in volume group PcbiHomesVG 1 logical volume(s) in volume group "PcbiHomesVG" now active Finding volume group "VolGroup00" 2 logical volume(s) in volume group "VolGroup00" already active 2 existing logical volume(s) in volume group "VolGroup00" monitored Activated 2 logical volumes in volume group VolGroup00 2 logical volume(s) in volume group "VolGroup00" now active Wiping internal VG cache [root at vm1 ~]# vgs Logging initialised at Wed Mar 2 15:25:12 2011 Set umask to 0077 Finding all volume groups Finding volume group "PcbiHomesVG" Finding volume group "VolGroup00" VG #PV #LV #SN Attr VSize VFree PcbiHomesVG 1 1 0 wz--nc 1.17T 0 VolGroup00 1 2 0 wz--n- 4.66G 0 Wiping internal VG cache But when I try to start clvmd on the 2nd node, it hangs: [root at vm2 ~]# service clvmd start Starting clvmd: [ OK ] ...hangs... I see the following in vm2:/var/log/messages: Mar 2 15:59:02 vm2 clvmd[2283]: Cluster LVM daemon started - connected to CMAN Mar 2 16:01:36 vm2 kernel: INFO: task clvmd:2302 blocked for more than 120 seconds. Mar 2 16:01:36 vm2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 2 16:01:36 vm2 kernel: clvmd D 0022a86125f49a6a 0 2302 1 2299 (NOTLB) Mar 2 16:01:36 vm2 kernel: ffff880030cb7db8 0000000000000282 0000000000000000 0000000000000000 Mar 2 16:01:36 vm2 kernel: 0000000000000008 ffff880033e327e0 ffff880000033080 000000000001c2b2 Mar 2 16:01:36 vm2 kernel: ffff880033e329c8 ffffffff8029c48f Mar 2 16:01:36 vm2 kernel: Call Trace: Mar 2 16:01:36 vm2 kernel: [] autoremove_wake_function+0x0/0x2e Mar 2 16:01:36 vm2 kernel: [] __down_read+0x82/0x9a Mar 2 16:01:36 vm2 kernel: [] :dlm:dlm_user_request+0x2d/0x174 Mar 2 16:01:36 vm2 kernel: [] mntput_no_expire+0x19/0x89 Mar 2 16:01:36 vm2 kernel: [] sys_sendto+0x14a/0x164 Mar 2 16:01:36 vm2 kernel: [] :dlm:device_write+0x2f5/0x5e5 Mar 2 16:01:36 vm2 kernel: [] vfs_write+0xce/0x174 Mar 2 16:01:36 vm2 kernel: [] sys_write+0x45/0x6e Mar 2 16:01:36 vm2 kernel: [] tracesys+0xab/0xb6 [...] I also noticed that there's a waiting "vgscan" process that "clvmd" is waiting on: 1 1655 1655 1655 ? -1 Ss 0 0:00 /usr/sbin/sshd 1655 1801 1801 1801 ? -1 Ss 0 0:00 \_ sshd: root at pts/0 1801 1803 1803 1803 pts/0 2187 Ss 0 0:00 | \_ -bash 1803 2187 2187 1803 pts/0 2187 S+ 0 0:00 | \_ /bin/sh /sbin/service clvmd start 2187 2192 2187 1803 pts/0 2187 S+ 0 0:00 | \_ /bin/bash /etc/init.d/clvmd start 2192 2215 2187 1803 pts/0 2187 S+ 0 0:00 | \_ /usr/sbin/vgscan Before starting clvmd, cman is started and both nodes are cluster members: [root at vm1 ~]# cman_tool nodes Node Sts Inc Joined Name 1 M 544456 2011-03-02 15:24:31 172.16.50.32 2 M 544468 2011-03-02 15:52:29 172.16.50.33 Note that I'm using manual fencing in this configuration. Both nodes are running CentOS 5.5: # uname -a Linux vm2.pcbi.upenn.edu 2.6.18-194.32.1.el5xen #1 SMP Wed Jan 5 18:44:24 EST 2011 x86_64 x86_64 x86_64 GNU/Linux These package versions were installed on each node: cman-2.0.115-34.el5_5.4 cman-devel-2.0.115-34.el5_5.4 gfs2-utils-0.1.62-20.el5 lvm2-2.02.56-8.el5_5.6 lvm2-cluster-2.02.56-7.el5_5.4 rgmanager-2.0.52-6.el5.centos.8 system-config-cluster-1.0.57-3.el5_5.1 iptables is turned off on each node. Does anyone know why clvmd hangs on the 2nd node? Best, -- Valeriu Mutu From jeff.sturm at eprize.com Wed Mar 2 22:36:45 2011 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Wed, 2 Mar 2011 17:36:45 -0500 Subject: [Linux-cluster] clvmd hangs on startup In-Reply-To: <20110302215050.GD10674@bsdera.pcbi.upenn.edu> References: <20110302215050.GD10674@bsdera.pcbi.upenn.edu> Message-ID: <64D0546C5EBBD147B75DE133D798665F0855C290@hugo.eprize.local> > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] > On Behalf Of Valeriu Mutu > Sent: Wednesday, March 02, 2011 4:51 PM > > Does anyone know why clvmd hangs on the 2nd node? Double-check that the 2nd node can read and write the shared iSCSI storage. -Jeff From swap_project at yahoo.com Wed Mar 2 23:16:43 2011 From: swap_project at yahoo.com (Srija) Date: Wed, 2 Mar 2011 15:16:43 -0800 (PST) Subject: [Linux-cluster] Nodes are not joining to the cluster In-Reply-To: Message-ID: <852499.20065.qm@web112812.mail.gq1.yahoo.com> Hi all, Here is the issue with the cluster describing below: The cluster is built with 16 nodes. All rhel5.5 86_64 bit OS. yesterday night two servers were rebooted and after that these two servers are not joining to the cluster. I was not the part of the team when it is built. and my knowledge regarding cluster is also little bit. Here is the scenario: - There is no quorum disks. But the person who has built the cluster he is telling he has executed the quorum from command line, [ i am not sure of that ] - The errors in the message log are showing as ccsd[24182]: Unable to connect to cluster infrastructure after 12060 seconds , it is a continuous error message in the log file The cluster.conf are as follows: ................... [ all the other nodes ]................... .............................[ all the fence devices for other nodes ]................ It seems it is a very basic configuration. But at this stage more important is, to attach the two servers in the cluster environment. If more information is needed , i will provide. Any advice is appreciated. Thanks in advance From lmb at novell.com Thu Mar 3 09:37:25 2011 From: lmb at novell.com (Lars Marowsky-Bree) Date: Thu, 3 Mar 2011 10:37:25 +0100 Subject: [Linux-cluster] Announcement: Linux Foundation HA working group mailing lists Message-ID: <20110303093725.GA32146@suse.de> Hi everyone, please excuse the long Cc list. Behind the scenes, some of the projects that make up the cluster stack on Linux have been working together to converge and integrate the various projects. We have been meeting on and off for the last decade, and made some amazing progress over the years. However, we believe we could make even better progress if we had a common umbrella that did not try to take away any independence from the projects, but acted as a vendor-neutral forum for coordination. Many projects have chosen to create their own foundations these days, but we did not want this overhead. The Linux Foundation is a well established organization, and its board has graciously agreed to host the working group for us, and also offered further support. One of the first steps here is the creation of mailing lists. https://lists.linux-foundation.org/mailman/listinfo/ha-wg https://lists.linux-foundation.org/mailman/listinfo/ha-wg-technical These mailing lists are not intended to supersede any of the existing project mailing lists, but act as a place for the coordination of cross-project issues - such as distribution adoption, convergence of components and projects (like the on-going resource agent merge between RHCS & Linux-HA), discussion of summits, and so on. You are all invited to join these mailing lists, but the focus is on project maintainers, contributors, distribution packagers. Our immediate roadmap (of the non-technical kind) is to prepare a summary statement, a brief charta, agree on what we consider to be part of the "core" stack, and explore the options that the LF can offer to us (we have already discussed some of this in a smaller group, but it would be too long for this announcement), and announce this working group to a larger audience at the Collab Summit in April. Also, we plan to hold this year's face to face meeting along the Linux Foundation conferences in October in Prague, CZ. I look forward to the dialogue! Regards, Lars -- Architect Storage/HA, OPS Engineering, Novell, Inc. SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG N?rnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde From swhiteho at redhat.com Thu Mar 3 10:21:44 2011 From: swhiteho at redhat.com (Steven Whitehouse) Date: Thu, 03 Mar 2011 10:21:44 +0000 Subject: [Linux-cluster] question about backup/restore of GFS2 In-Reply-To: References: Message-ID: <1299147704.2572.37.camel@dolmen> Hi, On Wed, 2011-03-02 at 15:23 -0500, Ning.Bao at statcan.gc.ca wrote: > Hi > > Does anyone have experience with using Netbackup to backup/restore > GFS2 file system in production environment? I noticed that GFS2 is > not on the list of file compatitbitly of netbackup 7. Is there any > alternative backup tool for GFS2 in enterprise settings if Netbackup > can not do it? > > Thanks! > > -Ning > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster The thing to watch out for with backup is how it affects the normal working set of the nodes in the cluster. Often you'll get much better performance with backup if you write a custom set of scripts which respects the working set on each node and which also allows the backup to proceed in parallel across the filesystem. What makes a good backup solution depends on the application and the I/O pattern, so it is tricky to make any generic suggestions, Steve. From mailing.sr at gmail.com Thu Mar 3 10:20:37 2011 From: mailing.sr at gmail.com (Seb) Date: Thu, 3 Mar 2011 11:20:37 +0100 Subject: [Linux-cluster] Nodes are not joining to the cluster In-Reply-To: <852499.20065.qm@web112812.mail.gq1.yahoo.com> References: <852499.20065.qm@web112812.mail.gq1.yahoo.com> Message-ID: 2011/3/3 Srija > Hi all, > > Here is the issue with the cluster describing below: > > The cluster is built with 16 nodes. All rhel5.5 86_64 bit OS. > yesterday night two servers were rebooted and after that these > two servers are not joining to the cluster. > > I was not the part of the team when it is built. and my knowledge > regarding cluster is also little bit. > > Here is the scenario: > > - There is no quorum disks. But the person > who has built the cluster he is telling he has executed the quorum > from command line, [ i am not sure of that ] > > - The errors in the message log are showing as > > ccsd[24182]: Unable to connect to cluster infrastructure after 12060 > seconds , it is a continuous error message in the log file > > The cluster.conf are as follows: > [snip]config[/snip] There is no section in your config file? Have you been able to identify a quorum disk on the nodes? The host-priv.domain.org is in your /etc/hosts? on all nodes? Why have they been rebooted? for maintenance/upgrade? Any iptable used? Could you please provide the logs showing the start of the cluster service? > It seems it is a very basic configuration. But at this stage more important > is, to attach the two servers in the cluster environment. > > If more information is needed , i will provide. > > Any advice is appreciated. > > Thanks in advance > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mika68vaan at gmail.com Thu Mar 3 10:21:33 2011 From: mika68vaan at gmail.com (Mika i) Date: Thu, 3 Mar 2011 12:21:33 +0200 Subject: [Linux-cluster] iLo3 and RedHat 5.5 : Unable to connect/login to fencing device In-Reply-To: References: Message-ID: hmm. Okey: have someone good installation instructions to get this fence_ipmilan to work. 1. active IPMI/DCMI over LAN in iLo3 2. what should i install in server to get fence_ipmilan to work Now if i test connection it shows like this. ipmitool -v -H 17x.3x.7x.1xx -I lanplus -U admin mc info Password: Get Auth Capabilities error Get Auth Capabilities error Error issuing Get Channel Authentication Capabilies request Error: Unable to establish IPMI v2 / RMCP+ session Get Device ID command failed Can someone help me! 2011/3/2 > > > linux-cluster-bounces at redhat.com wrote on 03/02/2011 10:07:38 AM: > > > Hi > > > > Is there a way to get cluster-suite Fence to work with Rhel 5.5 and iLo3 > > I have now in both clusters rhel 5.5 version with: > > kernel 2.6.18-194.el5 > > cman-2.0.115-68.el5_6.1 > > > > But in fence state i get allways message: Unable to connect/login to > > fencing device > > > > Any help - or must i update the cluster to rhel 5.6? > > What fence agent are you using, fence_ilo? > You will need to use the fence_ipmilan agent for iLO3. > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sklemer at gmail.com Thu Mar 3 13:00:43 2011 From: sklemer at gmail.com (=?UTF-8?B?16nXnNeV150g16fXnNee16g=?=) Date: Thu, 3 Mar 2011 15:00:43 +0200 Subject: [Linux-cluster] iLo3 and RedHat 5.5 : Unable to connect/login to fencing device In-Reply-To: References: Message-ID: Hi. for ilo3 testing you can use : # fence_ipmilan -a 17x.3x.7x.1xx -p "password" -o status fence_ipmilan -h usage: fence_ipmilan -A IPMI Lan Auth type (md5, password, or none) -a IPMI Lan IP to talk to -i IPMI Lan IP to talk to (deprecated, use -a) -p Password (if required) to control power on IPMI device -P Use Lanplus -S Script to retrieve password (if required) -l Username/Login (if required) to control power on IPMI device -o Operation to perform. Valid operations: on, off, reboot, status -t Timeout (sec) for IPMI operation (default 20) -C Ciphersuite to use (same as ipmitool -C parameter) -M Method to fence (onoff or cycle (default onoff) -V Print version and exit -v Verbose mode If no options are specified, the following options will be read from standard input (one per line): auth= Same as -A ipaddr=<#> Same as -a passwd= Same as -p passwd_script= Same as -S lanplus Same as -P login= Same as -u option= Same as -o operation= Same as -o action= Same as -o timeout= Same as -t cipher= Same as -C method= Same as -M verbose Same as -v On Thu, Mar 3, 2011 at 12:21 PM, Mika i wrote: > hmm. > Okey: have someone good installation instructions to get this fence_ipmilan to > work. > 1. active IPMI/DCMI over LAN in iLo3 > 2. what should i install in server to get fence_ipmilan to work > Now if i test connection it shows like this. > > ipmitool -v -H 17x.3x.7x.1xx -I lanplus -U admin mc info > Password: > Get Auth Capabilities error > Get Auth Capabilities error > Error issuing Get Channel Authentication Capabilies request > Error: Unable to establish IPMI v2 / RMCP+ session > Get Device ID command failed > > Can someone help me! > 2011/3/2 > >> >> >> linux-cluster-bounces at redhat.com wrote on 03/02/2011 10:07:38 AM: >> >> > Hi >> > >> > Is there a way to get cluster-suite Fence to work with Rhel 5.5 and iLo3 >> > I have now in both clusters rhel 5.5 version with: >> > kernel 2.6.18-194.el5 >> > cman-2.0.115-68.el5_6.1 >> > >> > But in fence state i get allways message: Unable to connect/login to >> > fencing device >> > >> > Any help - or must i update the cluster to rhel 5.6? >> >> What fence agent are you using, fence_ilo? >> You will need to use the fence_ipmilan agent for iLO3. >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mika68vaan at gmail.com Thu Mar 3 13:53:32 2011 From: mika68vaan at gmail.com (Mika i) Date: Thu, 3 Mar 2011 15:53:32 +0200 Subject: [Linux-cluster] iLo3 and RedHat 5.5 : Unable to connect/login to fencing device In-Reply-To: References: Message-ID: this works: ipmitool -H xxxxxilo -I lanplus -U admin -P xxxxxx chassis power cycle Server is rebooted..... but not this: root at fff fence_ipmilan -a xxxxxxxxlo -l admin -p xxxxxxxxx -M 'cycle' -v Rebooting machine @ IPMI:xxixxxxxilo...Spawning: '/usr/bin/ipmitool -I lan -H 'xxxxxxx' -U 'admin' -P 'xxxxx41!' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H 'xxxxxxx' -U 'admin' -P 'xxxxxx1!' -v chassis power cycle'... Failed cluster.conf .... ...... 2011/3/3 ???? ???? > Hi. > for ilo3 testing you can use : > # fence_ipmilan -a 17x.3x.7x.1xx -p "password" -o status > > fence_ipmilan -h > usage: fence_ipmilan > -A IPMI Lan Auth type (md5, password, or none) > -a IPMI Lan IP to talk to > -i IPMI Lan IP to talk to (deprecated, use -a) > -p Password (if required) to control power on > IPMI device > -P Use Lanplus > -S Script to retrieve password (if required) > -l Username/Login (if required) to control power > on IPMI device > -o Operation to perform. > Valid operations: on, off, reboot, status > -t Timeout (sec) for IPMI operation (default 20) > -C Ciphersuite to use (same as ipmitool -C parameter) > -M Method to fence (onoff or cycle (default onoff) > -V Print version and exit > -v Verbose mode > > If no options are specified, the following options will be read > from standard input (one per line): > > auth= Same as -A > ipaddr=<#> Same as -a > passwd= Same as -p > passwd_script= Same as -S > lanplus Same as -P > login= Same as -u > option= Same as -o > operation= Same as -o > action= Same as -o > timeout= Same as -t > cipher= Same as -C > method= Same as -M > verbose Same as -v > > > On Thu, Mar 3, 2011 at 12:21 PM, Mika i wrote: > >> hmm. >> Okey: have someone good installation instructions to get this fence_ipmilan to >> work. >> 1. active IPMI/DCMI over LAN in iLo3 >> 2. what should i install in server to get fence_ipmilan to work >> Now if i test connection it shows like this. >> >> ipmitool -v -H 17x.3x.7x.1xx -I lanplus -U admin mc info >> Password: >> Get Auth Capabilities error >> Get Auth Capabilities error >> Error issuing Get Channel Authentication Capabilies request >> Error: Unable to establish IPMI v2 / RMCP+ session >> Get Device ID command failed >> >> Can someone help me! >> 2011/3/2 >> >>> >>> >>> linux-cluster-bounces at redhat.com wrote on 03/02/2011 10:07:38 AM: >>> >>> > Hi >>> > >>> > Is there a way to get cluster-suite Fence to work with Rhel 5.5 and >>> iLo3 >>> > I have now in both clusters rhel 5.5 version with: >>> > kernel 2.6.18-194.el5 >>> > cman-2.0.115-68.el5_6.1 >>> > >>> > But in fence state i get allways message: Unable to connect/login to >>> > fencing device >>> > >>> > Any help - or must i update the cluster to rhel 5.6? >>> >>> What fence agent are you using, fence_ilo? >>> You will need to use the fence_ipmilan agent for iLO3. >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From swap_project at yahoo.com Thu Mar 3 15:37:12 2011 From: swap_project at yahoo.com (Srija) Date: Thu, 3 Mar 2011 07:37:12 -0800 (PST) Subject: [Linux-cluster] Nodes are not joining to the cluster In-Reply-To: Message-ID: <352869.58978.qm@web112815.mail.gq1.yahoo.com> Thanks for your reply. --- On Thu, 3/3/11, Seb wrote: > > There is no section in your config > file? No > Have you been able to identify a quorum disk on the > nodes? There is no quorum disk allocated for this configuration. As mentioned, only I know, quotum was alocated through command line etc. > > The host-priv.domain.org > is in your /etc/hosts? on all nodes? > Yes. > Why have they been rebooted? for > maintenance/upgrade? > For maintenance. But before the reboot, the cluster service on that node was not shutdown. > Any iptable used? > No. > Could you please provide the logs showing the start > of the cluster service? > I am mentioning here one of the server's log , when ccs started. _______________________________________________________________________________________________________ Mar 1 20:20:39 host ccsd[5287]: Starting ccsd 2.0.115: Mar 1 20:20:39 host ccsd[5287]: Built: May 25 2010 04:32:00 Mar 1 20:20:39 host ccsd[5287]: Copyright (C) Red Hat, Inc. 2004 All rights reserved. Mar 1 20:20:39 host ccsd[5287]: cluster.conf (cluster name = xxxxxxx, version = 21) found. Mar 1 20:20:40 host openais[5302]: [MAIN ] AIS Executive Service RELEASE 'subrev 1887 version 0.80.6' Mar 1 20:20:40 host openais[5302]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. Mar 1 20:20:40 host openais[5302]: [MAIN ] Copyright (C) 2006 Red Hat, Inc. Mar 1 20:20:40 host openais[5302]: [MAIN ] AIS Executive Service: started and ready to provide service. Mar 1 20:20:40 host openais[5302]: [MAIN ] Using default multicast address of xxx.xxx.xxx.xx Mar 1 20:20:40 host openais[5302]: [TOTEM] Token Timeout (10000 ms) retransmit timeout (495 ms) Mar 1 20:20:40 host openais[5302]: [TOTEM] token hold (386 ms) retransmits before loss (20 retrans) Mar 1 20:20:40 host openais[5302]: [TOTEM] join (60 ms) send_join (0 ms) consensus (20000 ms) merge (200 ms) Mar 1 20:20:40 host openais[5302]: [TOTEM] downcheck (1000 ms) fail to recv const (50 msgs) Mar 1 20:20:40 host openais[5302]: [TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1402 Mar 1 20:20:40 host openais[5302]: [TOTEM] window size per rotation (50 messages) maximum messages per rotation (17 messages) Mar 1 20:20:40 host openais[5302]: [TOTEM] send threads (0 threads) Mar 1 20:20:40 host openais[5302]: [TOTEM] RRP token expired timeout (495 ms) Mar 1 20:20:40 host openais[5302]: [TOTEM] RRP token problem counter (2000 ms) Mar 1 20:20:40 host openais[5302]: [TOTEM] RRP threshold (10 problem count) Mar 1 20:20:40 host openais[5302]: [TOTEM] RRP mode set to none. Mar 1 20:20:40 host openais[5302]: [TOTEM] heartbeat_failures_allowed (0) Mar 1 20:20:40 host openais[5302]: [TOTEM] max_network_delay (50 ms) Mar 1 20:20:40 host openais[5302]: [TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0 Mar 1 20:20:40 host openais[5302]: [TOTEM] Receive multicast socket recv buffer size (262142 bytes). Mar 1 20:20:40 host openais[5302]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Mar 1 20:20:40 host openais[5302]: [TOTEM] The network interface [192.168.xxx.x] is now up. Mar 1 20:20:40 host openais[5302]: [TOTEM] Created or loaded sequence id 6160.192.168.xxx.x for this ring. Mar 1 20:20:40 host openais[5302]: [TOTEM] entering GATHER state from 15. Mar 1 20:20:40 host openais[5302]: [CMAN ] CMAN 2.0.115 (built May 25 2010 04:32:02) started Mar 1 20:20:40 host openais[5302]: [MAIN ] Service initialized 'openais CMAN membership service 2.01' Mar 1 20:20:40 host openais[5302]: [SERV ] Service initialized 'openais extended virtual synchrony service' Mar 1 20:20:40 host openais[5302]: [SERV ] Service initialized 'openais cluster membership service B.01.01' Mar 1 20:20:40 host openais[5302]: [SERV ] Service initialized 'openais availability management framework B.01.01' Mar 1 20:20:40 host openais[5302]: [SERV ] Service initialized 'openais checkpoint service B.01.01' Mar 1 20:20:40 host openais[5302]: [SERV ] Service initialized 'openais event service B.01.01' Mar 1 20:20:40 host openais[5302]: [SERV ] Service initialized 'openais distributed locking service B.01.01' Mar 1 20:20:40 host openais[5302]: [SERV ] Service initialized 'openais message service B.01.01' Mar 1 20:20:40 host openais[5302]: [SERV ] Service initialized 'openais configuration service' Mar 1 20:20:40 host openais[5302]: [SERV ] Service initialized 'openais cluster closed process group service v1.01' Mar 1 20:20:40 host openais[5302]: [SERV ] Service initialized 'openais cluster config database access v1.01' Mar 1 20:20:40 host openais[5302]: [SYNC ] Not using a virtual synchrony filter. Mar 1 20:20:40 host openais[5302]: [TOTEM] Creating commit token because I am the rep. Mar 1 20:20:40 host openais[5302]: [TOTEM] Saving state aru 0 high seq received 0 Mar 1 20:20:40 host openais[5302]: [TOTEM] Storing new sequence id for ring 1814 Mar 1 20:20:40 host openais[5302]: [TOTEM] entering COMMIT state. Mar 1 20:20:40 host openais[5302]: [TOTEM] entering RECOVERY state. Mar 1 20:20:40 host openais[5302]: [TOTEM] position [0] member 192.168.xxx.x: Mar 1 20:20:40 host openais[5302]: [TOTEM] previous ring seq 6160 rep 192.168.xxx.x Mar 1 20:20:40 host openais[5302]: [TOTEM] aru 0 high delivered 0 received flag 1 Mar 1 20:20:40 host openais[5302]: [TOTEM] Did not need to originate any messages in recovery. Mar 1 20:20:40 host openais[5302]: [TOTEM] Sending initial ORF token Mar 1 20:20:40 host openais[5302]: [CLM ] CLM CONFIGURATION CHANGE Mar 1 20:20:40 host openais[5302]: [CLM ] New Configuration: Mar 1 20:20:40 host openais[5302]: [CLM ] Members Left: Mar 1 20:20:40 host openais[5302]: [CLM ] Members Joined: Mar 1 20:20:40 host openais[5302]: [CLM ] CLM CONFIGURATION CHANGE Mar 1 20:20:40 host openais[5302]: [CLM ] New Configuration: Mar 1 20:20:40 host openais[5302]: [CLM ] r(0) ip(192.168.xxx.x) Mar 1 20:20:40 host openais[5302]: [CLM ] Members Left: Mar 1 20:20:40 host openais[5302]: [CLM ] Members Joined: Mar 1 20:20:40 host openais[5302]: [CLM ] r(0) ip(192.168.xxx.x) Mar 1 20:20:40 host openais[5302]: [SYNC ] This node is within the primary component and will provide service. Mar 1 20:20:40 host openais[5302]: [TOTEM] entering OPERATIONAL state. Mar 1 20:20:40 host openais[5302]: [CLM ] got nodejoin message 192.168.xxx.x Mar 1 20:20:41 host ccsd[5287]: Initial status:: Inquorate Mar 1 20:20:41 host ccsd[5287]: Cluster is not quorate. Refusing connection. Mar 1 20:20:41 host ccsd[5287]: Error while processing connect: Connection refused Mar 1 20:20:42 host ccsd[5287]: Cluster is not quorate. Refusing connection. Mar 1 20:20:42 host ccsd[5287]: Error while processing connect: Connection refused Mar 1 20:20:42 host ccsd[5287]: Cluster is not quorate. Refusing connection. Mar 1 20:20:42 host ccsd[5287]: Error while processing connect: Connection refused _______________________________________________________________________________________________________ Thanks again From pradhanparas at gmail.com Thu Mar 3 16:15:43 2011 From: pradhanparas at gmail.com (Paras pradhan) Date: Thu, 3 Mar 2011 10:15:43 -0600 Subject: [Linux-cluster] question about backup/restore of GFS2 In-Reply-To: <1299147704.2572.37.camel@dolmen> References: <1299147704.2572.37.camel@dolmen> Message-ID: We had a whole cluster lockdown when we forgot to exclude one of the GFS partitions in netbackup when it was trying to lock the fs when it the backup started. Then I had to restart one of the nodes. I am still not sure why the lock was not released. Paras. On Thu, Mar 3, 2011 at 4:21 AM, Steven Whitehouse wrote: > Hi, > > On Wed, 2011-03-02 at 15:23 -0500, Ning.Bao at statcan.gc.ca wrote: >> Hi >> >> Does anyone have experience with using Netbackup to backup/restore >> GFS2 file system in production environment? ?I noticed that GFS2 is >> not on the list of file compatitbitly of netbackup 7. Is there any >> alternative backup tool for GFS2 in enterprise settings if Netbackup >> can not do it? >> >> Thanks! >> >> -Ning >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > The thing to watch out for with backup is how it affects the normal > working set of the nodes in the cluster. Often you'll get much better > performance with backup if you write a custom set of scripts which > respects the working set on each node and which also allows the backup > to proceed in parallel across the filesystem. > > What makes a good backup solution depends on the application and the I/O > pattern, so it is tricky to make any generic suggestions, > > Steve. > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From gianluca.cecchi at gmail.com Thu Mar 3 16:16:06 2011 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Thu, 3 Mar 2011 17:16:06 +0100 Subject: [Linux-cluster] Info on vm definitions and options in stable3 Message-ID: Hello, in stable 3 I can have this kind of config for a KVM virtual machine to manage live migration: It works ok, but I would like to know the possible parameters I can set. At http://sources.redhat.com/cluster/wiki/VirtualMachineBehaviors I can see this piece "..Most of the behaviors are common with normal services.." with a reference to start, stop, status monitoring, relocation, recovery Where could I find a complete list? For example are failover domains usable inside the line? Or autostart option? I know how to manage autostart in a standalone virt-manager environment, but when in a cluster of hosts? Or dependency lines such as < vm name="vm1" ... > to power on vm2 only after power on of vm1? About "transient domain support": In the stable3 implementation of rhel6 (or in general in stable 3 if it applies generally) a line such as this: where /etc/libvirt/qemu/myvm.xm is not on a shared path, is it supposed that if I have myvm on node 1 and run clusvcadm -M vm:myvm -m node2 the file is deleted from node 1 and created in node 2 automatically or not? Thanks in advance, Gianluca From vmutu at pcbi.upenn.edu Thu Mar 3 16:50:57 2011 From: vmutu at pcbi.upenn.edu (Valeriu Mutu) Date: Thu, 3 Mar 2011 11:50:57 -0500 Subject: [Linux-cluster] clvmd hangs on startup In-Reply-To: <64D0546C5EBBD147B75DE133D798665F0855C290@hugo.eprize.local> References: <20110302215050.GD10674@bsdera.pcbi.upenn.edu> <64D0546C5EBBD147B75DE133D798665F0855C290@hugo.eprize.local> Message-ID: <20110303165056.GF10674@bsdera.pcbi.upenn.edu> On Wed, Mar 02, 2011 at 05:36:45PM -0500, Jeff Sturm wrote: > Double-check that the 2nd node can read and write the shared iSCSI > storage. Reading/writing from/to the iSCSI storage device works as seen below. On the 1st node: [root at vm1 cluster]# dd count=10000 bs=1024 if=/dev/urandom of=/dev/mapper/pcbi-homes 10000+0 records in 10000+0 records out 10240000 bytes (10 MB) copied, 3.39855 seconds, 3.0 MB/s [root at vm1 cluster]# dd count=10000 bs=1024 if=/dev/mapper/pcbi-homes of=/dev/null 10000+0 records in 10000+0 records out 10240000 bytes (10 MB) copied, 0.331069 seconds, 30.9 MB/s On the 2nd node: [root at vm2 ~]# dd count=10000 bs=1024 if=/dev/urandom of=/dev/mapper/pcbi-homes 10000+0 records in 10000+0 records out 10240000 bytes (10 MB) copied, 3.2465 seconds, 3.2 MB/s [root at vm2 ~]# dd count=10000 bs=1024 if=/dev/mapper/pcbi-homes of=/dev/null 10000+0 records in 10000+0 records out 10240000 bytes (10 MB) copied, 0.223337 seconds, 45.8 MB/s -- Valeriu Mutu From sklemer at gmail.com Thu Mar 3 17:25:12 2011 From: sklemer at gmail.com (=?UTF-8?B?16nXnNeV150g16fXnNee16g=?=) Date: Thu, 3 Mar 2011 19:25:12 +0200 Subject: [Linux-cluster] iLo3 and RedHat 5.5 : Unable to connect/login to fencing device In-Reply-To: References: Message-ID: H. Maybe iLo3 dont support cycle. why not to use the default , which is "onoff" . Try it I think its good enough . -M method Method to fence (onoff or cycle). Default is onoff. Use cycle in case your management card will power off with default method so there will be no chance to power machine on by IPMI. On Thu, Mar 3, 2011 at 3:53 PM, Mika i wrote: > this works: > ipmitool -H xxxxxilo -I lanplus -U admin -P xxxxxx chassis power cycle > Server is rebooted..... > > but not this: > root at fff fence_ipmilan -a xxxxxxxxlo -l admin -p xxxxxxxxx -M 'cycle' -v > Rebooting machine @ IPMI:xxixxxxxilo...Spawning: '/usr/bin/ipmitool -I lan > -H 'xxxxxxx' -U 'admin' -P 'xxxxx41!' -v chassis power status'... > Spawning: '/usr/bin/ipmitool -I lan -H 'xxxxxxx' -U 'admin' -P 'xxxxxx1!' > -v chassis power cycle'... > Failed > > cluster.conf > .... > > name="xxxxxx_xxxxdev" timeout="20"/> > ...... > > login="admin" method="cycle" name="xxxxuxx32_fencedev" passwd="xxxxx!"/> > > > > > > > 2011/3/3 ???? ???? > > Hi. >> for ilo3 testing you can use : >> # fence_ipmilan -a 17x.3x.7x.1xx -p "password" -o status >> >> fence_ipmilan -h >> usage: fence_ipmilan >> -A IPMI Lan Auth type (md5, password, or none) >> -a IPMI Lan IP to talk to >> -i IPMI Lan IP to talk to (deprecated, use -a) >> -p Password (if required) to control power on >> IPMI device >> -P Use Lanplus >> -S Script to retrieve password (if required) >> -l Username/Login (if required) to control power >> on IPMI device >> -o Operation to perform. >> Valid operations: on, off, reboot, status >> -t Timeout (sec) for IPMI operation (default 20) >> -C Ciphersuite to use (same as ipmitool -C parameter) >> -M Method to fence (onoff or cycle (default onoff) >> -V Print version and exit >> -v Verbose mode >> >> If no options are specified, the following options will be read >> from standard input (one per line): >> >> auth= Same as -A >> ipaddr=<#> Same as -a >> passwd= Same as -p >> passwd_script= Same as -S >> lanplus Same as -P >> login= Same as -u >> option= Same as -o >> operation= Same as -o >> action= Same as -o >> timeout= Same as -t >> cipher= Same as -C >> method= Same as -M >> verbose Same as -v >> >> >> On Thu, Mar 3, 2011 at 12:21 PM, Mika i wrote: >> >>> hmm. >>> Okey: have someone good installation instructions to get this fence_ipmilan to >>> work. >>> 1. active IPMI/DCMI over LAN in iLo3 >>> 2. what should i install in server to get fence_ipmilan to work >>> Now if i test connection it shows like this. >>> >>> ipmitool -v -H 17x.3x.7x.1xx -I lanplus -U admin mc info >>> Password: >>> Get Auth Capabilities error >>> Get Auth Capabilities error >>> Error issuing Get Channel Authentication Capabilies request >>> Error: Unable to establish IPMI v2 / RMCP+ session >>> Get Device ID command failed >>> >>> Can someone help me! >>> 2011/3/2 >>> >>>> >>>> >>>> linux-cluster-bounces at redhat.com wrote on 03/02/2011 10:07:38 AM: >>>> >>>> > Hi >>>> > >>>> > Is there a way to get cluster-suite Fence to work with Rhel 5.5 and >>>> iLo3 >>>> > I have now in both clusters rhel 5.5 version with: >>>> > kernel 2.6.18-194.el5 >>>> > cman-2.0.115-68.el5_6.1 >>>> > >>>> > But in fence state i get allways message: Unable to connect/login to >>>> > fencing device >>>> > >>>> > Any help - or must i update the cluster to rhel 5.6? >>>> >>>> What fence agent are you using, fence_ilo? >>>> You will need to use the fence_ipmilan agent for iLO3. >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From raju.rajsand at gmail.com Thu Mar 3 17:56:27 2011 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Thu, 3 Mar 2011 23:26:27 +0530 Subject: [Linux-cluster] SNMP support with IBM Blade Center Fence Agent In-Reply-To: References: <20110228161406.GA14120@redhat.com> Message-ID: Greetings, On 3/1/11, Parvez Shaikh wrote: > Hi Ryan, > > > What is recommended method to deal with blade center fencing failure in > this situation? Do I have to add another level of fencing(between blade > center and manual) which can fence automatically(not requiring manual > interference)? > IIRC, I had touched upon similar fencing post some time back. AFAIK, Manual fencing is not supported by Redhat. Having said that, manual fencing has no place in production. At best it is ok for PHP's salestalk POC. The short answer: Two levels of fencing is NOT possible in blades within the same enclosure. Two levels of fencing is possible in two blades housed in two different enclosures provided it does not bother other servers when you yank the power chords from the enclosure. Let me explain with another example. The solution you are trying to achieve, is possible in 2 individual physical servers in rack. One fencing level would be the management port which I would call (for the sake of this post) as "in-band" fencing device. Second would be Power fencing using power strips similar to: http://www.apc.com/products/family/index.cfm?id=70 (Disclaimer: I get zilch from any vendor for that matter so you can pick the most buttery vendor you are comfortable with) Now this I call Power fencing or "out-of-band" fencing. Now You have the the members of clluster across racks/continents (with a unbreakable redundant datalinks and power control links) and we can talk about two layer fencing. First layer or level would be the in-band fencing using IPMI/Bladecenter management port/DRAC/ILO/ALOM/RSA etc. Second would be from the power control network which would yank the power chord, as it were, off the server. So we are possibly talking about three vlans here: power control vlan, in-band vlan and data vlan. Hope it is clear now to you now. And Parvez, Yes, I do happen to know couple of HA fundas -- I have deployed and managed few RHCS clusters in the past. But then enclosure are SPOF if the member nodes are in the same enclosure anyway. phew!! Regards, Rajagopal From mika68vaan at gmail.com Thu Mar 3 19:44:18 2011 From: mika68vaan at gmail.com (Mika i) Date: Thu, 3 Mar 2011 21:44:18 +0200 Subject: [Linux-cluster] iLo3 and RedHat 5.5 : Unable to connect/login to fencing device In-Reply-To: References: Message-ID: When i added -P like down here: fence_ipmilan -P -a xxxxxxxxlo -l admin -p xxxxxxxxx -M 'cycle' -v Everything works, server reboots. But how do i get this "-P" option included in fence_ipmilan. How should my cluster.conf look like..that's the question... 2011/3/3 ???? ???? > H. > > Maybe iLo3 dont support cycle. why not to use the default , which is > "onoff" . Try it > > I think its good enough . > > -M method > Method to fence (onoff or cycle). Default is onoff. Use cycle > in > case your management card will power off with default method > so > there will be no chance to power machine on by IPMI. > > > On Thu, Mar 3, 2011 at 3:53 PM, Mika i wrote: > >> this works: >> ipmitool -H xxxxxilo -I lanplus -U admin -P xxxxxx chassis power cycle >> Server is rebooted..... >> >> but not this: >> root at fff fence_ipmilan -a xxxxxxxxlo -l admin -p xxxxxxxxx -M 'cycle' -v >> Rebooting machine @ IPMI:xxixxxxxilo...Spawning: '/usr/bin/ipmitool -I lan >> -H 'xxxxxxx' -U 'admin' -P 'xxxxx41!' -v chassis power status'... >> Spawning: '/usr/bin/ipmitool -I lan -H 'xxxxxxx' -U 'admin' -P 'xxxxxx1!' >> -v chassis power cycle'... >> Failed >> >> cluster.conf >> .... >> >> > name="xxxxxx_xxxxdev" timeout="20"/> >> ...... >> >> > login="admin" method="cycle" name="xxxxuxx32_fencedev" passwd="xxxxx!"/> >> >> >> >> >> >> >> 2011/3/3 ???? ???? >> >> Hi. >>> for ilo3 testing you can use : >>> # fence_ipmilan -a 17x.3x.7x.1xx -p "password" -o status >>> >>> fence_ipmilan -h >>> usage: fence_ipmilan >>> -A IPMI Lan Auth type (md5, password, or none) >>> -a IPMI Lan IP to talk to >>> -i IPMI Lan IP to talk to (deprecated, use -a) >>> -p Password (if required) to control power on >>> IPMI device >>> -P Use Lanplus >>> -S Script to retrieve password (if required) >>> -l Username/Login (if required) to control power >>> on IPMI device >>> -o Operation to perform. >>> Valid operations: on, off, reboot, status >>> -t Timeout (sec) for IPMI operation (default 20) >>> -C Ciphersuite to use (same as ipmitool -C parameter) >>> -M Method to fence (onoff or cycle (default onoff) >>> -V Print version and exit >>> -v Verbose mode >>> >>> If no options are specified, the following options will be read >>> from standard input (one per line): >>> >>> auth= Same as -A >>> ipaddr=<#> Same as -a >>> passwd= Same as -p >>> passwd_script= Same as -S >>> lanplus Same as -P >>> login= Same as -u >>> option= Same as -o >>> operation= Same as -o >>> action= Same as -o >>> timeout= Same as -t >>> cipher= Same as -C >>> method= Same as -M >>> verbose Same as -v >>> >>> >>> On Thu, Mar 3, 2011 at 12:21 PM, Mika i wrote: >>> >>>> hmm. >>>> Okey: have someone good installation instructions to get this fence_ipmilan to >>>> work. >>>> 1. active IPMI/DCMI over LAN in iLo3 >>>> 2. what should i install in server to get fence_ipmilan to work >>>> Now if i test connection it shows like this. >>>> >>>> ipmitool -v -H 17x.3x.7x.1xx -I lanplus -U admin mc info >>>> Password: >>>> Get Auth Capabilities error >>>> Get Auth Capabilities error >>>> Error issuing Get Channel Authentication Capabilies request >>>> Error: Unable to establish IPMI v2 / RMCP+ session >>>> Get Device ID command failed >>>> >>>> Can someone help me! >>>> 2011/3/2 >>>> >>>>> >>>>> >>>>> linux-cluster-bounces at redhat.com wrote on 03/02/2011 10:07:38 AM: >>>>> >>>>> > Hi >>>>> > >>>>> > Is there a way to get cluster-suite Fence to work with Rhel 5.5 and >>>>> iLo3 >>>>> > I have now in both clusters rhel 5.5 version with: >>>>> > kernel 2.6.18-194.el5 >>>>> > cman-2.0.115-68.el5_6.1 >>>>> > >>>>> > But in fence state i get allways message: Unable to connect/login to >>>>> > fencing device >>>>> > >>>>> > Any help - or must i update the cluster to rhel 5.6? >>>>> >>>>> What fence agent are you using, fence_ilo? >>>>> You will need to use the fence_ipmilan agent for iLO3. >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>> >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jason_Henderson at Mitel.com Thu Mar 3 20:18:29 2011 From: Jason_Henderson at Mitel.com (Jason_Henderson at Mitel.com) Date: Thu, 3 Mar 2011 15:18:29 -0500 Subject: [Linux-cluster] iLo3 and RedHat 5.5 : Unable to connect/login to fencing device In-Reply-To: Message-ID: linux-cluster-bounces at redhat.com wrote on 03/03/2011 02:44:18 PM: > When i added -P like down here: > fence_ipmilan -P -a xxxxxxxxlo -l admin -p xxxxxxxxx -M 'cycle' -v > Everything works, server reboots. But how do i get this "-P" option > included in fence_ipmilan. > How should my cluster.conf look like..that's the question... Here is an example with passwords removed: > 2011/3/3 ???? ???? > H. > > Maybe iLo3 dont support cycle. why not to use the default , which is > "onoff" . Try it > > I think its good enough . > > -M method > Method to fence (onoff or cycle). Default is onoff. Use cycle in > case your management card will power off with defaultmethod so > there will be no chance to power machine on by IPMI. > > On Thu, Mar 3, 2011 at 3:53 PM, Mika i wrote: > this works: > ipmitool -H xxxxxilo -I lanplus -U admin -P xxxxxx chassis power cycle > Server is rebooted..... > > but not this: > root at fff fence_ipmilan -a xxxxxxxxlo -l admin -p xxxxxxxxx -M 'cycle' -v > Rebooting machine @ IPMI:xxixxxxxilo...Spawning: '/usr/bin/ipmitool > -I lan -H 'xxxxxxx' -U 'admin' -P 'xxxxx41!' -v chassis power status'... > Spawning: '/usr/bin/ipmitool -I lan -H 'xxxxxxx' -U 'admin' -P > 'xxxxxx1!' -v chassis power cycle'... > Failed > > cluster.conf > .... > > name="xxxxxx_xxxxdev" timeout="20"/> > ...... > > login="admin" method="cycle" name="xxxxuxx32_fencedev" passwd="xxxxx!"/> > > > > > > > 2011/3/3 ???? ???? > > Hi. > for ilo3 testing you can use : > # fence_ipmilan -a 17x.3x.7x.1xx -p "password" -o status > > fence_ipmilan -h > usage: fence_ipmilan > -A IPMI Lan Auth type (md5, password, or none) > -a IPMI Lan IP to talk to > -i IPMI Lan IP to talk to (deprecated, use -a) > -p Password (if required) to control power on > IPMI device > -P Use Lanplus > -S Script to retrieve password (if required) > -l Username/Login (if required) to control power > on IPMI device > -o Operation to perform. > Valid operations: on, off, reboot, status > -t Timeout (sec) for IPMI operation (default 20) > -C Ciphersuite to use (same as ipmitool -C parameter) > -M Method to fence (onoff or cycle (default onoff) > -V Print version and exit > -v Verbose mode > > If no options are specified, the following options will be read > from standard input (one per line): > > auth= Same as -A > ipaddr=<#> Same as -a > passwd= Same as -p > passwd_script= Same as -S > lanplus Same as -P > login= Same as -u > option= Same as -o > operation= Same as -o > action= Same as -o > timeout= Same as -t > cipher= Same as -C > method= Same as -M > verbose Same as -v > > On Thu, Mar 3, 2011 at 12:21 PM, Mika i wrote: > hmm. > Okey: have someone good installation instructions to get this > fence_ipmilan to work. > 1. active IPMI/DCMI over LAN in iLo3 > 2. what should i install in server to get fence_ipmilan to work > Now if i test connection it shows like this. > > ipmitool -v -H 17x.3x.7x.1xx -I lanplus -U admin mc info > Password: > Get Auth Capabilities error > Get Auth Capabilities error > Error issuing Get Channel Authentication Capabilies request > Error: Unable to establish IPMI v2 / RMCP+ session > Get Device ID command failed > > Can someone help me! > 2011/3/2 > > > linux-cluster-bounces at redhat.com wrote on 03/02/2011 10:07:38 AM: > > > Hi > > > > Is there a way to get cluster-suite Fence to work with Rhel 5.5 and iLo3 > > I have now in both clusters rhel 5.5 version with: > > kernel 2.6.18-194.el5 > > cman-2.0.115-68.el5_6.1 > > > > But in fence state i get allways message: Unable to connect/login to > > fencing device > > > > Any help - or must i update the cluster to rhel 5.6? > What fence agent are you using, fence_ilo? > You will need to use the fence_ipmilan agent for iLO3. > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlopmart at gmail.com Thu Mar 3 21:27:27 2011 From: carlopmart at gmail.com (carlopmart) Date: Thu, 03 Mar 2011 22:27:27 +0100 Subject: [Linux-cluster] How can I change status check for a script?? Message-ID: <4D7007BF.3000105@gmail.com> Hi all, How can I change status interval for a certain service?? I have tried to insert: under a service without luck. I am using rgmanager-3.0.12-10.el6.i686 and cman-3.0.12-23.el6_0.4.i686 under two RHEL6 hosts. Is it only possible to accomplish this changing /usr/share/cluster/script.sh directly?? Thanks. -- CL Martinez carlopmart {at} gmail {d0t} com From gregory.lee.bartholomew at gmail.com Thu Mar 3 21:29:22 2011 From: gregory.lee.bartholomew at gmail.com (Gregory Bartholomew) Date: Thu, 03 Mar 2011 15:29:22 -0600 Subject: [Linux-cluster] Error: "ailed actions: dlm:1_monitor_0/gfs-control:1_monitor_0 ... not installed". Message-ID: <4D700832.9040305@gmail.com> Hi, I'm trying to follow the "clusters from scratch" guide and I'm running Fedora 14. When I try to add the DLM and GFS2 services, crm_mon keeps reporting "Failed actions: dlm:1_monitor_0/gfs-control:1_monitor_0 ... not installed". Does anyone know what I'm missing? Thanks, gb From omerfsen at gmail.com Thu Mar 3 22:08:41 2011 From: omerfsen at gmail.com (Omer Faruk SEN) Date: Fri, 4 Mar 2011 00:08:41 +0200 Subject: [Linux-cluster] iLo3 and RedHat 5.5 : Unable to connect/login to fencing device In-Reply-To: References: Message-ID: See https://access.redhat.com/kb/docs/DOC-39336 2011/3/3 : > > > linux-cluster-bounces at redhat.com wrote on 03/03/2011 02:44:18 PM: > >> When i added -P like?down?here: >> fence_ipmilan -P -a xxxxxxxxlo -l admin -p xxxxxxxxx -M 'cycle' -v >> Everything works, server reboots. But how do i get this "-P" option >> included in fence_ipmilan. >> How should my cluster.conf look like..that's the question... > > Here is an example with passwords removed: > > > > ? > ? > ? > > ? > ? ? > ? ? > ? ? > ? > > ? > ? ? > ? ? ? > ? ? ? ? > ? ? ? ? ? login="user" ipaddr="10.39.170.233"/> > ? ? ? ? > ? ? ? > ? ? > ? ? > ? ? ? > ? ? ? ? > ? ? ? ? ? login="user" ipaddr="10.39.170.234"/> > ? ? ? ? > ? ? ? > ? ? > ? > > > > >> 2011/3/3 ???? ???? >> H. >> >> Maybe iLo3 dont support cycle. why not to use the default , which is >> "onoff" . Try it >> >> I think its good enough . >> >> -M method >> ?? ? ? ? ? ? ?Method to fence (onoff or cycle). Default is onoff. Use >> cycle in >> ?? ? ? ? ? ? ?case ?your management card will power off with defaultmethod >> so >> ?? ? ? ? ? ? ?there will be no chance to power machine on by IPMI. >> >> On Thu, Mar 3, 2011 at 3:53 PM, Mika i wrote: >> this works: >> ipmitool -H xxxxxilo -I lanplus -U admin -P xxxxxx chassis power cycle >> Server is rebooted..... >> >> but not this: >> root at fff fence_ipmilan -a xxxxxxxxlo -l admin -p xxxxxxxxx -M 'cycle' -v >> Rebooting machine @ IPMI:xxixxxxxilo...Spawning: '/usr/bin/ipmitool >> -I lan -H 'xxxxxxx' -U 'admin' -P 'xxxxx41!' -v chassis power status'... >> Spawning: '/usr/bin/ipmitool -I lan -H 'xxxxxxx' -U 'admin' -P >> 'xxxxxx1!' -v chassis power cycle'... >> Failed >> >> cluster.conf >> .... >> ??????????????????????????????? >> ??????????????????????????????????????? > name="xxxxxx_xxxxdev" timeout="20"/> >> ...... >> >> > login="admin" method="cycle" name="xxxxuxx32_fencedev" passwd="xxxxx!"/> >> >> >> >> >> >> >> 2011/3/3 ???? ???? >> >> Hi. >> for ilo3 testing you can use : >> #?fence_ipmilan -a?17x.3x.7x.1xx -p "password" -o status >> >> ?fence_ipmilan -h >> usage: fence_ipmilan >> ?? -A ?IPMI Lan Auth type (md5, password, or none) >> ?? -a ? ?IPMI Lan IP to talk to >> ?? -i ? ?IPMI Lan IP to talk to (deprecated, use -a) >> ?? -p ?Password (if required) to control power on >> ?? ? ? ? ? ? ? ? ?IPMI device >> ?? -P ? ? ? ? ? ? Use Lanplus >> ?? -S ? ? ?Script to retrieve password (if required) >> ?? -l ? ? Username/Login (if required) to control power >> ?? ? ? ? ? ? ? ? ?on IPMI device >> ?? -o ? ? ? ?Operation to perform. >> ?? ? ? ? ? ? ? ? ?Valid operations: on, off, reboot, status >> ?? -t ? Timeout (sec) for IPMI operation (default 20) >> ?? -C ? ?Ciphersuite to use (same as ipmitool -C parameter) >> ?? -M ? ?Method to fence (onoff or cycle (default onoff) >> ?? -V ? ? ? ? ? ? Print version and exit >> ?? -v ? ? ? ? ? ? Verbose mode >> >> If no options are specified, the following options will be read >> from standard input (one per line): >> >> ?? auth= ? ? ? ? ? Same as -A >> ?? ipaddr=<#> ? ? ? ? ? ?Same as -a >> ?? passwd= ? ? ? ? Same as -p >> ?? passwd_script= ?Same as -S >> ?? lanplus ? ? ? ? ? ? ? Same as -P >> ?? login= ? ? ? ? Same as -u >> ?? option= ? ? ? ? ? Same as -o >> ?? operation= ? ? ? ?Same as -o >> ?? action= ? ? ? ? ? Same as -o >> ?? timeout= ? ? Same as -t >> ?? cipher= ? ? ? Same as -C >> ?? method= ? ? ? Same as -M >> ?? verbose ? ? ? ? ? ? ? Same as -v >> >> On Thu, Mar 3, 2011 at 12:21 PM, Mika i wrote: >> hmm. >> Okey: have someone good installation instructions to get this >> fence_ipmilan?to work. >> 1. active IPMI/DCMI over LAN in iLo3 >> 2. what should i install in server to get fence_ipmilan to work >> Now if i test connection it shows like this. >> >> ipmitool -v -H 17x.3x.7x.1xx -I lanplus -U admin mc info >> Password: >> Get Auth Capabilities error >> Get Auth Capabilities error >> Error issuing Get Channel Authentication Capabilies request >> Error: Unable to establish IPMI v2 / RMCP+ session >> Get Device ID command failed >> >> Can someone help me! >> 2011/3/2 >> >> >> linux-cluster-bounces at redhat.com wrote on 03/02/2011 10:07:38 AM: >> >> > Hi >> > >> > Is there a way to get cluster-suite Fence to work with Rhel 5.5 and iLo3 >> > I have now in both clusters?rhel 5.5 version with: >> > kernel 2.6.18-194.el5 >> > cman-2.0.115-68.el5_6.1 >> > >> > But in fence state i get allways message: Unable to connect/login to >> > fencing device >> > >> > Any help - or?must i update the?cluster to rhel 5.6? > >> What fence agent are you using, fence_ilo? >> You will need to use the fence_ipmilan agent for iLO3. >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From mika68vaan at gmail.com Fri Mar 4 11:14:40 2011 From: mika68vaan at gmail.com (Mika i) Date: Fri, 4 Mar 2011 13:14:40 +0200 Subject: [Linux-cluster] iLo3 and RedHat 5.5 : Unable to connect/login to fencing device In-Reply-To: References: Message-ID: Hi and thanks to all, i get this worked... this needed to active in cluster.conf, then everything started to work. power_wait="15" -Mika 2011/3/4 Omer Faruk SEN > See https://access.redhat.com/kb/docs/DOC-39336 > > 2011/3/3 : > > > > > > linux-cluster-bounces at redhat.com wrote on 03/03/2011 02:44:18 PM: > > > >> When i added -P like down here: > >> fence_ipmilan -P -a xxxxxxxxlo -l admin -p xxxxxxxxx -M 'cycle' -v > >> Everything works, server reboots. But how do i get this "-P" option > >> included in fence_ipmilan. > >> How should my cluster.conf look like..that's the question... > > > > Here is an example with passwords removed: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > login="user" ipaddr="10.39.170.233"/> > > > > > > > > > > > > > > > login="user" ipaddr="10.39.170.234"/> > > > > > > > > > > > > > > > > > >> 2011/3/3 ???? ???? > >> H. > >> > >> Maybe iLo3 dont support cycle. why not to use the default , which is > >> "onoff" . Try it > >> > >> I think its good enough . > >> > >> -M method > >> Method to fence (onoff or cycle). Default is onoff. Use > >> cycle in > >> case your management card will power off with > defaultmethod > >> so > >> there will be no chance to power machine on by IPMI. > >> > >> On Thu, Mar 3, 2011 at 3:53 PM, Mika i wrote: > >> this works: > >> ipmitool -H xxxxxilo -I lanplus -U admin -P xxxxxx chassis power cycle > >> Server is rebooted..... > >> > >> but not this: > >> root at fff fence_ipmilan -a xxxxxxxxlo -l admin -p xxxxxxxxx -M 'cycle' > -v > >> Rebooting machine @ IPMI:xxixxxxxilo...Spawning: '/usr/bin/ipmitool > >> -I lan -H 'xxxxxxx' -U 'admin' -P 'xxxxx41!' -v chassis power status'... > >> Spawning: '/usr/bin/ipmitool -I lan -H 'xxxxxxx' -U 'admin' -P > >> 'xxxxxx1!' -v chassis power cycle'... > >> Failed > >> > >> cluster.conf > >> .... > >> > >> >> name="xxxxxx_xxxxdev" timeout="20"/> > >> ...... > >> > >> >> login="admin" method="cycle" name="xxxxuxx32_fencedev" passwd="xxxxx!"/> > >> > >> > >> > >> > >> > >> > >> 2011/3/3 ???? ???? > >> > >> Hi. > >> for ilo3 testing you can use : > >> # fence_ipmilan -a 17x.3x.7x.1xx -p "password" -o status > >> > >> fence_ipmilan -h > >> usage: fence_ipmilan > >> -A IPMI Lan Auth type (md5, password, or none) > >> -a IPMI Lan IP to talk to > >> -i IPMI Lan IP to talk to (deprecated, use -a) > >> -p Password (if required) to control power on > >> IPMI device > >> -P Use Lanplus > >> -S Script to retrieve password (if required) > >> -l Username/Login (if required) to control power > >> on IPMI device > >> -o Operation to perform. > >> Valid operations: on, off, reboot, status > >> -t Timeout (sec) for IPMI operation (default 20) > >> -C Ciphersuite to use (same as ipmitool -C parameter) > >> -M Method to fence (onoff or cycle (default onoff) > >> -V Print version and exit > >> -v Verbose mode > >> > >> If no options are specified, the following options will be read > >> from standard input (one per line): > >> > >> auth= Same as -A > >> ipaddr=<#> Same as -a > >> passwd= Same as -p > >> passwd_script= Same as -S > >> lanplus Same as -P > >> login= Same as -u > >> option= Same as -o > >> operation= Same as -o > >> action= Same as -o > >> timeout= Same as -t > >> cipher= Same as -C > >> method= Same as -M > >> verbose Same as -v > >> > >> On Thu, Mar 3, 2011 at 12:21 PM, Mika i wrote: > >> hmm. > >> Okey: have someone good installation instructions to get this > >> fence_ipmilan to work. > >> 1. active IPMI/DCMI over LAN in iLo3 > >> 2. what should i install in server to get fence_ipmilan to work > >> Now if i test connection it shows like this. > >> > >> ipmitool -v -H 17x.3x.7x.1xx -I lanplus -U admin mc info > >> Password: > >> Get Auth Capabilities error > >> Get Auth Capabilities error > >> Error issuing Get Channel Authentication Capabilies request > >> Error: Unable to establish IPMI v2 / RMCP+ session > >> Get Device ID command failed > >> > >> Can someone help me! > >> 2011/3/2 > >> > >> > >> linux-cluster-bounces at redhat.com wrote on 03/02/2011 10:07:38 AM: > >> > >> > Hi > >> > > >> > Is there a way to get cluster-suite Fence to work with Rhel 5.5 and > iLo3 > >> > I have now in both clusters rhel 5.5 version with: > >> > kernel 2.6.18-194.el5 > >> > cman-2.0.115-68.el5_6.1 > >> > > >> > But in fence state i get allways message: Unable to connect/login to > >> > fencing device > >> > > >> > Any help - or must i update the cluster to rhel 5.6? > > > >> What fence agent are you using, fence_ilo? > >> You will need to use the fence_ipmilan agent for iLO3. > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lhh at redhat.com Fri Mar 4 17:06:48 2011 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 4 Mar 2011 12:06:48 -0500 Subject: [Linux-cluster] Service location (colocation) In-Reply-To: <4D6A7709.6060108@gmail.com> References: <4D6A7709.6060108@gmail.com> Message-ID: <20110304170647.GC14803@redhat.com> On Sun, Feb 27, 2011 at 06:08:41PM +0200, Budai Laszlo wrote: > Hi all, > > is there a way to define location dependencies among services? for > instance how can I define that Service A should run on the same node as > service B? Or the opposite: Service C should run on a different node > than service D? > rgmanager doesn't have this feature built-in; you can define 'collocated services' by simply creating one large service comprising all of the resources for both services. You could probably trivially extend central_processing mode to do "anti collocation" (i.e. run on another node). The 'follow_service.sl' script is an example of how to do part of 'anti-collocation'. The way it works, it starts service A on a different node from service B. If the node running service A fails, it is started on the same node as service B, then service B is moved away to another (empty, usually) node in the cluster. Alternatively, pacemaker supports this functionality. -- Lon Hohberger - Red Hat, Inc. From lhh at redhat.com Fri Mar 4 17:08:54 2011 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 4 Mar 2011 12:08:54 -0500 Subject: [Linux-cluster] GFS2 In-Reply-To: <4D6AED6F.9000203@gmail.com> References: <4D6AED6F.9000203@gmail.com> Message-ID: <20110304170853.GD14803@redhat.com> On Mon, Feb 28, 2011 at 02:33:51AM +0200, Budai Laszlo wrote: > Hi all, > > in which version of RHEL GFS2 is considered production ready? 5.3? > RHEL 5.3 and later, gfs2 moved in to 'full support'. However, 5.3 EUS (async errata, updates/fixes/etc) closed in January, so ideally, you should move to RHEL 5.6 - which has loads of fixes for gfs2 compared to RHEL 5.3. -- Lon Hohberger - Red Hat, Inc. From lhh at redhat.com Fri Mar 4 17:15:45 2011 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 4 Mar 2011 12:15:45 -0500 Subject: [Linux-cluster] SNMP support with IBM Blade Center Fence Agent In-Reply-To: References: <20110228161406.GA14120@redhat.com> Message-ID: <20110304171545.GE14803@redhat.com> On Tue, Mar 01, 2011 at 06:50:18PM +0530, Parvez Shaikh wrote: > Hi Ryan, > > Thank you for response. Does it mean there is no way to intimate > administrator about failure of fencing as of now? > > Let me give more information about my cluster - > > I have set of nodes in cluster with only IP resource being protected. I have > two levels of fencing, first bladecenter fencing and second one is manual > fencing. If the problem you have with fence_bladecenter is intermittent - for example, if it fails 1/2 the time, fence_manual is going to *detract* from your cluster's ability to recover automatically. Ordinarily, if a fencing action fails, fenced will automatically retry the operation. When you configure fence_manual as a backup, this retry will *never* occur, meaning your cluster hangs. > At times if machine is already down(either power failure or turned off > abrupty); blade center fencing timesout and manual fencing happens. At this > time, administrator is expected to run fence_ack_manual. > Clearly this is not something which is desirable, as downtime of services is > as long as administrator runs fence_ack_manual. > What is recommended method to deal with blade center fencing failure in > this situation? Do I have to add another level of fencing(between blade > center and manual) which can fence automatically(not requiring manual > interference)? Start with removing fence_manual. If fencing is failing (permanently), you can still run: fence_ack_manual -e -n > > > my bladecenter fencing agent, I sometimes get message saying bladecenter > > > fencing failed because of timeout or fence device IP address/user > > > credentials are incorrect. ^^ This is why I think fence_manual is, in your specific case, very likely hurting your availability. -- Lon Hohberger - Red Hat, Inc. From lhh at redhat.com Fri Mar 4 17:18:22 2011 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 4 Mar 2011 12:18:22 -0500 Subject: [Linux-cluster] Nodes are not joining to the cluster In-Reply-To: References: <852499.20065.qm@web112812.mail.gq1.yahoo.com> Message-ID: <20110304171822.GF14803@redhat.com> On Thu, Mar 03, 2011 at 11:20:37AM +0100, Seb wrote: > [snip]config[/snip] > > There is no section in your config file? > Have you been able to identify a quorum disk on the nodes? Small nitpick - I'd really recommend against even trying to start qdiskd / use a quorum disk in a 16 node cluster. -- Lon Hohberger - Red Hat, Inc. From lhh at redhat.com Fri Mar 4 17:20:40 2011 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 4 Mar 2011 12:20:40 -0500 Subject: [Linux-cluster] iLo3 and RedHat 5.5 : Unable to connect/login to fencing device In-Reply-To: References: Message-ID: <20110304172040.GG14803@redhat.com> On Thu, Mar 03, 2011 at 03:53:32PM +0200, Mika i wrote: > this works: > ipmitool -H xxxxxilo -I lanplus -U admin -P xxxxxx chassis power cycle > Server is rebooted..... > > but not this: > root at fff fence_ipmilan -a xxxxxxxxlo -l admin -p xxxxxxxxx -M 'cycle' -v You forgot -P (for lanplus) -- Lon Hohberger - Red Hat, Inc. From parvez.h.shaikh at gmail.com Fri Mar 4 17:45:07 2011 From: parvez.h.shaikh at gmail.com (Parvez Shaikh) Date: Fri, 4 Mar 2011 23:15:07 +0530 Subject: [Linux-cluster] SNMP support with IBM Blade Center Fence Agent In-Reply-To: <20110304171545.GE14803@redhat.com> References: <20110228161406.GA14120@redhat.com> <20110304171545.GE14803@redhat.com> Message-ID: Hi Lon, Thank you for reply. What I gathered from your response is to remove manual fencing at once. This will cause fence daemon to retry fence_bladecenter until the node is fenced. More likely the fenced will succeed in fencing the failed node(provided IP, user name and password for bladecenter management module are right); even if it times out for the first time. Am I right? I will try removing manual fencing and see how things go. >> If fencing is failing (permanently), you can still run: >> fence_ack_manual -e -n By the way as per my understanding fence_ack_manual -n can be executed to acknowledge only manually fenced node(and not bladecenter fenced node), correct me if this understanding is wrong. So God forbid, if fence_bladecenter fails for some reason; we still have option to run fence_manual and then fence_ack_manual, so cluster is back to working. Thanks again and have great weekend ahead Yours truly, Parvez On Fri, Mar 4, 2011 at 10:45 PM, Lon Hohberger wrote: > On Tue, Mar 01, 2011 at 06:50:18PM +0530, Parvez Shaikh wrote: > > Hi Ryan, > > > > Thank you for response. Does it mean there is no way to intimate > > administrator about failure of fencing as of now? > > > > Let me give more information about my cluster - > > > > I have set of nodes in cluster with only IP resource being protected. I > have > > two levels of fencing, first bladecenter fencing and second one is manual > > fencing. > > If the problem you have with fence_bladecenter is intermittent - for > example, if it fails 1/2 the time, fence_manual is going to *detract* > from your cluster's ability to recover automatically. > > Ordinarily, if a fencing action fails, fenced will automatically retry > the operation. > > When you configure fence_manual as a backup, this retry will *never* > occur, meaning your cluster hangs. > > > > At times if machine is already down(either power failure or turned off > > abrupty); blade center fencing timesout and manual fencing happens. At > this > > time, administrator is expected to run fence_ack_manual. > > > Clearly this is not something which is desirable, as downtime of services > is > > as long as administrator runs fence_ack_manual. > > > What is recommended method to deal with blade center fencing failure in > > this situation? Do I have to add another level of fencing(between blade > > center and manual) which can fence automatically(not requiring manual > > interference)? > > Start with removing fence_manual. > > If fencing is failing (permanently), you can still run: > > fence_ack_manual -e -n > > > > > my bladecenter fencing agent, I sometimes get message saying > bladecenter > > > > fencing failed because of timeout or fence device IP address/user > > > > credentials are incorrect. > > ^^ This is why I think fence_manual is, in your specific case, very > likely hurting your availability. > > -- > Lon Hohberger - Red Hat, Inc. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lhh at redhat.com Fri Mar 4 18:01:20 2011 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 4 Mar 2011 13:01:20 -0500 Subject: [Linux-cluster] Info on vm definitions and options in stable3 In-Reply-To: References: Message-ID: <20110304180120.GH14803@redhat.com> On Thu, Mar 03, 2011 at 05:16:06PM +0100, Gianluca Cecchi wrote: > Hello, > in stable 3 I can have this kind of config for a KVM virtual machine > to manage live migration: > > > > > > It works ok, but I would like to know the possible parameters I can set. > At http://sources.redhat.com/cluster/wiki/VirtualMachineBehaviors I > can see this piece > "..Most of the behaviors are common with normal services.." > with a reference to start, stop, status monitoring, relocation, recovery > > Where could I find a complete list? > For example are failover domains usable inside the line? > Or autostart option? I know how to manage autostart in a standalone > virt-manager environment, but when in a cluster of hosts? http://sources.redhat.com/cluster/wiki/ServiceOperationalBehaviors http://sources.redhat.com/cluster/wiki/ServicePolicies http://sources.redhat.com/cluster/wiki/FailoverDomains > Or dependency lines such as > > < vm name="vm1" ... > > > If you add a child of a VM, you can no longer live-migrate it. > to power on vm2 only after power on of vm1? you can use 'depend=' if you want, but rgmanager's handling of this sort of dependency is rudimentary at best: Will work, but if you stop vm2, vm1 will be stopped after vm2. > About "transient domain support": > In the stable3 implementation of rhel6 (or in general in stable 3 if > it applies generally) a line such as this: > > > > where /etc/libvirt/qemu/myvm.xm is not on a shared path, is it > supposed that if I have myvm on node 1 and run > clusvcadm -M vm:myvm -m node2 > > the file is deleted from node 1 and created in node 2 automatically or not? You need to have the description on each host in the cluster. -- Lon Hohberger - Red Hat, Inc. From lhh at redhat.com Fri Mar 4 18:04:55 2011 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 4 Mar 2011 13:04:55 -0500 Subject: [Linux-cluster] How can I change status check for a script?? In-Reply-To: <4D7007BF.3000105@gmail.com> References: <4D7007BF.3000105@gmail.com> Message-ID: <20110304180455.GI14803@redhat.com> On Thu, Mar 03, 2011 at 10:27:27PM +0100, carlopmart wrote: > Hi all, > > How can I change status interval for a certain service?? I have > tried to insert: > > > > under a service without luck. I am using > rgmanager-3.0.12-10.el6.i686 and cman-3.0.12-23.el6_0.4.i686 under > two RHEL6 hosts. Checks are per-resource; the "service" meta-resource is largely a no-op for "status"; you'd have to redefine it for each child of the service. For example: ... will effectively do nothing; you'd have to do: Additionally, you can't redefine actions in a "ref"; you must do it where the resource is defined: http://sources.redhat.com/cluster/wiki/ResourceActions -- Lon Hohberger - Red Hat, Inc. From iarlyy at gmail.com Fri Mar 4 18:17:51 2011 From: iarlyy at gmail.com (iarly selbir) Date: Fri, 4 Mar 2011 15:17:51 -0300 Subject: [Linux-cluster] Nodes are not joining to the cluster In-Reply-To: References: <852499.20065.qm@web112812.mail.gq1.yahoo.com> Message-ID: Can you tell me where I can found more information about this secion ( quorumd ), I'm having a similar issue, some switches is failing, in this moment the cluster is unable to check status of the nodes, the cluster hangs and on /var/log/messages repeats this message ( unable to connect to cluster... ) . Thank you so much. - - iarlyy selbir :wq! On Thu, Mar 3, 2011 at 7:20 AM, Seb wrote: > 2011/3/3 Srija > > Hi all, >> >> Here is the issue with the cluster describing below: >> >> The cluster is built with 16 nodes. All rhel5.5 86_64 bit OS. >> yesterday night two servers were rebooted and after that these >> two servers are not joining to the cluster. >> >> I was not the part of the team when it is built. and my knowledge >> regarding cluster is also little bit. >> >> Here is the scenario: >> >> - There is no quorum disks. But the person >> who has built the cluster he is telling he has executed the quorum >> from command line, [ i am not sure of that ] >> >> - The errors in the message log are showing as >> >> ccsd[24182]: Unable to connect to cluster infrastructure after 12060 >> seconds , it is a continuous error message in the log file >> >> The cluster.conf are as follows: >> > [snip]config[/snip] > > There is no section in your config file? > Have you been able to identify a quorum disk on the nodes? > > The host-priv.domain.org is in your /etc/hosts? on all nodes? > > Why have they been rebooted? for maintenance/upgrade? > > Any iptable used? > > Could you please provide the logs showing the start of the cluster service? > > >> It seems it is a very basic configuration. But at this stage more >> important >> is, to attach the two servers in the cluster environment. >> >> If more information is needed , i will provide. >> >> Any advice is appreciated. >> >> Thanks in advance >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cos at aaaaa.org Fri Mar 4 19:49:23 2011 From: cos at aaaaa.org (Ofer Inbar) Date: Fri, 4 Mar 2011 14:49:23 -0500 Subject: [Linux-cluster] rg_test for testing other resource agent functions? Message-ID: <20110304194923.GX934@mip.aaaaa.org> I can do this: sudo rg_test test /etc/cluster/cluster.conf status service [servicename] To see what happens when I run a resource agent with the "status" command line argument, in exactly the same context as RHCS would run it - using the environment variables derived from cluster.conf, and potentially running multiple resource agents or the same one more than once with different variables, depending on what resources are defined for that service. It would be very useful to be able to use a similar framework to run an arbitrary script, or an arbitrary resource agent command line option, with the same automatic expansion from cluster.conf. Unfortunately, rg_test only supports a short hardcoded set of options: stop, start, and status. For example, I want to add a "verify" procedure to my resource agent, that I'd like to kick off from a monitoring script on my own schedule, but I want to make sure that it is run in the same context as the resource agent's status check is normally run. I could write some separate cluster.conf parser that simulates what I think rgmanager would do, but I might get it wrong. Or rgmanager might change in a future version and I wouldn't track the change. Is there anything like rg_test that might let me do this, or has anyone patched rg_test to allow it? Something as simple as: sudo rg_test test /etc/cluster/cluster.conf [foo] service [servicename] ... where it would simply call the resource agent the same way as it does for status/start/stop, but substitute whatever command line argument I give it. Or do I have to reverse-engineer my own cluster.conf parsing to set up the environment and run the script(s) myself (duplicating what rg_test already does for status/start/stop) ? -- Cos From swap_project at yahoo.com Fri Mar 4 21:23:46 2011 From: swap_project at yahoo.com (Srija) Date: Fri, 4 Mar 2011 13:23:46 -0800 (PST) Subject: [Linux-cluster] Nodes are not joining to the cluster In-Reply-To: <20110304171822.GF14803@redhat.com> Message-ID: <19011.9401.qm@web112811.mail.gq1.yahoo.com> Hi, It will be really appreciated if you send the documentation of building cluster. I think max 16 nodes are permitable for cluster. If you think it is better to divide into two clusters that is also ok. But I need some running ( i mean without any issue) configuration to follow. There are many docs in the web, but it is difficult to follow those docs specially on cluster . Once I get a running cluster doc , after that on that basis , I can go further for enhancing the knowledge on cluster. Thanks again --- On Fri, 3/4/11, Lon Hohberger wrote: > From: Lon Hohberger > Subject: Re: [Linux-cluster] Nodes are not joining to the cluster > To: "linux clustering" > Date: Friday, March 4, 2011, 12:18 PM > On Thu, Mar 03, 2011 at 11:20:37AM > +0100, Seb wrote: > > [snip]config[/snip] > > > > There is no section in your config > file? > > Have you been able to identify a quorum disk on the > nodes? > > Small nitpick - > > I'd really recommend against even trying to start qdiskd / > use a quorum > disk in a 16 node cluster. > > -- > Lon Hohberger - Red Hat, Inc. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From pradhanparas at gmail.com Fri Mar 4 22:40:36 2011 From: pradhanparas at gmail.com (Paras pradhan) Date: Fri, 4 Mar 2011 16:40:36 -0600 Subject: [Linux-cluster] GFS2 write Message-ID: Hi. I was trying to copy a 400 GB file to a gfs2 share. It was copying at 50MB/s approx. Suddenly after copying 80% ,the rate dropped to 30KB/s and stayed like that. I tried to kill the process but could't (which is normal) and after few minutes it was killed. Then I tried it again after few minutes and it was successfully copied at 50MB/s. But then after it looks like accessing the GFS share (even ls -l /gfsmount) takes 10-15 seconds to complete. Then I rebooted this node and everything is back normal. I am really confused what has gone wrong. GFS is running with all default parameters . Thanks! Paras. From gianluca.cecchi at gmail.com Sat Mar 5 14:03:25 2011 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Sat, 5 Mar 2011 15:03:25 +0100 Subject: [Linux-cluster] Info on vm definitions and options in stable3 In-Reply-To: References: Message-ID: On Fri, 4 Mar 2011 13:01:20 -0500 Lon Hohberger wrote: > http://sources.redhat.com/cluster/wiki/ServiceOperationalBehaviors > http://sources.redhat.com/cluster/wiki/ServicePolicies > http://sources.redhat.com/cluster/wiki/FailoverDomains Thanks for the links Some comments: 1) http://sources.redhat.com/cluster/wiki/ServicePolicies probably to correct near the end from The above service tolerance is 3 restarts in 10 minutes. to The above service tolerance is 3 restarts in 5 minutes. 3) http://sources.redhat.com/cluster/wiki/FailoverDomains It could be useful to add at the top a comment such as the one in 1) (Note: These policies also apply to virtual machine resources.) for example something like: Note: Failover Domains concepts also apply to virtual machine resources. 2) http://sources.redhat.com/cluster/wiki/ServiceOperationalBehaviors Here the application to virtual resources is implicit due to the various references inside the page itself Cheers, Gianluca From gianluca.cecchi at gmail.com Sat Mar 5 14:30:47 2011 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Sat, 5 Mar 2011 15:30:47 +0100 Subject: [Linux-cluster] unable to live migrate a vm in rh el 6: Migration unexpectedly failed Message-ID: I have two rh el 6 systems configured with rhcs and clvmd. General cluster services seems to be ok. As I'm not able to successfully migrate a vm through clusvcadm, I'm now downsizing the problem to direct virsh command that fails when called from clusvcadm. The guest's storage is composed by two disks that are clustered logical volumes vm definition is the xml file is the same at both hosts At first I verified correct startup on both nodes this way: - vm running on host2 with resource recovery policy set "relocate" - shutdown vm from inside its operating system - the cluster notices this and correctly restarts it on host1 - shutdown vm from inside its operating system - the cluster notices this and correctly restarts it on host1 I have also ssh equivalence in place (for the intracluster names) so that I can run from host2: [host2 ] # virsh -c qemu+ssh://intrarhev1/system list without need of password input. If I try the command used by the cluster itself (after stopping the vm from clusvcadm): # virsh migrate --live exorapr1 qemu+ssh://intrarhev1/system I receive: error: operation failed: Migration unexpectedly failed On host2: [host2 ] # virsh list Id Name State ---------------------------------- 3 exorapr1 running In messages: Mar 4 14:27:30 host2 libvirtd: 14:27:30.527: error : qemuDomainWaitForMigrationComplete:5394 : operation failed: Migration unexpectedly failed Setting this: [root at host2 libvirt]# export LIBVIRT_DEBUG=1 [root at host2 libvirt]# export LIBVIRT_LOG_OUTPUTS="1:file:/tmp/virsh.log" I get the file I'm going to attach (due to migration happening with intracluster network, the names are intrarhev1 and intrarhev2 on that LAN) It seems no more information in the file.... Any hints on further debugging? If there is not any big mistake at my side I could also open an official case, as these two systems are under subscription maintenance... Thanks in advance, Gianluca -------------- next part -------------- A non-text attachment was scrubbed... Name: virsh.log Type: application/octet-stream Size: 14718 bytes Desc: not available URL: From ra at ra.is Sat Mar 5 16:36:10 2011 From: ra at ra.is (Richard Allen) Date: Sat, 05 Mar 2011 11:36:10 -0500 Subject: [Linux-cluster] RHEL6 HA addon In-Reply-To: <4D66863E.3070304@alteeve.com> References: <4D667CE7.1050501@ra.is> <4D66863E.3070304@alteeve.com> Message-ID: <4D72667A.90803@ra.is> On 02/24/2011 11:24 AM, Digimer wrote: > On 02/24/2011 10:44 AM, Richard Allen wrote: >> Hi all >> >> I notice in the Release Notes for RHEL6 that many changes have been made >> to the Cluster Suite (HA Addon) but I am unable to find any mention of >> how the new suite does heartbeat. >> In previous versions the Cluster could only do heartbeats (node >> intercommunication) on one network link and for redundancy the only >> option was to use bonded network devices. >> There was a way to add a second heartbeat using altnode directives in >> the XML config file but that always felt a bit hackish and was only >> limited to only one altnode, giving two heartbeat paths. >> >> So I would like to ask how RHEL6 does this. If I have nodes with 4 10Gb >> NIC's, one connected to an admin network, another to a Database network >> and one to the Application network and the last one connected directly >> to the other node with a crossover cable, can the cluster now use all >> possible paths to communicate to the other nodes or will one of those >> paths become a single point of failure in the cluster? >> >> I'm used to using Clusters like HP's ServiceGuard where I can easily >> define which links to use as heartbeat. It can even use a serial >> connection (in a two node cluster) as a additional heartbeat and I have >> always felt this is quite a big limitation in Red Hat's cluster suite up >> to RHEL6 atleast. >> >> Thanks in advance >> Richard. > Hi Richard, > > Can I assume that you are talking about High Availability in general, > as opposed to Heartbeat specifically? If not, the rest won't be too > relevant. > > As you know, the 'altnode' parameter is how you assign a second link. > This is still the case (as is bonding to get more links, but that > requires common subnets which you don't have). > > Corosync is used as the cluster communication layer (as opposed to > openais from RHEL 5.x). It supports one or two interfaces for "totem" > communication. If the main fails, the second link will be used > automatically. However, when the main is restored, totem must be > manually moved back to the original link. > > So in short; as it was in 5, so it is in 6. That said, the 'altname' > is perfectly valid way of removing that SPF. :) > Thanks for the reply. I was reading up on this and I noticed something new in the cman(5) man page. Quote: Multi-home configuration It is quite common to use multiple ethernet adapters for cluster nodes, so they will toler- ate the failure of one link. A common way to do this is to use ethernet bonding. Alterna- tively you can get corosync to run in redundant ring mode by specifying an ?altname? for the node. This is an alternative name by which the node is known, that resolves to another IP address used on the other ethernet adapter(s). You can optionally specify a different port and/or multicast address for each altname in use. Up to 9 altnames (10 interfaces in total) can be used. Note that if you are using the DLM with cman/corosync then you MUST tell it to use SCTP as it?s communications protocol as TCP does not support multihoming. So I can use up to 9 altnames now? Is true, it would be fantastic :) -- Rikki. -- RHCE, RHCX, HP-UX Certified Administrator. -- Solaris 7 Certified Systems and Network Administrator. Bell Labs Unix -- Reach out and grep someone. Those who do not understand Unix are condemned to reinvent it, poorly. From swap_project at yahoo.com Sat Mar 5 22:53:51 2011 From: swap_project at yahoo.com (Srija) Date: Sat, 5 Mar 2011 14:53:51 -0800 (PST) Subject: [Linux-cluster] Nodes are not joining to the cluster In-Reply-To: <20110304171822.GF14803@redhat.com> Message-ID: <90386.57864.qm@web112805.mail.gq1.yahoo.com> Hi, Just a query , I am not very much clear so asking, > I'd really recommend against even trying to start qdiskd / > use a quorum > disk in a 16 node cluster. Did you ask not to use qdiskd /quorum disk in the 16 nodes cluster ? Thanks --- On Fri, 3/4/11, Lon Hohberger wrote: > From: Lon Hohberger > Subject: Re: [Linux-cluster] Nodes are not joining to the cluster > To: "linux clustering" > Date: Friday, March 4, 2011, 12:18 PM > On Thu, Mar 03, 2011 at 11:20:37AM > +0100, Seb wrote: > > [snip]config[/snip] > > > > There is no section in your config > file? > > Have you been able to identify a quorum disk on the > nodes? > > Small nitpick - > > I'd really recommend against even trying to start qdiskd / > use a quorum > disk in a 16 node cluster. > > -- > Lon Hohberger - Red Hat, Inc. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From ooolinux at 163.com Sun Mar 6 05:30:01 2011 From: ooolinux at 163.com (yue) Date: Sun, 6 Mar 2011 13:30:01 +0800 (CST) Subject: [Linux-cluster] is ocfs2 is limited 16T Message-ID: <56a50421.709d.12e89a4c7cb.Coremail.ooolinux@163.com> if there is a limit on ocfs2'volume? it must less 16T? thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From wen.gang.wang at oracle.com Sun Mar 6 11:35:57 2011 From: wen.gang.wang at oracle.com (Wengang Wang) Date: Sun, 6 Mar 2011 19:35:57 +0800 Subject: [Linux-cluster] is ocfs2 is limited 16T In-Reply-To: <56a50421.709d.12e89a4c7cb.Coremail.ooolinux@163.com> References: <56a50421.709d.12e89a4c7cb.Coremail.ooolinux@163.com> Message-ID: <20110306113557.GC2756@laptop> Hi, For mainline kernel, there is no such limit. But for existing ocfs2 1.2/1.4/1.6, there is a 16TB limit. thanks, wengang. if there is a limit on ocfs2'volume? it must less 16T? thanks On 11-03-06 13:30, yue wrote: > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From jakov.sosic at srce.hr Sun Mar 6 12:38:55 2011 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Sun, 06 Mar 2011 13:38:55 +0100 Subject: [Linux-cluster] is ocfs2 is limited 16T In-Reply-To: <56a50421.709d.12e89a4c7cb.Coremail.ooolinux@163.com> References: <56a50421.709d.12e89a4c7cb.Coremail.ooolinux@163.com> Message-ID: <4D73805F.8020308@srce.hr> On 03/06/2011 06:30 AM, yue wrote: > if there is a limit on ocfs2'volume? it must less 16T? For RHEL v5.x and derivateves yes. But you can hack it and rebuild kernel modules without limitation. You also need to patch kernel-sources and rebuild kernel too. -- Jakov Sosic www.srce.hr From balajisundar at midascomm.com Mon Mar 7 08:33:41 2011 From: balajisundar at midascomm.com (Balaji Sundar) Date: Mon, 7 Mar 2011 14:03:41 +0530 (IST) Subject: [Linux-cluster] rgmanager not running Message-ID: <38415.59.90.241.47.1299486821.squirrel@59.90.241.47> Dear All, I have using RHEL6 Linux and Kernel Version is 2.6.32-71.el6.i686 I have configured Cluster Suite with 2 servers Server 1 : 192.168.13.131 IP Address and hostname is primary Server 2 : 192.168.13.132 IP Address and hostname is secondary Floating : 192.168.13.133 IP Address (Assumed by currently active server) I have verified that service cman is running and cluster.conf is valid using ccs_config_validate command Finally i found that rgmanager is not running and services are not started [root at primary cluster]# service rgmanager status rgmanager dead but pid file exists [root at primary cluster]# [root at primary cluster]# cman_tool services [root at primary cluster]# [root at primary cluster]# cman_tool status Version: 6.2.0 Config Version: 1 Cluster Name: EMSCluster Cluster Id: 808 Cluster Member: Yes Cluster Generation: 96 Membership state: Cluster-Member Nodes: 1 Expected votes: 1 Total votes: 1 Node votes: 1 Quorum: 1 Active subsystems: 7 Flags: 2node Ports Bound: 0 Node name: primary Node ID: 1 Multicast addresses: 239.192.3.43 Node addresses: 192.168.13.131 [root at primary cluster]# Found some error messages in "/var/log/messages" file Mar 7 14:39:42 primary corosync[7155]: [CMAN ] quorum regained, resuming activity Mar 7 14:39:42 primary corosync[7155]: [QUORUM] This node is within the primary component and will provide service. Mar 7 14:39:42 primary corosync[7155]: [QUORUM] Members[1]: 1 Mar 7 14:39:42 primary corosync[7155]: [QUORUM] Members[1]: 1 Mar 7 14:39:42 primary corosync[7155]: [CPG ] downlist received left_list: 0 Mar 7 14:39:42 primary corosync[7155]: [CPG ] chosen downlist from node r(0) ip(192.168.13.131) Mar 7 14:39:42 primary corosync[7155]: [MAIN ] Completed service synchronization, ready to provide service. Mar 7 14:39:44 primary fenced[7210]: fenced 3.0.12 started Mar 7 14:39:45 primary dlm_controld[7224]: dlm_controld 3.0.12 started Mar 7 14:39:45 primary gfs_controld[7254]: gfs_controld 3.0.12 started Mar 7 14:39:45 primary kernel: dlm: Using TCP for communications Mar 7 14:39:45 primary dlm_controld[7224]: dlm_join_lockspace no fence domain Mar 7 14:39:45 primary dlm_controld[7224]: process_uevent online@ error -1 errno 2 Mar 7 14:39:45 primary kernel: dlm: rgmanager: group join failed -1 -1 Found some error messages in "/var/log/cluster/dlm_controld.log" file Mar 07 14:39:45 dlm_controld dlm_controld 3.0.12 started Mar 07 14:39:45 dlm_controld dlm_join_lockspace no fence domain Mar 07 14:39:45 dlm_controld process_uevent online@ error -1 errno 2 I don't know what is the problem and Can some one throw light on this peculiar problem Thanks in Advance --Regards S.Balaji From sdake at redhat.com Mon Mar 7 15:09:39 2011 From: sdake at redhat.com (Steven Dake) Date: Mon, 07 Mar 2011 08:09:39 -0700 Subject: [Linux-cluster] RHEL6 HA addon In-Reply-To: <4D72667A.90803@ra.is> References: <4D667CE7.1050501@ra.is> <4D66863E.3070304@alteeve.com> <4D72667A.90803@ra.is> Message-ID: <4D74F533.70907@redhat.com> > Note that if you are using the DLM with cman/corosync then you MUST tell it > to use SCTP as > it?s communications protocol as TCP does not support multihoming. > > > So I can use up to 9 altnames now? Is true, it would be fantastic :) > corosync supports a max of two interfaces. If we ever get around to supporting redundant ring well, we will add a larger number of redundant rings (ie: make it configurable in the packet data). Regards -steve From gregory.lee.bartholomew at gmail.com Mon Mar 7 16:15:29 2011 From: gregory.lee.bartholomew at gmail.com (Gregory Bartholomew) Date: Mon, 07 Mar 2011 10:15:29 -0600 Subject: [Linux-cluster] Error: "Failed actions: dlm:1_monitor_0/gfs-control:1_monitor_0 ... not installed". Message-ID: <4D7504A1.3090603@gmail.com> Hi All, I'm trying to follow the "clusters from scratch" guide and I'm running Fedora 14. When I try to add the DLM and GFS2 services, crm_mon keeps reporting "Failed actions: dlm:1_monitor_0/gfs-control:1_monitor_0 ... not installed". Does anyone know what I'm missing? Thanks, gb From rpeterso at redhat.com Mon Mar 7 16:35:58 2011 From: rpeterso at redhat.com (Bob Peterson) Date: Mon, 7 Mar 2011 11:35:58 -0500 (EST) Subject: [Linux-cluster] Error: "Failed actions: dlm:1_monitor_0/gfs-control:1_monitor_0 ... not installed". In-Reply-To: <4D7504A1.3090603@gmail.com> Message-ID: <1229003616.324573.1299515758308.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- Original Message ----- | Hi All, | | I'm trying to follow the "clusters from scratch" guide and I'm | running Fedora 14. | | When I try to add the DLM and GFS2 services, crm_mon keeps reporting | "Failed actions: dlm:1_monitor_0/gfs-control:1_monitor_0 ... not | installed". | | Does anyone know what I'm missing? | | Thanks, | gb Hm, it sounds like you don't have the debugfs mounted and some piece of software (likely crm_mon) is expecting it. Try adding something like this to /etc/fstab: debugfs /sys/kernel/debug debugfs defaults 0 0 and doing mount -a Regards, Bob Peterson Red Hat File Systems From rpeterso at redhat.com Mon Mar 7 18:43:34 2011 From: rpeterso at redhat.com (Bob Peterson) Date: Mon, 7 Mar 2011 13:43:34 -0500 (EST) Subject: [Linux-cluster] GFS2 write In-Reply-To: Message-ID: <1152011131.327272.1299523414066.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- Original Message ----- | Hi. | | I was trying to copy a 400 GB file to a gfs2 share. It was copying at | 50MB/s approx. Suddenly after copying 80% ,the rate dropped to 30KB/s | and stayed like that. I tried to kill the process but could't (which | is normal) and after few minutes it was killed. Then I tried it again | after few minutes and it was successfully copied at 50MB/s. But then | after it looks like accessing the GFS share (even ls -l /gfsmount) | takes 10-15 seconds to complete. Then I rebooted this node and | everything is back normal. | | I am really confused what has gone wrong. GFS is running with all | default parameters . | | Thanks! | Paras. Hi Paras, I think I've recreated the problem and I'm investigating it now. I hope to have an answer soon (maybe today). Looks like a bug to me, and so I'll see if I can generate a patch to fix it. That may take a few days. Regards, Bob Peterson Red Hat File Systems From gregory.lee.bartholomew at gmail.com Mon Mar 7 19:26:13 2011 From: gregory.lee.bartholomew at gmail.com (Gregory Bartholomew) Date: Mon, 07 Mar 2011 13:26:13 -0600 Subject: [Linux-cluster] Error: "Failed actions: dlm:1_monitor_0/gfs-control:1_monitor_0 ... not installed". In-Reply-To: <1229003616.324573.1299515758308.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> References: <1229003616.324573.1299515758308.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <4D753155.1090809@gmail.com> Thanks for offering me something to try Bob, but it still doesn't seem to work. Here is the exact output of crm_mon and "crm configure show": ============ Last updated: Mon Mar 7 13:20:47 2011 Stack: openais Current DC: eb2024-58.cs.siue.edu - partition with quorum Version: 1.1.4-ac608e3491c7dfc3b3e3c36d966ae9b016f77065 2 Nodes configured, 2 expected votes 4 Resources configured. ============ Online: [ eb2024-58.cs.siue.edu eb2024-59.cs.siue.edu ] ClusterIP (ocf::heartbeat:IPaddr2): Started eb2024-58.cs.siue.edu WebSite (ocf::heartbeat:apache): Started eb2024-58.cs.siue.edu Failed actions: dlm:0_monitor_0 (node=eb2024-59.cs.siue.edu, call=4, rc=5, status=complete): not installed gfs-control:0_monitor_0 (node=eb2024-59.cs.siue.edu, call=5, rc=5, status=complete): not installed dlm:1_monitor_0 (node=eb2024-58.cs.siue.edu, call=4, rc=5, status=complete): not installed gfs-control:1_monitor_0 (node=eb2024-58.cs.siue.edu, call=5, rc=5, status=complete): not installed [root at eb2024-58 ~]# crm configure show node eb2024-58.cs.siue.edu node eb2024-59.cs.siue.edu primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip="146.163.150.57" cidr_netmask="32" \ op monitor interval="30s" primitive WebSite ocf:heartbeat:apache \ params configfile="/etc/httpd/conf/httpd.conf" \ op start interval="0" timeout="40s" \ op stop interval="0" timeout="60s" \ op monitor interval="1min" primitive dlm ocf:pacemaker:controld \ op start interval="0" timeout="90s" \ op stop interval="0" timeout="100s" \ op monitor interval="120s" primitive gfs-control ocf:pacemaker:controld \ params daemon="gfs_controld.pcmk" args="-g 0" \ op start interval="0" timeout="90s" \ op stop interval="0" timeout="100s" \ op monitor interval="120s" clone dlm-clone dlm \ meta interleave="true" clone gfs-clone gfs-control \ meta interleave="true" location prefer-node1 WebSite 50: eb2024-58.cs.siue.edu colocation gfs-with-dlm inf: gfs-clone dlm-clone colocation website-with-ip inf: WebSite ClusterIP order apache-after-ip inf: ClusterIP WebSite order start-gfs-after-dlm inf: dlm-clone gfs-clone property $id="cib-bootstrap-options" \ dc-version="1.1.4-ac608e3491c7dfc3b3e3c36d966ae9b016f77065" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" Has anyone got this to work on Fedora 14? gb On 03/07/2011 10:35 AM, Bob Peterson wrote: > ----- Original Message ----- > | Hi All, > | > | I'm trying to follow the "clusters from scratch" guide and I'm > | running Fedora 14. > | > | When I try to add the DLM and GFS2 services, crm_mon keeps reporting > | "Failed actions: dlm:1_monitor_0/gfs-control:1_monitor_0 ... not > | installed". > | > | Does anyone know what I'm missing? > | > | Thanks, > | gb > > Hm, it sounds like you don't have the debugfs mounted > and some piece of software (likely crm_mon) is expecting it. > Try adding something like this to /etc/fstab: > > debugfs /sys/kernel/debug debugfs defaults 0 0 > > and doing mount -a > > Regards, > > Bob Peterson > Red Hat File Systems > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From pradhanparas at gmail.com Mon Mar 7 20:33:24 2011 From: pradhanparas at gmail.com (Paras pradhan) Date: Mon, 7 Mar 2011 14:33:24 -0600 Subject: [Linux-cluster] GFS2 write In-Reply-To: <1152011131.327272.1299523414066.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> References: <1152011131.327272.1299523414066.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: Thanks Bob. Please let me know if you need any other info. Paras. On Mon, Mar 7, 2011 at 12:43 PM, Bob Peterson wrote: > ----- Original Message ----- > | Hi. > | > | I was trying to copy a 400 GB file to a gfs2 share. It was copying at > | 50MB/s approx. Suddenly after copying 80% ,the rate dropped to 30KB/s > | and stayed like that. I tried to kill the process but could't (which > | is normal) and after few minutes it was killed. Then I tried it again > | after few minutes and it was successfully copied at 50MB/s. But then > | after it looks like accessing the GFS share (even ls -l /gfsmount) > | takes 10-15 seconds to complete. Then I rebooted this node and > | everything is back normal. > | > | I am really confuseTd what has gone wrong. GFS is running with all > | default parameters . > | > | Thanks! > | Paras. > > Hi Paras, > > I think I've recreated the problem and I'm investigating it now. > I hope to have an answer soon (maybe today). ?Looks like a bug to > me, and so I'll see if I can generate a patch to fix it. ?That > may take a few days. > > Regards, > > Bob Peterson > Red Hat File Systems > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From lhh at redhat.com Mon Mar 7 21:36:01 2011 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 7 Mar 2011 16:36:01 -0500 Subject: [Linux-cluster] Nodes are not joining to the cluster In-Reply-To: <90386.57864.qm@web112805.mail.gq1.yahoo.com> References: <20110304171822.GF14803@redhat.com> <90386.57864.qm@web112805.mail.gq1.yahoo.com> Message-ID: <20110307213601.GH17423@redhat.com> On Sat, Mar 05, 2011 at 02:53:51PM -0800, Srija wrote: > Hi, > > Just a query , I am not very much clear so asking, > > > I'd really recommend against even trying to start qdiskd / > > use a quorum > > disk in a 16 node cluster. > > Did you ask not to use qdiskd /quorum disk in the 16 nodes cluster ? Right - qdiskd was designed for 2- and 4-node clusters to expand the failure tolerances a little bit. It will *work* in a 16 node cluster, but is unlikely to provide any practical benefit. -- Lon Hohberger - Red Hat, Inc. From lhh at redhat.com Mon Mar 7 21:42:12 2011 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 7 Mar 2011 16:42:12 -0500 Subject: [Linux-cluster] Info on vm definitions and options in stable3 In-Reply-To: References: Message-ID: <20110307214212.GI17423@redhat.com> On Sat, Mar 05, 2011 at 03:03:25PM +0100, Gianluca Cecchi wrote: > On Fri, 4 Mar 2011 13:01:20 -0500 Lon Hohberger wrote: > > http://sources.redhat.com/cluster/wiki/ServiceOperationalBehaviors > > http://sources.redhat.com/cluster/wiki/ServicePolicies > > http://sources.redhat.com/cluster/wiki/FailoverDomains > > Thanks for the links > Some comments: > 1) http://sources.redhat.com/cluster/wiki/ServicePolicies > probably to correct near the end from > The above service tolerance is 3 restarts in 10 minutes. > to > The above service tolerance is 3 restarts in 5 minutes. Fixed. > 3) http://sources.redhat.com/cluster/wiki/FailoverDomains > It could be useful to add at the top a comment such as the one in 1) > (Note: These policies also apply to virtual machine resources.) > for example something like: > Note: Failover Domains concepts also apply to virtual machine resources. Done. > 2) http://sources.redhat.com/cluster/wiki/ServiceOperationalBehaviors > Here the application to virtual resources is implicit due to the > various references inside the page itself Added a note anyway. -- Lon Hohberger - Red Hat, Inc. From lhh at redhat.com Mon Mar 7 21:49:19 2011 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 7 Mar 2011 16:49:19 -0500 Subject: [Linux-cluster] rg_test for testing other resource agent functions? In-Reply-To: <20110304194923.GX934@mip.aaaaa.org> References: <20110304194923.GX934@mip.aaaaa.org> Message-ID: <20110307214919.GJ17423@redhat.com> On Fri, Mar 04, 2011 at 02:49:23PM -0500, Ofer Inbar wrote: > > For example, I want to add a "verify" procedure to my resource agent, > that I'd like to kick off from a monitoring script on my own schedule, > but I want to make sure that it is run in the same context as the > resource agent's status check is normally run. I could write some > separate cluster.conf parser that simulates what I think rgmanager > would do, but I might get it wrong. Or rgmanager might change in a > future version and I wouldn't track the change. rg_test exposes the operations rgmanager performs. rgmanager doesn't actually call 'validate-all' - it expects RAs to do this, or at least report when parameters are invalid if start/status/stop operations are called. > Is there anything like rg_test that might let me do this, or has > anyone patched rg_test to allow it? Something as simple as: > sudo rg_test test /etc/cluster/cluster.conf [foo] service [servicename] rgmanager does implicit start/status/stop ordering based on service tree structures, which is why those are the only operations that are currently done. > .. where it would simply call the resource agent the same way as it > does for status/start/stop, but substitute whatever command line > argument I give it. You could just do: OCF_RESKEY_x=y OCF_RESKEY_a=b /path/to/agent.sh > Or do I have to reverse-engineer my own cluster.conf parsing to set up > the environment and run the script(s) myself (duplicating what rg_test > already does for status/start/stop) ? Pacemaker has ocf-tester as well; maybe that would be useful? I have a tool that will flatten a cluster.conf for you, resolving rgmanager's entire resource tree structure and flattening the result. -- Lon Hohberger - Red Hat, Inc. From lhh at redhat.com Mon Mar 7 21:52:00 2011 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 7 Mar 2011 16:52:00 -0500 Subject: [Linux-cluster] unable to live migrate a vm in rh el 6: Migration unexpectedly failed In-Reply-To: References: Message-ID: <20110307215200.GK17423@redhat.com> On Sat, Mar 05, 2011 at 03:30:47PM +0100, Gianluca Cecchi wrote: > It seems no more information in the file.... > Any hints on further debugging? Check /var/log/audit/audit.log for an AVC denial around self:capability setpcap for xm_t? -- Lon Hohberger - Red Hat, Inc. From lhh at redhat.com Mon Mar 7 21:55:38 2011 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 7 Mar 2011 16:55:38 -0500 Subject: [Linux-cluster] rgmanager not running In-Reply-To: <38415.59.90.241.47.1299486821.squirrel@59.90.241.47> References: <38415.59.90.241.47.1299486821.squirrel@59.90.241.47> Message-ID: <20110307215538.GL17423@redhat.com> On Mon, Mar 07, 2011 at 02:03:41PM +0530, Balaji Sundar wrote: > > Found some error messages in "/var/log/messages" file > Mar 7 14:39:42 primary corosync[7155]: [CMAN ] quorum regained, > resuming activity How much time between: [DATE] corosync[7155]: [MAIN ] Corosync Cluster Engine ('1.2.3'): started and ready to provide service. and the above message? -- Lon Hohberger - Red Hat, Inc. From gianluca.cecchi at gmail.com Mon Mar 7 22:10:08 2011 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Mon, 7 Mar 2011 23:10:08 +0100 Subject: [Linux-cluster] unable to live migrate a vm in rh el 6: Migration unexpectedly failed Message-ID: On Mon, 7 Mar 2011 16:52:00 -0500 Lon Hohberger wrote: > Check /var/log/audit/audit.log for an AVC denial around self:capability > setpcap for xm_t? Uhm, SElinux is disabled on both nodes (I'll cross check tomorrow anyway) and auditd is chkconfig off too (even if I notice in rh el 6 many audit messages related to cron writing in /var/log/messages...) Could it be of any help an "strace -f" of the virsh command where I can see the ssh and netcat forked calls but am not able to identify the point where eventually there is something strange? Gianluca From rpeterso at redhat.com Mon Mar 7 22:14:55 2011 From: rpeterso at redhat.com (Bob Peterson) Date: Mon, 7 Mar 2011 17:14:55 -0500 (EST) Subject: [Linux-cluster] GFS2 write In-Reply-To: Message-ID: <1326901784.331890.1299536095096.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- Original Message ----- | Thanks Bob. Please let me know if you need any other info. | | Paras. | > Hi Paras, | > | > I think I've recreated the problem and I'm investigating it now. | > I hope to have an answer soon (maybe today). Looks like a bug to | > me, and so I'll see if I can generate a patch to fix it. That | > may take a few days. | > | > Regards, | > | > Bob Peterson | > Red Hat File Systems Hi Paras, Your block allocation problem will probably be fixed by this upstream patch to GFS2: http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=9cabcdbd4638cf884839ee4cd15780800c223b90 I tracked it down. I ported the patch to RHEL5 and now it doesn't happen. Unfortunately, my ported patch needs cleaning: I've got a bunch of instrumentation for other reasons in there. Regards, Bob Peterson Red Hat File Systems From pradhanparas at gmail.com Mon Mar 7 22:21:31 2011 From: pradhanparas at gmail.com (Paras pradhan) Date: Mon, 7 Mar 2011 16:21:31 -0600 Subject: [Linux-cluster] GFS2 write In-Reply-To: <1326901784.331890.1299536095096.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> References: <1326901784.331890.1299536095096.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: We are running Redhat 5. Do you think this patch has already been applied to the GFS that redhat ships? Paras. On Mon, Mar 7, 2011 at 4:14 PM, Bob Peterson wrote: > ----- Original Message ----- > | Thanks Bob. Please let me know if you need any other info. > | > | Paras. > | > Hi Paras, > | > > | > I think I've recreated the problem and I'm investigating it now. > | > I hope to have an answer soon (maybe today). Looks like a bug to > | > me, and so I'll see if I can generate a patch to fix it. That > | > may take a few days. > | > > | > Regards, > | > > | > Bob Peterson > | > Red Hat File Systems > > Hi Paras, > > Your block allocation problem will probably be fixed by this upstream > patch to GFS2: > > http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=9cabcdbd4638cf884839ee4cd15780800c223b90 > > I tracked it down. ?I ported the patch to RHEL5 and now it doesn't > happen. ?Unfortunately, my ported ?patch needs cleaning: I've got > a bunch of instrumentation for other reasons in there. > > Regards, > > Bob Peterson > Red Hat File Systems > From rpeterso at redhat.com Mon Mar 7 22:42:22 2011 From: rpeterso at redhat.com (Bob Peterson) Date: Mon, 7 Mar 2011 17:42:22 -0500 (EST) Subject: [Linux-cluster] GFS2 write In-Reply-To: Message-ID: <805163997.332464.1299537742195.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- Original Message ----- | We are running Redhat 5. Do you think this patch has already been | applied to the GFS that redhat ships? | | Paras. Hi Paras, If this is RHEL5, you should contact Red Hat support and open a ticket. After all, you're paying for support, so why not use it? That patch is in the upstream (kernel.org) kernel, not in RHEL5. I ported the patch to RHEL5 for testing purposes and I'm planning to put a test version on my people page for some of our customers to try out. I don't know what kernel you're running, but I can do the same for your kernel. If you open a support ticket, ask them to attach your case to bugzilla bug 681261, which is likely private because it contains confidential customer information. Regards, Bob Peterson Red Hat File Systems From pradhanparas at gmail.com Mon Mar 7 22:49:00 2011 From: pradhanparas at gmail.com (Paras pradhan) Date: Mon, 7 Mar 2011 16:49:00 -0600 Subject: [Linux-cluster] GFS2 write In-Reply-To: <805163997.332464.1299537742195.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> References: <805163997.332464.1299537742195.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: Thanks Bob. Will do. I don't know how quickly my issue gets resolved , I am using kernel 2.6.18-238.1.1.el5xen Thanks Paras. On Mon, Mar 7, 2011 at 4:42 PM, Bob Peterson wrote: > ----- Original Message ----- > | We are running Redhat 5. Do you think this patch has already been > | applied to the GFS that redhat ships? > | > | Paras. > > Hi Paras, > > If this is RHEL5, you should contact Red Hat support and open > a ticket. ?After all, you're paying for support, so why not use it? > That patch is in the upstream (kernel.org) kernel, not in RHEL5. > I ported the patch to RHEL5 for testing purposes and I'm > planning to put a test version on my people page for some of > our customers to try out. ?I don't know what kernel you're > running, but I can do the same for your kernel. > If you open a support ticket, ask them to attach your case > to bugzilla bug 681261, which is likely private because it contains > confidential customer information. > > Regards, > > Bob Peterson > Red Hat File Systems > From scooter at cgl.ucsf.edu Tue Mar 8 01:01:02 2011 From: scooter at cgl.ucsf.edu (Scooter Morris) Date: Mon, 07 Mar 2011 17:01:02 -0800 Subject: [Linux-cluster] GFS2 write In-Reply-To: <805163997.332464.1299537742195.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> References: <805163997.332464.1299537742195.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <4D757FCE.7070305@cgl.ucsf.edu> Hi Bob, I think we're seeing this also, but bugzilla #681261 is currently private. Could we open that up (or add us to the cc list)? If it does look like our problem, I'll open up a ticket. Thanks! -- scooter On 03/07/2011 02:42 PM, Bob Peterson wrote: > ----- Original Message ----- > | We are running Redhat 5. Do you think this patch has already been > | applied to the GFS that redhat ships? > | > | Paras. > > Hi Paras, > > If this is RHEL5, you should contact Red Hat support and open > a ticket. After all, you're paying for support, so why not use it? > That patch is in the upstream (kernel.org) kernel, not in RHEL5. > I ported the patch to RHEL5 for testing purposes and I'm > planning to put a test version on my people page for some of > our customers to try out. I don't know what kernel you're > running, but I can do the same for your kernel. > If you open a support ticket, ask them to attach your case > to bugzilla bug 681261, which is likely private because it contains > confidential customer information. > > Regards, > > Bob Peterson > Red Hat File Systems > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From fdinitto at redhat.com Tue Mar 8 08:11:04 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Tue, 08 Mar 2011 09:11:04 +0100 Subject: [Linux-cluster] cluster 3.1.1 stable release Message-ID: <4D75E498.9000206@redhat.com> Welcome to the cluster 3.1.1 release. This release contains dozens of bug fixes and improvements, including dbus notifications of cluster events, that in conjunction with project Foghorn, they can be translated into SNMP events. The new source tarball can be downloaded here: https://fedorahosted.org/releases/c/l/cluster/cluster-3.1.1.tar.xz ChangeLog: https://fedorahosted.org/releases/c/l/cluster/Changelog-3.1.1 To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. Happy clustering, Fabio From krishnanand.linux at gmail.com Tue Mar 8 08:20:28 2011 From: krishnanand.linux at gmail.com (krishnanand gouri) Date: Tue, 8 Mar 2011 13:50:28 +0530 Subject: [Linux-cluster] samba-cluster Issue Message-ID: Hi, I have configured 2-Node cluster. Every thing is working fine even the fail over cases also workign fine but i am facing a issue when ever I stop CTDB service in server 1, the user are not able to acces samba share at all. even after the CTDB IP is switched over. But where as if at all i stop CTDB service in server2 the CTDB IP will switch over to other server and the users are able to access the samba share normally. Why is it so happening only for server 1. Public IP's - 192.168.129.10 / 192.168.129.10 Heart Beat Ip's : 10.0.0.10 / 10.0.0.20 CTDB IP's - 192.168.129.14 / 192.168.129.15 Please help in solving this issue.... Thanks & Regards Krishnanand G -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpeterso at redhat.com Tue Mar 8 14:51:20 2011 From: rpeterso at redhat.com (Bob Peterson) Date: Tue, 8 Mar 2011 09:51:20 -0500 (EST) Subject: [Linux-cluster] GFS2 write In-Reply-To: <4D757FCE.7070305@cgl.ucsf.edu> Message-ID: <192026736.348110.1299595880572.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- Original Message ----- | Hi Bob, | I think we're seeing this also, but bugzilla #681261 is currently | private. Could we open that up (or add us to the cc list)? If it does | look like our problem, I'll open up a ticket. | | Thanks! | | -- scooter Hi Scooter, It's almost impossible to determine if you're experiencing a particular problem without doing an in-depth analysis of your data. It's probably best if you open a support ticket and collect the info they request so we (as a team) can analyze it. Regards, Bob Peterson Red Hat File Systems From szekelyi at niif.hu Tue Mar 8 16:35:41 2011 From: szekelyi at niif.hu (=?iso-8859-1?q?Sz=E9kelyi_Szabolcs?=) Date: Tue, 8 Mar 2011 17:35:41 +0100 Subject: [Linux-cluster] wait states? Message-ID: <201103081735.41537.szekelyi@niif.hu> Hi all, I've setup a simple two node cluster, and did some testing on it, but I'm having problems interpreting the results. Unfortunately I couldn't find any documentation answering my questions, so I'll post them here. I don't want to do any complicated stuff, just run CLVM properly to serve logical volumes as iSCSI targets. The iSCSI target software should run on both nodes, independent of the cluster stack. The cluster is needed only because of CLVM. I tried to avoid using fencing as much as possible since I don't really see the need for it. My cluster.conf looks like this: To test things, I broke the connection between the nodes for a while, and then restored it. I expected the cluster to return to normal state, but it didn't. The main difference between the nodes is the "wait state", about what I could hardly find any documentation. On one node it's "messages", on the other it's "quorum". Could you explain what these mean and how to return the cluster into normal state? Thanks, -- cc From vmutu at pcbi.upenn.edu Tue Mar 8 17:11:53 2011 From: vmutu at pcbi.upenn.edu (Valeriu Mutu) Date: Tue, 8 Mar 2011 12:11:53 -0500 Subject: [Linux-cluster] clvmd hangs on startup In-Reply-To: <20110303165056.GF10674@bsdera.pcbi.upenn.edu> References: <20110302215050.GD10674@bsdera.pcbi.upenn.edu> <64D0546C5EBBD147B75DE133D798665F0855C290@hugo.eprize.local> <20110303165056.GF10674@bsdera.pcbi.upenn.edu> Message-ID: <20110308171153.GB272@bsdera.pcbi.upenn.edu> Hi, I think the problem is solved. I was using a 9000bytes MTU on the Xen virtual machines' iSCSI interface. Switching back to 1500bytes MTU caused the clvmd to start working. On Thu, Mar 03, 2011 at 11:50:57AM -0500, Valeriu Mutu wrote: > On Wed, Mar 02, 2011 at 05:36:45PM -0500, Jeff Sturm wrote: > > Double-check that the 2nd node can read and write the shared iSCSI > > storage. > > Reading/writing from/to the iSCSI storage device works as seen below. > > On the 1st node: > [root at vm1 cluster]# dd count=10000 bs=1024 if=/dev/urandom of=/dev/mapper/pcbi-homes > 10000+0 records in > 10000+0 records out > 10240000 bytes (10 MB) copied, 3.39855 seconds, 3.0 MB/s > > [root at vm1 cluster]# dd count=10000 bs=1024 if=/dev/mapper/pcbi-homes of=/dev/null > 10000+0 records in > 10000+0 records out > 10240000 bytes (10 MB) copied, 0.331069 seconds, 30.9 MB/s > > On the 2nd node: > [root at vm2 ~]# dd count=10000 bs=1024 if=/dev/urandom of=/dev/mapper/pcbi-homes > 10000+0 records in > 10000+0 records out > 10240000 bytes (10 MB) copied, 3.2465 seconds, 3.2 MB/s > > [root at vm2 ~]# dd count=10000 bs=1024 if=/dev/mapper/pcbi-homes of=/dev/null > 10000+0 records in > 10000+0 records out > 10240000 bytes (10 MB) copied, 0.223337 seconds, 45.8 MB/s -- Valeriu Mutu From jeff.sturm at eprize.com Tue Mar 8 19:02:35 2011 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Tue, 8 Mar 2011 14:02:35 -0500 Subject: [Linux-cluster] clvmd hangs on startup In-Reply-To: <20110308171153.GB272@bsdera.pcbi.upenn.edu> References: <20110302215050.GD10674@bsdera.pcbi.upenn.edu><64D0546C5EBBD147B75DE133D798665F0855C290@hugo.eprize.local><20110303165056.GF10674@bsdera.pcbi.upenn.edu> <20110308171153.GB272@bsdera.pcbi.upenn.edu> Message-ID: <64D0546C5EBBD147B75DE133D798665F0855C339@hugo.eprize.local> > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] > On Behalf Of Valeriu Mutu > Sent: Tuesday, March 08, 2011 12:12 PM > > I think the problem is solved. I was using a 9000bytes MTU on the Xen virtual > machines' iSCSI interface. Switching back to 1500bytes MTU caused the clvmd to start > working. That'll do it. Jumbo frames with Xen are a little tricky, so it's easiest to stick with MTU 1500, as you have done. (If you really want jumbo frames, change the MTU on the dom0 vif interfaces to match, and if you use bridging, ditto for the bridge device and real interfaces.) I suspect if your "dd" test had used a block size like 4096, rather than 1024, it would have similarly failed. -Jeff From gregory.lee.bartholomew at gmail.com Tue Mar 8 19:53:47 2011 From: gregory.lee.bartholomew at gmail.com (Gregory Bartholomew) Date: Tue, 08 Mar 2011 13:53:47 -0600 Subject: [Linux-cluster] dlm-pcmk-3.0.17-1.fc14.x86_64 and gfs-pcmk-3.0.17-1.fc14.x86_64 woes Message-ID: <4D76894B.6010809@gmail.com> Hi Fabio M. Di Nitto, FYI, I was just trying to set up gfs2 under pacemaker on Fedora 14 X86_64 and although yum provides '*/gfs_controld.pcmk' showed that I needed the dlm-pcmk-3.0.17-1.fc14.x86_64 and gfs-pcmk-3.0.17-1.fc14.x86_64 packages, yum install dlm-pcmk gfs-pcmk would simply report "Nothing to do". rpm -q showed that I didn't have the packages installed. I tried installing the cman package but that didn't help. I finally got it working by downloading the packages with wget and installing them with rpm -ivh. FYI, the dlm-pcmk and gfs-pcmk packages seem to be broken in the Fedora 14 x86_64 database at the moment. gb From fdinitto at redhat.com Tue Mar 8 19:55:37 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Tue, 08 Mar 2011 20:55:37 +0100 Subject: [Linux-cluster] dlm-pcmk-3.0.17-1.fc14.x86_64 and gfs-pcmk-3.0.17-1.fc14.x86_64 woes In-Reply-To: <4D76894B.6010809@gmail.com> References: <4D76894B.6010809@gmail.com> Message-ID: <4D7689B9.7070906@redhat.com> On 03/08/2011 08:53 PM, Gregory Bartholomew wrote: > Hi Fabio M. Di Nitto, > > FYI, I was just trying to set up gfs2 under pacemaker on Fedora 14 > X86_64 and although yum provides '*/gfs_controld.pcmk' showed that I > needed the dlm-pcmk-3.0.17-1.fc14.x86_64 and > gfs-pcmk-3.0.17-1.fc14.x86_64 packages, yum install dlm-pcmk gfs-pcmk > would simply report "Nothing to do". rpm -q showed that I didn't have > the packages installed. I tried installing the cman package but that > didn't help. I finally got it working by downloading the packages with > wget and installing them with rpm -ivh. > > FYI, the dlm-pcmk and gfs-pcmk packages seem to be broken in the Fedora > 14 x86_64 database at the moment. No, those packages have been removed intentionally since pacemaker now supports cman cluster manager and they become obsoleted. So very short summary: configure cman for clusternodes start cman (including dlm/gfs controld) tell pacemaker to use cman configure fencing and all services. Fabio From lhh at redhat.com Tue Mar 8 22:17:45 2011 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 8 Mar 2011 17:17:45 -0500 Subject: [Linux-cluster] unable to live migrate a vm in rh el 6: Migration unexpectedly failed In-Reply-To: References: Message-ID: <20110308221744.GA6659@redhat.com> On Mon, Mar 07, 2011 at 11:10:08PM +0100, Gianluca Cecchi wrote: > On Mon, 7 Mar 2011 16:52:00 -0500 Lon Hohberger wrote: > > > Check /var/log/audit/audit.log for an AVC denial around self:capability > > setpcap for xm_t? > > Uhm, > SElinux is disabled on both nodes (I'll cross check tomorrow anyway) > and auditd is chkconfig off too (even if I notice in rh el 6 many > audit messages related to cron writing in /var/log/messages...) > Could it be of any help an "strace -f" of the virsh command where I > can see the ssh and netcat forked calls but am not able to identify > the point where eventually there is something strange? > Nothing comes to mind; in my RHEL6 development cluster, I have a custom SELinux policy: #==== cut module clusterlocal 1.0; require { type xm_t; type debugfs_t; type fenced_t; type mount_t; type telnetd_port_t; class capability setpcap; class tcp_socket name_connect; class dir mounton; } allow fenced_t telnetd_port_t:tcp_socket name_connect; allow mount_t debugfs_t:dir mounton; allow xm_t self:capability setpcap; #=== end cut And the following firewall rules: -A INPUT -p tcp -m state --state NEW -m multiport --dports 21064 -j ACCEPT -A INPUT -p tcp -m state --state NEW -m multiport --dports 11111 -j ACCEPT -A INPUT -p udp -m state --state NEW -m multiport --dports 5404,5405 -j ACCEPT I'm using bridging (as documented in the RHEL6 documentation) and everything pretty much just works. Are you seeing any other notable behaviors, besides the migration failing? -- Lon Hohberger - Red Hat, Inc. From Sunil_Gupta2 at Dell.com Wed Mar 9 07:02:51 2011 From: Sunil_Gupta2 at Dell.com (Sunil_Gupta2 at Dell.com) Date: Tue, 8 Mar 2011 23:02:51 -0800 Subject: [Linux-cluster] rgmanager not running In-Reply-To: <38415.59.90.241.47.1299486821.squirrel@59.90.241.47> References: <38415.59.90.241.47.1299486821.squirrel@59.90.241.47> Message-ID: <8EF1FE59C3C8694E94F558EB27E464B71D130C73DD@BLRX7MCDC201.AMER.DELL.COM> The rgmanager service is not necessary if the cluster has no resources to manage....further more info on cluster status is needed like #clustat If it says all the nodes are online then more debug logs will be needed to find out the problem. --Sunil -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Balaji Sundar Sent: Monday, March 07, 2011 2:04 PM To: linux-cluster at redhat.com Subject: [Linux-cluster] rgmanager not running Dear All, I have using RHEL6 Linux and Kernel Version is 2.6.32-71.el6.i686 I have configured Cluster Suite with 2 servers Server 1 : 192.168.13.131 IP Address and hostname is primary Server 2 : 192.168.13.132 IP Address and hostname is secondary Floating : 192.168.13.133 IP Address (Assumed by currently active server) I have verified that service cman is running and cluster.conf is valid using ccs_config_validate command Finally i found that rgmanager is not running and services are not started [root at primary cluster]# service rgmanager status rgmanager dead but pid file exists [root at primary cluster]# [root at primary cluster]# cman_tool services [root at primary cluster]# [root at primary cluster]# cman_tool status Version: 6.2.0 Config Version: 1 Cluster Name: EMSCluster Cluster Id: 808 Cluster Member: Yes Cluster Generation: 96 Membership state: Cluster-Member Nodes: 1 Expected votes: 1 Total votes: 1 Node votes: 1 Quorum: 1 Active subsystems: 7 Flags: 2node Ports Bound: 0 Node name: primary Node ID: 1 Multicast addresses: 239.192.3.43 Node addresses: 192.168.13.131 [root at primary cluster]# Found some error messages in "/var/log/messages" file Mar 7 14:39:42 primary corosync[7155]: [CMAN ] quorum regained, resuming activity Mar 7 14:39:42 primary corosync[7155]: [QUORUM] This node is within the primary component and will provide service. Mar 7 14:39:42 primary corosync[7155]: [QUORUM] Members[1]: 1 Mar 7 14:39:42 primary corosync[7155]: [QUORUM] Members[1]: 1 Mar 7 14:39:42 primary corosync[7155]: [CPG ] downlist received left_list: 0 Mar 7 14:39:42 primary corosync[7155]: [CPG ] chosen downlist from node r(0) ip(192.168.13.131) Mar 7 14:39:42 primary corosync[7155]: [MAIN ] Completed service synchronization, ready to provide service. Mar 7 14:39:44 primary fenced[7210]: fenced 3.0.12 started Mar 7 14:39:45 primary dlm_controld[7224]: dlm_controld 3.0.12 started Mar 7 14:39:45 primary gfs_controld[7254]: gfs_controld 3.0.12 started Mar 7 14:39:45 primary kernel: dlm: Using TCP for communications Mar 7 14:39:45 primary dlm_controld[7224]: dlm_join_lockspace no fence domain Mar 7 14:39:45 primary dlm_controld[7224]: process_uevent online@ error -1 errno 2 Mar 7 14:39:45 primary kernel: dlm: rgmanager: group join failed -1 -1 Found some error messages in "/var/log/cluster/dlm_controld.log" file Mar 07 14:39:45 dlm_controld dlm_controld 3.0.12 started Mar 07 14:39:45 dlm_controld dlm_join_lockspace no fence domain Mar 07 14:39:45 dlm_controld process_uevent online@ error -1 errno 2 I don't know what is the problem and Can some one throw light on this peculiar problem Thanks in Advance --Regards S.Balaji -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From gianluca.cecchi at gmail.com Wed Mar 9 08:47:09 2011 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Wed, 9 Mar 2011 09:47:09 +0100 Subject: [Linux-cluster] unable to live migrate a vm in rh el 6: Migration unexpectedly failed In-Reply-To: References: Message-ID: On Mon, Mar 7, 2011 at 11:10 PM, Gianluca Cecchi wrote: > Nothing comes to mind; in my RHEL6 development cluster, I have a > custom SELinux policy: I confirm that SElinux is disabled and [root at rhev1 ~]# chkconfig --list | grep audit auditd 0:off 1:off 2:off 3:off 4:off 5:off 6:off [root at rhev1 ~]# service auditd status auditd is stopped [root at rhev2 ~]# chkconfig --list | grep audit auditd 0:off 1:off 2:off 3:off 4:off 5:off 6:off [root at rhev2 ~]# service auditd status auditd is stopped No other problems apparently, apart the bug on bond+vlan+bridge: https://bugzilla.redhat.com/show_bug.cgi?id=623199 For which I have also a case open.. Cluster is sound and some test services worked ok. Strange thing is that at some point during my test I was able to live migrate this machine itself... apparently I did something that broke or one of my latest updates created a problem... Or something related with firewall perhaps. Can I stop firewall at all and have libvirtd working at the same time to test ...? I know libvirtd puts some iptables rules itself.. Gianluca From andrew at beekhof.net Wed Mar 9 08:48:03 2011 From: andrew at beekhof.net (Andrew Beekhof) Date: Wed, 9 Mar 2011 09:48:03 +0100 Subject: [Linux-cluster] dlm-pcmk-3.0.17-1.fc14.x86_64 and gfs-pcmk-3.0.17-1.fc14.x86_64 woes In-Reply-To: <4D7689B9.7070906@redhat.com> References: <4D76894B.6010809@gmail.com> <4D7689B9.7070906@redhat.com> Message-ID: On Tue, Mar 8, 2011 at 8:55 PM, Fabio M. Di Nitto wrote: > On 03/08/2011 08:53 PM, Gregory Bartholomew wrote: >> Hi Fabio M. Di Nitto, >> >> FYI, I was just trying to set up gfs2 under pacemaker on Fedora 14 >> X86_64 and although yum provides '*/gfs_controld.pcmk' showed that I >> needed the dlm-pcmk-3.0.17-1.fc14.x86_64 and >> gfs-pcmk-3.0.17-1.fc14.x86_64 packages, yum install dlm-pcmk gfs-pcmk >> would simply report "Nothing to do". ?rpm -q showed that I didn't have >> the packages installed. ?I tried installing the cman package but that >> didn't help. ?I finally got it working by downloading the packages with >> wget and installing them with rpm -ivh. >> >> FYI, the dlm-pcmk and gfs-pcmk packages seem to be broken in the Fedora >> 14 x86_64 database at the moment. > > No, those packages have been removed intentionally since pacemaker now > supports cman cluster manager and they become obsoleted. > > So very short summary: > > configure cman for clusternodes > start cman (including dlm/gfs controld) > tell pacemaker to use cman > configure fencing and all services. A week or so ago I added a big warning to the bottom of: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch08s02.html and an appendix for configuring cman+pacemaker. Hopefully it will be of some help. From gianluca.cecchi at gmail.com Wed Mar 9 10:32:39 2011 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Wed, 9 Mar 2011 11:32:39 +0100 Subject: [Linux-cluster] unable to live migrate a vm in rh el 6: Migration unexpectedly failed In-Reply-To: References: Message-ID: Here is the output of the command strace -f virsh migrate --live exorapr1 qemu+ssh://intrarhev1/system Note that if I run the same with rhev1 (main host name and not intracluster) instead of intrarhev1, I'm asked for the ssh password (ok because I set ssh equivalence only for intracluster) but at the end I get the same error: operation failed: Migration unexpectedly failed Gianluca From gianluca.cecchi at gmail.com Wed Mar 9 10:33:18 2011 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Wed, 9 Mar 2011 11:33:18 +0100 Subject: [Linux-cluster] unable to live migrate a vm in rh el 6: Migration unexpectedly failed In-Reply-To: References: Message-ID: On Wed, Mar 9, 2011 at 11:32 AM, Gianluca Cecchi wrote: > Here is the output of the command > > strace -f virsh migrate --live exorapr1 qemu+ssh://intrarhev1/system > > Note that if I run the same with rhev1 (main host name and not > intracluster) instead of intrarhev1, I'm asked for the ssh password > (ok because I set ssh equivalence only for intracluster) but at the > end I get the same error: > operation failed: Migration unexpectedly failed > > Gianluca > I forgot the attachment... ;-( It is in zip format -------------- next part -------------- A non-text attachment was scrubbed... Name: strace.zip Type: application/zip Size: 20072 bytes Desc: not available URL: From balajisundar at midascomm.com Wed Mar 9 11:24:18 2011 From: balajisundar at midascomm.com (Balaji) Date: Wed, 09 Mar 2011 16:54:18 +0530 Subject: [Linux-cluster] Linux-cluster Digest, Vol 83, Issue 13 In-Reply-To: References: Message-ID: <4D776362.4020203@midascomm.com> Dear All, Please find attached log file for more analysis Please help me to solve this problem ASAP. * Clustat Command Output is below * [root at corviewprimary ~]# clustat Cluster Status for EMSCluster @ Wed Mar 9 17:00:03 2011 Member Status: Quorate Member Name ID Status ----------- ------- ---- ------ corviewprimary 1 Online, Local corviewsecondary 2 Offline [root at corviewprimary ~]# Regards, -S.Balaji linux-cluster-request at redhat.com wrote: >Send Linux-cluster mailaddr:115.249.107.179ing list submissions to > linux-cluster at redhat.com > >To subscribe or unsubscribe via the World Wide Web, visit > https://www.redhat.com/mailman/listinfo/linux-cluster >or, via email, send a message with subject or body 'help' to > linux-cluster-request at redhat.com > >You can reach the person managing the list at > linux-cluster-owner at redhat.com > >When replying, please edit your Subject line so it is more specific >than "Re: Contents of Linux-cluster digest..." > > >Today's Topics: > > 1. Re: clvmd hangs on startup (Valeriu Mutu) > 2. Re: clvmd hangs on startup (Jeff Sturm) > 3. dlm-pcmk-3.0.17-1.fc14.x86_64 and > gfs-pcmk-3.0.17-1.fc14.x86_64 woes (Gregory Bartholomew) > 4. Re: dlm-pcmk-3.0.17-1.fc14.x86_64 and > gfs-pcmk-3.0.17-1.fc14.x86_64 woes (Fabio M. Di Nitto) > 5. Re: unable to live migrate a vm in rh el 6: Migration > unexpectedly failed (Lon Hohberger) > 6. Re: rgmanager not running (Sunil_Gupta2 at Dell.com) > 7. Re: unable to live migrate a vm in rh el 6: Migration > unexpectedly failed (Gianluca Cecchi) > 8. Re: dlm-pcmk-3.0.17-1.fc14.x86_64 and > gfs-pcmk-3.0.17-1.fc14.x86_64 woes (Andrew Beekhof) > 9. Re: unable to live migrate a vm in rh el 6: Migration > unexpectedly failed (Gianluca Cecchi) > 10. Re: unable to live migrate a vm in rh el 6: Migration > unexpectedly failed (Gianluca Cecchi) > > >---------------------------------------------------------------------- > >Message: 1 >Date: Tue, 8 Mar 2011 12:11:53 -0500 >From: Valeriu Mutu >To: linux clustering >Subject: Re: [Linux-cluster] clvmd hangs on startup >Message-ID: <20110308171153.GB272 at bsdera.pcbi.upenn.edu> >Content-Type: text/plain; charset=us-ascii > >Hi, > >I think the problem is solved. I was using a 9000bytes MTU on the Xen virtual machines' iSCSI interface. Switching back to 1500bytes MTU caused the clvmd to start working. > >On Thu, Mar 03, 2011 at 11:50:57AM -0500, Valeriu Mutu wrote: > > >>On Wed, Mar 02, 2011 at 05:36:45PM -0500, Jeff Sturm wrote: >> >> >>>Double-check that the 2nd node can read and write the shared iSCSI >>>storage. >>> >>> >>Reading/writing from/to the iSCSI storage device works as seen below. >> >>On the 1st node: >>[root at vm1 cluster]# dd count=10000 bs=1024 if=/dev/urandom of=/dev/mapper/pcbi-homes >>10000+0 records in >>10000+0 records out >>10240000 bytes (10 MB) copied, 3.39855 seconds, 3.0 MB/s >> >>[root at vm1 cluster]# dd count=10000 bs=1024 if=/dev/mapper/pcbi-homes of=/dev/null >>10000+0 records in >>10000+0 records out >>10240000 bytes (10 MB) copied, 0.331069 seconds, 30.9 MB/s >> >>On the 2nd node: >>[root at vm2 ~]# dd count=10000 bs=1024 if=/dev/urandom of=/dev/mapper/pcbi-homes >>10000+0 records in >>10000+0 records out >>10240000 bytes (10 MB) copied, 3.2465 seconds, 3.2 MB/s >> >>[root at vm2 ~]# dd count=10000 bs=1024 if=/dev/mapper/pcbi-homes of=/dev/null >>10000+0 records in >>10000+0 records out >>10240000 bytes (10 MB) copied, 0.223337 seconds, 45.8 MB/s >> >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: corosync.log Type: text/x-log Size: 2424 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: dlm_controld.log Type: text/x-log Size: 190 bytes Desc: not available URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: messages URL: From Sunil_Gupta2 at Dell.com Wed Mar 9 12:14:17 2011 From: Sunil_Gupta2 at Dell.com (Sunil_Gupta2 at Dell.com) Date: Wed, 9 Mar 2011 17:44:17 +0530 Subject: [Linux-cluster] Linux-cluster Digest, Vol 83, Issue 13 In-Reply-To: <4D776362.4020203@midascomm.com> References: <4D776362.4020203@midascomm.com> Message-ID: <8EF1FE59C3C8694E94F558EB27E464B71D130C752D@BLRX7MCDC201.AMER.DELL.COM> One node is offline cluster is not formed....check if multicast traffic is working... --Sunil From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Balaji Sent: Wednesday, March 09, 2011 4:54 PM To: linux-cluster at redhat.com Subject: Re: [Linux-cluster] Linux-cluster Digest, Vol 83, Issue 13 Dear All, Please find attached log file for more analysis Please help me to solve this problem ASAP. Clustat Command Output is below [root at corviewprimary ~]# clustat Cluster Status for EMSCluster @ Wed Mar 9 17:00:03 2011 Member Status: Quorate Member Name ID Status ----------- ------- ---- ------ corviewprimary 1 Online, Local corviewsecondary 2 Offline [root at corviewprimary ~]# Regards, -S.Balaji linux-cluster-request at redhat.com wrote: Send Linux-cluster mailaddr:115.249.107.179ing list submissions to linux-cluster at redhat.com To subscribe or unsubscribe via the World Wide Web, visit https://www.redhat.com/mailman/listinfo/linux-cluster or, via email, send a message with subject or body 'help' to linux-cluster-request at redhat.com You can reach the person managing the list at linux-cluster-owner at redhat.com When replying, please edit your Subject line so it is more specific than "Re: Contents of Linux-cluster digest..." Today's Topics: 1. Re: clvmd hangs on startup (Valeriu Mutu) 2. Re: clvmd hangs on startup (Jeff Sturm) 3. dlm-pcmk-3.0.17-1.fc14.x86_64 and gfs-pcmk-3.0.17-1.fc14.x86_64 woes (Gregory Bartholomew) 4. Re: dlm-pcmk-3.0.17-1.fc14.x86_64 and gfs-pcmk-3.0.17-1.fc14.x86_64 woes (Fabio M. Di Nitto) 5. Re: unable to live migrate a vm in rh el 6: Migration unexpectedly failed (Lon Hohberger) 6. Re: rgmanager not running (Sunil_Gupta2 at Dell.com) 7. Re: unable to live migrate a vm in rh el 6: Migration unexpectedly failed (Gianluca Cecchi) 8. Re: dlm-pcmk-3.0.17-1.fc14.x86_64 and gfs-pcmk-3.0.17-1.fc14.x86_64 woes (Andrew Beekhof) 9. Re: unable to live migrate a vm in rh el 6: Migration unexpectedly failed (Gianluca Cecchi) 10. Re: unable to live migrate a vm in rh el 6: Migration unexpectedly failed (Gianluca Cecchi) ---------------------------------------------------------------------- Message: 1 Date: Tue, 8 Mar 2011 12:11:53 -0500 From: Valeriu Mutu To: linux clustering Subject: Re: [Linux-cluster] clvmd hangs on startup Message-ID: <20110308171153.GB272 at bsdera.pcbi.upenn.edu> Content-Type: text/plain; charset=us-ascii Hi, I think the problem is solved. I was using a 9000bytes MTU on the Xen virtual machines' iSCSI interface. Switching back to 1500bytes MTU caused the clvmd to start working. On Thu, Mar 03, 2011 at 11:50:57AM -0500, Valeriu Mutu wrote: On Wed, Mar 02, 2011 at 05:36:45PM -0500, Jeff Sturm wrote: Double-check that the 2nd node can read and write the shared iSCSI storage. Reading/writing from/to the iSCSI storage device works as seen below. On the 1st node: [root at vm1 cluster]# dd count=10000 bs=1024 if=/dev/urandom of=/dev/mapper/pcbi-homes 10000+0 records in 10000+0 records out 10240000 bytes (10 MB) copied, 3.39855 seconds, 3.0 MB/s [root at vm1 cluster]# dd count=10000 bs=1024 if=/dev/mapper/pcbi-homes of=/dev/null 10000+0 records in 10000+0 records out 10240000 bytes (10 MB) copied, 0.331069 seconds, 30.9 MB/s On the 2nd node: [root at vm2 ~]# dd count=10000 bs=1024 if=/dev/urandom of=/dev/mapper/pcbi-homes 10000+0 records in 10000+0 records out 10240000 bytes (10 MB) copied, 3.2465 seconds, 3.2 MB/s [root at vm2 ~]# dd count=10000 bs=1024 if=/dev/mapper/pcbi-homes of=/dev/null 10000+0 records in 10000+0 records out 10240000 bytes (10 MB) copied, 0.223337 seconds, 45.8 MB/s -------------- next part -------------- An HTML attachment was scrubbed... URL: From ooolinux at 163.com Wed Mar 9 14:13:35 2011 From: ooolinux at 163.com (yue) Date: Wed, 9 Mar 2011 22:13:35 +0800 (CST) Subject: [Linux-cluster] which is better gfs2 and ocfs2? Message-ID: <4f996c7c.1356a.12e9af733aa.Coremail.ooolinux@163.com> which is better gfs2 and ocfs2? i want to share fc-san, do you know which is better? stablility,performmance? thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeff.sturm at eprize.com Wed Mar 9 14:48:03 2011 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Wed, 9 Mar 2011 09:48:03 -0500 Subject: [Linux-cluster] which is better gfs2 and ocfs2? In-Reply-To: <4f996c7c.1356a.12e9af733aa.Coremail.ooolinux@163.com> References: <4f996c7c.1356a.12e9af733aa.Coremail.ooolinux@163.com> Message-ID: <64D0546C5EBBD147B75DE133D798665F0855C34D@hugo.eprize.local> Do you expect to get an objective answer to that from a Red Hat list? Most users on this forum are familiar with GFS2, some may have tried OCFS2 but there's bound to be a bias. GFS has been extremely stable for us (haven't migrated to GFS2 yet, went into production with GFS in 2008). Just last night in fact a single hardware node failed in one of our virtual test clusters, the fencing operations were successful and everything recovered nicely. The cluster never lost quorum and disruption was minimal. Performance is highly variable depending on the software application. We have developed our own application which gave us freedom to tailor it for GFS, improving performance and throughput significantly. Regardless of what you hear, why not give both a try? Your evaluation and feedback would be very useful to the cluster community. -Jeff From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of yue Sent: Wednesday, March 09, 2011 9:14 AM To: linux-cluster Subject: [Linux-cluster] which is better gfs2 and ocfs2? which is better gfs2 and ocfs2? i want to share fc-san, do you know which is better? stablility,performmance? thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.lackner at unileoben.ac.at Wed Mar 9 14:53:40 2011 From: michael.lackner at unileoben.ac.at (Michael Lackner) Date: Wed, 09 Mar 2011 15:53:40 +0100 Subject: [Linux-cluster] which is better gfs2 and ocfs2? In-Reply-To: <4f996c7c.1356a.12e9af733aa.Coremail.ooolinux@163.com> References: <4f996c7c.1356a.12e9af733aa.Coremail.ooolinux@163.com> Message-ID: <4D779474.6020509@unileoben.ac.at> I guess not all usage scenarios are comparable, but I once tried to use GFS2 as well as OCFS2 to share a FC SAN to three nodes using 8GBit FC and 1GBit Ethernet for the cluster communication. Additionally, i compared it to a trial version of Dataplows SAN File System (SFS). I was also supposed to compare it to Quantum StorNext, but there just wasn't enough time for that. OS was CentOS 5.3 at that time. So I tried a lot of performance tuning settings for all three, and it was like this: 1.) SFS was the fastest, but caused reproducible kernel panics. Those were fixed by Dataplow, but then SFS produced corrupted data when writing large files. Unusable in that state, so we gave up. SFS uses NFS for lock management. Noteworthy: Writing data on the machine with the NFS lock manager also crippled the I/O performance for all the other nodes in a VERY, VERY bad way.. 2.) GFS2 was the slowest, and despite all the tunings I tried, it never came close to anything that any local FS would provide in terms of speed (compared to EXT3 and XFS). The statfs() calls pretty much crippled the FS. Multiple I/O streams on multiple nodes: Not a good idea it seems.. Sometimes you have to wait for minutes for the FS to just give you any feedback, when you're hammering it with let's say 30 sequential write streams across 3 nodes, with the streams equally distributed among them. 3.) OCFS2 was slightly faster than GFS2, especially when it came to statfs(), like ls -l. It did not slow down that much. But overall, it was still just far too slow. Our solution: Hook up the SAN on one node only, and share via NFS over GBit Ethernet. Overall, we are getting better results even with the obvious network overhead, especially when doing a lot of I/O on multiple clients. Our original goal was to provide a high-speed centralized storage solution for multiple nodes without having to use ethernet. This failed completely unfortunately. Hope this helps, it's just my experience though. As usual, mileage may vary... yue wrote: > which is better gfs2 and ocfs2? > i want to share fc-san, do you know which is better? > stablility,performmance? -- Michael Lackner Lehrstuhl f?r Informationstechnologie, Montanuniversit?t Leoben IT Administration michael.lackner at mu-leoben.at | +43 (0)3842/402-1505 From rhurst at bidmc.harvard.edu Wed Mar 9 15:23:31 2011 From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu) Date: Wed, 9 Mar 2011 10:23:31 -0500 Subject: [Linux-cluster] which is better gfs2 and ocfs2? In-Reply-To: <4f996c7c.1356a.12e9af733aa.Coremail.ooolinux@163.com> References: <4f996c7c.1356a.12e9af733aa.Coremail.ooolinux@163.com> Message-ID: <50168EC934B8D64AA8D8DD37F840F3DE0568486C63@EVS2CCR.its.caregroup.org> Depends on the application's use of the filesystem and your processing usage patterns. You should evaluate both. We're using 8gbit FC and 1gb private networking between two pairs (test & production) of 4-node clusters on IBM BladeCenter. We used RHEL 4 GFS from 2007 - 2010 without issue. Upgraded to RHEL 5u5, then most of the GFS filesystems to GFS2 without issues (yet). We'd like to see a working cluster configuration using GFS2 on KVM guests. ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of yue Sent: Wednesday, March 09, 2011 9:14 AM To: linux-cluster Subject: [Linux-cluster] which is better gfs2 and ocfs2? which is better gfs2 and ocfs2? i want to share fc-san, do you know which is better? stablility,performmance? thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From gregory.lee.bartholomew at gmail.com Wed Mar 9 15:33:53 2011 From: gregory.lee.bartholomew at gmail.com (Gregory Bartholomew) Date: Wed, 09 Mar 2011 09:33:53 -0600 Subject: [Linux-cluster] dlm-pcmk-3.0.17-1.fc14.x86_64 and gfs-pcmk-3.0.17-1.fc14.x86_64 woes In-Reply-To: References: <4D76894B.6010809@gmail.com> <4D7689B9.7070906@redhat.com> Message-ID: <4D779DE1.7040306@gmail.com> Arrr ... I was just starting to get all this figured out and you've gone and changed EVERYTHING!!! :-) Since I'm now using cman, should I favor the RA's that are listed by "crm ra list ocf redhat" (ocf:redhat:ip.sh instead of ocf:heartbeat:IPaddr2, ocf:redhat:apache.sh instead of ocf:heartbeat:apache, etc.)? gb On 03/09/2011 02:48 AM, Andrew Beekhof wrote: > On Tue, Mar 8, 2011 at 8:55 PM, Fabio M. Di Nitto wrote: >> On 03/08/2011 08:53 PM, Gregory Bartholomew wrote: >>> Hi Fabio M. Di Nitto, >>> >>> FYI, I was just trying to set up gfs2 under pacemaker on Fedora 14 >>> X86_64 and although yum provides '*/gfs_controld.pcmk' showed that I >>> needed the dlm-pcmk-3.0.17-1.fc14.x86_64 and >>> gfs-pcmk-3.0.17-1.fc14.x86_64 packages, yum install dlm-pcmk gfs-pcmk >>> would simply report "Nothing to do". rpm -q showed that I didn't have >>> the packages installed. I tried installing the cman package but that >>> didn't help. I finally got it working by downloading the packages with >>> wget and installing them with rpm -ivh. >>> >>> FYI, the dlm-pcmk and gfs-pcmk packages seem to be broken in the Fedora >>> 14 x86_64 database at the moment. >> >> No, those packages have been removed intentionally since pacemaker now >> supports cman cluster manager and they become obsoleted. >> >> So very short summary: >> >> configure cman for clusternodes >> start cman (including dlm/gfs controld) >> tell pacemaker to use cman >> configure fencing and all services. > > A week or so ago I added a big warning to the bottom of: > http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch08s02.html > > and an appendix for configuring cman+pacemaker. > Hopefully it will be of some help. From thomas at sjolshagen.net Wed Mar 9 15:48:10 2011 From: thomas at sjolshagen.net (Thomas Sjolshagen) Date: Wed, 09 Mar 2011 10:48:10 -0500 Subject: [Linux-cluster] =?utf-8?q?which_is_better_gfs2_and_ocfs2=3F?= In-Reply-To: <50168EC934B8D64AA8D8DD37F840F3DE0568486C63@EVS2CCR.its.caregroup.org> References: <4f996c7c.1356a.12e9af733aa.Coremail.ooolinux@163.com> <50168EC934B8D64AA8D8DD37F840F3DE0568486C63@EVS2CCR.its.caregroup.org> Message-ID: On Wed, 9 Mar 2011 10:23:31 -0500, rhurst at bidmc.harvard.edu wrote: > We'd like to see a working cluster configuration using GFS2 on KVM guests. Got one of those although it's a very small-scale setup using Fedora 14. Bare metal uses gfs2 for hosting KVM image files. vm's managed by RH cluster3 stack. VMs use gfs2 for sharing Maildir spool for a small postfix/dovecot setup with webmail frontend. All on shared iSCSI based storage. -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrew at beekhof.net Wed Mar 9 15:54:42 2011 From: andrew at beekhof.net (Andrew Beekhof) Date: Wed, 9 Mar 2011 16:54:42 +0100 Subject: [Linux-cluster] dlm-pcmk-3.0.17-1.fc14.x86_64 and gfs-pcmk-3.0.17-1.fc14.x86_64 woes In-Reply-To: <4D779DE1.7040306@gmail.com> References: <4D76894B.6010809@gmail.com> <4D7689B9.7070906@redhat.com> <4D779DE1.7040306@gmail.com> Message-ID: On Wed, Mar 9, 2011 at 4:33 PM, Gregory Bartholomew wrote: > Arrr ... I was just starting to get all this figured out and you've gone and > changed EVERYTHING!!! :-) Very little is actually changed :-) These days cman is mostly just a small corosync plugin. I'm not sure if this was the case back when we ported Pacemaker to corosync, but it would have simplified a lot if we'd sucked in that little plugin instead of writing our own. > > Since I'm now using cman, should I favor the RA's that are listed by "crm ra > list ocf redhat" (ocf:redhat:ip.sh instead of ocf:heartbeat:IPaddr2, > ocf:redhat:apache.sh instead of ocf:heartbeat:apache, etc.)? No, we're only using cman for its quorum and membership information. And the only reason for doing that is so that everything is getting it from the same source (and the "native" pcmk variants aren't widely available). Everything else is unchanged. > > gb > > On 03/09/2011 02:48 AM, Andrew Beekhof wrote: >> >> On Tue, Mar 8, 2011 at 8:55 PM, Fabio M. Di Nitto >> ?wrote: >>> >>> On 03/08/2011 08:53 PM, Gregory Bartholomew wrote: >>>> >>>> Hi Fabio M. Di Nitto, >>>> >>>> FYI, I was just trying to set up gfs2 under pacemaker on Fedora 14 >>>> X86_64 and although yum provides '*/gfs_controld.pcmk' showed that I >>>> needed the dlm-pcmk-3.0.17-1.fc14.x86_64 and >>>> gfs-pcmk-3.0.17-1.fc14.x86_64 packages, yum install dlm-pcmk gfs-pcmk >>>> would simply report "Nothing to do". ?rpm -q showed that I didn't have >>>> the packages installed. ?I tried installing the cman package but that >>>> didn't help. ?I finally got it working by downloading the packages with >>>> wget and installing them with rpm -ivh. >>>> >>>> FYI, the dlm-pcmk and gfs-pcmk packages seem to be broken in the Fedora >>>> 14 x86_64 database at the moment. >>> >>> No, those packages have been removed intentionally since pacemaker now >>> supports cman cluster manager and they become obsoleted. >>> >>> So very short summary: >>> >>> configure cman for clusternodes >>> start cman (including dlm/gfs controld) >>> tell pacemaker to use cman >>> configure fencing and all services. >> >> A week or so ago I added a big warning to the bottom of: >> >> ?http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch08s02.html >> >> and an appendix for configuring cman+pacemaker. >> Hopefully it will be of some help. > From gregory.lee.bartholomew at gmail.com Wed Mar 9 17:01:17 2011 From: gregory.lee.bartholomew at gmail.com (Gregory Bartholomew) Date: Wed, 09 Mar 2011 11:01:17 -0600 Subject: [Linux-cluster] dlm-pcmk-3.0.17-1.fc14.x86_64 and gfs-pcmk-3.0.17-1.fc14.x86_64 woes In-Reply-To: <4D7689B9.7070906@redhat.com> References: <4D76894B.6010809@gmail.com> <4D7689B9.7070906@redhat.com> Message-ID: <4D77B25D.5010603@gmail.com> On 03/08/2011 01:55 PM, Fabio M. Di Nitto wrote: > On 03/08/2011 08:53 PM, Gregory Bartholomew wrote: >> Hi Fabio M. Di Nitto, >> >> FYI, I was just trying to set up gfs2 under pacemaker on Fedora 14 >> X86_64 and although yum provides '*/gfs_controld.pcmk' showed that I >> needed the dlm-pcmk-3.0.17-1.fc14.x86_64 and >> gfs-pcmk-3.0.17-1.fc14.x86_64 packages, yum install dlm-pcmk gfs-pcmk >> would simply report "Nothing to do". rpm -q showed that I didn't have >> the packages installed. I tried installing the cman package but that >> didn't help. I finally got it working by downloading the packages with >> wget and installing them with rpm -ivh. >> >> FYI, the dlm-pcmk and gfs-pcmk packages seem to be broken in the Fedora >> 14 x86_64 database at the moment. > > No, those packages have been removed intentionally since pacemaker now > supports cman cluster manager and they become obsoleted. > > So very short summary: > > configure cman for clusternodes > start cman (including dlm/gfs controld) > tell pacemaker to use cman > configure fencing and all services. > > Fabio Hello again Linux Clustering group, So I've switched to using cman and I have an IP resource shared between my two nodes, but now I'm having trouble with GFS2 again. When I try to mount my active/active iscsi partition, I get: [root at eb2024-58 ~]# mount /dev/sda1 /mnt gfs_controld join connect error: Connection refused error mounting lockproto lock_dlm I was able to create a partition on my iscsi device and format it with "mkfs.gfs2 -p lock_dlm -j 2 -t pcmk:iscsi /dev/sda1" and I can see the partition on both nodes with "fdisk -l", so I think everything iscsi is working. I was said earlier that I needed to "start cman (including dlm/gfs controld)". I see the cman service and it is started and running, but the only dlm/gfs service that I see is one called "gfs2" and when I try to start it I get: [root at eb2024-58 ~]# service gfs2 start GFS2: no entries found in /etc/fstab So why can't I mount my iscsi partition and where are these elusive dlm/gfs crontrold services? Thanks, gb From gregory.lee.bartholomew at gmail.com Wed Mar 9 18:03:09 2011 From: gregory.lee.bartholomew at gmail.com (Gregory Bartholomew) Date: Wed, 09 Mar 2011 12:03:09 -0600 Subject: [Linux-cluster] dlm-pcmk-3.0.17-1.fc14.x86_64 and gfs-pcmk-3.0.17-1.fc14.x86_64 woes In-Reply-To: <4D77B25D.5010603@gmail.com> References: <4D76894B.6010809@gmail.com> <4D7689B9.7070906@redhat.com> <4D77B25D.5010603@gmail.com> Message-ID: <4D77C0DD.4070405@gmail.com> On 03/09/2011 11:01 AM, Gregory Bartholomew wrote: > Hello again Linux Clustering group, > > So I've switched to using cman and I have an IP resource shared between > my two nodes, but now I'm having trouble with GFS2 again. When I try to > mount my active/active iscsi partition, I get: > > [root at eb2024-58 ~]# mount /dev/sda1 /mnt > gfs_controld join connect error: Connection refused > error mounting lockproto lock_dlm > > I was able to create a partition on my iscsi device and format it with > "mkfs.gfs2 -p lock_dlm -j 2 -t pcmk:iscsi /dev/sda1" and I can see the > partition on both nodes with "fdisk -l", so I think everything iscsi is > working. > > I was said earlier that I needed to "start cman (including dlm/gfs > controld)". I see the cman service and it is started and running, but > the only dlm/gfs service that I see is one called "gfs2" and when I try > to start it I get: > > [root at eb2024-58 ~]# service gfs2 start > GFS2: no entries found in /etc/fstab > > So why can't I mount my iscsi partition and where are these elusive > dlm/gfs crontrold services? > > Thanks, > gb Never mind, I figured it out ... I needed to install the gfs2-cluster package and start its service and I also had a different name for my cluster in /etc/cluster/cluster.conf than what I was using in my mkfs.gfs2 command. It's all working now. Thanks to those who helped me get this going, gb From balajisundar at midascomm.com Thu Mar 10 04:42:31 2011 From: balajisundar at midascomm.com (Balaji) Date: Thu, 10 Mar 2011 10:12:31 +0530 Subject: [Linux-cluster] Linux-cluster Digest, Vol 83, Issue 15 In-Reply-To: References: Message-ID: <4D7856B7.4070105@midascomm.com> Dear All, Currently other node is shutdown. First of all we will check the cluster is up in simplex mode Regards -S.Balaji linux-cluster-request at redhat.com wrote: >Send Linux-cluster mailing list submissions to > linux-cluster at redhat.com > >To subscribe or unsubscribe via the World Wide Web, visit > https://www.redhat.com/mailman/listinfo/linux-cluster >or, via email, send a message with subject or body 'help' to > linux-cluster-request at redhat.com > >You can reach the person managing the list at > linux-cluster-owner at redhat.com > >When replying, please edit your Subject line so it is more specific >than "Re: Contents of Linux-cluster digest..." > > >Today's Topics: > > 1. Re: Linux-cluster Digest, Vol 83, Issue 13 (Sunil_Gupta2 at Dell.com) > 2. which is better gfs2 and ocfs2? (yue) > 3. Re: which is better gfs2 and ocfs2? (Jeff Sturm) > 4. Re: which is better gfs2 and ocfs2? (Michael Lackner) > 5. Re: which is better gfs2 and ocfs2? (rhurst at bidmc.harvard.edu) > 6. Re: dlm-pcmk-3.0.17-1.fc14.x86_64 and > gfs-pcmk-3.0.17-1.fc14.x86_64 woes (Gregory Bartholomew) > 7. Re: which is better gfs2 and ocfs2? (Thomas Sjolshagen) > 8. Re: dlm-pcmk-3.0.17-1.fc14.x86_64 and > gfs-pcmk-3.0.17-1.fc14.x86_64 woes (Andrew Beekhof) > > >---------------------------------------------------------------------- > >Message: 1 >Date: Wed, 9 Mar 2011 17:44:17 +0530 >From: >To: >Subject: Re: [Linux-cluster] Linux-cluster Digest, Vol 83, Issue 13 >Message-ID: > <8EF1FE59C3C8694E94F558EB27E464B71D130C752D at BLRX7MCDC201.AMER.DELL.COM> > >Content-Type: text/plain; charset="us-ascii" > >One node is offline cluster is not formed....check if multicast traffic is working... > >--Sunil > >From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Balaji >Sent: Wednesday, March 09, 2011 4:54 PM >To: linux-cluster at redhat.com >Subject: Re: [Linux-cluster] Linux-cluster Digest, Vol 83, Issue 13 > >Dear All, > > Please find attached log file for more analysis > Please help me to solve this problem ASAP. > > Clustat Command Output is below > [root at corviewprimary ~]# clustat > Cluster Status for EMSCluster @ Wed Mar 9 17:00:03 2011 > Member Status: Quorate > > Member Name ID Status > ----------- ------- ---- ------ > corviewprimary 1 Online, Local > corviewsecondary 2 Offline > > [root at corviewprimary ~]# > >Regards, >-S.Balaji > >linux-cluster-request at redhat.com wrote: > >Send Linux-cluster mailaddr:115.249.107.179ing list submissions to > > linux-cluster at redhat.com > > > >To subscribe or unsubscribe via the World Wide Web, visit > > https://www.redhat.com/mailman/listinfo/linux-cluster > >or, via email, send a message with subject or body 'help' to > > linux-cluster-request at redhat.com > > > >You can reach the person managing the list at > > linux-cluster-owner at redhat.com > > > >When replying, please edit your Subject line so it is more specific > >than "Re: Contents of Linux-cluster digest..." > > > > > >Today's Topics: > > > > 1. Re: clvmd hangs on startup (Valeriu Mutu) > > 2. Re: clvmd hangs on startup (Jeff Sturm) > > 3. dlm-pcmk-3.0.17-1.fc14.x86_64 and > > gfs-pcmk-3.0.17-1.fc14.x86_64 woes (Gregory Bartholomew) > > 4. Re: dlm-pcmk-3.0.17-1.fc14.x86_64 and > > gfs-pcmk-3.0.17-1.fc14.x86_64 woes (Fabio M. Di Nitto) > > 5. Re: unable to live migrate a vm in rh el 6: Migration > > unexpectedly failed (Lon Hohberger) > > 6. Re: rgmanager not running (Sunil_Gupta2 at Dell.com) > > 7. Re: unable to live migrate a vm in rh el 6: Migration > > unexpectedly failed (Gianluca Cecchi) > > 8. Re: dlm-pcmk-3.0.17-1.fc14.x86_64 and > > gfs-pcmk-3.0.17-1.fc14.x86_64 woes (Andrew Beekhof) > > 9. Re: unable to live migrate a vm in rh el 6: Migration > > unexpectedly failed (Gianluca Cecchi) > > 10. Re: unable to live migrate a vm in rh el 6: Migration > > unexpectedly failed (Gianluca Cecchi) > > > > > >---------------------------------------------------------------------- > > > >Message: 1 > >Date: Tue, 8 Mar 2011 12:11:53 -0500 > >From: Valeriu Mutu > >To: linux clustering > >Subject: Re: [Linux-cluster] clvmd hangs on startup > >Message-ID: <20110308171153.GB272 at bsdera.pcbi.upenn.edu> > >Content-Type: text/plain; charset=us-ascii > > > >Hi, > > > >I think the problem is solved. I was using a 9000bytes MTU on the Xen virtual machines' iSCSI interface. Switching back to 1500bytes MTU caused the clvmd to start working. > > > >On Thu, Mar 03, 2011 at 11:50:57AM -0500, Valeriu Mutu wrote: > > > >On Wed, Mar 02, 2011 at 05:36:45PM -0500, Jeff Sturm wrote: > > > >Double-check that the 2nd node can read and write the shared iSCSI > >storage. > > > >Reading/writing from/to the iSCSI storage device works as seen below. > > > >On the 1st node: > >[root at vm1 cluster]# dd count=10000 bs=1024 if=/dev/urandom of=/dev/mapper/pcbi-homes > >10000+0 records in > >10000+0 records out > >10240000 bytes (10 MB) copied, 3.39855 seconds, 3.0 MB/s > > > >[root at vm1 cluster]# dd count=10000 bs=1024 if=/dev/mapper/pcbi-homes of=/dev/null > >10000+0 records in > >10000+0 records out > >10240000 bytes (10 MB) copied, 0.331069 seconds, 30.9 MB/s > > > >On the 2nd node: > >[root at vm2 ~]# dd count=10000 bs=1024 if=/dev/urandom of=/dev/mapper/pcbi-homes > >10000+0 records in > >10000+0 records out > >10240000 bytes (10 MB) copied, 3.2465 seconds, 3.2 MB/s > > > >[root at vm2 ~]# dd count=10000 bs=1024 if=/dev/mapper/pcbi-homes of=/dev/null > >10000+0 records in > >10000+0 records out > >10240000 bytes (10 MB) copied, 0.223337 seconds, 45.8 MB/s > > > > > > > >-------------- next part -------------- >An HTML attachment was scrubbed... >URL: > >------------------------------ > >Message: 2 >Date: Wed, 9 Mar 2011 22:13:35 +0800 (CST) >From: yue >To: linux-cluster >Subject: [Linux-cluster] which is better gfs2 and ocfs2? >Message-ID: <4f996c7c.1356a.12e9af733aa.Coremail.ooolinux at 163.com> >Content-Type: text/plain; charset="gbk" > >which is better gfs2 and ocfs2? >i want to share fc-san, do you know which is better? >stablility,performmance? > > >thanks >-------------- next part -------------- >An HTML attachment was scrubbed... >URL: > >------------------------------ > >Message: 3 >Date: Wed, 9 Mar 2011 09:48:03 -0500 >From: Jeff Sturm >To: linux clustering >Subject: Re: [Linux-cluster] which is better gfs2 and ocfs2? >Message-ID: > <64D0546C5EBBD147B75DE133D798665F0855C34D at hugo.eprize.local> >Content-Type: text/plain; charset="us-ascii" > >Do you expect to get an objective answer to that from a Red Hat list? >Most users on this forum are familiar with GFS2, some may have tried >OCFS2 but there's bound to be a bias. > > > >GFS has been extremely stable for us (haven't migrated to GFS2 yet, went >into production with GFS in 2008). Just last night in fact a single >hardware node failed in one of our virtual test clusters, the fencing >operations were successful and everything recovered nicely. The cluster >never lost quorum and disruption was minimal. > > > >Performance is highly variable depending on the software application. >We have developed our own application which gave us freedom to tailor it >for GFS, improving performance and throughput significantly. > > > >Regardless of what you hear, why not give both a try? Your evaluation >and feedback would be very useful to the cluster community. > > > >-Jeff > > > >From: linux-cluster-bounces at redhat.com >[mailto:linux-cluster-bounces at redhat.com] On Behalf Of yue >Sent: Wednesday, March 09, 2011 9:14 AM >To: linux-cluster >Subject: [Linux-cluster] which is better gfs2 and ocfs2? > > > >which is better gfs2 and ocfs2? > >i want to share fc-san, do you know which is better? > >stablility,performmance? > > > > > >thanks > > > >-------------- next part -------------- >An HTML attachment was scrubbed... >URL: > >------------------------------ > >Message: 4 >Date: Wed, 09 Mar 2011 15:53:40 +0100 >From: Michael Lackner >To: linux clustering >Subject: Re: [Linux-cluster] which is better gfs2 and ocfs2? >Message-ID: <4D779474.6020509 at unileoben.ac.at> >Content-Type: text/plain; charset=UTF-8; format=flowed > >I guess not all usage scenarios are comparable, but I once >tried to use GFS2 as well as OCFS2 to share a FC SAN to three >nodes using 8GBit FC and 1GBit Ethernet for the cluster >communication. Additionally, i compared it to a trial version >of Dataplows SAN File System (SFS). I was also supposed to >compare it to Quantum StorNext, but there just wasn't enough >time for that. > >OS was CentOS 5.3 at that time. > >So I tried a lot of performance tuning settings for all three, >and it was like this: > >1.) SFS was the fastest, but caused reproducible kernel panics. >Those were fixed by Dataplow, but then SFS produced corrupted data >when writing large files. Unusable in that state, so we gave up. >SFS uses NFS for lock management. Noteworthy: Writing data on the >machine with the NFS lock manager also crippled the I/O performance >for all the other nodes in a VERY, VERY bad way.. > >2.) GFS2 was the slowest, and despite all the tunings I tried, it >never came close to anything that any local FS would provide in >terms of speed (compared to EXT3 and XFS). The statfs() calls >pretty much crippled the FS. Multiple I/O streams on multiple nodes: >Not a good idea it seems.. Sometimes you have to wait for minutes >for the FS to just give you any feedback, when you're hammering >it with let's say 30 sequential write streams across 3 nodes, with >the streams equally distributed among them. > >3.) OCFS2 was slightly faster than GFS2, especially when it came >to statfs(), like ls -l. It did not slow down that much. But overall, >it was still just far too slow. > >Our solution: Hook up the SAN on one node only, and share via NFS >over GBit Ethernet. Overall, we are getting better results even >with the obvious network overhead, especially when doing a lot of >I/O on multiple clients. > >Our original goal was to provide a high-speed centralized storage >solution for multiple nodes without having to use ethernet. This >failed completely unfortunately. > >Hope this helps, it's just my experience though. As usual, mileage >may vary... > >yue wrote: > > >>which is better gfs2 and ocfs2? >>i want to share fc-san, do you know which is better? >>stablility,performmance? >> >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrew at beekhof.net Thu Mar 10 07:14:53 2011 From: andrew at beekhof.net (Andrew Beekhof) Date: Thu, 10 Mar 2011 08:14:53 +0100 Subject: [Linux-cluster] dlm-pcmk-3.0.17-1.fc14.x86_64 and gfs-pcmk-3.0.17-1.fc14.x86_64 woes In-Reply-To: <4D77C0DD.4070405@gmail.com> References: <4D76894B.6010809@gmail.com> <4D7689B9.7070906@redhat.com> <4D77B25D.5010603@gmail.com> <4D77C0DD.4070405@gmail.com> Message-ID: On Wed, Mar 9, 2011 at 7:03 PM, Gregory Bartholomew wrote: > Never mind, I figured it out ... I needed to install the gfs2-cluster > package and start its service and I also had a different name for my cluster > in /etc/cluster/cluster.conf than what I was using in my mkfs.gfs2 command. > > It's all working now. ?Thanks to those who helped me get this going, So you're still using Pacemaker to mount/unmount the filesystem and other services? If so, were there any discrepancies in the documentation describing how to configure this? From gianluca.cecchi at gmail.com Thu Mar 10 15:18:37 2011 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Thu, 10 Mar 2011 16:18:37 +0100 Subject: [Linux-cluster] unable to live migrate a vm in rh el 6: Migration unexpectedly failed In-Reply-To: References: Message-ID: On Wed, Mar 9, 2011 at 9:47 AM, Gianluca Cecchi wrote: [snip] > Or something related with firewall perhaps. > Can I stop firewall at all and have libvirtd working at the same time > to test ...? > I know libvirtd puts some iptables rules itself.. > > Gianluca > OK. It was indeed a problem related to iptables rules. After adding at both ends this rule about intracluster network tcp ports (.31 for the other node) I get live migration working ok using clusvcadm command iptables -t filter -I INPUT 17 -s 192.168.16.32/32 -p tcp -m multiport --dports 49152:49215 -j ACCEPT I'm going to put it in /etc/sysconfig/iptables in the middle of these two: -I FORWARD -m physdev --physdev-is-bridged -j ACCEPT -A INPUT -j REJECT --reject-with icmp-host-prohibited I can also simulate the clusvcadm command with virsh (after freezing the resource) with virsh migrate --live exorapr1 qemu+ssh://intrarhev2/system tcp:intrarhev2 otherwise the ssh connection is tunneled through hostname in connection string, but data exchange happens anyway through the public lan (or what hostname resolves to, I suppose). BTW: I noticed that when you freeze a vm resource you don't get the [Z] notification at the right side of the corresponding line, as it happens with standard services... Is this intentional or could I post a bugzilla for it? For a service, when frozen: service:MYSRV intrarhev2 started [Z] [root at rhev2 ]# clusvcadm -Z vm:exorapr1 Local machine freezing vm:exorapr1...Success [root at rhev2 ]# clustat | grep orapr1 vm:exorapr1 intrarhev1 started Cheers, Gianluca From gregory.lee.bartholomew at gmail.com Thu Mar 10 15:52:26 2011 From: gregory.lee.bartholomew at gmail.com (Gregory Bartholomew) Date: Thu, 10 Mar 2011 09:52:26 -0600 Subject: [Linux-cluster] dlm-pcmk-3.0.17-1.fc14.x86_64 and gfs-pcmk-3.0.17-1.fc14.x86_64 woes In-Reply-To: References: <4D76894B.6010809@gmail.com> <4D7689B9.7070906@redhat.com> <4D77B25D.5010603@gmail.com> <4D77C0DD.4070405@gmail.com> Message-ID: <4D78F3BA.1000904@gmail.com> On 03/10/2011 01:14 AM, Andrew Beekhof wrote: > On Wed, Mar 9, 2011 at 7:03 PM, Gregory Bartholomew > wrote: >> Never mind, I figured it out ... I needed to install the gfs2-cluster >> package and start its service and I also had a different name for my cluster >> in /etc/cluster/cluster.conf than what I was using in my mkfs.gfs2 command. >> >> It's all working now. Thanks to those who helped me get this going, > > So you're still using Pacemaker to mount/unmount the filesystem and > other services? > If so, were there any discrepancies in the documentation describing > how to configure this? Good morning, This is what I did to get the file system going: ----- yum install -y httpd gfs2-cluster gfs2-utils chkconfig gfs2-cluster on service gfs2-cluster start mkfs.gfs2 -p lock_dlm -j 2 -t siue-cs:iscsi /dev/sda1 cat <<-END | crm configure primitive gfs ocf:heartbeat:Filesystem params device="/dev/sda1" directory="/var/www/html" fstype="gfs2" op start interval="0" timeout="60s" op stop interval="0" timeout="60s" configure clone dual-gfs gfs END ----- I think this sed command was also missing from the guide: sed -i '/^#/,/#<\/Location>/{s/^#//;s/Allow from .example.com/Allow from 127.0.0.1/}' /etc/httpd/conf/httpd.conf I've attached the full record of all the commands that I used to set up my nodes to this email. It has, at the end, the final result of "crm configure show". gb -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: node-config.txt URL: From jinzishuai at gmail.com Thu Mar 10 17:09:22 2011 From: jinzishuai at gmail.com (Shi Jin) Date: Thu, 10 Mar 2011 10:09:22 -0700 Subject: [Linux-cluster] What is the proper procedure to reboot a node in a cluster? Message-ID: Hi there, I've setup a two-node cluster with cman, clvmd and gfs2. I don't use qdisk but had I would like to know what is the proper procedure to reboot a node in the two-node cluster (maybe this applies for all size?) when both nodes are functioning fine but I just want to reboot one for some reason (for example, upgrade kernel). Is there a preferred/better way to reboot the machine rather than just running the "reboot" command as root. I have been doing the "reboot" command so far and it sometimes creates problems for us, including making the other node to fail. Thank you very much. Shi -- Shi Jin, Ph.D. -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Thu Mar 10 17:21:23 2011 From: linux at alteeve.com (Digimer) Date: Thu, 10 Mar 2011 12:21:23 -0500 Subject: [Linux-cluster] What is the proper procedure to reboot a node in a cluster? In-Reply-To: References: Message-ID: <4D790893.6000107@alteeve.com> On 03/10/2011 12:09 PM, Shi Jin wrote: > Hi there, > > I've setup a two-node cluster with cman, clvmd and gfs2. I don't use > qdisk but had > > > I would like to know what is the proper procedure to reboot a node in > the two-node cluster (maybe this applies for all size?) when both nodes > are functioning fine but I just want to reboot one for some reason (for > example, upgrade kernel). Is there a preferred/better way to reboot the > machine rather than just running the "reboot" command as root. I have > been doing the "reboot" command so far and it sometimes creates problems > for us, including making the other node to fail. > > Thank you very much. > Shi > -- > Shi Jin, Ph.D. What I do is migrate any services from the node to the other member, then stop rgmanager->gfs2->clvmd->cman (obviously adapt to what you are running). If you have DRBD, then stop it as well. At this point, the other node should be the only one in the cluster (confirm with 'cman_tool status'). If all is good, reboot. Once up, rejoin the cluster. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From ricks at nerd.com Thu Mar 10 17:40:09 2011 From: ricks at nerd.com (Rick Stevens) Date: Thu, 10 Mar 2011 09:40:09 -0800 Subject: [Linux-cluster] Linux-cluster Digest, Vol 83, Issue 15 In-Reply-To: <4D7856B7.4070105@midascomm.com> References: <4D7856B7.4070105@midascomm.com> Message-ID: <4D790CF9.9070406@nerd.com> On 03/09/2011 08:42 PM, Balaji wrote: > Dear All, > > Currently other node is shutdown. > First of all we will check the cluster is up in simplex mode Please don't respond to message digests. Create a NEW message with an appropriate subject line along with your question or comment. > linux-cluster-request at redhat.com > wrote: >> Send Linux-cluster mailing list submissions to >> linux-cluster at redhat.com >> >> To subscribe or unsubscribe via the World Wide Web, visit >> https://www.redhat.com/mailman/listinfo/linux-cluster >> or, via email, send a message with subject or body'help' to >> linux-cluster-request at redhat.com >> >> You can reach the person managing the list at >> linux-cluster-owner at redhat.com >> >> When replying, please edit your Subject line so it is more specific >> than"Re: Contents of Linux-cluster digest..." >> >> >> Today's Topics: >> >> 1. Re: Linux-cluster Digest, Vol 83, Issue 13 (Sunil_Gupta2 at Dell.com ) >> 2. which is better gfs2 and ocfs2? (yue) >> 3. Re: which is better gfs2 and ocfs2? (Jeff Sturm) >> 4. Re: which is better gfs2 and ocfs2? (Michael Lackner) >> 5. Re: which is better gfs2 and ocfs2? (rhurst at bidmc.harvard.edu ) >> 6. Re: dlm-pcmk-3.0.17-1.fc14.x86_64 and >> gfs-pcmk-3.0.17-1.fc14.x86_64 woes (Gregory Bartholomew) >> 7. Re: which is better gfs2 and ocfs2? (Thomas Sjolshagen) >> 8. Re: dlm-pcmk-3.0.17-1.fc14.x86_64 and >> gfs-pcmk-3.0.17-1.fc14.x86_64 woes (Andrew Beekhof) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Wed, 9 Mar 2011 17:44:17 +0530 >> From: >> To: >> Subject: Re: [Linux-cluster] Linux-cluster Digest, Vol 83, Issue 13 >> Message-ID: >> <8EF1FE59C3C8694E94F558EB27E464B71D130C752D at BLRX7MCDC201.AMER.DELL.COM> >> >> Content-Type: text/plain; charset="us-ascii" >> >> One node is offline cluster is not formed....check if multicast traffic is working... >> >> --Sunil >> >> From:linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Balaji >> Sent: Wednesday, March 09, 2011 4:54 PM >> To:linux-cluster at redhat.com >> Subject: Re: [Linux-cluster] Linux-cluster Digest, Vol 83, Issue 13 >> >> Dear All, >> >> Please find attached log file for more analysis >> Please help me to solve this problem ASAP. >> >> Clustat Command Output is below >> [root at corviewprimary ~]# clustat >> Cluster Status for EMSCluster @ Wed Mar 9 17:00:03 2011 >> Member Status: Quorate >> >> Member Name ID Status >> ----------- ------- ---- ------ >> corviewprimary 1 Online, Local >> corviewsecondary 2 Offline >> >> [root at corviewprimary ~]# >> >> Regards, >> -S.Balaji >> >> linux-cluster-request at redhat.com wrote: >> >> Send Linux-cluster mailaddr:115.249.107.179ing list submissions to >> >> linux-cluster at redhat.com >> >> >> >> To subscribe or unsubscribe via the World Wide Web, visit >> >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> or, via email, send a message with subject or body'help' to >> >> linux-cluster-request at redhat.com >> >> >> >> You can reach the person managing the list at >> >> linux-cluster-owner at redhat.com >> >> >> >> When replying, please edit your Subject line so it is more specific >> >> than"Re: Contents of Linux-cluster digest..." >> >> >> >> >> >> Today's Topics: >> >> >> >> 1. Re: clvmd hangs on startup (Valeriu Mutu) >> >> 2. Re: clvmd hangs on startup (Jeff Sturm) >> >> 3. dlm-pcmk-3.0.17-1.fc14.x86_64 and >> >> gfs-pcmk-3.0.17-1.fc14.x86_64 woes (Gregory Bartholomew) >> >> 4. Re: dlm-pcmk-3.0.17-1.fc14.x86_64 and >> >> gfs-pcmk-3.0.17-1.fc14.x86_64 woes (Fabio M. Di Nitto) >> >> 5. Re: unable to live migrate a vm in rh el 6: Migration >> >> unexpectedly failed (Lon Hohberger) >> >> 6. Re: rgmanager not running (Sunil_Gupta2 at Dell.com ) >> >> 7. Re: unable to live migrate a vm in rh el 6: Migration >> >> unexpectedly failed (Gianluca Cecchi) >> >> 8. Re: dlm-pcmk-3.0.17-1.fc14.x86_64 and >> >> gfs-pcmk-3.0.17-1.fc14.x86_64 woes (Andrew Beekhof) >> >> 9. Re: unable to live migrate a vm in rh el 6: Migration >> >> unexpectedly failed (Gianluca Cecchi) >> >> 10. Re: unable to live migrate a vm in rh el 6: Migration >> >> unexpectedly failed (Gianluca Cecchi) >> >> >> >> >> >> ---------------------------------------------------------------------- >> >> >> >> Message: 1 >> >> Date: Tue, 8 Mar 2011 12:11:53 -0500 >> >> From: Valeriu Mutu >> >> To: linux clustering >> >> Subject: Re: [Linux-cluster] clvmd hangs on startup >> >> Message-ID:<20110308171153.GB272 at bsdera.pcbi.upenn.edu> >> >> Content-Type: text/plain; charset=us-ascii >> >> >> >> Hi, >> >> >> >> I think the problem is solved. I was using a 9000bytes MTU on the Xen virtual machines' iSCSI interface. Switching back to 1500bytes MTU caused the clvmd to start working. >> >> >> >> On Thu, Mar 03, 2011 at 11:50:57AM -0500, Valeriu Mutu wrote: >> >> >> >> On Wed, Mar 02, 2011 at 05:36:45PM -0500, Jeff Sturm wrote: >> >> >> >> Double-check that the 2nd node can read and write the shared iSCSI >> >> storage. >> >> >> >> Reading/writing from/to the iSCSI storage device works as seen below. >> >> >> >> On the 1st node: >> >> [root at vm1 cluster]# dd count=10000 bs=1024 if=/dev/urandom of=/dev/mapper/pcbi-homes >> >> 10000+0 records in >> >> 10000+0 records out >> >> 10240000 bytes (10 MB) copied, 3.39855 seconds, 3.0 MB/s >> >> >> >> [root at vm1 cluster]# dd count=10000 bs=1024 if=/dev/mapper/pcbi-homes of=/dev/null >> >> 10000+0 records in >> >> 10000+0 records out >> >> 10240000 bytes (10 MB) copied, 0.331069 seconds, 30.9 MB/s >> >> >> >> On the 2nd node: >> >> [root at vm2 ~]# dd count=10000 bs=1024 if=/dev/urandom of=/dev/mapper/pcbi-homes >> >> 10000+0 records in >> >> 10000+0 records out >> >> 10240000 bytes (10 MB) copied, 3.2465 seconds, 3.2 MB/s >> >> >> >> [root at vm2 ~]# dd count=10000 bs=1024 if=/dev/mapper/pcbi-homes of=/dev/null >> >> 10000+0 records in >> >> 10000+0 records out >> >> 10240000 bytes (10 MB) copied, 0.223337 seconds, 45.8 MB/s >> >> >> >> >> >> >> >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: >> >> ------------------------------ >> >> Message: 2 >> Date: Wed, 9 Mar 2011 22:13:35 +0800 (CST) >> From: yue >> To: linux-cluster >> Subject: [Linux-cluster] which is better gfs2 and ocfs2? >> Message-ID:<4f996c7c.1356a.12e9af733aa.Coremail.ooolinux at 163.com> >> Content-Type: text/plain; charset="gbk" >> >> which is better gfs2 and ocfs2? >> i want to share fc-san, do you know which is better? >> stablility,performmance? >> >> >> thanks >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: >> >> ------------------------------ >> >> Message: 3 >> Date: Wed, 9 Mar 2011 09:48:03 -0500 >> From: Jeff Sturm >> To: linux clustering >> Subject: Re: [Linux-cluster] which is better gfs2 and ocfs2? >> Message-ID: >> <64D0546C5EBBD147B75DE133D798665F0855C34D at hugo.eprize.local> >> Content-Type: text/plain; charset="us-ascii" >> >> Do you expect to get an objective answer to that from a Red Hat list? >> Most users on this forum are familiar with GFS2, some may have tried >> OCFS2 but there's bound to be a bias. >> >> >> >> GFS has been extremely stable for us (haven't migrated to GFS2 yet, went >> into production with GFS in 2008). Just last night in fact a single >> hardware node failed in one of our virtual test clusters, the fencing >> operations were successful and everything recovered nicely. The cluster >> never lost quorum and disruption was minimal. >> >> >> >> Performance is highly variable depending on the software application. >> We have developed our own application which gave us freedom to tailor it >> for GFS, improving performance and throughput significantly. >> >> >> >> Regardless of what you hear, why not give both a try? Your evaluation >> and feedback would be very useful to the cluster community. >> >> >> >> -Jeff >> >> >> >> From:linux-cluster-bounces at redhat.com >> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of yue >> Sent: Wednesday, March 09, 2011 9:14 AM >> To: linux-cluster >> Subject: [Linux-cluster] which is better gfs2 and ocfs2? >> >> >> >> which is better gfs2 and ocfs2? >> >> i want to share fc-san, do you know which is better? >> >> stablility,performmance? >> >> >> >> >> >> thanks >> >> >> >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: >> >> ------------------------------ >> >> Message: 4 >> Date: Wed, 09 Mar 2011 15:53:40 +0100 >> From: Michael Lackner >> To: linux clustering >> Subject: Re: [Linux-cluster] which is better gfs2 and ocfs2? >> Message-ID:<4D779474.6020509 at unileoben.ac.at> >> Content-Type: text/plain; charset=UTF-8; format=flowed >> >> I guess not all usage scenarios are comparable, but I once >> tried to use GFS2 as well as OCFS2 to share a FC SAN to three >> nodes using 8GBit FC and 1GBit Ethernet for the cluster >> communication. Additionally, i compared it to a trial version >> of Dataplows SAN File System (SFS). I was also supposed to >> compare it to Quantum StorNext, but there just wasn't enough >> time for that. >> >> OS was CentOS 5.3 at that time. >> >> So I tried a lot of performance tuning settings for all three, >> and it was like this: >> >> 1.) SFS was the fastest, but caused reproducible kernel panics. >> Those were fixed by Dataplow, but then SFS produced corrupted data >> when writing large files. Unusable in that state, so we gave up. >> SFS uses NFS for lock management. Noteworthy: Writing data on the >> machine with the NFS lock manager also crippled the I/O performance >> for all the other nodes in a VERY, VERY bad way.. >> >> 2.) GFS2 was the slowest, and despite all the tunings I tried, it >> never came close to anything that any local FS would provide in >> terms of speed (compared to EXT3 and XFS). The statfs() calls >> pretty much crippled the FS. Multiple I/O streams on multiple nodes: >> Not a good idea it seems.. Sometimes you have to wait for minutes >> for the FS to just give you any feedback, when you're hammering >> it with let's say 30 sequential write streams across 3 nodes, with >> the streams equally distributed among them. >> >> 3.) OCFS2 was slightly faster than GFS2, especially when it came >> to statfs(), like ls -l. It did not slow down that much. But overall, >> it was still just far too slow. >> >> Our solution: Hook up the SAN on one node only, and share via NFS >> over GBit Ethernet. Overall, we are getting better results even >> with the obvious network overhead, especially when doing a lot of >> I/O on multiple clients. >> >> Our original goal was to provide a high-speed centralized storage >> solution for multiple nodes without having to use ethernet. This >> failed completely unfortunately. >> >> Hope this helps, it's just my experience though. As usual, mileage >> may vary... >> >> yue wrote: >> >>> which is better gfs2 and ocfs2? >>> i want to share fc-san, do you know which is better? >>> stablility,performmance? >>> >> > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- ---------------------------------------------------------------------- - Rick Stevens, Systems Engineer, C2 Hosting ricks at nerd.com - - AIM/Skype: therps2 ICQ: 22643734 Yahoo: origrps2 - - - - Never put off 'til tommorrow what you can forget altogether! - ---------------------------------------------------------------------- From lomazzog at dteenergy.com Thu Mar 10 19:05:02 2011 From: lomazzog at dteenergy.com (Gino Lomazzo) Date: Thu, 10 Mar 2011 14:05:02 -0500 Subject: [Linux-cluster] Performing fsck on large gfs file-systems. In-Reply-To: <4D790CF9.9070406@nerd.com> Message-ID: Good Afternoon; We currently have a critical Oracle application running on a two node Red Hat Cluster environment. (RHEL5u5) Our /oracle/d01 ( gfs file system) is about 2TB, when the system reboots it takes a few hours to perform a fsck. Is a fsck required on a gfs file system? Thank you! -------------- next part -------------- An HTML attachment was scrubbed... URL: From gregory.lee.bartholomew at gmail.com Thu Mar 10 19:42:57 2011 From: gregory.lee.bartholomew at gmail.com (Gregory Bartholomew) Date: Thu, 10 Mar 2011 13:42:57 -0600 Subject: [Linux-cluster] dlm-pcmk-3.0.17-1.fc14.x86_64 and gfs-pcmk-3.0.17-1.fc14.x86_64 woes In-Reply-To: <4D78F3BA.1000904@gmail.com> References: <4D76894B.6010809@gmail.com> <4D7689B9.7070906@redhat.com> <4D77B25D.5010603@gmail.com> <4D77C0DD.4070405@gmail.com> <4D78F3BA.1000904@gmail.com> Message-ID: <4D7929C1.3080209@gmail.com> FYI, per: > Cluster shutdown tips > --------------------- > > * Avoiding a partly shutdown cluster due to lost quorum. > > There is a practical timing issue with respect to the shutdown steps being run > on all nodes when shutting down an entire cluster (or most of it). When > shutting down the entire cluster (or shutting down a node for an extended > period) use "cman_tool leave remove". This automatically reduces the number > of votes needed for quorum as each node leaves and prevents the loss of quorum > which could keep the last nodes from cleanly completing shutdown. > > Using the "remove" leave option should not be used in general since it > introduces potential split-brain risks. > > If the "remove" leave option is not used, quorum will be lost after enough > nodes have left the cluster. Once the cluster is inquorate, remaining members > that have not yet completed "fence_tool leave" in the steps above will be > stuck. Operations such as umounting gfs or leaving the fence domain will > block while the cluster is inquorate. They can continue and complete only > when quorum is regained. > > If this happens, one option is to join the cluster ("cman_tool join") on some > of the nodes that have left so that the cluster regains quorum and the stuck > nodes can complete their shutdown. Another option is to forcibly reduce the > number of expected votes for the cluster which allows the cluster to become > quorate again ("cman_tool expected "). > > ... > > Two node clusters > ----------------- > > Ordinarily the loss of quorum after one node fails out of two will prevent the > remaining node from continuing (if both nodes have one vote.) Some special > configuration options can be set to allow the one remaining node to continue > operating if the other fails. To do this only two nodes with one vote each can > be defined in cluster.conf. The two_node and expected_votes values must then be > set to 1 in the cman config section as follows. > > > > In http://sourceware.org/cluster/doc/usage.txt, it looks like example C.1 in http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Clusters_from_Scratch/index.html#ap-cman should be changed to: gb On 03/10/2011 09:52 AM, Gregory Bartholomew wrote: > On 03/10/2011 01:14 AM, Andrew Beekhof wrote: >> On Wed, Mar 9, 2011 at 7:03 PM, Gregory Bartholomew >> wrote: >>> Never mind, I figured it out ... I needed to install the gfs2-cluster >>> package and start its service and I also had a different name for my >>> cluster >>> in /etc/cluster/cluster.conf than what I was using in my mkfs.gfs2 >>> command. >>> >>> It's all working now. Thanks to those who helped me get this going, >> >> So you're still using Pacemaker to mount/unmount the filesystem and >> other services? >> If so, were there any discrepancies in the documentation describing >> how to configure this? > > Good morning, > > This is what I did to get the file system going: > > ----- > > yum install -y httpd gfs2-cluster gfs2-utils > chkconfig gfs2-cluster on > service gfs2-cluster start > > mkfs.gfs2 -p lock_dlm -j 2 -t siue-cs:iscsi /dev/sda1 > > cat <<-END | crm > configure primitive gfs ocf:heartbeat:Filesystem params > device="/dev/sda1" directory="/var/www/html" fstype="gfs2" op start > interval="0" timeout="60s" op stop interval="0" timeout="60s" > configure clone dual-gfs gfs > END > > ----- > > I think this sed command was also missing from the guide: > > sed -i '/^#/,/#<\/Location>/{s/^#//;s/Allow > from .example.com/Allow from 127.0.0.1/}' /etc/httpd/conf/httpd.conf > > I've attached the full record of all the commands that I used to set up > my nodes to this email. It has, at the end, the final result of "crm > configure show". > > gb From jinzishuai at gmail.com Thu Mar 10 19:58:56 2011 From: jinzishuai at gmail.com (Shi Jin) Date: Thu, 10 Mar 2011 12:58:56 -0700 Subject: [Linux-cluster] What is the proper procedure to reboot a node in a cluster? In-Reply-To: <4D790893.6000107@alteeve.com> References: <4D790893.6000107@alteeve.com> Message-ID: > > > > What I do is migrate any services from the node to the other member, > then stop rgmanager->gfs2->clvmd->cman (obviously adapt to what you are > running). If you have DRBD, then stop it as well. At this point, the > other node should be the only one in the cluster (confirm with > 'cman_tool status'). If all is good, reboot. Once up, rejoin the cluster. > > Thank you. I think what you did makes perfect sense but on the other hand shouldn't the reboot process stop the services in the right order in the first place? Maybe there is a timeout issue or they don't necessarily follow the right order? Do you let the boot system to start the services for you in the right order or you actually have to do it manually? Thanks. Shi -- Shi Jin, Ph.D. -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Thu Mar 10 20:02:28 2011 From: linux at alteeve.com (Digimer) Date: Thu, 10 Mar 2011 15:02:28 -0500 Subject: [Linux-cluster] What is the proper procedure to reboot a node in a cluster? In-Reply-To: References: <4D790893.6000107@alteeve.com> Message-ID: <4D792E54.1030109@alteeve.com> On 03/10/2011 02:58 PM, Shi Jin wrote: > Thank you. > I think what you did makes perfect sense but on the other hand shouldn't > the reboot process stop the services in the right order in the first > place? Maybe there is a timeout issue or they don't necessarily follow > the right order? > > Do you let the boot system to start the services for you in the right > order or you actually have to do it manually? > > Thanks. > > Shi It should, assuming that the KXXfoo entries exist for the cluster services in the right order, and that they all shut down properly. Of course, I find this not always is the case. So for that reason, I like to stop things manually, when I have the luxury. As for starting, it depends. If it's a machine I have ready access to, I generally like to start things manually. Mainly because of how heavily I use DRBD and it's tendency to have issues on startup. Not frequent, but frequent enough. If you do automatically start everything, pay close attention to the start order and do plenty of testing. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From alvaro.fernandez at sivsa.com Thu Mar 10 20:41:17 2011 From: alvaro.fernandez at sivsa.com (Alvaro Jose Fernandez) Date: Thu, 10 Mar 2011 21:41:17 +0100 Subject: [Linux-cluster] What is the proper procedure to reboot a node in acluster? References: Message-ID: <607D6181D9919041BE792D70EF2AEC48017568F1@LIMENS.sivsa.int> Hi, Given fencing is properly configured, I think the default boot/sshutdown RHCS scripts should work. I too use two_node (but no clvmd) in RHEL5.5 with latest updates to cman and rgmanager, and a shutdown -r works well (and a shutdown -h too). The other node cluster daemon should log this as a node shutdown in /var/log/messages, and it should adjust quorum, and not trigger a fencing action over the other node. If one halts and poweroff via shutdown -h one of the two nodes, and then reboots (via shutdown -r) the surviving node, the surviving node will fence the other. We have power switch fencing, and it should simply suceed (making a power off then a power on on the other node's outlets). Once this fencing suceeds, the boot sequence continues and the node assumes quorum. If later the other node is powered on, it should join the cluster without problems. alvaro, Hi there, I've setup a two-node cluster with cman, clvmd and gfs2. I don't use qdisk but had I would like to know what is the proper procedure to reboot a node in the two-node cluster (maybe this applies for all size?) when both nodes are functioning fine but I just want to reboot one for some reason (for example, upgrade kernel). Is there a preferred/better way to reboot the machine rather than just running the "reboot" command as root. I have been doing the "reboot" command so far and it sometimes creates problems for us, including making the other node to fail. Thank you very much. Shi -- Shi Jin, Ph.D. -------------- next part -------------- An HTML attachment was scrubbed... URL: From abhishekf2k1 at gmail.com Fri Mar 11 05:28:27 2011 From: abhishekf2k1 at gmail.com (abhishek .) Date: Thu, 10 Mar 2011 21:28:27 -0800 Subject: [Linux-cluster] disable me Message-ID: please dont send me more info about cluster remove my id from mailing list -- abhishek -------------- next part -------------- An HTML attachment was scrubbed... URL: From laszlo.budai at gmail.com Fri Mar 11 09:01:49 2011 From: laszlo.budai at gmail.com (Budai Laszlo) Date: Fri, 11 Mar 2011 11:01:49 +0200 Subject: [Linux-cluster] documentation needed Message-ID: <4D79E4FD.2060305@gmail.com> Hello, can you point me to some documentation of the new cluster architecture available in RHEL6? I'm interested to learn about the internals. I'm thinking about documents like the "Cluster2 architecture" (http://people.redhat.com/teigland/cluster2-arch.txt), or "Symmetric Cluster Architecture and Component Technical Specifications" (http://people.redhat.com/teigland/sca.pdf) Thank you, Laszlo From andrew at beekhof.net Fri Mar 11 10:06:22 2011 From: andrew at beekhof.net (Andrew Beekhof) Date: Fri, 11 Mar 2011 11:06:22 +0100 Subject: [Linux-cluster] dlm-pcmk-3.0.17-1.fc14.x86_64 and gfs-pcmk-3.0.17-1.fc14.x86_64 woes In-Reply-To: <4D78F3BA.1000904@gmail.com> References: <4D76894B.6010809@gmail.com> <4D7689B9.7070906@redhat.com> <4D77B25D.5010603@gmail.com> <4D77C0DD.4070405@gmail.com> <4D78F3BA.1000904@gmail.com> Message-ID: On Thu, Mar 10, 2011 at 4:52 PM, Gregory Bartholomew wrote: > On 03/10/2011 01:14 AM, Andrew Beekhof wrote: >> >> On Wed, Mar 9, 2011 at 7:03 PM, Gregory Bartholomew >> ?wrote: >>> >>> Never mind, I figured it out ... I needed to install the gfs2-cluster >>> package and start its service and I also had a different name for my >>> cluster >>> in /etc/cluster/cluster.conf than what I was using in my mkfs.gfs2 >>> command. >>> >>> It's all working now. ?Thanks to those who helped me get this going, >> >> So you're still using Pacemaker to mount/unmount the filesystem and >> other services? >> If so, were there any discrepancies in the documentation describing >> how to configure this? > > Good morning, > > This is what I did to get the file system going: Excellent, very pleased to hear you got it working. I'll try and incorporate your feedback into the doc. > ----- > > yum install -y httpd gfs2-cluster gfs2-utils > chkconfig gfs2-cluster on > service gfs2-cluster start > > mkfs.gfs2 -p lock_dlm -j 2 -t siue-cs:iscsi /dev/sda1 > > cat <<-END | crm > configure primitive gfs ocf:heartbeat:Filesystem params device="/dev/sda1" > directory="/var/www/html" fstype="gfs2" op start interval="0" timeout="60s" > op stop interval="0" timeout="60s" > configure clone dual-gfs gfs > END > > ----- > > I think this sed command was also missing from the guide: > > sed -i '/^#/,/#<\/Location>/{s/^#//;s/Allow from > .example.com/Allow from 127.0.0.1/}' /etc/httpd/conf/httpd.conf What on earth does that do? :-) > > I've attached the full record of all the commands that I used to set up my > nodes to this email. ?It has, at the end, the final result of "crm configure > show". > > gb > From jinzishuai at gmail.com Fri Mar 11 16:28:05 2011 From: jinzishuai at gmail.com (Shi Jin) Date: Fri, 11 Mar 2011 09:28:05 -0700 Subject: [Linux-cluster] What is the proper procedure to reboot a node in acluster? In-Reply-To: <607D6181D9919041BE792D70EF2AEC48017568F1@LIMENS.sivsa.int> References: <607D6181D9919041BE792D70EF2AEC48017568F1@LIMENS.sivsa.int> Message-ID: Thank you all. The problem I have is that I don't seem to be able to get out of the cluster gracefully, even if I stop the services manually in the right order. For example, I joined the cluster manually by starting cman, clvmd and gfs2 in the order and everything is working just fine. Then I wanted to reboot. This time, I want to do it manually so I went to stop the services in order. [root at test2 ~]# service gfs2 stop Unmounting GFS2 filesystem (/vrstorm): [ OK ] [root at test2 ~]# service clvmd stop Signaling clvmd to exit [ OK ] Waiting for clvmd to exit: [FAILED] clvmd failed to exit [FAILED] Somehow clvmd cannot be stopped. I still have the process running root 2646 0.0 0.5 194476 45016 ? SLsl 02:18 0:00 clvmd -T30 How do I stop clvmd gracefully? I am running RHEL-6. [root at test2 ~]# uname -a Linux test2 2.6.32-71.18.2.el6.x86_64 #1 SMP Wed Mar 2 14:17:40 EST 2011 x86_64 x86_64 x86_64 GNU/Linux [root at test2 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.0 (Santiago) Thank you very much. Shi On Thu, Mar 10, 2011 at 1:41 PM, Alvaro Jose Fernandez < alvaro.fernandez at sivsa.com> wrote: > Hi, > > > > Given fencing is properly configured, I think the default boot/sshutdown > RHCS scripts should work. I too use two_node (but no clvmd) in RHEL5.5 with > latest updates to cman and rgmanager, and a shutdown -r works well (and a > shutdown -h too). The other node cluster daemon should log this as a node > shutdown in /var/log/messages, and it should adjust quorum, and not trigger > a fencing action over the other node. > > > > If one halts and poweroff via shutdown -h one of the two nodes, and then > reboots (via shutdown -r) the surviving node, the surviving node will fence > the other. We have power switch fencing, and it should simply suceed (making > a power off then a power on on the other node's outlets). Once this fencing > suceeds, the boot sequence continues and the node assumes quorum. > > > > If later the other node is powered on, it should join the cluster without > problems. > > > > alvaro, > > > > Hi there, > > > > I've setup a two-node cluster with cman, clvmd and gfs2. I don't use qdisk > but had > > > > > > I would like to know what is the proper procedure to reboot a node in the > two-node cluster (maybe this applies for all size?) when both nodes are > functioning fine but I just want to reboot one for some reason (for example, > upgrade kernel). Is there a preferred/better way to reboot the machine > rather than just running the "reboot" command as root. I have been doing the > "reboot" command so far and it sometimes creates problems for us, including > making the other node to fail. > > > > Thank you very much. > Shi > -- > Shi Jin, Ph.D. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Shi Jin, Ph.D. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jinzishuai at gmail.com Fri Mar 11 16:42:00 2011 From: jinzishuai at gmail.com (Shi Jin) Date: Fri, 11 Mar 2011 09:42:00 -0700 Subject: [Linux-cluster] What is the proper procedure to reboot a node in acluster? In-Reply-To: References: <607D6181D9919041BE792D70EF2AEC48017568F1@LIMENS.sivsa.int> Message-ID: To follow up, I couldn't manually leave by dlm_tool [root at test2 log]# dlm_tool leave clvmd Leaving lockspace "clvmd" dlm_open_lockspace clvmd error (nil) 2 [root at test2 log]# dlm_tool ls dlm lockspaces name clvmd id 0x4104eefa flags 0x00000002 leave change member 2 joined 1 remove 0 failed 0 seq 1,1 members 1 2 Thanks. Shi On Fri, Mar 11, 2011 at 9:28 AM, Shi Jin wrote: > Thank you all. > The problem I have is that I don't seem to be able to get out of the > cluster gracefully, even if I stop the services manually in the right order. > For example, I joined the cluster manually by starting cman, clvmd and gfs2 > in the order and everything is working just fine. > > Then I wanted to reboot. This time, I want to do it manually so I went to > stop the services in order. > [root at test2 ~]# service gfs2 stop > Unmounting GFS2 filesystem (/vrstorm): [ OK ] > [root at test2 ~]# service clvmd stop > Signaling clvmd to exit [ OK ] > Waiting for clvmd to exit: [FAILED] > clvmd failed to exit [FAILED] > > Somehow clvmd cannot be stopped. I still have the process running > root 2646 0.0 0.5 194476 45016 ? SLsl 02:18 0:00 clvmd -T30 > > How do I stop clvmd gracefully? I am running RHEL-6. > [root at test2 ~]# uname -a > Linux test2 2.6.32-71.18.2.el6.x86_64 #1 SMP Wed Mar 2 14:17:40 EST 2011 > x86_64 x86_64 x86_64 GNU/Linux > [root at test2 ~]# cat /etc/redhat-release > Red Hat Enterprise Linux Server release 6.0 (Santiago) > > > Thank you very much. > > Shi > > > > On Thu, Mar 10, 2011 at 1:41 PM, Alvaro Jose Fernandez < > alvaro.fernandez at sivsa.com> wrote: > >> Hi, >> >> >> >> Given fencing is properly configured, I think the default boot/sshutdown >> RHCS scripts should work. I too use two_node (but no clvmd) in RHEL5.5 with >> latest updates to cman and rgmanager, and a shutdown -r works well (and a >> shutdown -h too). The other node cluster daemon should log this as a node >> shutdown in /var/log/messages, and it should adjust quorum, and not trigger >> a fencing action over the other node. >> >> >> >> If one halts and poweroff via shutdown -h one of the two nodes, and then >> reboots (via shutdown -r) the surviving node, the surviving node will fence >> the other. We have power switch fencing, and it should simply suceed (making >> a power off then a power on on the other node's outlets). Once this fencing >> suceeds, the boot sequence continues and the node assumes quorum. >> >> >> >> If later the other node is powered on, it should join the cluster without >> problems. >> >> >> >> alvaro, >> >> >> >> Hi there, >> >> >> >> I've setup a two-node cluster with cman, clvmd and gfs2. I don't use qdisk >> but had >> >> >> >> >> >> I would like to know what is the proper procedure to reboot a node in the >> two-node cluster (maybe this applies for all size?) when both nodes are >> functioning fine but I just want to reboot one for some reason (for example, >> upgrade kernel). Is there a preferred/better way to reboot the machine >> rather than just running the "reboot" command as root. I have been doing the >> "reboot" command so far and it sometimes creates problems for us, including >> making the other node to fail. >> >> >> >> Thank you very much. >> Shi >> -- >> Shi Jin, Ph.D. >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > -- > Shi Jin, Ph.D. > > -- Shi Jin, Ph.D. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ajb2 at mssl.ucl.ac.uk Fri Mar 11 23:57:41 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Fri, 11 Mar 2011 23:57:41 +0000 Subject: [Linux-cluster] clvmd hangs on startup In-Reply-To: <20110308171153.GB272@bsdera.pcbi.upenn.edu> References: <20110302215050.GD10674@bsdera.pcbi.upenn.edu> <64D0546C5EBBD147B75DE133D798665F0855C290@hugo.eprize.local> <20110303165056.GF10674@bsdera.pcbi.upenn.edu> <20110308171153.GB272@bsdera.pcbi.upenn.edu> Message-ID: <4D7AB6F5.6070107@mssl.ucl.ac.uk> On 08/03/11 17:11, Valeriu Mutu wrote: > Hi, > > I think the problem is solved. I was using a 9000bytes MTU on the Xen virtual machines' iSCSI interface. Switching back to 1500bytes MTU caused the clvmd to start working. As long as everything on the network is 9000bytes then you should be ok. RH's linux implementation doesn't seem to allow path MTU discovery on local network. From ajb2 at mssl.ucl.ac.uk Sat Mar 12 00:06:47 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Sat, 12 Mar 2011 00:06:47 +0000 Subject: [Linux-cluster] which is better gfs2 and ocfs2? In-Reply-To: <4f996c7c.1356a.12e9af733aa.Coremail.ooolinux@163.com> References: <4f996c7c.1356a.12e9af733aa.Coremail.ooolinux@163.com> Message-ID: <4D7AB917.80103@mssl.ucl.ac.uk> On 09/03/11 14:13, yue wrote: > which is better gfs2 and ocfs2? > i want to share fc-san, do you know which is better? "that depends" - it is highly dependent on the type of disk activity you are performing. There are various reviews of both FSes circulating. Personal observation: GFS and GFS2 currently have utterly rotten performance for activities involving many small files, such as NFS exporting /home via NFS sync mounts. They also fails miserably if there are a lot of files in a single directory (more than 5-700, with things getting unusable beyond about 1500 files) I have not used OCFS2 in production environments, so I cannot comment on its performance in these scenarios. From ajb2 at mssl.ucl.ac.uk Sat Mar 12 00:15:16 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Sat, 12 Mar 2011 00:15:16 +0000 Subject: [Linux-cluster] What is the proper procedure to reboot a node in a cluster? In-Reply-To: References: Message-ID: <4D7ABB14.2090201@mssl.ucl.ac.uk> The only reliable way I have found (rhel4 and 5) is this: 1: Migrate all services off the node. 2: Unmount as many GFS disks as possible. 3: Power cycle the node. The other nodes will recover quickly. "cman leave (remove) (force)" sometimes works but often doesn't. From jeff.sturm at eprize.com Sat Mar 12 17:46:15 2011 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Sat, 12 Mar 2011 12:46:15 -0500 Subject: [Linux-cluster] which is better gfs2 and ocfs2? In-Reply-To: <4D7AB917.80103@mssl.ucl.ac.uk> References: <4f996c7c.1356a.12e9af733aa.Coremail.ooolinux@163.com> <4D7AB917.80103@mssl.ucl.ac.uk> Message-ID: <64D0546C5EBBD147B75DE133D798665F0855C3A5@hugo.eprize.local> > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] > On Behalf Of Alan Brown > Sent: Friday, March 11, 2011 7:07 PM > > Personal observation: GFS and GFS2 currently have utterly rotten performance for > activities involving many small files, such as NFS exporting /home via NFS sync > mounts. They also fails miserably if there are a lot of files in a single directory (more > than 5-700, with things getting unusable beyond about 1500 files) While I certainly agree there are common scenarios in which GFS performs slowly (backup by rsync is one), your characterization of GFS performance within large directories isn't completely fair. Here's a test I just ran on a cluster node, immediately after rebooting, joining the cluster and mounting a GFS filesystem: [root at cluster1 76]# time ls 00076985.ts 28d80a9c.ts 52b778d2.ts 7f50762b.ts a9c5f908.ts d39d0032.ts 00917c3e.ts 28de643b.ts 532d3fd7.ts 7f5dea46.ts a9e0328b.ts d3bcc9fb.ts ... 289d2764.ts 527b6f37.ts 7f3e5c9a.ts a989df77.ts d36c57fc.ts 28c3aa38.ts 52ab865f.ts 7f3e9278.ts a9aa3dba.ts d392d793.ts real 0m0.034s user 0m0.008s sys 0m0.004s [root at cluster1 76]# ls | wc -l 1970 The key is that only a few locks are needed to list the directory: [root at cluster1 76]# gfs_tool counters /tb2 locks 32 locks held 25 Running "ls -l" on the same directory takes a bit longer (by a factor of about 20): [root at cluster1 76]# time ls -l total 1970 -rw-r----- 1 root root 42 Mar 2 12:01 00076985.ts -rw-r----- 1 root root 42 Mar 2 12:01 00917c3e.ts -rw-r----- 1 root root 42 Mar 2 12:01 00b60c66.ts ... -rw-r----- 1 root root 42 Mar 2 12:01 ffc02edd.ts -rw-r----- 1 root root 42 Mar 2 12:01 ffefd00a.ts -rw-r----- 1 root root 42 Mar 2 12:01 fff80ff6.ts real 0m0.641s user 0m0.032s sys 0m0.032s presumably because it has to acquire quite a few additional locks: [root at cluster1 76]# gfs_tool counters /tb2 locks 3972 locks held 3965 For better or worse, "ls -l" (or equivalently, the aliased "ls --color=tty" for Red Hat users) is a very common operation for interactive users, and such users often have an immediate negative reaction to using GFS as a consequence. In my personal opinion: - Decades of work on Linux have optimized local filesystem performance and system call performance to the point that system call overhead is often treated as negligible for most applications. Running "ls -l" within a large directory is a slow, expensive operations on any system, but if it "feels" fast enough (in terms of wall clock time, not compute cycles) there's little incentive to optimize it further. I find this is true of software applications as well. It's shocking to me how many unnecessary system calls our own applications make, often as a result of libraries such as glibc. - Cluster filesystems require a lot of network communication to maintain perfect consistency. The network protocols used by (e.g.) DLM to maintain this consistency are probably slower than the methods of maintaining memory cache consistency on a SMP system by several orders of magnitude. It follows that assumptions about stat() performance on a local filesystem do not necessarily hold on a clustered filesystem, and application performance can suffer as a result. - Overcoming this may involve significant changes to the Linux system call interface (assuming there won't be a hardware solution anytime soon). For example, relying on the traditional stat() interface for file metadata limits us to one file per system call. In the case of a clustered filesystem, stat() often triggers a synchronous network round-trip via the locking protocol. A theoretical stat() interface that supports looking up multiple files at once would be an improvement, but is relatively difficult to implement because it would entail changing the system kernel, libraries, and application software. - Ethernet is a terrible medium for a distributed locking protocol. Ethernet is well suited for applications needing high bandwidth that are not particularly sensitive to latency. DLM doesn't need lots of bandwidth, but is very sensitive to latency. There exists better hardware for this (e.g. http://www.dolphinics.com/products/pemb-sci-d352.html) than Ethernet, but alas Ethernet is ubiquitous and little work has been done in the cluster community to support alternative hardware as far as I am aware. As an example, while running a "du" command on my GFS mount point, I observed the Ethernet traffic peak: 12:20:33 PM IFACE rxpck/s txpck/s rxbyt/s txbyt/s rxcmp/s txcmp/s rxmcst/s 12:20:38 PM eth0 3517.60 3520.60 545194.80 631191.20 0.00 0.00 0.00 So a few thousand packets per second is the best this cluster node could muster. Average packet sizes are less than 200 bytes each way. I'm sure I could bring in my network experts and improve these results somewhat, maybe with hardware that supports TCP offloading, but you'd never improve this by more than perhaps an order of magnitude because you're hitting the limits of what Ethernet hardware can do. In summary, the state of the art in Linux clustered filesystems is unlikely to change much until we change the way we write software applications to optimize system call usage, or redesign the system call interface to take better advantage of distributed locking protocols, or start using new hardware that provides for distributed shared memory much more efficiently than Ethernet is capable of. Until any of those things happen, many users are bound to be unimpressed with GFS and similar clustered filesystems, relegating these to remain a niche technology. -Jeff From ajb2 at mssl.ucl.ac.uk Sat Mar 12 21:21:45 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Sat, 12 Mar 2011 21:21:45 +0000 Subject: [Linux-cluster] which is better gfs2 and ocfs2? In-Reply-To: <64D0546C5EBBD147B75DE133D798665F0855C3A5@hugo.eprize.local> References: <4f996c7c.1356a.12e9af733aa.Coremail.ooolinux@163.com> <4D7AB917.80103@mssl.ucl.ac.uk> <64D0546C5EBBD147B75DE133D798665F0855C3A5@hugo.eprize.local> Message-ID: <4D7BE3E9.2050101@mssl.ucl.ac.uk> On 12/03/11 17:46, Jeff Sturm wrote: > [root at cluster1 76]# ls | wc -l > 1970 > > The key is that only a few locks are needed to list the directory: > You assume NFS clients are simply using "ls" > Running "ls -l" on the same directory takes a bit longer (by a factor of > about 20): > Or more. Try it with 256, 512, 1024 and 4096 files in the directory Then try it with 16k files, 32k, 64k and 128k Yes, users do have directories this large. > For better or worse, "ls -l" (or equivalently, the aliased "ls > --color=tty" for Red Hat users) is a very common operation for > interactive users, and such users often have an immediate negative > reaction to using GFS as a consequence. Those users are paying for GFS installations. They have every right to criticize its shockingly poor performance for these operations, especially when it adversely impacts their ability to get work done. In addition the same problem appears every time a backup is run - even incrementals need to stat each file in order to find out what's changed. Having a 2million file filesystem take 28 hours to run an incremental vs 10 minutes for the same thing on ext3/4 doesn't go down at all well. What you've said is right, but also comes across to the average academic as condescending - which is a fast way of further alienating them. As far as most users are concerned, a computer is a black box. You put files in, you get files out. If it's shockingly slow it's _not_ their problem, it's the problem of whoever installed it - and it doesn't help that GFS has been sold as production-ready when it's only useful in a limited range of filesystem activities. AB From ajb2 at mssl.ucl.ac.uk Sat Mar 12 22:45:25 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Sat, 12 Mar 2011 22:45:25 +0000 Subject: [Linux-cluster] which is better gfs2 and ocfs2? In-Reply-To: <64D0546C5EBBD147B75DE133D798665F0855C3A5@hugo.eprize.local> References: <4f996c7c.1356a.12e9af733aa.Coremail.ooolinux@163.com> <4D7AB917.80103@mssl.ucl.ac.uk> <64D0546C5EBBD147B75DE133D798665F0855C3A5@hugo.eprize.local> Message-ID: <4D7BF785.6070006@mssl.ucl.ac.uk> I missed somthing: On 12/03/11 17:46, Jeff Sturm wrote: > As an example, while running a "du" command on my GFS mount point, I > observed the Ethernet traffic peak: > > 12:20:33 PM IFACE rxpck/s txpck/s rxbyt/s txbyt/s > rxcmp/s txcmp/s rxmcst/s > 12:20:38 PM eth0 3517.60 3520.60 545194.80 631191.20 > 0.00 0.00 0.00 Mount the GFS filesystem on one node only, lock_dlm and repeat all the tests. Observe the network traffic. The latencies aren't in the Ethernet layer (at the moment) AB From yvette at dbtgroup.com Sat Mar 12 23:00:41 2011 From: yvette at dbtgroup.com (yvette hirth) Date: Sat, 12 Mar 2011 23:00:41 +0000 Subject: [Linux-cluster] which is better gfs2 and ocfs2? In-Reply-To: <4D7BE3E9.2050101@mssl.ucl.ac.uk> References: <4f996c7c.1356a.12e9af733aa.Coremail.ooolinux@163.com> <4D7AB917.80103@mssl.ucl.ac.uk> <64D0546C5EBBD147B75DE133D798665F0855C3A5@hugo.eprize.local> <4D7BE3E9.2050101@mssl.ucl.ac.uk> Message-ID: <4D7BFB19.7020301@dbtgroup.com> Alan Brown wrote: > Those users are paying for GFS installations. oh? i've got the full cluster suite running here, from CentOS. i don't remember receiving a bill... > In addition the same problem appears every time a backup is run - even > incrementals need to stat each file in order to find out what's changed. > Having a 2million file filesystem take 28 hours to run an incremental vs > 10 minutes for the same thing on ext3/4 doesn't go down at all well. if you have 2million files on one filesystem, methinks that GFS et al are doing the best that they can. perhaps GFS is not the real issue... we had issues with GFS; we flattened the big directories, and now things run much smoother. slower than extX, and much slower than XFS, but since we can backup two machines to the same filesystem concurrently, we're not complaining... > What you've said is right, but also comes across to the average academic > as condescending - which is a fast way of further alienating them. "There is no offense where none is taken." --old Vulcan sayinig > As far as most users are concerned, a computer is a black box. You put > files in, you get files out. If it's shockingly slow it's _not_ their > problem, it's the problem of whoever installed it - and it doesn't help > that GFS has been sold as production-ready when it's only useful in a > limited range of filesystem activities. while we have found that GFS is indeed production ready, one doesn't use a moving van to participate in the Indy 500. caveat emptor. yvette From rpeterso at redhat.com Sat Mar 12 23:13:05 2011 From: rpeterso at redhat.com (Bob Peterson) Date: Sat, 12 Mar 2011 18:13:05 -0500 (EST) Subject: [Linux-cluster] which is better gfs2 and ocfs2? In-Reply-To: <4D7BE3E9.2050101@mssl.ucl.ac.uk> Message-ID: <368243113.417601.1299971585048.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- Original Message ----- | Those users are paying for GFS installations. They have every right to | criticize its shockingly poor performance for these operations, | especially when it adversely impacts their ability to get work done. Hi, Agreed. We're abundantly aware of the performance problems, and we're not ignoring them. People, please bear in mind that Red Hat is also working diligently to improve all aspects of gfs2 performance, and we've made great strides. Cases in point: (1) We recently found and fixed a problem that caused the dlm to pass locking traffic much slower than possible. (2) We recently increased the speed and accuracy of fsck.gfs2 quite a bit. (3) We also recently developed a patch that improves GFS2's management of cluster locks by making hold times self-tuning. This makes gfs2 perform much faster in many situations. (4) We've recently developed another performance patch that sped up clustered deletes (unlinks) as much as 25%. (5) We recently identified and fixed a performance problem related to writing large files that sped things up considerably. These patches are in various stages of development, and most or all have already been posted to the public cluster-devel mailing list, of various records in bugzilla, which means they're making their way to a kernel (or user-space) near you. Our work continues; we're improving it every day and have more performance improvements planned. I don't know about ocfs2, but there's a whole team of people at Red Hat plus the open source community at large working to improve gfs2. Regards, Bob Peterson Red Hat File Systems From ajb2 at mssl.ucl.ac.uk Sun Mar 13 04:48:17 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Sun, 13 Mar 2011 04:48:17 +0000 Subject: [Linux-cluster] which is better gfs2 and ocfs2? In-Reply-To: <368243113.417601.1299971585048.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> References: <368243113.417601.1299971585048.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <4D7C4C91.801@mssl.ucl.ac.uk> On 12/03/11 23:13, Bob Peterson wrote: > Agreed. We're abundantly aware of the performance problems, > and we're not ignoring them. I know Bob, thanks. > (1) We recently found and fixed a problem that caused the > dlm to pass locking traffic much slower than possible. Is this rolled into 2.6.18-238.5.1.el5 ? > (2) We recently increased the speed and accuracy of fsck.gfs2 > quite a bit. Noted and appreciated. I had cause to use them a few days ago. > (3) We also recently developed a patch that improves GFS2's > management of cluster locks by making hold times self-tuning. > This makes gfs2 perform much faster in many situations. Great > (4) We've recently developed another performance patch that > sped up clustered deletes (unlinks) as much as 25%. Good. This has been a real cow but at least for this kind of thing users simply tend to go for lunch and let it run. > (5) We recently identified and fixed a performance problem > related to writing large files that sped things up considerably. See question 1 :) Can I get hotfixes if possible? (el5.6 x64) AB From parvez.h.shaikh at gmail.com Sun Mar 13 07:19:42 2011 From: parvez.h.shaikh at gmail.com (Parvez Shaikh) Date: Sun, 13 Mar 2011 12:49:42 +0530 Subject: [Linux-cluster] Two node cluster - a potential problem of node fencing each other? Message-ID: Hi all, I have a question pertaining to two node cluster, I have RHEL 5.5 and cluster along with it which at least should have two nodes. In a situation where both nodes of the cluster are up, and have reliable connection to fencing device (e.g. power switch OR any other power fencing device) and heartbeat link between two nodes goes down. Each node finds another node is down (because heartbeat IP becomes unreachable) and tries to fence each other. Is this situation possible? If so, can two nodes possibly fence (in short shutdown or reboot) each other? Is there anyway out of this situation? Thanks Parvez -------------- next part -------------- An HTML attachment was scrubbed... URL: From cthulhucalling at gmail.com Sun Mar 13 07:49:02 2011 From: cthulhucalling at gmail.com (Ian Hayes) Date: Sat, 12 Mar 2011 23:49:02 -0800 Subject: [Linux-cluster] Two node cluster - a potential problem of node fencing each other? In-Reply-To: References: Message-ID: On Sat, Mar 12, 2011 at 11:19 PM, Parvez Shaikh wrote: > Hi all, > > I have a question pertaining to two node cluster, I have RHEL 5.5 and > cluster along with it which at least should have two nodes. > > In a situation where both nodes of the cluster are up, and have reliable > connection to fencing device (e.g. power switch OR any other power fencing > device) and heartbeat link between two nodes goes down. > > Each node finds another node is down (because heartbeat IP becomes > unreachable) and tries to fence each other. > > Is this situation possible? If so, can two nodes possibly fence (in short > shutdown or reboot) each other? Is there anyway out of this situation? > This is a fairly common problem called "split brain". The two nodes will go into a shootout, fencing each other. There are a few ways to prevent this, such as redundant network links and the use of quorum disks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ooolinux at 163.com Sun Mar 13 13:37:37 2011 From: ooolinux at 163.com (yue) Date: Sun, 13 Mar 2011 21:37:37 +0800 (CST) Subject: [Linux-cluster] which is better gfs2 and ocfs2? In-Reply-To: <4D7C4C91.801@mssl.ucl.ac.uk> References: <4D7C4C91.801@mssl.ucl.ac.uk> <368243113.417601.1299971585048.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: 1.i need gfs2 or ocfs2 to store xen-disk image file(20G--100G),it is big file. the underlying storage is fc-san. both of them have cluster sence.so they fit for me. if gfs2 is ready for product? anyone use gfs2 in product? stability is the most important thing. 2.i have try gfs2 and ocfs2 , iozone shows , gfs2 has a good throughput when record>=512k and file size > 4G. 3.my kernel is 2.6.32 and latest. At 2011-03-13 12:48:17?"Alan Brown" wrote: >On 12/03/11 23:13, Bob Peterson wrote: >> Agreed. We're abundantly aware of the performance problems, >> and we're not ignoring them. > >I know Bob, thanks. > >> (1) We recently found and fixed a problem that caused the >> dlm to pass locking traffic much slower than possible. > >Is this rolled into 2.6.18-238.5.1.el5 ? > >> (2) We recently increased the speed and accuracy of fsck.gfs2 >> quite a bit. > >Noted and appreciated. I had cause to use them a few days ago. > >> (3) We also recently developed a patch that improves GFS2's >> management of cluster locks by making hold times self-tuning. >> This makes gfs2 perform much faster in many situations. > >Great > >> (4) We've recently developed another performance patch that >> sped up clustered deletes (unlinks) as much as 25%. > >Good. This has been a real cow but at least for this kind of thing users >simply tend to go for lunch and let it run. > >> (5) We recently identified and fixed a performance problem >> related to writing large files that sped things up considerably. > See question 1 :) > >Can I get hotfixes if possible? (el5.6 x64) > >AB > > > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From parvez.h.shaikh at gmail.com Sun Mar 13 13:57:46 2011 From: parvez.h.shaikh at gmail.com (Parvez Shaikh) Date: Sun, 13 Mar 2011 19:27:46 +0530 Subject: [Linux-cluster] Two node cluster - a potential problem of node fencing each other? In-Reply-To: References: Message-ID: redundant network link - i trust you were referring to ethernet bonding. On Sun, Mar 13, 2011 at 1:19 PM, Ian Hayes wrote: > On Sat, Mar 12, 2011 at 11:19 PM, Parvez Shaikh > wrote: > >> Hi all, >> >> I have a question pertaining to two node cluster, I have RHEL 5.5 and >> cluster along with it which at least should have two nodes. >> >> In a situation where both nodes of the cluster are up, and have reliable >> connection to fencing device (e.g. power switch OR any other power fencing >> device) and heartbeat link between two nodes goes down. >> >> Each node finds another node is down (because heartbeat IP becomes >> unreachable) and tries to fence each other. >> >> Is this situation possible? If so, can two nodes possibly fence (in short >> shutdown or reboot) each other? Is there anyway out of this situation? >> > > This is a fairly common problem called "split brain". The two nodes will go > into a shootout, fencing each other. There are a few ways to prevent this, > such as redundant network links and the use of quorum disks. > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomas at sjolshagen.net Sun Mar 13 16:21:40 2011 From: thomas at sjolshagen.net (Thomas Sjolshagen) Date: Sun, 13 Mar 2011 12:21:40 -0400 Subject: [Linux-cluster] which is better gfs2 and ocfs2? In-Reply-To: References: <4D7C4C91.801@mssl.ucl.ac.uk> <368243113.417601.1299971585048.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <257E2D73-391F-4FA0-84FE-8E5A97F8CD82@sjolshagen.net> I'm using gfs2 to host KVM vm image files for a pair of clustered hosts. Am using iscsi targets for the vm data devices however as they are hosting imap spools. No stability issues or performance problems I can readily or easily attribute to the gfs2 FS in my use case. // Thomas On Mar 13, 2011, at 9:37 AM, yue wrote: > 1.i need gfs2 or ocfs2 to store xen-disk image file(20G--100G),it is big file. the underlying storage is fc-san. both of them have cluster sence.so they fit for me. > if gfs2 is ready for product? anyone use gfs2 in product? stability is the most important thing. > 2.i have try gfs2 and ocfs2 , iozone shows , gfs2 has a good throughput when record>=512k and file size > 4G. > 3.my kernel is 2.6.32 and latest. > > > At 2011-03-13 12:48:17?"Alan Brown" wrote: > > >On 12/03/11 23:13, Bob Peterson wrote: > >> Agreed. We're abundantly aware of the performance problems, > >> and we're not ignoring them. > > > >I know Bob, thanks. > > > >> (1) We recently found and fixed a problem that caused the > >> dlm to pass locking traffic much slower than possible. > > > >Is this rolled into 2.6.18-238.5.1.el5 ? > > > >> (2) We recently increased the speed and accuracy of fsck.gfs2 > >> quite a bit. > > > >Noted and appreciated. I had cause to use them a few days ago. > > > >> (3) We also recently developed a patch that improves GFS2's > >> management of cluster locks by making hold times self-tuning. > >> This makes gfs2 perform much faster in many situations. > > > >Great > > > >> (4) We've recently developed another performance patch that > >> sped up clustered deletes (unlinks) as much as 25%. > > > >Good. This has been a real cow but at least for this kind of thing users > >simply tend to go for lunch and let it run. > > > >> (5) We recently identified and fixed a performance problem > >> related to writing large files that sped things up considerably. > > See question 1 :) > > > >Can I get hotfixes if possible? (el5.6 x64) > > > >AB > > > > > > > >-- > >Linux-cluster mailing list > >Linux-cluster at redhat.com > >https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From ooolinux at 163.com Mon Mar 14 04:43:59 2011 From: ooolinux at 163.com (yue) Date: Mon, 14 Mar 2011 12:43:59 +0800 (CST) Subject: [Linux-cluster] which is better gfs2 and ocfs2? In-Reply-To: <257E2D73-391F-4FA0-84FE-8E5A97F8CD82@sjolshagen.net> References: <257E2D73-391F-4FA0-84FE-8E5A97F8CD82@sjolshagen.net> <4D7C4C91.801@mssl.ucl.ac.uk> <368243113.417601.1299971585048.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <363e5b.9647.12eb2ad844f.Coremail.ooolinux@163.com> 1.thanks, i have 20-100 nodes. anyone knows how citrix does ? At 2011-03-14 00:21:40?"Thomas Sjolshagen" wrote: >I'm using gfs2 to host KVM vm image files for a pair of clustered hosts. Am using iscsi targets for the vm data devices however as they are hosting imap spools. No stability issues or performance problems I can readily or easily attribute to the gfs2 FS in my use case. > >// Thomas > >On Mar 13, 2011, at 9:37 AM, yue wrote: > >> 1.i need gfs2 or ocfs2 to store xen-disk image file(20G--100G),it is big file. the underlying storage is fc-san. both of them have cluster sence.so they fit for me. >> if gfs2 is ready for product? anyone use gfs2 in product? stability is the most important thing. >> 2.i have try gfs2 and ocfs2 , iozone shows , gfs2 has a good throughput when record>=512k and file size > 4G. >> 3.my kernel is 2.6.32 and latest. >> >> >> At 2011-03-13 12:48:17?"Alan Brown" wrote: >> >> >On 12/03/11 23:13, Bob Peterson wrote: >> >> Agreed. We're abundantly aware of the performance problems, >> >> and we're not ignoring them. >> > >> >I know Bob, thanks. >> > >> >> (1) We recently found and fixed a problem that caused the >> >> dlm to pass locking traffic much slower than possible. >> > >> >Is this rolled into 2.6.18-238.5.1.el5 ? >> > >> >> (2) We recently increased the speed and accuracy of fsck.gfs2 >> >> quite a bit. >> > >> >Noted and appreciated. I had cause to use them a few days ago. >> > >> >> (3) We also recently developed a patch that improves GFS2's >> >> management of cluster locks by making hold times self-tuning. >> >> This makes gfs2 perform much faster in many situations. >> > >> >Great >> > >> >> (4) We've recently developed another performance patch that >> >> sped up clustered deletes (unlinks) as much as 25%. >> > >> >Good. This has been a real cow but at least for this kind of thing users >> >simply tend to go for lunch and let it run. >> > >> >> (5) We recently identified and fixed a performance problem >> >> related to writing large files that sped things up considerably. >> > See question 1 :) >> > >> >Can I get hotfixes if possible? (el5.6 x64) >> > >> >AB >> > >> > >> > >> >-- >> >Linux-cluster mailing list >> >Linux-cluster at redhat.com >> >https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From raju.rajsand at gmail.com Mon Mar 14 09:50:13 2011 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Mon, 14 Mar 2011 15:20:13 +0530 Subject: [Linux-cluster] Two node cluster - a potential problem of node fencing each other? In-Reply-To: References: Message-ID: Greetings, On Sun, Mar 13, 2011 at 7:27 PM, Parvez Shaikh wrote: > redundant network link - i trust you were referring to ethernet bonding. > >> >> This is a fairly common problem called "split brain". The two nodes will >> go into a shootout, fencing each other. There are a few ways to prevent >> this, such as redundant network links and the use of quorum disks. >> >> No .it is not bonding It is another IP address accessible to each node of the cluster (perhaps the gateway?-- can anybody expand on this a bit) and Quorum disk is another LUN, say about 100mb per node, accessible to the cluster (IOW, external storage) Regards, Rajagopal From rpeterso at redhat.com Mon Mar 14 13:16:15 2011 From: rpeterso at redhat.com (Bob Peterson) Date: Mon, 14 Mar 2011 09:16:15 -0400 (EDT) Subject: [Linux-cluster] which is better gfs2 and ocfs2? In-Reply-To: <4D7C4C91.801@mssl.ucl.ac.uk> Message-ID: <350344129.425432.1300108575392.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- Original Message ----- | > (1) We recently found and fixed a problem that caused the | > dlm to pass locking traffic much slower than possible. | | Is this rolled into 2.6.18-238.5.1.el5 ? Yes, it was added starting with 2.6.18-232 | > (5) We recently identified and fixed a performance problem | > related to writing large files that sped things up | > considerably. This one is still in patch form. Some of our customers are testing it in production now, but it hasn't made its way to any official kernels yet. | | Can I get hotfixes if possible? (el5.6 x64) | | AB If you're a Red Hat customer you should contact our support people. We don't have kernels built for other distros, but as I said, the patches are all posted in various places. The first place to look is the cluster-devel mailing list. The archives are here: https://www.redhat.com/archives/cluster-devel/ The clustered unlink patch is here: https://www.redhat.com/archives/cluster-devel/2011-February/msg00059.html The self-tuning glocks patch is here: https://www.redhat.com/archives/cluster-devel/2011-January/msg00079.html The "large file" slowdown patch only affects RHEL5, so the upstream code and RHEL6 don't have that problem. The rhel5 patch is attached to this bugzilla bug (not sure if it's public or private): https://bugzilla.redhat.com/show_bug.cgi?id=683155 And as for fsck.gfs2, I think the performance patches are planned for RHEL5.7. Regards, Bob Peterson Red Hat File Systems From iarlyy at gmail.com Mon Mar 14 13:50:29 2011 From: iarlyy at gmail.com (iarly selbir) Date: Mon, 14 Mar 2011 10:50:29 -0300 Subject: [Linux-cluster] time of left/join member Message-ID: I was checking my two node cluster, I noticed that one node is down, my question is how to find out when this node left the cluster? assuming that services already running on the other node I can't see why this node "maybe" was fenced and powered off. /var/log/messages was not clear enough to me, my only information I only got the time when machine was powered off in lastlog. Thanks in advance for any suggestion. - - iarlyy selbir :wq! -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Mon Mar 14 13:57:03 2011 From: linux at alteeve.com (Digimer) Date: Mon, 14 Mar 2011 09:57:03 -0400 Subject: [Linux-cluster] time of left/join member In-Reply-To: References: Message-ID: <4D7E1EAF.50205@alteeve.com> On 03/14/2011 09:50 AM, iarly selbir wrote: > I was checking my two node cluster, I noticed that one node is down, my > question is how to find out when this node left the cluster? assuming > that services already running on the other node I can't see why this > node "maybe" was fenced and powered off. > > /var/log/messages was not clear enough to me, my only information I only > got the time when machine was powered off in lastlog. > > Thanks in advance for any suggestion. The surviving node's /var/log/messages should contain mention of when it lost contact with the node and reformed the cluster. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From thiagoh at digirati.com.br Mon Mar 14 14:03:06 2011 From: thiagoh at digirati.com.br (Thiago Henrique) Date: Mon, 14 Mar 2011 11:03:06 -0300 Subject: [Linux-cluster] Two node cluster benchmark Message-ID: <1300111386.2148.35.camel@thiagohenrique06> Hello, I have a two node cluster configured like: Ubuntu 10.04 + CMAN + DRBD + GFS2 In a benchmark, I run simultaneously on both nodes, a script that make write operations in the filesystem until it fills. But when I run the benchmark, foo-node remains almost the whole time waiting for bar-node write to the file system. Is this normal? How could I optimize this? Cman config: ################################################################################ ################################################################################ GFS2 config: ################################################################################ mkfs.gfs2 -p lock_dlm -t MyCluster:MyFileSystem -j 2 /dev/drbd2 mount.gfs2 /dev/drbd2 /var/fs_tests/gfs2/ -o noatime ################################################################################ Thank you in advance Best regards -- Thiago Henrique From ajb2 at mssl.ucl.ac.uk Mon Mar 14 14:14:37 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Mon, 14 Mar 2011 14:14:37 +0000 Subject: [Linux-cluster] Resource groups Message-ID: <4D7E22CD.2060009@mssl.ucl.ac.uk> Bob: You say this in your best practice document: "Our performance testing lab has experimented with various resource group sizes and found a performance problem with anything bigger than 768MB. Until this is properly diagnosed, we recommend staying below 768MB." What are the details? Nearly all of our FSes are created with 2Gb RGs. From rpeterso at redhat.com Mon Mar 14 14:30:11 2011 From: rpeterso at redhat.com (Bob Peterson) Date: Mon, 14 Mar 2011 10:30:11 -0400 (EDT) Subject: [Linux-cluster] Resource groups In-Reply-To: <4D7E22CD.2060009@mssl.ucl.ac.uk> Message-ID: <412215982.427568.1300113011942.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- Original Message ----- | Bob: | | You say this in your best practice document: | | "Our performance testing lab has experimented with various resource | group sizes and found a performance problem with anything bigger than | 768MB. Until this is properly diagnosed, we recommend staying below | 768MB." | | What are the details? Nearly all of our FSes are created with 2Gb RGs. Hi, I'm afraid I don't have many more details. This was just a comment that one of our performance guys sent to me a while back. I haven't had a chance to investigate his claims or look into what's going on. I'll do some tests of my own, and if I can recreate a performance problem based on rgrp size, I'll open a bugzilla record and analyze what's going on. Regards, Bob Peterson Red Hat File Systems From jduston at ll.mit.edu Mon Mar 14 22:35:30 2011 From: jduston at ll.mit.edu (Jack Duston) Date: Mon, 14 Mar 2011 18:35:30 -0400 Subject: [Linux-cluster] GFS2 file system maintenance question. Message-ID: <4D7E9832.40000@ll.mit.edu> Hello folks, I am planning to create a 2 node cluster with a GFS2 CLVM SAN. The following Note in the RHEL6 GFS2 manual jumped out at me: Chapter 3. Managing GFS2 Note: Once you have created a GFS2 file system with the mkfs.gfs2 command, you cannot decrease the size of the file system. You can, however, increase the size of an existing file system with the gfs2_grow command, as described in Section 3.6, ?Growing a File System?. This seems to me to make a GFS2 LV un-maintainable. What concerns me is the issue of how to remove a LUN from the GFS2 LV. This will be a necessity *when* there are hardware problems with a storage unit, End of Life/obsolescence (a la XRaid), or upgrade (replace 1TB HDDS with 3 TB HDDs in the LUNs). Hardware does not last forever, and manufacturers do EOL products or go out of business. I had also hoped to upgrade the 1TB HDDs in our current LUNs with 3 TB HDDs next year. I planned to free up enough space on the GFS2 LV to migrate data off one LUN. I could then decrease the GFS2 file system size, remove the LUN from the LV, destroy the RAID LUN, replace 1TB HDDs with 3TB HDDs, rebuild the RAID LUN, add the new larger LUN to the LV, increase the GFS2 file system size, and repeat migrating data off the next LUN. If the above note is correct, it seems to only way to deal with a hardware issue, obsolescence/EOL, or upgrading components is to destroy the entire GFS2 file system, build a new GFS2 file system from scratch, and restore data from backups. This might not be too bad with a small SAN of 20TB, but our data will exceed 100TB and it would be good not to have to rebuild Rome in a day. Can anyone confirm that GFS2 file system cannot be decreased? If so, is there any plan to add this capability/fix this issue in a future release? Is there another/better way to remove a LUN from GFS2 than what I considered? Any info greatly appreciated. From ooolinux at 163.com Tue Mar 15 01:35:14 2011 From: ooolinux at 163.com (yue) Date: Tue, 15 Mar 2011 09:35:14 +0800 (CST) Subject: [Linux-cluster] GFS2 file system maintenance question. In-Reply-To: <4D7E9832.40000@ll.mit.edu> References: <4D7E9832.40000@ll.mit.edu> Message-ID: <21f7a2.18b2.12eb72713be.Coremail.ooolinux@163.com> 1. GFS2 is based on a 64-bit architecture, which can theoretically accommodate an 8 EB file system. However, the current supported maximum size of a GFS2 file system is 25 TB. If your system requires GFS2 file systems larger than 25 TB, contact your Red Hat service representative. At 2011-03-15 06:35:30?"Jack Duston" wrote: >Hello folks, > >I am planning to create a 2 node cluster with a GFS2 CLVM SAN. >The following Note in the RHEL6 GFS2 manual jumped out at me: > >Chapter 3. Managing GFS2 >Note: >Once you have created a GFS2 file system with the mkfs.gfs2 command, you >cannot decrease the size of the file system. You can, however, increase >the size of an existing file system with the gfs2_grow command, as >described in Section 3.6, ?Growing a File System?. > >This seems to me to make a GFS2 LV un-maintainable. > >What concerns me is the issue of how to remove a LUN from the GFS2 LV. >This will be a necessity *when* there are hardware problems with a >storage unit, End of Life/obsolescence (a la XRaid), or upgrade (replace >1TB HDDS with 3 TB HDDs in the LUNs). > >Hardware does not last forever, and manufacturers do EOL products or go >out of business. >I had also hoped to upgrade the 1TB HDDs in our current LUNs with 3 TB >HDDs next year. > >I planned to free up enough space on the GFS2 LV to migrate data off one >LUN. I could then decrease the GFS2 file system size, remove the LUN >from the LV, destroy the RAID LUN, replace 1TB HDDs with 3TB HDDs, >rebuild the RAID LUN, add the new larger LUN to the LV, increase the >GFS2 file system size, and repeat migrating data off the next LUN. > >If the above note is correct, it seems to only way to deal with a >hardware issue, obsolescence/EOL, or upgrading components is to destroy >the entire GFS2 file system, build a new GFS2 file system from scratch, >and restore data from backups. This might not be too bad with a small >SAN of 20TB, but our data will exceed 100TB and it would be good not to >have to rebuild Rome in a day. > >Can anyone confirm that GFS2 file system cannot be decreased? If so, is >there any plan to add this capability/fix this issue in a future >release? Is there another/better way to remove a LUN from GFS2 than what >I considered? > >Any info greatly appreciated. > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From bergman at merctech.com Tue Mar 15 04:11:41 2011 From: bergman at merctech.com (bergman at merctech.com) Date: Tue, 15 Mar 2011 00:11:41 -0400 Subject: [Linux-cluster] quorum device not getting a vote causes 2-node cluster to be inquorate Message-ID: <25829.1300162301@mirchi> I have been using a 2-node cluster with a quorum disk successfully for about 2 years. Beginning today, the cluster will not boot correctly. The RHCS services start, but fencing fails with: dlm: no local IP address has been set dlm: cannot start dlm lowcomms -107 This seems to be a symtpom of the fact that the cluster votes do not include votes from the quorum device: # clustat Cluster Status for example-infra @ Tue Mar 15 00:02:35 2011 Member Status: Inquorate Member Name ID Status ------ ---- ---- ------ example-infr2-admin.domain.com 1 Online, Local example-infr1-admin.domain.com 2 Offline /dev/mpath/quorum 0 Offline [root at example-infr2 ~]# cman_tool status Version: 6.2.0 Config Version: 239 Cluster Name: example-infra Cluster Id: 42813 Cluster Member: Yes Cluster Generation: 676844 Membership state: Cluster-Member Nodes: 1 Expected votes: 2 Total votes: 1 Quorum: 2 Activity blocked Active subsystems: 7 Flags: Ports Bound: 0 Node name: example-infr2-admin.domain.com Node ID: 1 Multicast addresses: 239.192.167.228 Node addresses: 192.168.110.3 The shared-SAN-disk quorum device is readable from each node. Testing with "mkqdisk -L" and "dd if=/dev/mpath/quorum of=/dev/quorum.dump" both succeed from each node. When run in the foreground, "qdisk -d -f" gives messages that seem to indicate that it is successful: # qdiskd -d -f [22568] debug: Loading configuration information [22568] debug: Heuristic: '/bin/ping -c3 -W1 -t2 192.168.110.10' score=1 interval=2 tko=9 [22568] debug: 1 heuristics loaded [22568] debug: Quorum Daemon: 1 heuristics, 3 interval, 15 tko, 1 votes [22568] debug: Run Flags: 00000035 [22568] info: Quorum Daemon Initializing [22568] debug: I/O Size: 512 Page Size: 4096 [22569] info: Heuristic: '/bin/ping -c3 -W1 -t2 192.168.110.10' UP [22568] debug: Node 3 is UP [22568] info: Node 3 is the master [22568] info: Initial score 1/1 [22568] info: Initialization complete [22568] notice: Score sufficient for master operation (1/1; required=1); upgrading Any suggestions? Thanks, Mark ------------Versions----------------- Linux example-infr2.domain.com 2.6.18-194.32.1.el5 #1 SMP Wed Jan 5 17:52:25 lvm2-cluster-2.02.56-7.el5_5.4 cman-2.0.115-34.el5_5.4 system-config-cluster-1.0.57-3.el5_5.1 rgmanager-2.0.52-6.el5.centos.8 ----------excerpt from cluster.conf---------------- -------------------------------------------------------------------------- From fdinitto at redhat.com Tue Mar 15 05:22:13 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Tue, 15 Mar 2011 06:22:13 +0100 Subject: [Linux-cluster] quorum device not getting a vote causes 2-node cluster to be inquorate In-Reply-To: <25829.1300162301@mirchi> References: <25829.1300162301@mirchi> Message-ID: <4D7EF785.8070708@redhat.com> On 03/15/2011 05:11 AM, bergman at merctech.com wrote: > I have been using a 2-node cluster with a quorum disk successfully for > about 2 years. Beginning today, the cluster will not boot correctly. > > The RHCS services start, but fencing fails with: > > dlm: no local IP address has been set > dlm: cannot start dlm lowcomms -107 > > This seems to be a symtpom of the fact that the cluster votes do not include votes from the quorum > device: > > # clustat > Cluster Status for example-infra @ Tue Mar 15 00:02:35 2011 > Member Status: Inquorate > > Member Name ID Status > ------ ---- ---- ------ > example-infr2-admin.domain.com 1 Online, Local > example-infr1-admin.domain.com 2 Offline > /dev/mpath/quorum 0 Offline > > [root at example-infr2 ~]# cman_tool status > Version: 6.2.0 > Config Version: 239 > Cluster Name: example-infra > Cluster Id: 42813 > Cluster Member: Yes > Cluster Generation: 676844 > Membership state: Cluster-Member > Nodes: 1 > Expected votes: 2 > Total votes: 1 > Quorum: 2 Activity blocked > Active subsystems: 7 > Flags: > Ports Bound: 0 > Node name: example-infr2-admin.domain.com > Node ID: 1 > Multicast addresses: 239.192.167.228 > Node addresses: 192.168.110.3 You should check the output from cman_tool nodes. It appears that the nodes are not seeing each other at all. The first things I would check are iptables, node names resolves to the correct ip addresses, selinux and eventually if the switch in between the nodes support multicast. Fabio From ooolinux at 163.com Tue Mar 15 05:37:20 2011 From: ooolinux at 163.com (yue) Date: Tue, 15 Mar 2011 13:37:20 +0800 (CST) Subject: [Linux-cluster] is ocfs2 is limited 16T In-Reply-To: <4D73805F.8020308@srce.hr> References: <4D73805F.8020308@srce.hr> <56a50421.709d.12e89a4c7cb.Coremail.ooolinux@163.com> Message-ID: <2baf4fa.6fc9.12eb804b8c4.Coremail.ooolinux@163.com> how to rebuild ocfs2.ko what is needed to changed? thanks At 2011-03-06 20:38:55?"Jakov Sosic" wrote: >On 03/06/2011 06:30 AM, yue wrote: >> if there is a limit on ocfs2'volume? it must less 16T? > >For RHEL v5.x and derivateves yes. But you can hack it and rebuild >kernel modules without limitation. You also need to patch kernel-sources >and rebuild kernel too. > > >-- >Jakov Sosic >www.srce.hr > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rmitchel at redhat.com Tue Mar 15 07:09:17 2011 From: rmitchel at redhat.com (Ryan Mitchell) Date: Tue, 15 Mar 2011 17:09:17 +1000 Subject: [Linux-cluster] GFS2 file system maintenance question. In-Reply-To: <4D7E9832.40000@ll.mit.edu> References: <4D7E9832.40000@ll.mit.edu> Message-ID: <4D7F109D.4000404@redhat.com> On 03/15/2011 08:35 AM, Jack Duston wrote: > > I planned to free up enough space on the GFS2 LV to migrate data off > one LUN. I could then decrease the GFS2 file system size, remove the > LUN from the LV, destroy the RAID LUN, replace 1TB HDDs with 3TB HDDs, > rebuild the RAID LUN, add the new larger LUN to the LV, increase the > GFS2 file system size, and repeat migrating data off the next LUN. > Hi, No you will not be able to use that procedure to swap LUNs. If you have the ability to present the new LUN's before removing the old LUN's from the volume group, it would be possible to: 1) vgextend the volume group using the new LUN 2) pvmove the extents from the old LUN to the new LUN 3) vgreduce the old LUN to remove it from the volume group This could be done 1 LUN at a time. It doesn't even require you to grow the filesystem (unless the new LUN's are larger than the old ones). This is common and I've seen it done many times. You could even use a temporary staging LUN to shuffle the data around. If you do not have the capacity to add additional LUNs before removing the original LUNs, then you will face a difficult migration, possibly using backup/restore as you mentioned. The feature to reduce the filesystem has not been implemented; there is no code as yet to manage it. It isn't commonly required. Regards, Ryan Mitchell From fedorischev at bsu.edu.ru Tue Mar 15 08:07:21 2011 From: fedorischev at bsu.edu.ru (=?koi8-r?Q?=E6=C5=C4=CF=D2=C9=DD=C5=D7_?= =?koi8-r?Q?=E9=2E=EE=2E?=) Date: Tue, 15 Mar 2011 11:07:21 +0300 Subject: [Linux-cluster] gfs2 volume constantly growing Message-ID: <1300176441.4161.19.camel@ui-tcc02.bsu.edu.ru> Hello, subscribers! Please advise me in the problem. Gfs2 volume on our cluster servers is growing all the time, but the overall size of the files on the partition remains small. Here it is: # df -h /dev/sdb5 4,7G 3,9G 853M 83% /var/log/httpd But # du -ch /var/log/httpd/ 70M /var/log/httpd/ I reboot the cluster systems to kill any suspicious processes but nothing happens. Then I umount partition on both cluster nodes and do fsck.gfs2 on partition. It was found many file system errors and after that everything fell into place. But after 1 weak has passed the same thing happened again, volume is grows again. Is this any bug in gfs2 implementation or something else? We using CentOS release 5.5 x86_64 on server and dag, epel repositories to software updates. Thanks to all. From laszlo.budai at gmail.com Tue Mar 15 10:02:28 2011 From: laszlo.budai at gmail.com (Budai Laszlo) Date: Tue, 15 Mar 2011 12:02:28 +0200 Subject: [Linux-cluster] Service location (colocation) In-Reply-To: <20110304170647.GC14803@redhat.com> References: <4D6A7709.6060108@gmail.com> <20110304170647.GC14803@redhat.com> Message-ID: <4D7F3934.3040209@gmail.com> Is pacemaker a supported alternative for rgmanager? Starting with which version of Red Hat Enterprise Linux? Thank you, Laszlo On 03/04/2011 07:06 PM, Lon Hohberger wrote: > On Sun, Feb 27, 2011 at 06:08:41PM +0200, Budai Laszlo wrote: >> Hi all, >> >> is there a way to define location dependencies among services? for >> instance how can I define that Service A should run on the same node as >> service B? Or the opposite: Service C should run on a different node >> than service D? >> > rgmanager doesn't have this feature built-in; you can define > 'collocated services' by simply creating one large service comprising > all of the resources for both services. > > You could probably trivially extend central_processing mode to do "anti > collocation" (i.e. run on another node). > > The 'follow_service.sl' script is an example of how to do part of > 'anti-collocation'. The way it works, it starts service A on a > different node from service B. If the node running service A fails, it > is started on the same node as service B, then service B is moved away > to another (empty, usually) node in the cluster. > > Alternatively, pacemaker supports this functionality. > From andrew at beekhof.net Tue Mar 15 11:50:32 2011 From: andrew at beekhof.net (Andrew Beekhof) Date: Tue, 15 Mar 2011 12:50:32 +0100 Subject: [Linux-cluster] Service location (colocation) In-Reply-To: <4D7F3934.3040209@gmail.com> References: <4D6A7709.6060108@gmail.com> <20110304170647.GC14803@redhat.com> <4D7F3934.3040209@gmail.com> Message-ID: On Tue, Mar 15, 2011 at 11:02 AM, Budai Laszlo wrote: > Is pacemaker a supported alternative for rgmanager? Not yet, although it is available as of 6.0 > Starting with which > version of Red Hat Enterprise Linux? > > Thank you, > Laszlo > > > On 03/04/2011 07:06 PM, Lon Hohberger wrote: >> On Sun, Feb 27, 2011 at 06:08:41PM +0200, Budai Laszlo wrote: >>> Hi all, >>> >>> is there a way to define location dependencies among services? for >>> instance how can I define that Service A should run on the same node as >>> service B? Or the opposite: Service C should run on a different node >>> than service D? >>> >> rgmanager doesn't have this feature built-in; you can define >> 'collocated services' by simply creating one large service comprising >> all of the resources for both services. >> >> You could probably trivially extend central_processing mode to do "anti >> collocation" (i.e. run on another node). >> >> The 'follow_service.sl' script is an example of how to do part of >> 'anti-collocation'. ? The way it works, it starts service A on a >> different node from service B. ?If the node running service A fails, it >> is started on the same node as service B, then service B is moved away >> to another (empty, usually) node in the cluster. >> >> Alternatively, pacemaker supports this functionality. >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From jayesh.shinde at netcore.co.in Tue Mar 15 12:13:14 2011 From: jayesh.shinde at netcore.co.in (jayesh.shinde) Date: Tue, 15 Mar 2011 17:43:14 +0530 Subject: [Linux-cluster] Split-brain with DRBD active-active + RHCS Message-ID: <4D7F57DA.8060204@netcore.co.in> Hi All , I don't have SAN with me , so I want to build the 2 node DRBD active active for mysql & http resource ( i.e /dev/drbd2 & /dev/drbd3 in my case) with RHCS . I configured the require setup from http://sourceware.org/cluster/wiki/DRBD_Cookbook and from DRDB links. From last 1 week I am testing the same scenario in 2 XEN vms with kenel 2.6.18-128.el5xen , Every thing is working fine , like mysql and http services move from one server to other etc... But not working correctly when it get fence ( i.e when n/w fail on of the node). *I am facing the split-brain problem. * I search a lot in google and mailling list but don't found the proper correct solution and suggestion. For Fence testing I am doing following. ================================ 1) On node1 http service is running with /dev/drbd2 2) On node2 mysql service is runing with /dev/drbd3 At this movement the DRBD primary-primary status is working correctly. 3) Now On node2 When I stop n/w service manually by "service network stop" then within 3-5 sec. node2 get fence properly and mysql service get switch on node1 properly. 4) After fencing when node2 come up , then I am facing the DRBD split-brain issue with node1 and node2. My questions :-- ========== 1) Why DRBD Split brain is not coming when I reboot or shutdown or destroy the machine by xm command i.e "xm reboot " OR xm shutdown OR xm destroy 2) Why the DRBD split brain issue come at the time of fencing node only ? 3) Is the combination of DRBD active-active + RHCS is stable ? and workable solution Because one of the below mailling list I found it's workable solution http://www.gossamer-threads.com/lists/drbd/users/20467#20467 4) In fencing Is there any extra setting require for such combination ? 5) Do I need to use any custom fencing logic ? 6) Any one using such "*DRBD active-active + RHCS*" setup in Live without split brain issue ? Please guide and suggest on the same. Regards Jayesh Shinde -------------- next part -------------- An HTML attachment was scrubbed... URL: From jayesh.shinde at netcore.co.in Tue Mar 15 12:34:59 2011 From: jayesh.shinde at netcore.co.in (jayesh.shinde) Date: Tue, 15 Mar 2011 18:04:59 +0530 Subject: [Linux-cluster] Split-brain with DRBD active-active + RHCS In-Reply-To: <4D7F57DA.8060204@netcore.co.in> References: <4D7F57DA.8060204@netcore.co.in> Message-ID: <4D7F5CF3.4050805@netcore.co.in> Hi All , Just want mention one point. I am using Ext3 filesystem in below setup. Regards Jayesh Shinde On 03/15/2011 05:43 PM, jayesh.shinde wrote: > Hi All , > > I don't have SAN with me , so I want to build the 2 node DRBD active > active for mysql & http resource ( i.e /dev/drbd2 & /dev/drbd3 in my > case) with RHCS . > > I configured the require setup from > http://sourceware.org/cluster/wiki/DRBD_Cookbook and from DRDB links. > > From last 1 week I am testing the same scenario in 2 XEN vms with > kenel 2.6.18-128.el5xen , Every thing is working fine , like mysql > and http services move from one server to other etc... But not working > correctly when it get fence ( i.e when n/w fail on of the node). > > *I am facing the split-brain problem. * I search a lot in google and > mailling list but don't found the proper correct solution and suggestion. > > For Fence testing I am doing following. > ================================ > 1) On node1 http service is running with /dev/drbd2 > 2) On node2 mysql service is runing with /dev/drbd3 > At this movement the DRBD primary-primary status is working correctly. > 3) Now On node2 When I stop n/w service manually by "service network > stop" then within 3-5 sec. node2 get fence properly and mysql > service get switch on node1 properly. > 4) After fencing when node2 come up , then I am facing the DRBD > split-brain issue with node1 and node2. > > My questions :-- > ========== > 1) Why DRBD Split brain is not coming when I reboot or shutdown or > destroy the machine by xm command > i.e "xm reboot " OR xm shutdown > OR xm destroy > > 2) Why the DRBD split brain issue come at the time of fencing node only ? > > 3) Is the combination of DRBD active-active + RHCS is stable ? and > workable solution > Because one of the below mailling list I found it's workable solution > http://www.gossamer-threads.com/lists/drbd/users/20467#20467 > > 4) In fencing Is there any extra setting require for such combination ? > 5) Do I need to use any custom fencing logic ? > > 6) Any one using such "*DRBD active-active + RHCS*" setup in Live > without split brain issue ? > > Please guide and suggest on the same. > > Regards > Jayesh Shinde > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bazy84 at gmail.com Tue Mar 15 13:53:51 2011 From: bazy84 at gmail.com (Bazy) Date: Tue, 15 Mar 2011 15:53:51 +0200 Subject: [Linux-cluster] Split-brain with DRBD active-active + RHCS In-Reply-To: <4D7F5CF3.4050805@netcore.co.in> References: <4D7F57DA.8060204@netcore.co.in> <4D7F5CF3.4050805@netcore.co.in> Message-ID: Hello, Good question. I myself use the manual split brain recovery after one of the nodes fails. See http://www.drbd.org/users-guide/s-resolve-split-brain.html. If anyone can share how to resolve this without manual intervention it would be great. Cheers! On Tue, Mar 15, 2011 at 2:34 PM, jayesh.shinde wrote: > Hi All , > > Just want mention one point. I am using Ext3 filesystem in below setup. > > Regards > Jayesh Shinde > > > On 03/15/2011 05:43 PM, jayesh.shinde wrote: > > Hi All , > > I don't have SAN with me , so I want to build the 2 node DRBD active active > for mysql? & http? resource ( i.e /dev/drbd2 & /dev/drbd3 in my case) with > RHCS . > > I configured the require setup from > http://sourceware.org/cluster/wiki/DRBD_Cookbook and from DRDB links. > > From last 1 week I am testing the same scenario in 2 XEN vms with kenel > 2.6.18-128.el5xen , Every thing is working fine , like mysql and http > services move from one server to other etc...? But not working correctly > when it get fence ( i.e when n/w fail on of the node). > > I am facing the split-brain problem.? I search a lot in google and mailling > list but don't found the proper correct solution and suggestion. > > For Fence testing? I am doing following. > ================================ > 1) On node1 http service is running with /dev/drbd2 > 2) On node2 mysql service is runing with /dev/drbd3 > ??? At this movement the DRBD primary-primary status is working correctly. > 3) Now On node2 When I stop n/w service manually by "service network stop" > then within 3-5 sec.? node2 get fence properly and mysql service get switch > on node1 properly. > 4) After fencing when node2 come up , then I am facing the DRBD split-brain > issue with node1 and node2. > > My questions :-- > ========== > 1) Why DRBD Split brain is not coming when I reboot or shutdown or destroy > the machine by xm command > ??? i.e? "xm reboot "?? OR ??????? xm shutdown > OR????? xm destroy > > 2) Why the DRBD split brain issue come at the time of fencing node only ? > > 3) Is the combination of DRBD active-active + RHCS is stable ? and workable > solution > ??? Because one of the below mailling list I found it's workable solution > ??? http://www.gossamer-threads.com/lists/drbd/users/20467#20467 > > 4) In fencing Is there any extra setting require for such combination? ? > 5) Do I need to use any custom fencing logic ? > > 6) Any one using such "DRBD active-active + RHCS"? setup in Live without > split brain issue ? > > Please guide and suggest on the same. > > Regards > Jayesh Shinde > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From bergman at merctech.com Tue Mar 15 15:42:09 2011 From: bergman at merctech.com (bergman at merctech.com) Date: Tue, 15 Mar 2011 11:42:09 -0400 Subject: [Linux-cluster] quorum device not getting a vote causes 2-node cluster to be inquorate In-Reply-To: <4D7EF785.8070708@redhat.com> References: <25829.1300162301@mirchi> <4D7EF785.8070708@redhat.com> Message-ID: <20110315114209.73b9f0f2@mirchi> The pithy ruminations from "Fabio M. Di Nitto" on "Re: [Linux-cluster] quorum device not getting a vote causes 2-node cluster to be inquorate" were: => On 03/15/2011 05:11 AM, bergman at merctech.com wrote: => > I have been using a 2-node cluster with a quorum disk successfully for => > about 2 years. Beginning today, the cluster will not boot correctly. => > => > The RHCS services start, but fencing fails with: => > => > dlm: no local IP address has been set => > dlm: cannot start dlm lowcomms -107 => > => > This seems to be a symtpom of the fact that the cluster votes do not include votes from the quorum => > device: => > => > # clustat => > Cluster Status for example-infra @ Tue Mar 15 00:02:35 2011 => > Member Status: Inquorate => > => > Member Name ID Status => > ------ ---- ---- ------ => > example-infr2-admin.domain.com 1 Online, Local => > example-infr1-admin.domain.com 2 Offline => > /dev/mpath/quorum 0 Offline => > => > [root at example-infr2 ~]# cman_tool status => > Version: 6.2.0 => > Config Version: 239 => > Cluster Name: example-infra => > Cluster Id: 42813 => > Cluster Member: Yes => > Cluster Generation: 676844 => > Membership state: Cluster-Member => > Nodes: 1 => > Expected votes: 2 => > Total votes: 1 => > Quorum: 2 Activity blocked => > Active subsystems: 7 => > Flags: => > Ports Bound: 0 => > Node name: example-infr2-admin.domain.com => > Node ID: 1 => > Multicast addresses: 239.192.167.228 => > Node addresses: 192.168.110.3 => => You should check the output from cman_tool nodes. It appears that the => nodes are not seeing each other at all. That's correct...at the time I ran cman_tool and clustat, one node was down (deliberately, in an attempt to troubleshoot the issue, but this would also be the case in the event of a hardware failure). As I see it, the problem is not with the inter-node communication, but with the quorum device. Note that there is only one vote registered--there are no votes from the quorum device. The quorum device should provide sufficient votes to make the "cluster" quorate if only one node is running. If I understand it correctly, this should also let the "cluster" start with a single node (as long as that node can write to the quorum device). If my understanding is wrong, then how can a 2-node cluster start if one node is down? => => The first things I would check are iptables, node names resolves to the => correct ip addresses, selinux and eventually if the switch in between => the nodes support multicast. SElinux is disabled (as it has been for the 2 years this cluster has been operational). There have been no switch changes. Node names & IPs resolve correctly. IPtables permits all communication between the "admin" address on the servers. => => Fabio => => -- => Linux-cluster mailing list => Linux-cluster at redhat.com => https://www.redhat.com/mailman/listinfo/linux-cluster => Thanks, Mark From m.watts at eris.qinetiq.com Tue Mar 15 16:11:45 2011 From: m.watts at eris.qinetiq.com (Mark Watts) Date: Tue, 15 Mar 2011 16:11:45 +0000 Subject: [Linux-cluster] Split-brain with DRBD active-active + RHCS In-Reply-To: <4D7F5CF3.4050805@netcore.co.in> References: <4D7F57DA.8060204@netcore.co.in> <4D7F5CF3.4050805@netcore.co.in> Message-ID: <4D7F8FC1.9030505@eris.qinetiq.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/15/2011 12:34 PM, jayesh.shinde wrote: > Hi All , > > Just want mention one point. I am using Ext3 filesystem in below setup. Dual Primary DRBD (Active-Active) and EXT3 are mutually exclusive. You should be using GFS(2) (or OCFS2) on a Dual-Primary setup. - -- Mark Watts BSc RHCE Senior Systems Engineer, MSS Secure Managed Hosting www.QinetiQ.com QinetiQ - Delivering customer-focused solutions GPG Key: http://www.linux-corner.info/mwatts.gpg -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ iEYEARECAAYFAk1/j8AACgkQBn4EFUVUIO03DQCggILhV71sPjJ+VFXxtfjT+DPS 6n8An3loexkOIeaNg6IW/IhZ8wmI0saQ =F80k -----END PGP SIGNATURE----- From jduston at ll.mit.edu Tue Mar 15 17:14:49 2011 From: jduston at ll.mit.edu (Jack Duston) Date: Tue, 15 Mar 2011 13:14:49 -0400 Subject: [Linux-cluster] GFS2 file system maintenance question. In-Reply-To: <21f7a2.18b2.12eb72713be.Coremail.ooolinux@163.com> References: <4D7E9832.40000@ll.mit.edu> <21f7a2.18b2.12eb72713be.Coremail.ooolinux@163.com> Message-ID: <4D7F9E89.1020307@ll.mit.edu> Thanks Yue, but your information would seem dated if this site is correct: http://www.redhat.com/rhel/compare Even if 100TB is what's officially supported in RHEL6, it doesn't mean that larger file systems won't work. Most likely that is the largest amount of storage that Red Hat had available to test. Since this is a brand new setup, now is a great time to see if it works with the storage I have available. If it doesn't, then I haven't lost anything other than a little time, and I'll just chunk it up into 100TB Logical Volumes. However, since it would be better for our purposes, I would like to keep our data in one file system if possible. Regards, Jack On 03/14/2011 09:35 PM, yue wrote: > 1. > GFS2 is based on a 64-bit architecture, which can theoretically > accommodate an 8 EB file system. However, the current supported > maximum size of a GFS2 file system is 25 TB. If your system requires > GFS2 file systems larger than 25 TB, contact your Red Hat service > representative. > > > At 2011-03-15 06:35:30?"Jack Duston" wrote: > > >Hello folks, > > > >I am planning to create a 2 node cluster with a GFS2 CLVM SAN. > >The following Note in the RHEL6 GFS2 manual jumped out at me: > > > >Chapter 3. Managing GFS2 > >Note: > >Once you have created a GFS2 file system with the mkfs.gfs2 command, you > >cannot decrease the size of the file system. You can, however, increase > >the size of an existing file system with the gfs2_grow command, as > >described in Section 3.6, ?Growing a File System?. > > > >This seems to me to make a GFS2 LV un-maintainable. > > > >What concerns me is the issue of how to remove a LUN from the GFS2 LV. > >This will be a necessity *when* there are hardware problems with a > >storage unit, End of Life/obsolescence (a la XRaid), or upgrade (replace > >1TB HDDS with 3 TB HDDs in the LUNs). > > > >Hardware does not last forever, and manufacturers do EOL products or go > >out of business. > >I had also hoped to upgrade the 1TB HDDs in our current LUNs with 3 TB > >HDDs next year. > > > >I planned to free up enough space on the GFS2 LV to migrate data off one > >LUN. I could then decrease the GFS2 file system size, remove the LUN > >from the LV, destroy the RAID LUN, replace 1TB HDDs with 3TB HDDs, > >rebuild the RAID LUN, add the new larger LUN to the LV, increase the > >GFS2 file system size, and repeat migrating data off the next LUN. > > > >If the above note is correct, it seems to only way to deal with a > >hardware issue, obsolescence/EOL, or upgrading components is to destroy > >the entire GFS2 file system, build a new GFS2 file system from scratch, > >and restore data from backups. This might not be too bad with a small > >SAN of 20TB, but our data will exceed 100TB and it would be good not to > >have to rebuild Rome in a day. > > > >Can anyone confirm that GFS2 file system cannot be decreased? If so, is > >there any plan to add this capability/fix this issue in a future > >release? Is there another/better way to remove a LUN from GFS2 than what > >I considered? > > > >Any info greatly appreciated. > > > >-- > >Linux-cluster mailing list > >Linux-cluster at redhat.com > >https://www.redhat.com/mailman/listinfo/linux-cluster > > From bazy84 at gmail.com Tue Mar 15 17:58:45 2011 From: bazy84 at gmail.com (Bazy) Date: Tue, 15 Mar 2011 19:58:45 +0200 Subject: [Linux-cluster] Split-brain with DRBD active-active + RHCS In-Reply-To: <4D7F8FC1.9030505@eris.qinetiq.com> References: <4D7F57DA.8060204@netcore.co.in> <4D7F5CF3.4050805@netcore.co.in> <4D7F8FC1.9030505@eris.qinetiq.com> Message-ID: Hi Mark, Yes, clustered file system is mandatory. Even with gfs(2) DRBD will not recover by itself from a split brain. I think specific options are needed in drbd.conf "after-sb-0pri after-sb-1pri after-sb-2pri", but don't know what the exact ones are. Best regards! On Tue, Mar 15, 2011 at 6:11 PM, Mark Watts wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 03/15/2011 12:34 PM, jayesh.shinde wrote: >> Hi All , >> >> Just want mention one point. I am using Ext3 filesystem in below setup. > > Dual Primary DRBD (Active-Active) and EXT3 are mutually exclusive. > > You should be using GFS(2) (or OCFS2) on a Dual-Primary setup. > > - -- > Mark Watts BSc RHCE > Senior Systems Engineer, MSS Secure Managed Hosting > www.QinetiQ.com > QinetiQ - Delivering customer-focused solutions > GPG Key: http://www.linux-corner.info/mwatts.gpg > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk1/j8AACgkQBn4EFUVUIO03DQCggILhV71sPjJ+VFXxtfjT+DPS > 6n8An3loexkOIeaNg6IW/IhZ8wmI0saQ > =F80k > -----END PGP SIGNATURE----- > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From jduston at ll.mit.edu Tue Mar 15 18:14:24 2011 From: jduston at ll.mit.edu (Jack Duston) Date: Tue, 15 Mar 2011 14:14:24 -0400 Subject: [Linux-cluster] GFS2 file system maintenance question. In-Reply-To: <4D7F109D.4000404@redhat.com> References: <4D7E9832.40000@ll.mit.edu> <4D7F109D.4000404@redhat.com> Message-ID: <4D7FAC80.9080804@ll.mit.edu> Thanks much Ryan, Its good to know this situation can be handled via the CLVM layer even though GFS2 doesn't provide a method directly. I just needed to know there was a way to deal with those situations. I will hang on to the least bad old RAID system, rather than surplussing it, to use as a temp LUN. I understand we are not a typical use case. We do re-purpose equipment for other experiments or tasks at times, and it would be great to be able without completely tearing down existing setups. Even though its not a common event, it sure would be nice to have the capability to move older data offline and reduce the file system in place if the situation arises. Other than just griping, I'd also like to say its greatly appreciated that you (Red Hat) created and have made GFS2 available. We have been using XSan2/StorNext, and its great to have such an alternative. Now that Apple has discontinued its Server support we are looking to move to Red Hat's GFS2 SAN solution. We looked into other cluster file systems like GlusterFS, but we need a real SAN and not a distributed file system for this use case. Thanks again, Jack On 03/15/2011 03:09 AM, Ryan Mitchell wrote: > On 03/15/2011 08:35 AM, Jack Duston wrote: >> I planned to free up enough space on the GFS2 LV to migrate data off >> one LUN. I could then decrease the GFS2 file system size, remove the >> LUN from the LV, destroy the RAID LUN, replace 1TB HDDs with 3TB HDDs, >> rebuild the RAID LUN, add the new larger LUN to the LV, increase the >> GFS2 file system size, and repeat migrating data off the next LUN. >> > Hi, > > No you will not be able to use that procedure to swap LUNs. If you have > the ability to present the new LUN's before removing the old LUN's from > the volume group, it would be possible to: > 1) vgextend the volume group using the new LUN > 2) pvmove the extents from the old LUN to the new LUN > 3) vgreduce the old LUN to remove it from the volume group > > This could be done 1 LUN at a time. It doesn't even require you to grow > the filesystem (unless the new LUN's are larger than the old ones). > This is common and I've seen it done many times. You could even use a > temporary staging LUN to shuffle the data around. > > If you do not have the capacity to add additional LUNs before removing > the original LUNs, then you will face a difficult migration, possibly > using backup/restore as you mentioned. > > The feature to reduce the filesystem has not been implemented; there is > no code as yet to manage it. It isn't commonly required. > > Regards, > > Ryan Mitchell > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From ajb2 at mssl.ucl.ac.uk Tue Mar 15 18:26:41 2011 From: ajb2 at mssl.ucl.ac.uk (Alan Brown) Date: Tue, 15 Mar 2011 18:26:41 +0000 Subject: [Linux-cluster] GFS2 file system maintenance question. In-Reply-To: <4D7F9E89.1020307@ll.mit.edu> References: <4D7E9832.40000@ll.mit.edu> <21f7a2.18b2.12eb72713be.Coremail.ooolinux@163.com> <4D7F9E89.1020307@ll.mit.edu> Message-ID: <4D7FAF61.4050801@mssl.ucl.ac.uk> Jack Duston wrote: > Thanks Yue, but your information would seem dated if this site is correct: > > http://www.redhat.com/rhel/compare > > Even if 100TB is what's officially supported in RHEL6, it doesn't mean > that larger file systems won't work. Anyone considering such large filesystems should consider the following questions. 1: How long is it going to take to back it up.? 2: How long will it take to restore?? Even LTO5 takes the best part of 12 hours to restore 1Tb... From linux at alteeve.com Tue Mar 15 19:03:06 2011 From: linux at alteeve.com (Digimer) Date: Tue, 15 Mar 2011 15:03:06 -0400 Subject: [Linux-cluster] Split-brain with DRBD active-active + RHCS In-Reply-To: References: <4D7F57DA.8060204@netcore.co.in> <4D7F5CF3.4050805@netcore.co.in> <4D7F8FC1.9030505@eris.qinetiq.com> Message-ID: <4D7FB7EA.8090604@alteeve.com> On 03/15/2011 01:58 PM, Bazy wrote: > Hi Mark, > > Yes, clustered file system is mandatory. Even with gfs(2) DRBD will > not recover by itself from a split brain. I think specific options are > needed in drbd.conf "after-sb-0pri after-sb-1pri after-sb-2pri", but > don't know what the exact ones are. > > Best regards! For the recovery; resource r0 { device /dev/drbd0; net { after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; } } To tie it into fenced via cman (to fence instead of split-brain), also add: resource r0 { device /dev/drbd0; disk { fencing resource-and-stonith; } handlers { outdate-peer "/sbin/obliterate"; } } You can download 'obliterate' from here: http://people.redhat.com/lhh/obliterate (found here: http://gfs.wikidev.net/DRBD_Cookbook) See also: http://www.drbd.org/users-guide/ch-rhcs.html -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From jduston at ll.mit.edu Tue Mar 15 20:55:27 2011 From: jduston at ll.mit.edu (Jack Duston) Date: Tue, 15 Mar 2011 16:55:27 -0400 Subject: [Linux-cluster] GFS2 file system maintenance question. In-Reply-To: <4D7FAF61.4050801@mssl.ucl.ac.uk> References: <4D7E9832.40000@ll.mit.edu> <21f7a2.18b2.12eb72713be.Coremail.ooolinux@163.com> <4D7F9E89.1020307@ll.mit.edu> <4D7FAF61.4050801@mssl.ucl.ac.uk> Message-ID: <4D7FD23F.1040200@ll.mit.edu> Hi Alan, These certainly are concerns, although straying a little off-topic. I am risk-averse and try to avoid or mitigate 'gotcha' issues. (Hence why I'm thinking about maintenance issues now). Unfortunately, having less data is not an option, so I need to try to implement the best solution available to handle that data. I'm not sure backing up, say 5 x 100TB filesystems would be much difference from backing up the same data on a single 1 x 500TB filesystem. Restoring a single 100TB filesystem would definitely be easier than a single 500TB filesystem, but the trade-off is that the data is split up across 5 filesystems. Tape backup has definite drawbacks when you start dealing with large data sets. Fortunately our data will not be changing often. We are backing up to external hard drives using trayless chassis, basically using 2TB HDDs as jumbo floppy drives. YMMV. We are presently running a 70TB XSan2/StorNext SAN. It has been rock-steady since created, about half a year now (XSan1, not so much). I hope creating a 100TB filesystem from one designed to scale to 8EB should really not be too much of a test for GFS2. That is only 1/80th its design capacity. I do not think the developers at Red Hat are any less capable than those at Quantum. (although it will certainly suck to uncover an edge case or bug triggered by >100TB filesystem!). Given the choice, I certainly would not be pushing the boundaries of what's officially supported. However, it does seem Red Hat has built a Monster Truck. Since I have cars that need crushing, lets see if it can crush some cars before just parking it in the drive. Cheers, Jack p.s. I'll start building more driveways if it can't, but lets at least try it first... On 03/15/2011 02:26 PM, Alan Brown wrote: > Jack Duston wrote: >> Thanks Yue, but your information would seem dated if this site is correct: >> >> http://www.redhat.com/rhel/compare >> >> Even if 100TB is what's officially supported in RHEL6, it doesn't mean >> that larger file systems won't work. > Anyone considering such large filesystems should consider the following > questions. > > 1: How long is it going to take to back it up.? > > 2: How long will it take to restore?? > > Even LTO5 takes the best part of 12 hours to restore 1Tb... > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From ooolinux at 163.com Wed Mar 16 01:51:57 2011 From: ooolinux at 163.com (yue) Date: Wed, 16 Mar 2011 09:51:57 +0800 (CST) Subject: [Linux-cluster] which is max gfs2 filesystem size,25T or 100T? Message-ID: <5ee37b80.144c7.12ebc5cbc8e.Coremail.ooolinux@163.com> 1.the link is rhel5 and rhel6. but the article confuse me. http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Global_File_System_2/ch-overview-GFS2.html http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/ch-overview-GFS2.html 2.if it is to say the limitation is efficacious on fedora or centos? thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From parvez.h.shaikh at gmail.com Wed Mar 16 05:07:55 2011 From: parvez.h.shaikh at gmail.com (Parvez Shaikh) Date: Wed, 16 Mar 2011 10:37:55 +0530 Subject: [Linux-cluster] Clustat exit code for service status Message-ID: Hi all, Command clustat -s gives status of service. If service is started (i.e. running on some node), exit code of this command is 0, if however service is not running, its exit code is non-zero (found it to be 119). Is this right and going to be continued in subsequent cluster versions as well? Reason I am asking this, is if I can use this command in shell script to give status of service - clustat -s if [ $? -eq 0 ]; then echo "service is up" else echo "service is not up" Thanks Parvez -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Wed Mar 16 18:07:50 2011 From: linux at alteeve.com (Digimer) Date: Wed, 16 Mar 2011 14:07:50 -0400 Subject: [Linux-cluster] Tripp Lite switched PDU fence agent; exists? Message-ID: <4D80FC76.6070605@alteeve.com> Hi all, Does anyone know if the tripp lite (mn: PDUMH15ATNET, specifically) has an existing RHCS fence agent? Specifically for cluster 2 / EL5.5. If not, has anyone written one? Failing all that, I suppose I will write one. :) -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From bergman at merctech.com Wed Mar 16 18:59:51 2011 From: bergman at merctech.com (bergman at merctech.com) Date: Wed, 16 Mar 2011 14:59:51 -0400 Subject: [Linux-cluster] Tripp Lite switched PDU fence agent; exists? In-Reply-To: <4D80FC76.6070605@alteeve.com> References: <4D80FC76.6070605@alteeve.com> Message-ID: <20110316145951.75cc0849@mirchi> The pithy ruminations from Digimer on "[Linux-cluster] Tripp Lite switched PDU fence agent; exists?" were: => Hi all, => => Does anyone know if the tripp lite (mn: PDUMH15ATNET, specifically) => has an existing RHCS fence agent? Specifically for cluster 2 / EL5.5. If Yes. => not, has anyone written one? Failing all that, I suppose I will write => one. :) => Yes. I wrote an agent for that piece of hardware and offered the agent to the RHCS community in Nov 2008...there was no response at the time.[1] In March, 2009, I sent a copy of the agent script to Jan Friesse , Marek Grac , who were identified as the maintainers of all the fence agents. Since it apparently hasn't made it into the RHCS distribution, let me know if you want a copy. Finally, I'd like to warn people away from using the TrippLite PDU model PDUMH15ATNET as a fencing device. While it seems to have nice features, it has a design choice that is a serious problem with fencing--when a command is given to power down an outlet, there is a "random" delay (observed to be about 17 to 35 seconds) before that command is executed. This has been acknowledged by TrippLite support as a design choice, with no option or setting to override this behavior. Mark [1] http://www.redhat.com/archives/linux-cluster/2008-November/msg00215.html From linux at alteeve.com Wed Mar 16 19:57:45 2011 From: linux at alteeve.com (Digimer) Date: Wed, 16 Mar 2011 15:57:45 -0400 Subject: [Linux-cluster] Tripp Lite switched PDU fence agent; exists? In-Reply-To: <20110316145951.75cc0849@mirchi> References: <4D80FC76.6070605@alteeve.com> <20110316145951.75cc0849@mirchi> Message-ID: <4D811639.1070704@alteeve.com> On 03/16/2011 02:59 PM, bergman at merctech.com wrote: > The pithy ruminations from Digimer on "[Linux-cluster] Tripp Lite switched PDU fence agent; exists?" were: > > => Hi all, > => > => Does anyone know if the tripp lite (mn: PDUMH15ATNET, specifically) > => has an existing RHCS fence agent? Specifically for cluster 2 / EL5.5. If > > Yes. > > > => not, has anyone written one? Failing all that, I suppose I will write > => one. :) > => > > Yes. > > I wrote an agent for that piece of hardware and offered the agent to the RHCS community in Nov 2008...there was no response at the time.[1] > > In March, 2009, I sent a copy of the agent script to Jan Friesse , Marek Grac , who were identified as the maintainers of all the fence agents. > > Since it apparently hasn't made it into the RHCS distribution, let me know if you want a copy. > > > Finally, I'd like to warn people away from using the TrippLite PDU model > PDUMH15ATNET as a fencing device. While it seems to have nice features, it has > a design choice that is a serious problem with fencing--when a command is > given to power down an outlet, there is a "random" delay (observed to be > about 17 to 35 seconds) before that command is executed. This has been > acknowledged by TrippLite support as a design choice, with no option or setting > to override this behavior. > > Mark > > > [1] http://www.redhat.com/archives/linux-cluster/2008-November/msg00215.html Hi Mark, I came across your post in the archives, actually. :) I would like a copy of your agent, if you don't mind. I already maintain another fence agent, and would be happy to maintain this one, shy of someone more experience stepping up. As for the delay, that sounds annoying, but not insurmountable. I've got one of the switches on order already, as I wanted to see how they worked. I can fairly easily put in a 5-sec poll that checks the state until the node is cut or a timeout is hit. From the cluster's point of view, this is safe outside of delaying recovery. In my case though, I'll be sure to use it as the secondary fence device. I'll include such a warning/suggestion in the agent's man page as well. Cheers -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From fdinitto at redhat.com Wed Mar 16 20:24:38 2011 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Wed, 16 Mar 2011 21:24:38 +0100 Subject: [Linux-cluster] Tripp Lite switched PDU fence agent; exists? In-Reply-To: <20110316145951.75cc0849@mirchi> References: <4D80FC76.6070605@alteeve.com> <20110316145951.75cc0849@mirchi> Message-ID: <4D811C86.70202@redhat.com> On 03/16/2011 07:59 PM, bergman at merctech.com wrote: > The pithy ruminations from Digimer on "[Linux-cluster] Tripp Lite switched PDU fence agent; exists?" were: > > => Hi all, > => > => Does anyone know if the tripp lite (mn: PDUMH15ATNET, specifically) > => has an existing RHCS fence agent? Specifically for cluster 2 / EL5.5. If > > Yes. > > > => not, has anyone written one? Failing all that, I suppose I will write > => one. :) > => > > Yes. > > I wrote an agent for that piece of hardware and offered the agent to the RHCS community in Nov 2008...there was no response at the time.[1] > > In March, 2009, I sent a copy of the agent script to Jan Friesse , Marek Grac , who were identified as the maintainers of all the fence agents. > > Since it apparently hasn't made it into the RHCS distribution, let me know if you want a copy. > Hmm ok, this is pretty bad... i am sorry that it got missed and I take responsibility for it. Can you please send it to me/digimer? Digimer, you have git commit access to fence-agents.git. If the agent is GPLv2+ compliant, and it looks sane, please add it. Fabio From linux at alteeve.com Wed Mar 16 20:52:13 2011 From: linux at alteeve.com (Digimer) Date: Wed, 16 Mar 2011 16:52:13 -0400 Subject: [Linux-cluster] Tripp Lite switched PDU fence agent; exists? In-Reply-To: <4D811C86.70202@redhat.com> References: <4D80FC76.6070605@alteeve.com> <20110316145951.75cc0849@mirchi> <4D811C86.70202@redhat.com> Message-ID: <4D8122FD.2050001@alteeve.com> On 03/16/2011 04:24 PM, Fabio M. Di Nitto wrote: > On 03/16/2011 07:59 PM, bergman at merctech.com wrote: >> The pithy ruminations from Digimer on "[Linux-cluster] Tripp Lite switched PDU fence agent; exists?" were: >> >> => Hi all, >> => >> => Does anyone know if the tripp lite (mn: PDUMH15ATNET, specifically) >> => has an existing RHCS fence agent? Specifically for cluster 2 / EL5.5. If >> >> Yes. >> >> >> => not, has anyone written one? Failing all that, I suppose I will write >> => one. :) >> => >> >> Yes. >> >> I wrote an agent for that piece of hardware and offered the agent to the RHCS community in Nov 2008...there was no response at the time.[1] >> >> In March, 2009, I sent a copy of the agent script to Jan Friesse , Marek Grac , who were identified as the maintainers of all the fence agents. >> >> Since it apparently hasn't made it into the RHCS distribution, let me know if you want a copy. >> > > Hmm ok, this is pretty bad... i am sorry that it got missed and I take > responsibility for it. > > Can you please send it to me/digimer? > > Digimer, you have git commit access to fence-agents.git. > > If the agent is GPLv2+ compliant, and it looks sane, please add it. > > Fabio I've got a copy now. Let me wait until I get my hardware to test/tweak/document it, then I will look to push it into git. Thanks again, Mark! -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From parvez.h.shaikh at gmail.com Thu Mar 17 05:25:51 2011 From: parvez.h.shaikh at gmail.com (Parvez Shaikh) Date: Thu, 17 Mar 2011 10:55:51 +0530 Subject: [Linux-cluster] Node without fencing method, is it possible to failover from such a node? Message-ID: Hi all, I have a red hat cluster on IBM blade center with blades being my clusternodes and fence_bladecenter fencing agent. I have couple of resources - IP which activate or deactivate floating IP and script which start my server listening on this floating IP. This is a stateless server with no shared storage requirements or any shared resources which require me to use fancy fencing device. Everything was working fine, when I disable ethcard of heartbeat IP or of floating IP or pull powerplug or reboot/shutdown/halt one node, IP floats on another node and script start my server which happily listen on this IP. Life was good until I am now required to support cluster of nodes which are not hosted in bladecenter but any vanilla nodes. Now everything remains same but bladecenter fencing cant be used, and as per my understanding since I am using red hat cluster, it requires me to use some fence method, my first choice is to use power fencing and that only fencing suits my application needs. But is there any way (I know not the best and recommended but if I can live with it) to get away with fencing and let service failover in absence of fence devices configured for node? Thanks, Parvez -------------- next part -------------- An HTML attachment was scrubbed... URL: From ooolinux at 163.com Thu Mar 17 05:57:40 2011 From: ooolinux at 163.com (yue) Date: Thu, 17 Mar 2011 13:57:40 +0800 (CST) Subject: [Linux-cluster] do you have gfs2 code call-flow Message-ID: <424a4c1d.7604.12ec2640c9e.Coremail.ooolinux@163.com> do you have gfs2 code call-flow i wan to know how gfs2 is implemented ,on code level thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From swhiteho at redhat.com Thu Mar 17 09:30:50 2011 From: swhiteho at redhat.com (Steven Whitehouse) Date: Thu, 17 Mar 2011 09:30:50 +0000 Subject: [Linux-cluster] do you have gfs2 code call-flow In-Reply-To: <424a4c1d.7604.12ec2640c9e.Coremail.ooolinux@163.com> References: <424a4c1d.7604.12ec2640c9e.Coremail.ooolinux@163.com> Message-ID: <1300354250.2596.7.camel@dolmen> Hi, On Thu, 2011-03-17 at 13:57 +0800, yue wrote: > do you have gfs2 code call-flow > > i wan to know how gfs2 is implemented ,on code level > > thanks > > There is some documentation in the kernel source Documentation/filesystems directory. Also if you look at the GFS2 Wikipedia page, you'll see some links to some documents which I have written that explain a bit about the internals. There is no one single overall document though I'm afraid, Steve. From ccd.stoy.ml at gmail.com Thu Mar 17 15:17:53 2011 From: ccd.stoy.ml at gmail.com (C.D.) Date: Thu, 17 Mar 2011 17:17:53 +0200 Subject: [Linux-cluster] GFS2 locking in a VM based cluster (KVM) Message-ID: Hello, sorry guys to resurrect an old thread, but I have to say I can confirm that, too. I have a libvirt setup with multipathed FC SAN devices and KVM guests running on top of it. The physical machine is HP 465c G7 (2 x 12 Core Magny-Cours with 96GB RAM). The host OS is Fedora 14. The guests are Scientific Linux 6. With gfs2 10GB shared LUN I can manage ~600k plocks/sec while both machines mounted the LUN. I started: ping_pong some_file 3 on one of the VMs and got those 600k plocks. Then I started ping_pong the_same_file 3 on the second machines and got around 360 plocks/sec (that is 360, not 360 000). No matter what I tried I couldn't optimize it. If I stop the ping_pong on one of the VMs the plocks wen't up to around 500-550 plocks/sec (again 550 not 550k). Stopping the process. Waiting a while and starting again on a single machine still got me around 600k plocks. This I could reproduce both with tcp and sctp and tried bunch of different settings. Then I decided to give ocfs2 a change. Compiling the module on SL6, and I suppose on RHEL6, is not the most straight forward taks, buth half an hour later I got the module compiled from the sources of the EL kernel. Stripped all debug symbols. Copied the ocfs2 kernel module dir to both VM machines. Did depmod -a, I set up the oracle fs on top of the same LUN. Used ping_pong the_same_file_i_used_in_the_first_test 3 on just one machine, while both VMs have mounted the LUN. 1600k plocks/sec (as in ~1 600 000 ). Started ping_pong on the second host. The plocks did not move at all. Still 1600k plocks/sec. Tested with the real life app. It worked very well, unlike gfs2, which was painfully slow with just 2 users. I created the ocfs2 with -T mail, I didn't do any tuning on it, either. I'm not trying to bash gfs2, actually I would definitely prefer it over ocfs2 anytime, however it seems it doesn't work well with VM for some reason. I have used both mtu 1500 and 9000 also, it just didn't make any diffence, no matter what I have tried.I haven't tested the same setup on top of two physical nodes, but I have the feeling it will work just as good as ocfs2 on the VMs. I didn't test with hugepages for the VMs, but I somehow doubt that would make much of a difference. I think this should be investigates by someone at RH possibly because they are the driving force behind both KVM, libvirt, the cluster soft and gfs2. -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Thu Mar 17 16:01:40 2011 From: linux at alteeve.com (Digimer) Date: Thu, 17 Mar 2011 12:01:40 -0400 Subject: [Linux-cluster] Node without fencing method, is it possible to failover from such a node? In-Reply-To: References: Message-ID: <4D823064.8040907@alteeve.com> On 03/17/2011 01:25 AM, Parvez Shaikh wrote: > Hi all, > > I have a red hat cluster on IBM blade center with blades being my > clusternodes and fence_bladecenter fencing agent. I have couple of > resources - IP which activate or deactivate floating IP and script which > start my server listening on this floating IP. This is a stateless > server with no shared storage requirements or any shared resources which > require me to use fancy fencing device. > > Everything was working fine, when I disable ethcard of heartbeat IP or > of floating IP or pull powerplug or reboot/shutdown/halt one node, IP > floats on another node and script start my server which happily listen > on this IP. Life was good until I am now required to support cluster of > nodes which are not hosted in bladecenter but any vanilla nodes. > > Now everything remains same but bladecenter fencing cant be used, and as > per my understanding since I am using red hat cluster, it requires me to > use some fence method, my first choice is to use power fencing and that > only fencing suits my application needs. > > But is there any way (I know not the best and recommended but if I can > live with it) to get away with fencing and let service failover in > absence of fence devices configured for node? > > Thanks, > Parvez Manual fencing is not supported: - http://sources.redhat.com/cluster/wiki/FAQ/Fencing#fence_manual2 Do you vanilla servers have IPMI (or equiv. like iLo, DRAC, etc)? If they do, than you can use fence_ipmilan. Failing that, then I'd strongly recommend investing is a switched PDU. All things considered, they are not that expensive. Plus, when a node is fenced/rebooted, there is a chance it will return to the cluster healthy. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From raju.rajsand at gmail.com Thu Mar 17 16:49:20 2011 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Thu, 17 Mar 2011 22:19:20 +0530 Subject: [Linux-cluster] Node without fencing method, is it possible to failover from such a node? In-Reply-To: <4D823064.8040907@alteeve.com> References: <4D823064.8040907@alteeve.com> Message-ID: Greetings, On 3/17/11, Digimer wrote: > On 03/17/2011 01:25 AM, Parvez Shaikh wrote: >> Hi all, >> >> Life was good until I am now required to support cluster of >> nodes which are not hosted in bladecenter but any vanilla nodes. Suggestions from somebody who stupidly yapped "I will support manual fencing" and burnt his finger (Who? Oh! that was me): 1. Don't commit support for manual fencing 2. Don't support manual fencing. If you are in India, APC Fence PDU is available for around 30-35K INR (about a year back or so). If someone is ready to invest say 500K INR for HA hardware such as two servers etc., they might as well add 35k. OTOH, if those nodes are rack mounted servers (Unlike entry level server which does not have management port), the cost of the Powerfence strip will be a different issue when it comes to justifying, etc. within a corporate/Enterprise environment. Too much paperwork, I agree. But It will give a more robust infrastructure which will help us in using various tools like Zabbix, Spacewalk, snmp (I think fence strips have some SNMP - please check) etc. in the future. Life will be good then. With warm regards, Rajagopal From ra at ra.is Thu Mar 17 20:29:34 2011 From: ra at ra.is (Richard Allen) Date: Thu, 17 Mar 2011 16:29:34 -0400 Subject: [Linux-cluster] DLM problem Message-ID: <4D826F2E.9030600@ra.is> I have a simple test cluster up and running (RHEL 6 HA) on three vmware guests. Each vmware guest has 3 vnic's. After booting a node, I often get a dead rgmanager: [root at syseng1-vm ~]# service rgmanager status rgmanager dead but pid file exists Cluster is otherwise OK [root at syseng1-vm ~]# clustat Cluster Status for RHEL6Test @ Thu Mar 17 16:10:38 2011 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ syseng1-vm 1 Online, Local syseng2-vm 2 Online syseng3-vm 3 Online There is a service running on node2 but clustat has no info on that. [root at syseng1-vm ~]# cman_tool status Version: 6.2.0 Config Version: 9 Cluster Name: RHEL6Test Cluster Id: 36258 Cluster Member: Yes Cluster Generation: 88 Membership state: Cluster-Member Nodes: 3 Expected votes: 3 Total votes: 3 Node votes: 1 Quorum: 2 Active subsystems: 1 Flags: Ports Bound: 0 Node name: syseng1-[CENSORED] Node ID: 1 Multicast addresses: 239.192.141.48 Node addresses: 10.10.16.11 The syslog has some info: Mar 17 15:47:55 syseng1-vm rgmanager[2463]: Quorum formed Mar 17 15:47:55 syseng1-vm kernel: dlm: no local IP address has been set Mar 17 15:47:55 syseng1-vm kernel: dlm: cannot start dlm lowcomms -107 The fix is always the same: [root at syseng1-vm ~]# service cman restart Stopping cluster: Leaving fence domain... [ OK ] Stopping gfs_controld... [ OK ] Stopping dlm_controld... [ OK ] Stopping fenced... [ OK ] Stopping cman... [ OK ] Waiting for corosync to shutdown: [ OK ] Unloading kernel modules... [ OK ] Unmounting configfs... [ OK ] Starting cluster: Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs... [ OK ] Starting cman... [ OK ] Waiting for quorum... [ OK ] Starting fenced... [ OK ] Starting dlm_controld... [ OK ] Starting gfs_controld... [ OK ] Unfencing self... [ OK ] Joining fence domain... [ OK ] [root at syseng1-vm ~]# service rgmanager restart Stopping Cluster Service Manager: [ OK ] Starting Cluster Service Manager: [ OK ] [root at syseng1-vm ~]# clustat Cluster Status for RHEL6Test @ Thu Mar 17 16:22:01 2011 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ syseng1-vm 1 Online, Local, rgmanager syseng2-vm 2 Online, rgmanager syseng3-vm 3 Online Service Name Owner (Last) State ------- ---- ----- ------ ----- service:TestDB syseng2-vm started Sometimes restarting rgmanager hangs and the node needs to be rebooted. my cluster.conf: