From darkblue2000 at gmail.com Wed Aug 1 01:29:54 2007 From: darkblue2000 at gmail.com (darkblue) Date: Wed, 1 Aug 2007 09:29:54 +0800 Subject: [Linux-cluster] dependency problem when install cman-kernel-2.6.9-50.2.src.rpm Message-ID: <2c8195ff0707311829m4b6d54fel38364af8eedf4632@mail.gmail.com> When I installing cman-kernel, there is a dependency problem. [root at rh4-clus1 rhcs4]# rpm -iv cman-kernel-2.6.9-59.2.src.rpm [root at rh4-clus1 SPECS]# rpmbuild -ba --target=i686 cman-kernel.spec Building target platforms: i686 Building for target i686 error: Failed build dependencies: kernel-devel = 2.6.9-55.EL is needed by cman-kernel-2.6.9-50.2.i686 kernel-smp-devel = 2.6.9-55.EL is needed by cman-kernel-2.6.9-50.2.i686 kernel-hugemem-devel = 2.6.9-55.EL is needed by cman-kernel-2.6.9-50.2.i686 kernel-xenU-devel = 2.6.9-55.EL is needed by cman-kernel-2.6.9-50.2.i686 [root at rh4-clus1 SPECS]# uname -a Linux rh4-clus1.darkblue.com 2.6.9-55.EL #1 Fri Apr 20 16:35:59 EDT 2007 i686 i686 i386 GNU/Linux I am curious that I am using the 2.6.9-55 kernel, why there is still a dependency problem? and How to fix it? -- He is nothing From orkcu at yahoo.com Wed Aug 1 02:32:06 2007 From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=) Date: Tue, 31 Jul 2007 19:32:06 -0700 (PDT) Subject: [Linux-cluster] LVS redundancy server and network type: DIRECT In-Reply-To: <46AF9B6F.7080300@lexum.umontreal.ca> Message-ID: <702274.88733.qm@web50603.mail.re2.yahoo.com> --- FM wrote: > Tx for the reply, > I re read the doc and my question remains :-) > ex : > from the RH documentation : > Create the ARP table entries for each virtual IP > address on each real > server (the real_ip is the IP the director uses to > communicate with the > real server; often this is the IP bound to eth0): are you clear about the fact that real_ip is the real IP of the real server? (the one that the LVS use to connect to the real server :-) ) > arptables -A IN -d -j DROP > arptables -A OUT -d -j mangle > --mangle-ip-s > you should do this in each real server, so real_ip is diferent for each real server none of those IPs are IP bounded to an specific LVS (master or slave), well vip is but is like a floating ip :-) > > If I create a redundancy server, and if the master > server goes down, the > backup server will create all the but > not the so I think you beleave that this "real_ip" is the IP owned by the LVS to comunicate with the real server, but it is not. The real_ip is the IP owned by the real server, which is used by the LVS to connect to the real server :-) maybe a graphic is very needed :-) just keep in mind what is purpose of the arptable commands: avoid at all means that the real server announce that it has the VIP as one of it address cu roger __________________________________________ RedHat Certified ( RHCE ) Cisco Certified ( CCNA & CCDA ) ____________________________________________________________________________________ Be a better Heartthrob. Get better relationship answers from someone who knows. Yahoo! Answers - Check it out. http://answers.yahoo.com/dir/?link=list&sid=396545433 From darkblue2000 at gmail.com Wed Aug 1 06:41:58 2007 From: darkblue2000 at gmail.com (darkblue) Date: Wed, 1 Aug 2007 14:41:58 +0800 Subject: [Linux-cluster] Which packages are the right combination for AS4U5? In-Reply-To: <2c8195ff0707301714v160b1590l1ab325bbd0a12cc2@mail.gmail.com> References: <2c8195ff0707292037s7441c7cv91946b6fc3e98fc9@mail.gmail.com> <46ADB6DA.7000603@fu-berlin.de> <2c8195ff0707300317m5ca565b8n6ea66465d72232f2@mail.gmail.com> <46ADCCDD.1010209@fu-berlin.de> <2c8195ff0707301714v160b1590l1ab325bbd0a12cc2@mail.gmail.com> Message-ID: <2c8195ff0707312341reac4308gc4691c976397d644@mail.gmail.com> hello, I had been tried to install rhcs, but failed. hmm, the error look like this: [root at rh4-clus1 SPECS]# rpmbuild -ba --target=i686 cman.spec Building target platforms: i686 Building for target i686 error: Failed build dependencies: cman-kernheaders >= 2.6.9 is needed by cman-1.0.17-0.i686 ccs-devel is needed by cman-1.0.17-0.i686 So, Is that mean I have to download and install ccs-devel first. but I can't find it on redhat's ftp. anybody know where to find it? 2007/7/31, darkblue : > thanks, thank you very much, you save my life. > I am gonna to install the src.rpm combination tonight. > > 2007/7/30, Sebastian Walter : > > If you are using RHEL in an production environment, I can only recommend > > you to use the original rhel packages, as the centos' ones are modified. > > Anyway, the versions of the rpm's should be the same. So this is the > > list of packages what is installed on my centos 4.5 system: > > > > (rhcs: rgmanager system-config-cluster ccsd magma magma-plugins cman > > cman-kernel-smp dlm dlm-kernel-smp fence gulm iddev) > > Installing: > > cman x86_64 1.0.17-0 csgfs 67 k > > cman-kernel-smp x86_64 2.6.9-50.2 csgfs 133 k > > dlm x86_64 1.0.3-1 csgfs 13 k > > dlm-kernel-smp x86_64 2.6.9-46.16 csgfs 132 k > > fence x86_64 1.32.45-1.0.1 csgfs 282 k > > gulm x86_64 1.0.10-0 csgfs 151 k > > iddev x86_64 2.0.0-4 csgfs 2.3 k > > magma x86_64 1.0.7-1 csgfs 37 k > > magma-plugins x86_64 1.0.12-0 csgfs 19 k > > rgmanager x86_64 1.9.68-1 csgfs 209 k > > system-config-cluster noarch 1.0.45-1.0 csgfs 122 k > > Installing for dependencies: > > ccs x86_64 1.0.10-0 csgfs 80 k > > perl-Net-Telnet noarch 3.03-3 csgfs 51 k > > seamonkey-nss x86_64 1.0.9-2.el4.centos update > > 872 k > > > > (gfs: GFS GFS-kernel-smp gnbd gnbd-kernel-smp lvm2-cluster > > GFS-kernheaders gnbd-kernheaders) > > GFS x86_64 6.1.14-0 csgfs 152 k > > GFS-kernel-smp x86_64 2.6.9-72.2 csgfs 214 k > > GFS-kernheaders x86_64 2.6.9-72.2 csgfs 20 k > > gnbd x86_64 1.0.9-1 csgfs 142 k > > gnbd-kernel-smp x86_64 2.6.9-10.20 csgfs 13 k > > gnbd-kernheaders x86_64 2.6.9-10.20 csgfs 4.1 k > > lvm2-cluster x86_64 2.02.21-7.el4 csgfs 199 k > > > > In this configuration, which comes from the yum rhcs repository, I had > > to downgrade to kernel kernel-smp-2.6.9-55.EL. Maybe you also want to > > install luci and ricci: > > > > yum install luci > > Installing: > > luci x86_64 0.9.1-8.el4.centos.1 csgfs > > > > yum install ricci > > Installing: > > ricci x86_64 0.9.1-8.el4.centos.1 csgfs 1.1 M > > Installing for dependencies: > > modcluster x86_64 0.9.1-8.el4.centos > > csgfs 317 k > > oddjob x86_64 0.26-1.1 base 57 k > > oddjob-libs x86_64 0.26-1.1 base 43 k > > > > That's it. Regards, > > Sebastian > > > > darkblue wrote: > > > thanks very much, I have been waiting this letter for the whole day. > > > May I using yum to install centos's packages on redhat as4u5, because > > > the OS of the production server is redhat as4u5. > > > > > > 2007/7/30, Sebastian Walter wrote: > > > > > >> Hi, > > >> > > >> maybe you want to orientate on the centos distribution. Easiest for you > > >> would be to somehow get yum working and then import the whole > > >> repository. If it's a new installation, I suggest you to switch to such > > >> a rhel-compatible distribution as centos or scientific linux anyway. > > >> > > >> http://mirror.centos.org/centos/4/csgfs/ > > >> > > >> Regards, > > >> Sebastian > > >> > > >> > > >> darkblue wrote: > > >> > > >>> Hello, > > >>> I am a newbie of cluster.I encounter a problem when installing cluster > > >>> suite on AS4U5, I download the following packages from > > >>> ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/4AS/en/RHCS/SRPMS/ > > >>> > > >>> ccs-1.0.10-0.src.rpm > > >>> ccs-1.0.2-0.src.rpm > > >>> ccs-1.0.3-0.src.rpm > > >>> ccs-1.0.7-0.src.rpm > > >>> clustermon-0.9.1-8.src.rpm > > >>> cman-1.0.11-0.src.rpm > > >>> cman-1.0.17-0.src.rpm > > >>> cman-1.0.2-0.src.rpm > > >>> cman-1.0.4-0.src.rpm > > >>> cman-kernel-2.6.9-39.5.src.rpm > > >>> cman-kernel-2.6.9-39.8.src.rpm > > >>> cman-kernel-2.6.9-41.0.2.src.rpm > > >>> cman-kernel-2.6.9-41.0.src.rpm > > >>> cman-kernel-2.6.9-43.8.3.src.rpm > > >>> cman-kernel-2.6.9-43.8.5.src.rpm > > >>> cman-kernel-2.6.9-43.8.src.rpm > > >>> cman-kernel-2.6.9-45.14.src.rpm > > >>> cman-kernel-2.6.9-45.15.src.rpm > > >>> cman-kernel-2.6.9-45.2.src.rpm > > >>> cman-kernel-2.6.9-45.3.src.rpm > > >>> cman-kernel-2.6.9-45.4.src.rpm > > >>> cman-kernel-2.6.9-45.5.src.rpm > > >>> cman-kernel-2.6.9-45.8.src.rpm > > >>> cman-kernel-2.6.9-50.2.0.1.src.rpm > > >>> cman-kernel-2.6.9-50.2.src.rpm > > >>> conga-0.9.1-8.src.rpm > > >>> dlm-1.0.0-5.src.rpm > > >>> dlm-1.0.1-1.src.rpm > > >>> dlm-1.0.3-1.src.rpm > > >>> dlm-kernel-2.6.9-37.7.src.rpm > > >>> dlm-kernel-2.6.9-37.9.src.rpm > > >>> dlm-kernel-2.6.9-39.1.2.src.rpm > > >>> dlm-kernel-2.6.9-39.1.src.rpm > > >>> dlm-kernel-2.6.9-41.7.1.src.rpm > > >>> dlm-kernel-2.6.9-41.7.2.src.rpm > > >>> dlm-kernel-2.6.9-41.7.src.rpm > > >>> dlm-kernel-2.6.9-42.10.src.rpm > > >>> dlm-kernel-2.6.9-42.11.src.rpm > > >>> dlm-kernel-2.6.9-42.12.src.rpm > > >>> dlm-kernel-2.6.9-42.13.src.rpm > > >>> dlm-kernel-2.6.9-44.2.src.rpm > > >>> dlm-kernel-2.6.9-44.3.src.rpm > > >>> dlm-kernel-2.6.9-44.8.src.rpm > > >>> dlm-kernel-2.6.9-44.9.src.rpm > > >>> dlm-kernel-2.6.9-46.16.0.1.src.rpm > > >>> dlm-kernel-2.6.9-46.16.src.rpm > > >>> fence-1.32.10-0.src.rpm > > >>> fence-1.32.18-0.src.rpm > > >>> fence-1.32.25-1.src.rpm > > >>> fence-1.32.45-1.0.1.src.rpm > > >>> fence-1.32.45-1.src.rpm > > >>> fence-1.32.6-0.src.rpm > > >>> gulm-1.0.10-0.src.rpm > > >>> gulm-1.0.4-0.src.rpm > > >>> gulm-1.0.6-0.src.rpm > > >>> gulm-1.0.7-0.src.rpm > > >>> gulm-1.0.8-0.src.rpm > > >>> iddev-2.0.0-3.src.rpm > > >>> iddev-2.0.0-4.src.rpm > > >>> magma-1.0.1-4.src.rpm > > >>> magma-1.0.3-2.src.rpm > > >>> magma-1.0.4-0.src.rpm > > >>> magma-1.0.6-0.src.rpm > > >>> magma-1.0.7-1.src.rpm > > >>> magma-plugins-1.0.12-0.src.rpm > > >>> magma-plugins-1.0.2-0.src.rpm > > >>> magma-plugins-1.0.5-0.src.rpm > > >>> magma-plugins-1.0.6-0.src.rpm > > >>> magma-plugins-1.0.9-0.src.rpm > > >>> piranha-0.8.1-1.src.rpm > > >>> piranha-0.8.2-1.src.rpm > > >>> rgmanager-1.9.38-0.src.rpm > > >>> rgmanager-1.9.39-0.src.rpm > > >>> rgmanager-1.9.43-0.src.rpm > > >>> rgmanager-1.9.46-0.src.rpm > > >>> rgmanager-1.9.53-0.src.rpm > > >>> rgmanager-1.9.54-1.src.rpm > > >>> rgmanager-1.9.68-1.src.rpm > > >>> system-config-cluster-1.0.16-1.0.src.rpm > > >>> system-config-cluster-1.0.25-1.0.src.rpm > > >>> system-config-cluster-1.0.27-1.0.src.rpm > > >>> system-config-cluster-1.0.45-1.0.src.rpm > > >>> > > >>> but I want to know which packages are the right combination for AS4U5? > > >>> > > >>> > > >>> > > >> > > > > > > > > > > > > > > > > -- > He is nothing > -- He is nothing From benjamin.jakubowski at gmail.com Wed Aug 1 08:39:52 2007 From: benjamin.jakubowski at gmail.com (Benjamin Jakubowski) Date: Wed, 1 Aug 2007 10:39:52 +0200 Subject: [Linux-cluster] RHCS on RedHat As 4 u4 In-Reply-To: <46AF6B24.9000208@redhat.com> References: <6c80b6370707310937te6a3476x3dc682a971083622@mail.gmail.com> <46AF65A3.1010806@redhat.com> <6c80b6370707310956j10208835lff5af4d0806bf17@mail.gmail.com> <46AF6B24.9000208@redhat.com> Message-ID: <6c80b6370708010139h6438d1dsd1cc5bc1dfba8dff@mail.gmail.com> OK thanks, it seems to become better but this is my simple cluster.conf My first node is : server1 when i start ccsd ok : Aug 1 10:35:59 sv157020 ccsd[20726]: Starting ccsd 1.0.10: Aug 1 10:35:59 sv157020 ccsd[20726]: Built: Mar 19 2007 17:44:26 Aug 1 10:35:59 sv157020 ccsd[20726]: Copyright (C) Red Hat, Inc. 2004 All rights reserved. Aug 1 10:35:59 sv157020 ccsd: succeeded when i try to start cman : Aug 1 10:36:28 sv157020 ccsd[20726]: Unable to connect to cluster infrastructure after 30 seconds. Aug 1 10:36:33 sv157020 kernel: CMAN 2.6.9-45.2 (built Jul 13 2006 11:42:36) installed Aug 1 10:36:33 sv157020 kernel: NET: Registered protocol family 30 Aug 1 10:36:33 sv157020 ccsd[20726]: cluster.conf (cluster name = siclad_re7, version = 7) found. Aug 1 10:36:34 sv157020 kernel: CMAN: Waiting to join or form a Linux-cluster Aug 1 10:36:52 sv157020 sshd(pam_unix)[20747]: session opened for user root by root(uid=0) Aug 1 10:36:59 sv157020 ccsd[20726]: Unable to connect to cluster infrastructure after 60 seconds. Aug 1 10:37:06 sv157020 kernel: CMAN: forming a new cluster Aug 1 10:37:06 sv157020 kernel: CMAN: quorum regained, resuming activity Aug 1 10:37:06 sv157020 cman: startup succeeded when i try to start rgmanager : Aug 1 10:37:29 sv157020 ccsd[20726]: Unable to connect to cluster infrastructure after 90 seconds. Aug 1 10:37:36 sv157020 ccsd[20726]: Cluster is not quorate. Refusing connection. Aug 1 10:37:36 sv157020 ccsd[20726]: Error while processing connect: Connection refused Aug 1 10:37:36 sv157020 ccsd[20726]: Invalid descriptor specified (-111). Aug 1 10:37:36 sv157020 ccsd[20726]: Someone may be attempting something evil. Aug 1 10:37:36 sv157020 ccsd[20726]: Error while processing get: Invalid request descriptor Aug 1 10:37:36 sv157020 ccsd[20726]: Invalid descriptor specified (-111). Aug 1 10:37:36 sv157020 ccsd[20726]: Someone may be attempting something evil. Aug 1 10:37:36 sv157020 ccsd[20726]: Error while processing get: Invalid request descriptor Aug 1 10:37:36 sv157020 ccsd[20726]: Invalid descriptor specified (-21). Aug 1 10:37:36 sv157020 ccsd[20726]: Someone may be attempting something evil. Aug 1 10:37:36 sv157020 ccsd[20726]: Error while processing disconnect: Invalid request descriptor Aug 1 10:37:36 sv157020 clurgmgrd[20817]: Resource Group Manager Starting Aug 1 10:37:36 sv157020 clurgmgrd[20817]: Loading Service Data Aug 1 10:37:37 sv157020 ccsd[20726]: Cluster is not quorate. Refusing connection. Aug 1 10:37:37 sv157020 ccsd[20726]: Error while processing connect: Connection refused Aug 1 10:37:37 sv157020 clurgmgrd[20817]: #5: Couldn't connect to ccsd! Aug 1 10:37:37 sv157020 clurgmgrd[20817]: #8: Couldn't initialize services Aug 1 10:37:37 sv157020 rgmanager: D?(c)marrage de clurgmgrd failed Do u have any idea ? is so simple in RHEL 5, but i need RHEL 4U4 Thanks a lot 2007/7/31, Bryn M. Reeves : > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Benjamin Jakubowski wrote: > > The probleme is : > > i need to preserve the kernel version 2.6.9-42.ELsmp, to preserve a SAN > > compatibility and in RHN, there isn't have a cluster kernel module ? > > do u have any idea ? > > > > Thanks a lot > > Benjamin > > OK, if you need to match with a specific kernel release you'll need to > use the Cluster Suite packages that were released at the same time. > > For U4 (2.6.9-42.EL), I think those would be: > > cman-kernel-2.6.9-45.2 > cman-kernel-smp-2.6.9-45.2 > cman-kernel-hugemem-2.6.9-45.2 > dlm-kernel-hugemem-2.6.9-42.10 > dlm-kernel-2.6.9-42.10 > dlm-kernel-smp-2.6.9-42.10 > > If you also need the packages for GFS, those are: > > GFS-kernel-2.6.9-58.0 > GFS-kernel-smp-2.6.9-58.0 > GFS-kernel-hugemem-2.6.9-58.0 > gnbd-kernel-hugemem-2.6.9-9.41 > gnbd-kernel-2.6.9-9.41 > gnbd-kernel-smp-2.6.9-9.41 > > Those packages are still available on RHN - just go to the web > interface, hit search by package & they should appear there. > > Be aware though that there were some important kernel related bugfixes > applied after U4 - particularly for systems using multipath SAN storage. > > Kind regards, > Bryn. > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.7 (GNU/Linux) > Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org > > iD8DBQFGr2sk6YSQoMYUY94RAp+qAJ9yKqpXxmFXqQzRc//0RVFzavPdzgCgk+YE > YJFtx2iQm4zdYnKXQ/QmRHY= > =s4BM > -----END PGP SIGNATURE----- > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- @+ Benj -------------- next part -------------- An HTML attachment was scrubbed... URL: From benjamin.jakubowski at gmail.com Wed Aug 1 08:43:21 2007 From: benjamin.jakubowski at gmail.com (Benjamin Jakubowski) Date: Wed, 1 Aug 2007 10:43:21 +0200 Subject: [Linux-cluster] Which packages are the right combination for AS4U5? In-Reply-To: <2c8195ff0707312341reac4308gc4691c976397d644@mail.gmail.com> References: <2c8195ff0707292037s7441c7cv91946b6fc3e98fc9@mail.gmail.com> <46ADB6DA.7000603@fu-berlin.de> <2c8195ff0707300317m5ca565b8n6ea66465d72232f2@mail.gmail.com> <46ADCCDD.1010209@fu-berlin.de> <2c8195ff0707301714v160b1590l1ab325bbd0a12cc2@mail.gmail.com> <2c8195ff0707312341reac4308gc4691c976397d644@mail.gmail.com> Message-ID: <6c80b6370708010143g4062dfedya86c96a65a94f886@mail.gmail.com> Hi, do you any access on RHN, cause there is an iso to install RedHatCluster on RedHet 4U5 : - Software Download - Select RHEL 4 version -and RedHat CLuster Suite -- @+ Benj -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian.walter at fu-berlin.de Wed Aug 1 09:32:05 2007 From: sebastian.walter at fu-berlin.de (Sebastian Walter) Date: Wed, 01 Aug 2007 11:32:05 +0200 Subject: [Linux-cluster] dependency problem when install cman-kernel-2.6.9-50.2.src.rpm In-Reply-To: <2c8195ff0707311829m4b6d54fel38364af8eedf4632@mail.gmail.com> References: <2c8195ff0707311829m4b6d54fel38364af8eedf4632@mail.gmail.com> Message-ID: <46B05315.8080709@fu-berlin.de> You should be able to install these packages via RHN and up2date. darkblue wrote: > When I installing cman-kernel, there is a dependency problem. > [root at rh4-clus1 rhcs4]# rpm -iv cman-kernel-2.6.9-59.2.src.rpm > [root at rh4-clus1 SPECS]# rpmbuild -ba --target=i686 cman-kernel.spec > Building target platforms: i686 > Building for target i686 > error: Failed build dependencies: > kernel-devel = 2.6.9-55.EL is needed by cman-kernel-2.6.9-50.2.i686 > kernel-smp-devel = 2.6.9-55.EL is needed by cman-kernel-2.6.9-50.2.i686 > kernel-hugemem-devel = 2.6.9-55.EL is needed by > cman-kernel-2.6.9-50.2.i686 > kernel-xenU-devel = 2.6.9-55.EL is needed by cman-kernel-2.6.9-50.2.i686 > [root at rh4-clus1 SPECS]# uname -a > Linux rh4-clus1.darkblue.com 2.6.9-55.EL #1 Fri Apr 20 16:35:59 EDT > 2007 i686 i686 i386 GNU/Linux > > I am curious that I am using the 2.6.9-55 kernel, why there is still a > dependency problem? and How to fix it? > From maciej.bogucki at artegence.com Wed Aug 1 08:59:16 2007 From: maciej.bogucki at artegence.com (Maciej Bogucki) Date: Wed, 01 Aug 2007 10:59:16 +0200 Subject: [Linux-cluster] dependency problem when install cman-kernel-2.6.9-50.2.src.rpm In-Reply-To: <2c8195ff0707311829m4b6d54fel38364af8eedf4632@mail.gmail.com> References: <2c8195ff0707311829m4b6d54fel38364af8eedf4632@mail.gmail.com> Message-ID: <46B04B64.7010506@artegence.com> darkblue napisa?(a): > When I installing cman-kernel, there is a dependency problem. > [root at rh4-clus1 rhcs4]# rpm -iv cman-kernel-2.6.9-59.2.src.rpm > [root at rh4-clus1 SPECS]# rpmbuild -ba --target=i686 cman-kernel.spec > Building target platforms: i686 > Building for target i686 > error: Failed build dependencies: > kernel-devel = 2.6.9-55.EL is needed by cman-kernel-2.6.9-50.2.i686 > kernel-smp-devel = 2.6.9-55.EL is needed by cman-kernel-2.6.9-50.2.i686 > kernel-hugemem-devel = 2.6.9-55.EL is needed by > cman-kernel-2.6.9-50.2.i686 > kernel-xenU-devel = 2.6.9-55.EL is needed by cman-kernel-2.6.9-50.2.i686 > [root at rh4-clus1 SPECS]# uname -a > Linux rh4-clus1.darkblue.com 2.6.9-55.EL #1 Fri Apr 20 16:35:59 EDT > 2007 i686 i686 i386 GNU/Linux > > I am curious that I am using the 2.6.9-55 kernel, why there is still a > dependency problem? and How to fix it? up2date -u kernel-devel kernel-smp-devel kernel-hugemem-devel kernel-xenU-devel From srigler at MarathonOil.com Wed Aug 1 11:57:45 2007 From: srigler at MarathonOil.com (Steve Rigler) Date: Wed, 01 Aug 2007 06:57:45 -0500 Subject: [Linux-cluster] LVS redundancy server and network type: DIRECT In-Reply-To: <46AF9B6F.7080300@lexum.umontreal.ca> References: <46AF5D36.9090304@lexum.umontreal.ca> <46AF7E49.6050201@redhat.com> <46AF9B6F.7080300@lexum.umontreal.ca> Message-ID: <1185969465.22317.8.camel@houuc8> On Tue, 2007-07-31 at 16:28 -0400, FM wrote: > Tx for the reply, > I re read the doc and my question remains :-) > ex : > from the RH documentation : > Create the ARP table entries for each virtual IP address on each real > server (the real_ip is the IP the director uses to communicate with the > real server; often this is the IP bound to eth0): > arptables -A IN -d -j DROP > arptables -A OUT -d -j mangle --mangle-ip-s > > > If I create a redundancy server, and if the master server goes down, the > backup server will create all the but not the so > all the real servers still have the arptables setting to modify the > source of the IP packet to look likes the master LVS server that is down > now. Another way you can do it is by adding iptables rules to you real servers like: -A PREROUTING -d -p tcp -m tcp --dport -j REDIRECT I didn't have much luck using arptables, but this worked well for me. -Steve From grimme at atix.de Wed Aug 1 12:37:04 2007 From: grimme at atix.de (Marc Grimme) Date: Wed, 1 Aug 2007 14:37:04 +0200 Subject: [Linux-cluster] Activating a clustered volumeggroup without running cluster Message-ID: <200708011437.04734.grimme@atix.de> Hello, is there a way to activate a clustered volumegroup without a running cluster. Base RHEL4U5?? vgchange -ay complains about skipping clustered vgs. Regards Marc. -- Gruss / Regards, Marc Grimme Phone: +49-89 452 3538-14 http://www.atix.de/ http://www.open-sharedroot.org/ ** ATIX Informationstechnologie und Consulting AG Einsteinstr. 10 85716 Unterschleissheim Deutschland/Germany Phone: +49-89 452 3538-0 Fax: +49-89 990 1766-0 Registergericht: Amtsgericht Muenchen Registernummer: HRB 168930 USt.-Id.: DE209485962 Vorstand: Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.) Vorsitzender des Aufsichtsrats: Dr. Martin Buss From dgavin at davegavin.com Wed Aug 1 13:09:53 2007 From: dgavin at davegavin.com (Dave Gavin) Date: Wed, 1 Aug 2007 09:09:53 -0400 Subject: [Linux-cluster] Adding a new fencing script ? Message-ID: <20070801090953.28bbce51@setanta.asarlai> I have a couple of Server Tech devices controlling the power for the servers in my cluster and they don't seem to have a fence script. I modified a copy of the brocade script to work with the Server Tech device and the script is in /sbin with the permissions/ownership matching the other fence_* scripts. Can anyone point me at a how-to or a doc somewhere on adding this script to the drop-down in system-config-cluster ? Thanks Dave -- Being shot out of a cannon will always be better than being squeezed out of a tube. That is why God made fast motorcycles, Bubba.... "Song of the Sausage Creature" Hunter S. Thompson (RIP 02/20/2005) From jparsons at redhat.com Wed Aug 1 13:57:21 2007 From: jparsons at redhat.com (jim parsons) Date: Wed, 01 Aug 2007 09:57:21 -0400 Subject: [Linux-cluster] Adding a new fencing script ? In-Reply-To: <20070801090953.28bbce51@setanta.asarlai> References: <20070801090953.28bbce51@setanta.asarlai> Message-ID: <1185976641.3318.17.camel@localhost.localdomain> On Wed, 2007-08-01 at 09:09 -0400, Dave Gavin wrote: > I have a couple of Server Tech devices controlling the power for the servers in > my cluster and they don't seem to have a fence script. I modified a copy of the > brocade script to work with the Server Tech device and the script is in /sbin > with the permissions/ownership matching the other fence_* scripts. Can anyone > point me at a how-to or a doc somewhere on adding this script to the drop-down > in system-config-cluster ? Hi Dave, Adding a new fence form to s-c-c is kind of a daunting task...I'll explain below; but in the meanwhile, have you considered donating your fence agent under a gpl variant? It would be nice for cluster users to have ServerTech support... Anyhow, yes, you should be able to drop your agent into /sbin with similar permisions to other agents. You don't need s-c-c, of course, to use your agent...you can edit the cluster.conf file directly and add it there. Under the fencedevices section, include an entry for your agent and set the agent attribute to whatever you named it (agent='fence_dave'). Put shared attributes in the fencedevice section and node specific attrs under the clusternode->fence->method->device tag. Then propagate the new file...first run ccs_tool update and then cman_tool version -r...man these comands for details - but dont forget to incerment the config_version attribute in the conf file before propagating a new one. Here is a rough outline of how to add it to s-c-c: First add the form fields to one of the windows in fence.glade. Just follow the conventions for naming along the lines of the other agent forms that you find there. In each of the three windows that contain fence forms, there is a device column and an instance column...both will need to be extended for your new agent. Next, you will need to edit the python file FenceHandler.py...This should be the only other file you need to touch. I would pick an existing agent and follow it through the file, noticing all of the places it is set...for example, for each fence device and fence instance, there is a populate method, a validate method, a clear form method, and a process_widgets method entry. Then there are a few hash maps to add to. There are comments in the file to assist in adding a new fence type. In summary, add forms to glade file, then edit FenceHandler.py -J From jparsons at redhat.com Wed Aug 1 14:09:24 2007 From: jparsons at redhat.com (jim parsons) Date: Wed, 01 Aug 2007 10:09:24 -0400 Subject: [Linux-cluster] RHCS on RedHat As 4 u4 In-Reply-To: <6c80b6370708010139h6438d1dsd1cc5bc1dfba8dff@mail.gmail.com> References: <6c80b6370707310937te6a3476x3dc682a971083622@mail.gmail.com> <46AF65A3.1010806@redhat.com> <6c80b6370707310956j10208835lff5af4d0806bf17@mail.gmail.com> <46AF6B24.9000208@redhat.com> <6c80b6370708010139h6438d1dsd1cc5bc1dfba8dff@mail.gmail.com> Message-ID: <1185977365.3318.21.camel@localhost.localdomain> I am not sure here, but I believe you need a tag in your conf file under ...there are two locking types available in rhel4...perhaps you need to specify the one you wish to use? At any rate, it is a simple thing to add and try. -J On Wed, 2007-08-01 at 10:39 +0200, Benjamin Jakubowski wrote: > OK thanks, it seems to become better but this is my simple > cluster.conf > > My first node is : server1 > > > > post_join_delay="3"/> > > > > > > > > > > > when i start ccsd ok : > Aug 1 10:35:59 sv157020 ccsd[20726]: Starting ccsd 1.0.10: > Aug 1 10:35:59 sv157020 ccsd[20726]: Built: Mar 19 2007 17:44:26 > Aug 1 10:35:59 sv157020 ccsd[20726]: Copyright (C) Red Hat, Inc. > 2004 All rights reserved. > Aug 1 10:35:59 sv157020 ccsd: succeeded > > when i try to start cman : > Aug 1 10:36:28 sv157020 ccsd[20726]: Unable to connect to cluster > infrastructure after 30 seconds. > Aug 1 10:36:33 sv157020 kernel: CMAN 2.6.9-45.2 (built Jul 13 2006 > 11:42:36) installed > Aug 1 10:36:33 sv157020 kernel: NET: Registered protocol family 30 > Aug 1 10:36:33 sv157020 ccsd[20726]: cluster.conf (cluster name = > siclad_re7, version = 7) found. > Aug 1 10:36:34 sv157020 kernel: CMAN: Waiting to join or form a > Linux-cluster > Aug 1 10:36:52 sv157020 sshd(pam_unix)[20747]: session opened for > user root by root(uid=0) > Aug 1 10:36:59 sv157020 ccsd[20726]: Unable to connect to cluster > infrastructure after 60 seconds. > Aug 1 10:37:06 sv157020 kernel: CMAN: forming a new cluster > Aug 1 10:37:06 sv157020 kernel: CMAN: quorum regained, resuming > activity > Aug 1 10:37:06 sv157020 cman: startup succeeded > > when i try to start rgmanager : > Aug 1 10:37:29 sv157020 ccsd[20726]: Unable to connect to cluster > infrastructure after 90 seconds. > Aug 1 10:37:36 sv157020 ccsd[20726]: Cluster is not quorate. > Refusing connection. > Aug 1 10:37:36 sv157020 ccsd[20726]: Error while processing connect: > Connection refused > Aug 1 10:37:36 sv157020 ccsd[20726]: Invalid descriptor specified > (-111). > Aug 1 10:37:36 sv157020 ccsd[20726]: Someone may be attempting > something evil. > Aug 1 10:37:36 sv157020 ccsd[20726]: Error while processing get: > Invalid request descriptor > Aug 1 10:37:36 sv157020 ccsd[20726]: Invalid descriptor specified > (-111). > Aug 1 10:37:36 sv157020 ccsd[20726]: Someone may be attempting > something evil. > Aug 1 10:37:36 sv157020 ccsd[20726]: Error while processing get: > Invalid request descriptor > Aug 1 10:37:36 sv157020 ccsd[20726]: Invalid descriptor specified > (-21). > Aug 1 10:37:36 sv157020 ccsd[20726]: Someone may be attempting > something evil. > Aug 1 10:37:36 sv157020 ccsd[20726]: Error while processing > disconnect: Invalid request descriptor > Aug 1 10:37:36 sv157020 clurgmgrd[20817]: Resource Group > Manager Starting > Aug 1 10:37:36 sv157020 clurgmgrd[20817]: Loading Service Data > Aug 1 10:37:37 sv157020 ccsd[20726]: Cluster is not quorate. > Refusing connection. > Aug 1 10:37:37 sv157020 ccsd[20726]: Error while processing connect: > Connection refused > Aug 1 10:37:37 sv157020 clurgmgrd[20817]: #5: Couldn't connect > to ccsd! > Aug 1 10:37:37 sv157020 clurgmgrd[20817]: #8: Couldn't > initialize services > Aug 1 10:37:37 sv157020 rgmanager: D??marrage de clurgmgrd failed > > Do u have any idea ? > > is so simple in RHEL 5, but i need RHEL 4U4 > > Thanks a lot > > > > 2007/7/31, Bryn M. Reeves : > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Benjamin Jakubowski wrote: > > The probleme is : > > i need to preserve the kernel version 2.6.9-42.ELsmp, to > preserve a SAN > > compatibility and in RHN, there isn't have a cluster kernel > module ? > > do u have any idea ? > > > > Thanks a lot > > Benjamin > > OK, if you need to match with a specific kernel release you'll > need to > use the Cluster Suite packages that were released at the same > time. > > For U4 (2.6.9-42.EL), I think those would be: > > cman-kernel-2.6.9-45.2 > cman-kernel-smp-2.6.9-45.2 > cman-kernel-hugemem-2.6.9-45.2 > dlm-kernel-hugemem-2.6.9-42.10 > dlm-kernel-2.6.9-42.10 > dlm-kernel-smp-2.6.9-42.10 > > If you also need the packages for GFS, those are: > > GFS-kernel-2.6.9-58.0 > GFS-kernel-smp-2.6.9-58.0 > GFS-kernel-hugemem-2.6.9-58.0 > gnbd-kernel-hugemem-2.6.9-9.41 > gnbd-kernel-2.6.9-9.41 > gnbd-kernel-smp-2.6.9-9.41 > > Those packages are still available on RHN - just go to the > web > interface, hit search by package & they should appear there. > > Be aware though that there were some important kernel related > bugfixes > applied after U4 - particularly for systems using multipath > SAN storage. > > Kind regards, > Bryn. > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.7 (GNU/Linux) > Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org > > iD8DBQFGr2sk6YSQoMYUY94RAp+qAJ9yKqpXxmFXqQzRc//0RVFzavPdzgCgk > +YE > YJFtx2iQm4zdYnKXQ/QmRHY= > =s4BM > -----END PGP SIGNATURE----- > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > @+ > Benj > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From dgavin at davegavin.com Wed Aug 1 16:12:15 2007 From: dgavin at davegavin.com (Dave Gavin) Date: Wed, 1 Aug 2007 12:12:15 -0400 (EDT) Subject: [Linux-cluster] Adding a new fencing script ? In-Reply-To: <1185976641.3318.17.camel@localhost.localdomain> References: <20070801090953.28bbce51@setanta.asarlai> <1185976641.3318.17.camel@localhost.localdomain> Message-ID: <52485.157.130.62.182.1185984735.squirrel@dgavin.no-ip.com> On Wed, August 1, 2007 9:57 am, jim parsons wrote: > On Wed, 2007-08-01 at 09:09 -0400, Dave Gavin wrote: >> I have a couple of Server Tech devices controlling the power for the >> servers in >> my cluster and they don't seem to have a fence script. I modified a copy >> of the >> brocade script to work with the Server Tech device and the script is in >> /sbin >> with the permissions/ownership matching the other fence_* scripts. Can >> anyone >> point me at a how-to or a doc somewhere on adding this script to the >> drop-down >> in system-config-cluster ? > Hi Dave, > > Adding a new fence form to s-c-c is kind of a daunting task...I'll > explain below; but in the meanwhile, have you considered donating your > fence agent under a gpl variant? It would be nice for cluster users to > have ServerTech support... > > Anyhow, yes, you should be able to drop your agent into /sbin with > similar permisions to other agents. > > You don't need s-c-c, of course, to use your agent...you can edit the > cluster.conf file directly and add it there. Under the fencedevices > section, include an entry for your agent and set the agent attribute to > whatever you named it (agent='fence_dave'). > > Put shared attributes in the fencedevice section and node specific attrs > under the clusternode->fence->method->device tag. Then propagate the new > file...first run ccs_tool update and then cman_tool version -r...man > these comands for details - but dont forget to incerment the > config_version attribute in the conf file before propagating a new one. > > Here is a rough outline of how to add it to s-c-c: > First add the form fields to one of the windows in fence.glade. Just > follow the conventions for naming along the lines of the other agent > forms that you find there. In each of the three windows that contain > fence forms, there is a device column and an instance column...both will > need to be extended for your new agent. > > Next, you will need to edit the python file FenceHandler.py...This > should be the only other file you need to touch. I would pick an > existing agent and follow it through the file, noticing all of the > places it is set...for example, for each fence device and fence > instance, there is a populate method, a validate method, a clear form > method, and a process_widgets method entry. Then there are a few hash > maps to add to. There are comments in the file to assist in adding a new > fence type. > > In summary, add forms to glade file, then edit FenceHandler.py > > -J > HI Jim, Yeow! Daunting pretty much captures it 8-) I tried hacking the files and got a bit dizzy figuring out which labelN/entryN/tableN was available - I actually got through changing fence.glade and then FenceHandler.py just blew me away.... I guess I'll give up on the gui and just use the command line tools to update and propagate the cluster configuration from now on. Playing it safe, I added two brocade fence devices to the config using s-c-c and then manually edited that cluster.conf, changing the fence_brocade to fence_servertech (the other options are OK for what I need). I was then able to propagate this to the other node OK. Started up fenced and no smoke so far.... I have to go to our co-locate to do the serious testing, so that'll be tomorrow. I'd be happy to pass my script on as GPL (it's just a modified version of fence_brocade - someone else did the heavy lifting), how would I go about that ? Thanks very much for the quick and detailed answer, Dave -- Being shot out of a cannon will always be better than being squeezed out of a tube. That is why God made fast motorcycles, Bubba.... "Song of the Sausage Creature" Hunter S. Thompson (RIP 02/20/2005) From benjamin.jakubowski at gmail.com Wed Aug 1 18:38:38 2007 From: benjamin.jakubowski at gmail.com (Benjamin Jakubowski) Date: Wed, 1 Aug 2007 20:38:38 +0200 Subject: [Linux-cluster] RHCS on RedHat As 4 u4 In-Reply-To: <1185977365.3318.21.camel@localhost.localdomain> References: <6c80b6370707310937te6a3476x3dc682a971083622@mail.gmail.com> <46AF65A3.1010806@redhat.com> <6c80b6370707310956j10208835lff5af4d0806bf17@mail.gmail.com> <46AF6B24.9000208@redhat.com> <6c80b6370708010139h6438d1dsd1cc5bc1dfba8dff@mail.gmail.com> <1185977365.3318.21.camel@localhost.localdomain> Message-ID: <6c80b6370708011138r17f04e14v5ad7dfe99e0b2b66@mail.gmail.com> it's the same pb, withc balise really strange ... 2007/8/1, jim parsons : > > I am not sure here, but I believe you need a tag in your conf > file under ...there are two locking types available in > rhel4...perhaps you need to specify the one you wish to use? At any > rate, it is a simple thing to add and try. > > -J > > On Wed, 2007-08-01 at 10:39 +0200, Benjamin Jakubowski wrote: > > OK thanks, it seems to become better but this is my simple > > cluster.conf > > > > My first node is : server1 > > > > > > > > > post_join_delay="3"/> > > > > > > > > > > > > > > > > > > > > > > when i start ccsd ok : > > Aug 1 10:35:59 sv157020 ccsd[20726]: Starting ccsd 1.0.10: > > Aug 1 10:35:59 sv157020 ccsd[20726]: Built: Mar 19 2007 17:44:26 > > Aug 1 10:35:59 sv157020 ccsd[20726]: Copyright (C) Red Hat, Inc. > > 2004 All rights reserved. > > Aug 1 10:35:59 sv157020 ccsd: succeeded > > > > when i try to start cman : > > Aug 1 10:36:28 sv157020 ccsd[20726]: Unable to connect to cluster > > infrastructure after 30 seconds. > > Aug 1 10:36:33 sv157020 kernel: CMAN 2.6.9-45.2 (built Jul 13 2006 > > 11:42:36) installed > > Aug 1 10:36:33 sv157020 kernel: NET: Registered protocol family 30 > > Aug 1 10:36:33 sv157020 ccsd[20726]: cluster.conf (cluster name = > > siclad_re7, version = 7) found. > > Aug 1 10:36:34 sv157020 kernel: CMAN: Waiting to join or form a > > Linux-cluster > > Aug 1 10:36:52 sv157020 sshd(pam_unix)[20747]: session opened for > > user root by root(uid=0) > > Aug 1 10:36:59 sv157020 ccsd[20726]: Unable to connect to cluster > > infrastructure after 60 seconds. > > Aug 1 10:37:06 sv157020 kernel: CMAN: forming a new cluster > > Aug 1 10:37:06 sv157020 kernel: CMAN: quorum regained, resuming > > activity > > Aug 1 10:37:06 sv157020 cman: startup succeeded > > > > when i try to start rgmanager : > > Aug 1 10:37:29 sv157020 ccsd[20726]: Unable to connect to cluster > > infrastructure after 90 seconds. > > Aug 1 10:37:36 sv157020 ccsd[20726]: Cluster is not quorate. > > Refusing connection. > > Aug 1 10:37:36 sv157020 ccsd[20726]: Error while processing connect: > > Connection refused > > Aug 1 10:37:36 sv157020 ccsd[20726]: Invalid descriptor specified > > (-111). > > Aug 1 10:37:36 sv157020 ccsd[20726]: Someone may be attempting > > something evil. > > Aug 1 10:37:36 sv157020 ccsd[20726]: Error while processing get: > > Invalid request descriptor > > Aug 1 10:37:36 sv157020 ccsd[20726]: Invalid descriptor specified > > (-111). > > Aug 1 10:37:36 sv157020 ccsd[20726]: Someone may be attempting > > something evil. > > Aug 1 10:37:36 sv157020 ccsd[20726]: Error while processing get: > > Invalid request descriptor > > Aug 1 10:37:36 sv157020 ccsd[20726]: Invalid descriptor specified > > (-21). > > Aug 1 10:37:36 sv157020 ccsd[20726]: Someone may be attempting > > something evil. > > Aug 1 10:37:36 sv157020 ccsd[20726]: Error while processing > > disconnect: Invalid request descriptor > > Aug 1 10:37:36 sv157020 clurgmgrd[20817]: Resource Group > > Manager Starting > > Aug 1 10:37:36 sv157020 clurgmgrd[20817]: Loading Service Data > > Aug 1 10:37:37 sv157020 ccsd[20726]: Cluster is not quorate. > > Refusing connection. > > Aug 1 10:37:37 sv157020 ccsd[20726]: Error while processing connect: > > Connection refused > > Aug 1 10:37:37 sv157020 clurgmgrd[20817]: #5: Couldn't connect > > to ccsd! > > Aug 1 10:37:37 sv157020 clurgmgrd[20817]: #8: Couldn't > > initialize services > > Aug 1 10:37:37 sv157020 rgmanager: D?(c)marrage de clurgmgrd failed > > > > Do u have any idea ? > > > > is so simple in RHEL 5, but i need RHEL 4U4 > > > > Thanks a lot > > > > > > > > 2007/7/31, Bryn M. Reeves : > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > Benjamin Jakubowski wrote: > > > The probleme is : > > > i need to preserve the kernel version 2.6.9-42.ELsmp, to > > preserve a SAN > > > compatibility and in RHN, there isn't have a cluster kernel > > module ? > > > do u have any idea ? > > > > > > Thanks a lot > > > Benjamin > > > > OK, if you need to match with a specific kernel release you'll > > need to > > use the Cluster Suite packages that were released at the same > > time. > > > > For U4 (2.6.9-42.EL), I think those would be: > > > > cman-kernel-2.6.9-45.2 > > cman-kernel-smp-2.6.9-45.2 > > cman-kernel-hugemem-2.6.9-45.2 > > dlm-kernel-hugemem-2.6.9-42.10 > > dlm-kernel-2.6.9-42.10 > > dlm-kernel-smp-2.6.9-42.10 > > > > If you also need the packages for GFS, those are: > > > > GFS-kernel-2.6.9-58.0 > > GFS-kernel-smp-2.6.9-58.0 > > GFS-kernel-hugemem-2.6.9-58.0 > > gnbd-kernel-hugemem-2.6.9-9.41 > > gnbd-kernel-2.6.9-9.41 > > gnbd-kernel-smp-2.6.9-9.41 > > > > Those packages are still available on RHN - just go to the > > web > > interface, hit search by package & they should appear there. > > > > Be aware though that there were some important kernel related > > bugfixes > > applied after U4 - particularly for systems using multipath > > SAN storage. > > > > Kind regards, > > Bryn. > > -----BEGIN PGP SIGNATURE----- > > Version: GnuPG v1.4.7 (GNU/Linux) > > Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org > > > > iD8DBQFGr2sk6YSQoMYUY94RAp+qAJ9yKqpXxmFXqQzRc//0RVFzavPdzgCgk > > +YE > > YJFtx2iQm4zdYnKXQ/QmRHY= > > =s4BM > > -----END PGP SIGNATURE----- > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > -- > > @+ > > Benj > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- @+ Benj -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris at cmiware.com Thu Aug 2 00:55:22 2007 From: chris at cmiware.com (Chris Harms) Date: Wed, 01 Aug 2007 19:55:22 -0500 Subject: [Linux-cluster] cluster suite crashing Message-ID: <46B12B7A.2070402@cmiware.com> I am again attempting a 2-node cluster (two_node=1). This time we have power fencing, creating a cluster config from scratch. Unplug network cables on Node A. Node B still plugged in. (Expected B to fence A.) Node B does not attempt fencing, claims to have lost quorum (???). ( Plug Node A back in. Node A fences Node B On reboot, Node B reboots itself right after fencing Node A. [repeat] clurgmgrd[3630]: *Watchdog: Daemon died, rebooting *Various things appear directly ahead of this in the log. Most of the time it was a service script that was failing a stop operation. Correcting it did not resolve the issue: [/var/log/messages on Node B] clurgmgrd[5669]: Resource Group Manager Starting [selinux warnings] clurgmgrd[5667]: Watchdog: Daemon died, rebooting... kernel: md: stopping all md devices. fenced[4617]: fence "[Node A]" success [reboot] [some pertinent lines from cluster.conf - they are identical on each node] Meanwhile, Node A comes up and fences B when it gets a chance. I'm really at a loss on what to do. We are running the RHEL 5 rpms from RHN. Googling the error message yields some results on crashes in RGManager which were allegedly fixed in version 4. I have seen some other squirrelly behavior out of RGManager at various points, but reboots seemed to fix those so I figured proper fencing might render them moot. Any advice is welcome. Thanks, Chris From jacquesb at fnb.co.za Thu Aug 2 08:56:03 2007 From: jacquesb at fnb.co.za (Jacques Botha) Date: Thu, 02 Aug 2007 10:56:03 +0200 Subject: [Linux-cluster] RHEL5.1 beta and qdisk Message-ID: <1186044963.6514.4.camel@f2821966> Okay I have a cluster, everything is setup correctly. I start qdisk, and it likes the heuristics, but the moment it upgrades the cluster votes, everything to do with clustering just stops. The machine is still responsive, you can talk to it over the network, but cman_tool status, cman_tool nodes, clustat, all just sit and blink at the prompt. I can stop qdisk immediately afterwards, but it doesn't change the state, everything stays broken. -- Jacques Botha South Africa +27-11-889-4142 To read FirstRand Bank's Disclaimer for this email click on the following address or copy into your Internet browser: https://www.fnb.co.za/disclaimer.html If you are unable to access the Disclaimer, send a blank e-mail to firstrandbankdisclaimer at fnb.co.za and we will send you a copy of the Disclaimer. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From doc at mts.com.ua Thu Aug 2 09:19:31 2007 From: doc at mts.com.ua (Eugene Melnichuk) Date: Thu, 02 Aug 2007 12:19:31 +0300 Subject: [Linux-cluster] RHEL5.1 beta and qdisk In-Reply-To: <1186044963.6514.4.camel@f2821966> References: <1186044963.6514.4.camel@f2821966> Message-ID: <46B1A1A3.5060507@mts.com.ua> I confirm this bug, and already asked this question on this list (with subject "Hang on start fence_tool join with qdisk") but problem still here... -- Eugene Melnichuk Leading Engineer email: doc at mts.com.ua mob: +380503304043 pbx: +380501105731 BU MTS Ukraine 49/2 Pobedy ave., room 4.26, 03680, Kyiv, Ukraine Jacques Botha ?????: > Okay > > I have a cluster, everything is setup correctly. > > I start qdisk, and it likes the heuristics, but the moment it upgrades > the cluster votes, everything to do with clustering just stops. > The machine is still responsive, you can talk to it over the network, > but cman_tool status, cman_tool nodes, clustat, all just sit and blink > at the prompt. > > I can stop qdisk immediately afterwards, but it doesn't change the > state, everything stays broken. > > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From lhh at redhat.com Thu Aug 2 15:48:19 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 2 Aug 2007 11:48:19 -0400 Subject: [Linux-cluster] fs.sh? In-Reply-To: <20070731145441.GE21896@helsinki.fi> References: <20070629132556.GK29854@helsinki.fi> <20070706163152.GH1681@redhat.com> <20070706182930.GE5981@redhat.com> <20070706183151.GF5981@redhat.com> <20070706183658.GA24692@helsinki.fi> <20070710221922.GG18076@redhat.com> <20070731121438.GA21896@helsinki.fi> <20070731134121.GH4955@redhat.com> <20070731145441.GE21896@helsinki.fi> Message-ID: <20070802154818.GB26367@redhat.com> On Tue, Jul 31, 2007 at 05:54:41PM +0300, Janne Peltonen wrote: > On Tue, Jul 31, 2007 at 09:41:21AM -0400, Lon Hohberger wrote: > > On Tue, Jul 31, 2007 at 03:14:38PM +0300, Janne Peltonen wrote: > > > On Tue, Jul 10, 2007 at 06:19:22PM -0400, Lon Hohberger wrote: > > > > > > > > http://people.redhat.com/lhh/rhel5-test > > > > > > > > You'll need at least the updated cman package. The -2.1lhh build of > > > > rgmanager is the one I just built today; the others are a bit older. > > > > > > Well, I installed the new versions of the cman and rgmanager packages I > > > found there, but to no avail: I still get 1500 invocations of fs.sh per > > > second. > > > > I put a log message in fs.sh: > > > > Jul 31 09:27:29 bart clurgmgrd: [4395]: /usr/share/cluster/fs.sh > > TEST > > > > It comes up once every several (10-20) seconds like it's supposed to. > > I did the same, with the same results. It seems to me that the clurgmgrd > process isn't calling the complete script any more times than it's > supposed to. What I'm seeing are the execs of fs.sh, that is, it > includes each () and `` and so on. Each fs.sh invocation seems to create > quite an amount of subshells. > > I'm sorry for having misled you. And this all means, there isn't > probably much reason to read the cluster.conf and rg_test rules output - > I'll attach them anyway. Yeah, it does hit a lot of subshells. Awks, seds, and the like. Some pattern substitution and matching can be done in pure bash. That's quite an impressive cluster.conf... I'm going to look at it some. -- Lon From lhh at redhat.com Thu Aug 2 15:51:55 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 2 Aug 2007 11:51:55 -0400 Subject: [Linux-cluster] fs.sh? In-Reply-To: <20070731171123.GH21896@helsinki.fi> References: <20070629132556.GK29854@helsinki.fi> <20070706163152.GH1681@redhat.com> <20070706182930.GE5981@redhat.com> <20070706183151.GF5981@redhat.com> <20070706183658.GA24692@helsinki.fi> <20070710221922.GG18076@redhat.com> <20070731121438.GA21896@helsinki.fi> <20070731134121.GH4955@redhat.com> <20070731145441.GE21896@helsinki.fi> <20070731171123.GH21896@helsinki.fi> Message-ID: <20070802155155.GC26367@redhat.com> On Tue, Jul 31, 2007 at 08:11:23PM +0300, Janne Peltonen wrote: > On Tue, Jul 31, 2007 at 05:54:41PM +0300, Janne Peltonen wrote: > > On Tue, Jul 31, 2007 at 09:41:21AM -0400, Lon Hohberger wrote: > > > On Tue, Jul 31, 2007 at 03:14:38PM +0300, Janne Peltonen wrote: > > > > On Tue, Jul 10, 2007 at 06:19:22PM -0400, Lon Hohberger wrote: > > > > > > > > > > http://people.redhat.com/lhh/rhel5-test > > > > > > > > > > You'll need at least the updated cman package. The -2.1lhh build of > > > > > rgmanager is the one I just built today; the others are a bit older. > > > > > > > > Well, I installed the new versions of the cman and rgmanager packages I > > > > found there, but to no avail: I still get 1500 invocations of fs.sh per > > > > second. > > > > > > I put a log message in fs.sh: > > > > > > Jul 31 09:27:29 bart clurgmgrd: [4395]: /usr/share/cluster/fs.sh > > > TEST > > > > > > It comes up once every several (10-20) seconds like it's supposed to. > > > > I did the same, with the same results. It seems to me that the clurgmgrd > > process isn't calling the complete script any more times than it's > > supposed to. What I'm seeing are the execs of fs.sh, that is, it > > includes each () and `` and so on. Each fs.sh invocation seems to create > > quite an amount of subshells. > > > > I'm sorry for having misled you. And this all means, there isn't > > probably much reason to read the cluster.conf and rg_test rules output - > > I'll attach them anyway. > > After running the new rgmanager packages for abt four hours without any > of the load fluctuation I'd experienced before (with a more-or-less > four-hour interval, system load first increases slowly until it reaches > a high level - dependent on overall system load - and then swiftly > decreases to near zero, to start increasing again. This fluctuation > peaks at about 5.0 in a system with no users at all, but many services. > If there are many users and the user peak coincides with the base peak, > the system experiences a shortish load peak of abt 100.0, after which it > recovers and the basic load fluctuation becomes visible again). Then the > load averages started increasing again, to something 10.0ish, so - > frustrated - I edited /usr/share/cluster/fs.sh and put an exit 0 to the > switch-case "status|monitor" on $1. Well. Load averages promptly fell > back to under 0.5, disk usage% fell by 30 %-units, and overall system > responsiveness increased considerably. > > So I'll be running my cluster without fs status checks for now. I hope > someone'll work out what's wrong with fs.sh soon... ;) There are a number of things we can do - can you file a bugzilla about this, now that we know what's going on? (and that it's not internal rgmanager difficulties, just inefficient scripting)? -- Lon -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Thu Aug 2 15:54:33 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 2 Aug 2007 11:54:33 -0400 Subject: [Linux-cluster] clustat on GULM In-Reply-To: <6596a7c70707310722l24a10059l9309229ba7c534fc@mail.gmail.com> References: <6596a7c70707310722l24a10059l9309229ba7c534fc@mail.gmail.com> Message-ID: <20070802155432.GD26367@redhat.com> On Tue, Jul 31, 2007 at 10:22:33AM -0400, siman hew wrote: > I found clustat report wrong information about rgmanager when a cluster is > GULM. > I setup a 3-node cluster on RHEL4U5, with GULM. There are one failover > domain, one resource and one service, just for testing. > I found clustat always report rgmanger is running after cluster is started. > With the configuration with DLM, clustat report correctly. > node4 is lock server, and all nodes report the same inaccurate information. GULM doesn't have a way to provide sub-groups really, except for users of lockspaces. Can you file a bugzilla about this? I wonder if we can get the information for "what nodes are in this lockspace" from GULM. If it's possible, we could probably fix it. -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Thu Aug 2 15:55:40 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 2 Aug 2007 11:55:40 -0400 Subject: [Linux-cluster] Odd cluster problems In-Reply-To: <46AF59DC.9000906@utmem.edu> References: <46AF59DC.9000906@utmem.edu> Message-ID: <20070802155540.GE26367@redhat.com> On Tue, Jul 31, 2007 at 10:48:44AM -0500, Jay Leafey wrote: > I've got a 3-node cluster running CentOS 4.5 and I cannot communicate > with the resource group manager. When I use the clustat command I get a > timeout: > > >[root at rapier ~]# clustat > >Timed out waiting for a response from Resource Group Manager > >Member Status: Quorate > > > > Member Name Status > > ------ ---- ------ > > rapier.utmem.edu Online, Local, rgmanager > > thorax.utmem.edu Offline > > cyclops.utmem.edu Online, rgmanager > >Fence Domain: "default" 2 2 recover 4 - > >[1 2] Until fencing completes, rgmanager won't respond. fence_ack_manual needs to be run. > > > > > > > >User: "usrm::manager" 10 10 recover 2 - > >[1 2] > > -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Thu Aug 2 15:56:19 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 2 Aug 2007 11:56:19 -0400 Subject: [Linux-cluster] VF: Abort: Invalid header in reply from member #1 In-Reply-To: References: Message-ID: <20070802155619.GF26367@redhat.com> On Tue, Jul 31, 2007 at 12:52:38PM -0300, Filipe Miranda wrote: > Hello everybody, > > I am using RedHatCluster Suite for RHEL3 and I am experiencing the following > errors in the cluster's log file: Which version do you have installed? -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Thu Aug 2 15:58:00 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 2 Aug 2007 11:58:00 -0400 Subject: [Linux-cluster] cluster suite crashing In-Reply-To: <46B12B7A.2070402@cmiware.com> References: <46B12B7A.2070402@cmiware.com> Message-ID: <20070802155800.GG26367@redhat.com> On Wed, Aug 01, 2007 at 07:55:22PM -0500, Chris Harms wrote: > I am again attempting a 2-node cluster (two_node=1). This time we have > power fencing, creating a cluster config from scratch. > > Unplug network cables on Node A. Node B still plugged in. (Expected B to > fence A.) > Node B does not attempt fencing, claims to have lost quorum (???). ( > Plug Node A back in. > Node A fences Node B > > On reboot, Node B reboots itself right after fencing Node A. > [repeat] > clurgmgrd[3630]: *Watchdog: Daemon died, rebooting That's a bug; what version of rgmanger do you have installed? -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Thu Aug 2 15:59:11 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 2 Aug 2007 11:59:11 -0400 Subject: [Linux-cluster] RHEL5.1 beta and qdisk In-Reply-To: <1186044963.6514.4.camel@f2821966> References: <1186044963.6514.4.camel@f2821966> Message-ID: <20070802155911.GH26367@redhat.com> On Thu, Aug 02, 2007 at 10:56:03AM +0200, Jacques Botha wrote: > Okay > > I have a cluster, everything is setup correctly. > > I start qdisk, and it likes the heuristics, but the moment it upgrades > the cluster votes, everything to do with clustering just stops. > The machine is still responsive, you can talk to it over the network, > but cman_tool status, cman_tool nodes, clustat, all just sit and blink > at the prompt. > > I can stop qdisk immediately afterwards, but it doesn't change the > state, everything stays broken. Fencing being attempted? cman_tool shouldn't *ever* hang as a result of qdiskd. -- Lon Hohberger - Software Engineer - Red Hat, Inc. From chris at cmiware.com Thu Aug 2 16:08:51 2007 From: chris at cmiware.com (Chris Harms) Date: Thu, 02 Aug 2007 11:08:51 -0500 Subject: [Linux-cluster] cluster suite crashing In-Reply-To: <20070802155800.GG26367@redhat.com> References: <46B12B7A.2070402@cmiware.com> <20070802155800.GG26367@redhat.com> Message-ID: <46B20193.8050205@cmiware.com> rgmanager-2.0.24-1.el5 I'm not sure if this is useful or not, but I had just rebooted Node B when we pulled the cables on Node A. It is possible not all of the services / inter-node communication had completed. Thanks, Chris Lon Hohberger wrote: > On Wed, Aug 01, 2007 at 07:55:22PM -0500, Chris Harms wrote: > >> I am again attempting a 2-node cluster (two_node=1). This time we have >> power fencing, creating a cluster config from scratch. >> >> Unplug network cables on Node A. Node B still plugged in. (Expected B to >> fence A.) >> Node B does not attempt fencing, claims to have lost quorum (???). ( >> Plug Node A back in. >> Node A fences Node B >> >> On reboot, Node B reboots itself right after fencing Node A. >> [repeat] >> > > >> clurgmgrd[3630]: *Watchdog: Daemon died, rebooting >> > > That's a bug; what version of rgmanger do you have installed? > > From jleafey at utmem.edu Thu Aug 2 19:00:13 2007 From: jleafey at utmem.edu (Jay Leafey) Date: Thu, 02 Aug 2007 14:00:13 -0500 Subject: [Linux-cluster] Odd cluster problems In-Reply-To: <20070802155540.GE26367@redhat.com> References: <46AF59DC.9000906@utmem.edu> <20070802155540.GE26367@redhat.com> Message-ID: <46B229BD.1020009@utmem.edu> Lon Hohberger wrote: > On Tue, Jul 31, 2007 at 10:48:44AM -0500, Jay Leafey wrote: >> I've got a 3-node cluster running CentOS 4.5 and I cannot communicate >> with the resource group manager. When I use the clustat command I get a >> timeout: >> >>> [root at rapier ~]# clustat >>> Timed out waiting for a response from Resource Group Manager >>> Member Status: Quorate >>> >>> Member Name Status >>> ------ ---- ------ >>> rapier.utmem.edu Online, Local, rgmanager >>> thorax.utmem.edu Offline >>> cyclops.utmem.edu Online, rgmanager > >>> Fence Domain: "default" 2 2 recover 4 - >>> [1 2] > > Until fencing completes, rgmanager won't respond. > > fence_ack_manual needs to be run. > >>> >>> >>> User: "usrm::manager" 10 10 recover 2 - >>> [1 2] >>> > Your reply was a bit confusing at first, but looking deeper showed you were right on the mark. The systems (using HP ILO fencing) were unable to communicate with each other very well or with the ILO ports at all. Turns out some of the ports they were configured on had been moved to a different VLAN, so the network was split between the ILOs and the host ports. Configuring the ports properly seems to have resolved the issue, everything is working fine now. I guess I just need to keep the rubber hose handy for "discussions" with the network guys! (grin!) Thanks! -- Jay Leafey - University of Tennessee E-Mail: jleafey at utmem.edu Phone: 901-448-6534 FAX: 901-448-8199 -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5158 bytes Desc: S/MIME Cryptographic Signature URL: From orkcu at yahoo.com Thu Aug 2 20:07:12 2007 From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=) Date: Thu, 2 Aug 2007 13:07:12 -0700 (PDT) Subject: [Linux-cluster] How to add start-up options to pulse ?? Message-ID: <41175.41349.qm@web50610.mail.re2.yahoo.com> Hi is there anyway to start pulse with one of its options , especificaly --forceactive, without manualy modification of the init.d/pulse start-stop script? I guess if I modify this script with the next update to piranha I will lose all the changes, am I correct? I was thinking of having just like other init.d script have: a startup config file in /etc/sysconfig/pulse but to implement this I must modify pulse start-stop script so ..... should I ask for an enhance to the package in bugzilla? I do not have Redhat cluster support for rhel4 (which is the one I am working with) but I do have support for rhel5-server-AP, so maybe a ticket might help? thanks roger __________________________________________ RedHat Certified ( RHCE ) Cisco Certified ( CCNA & CCDA ) ____________________________________________________________________________________ Shape Yahoo! in your own image. Join our Network Research Panel today! http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 From lhh at redhat.com Thu Aug 2 20:19:17 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 2 Aug 2007 16:19:17 -0400 Subject: [Linux-cluster] Odd cluster problems In-Reply-To: <46B229BD.1020009@utmem.edu> References: <46AF59DC.9000906@utmem.edu> <20070802155540.GE26367@redhat.com> <46B229BD.1020009@utmem.edu> Message-ID: <20070802201917.GL26367@redhat.com> On Thu, Aug 02, 2007 at 02:00:13PM -0500, Jay Leafey wrote: > Lon Hohberger wrote: > >On Tue, Jul 31, 2007 at 10:48:44AM -0500, Jay Leafey wrote: > >>I've got a 3-node cluster running CentOS 4.5 and I cannot communicate > >>with the resource group manager. When I use the clustat command I get a > >>timeout: > >> > >>>[root at rapier ~]# clustat > >>>Timed out waiting for a response from Resource Group Manager > >>>Member Status: Quorate > >>> > >>> Member Name Status > >>> ------ ---- ------ > >>> rapier.utmem.edu Online, Local, rgmanager > >>> thorax.utmem.edu Offline > >>> cyclops.utmem.edu Online, rgmanager > > > >>>Fence Domain: "default" 2 2 recover 4 - > >>>[1 2] > > > >Until fencing completes, rgmanager won't respond. > > > >fence_ack_manual needs to be run. > > > >>> > >>> > >>>User: "usrm::manager" 10 10 recover 2 - > >>>[1 2] > >>> > > > > Your reply was a bit confusing at first, but looking deeper showed you > were right on the mark. The systems (using HP ILO fencing) were unable > to communicate with each other very well or with the ILO ports at all. > Turns out some of the ports they were configured on had been moved to a > different VLAN, so the network was split between the ILOs and the host > ports. Sorry, I just assumed you were using manual fencing as opposed to iLO, since that's the 90+/- % case of why fencing was stuck in the 'recover' state. I guess we all know what happens when you assume... :) Or maybe, when I assume? -- Lon -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Thu Aug 2 20:23:15 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 2 Aug 2007 16:23:15 -0400 Subject: [Linux-cluster] cluster suite crashing In-Reply-To: <46B20193.8050205@cmiware.com> References: <46B12B7A.2070402@cmiware.com> <20070802155800.GG26367@redhat.com> <46B20193.8050205@cmiware.com> Message-ID: <20070802202314.GM26367@redhat.com> On Thu, Aug 02, 2007 at 11:08:51AM -0500, Chris Harms wrote: > rgmanager-2.0.24-1.el5 > > I'm not sure if this is useful or not, but I had just rebooted Node B > when we pulled the cables on Node A. It is possible not all of the > services / inter-node communication had completed. Could you pull from CVS (RHEL5 or 51 branches)? The current code has a couple of crash bugs fixed. Note that if you store: DAEMON_COREFILE_LIMIT="unlimited" RGMGR_OPTS="-w" ... in /etc/sysconfig/cluster, rgmanager will generate a core file in the root directory. Attaching the core to the bug report will help determine whether it's something already fixed in CVS. But seriously, if you see 'daemon died, rebooting' it's either user error (you did a 'kill -9' of only one rgmanager pid) or a bug (crash). -- Lon -- Lon Hohberger - Software Engineer - Red Hat, Inc. From lhh at redhat.com Thu Aug 2 20:25:02 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 2 Aug 2007 16:25:02 -0400 Subject: [Linux-cluster] How to add start-up options to pulse ?? In-Reply-To: <41175.41349.qm@web50610.mail.re2.yahoo.com> References: <41175.41349.qm@web50610.mail.re2.yahoo.com> Message-ID: <20070802202502.GN26367@redhat.com> On Thu, Aug 02, 2007 at 01:07:12PM -0700, Roger Pe?a wrote: > Hi is there anyway to start pulse with one of its > options , especificaly --forceactive, without manualy > modification of the init.d/pulse start-stop script? > > I guess if I modify this script with the next update > to piranha I will lose all the changes, am I correct? > > I was thinking of having just like other init.d script > have: a startup config file in /etc/sysconfig/pulse > but to implement this I must modify pulse start-stop > script so ..... It should leave your pulse init script and create pulse.rpmnew, actually. > should I ask for an enhance to the package in > bugzilla? Yes. Most packages should have /etc/sysconfig/ which gets sourced on startup for specifically this purpose. -- Lon Hohberger - Software Engineer - Red Hat, Inc. From orkcu at yahoo.com Thu Aug 2 21:27:28 2007 From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=) Date: Thu, 2 Aug 2007 14:27:28 -0700 (PDT) Subject: [Linux-cluster] How to add start-up options to pulse ?? In-Reply-To: <20070802202502.GN26367@redhat.com> Message-ID: <450271.81221.qm@web50604.mail.re2.yahoo.com> --- Lon Hohberger wrote: > On Thu, Aug 02, 2007 at 01:07:12PM -0700, Roger Pe?a > wrote: > > Hi is there anyway to start pulse with one of its > > options , especificaly --forceactive, without > manualy > > modification of the init.d/pulse start-stop > script? > > > > I guess if I modify this script with the next > update > > to piranha I will lose all the changes, am I > correct? > > > > I was thinking of having just like other init.d > script > > have: a startup config file in > /etc/sysconfig/pulse > > but to implement this I must modify pulse > start-stop > > script so ..... > > It should leave your pulse init script and create > pulse.rpmnew, > actually. I download the src.rpm I saw the spec file :-) you are right ;-) (In the past, I had the experience with other rpm that overwrite those scripts :-( ) > > > should I ask for an enhance to the package in > > bugzilla? > > Yes. Most packages should have > /etc/sysconfig/ which gets > sourced on startup for specifically this purpose. where should I get the cvs of piranha? http://sources.redhat.com/cgi-bin/cvsweb.cgi/?sortby=file&hideattic=1&logsort=date&f=u&hidenonreadable=1&cvsroot=piranha do not show any code I ask because I am interested in one of your patch that is already in CVS tree, the one attached to: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=238498 thanks roger __________________________________________ RedHat Certified ( RHCE ) Cisco Certified ( CCNA & CCDA ) ____________________________________________________________________________________ Luggage? GPS? Comic books? Check out fitting gifts for grads at Yahoo! Search http://search.yahoo.com/search?fr=oni_on_mail&p=graduation+gifts&cs=bz From chris at cmiware.com Thu Aug 2 23:23:55 2007 From: chris at cmiware.com (Chris Harms) Date: Thu, 02 Aug 2007 18:23:55 -0500 Subject: [Linux-cluster] cluster suite crashing In-Reply-To: <20070802202314.GM26367@redhat.com> References: <46B12B7A.2070402@cmiware.com> <20070802155800.GG26367@redhat.com> <46B20193.8050205@cmiware.com> <20070802202314.GM26367@redhat.com> Message-ID: <46B2678B.7000702@cmiware.com> I grabbed the RHEL5 branch out of CVS, but compilation fails with make[2]: Entering directory `/usr/src/cluster-cvs/cluster/dlm/lib' gcc -Wall -g -I. -O2 -D_REENTRANT -c -o libdlm.o libdlm.c libdlm.c: In function ?set_version_v5?: libdlm.c:324: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c:325: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c:326: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c: In function ?set_version_v6?: libdlm.c:335: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c:336: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c:337: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c: In function ?detect_kernel_version?: libdlm.c:443: error: storage size of ?v? isn?t known libdlm.c:446: error: invalid application of ?sizeof? to incomplete type ?struct dlm_device_version? libdlm.c:448: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c:449: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c:450: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c:452: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c:453: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c:454: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c:443: warning: unused variable ?v? libdlm.c: In function ?do_dlm_dispatch?: libdlm.c:590: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c: In function ?ls_lock_v6?: libdlm.c:835: error: ?struct dlm_lock_params? has no member named ?xid? libdlm.c:837: error: ?struct dlm_lock_params? has no member named ?timeout? libdlm.c: In function ?ls_lock?: libdlm.c:892: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c: In function ?dlm_ls_lockx?: libdlm.c:916: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c: In function ?dlm_ls_unlock?: libdlm.c:1067: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c: In function ?dlm_ls_deadlock_cancel?: libdlm.c:1099: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c:1115: error: ?DLM_USER_DEADLOCK? undeclared (first use in this function) libdlm.c:1115: error: (Each undeclared identifier is reported only once libdlm.c:1115: error: for each function it appears in.) libdlm.c: In function ?dlm_ls_purge?: libdlm.c:1134: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c:1145: error: ?DLM_USER_PURGE? undeclared (first use in this function) libdlm.c:1146: error: ?union ? has no member named ?purge? libdlm.c:1147: error: ?union ? has no member named ?purge? libdlm.c: In function ?create_lockspace?: libdlm.c:1311: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c: In function ?release_lockspace?: libdlm.c:1415: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c: In function ?dlm_kernel_version?: libdlm.c:1501: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c:1502: error: invalid use of undefined type ?struct dlm_device_version? libdlm.c:1503: error: invalid use of undefined type ?struct dlm_device_version? make[2]: *** [libdlm.o] Error 1 make[2]: Leaving directory `/usr/src/cluster-cvs/cluster/dlm/lib' make[1]: *** [all] Error 2 make[1]: Leaving directory `/usr/src/cluster-cvs/cluster/dlm' make: *** [all] Error 2 I guess it doesn't like the officially supported RHEL kernel (2.6.18-8.1.8). We also are trying to get the 5.1 Beta rpms going with no success. So far a kernel panic on 5.1 kernel (2.6.18-36) Lon Hohberger wrote: > On Thu, Aug 02, 2007 at 11:08:51AM -0500, Chris Harms wrote: > >> rgmanager-2.0.24-1.el5 >> >> I'm not sure if this is useful or not, but I had just rebooted Node B >> when we pulled the cables on Node A. It is possible not all of the >> services / inter-node communication had completed. >> > > Could you pull from CVS (RHEL5 or 51 branches)? The current code has a > couple of crash bugs fixed. > > Note that if you store: > > DAEMON_COREFILE_LIMIT="unlimited" > RGMGR_OPTS="-w" > > ... in /etc/sysconfig/cluster, rgmanager will generate a core file in > the root directory. Attaching the core to the bug report will help > determine whether it's something already fixed in CVS. > > But seriously, if you see 'daemon died, rebooting' it's either user > error (you did a 'kill -9' of only one rgmanager pid) or a bug (crash). > > -- Lon > > From orkcu at yahoo.com Fri Aug 3 02:14:44 2007 From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=) Date: Thu, 2 Aug 2007 19:14:44 -0700 (PDT) Subject: [Linux-cluster] How to add start-up options to pulse ?? In-Reply-To: <450271.81221.qm@web50604.mail.re2.yahoo.com> Message-ID: <526805.31332.qm@web50612.mail.re2.yahoo.com> --- Roger Pe?a wrote: > > --- Lon Hohberger wrote: > > > On Thu, Aug 02, 2007 at 01:07:12PM -0700, Roger > Pe?a > > wrote: > > > Hi is there anyway to start pulse with one of > its > > > options , especificaly --forceactive, without > > manualy > > > modification of the init.d/pulse start-stop > > script? > > > > > > I guess if I modify this script with the next > > update > > > to piranha I will lose all the changes, am I > > correct? > > > > > > I was thinking of having just like other init.d > > script > > > have: a startup config file in > > /etc/sysconfig/pulse > > > but to implement this I must modify pulse > > start-stop > > > script so ..... > > > > It should leave your pulse init script and create > > pulse.rpmnew, > > actually. > > I download the src.rpm I saw the spec file :-) > you are right ;-) > (In the past, I had the experience with other rpm > that > overwrite those scripts :-( ) I speak to quickly, piranha 0.8.2 (RHEL4) do not overwrite /etc/init.d/* but 0.8.4 (RHEL5) actually do it (line 83 of piranha.spec) there was a change in the piranha.spec file between the two versions looking into the way piranha is package, I think a patch to implement an startup config file in /etc/sysconfig/ is a litle more complex than what I think at the first place because: 1- create the /etc/sysconfig/pulse prototype file in the piranha package 2- modify the Makefile of the package to install this file in the proper place with the proper name 3- repackage the tar.gz 4- modify the spec.file to include the new file in the "file" section 5- repackage the rpm not dificult but "lot" of work and not so important patch so, I guess, this will not be a priority for redhat ... will you appreciate a patch for all this ? I can do it, in fact I will, again rhel5 version thanks roger __________________________________________ RedHat Certified ( RHCE ) Cisco Certified ( CCNA & CCDA ) ____________________________________________________________________________________ Got a little couch potato? Check out fun summer activities for kids. http://search.yahoo.com/search?fr=oni_on_mail&p=summer+activities+for+kids&cs=bz From jacquesb at fnb.co.za Fri Aug 3 07:02:10 2007 From: jacquesb at fnb.co.za (Jacques Botha) Date: Fri, 03 Aug 2007 09:02:10 +0200 Subject: [Linux-cluster] RHEL5.1 beta and qdisk In-Reply-To: <20070802155911.GH26367@redhat.com> References: <1186044963.6514.4.camel@f2821966> <20070802155911.GH26367@redhat.com> Message-ID: <1186124530.5993.2.camel@f2821966> No fencing being attempted, it just sits there, and the last thing in the log is qdisk upgrading the status of the cluser: Aug 2 10:36:21 fnbgw01 qdiskd[3519]: Quorum Partition: /dev/sdb1 Label: fnbgw_qdisk Aug 2 10:36:21 fnbgw01 qdiskd[3520]: Quorum Daemon Initializing Aug 2 10:36:24 fnbgw01 qdiskd[3520]: Heuristic: 'ping 172.20.28.193 -c3 -t1' UP Aug 2 10:36:24 fnbgw01 qdiskd[3520]: Heuristic: 'ping 172.20.28.195 -c3 -t1' UP Aug 2 10:36:24 fnbgw01 qdiskd[3520]: Heuristic: 'ping 172.20.28.196 -c3 -t1' UP Aug 2 10:36:24 fnbgw01 qdiskd[3520]: Heuristic: 'ping 172.20.28.197 -c3 -t1' UP Aug 2 10:36:24 fnbgw01 qdiskd[3520]: Heuristic: 'ping 172.20.28.198 -c3 -t1' UP Aug 2 10:36:31 fnbgw01 qdiskd[3520]: Initial score 6/6 Aug 2 10:36:31 fnbgw01 qdiskd[3520]: Initialization complete Aug 2 10:36:31 fnbgw01 openais[3469]: [CMAN ] quorum device registered Aug 2 10:36:31 fnbgw01 qdiskd[3520]: Score sufficient for master operation (6/6; required=3); upgrading Aug 2 10:36:45 fnbgw01 qdiskd[3520]: Assuming master role And after that you can't get any joy. Jacques On Thu, 2007-08-02 at 11:59 -0400, Lon Hohberger wrote: > On Thu, Aug 02, 2007 at 10:56:03AM +0200, Jacques Botha wrote: > > Okay > > > > I have a cluster, everything is setup correctly. > > > > I start qdisk, and it likes the heuristics, but the moment it upgrades > > the cluster votes, everything to do with clustering just stops. > > The machine is still responsive, you can talk to it over the network, > > but cman_tool status, cman_tool nodes, clustat, all just sit and blink > > at the prompt. > > > > I can stop qdisk immediately afterwards, but it doesn't change the > > state, everything stays broken. > > Fencing being attempted? > > cman_tool shouldn't *ever* hang as a result of qdiskd. > To read FirstRand Bank's Disclaimer for this email click on the following address or copy into your Internet browser: https://www.fnb.co.za/disclaimer.html If you are unable to access the Disclaimer, send a blank e-mail to firstrandbankdisclaimer at fnb.co.za and we will send you a copy of the Disclaimer. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From janne.peltonen at helsinki.fi Fri Aug 3 13:09:58 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Fri, 3 Aug 2007 16:09:58 +0300 Subject: [Linux-cluster] fs.sh? In-Reply-To: <20070802154818.GB26367@redhat.com> References: <20070629132556.GK29854@helsinki.fi> <20070706163152.GH1681@redhat.com> <20070706182930.GE5981@redhat.com> <20070706183151.GF5981@redhat.com> <20070706183658.GA24692@helsinki.fi> <20070710221922.GG18076@redhat.com> <20070731121438.GA21896@helsinki.fi> <20070731134121.GH4955@redhat.com> <20070731145441.GE21896@helsinki.fi> <20070802154818.GB26367@redhat.com> Message-ID: <20070803130957.GL21896@helsinki.fi> On Thu, Aug 02, 2007 at 11:48:19AM -0400, Lon Hohberger wrote: > That's quite an impressive cluster.conf... I'm going to look at it > some. One thing that makes it longer than strictly necessary is the fact that each service has its own prioritized failover domain. I could, of course, just have one failover domain for each possible permutation of nodes pcn1-hb..pcn4-hb... on the other hand, I feel safer to edit the failover domain specification on a live system than to edit the service specification (to change the failover domain) - I don't know whether rgmanager would want to restart a service if I changed its failover domain (would it?), but I know it doesn't restart a service if I edit the failover domain specification... --Janne -- Janne Peltonen From jos at xos.nl Fri Aug 3 14:56:19 2007 From: jos at xos.nl (Jos Vos) Date: Fri, 03 Aug 2007 16:56:19 +0200 Subject: [Linux-cluster] Priority in failover domains sometimes not honoured Message-ID: <200708031456.l73EuJM22161@xos037.xos.nl> Hi, I have the following failover domains: But often a service for, say, "prio_host2" stays running on host1, (or vice versa), while both hosts are running fine. In a "stable" situation, that service should be moved to host2, AFAIK (and I have seen this happen in some situations). This is on RHEL 5.0. Is this a bug or...? Thanks, -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From simanhew at gmail.com Fri Aug 3 18:50:45 2007 From: simanhew at gmail.com (siman hew) Date: Fri, 3 Aug 2007 14:50:45 -0400 Subject: [Linux-cluster] clustat on GULM In-Reply-To: <20070802155432.GD26367@redhat.com> References: <6596a7c70707310722l24a10059l9309229ba7c534fc@mail.gmail.com> <20070802155432.GD26367@redhat.com> Message-ID: <6596a7c70708031150xe002365y8b04e0ded11b7edc@mail.gmail.com> Entered a bug.Bugzilla Bug 250811: clustat report wrong information about rgmanager on a GULM clusterHope it will be solved soon, which can save some work of mine, since I can not trust/rely on clustat right now. Thanks, Siman On 8/2/07, Lon Hohberger wrote: > > On Tue, Jul 31, 2007 at 10:22:33AM -0400, siman hew wrote: > > I found clustat report wrong information about rgmanager when a cluster > is > > GULM. > > I setup a 3-node cluster on RHEL4U5, with GULM. There are one failover > > domain, one resource and one service, just for testing. > > I found clustat always report rgmanger is running after cluster is > started. > > With the configuration with DLM, clustat report correctly. > > node4 is lock server, and all nodes report the same inaccurate > information. > > GULM doesn't have a way to provide sub-groups really, except for users > of lockspaces. > > Can you file a bugzilla about this? I wonder if we can get the > information for "what nodes are in this lockspace" from GULM. If it's > possible, we could probably fix it. > > -- > Lon Hohberger - Software Engineer - Red Hat, Inc. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From karl at klxsystems.net Fri Aug 3 19:29:34 2007 From: karl at klxsystems.net (karl at klxsystems.net) Date: Fri, 3 Aug 2007 12:29:34 -0700 (PDT) Subject: [Linux-cluster] fencing with IBM RAS II on Centos 5 cman start Message-ID: <47888.66.93.167.218.1186169374.squirrel@www.klxsystems.net> Hi, I am attempting to use an IBM RSA II (Remote Supervisor Adapter) and my fencing device, as it is listed in the dropdown menu of supported devices when you create and configure a fencing device. Is there a specific log I can tail to find out what's going on with my fencing device and why it won't start? Also, if any of you have familiarity with this device, how has it run for you as a fence? Anyhow, just interested in knocking out what appears to be a minor error and not seeing any immediate reference to what logging mechanism I should be consulting to troubelshoot GFS-related items. To be clear, this is happening when I try to start cman. ----snip----- [root at ftp3 RPMS]# rpm -i ibmusbasm64-1.37_rhel4-2.x86_64.rpm Found IBM Remote Supervisor Adaptor II. Removing previous Start/Kill ibmasm run levels. Starting IBM RSA II daemon Calling install_initd ibmasm [root at ftp3 RPMS]# /etc/rc.d/init.d/cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... failed -karlski [FAILED] From jparsons at redhat.com Fri Aug 3 19:36:08 2007 From: jparsons at redhat.com (James Parsons) Date: Fri, 03 Aug 2007 15:36:08 -0400 Subject: [Linux-cluster] fencing with IBM RAS II on Centos 5 cman start In-Reply-To: <47888.66.93.167.218.1186169374.squirrel@www.klxsystems.net> References: <47888.66.93.167.218.1186169374.squirrel@www.klxsystems.net> Message-ID: <46B383A8.7090601@redhat.com> karl at klxsystems.net wrote: >Hi, > >I am attempting to use an IBM RSA II (Remote Supervisor Adapter) and my >fencing device, as it is listed in the dropdown menu of supported devices >when you create and configure a fencing device. > >Is there a specific log I can tail to find out what's going on with my >fencing device and why it won't start? > No, but you could try running the agent from the command line and seeing what happens... /sbin/fence_rsa -a hostname_of_rsa_card -l login -p password -v (for verbose). You can see the fence_rsa man page for more info. fence_rsa has been a goo performer...I seem to recall a bug about not working with custom command prompts, but that has been fixed about everywhere I believe. >[root at ftp3 RPMS]# /etc/rc.d/init.d/cman start >Starting cluster: > Loading modules... done > Mounting configfs... done > Starting ccsd... done > Starting cman... done > Starting daemons... done > Starting fencing... failed > Can you post your conf file? -j > > > From miksir at maker.ru Sat Aug 4 10:02:40 2007 From: miksir at maker.ru (Dmitriy MiksIr) Date: Sat, 04 Aug 2007 14:02:40 +0400 Subject: [Linux-cluster] rsync via GFS Message-ID: Hi! I want to sync nodes by run rsync from GFS shared storage to node's filesystem. But preparing of rsync filelist is very slow (due "stat" trouble?). Is any suggestions to sync nodes or tune rsync? From chris at cmiware.com Sun Aug 5 19:33:43 2007 From: chris at cmiware.com (Chris Harms) Date: Sun, 05 Aug 2007 14:33:43 -0500 Subject: [Linux-cluster] rgmanager 5.1 Beta Segfault Message-ID: <46B62617.8000104@cmiware.com> 5.1 Beta RPMs of cluster suite have so far resolved our dual fencing and self-reboot issues. The cluster was stopped and sat overnight, and when I started it via Conga today, one node came up fine, but the other logged the following: kernel: clurgmgrd[32233]: segfault at 0000000000000048 rip 0000000000419804 rsp 0000000040a00030 error 4 rgmanager-2.0.28-1.el5 cman-2.0.70-1.el5 Per Lon's instructions, I setup some configuration options to have it write a Core file. Stopping and restarting the cluster via Conga did not reproduce the segfault. Please advise on any additional information that would be helpful and where I should send the core file if necessary. Thanks, Chris From maciej.bogucki at artegence.com Mon Aug 6 09:05:49 2007 From: maciej.bogucki at artegence.com (Maciej Bogucki) Date: Mon, 06 Aug 2007 11:05:49 +0200 Subject: [Linux-cluster] rsync via GFS In-Reply-To: References: Message-ID: <46B6E46D.3090200@artegence.com> Dmitriy MiksIr napisa?(a): > Hi! I want to sync nodes by run rsync from GFS shared storage to node's > filesystem. But preparing of rsync filelist is very slow (due "stat" > trouble?). Is any suggestions to sync nodes or tune rsync? Hello, a) Disable quota, it will increase performance at least 2-3 times: gfs_tool settune /gfs quota_account 0 b) Rsync have to create a lock for each file so You could try to increase /proc/cluster/lock_dlm/drop_count c) Mount GFS filesystem with noatime flag. d) Create direcotries with small number of files - stat() will be faster Best Regards Maciej Bogucki From rainer at ultra-secure.de Mon Aug 6 10:37:13 2007 From: rainer at ultra-secure.de (Rainer Duffner) Date: Mon, 06 Aug 2007 12:37:13 +0200 Subject: [Linux-cluster] rsync via GFS In-Reply-To: <46B6E46D.3090200@artegence.com> References: <46B6E46D.3090200@artegence.com> Message-ID: <46B6F9D9.6080905@ultra-secure.de> Maciej Bogucki wrote: > Dmitriy MiksIr napisa?(a): > >> Hi! I want to sync nodes by run rsync from GFS shared storage to node's >> filesystem. But preparing of rsync filelist is very slow (due "stat" >> trouble?). Is any suggestions to sync nodes or tune rsync? >> > > Hello, > > a) Disable quota, it will increase performance at least 2-3 times: > gfs_tool settune /gfs quota_account 0 > b) Rsync have to create a lock for each file so You could try to > increase /proc/cluster/lock_dlm/drop_count > c) Mount GFS filesystem with noatime flag. > d) Create direcotries with small number of files - stat() will be faster > > e) make snapshot, only mount snapshot on to-sync-node (and with lock_nolock). http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-lock-nolock.html Disadvantage: only works for one node at a time. f) use NFS.... ;-) cheers, Rainer From miksir at maker.ru Mon Aug 6 10:39:39 2007 From: miksir at maker.ru (Dmitriy MiksIr) Date: Mon, 06 Aug 2007 14:39:39 +0400 Subject: [Linux-cluster] Re: rsync via GFS In-Reply-To: <46B6E46D.3090200@artegence.com> References: <46B6E46D.3090200@artegence.com> Message-ID: Great hint's for tune, thanks. But it's help not much - too many files (it's full tree of system with /usr /var etc). May be try to use some filesystem-in-file (put to gfs shared storage file and init some filesystem inside)? Is anyone try something like this? Maciej Bogucki ?????: > Dmitriy MiksIr napisa?(a): >> Hi! I want to sync nodes by run rsync from GFS shared storage to node's >> filesystem. But preparing of rsync filelist is very slow (due "stat" >> trouble?). Is any suggestions to sync nodes or tune rsync? > > Hello, > > a) Disable quota, it will increase performance at least 2-3 times: > gfs_tool settune /gfs quota_account 0 > b) Rsync have to create a lock for each file so You could try to > increase /proc/cluster/lock_dlm/drop_count > c) Mount GFS filesystem with noatime flag. > d) Create direcotries with small number of files - stat() will be faster > > Best Regards > Maciej Bogucki > From miksir at maker.ru Mon Aug 6 10:47:24 2007 From: miksir at maker.ru (Dmitriy MiksIr) Date: Mon, 06 Aug 2007 14:47:24 +0400 Subject: [Linux-cluster] Re: rsync via GFS In-Reply-To: <46B6F9D9.6080905@ultra-secure.de> References: <46B6E46D.3090200@artegence.com> <46B6F9D9.6080905@ultra-secure.de> Message-ID: Rainer Duffner ?????: >> > > e) make snapshot, only mount snapshot on to-sync-node (and with > lock_nolock). > > http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-lock-nolock.html > > Disadvantage: only works for one node at a time. > Can you give me some more links about snapshots? =) > f) use NFS.... ;-) > > > cheers, > Rainer > From rainer at ultra-secure.de Mon Aug 6 11:29:28 2007 From: rainer at ultra-secure.de (Rainer Duffner) Date: Mon, 06 Aug 2007 13:29:28 +0200 Subject: [Linux-cluster] Re: rsync via GFS In-Reply-To: References: <46B6E46D.3090200@artegence.com> <46B6F9D9.6080905@ultra-secure.de> Message-ID: <46B70618.60006@ultra-secure.de> Dmitriy MiksIr wrote: > Rainer Duffner ?????: >>> >> >> e) make snapshot, only mount snapshot on to-sync-node (and with >> lock_nolock). >> >> http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-lock-nolock.html >> >> Disadvantage: only works for one node at a time. >> > > Can you give me some more links about snapshots? =) > Oh... Ahem... I forgot to say: this was meant to use the snapshot-facilities of your storage. I was assuming, you use some kind of SAN. What storage do you use? I don't see anything being mentioned in the original post. cheers, Rainer From miksir at maker.ru Mon Aug 6 11:45:34 2007 From: miksir at maker.ru (Dmitriy MiksIr) Date: Mon, 06 Aug 2007 15:45:34 +0400 Subject: [Linux-cluster] Re: rsync via GFS In-Reply-To: <46B70618.60006@ultra-secure.de> References: <46B6E46D.3090200@artegence.com> <46B6F9D9.6080905@ultra-secure.de> <46B70618.60006@ultra-secure.de> Message-ID: Rainer Duffner ?????: > Dmitriy MiksIr wrote: >> Rainer Duffner ?????: >>>> >>> e) make snapshot, only mount snapshot on to-sync-node (and with >>> lock_nolock). >>> >>> http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-lock-nolock.html >>> >>> Disadvantage: only works for one node at a time. >>> >> Can you give me some more links about snapshots? =) >> > > Oh... > Ahem... > I forgot to say: this was meant to use the snapshot-facilities of your > storage. > I was assuming, you use some kind of SAN. > What storage do you use? I don't see anything being mentioned in the > original post. Linux from kernel.org 2.6.19.2 FC storage (HP MSA1500) LVM2 Cluster package 1.04 > > > > > cheers, > Rainer > From rainer at ultra-secure.de Mon Aug 6 12:52:38 2007 From: rainer at ultra-secure.de (Rainer Duffner) Date: Mon, 06 Aug 2007 14:52:38 +0200 Subject: [Linux-cluster] Re: rsync via GFS In-Reply-To: References: <46B6E46D.3090200@artegence.com> <46B6F9D9.6080905@ultra-secure.de> <46B70618.60006@ultra-secure.de> Message-ID: <46B71996.30500@ultra-secure.de> Dmitriy MiksIr wrote: > Rainer Duffner ?????: >> Dmitriy MiksIr wrote: >>> Rainer Duffner ?????: >>>>> >>>> e) make snapshot, only mount snapshot on to-sync-node (and with >>>> lock_nolock). >>>> >>>> http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-lock-nolock.html >>>> >>>> >>>> Disadvantage: only works for one node at a time. >>>> >>> Can you give me some more links about snapshots? =) >>> >> >> Oh... >> Ahem... >> I forgot to say: this was meant to use the snapshot-facilities of your >> storage. >> I was assuming, you use some kind of SAN. >> What storage do you use? I don't see anything being mentioned in the >> original post. > > Linux from kernel.org 2.6.19.2 > FC storage (HP MSA1500) Does that do hardware-snapshots? cheers, Rainer From miksir at maker.ru Mon Aug 6 13:18:08 2007 From: miksir at maker.ru (Dmitriy MiksIr) Date: Mon, 06 Aug 2007 17:18:08 +0400 Subject: [Linux-cluster] Re: rsync via GFS In-Reply-To: <46B71996.30500@ultra-secure.de> References: <46B6E46D.3090200@artegence.com> <46B6F9D9.6080905@ultra-secure.de> <46B70618.60006@ultra-secure.de> <46B71996.30500@ultra-secure.de> Message-ID: In my opinion, no =( I think, two ways for me - create special partition for sync on storage or in existing gfs partition create file and map it as block device with direct io access... Rainer Duffner ?????: > > > Does that do hardware-snapshots? From sys.mailing at gmail.com Mon Aug 6 13:31:16 2007 From: sys.mailing at gmail.com (Bjorn Oglefjorn) Date: Mon, 6 Aug 2007 09:31:16 -0400 Subject: [Linux-cluster] nofailback for failover domains? Message-ID: <926ab61b0708060631h5aa3ef0fu2fe183e80df752f6@mail.gmail.com> I found that a 'nofailback' option was added for the section of the conf. I can't find any reference to 'nofailback' in any RHCS doc I can find. I'm guessing it should look like this: ... Can someone confirm? I will attempt to confirm this myself and will report back when I know for sure, but it seems to behave as I would expect. Thanks, BO -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomas at mainloop.se Mon Aug 6 14:58:05 2007 From: thomas at mainloop.se (Thomas Althoff) Date: Mon, 6 Aug 2007 16:58:05 +0200 Subject: [Linux-cluster] rsync via GFS In-Reply-To: <46B6E46D.3090200@artegence.com> References: <46B6E46D.3090200@artegence.com> Message-ID: <6788CB42B92B3449B633BEBC173D9BB7884A07@spoke.intranet.mainloop.net> > b) Rsync have to create a lock for each file so You could try to increase /proc/cluster/lock_dlm/drop_count How ? I'm running RHEL5, with GFS (not GFS2) and lock_dlm. I don't see /proc/cluster on my servers. -Thomas From jmaddox at stetson.edu Mon Aug 6 17:17:26 2007 From: jmaddox at stetson.edu (John Maddox) Date: Mon, 6 Aug 2007 13:17:26 -0400 Subject: [Linux-cluster] Capturing ricci errors Message-ID: Hello - still new to RH Cluster - trying to set up a small, 2 node cluster and frequently get "A ricci error has occurred ... etc you will be redirected" - but I never get to see what the error was. Is ricci logging these errors anywhere? I don't see them in '/var/log/messages', nor in luci's log. Thanks in advance. John -------------- next part -------------- An HTML attachment was scrubbed... URL: From mgrac at redhat.com Tue Aug 7 13:39:26 2007 From: mgrac at redhat.com (Marek 'marx' Grac) Date: Tue, 07 Aug 2007 15:39:26 +0200 Subject: [Linux-cluster] How to add start-up options to pulse ?? In-Reply-To: <526805.31332.qm@web50612.mail.re2.yahoo.com> References: <526805.31332.qm@web50612.mail.re2.yahoo.com> Message-ID: <46B8760E.7030404@redhat.com> Hi, ? wrote: > not dificult but "lot" of work and not so important > patch so, I guess, this will not be a priority for > redhat ... > > will you appreciate a patch for all this ? I can do > it, in fact I will, again rhel5 version > I will appreciate it and you will find it in next update :) marx, -- Marek Grac Red Hat Czech s.r.o. From orkcu at yahoo.com Tue Aug 7 16:06:02 2007 From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=) Date: Tue, 7 Aug 2007 09:06:02 -0700 (PDT) Subject: [Linux-cluster] How to add start-up options to pulse ?? In-Reply-To: <46B8760E.7030404@redhat.com> Message-ID: <38991.27701.qm@web50607.mail.re2.yahoo.com> --- Marek 'marx' Grac wrote: > Hi, > > ??? wrote: > > not dificult but "lot" of work and not so > important > > patch so, I guess, this will not be a priority for > > redhat ... > > > > will you appreciate a patch for all this ? I can > do > > it, in fact I will, again rhel5 version > > > I will appreciate it and you will find it in next > update :) > ok, here is the RFE bugzilla entry: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=250888 although very simple patch, not big deal ... thanks roger __________________________________________ RedHat Certified ( RHCE ) Cisco Certified ( CCNA & CCDA ) ____________________________________________________________________________________ Sick sense of humor? Visit Yahoo! TV's Comedy with an Edge to see what's on, when. http://tv.yahoo.com/collections/222 From storm at elemental.it Tue Aug 7 18:15:46 2007 From: storm at elemental.it (St0rM) Date: Tue, 07 Aug 2007 20:15:46 +0200 Subject: [Linux-cluster] Using mysql cluster with GFS on RHEL 4 In-Reply-To: <2CC2091A-BBAC-4E9C-98BC-DE3EAA82646B@engineyard.com> References: <433093DF7AD7444DA65EFAFE3987879C2454BA@jellyfish.highlyscyld.com> <2CC2091A-BBAC-4E9C-98BC-DE3EAA82646B@engineyard.com> Message-ID: <46B8B6D2.5060509@elemental.it> Excuse me if i drag this dead (mail) body from the water of november 2006... >> So with myisam tables I can do active/active on the same database >> with shared data? Or is it the inram database that is shared-nothing? > That is correct, with MyISAM tables, you can have active/active on GFS > storage. ... And if I use InnoDB? What happen if I have two separate servers connecter with a SCSI storage, using GFS, having the two MySQL server using a datadir on a shared mounted partition residing on the storage ? -- St0rM -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GIT d-() s:+>: a- C++(++++) UL++++$ P+ L++++$ E- W+++$ N- o+ K w--() !O !M>+ !V PS+ PE Y+(++) PGP>+ t+ 5?>+ X++ R++ tv-- b+ DI+++ D+ G+ e* h--- r++ y+++ ------END GEEK CODE BLOCK------ "There are only 10 types of people in the world: Those who understand binary, and those who don't" From nattaponv at hotmail.com Wed Aug 8 06:00:55 2007 From: nattaponv at hotmail.com (nattapon viroonsri) Date: Wed, 08 Aug 2007 06:00:55 +0000 Subject: [Linux-cluster] Missed too many heartbeats Message-ID: OS: RHEL4 Update 4 Kernel: 2.6.9-42.ELsmp Cluster: RhCS4 Update4, RHGFS4 U4(GFS-6.1.6-1) Multipath: EMCpower.LINUX-4.5.1-022 Storage: Fibre channel with EMC CX-320 Fence Device: DELL DRAC5 Service: Postfix, Courier-imap nodeA.example.com: 192.168.0.20 nodeB.example.com: 192.168.0.60 Drac5(nodeA): 192.168.0.121 Drac5(nodeB); 192.168.0.161 I have 2 node using gfs cluster and powerpath connect through fibre to EMC-CX-320 Storage. both node use drac5 as fence device Heartbeat traffice use same interface as normal traffic(Mail,imap/pop3) Problem is only NodeB alway fenced NodeA with reason "Missed too many heartbeats" After NodeA was rebooted system can join cluster again and working fine until nodeB start fence again, May be 4-5 hour or 6-7 hour later. This happen in random manner 2-3 time per day Memory,Cpu,i/o look good and Traffice not peak during problem have occured (from sar, and mrtg) no drop, no collision from ifconfig command In logfile show same messages every time nodeB start fenced NodeA I try to extend heartbeat interval by change "deadnode_timeout" from 21 to 61 but doesn't help Have anyway to solve this problem or enable more debuging ? Do i have to dedicate network card to separte heartbeat and normal traffic ? ###### /var/log/message Aug 7 21:50:06 nodeB kernel: CMAN: removing node nodeA.example.com from the cluster : Missed too many heartbeats Aug 7 21:50:06 nodeB fenced[20770]: nodeA.example.com not a cluster member after 0 sec post_fail_delay Aug 7 21:50:06 nodeB fenced[20770]: fencing node "nodeA.example.com" Aug 7 21:50:15 nodeB fenced[20770]: fence "nodeA.example.com" success Aug 7 21:50:22 nodeB kernel: GFS: fsid=bkkair_cluster:gfs01.1: jid=0: Trying to acquire journal lock... Aug 7 21:50:22 nodeB kernel: GFS: fsid=bkkair_cluster:gfs01.1: jid=0: Looking at journal... Aug 7 21:50:22 nodeB kernel: GFS: fsid=bkkair_cluster:gfs01.1: jid=0: Done Aug 7 21:53:36 nodeB kernel: CMAN: node nodeA.example.com rejoining ###### /etc/cluster/cluster.conf ################ ##################################################### Regards, Nattapon _________________________________________________________________ FREE pop-up blocking with the new MSN Toolbar - get it now! http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/ From maciej.bogucki at artegence.com Thu Aug 9 13:49:33 2007 From: maciej.bogucki at artegence.com (Maciej Bogucki) Date: Thu, 09 Aug 2007 15:49:33 +0200 Subject: [Linux-cluster] Using mysql cluster with GFS on RHEL 4 In-Reply-To: <46B8B6D2.5060509@elemental.it> References: <433093DF7AD7444DA65EFAFE3987879C2454BA@jellyfish.highlyscyld.com> <2CC2091A-BBAC-4E9C-98BC-DE3EAA82646B@engineyard.com> <46B8B6D2.5060509@elemental.it> Message-ID: <46BB1B6D.6020008@artegence.com> St0rM napisa?(a): > Excuse me if i drag this dead (mail) body from the water of november 2006... > >>> So with myisam tables I can do active/active on the same database >>> with shared data? Or is it the inram database that is shared-nothing? > >> That is correct, with MyISAM tables, you can have active/active on GFS >> storage. Did somebody test it in production? > ... And if I use InnoDB? > > What happen if I have two separate servers connecter with a SCSI > storage, using GFS, having the two MySQL server using a datadir on a > shared mounted partition residing on the storage ? It will not work, because InnoDB isn't impemented to be clustered storage engine. Please check http://www.mysql.com/products/cluster/ with NDB engine stored in RAM or wait for stable MySQL 5.1 which from version 5.1.6[1] supports "Cluster Disk Data Tables" [1] - http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-disk-data.html Best Regards Maciej Bogucki From lanthier.stephanie at uqam.ca Thu Aug 9 18:13:37 2007 From: lanthier.stephanie at uqam.ca (=?UTF-8?B?TGFudGhpZXIsIFN0w6lwaGFuaWU=?=) Date: Thu, 9 Aug 2007 14:13:37 -0400 Subject: [Linux-cluster] Add a fence device of type SUN ILOM Message-ID: Dear list members, I have in production a RHCS cluster composed of three RHEL4u5 nodes that use GFS. Initially, I first put no fence device on the nodes. I just defined a manual fence device without associating it to the nodes. As the GFS file system is not accessible when I'm rebooting one of the three nodes, I'm realizing the importance of fence devices. I just defined manual fence devices for the three nodes, but I read that manual fence device is not a good idea for production environment. My machines are SUN Fire X4100. I see that we can define a fence device of type HP ILO. I would like to know if I can use the HP ILO form in system-config-cluster tool to enter and use a SUN ILOM as fence device? If so, do you have any points I should pay attention for when I will define them? I recall that I'm working on a production environment and I'm scary to put things worst that they already are. Thank you very much __________________ St?phanie Lanthier Analyste de l'informatique Universit? du Qu?bec ? Montr?al Service de l'informatique et des t?l?communications lanthier.stephanie at uqam.ca T?l?phone : 514-987-3000 poste 6106 Bureau : PK-M535 -------------- next part -------------- An HTML attachment was scrubbed... URL: From brad at bradandkim.net Thu Aug 9 18:36:00 2007 From: brad at bradandkim.net (brad at bradandkim.net) Date: Thu, 9 Aug 2007 13:36:00 -0500 (CDT) Subject: [Linux-cluster] Add a fence device of type SUN ILOM In-Reply-To: References: Message-ID: <52540.129.237.174.144.1186684560.squirrel@webmail.bradandkim.net> > Dear list members, > > I have in production a RHCS cluster composed of three RHEL4u5 nodes that > use GFS. Initially, I first put no fence device on the nodes. I just > defined a manual fence device without associating it to the nodes. > > As the GFS file system is not accessible when I'm rebooting one of the > three nodes, I'm realizing the importance of fence devices. > > I just defined manual fence devices for the three nodes, but I read that > manual fence device is not a good idea for production environment. > > My machines are SUN Fire X4100. I see that we can define a fence device of > type HP ILO. I would like to know if I can use the HP ILO form in > system-config-cluster tool to enter and use a SUN ILOM as fence device? > > If so, do you have any points I should pay attention for when I will > define them? I recall that I'm working on a production environment and I'm > scary to put things worst that they already are. > > Thank you very much > > __________________ > > St??phanie Lanthier I run a mix of SUNFire X4100's and X4600's and am currently testing a cluster setup with them. Though I have not fully tested it yet, I am planning on trying ipmi_lan as the fence device since the cards support IPMI. I can let you know how it works out. Thanks, Brad Crotchett brad at bradandkim.net http://www.bradandkim.net From jparsons at redhat.com Thu Aug 9 20:01:32 2007 From: jparsons at redhat.com (jim parsons) Date: Thu, 09 Aug 2007 16:01:32 -0400 Subject: [Linux-cluster] Add a fence device of type SUN ILOM In-Reply-To: References: Message-ID: <1186689692.3002.12.camel@localhost.localdomain> On Thu, 2007-08-09 at 14:13 -0400, Lanthier, St?phanie wrote: > ? > Dear list members, > > I have in production a RHCS cluster composed of three RHEL4u5 nodes > that use GFS. Initially, I first put no fence device on the nodes. I > just defined a manual fence device without associating it to the > nodes. > > As the GFS file system is not accessible when I'm rebooting one of the > three nodes, I'm realizing the importance of fence devices. > > I just defined manual fence devices for the three nodes, but I read > that manual fence device is not a good idea for production > environment. You need to run the fence_ack_manual script after fencing...it is really a pain, and DEF not anything to use for production. > > My machines are SUN Fire X4100. I see that we can define a fence > device of type HP ILO. I would like to know if I can use the HP ILO > form in system-config-cluster tool to enter and use a SUN ILOM as > fence device? Know, please, that system-config-cluster is just a front-end editor for the /etc/cluster/cluster.conf file. It takes your fence form values and inserts the values in the proper format in the file and then calls the methods to update the cluster with the new file. I do not know the params needed for the SUN ILOM, but I doubt very much that the fence_ilo agent would do the correct thing. It would be easy to find out, though. Man fence_ilo and see the params needed, and then run the agent from the command line (/sbin/fence_ilo -a System.ILOM.To.Reboot.Now -l login -p passwd ...and see what happens...I kind of doubt it will work :/ How does ILOM work? telnet or ssh? Is there an snmp interface to ILOM? If so, there might be a way...by hacking on another agent. Adding an agent is really not too big of a deal, if you are handy with scripting. > -J > From dist-list at LEXUM.UMontreal.CA Thu Aug 9 22:03:58 2007 From: dist-list at LEXUM.UMontreal.CA (FM) Date: Thu, 09 Aug 2007 18:03:58 -0400 Subject: [Linux-cluster] add journal to GFS Message-ID: <46BB8F4E.7070203@lexum.umontreal.ca> Hello, I have to add more journals to add a new nodes. SO What I did : create a new LUN add it to lvm usign : lvcreate vgextend lvextend after that I use gfs_grow now the GFS is 150 GB bigger BUT gfs_jadd say that did not have space to add journals What did I do wrong ? Tx From mathieu.avila at seanodes.com Fri Aug 10 06:59:26 2007 From: mathieu.avila at seanodes.com (Mathieu Avila) Date: Fri, 10 Aug 2007 08:59:26 +0200 Subject: [Linux-cluster] add journal to GFS In-Reply-To: <46BB8F4E.7070203@lexum.umontreal.ca> References: <46BB8F4E.7070203@lexum.umontreal.ca> Message-ID: <20070810085926.1a4a13d1@mathieu.toulouse> Le Thu, 09 Aug 2007 18:03:58 -0400, FM a ?crit : > Hello, > I have to add more journals to add a new nodes. > SO What I did : > create a new LUN > add it to lvm usign : > lvcreate > vgextend > lvextend > > after that I use gfs_grow > now the GFS is 150 GB bigger BUT > gfs_jadd say that did not have space to add journals > > You have used the new space for normal data or meta-data, so that it isn't available anymore. "gfs_jadd" doesn't use the free space of a file system, it needs to be fed up with new space, just like gfs_grow works. As your file system cannot be shrinked, the only solution i see is to add space one more time (much less, 150G is very much unless you want to 100+ nodes), and run gfs_jadd again. Cluster team, please correct me if i'm wrong. -- Mathieu From maciej.bogucki at artegence.com Fri Aug 10 07:32:51 2007 From: maciej.bogucki at artegence.com (Maciej Bogucki) Date: Fri, 10 Aug 2007 09:32:51 +0200 Subject: [Linux-cluster] Add a fence device of type SUN ILOM In-Reply-To: References: Message-ID: <46BC14A3.70303@artegence.com> > As the GFS file system is not accessible when I'm rebooting one of the > three nodes, I'm realizing the importance of fence devices. It is strange to me, because if You have rc scripts and quorum properly configured, and when You perform reboot of one node Your GFS filesystem should be accesible all time. Best Regards Maciej Bogucki From maciej.bogucki at artegence.com Fri Aug 10 09:30:16 2007 From: maciej.bogucki at artegence.com (Maciej Bogucki) Date: Fri, 10 Aug 2007 11:30:16 +0200 Subject: [Linux-cluster] RH4U5 and SCSI-3 Persistent reservation Message-ID: <46BC3028.2090507@artegence.com> Hello, I want to run fencing based on SCSI-3 persisten reservation protocol. I have two servers with GFS filesystem and I want to write to all of them at one time. But when I start scsi_reserve on the second node I get: ---cut--- Aug 10 11:26:08 host2 scsi_reserve: register of device /dev/sdb1 succeeded Aug 10 11:26:08 host2 kernel: scsi0 (0,0,1) : reservation conflict Aug 10 11:26:08 host2 kernel: 492 [RAIDarray.mpp]Array_Module_0:1:0:1 IO FAILURE. vcmnd SN 35975 pdev H0:C0:T0:L1 0x00/0x00/0x00 0x00000018 mpp_status:3 Aug 10 11:26:08 host2 kernel: scsi2 (0,0,1) : reservation conflict Aug 10 11:26:08 host2 kernel: SCSI error : <2 0 0 1> return code = 0x18 ---cut--- So I think that RH4U5 only supports SCSI-2 persistent reservations. But here [1] we can read that RH4U5 support for SCSI-3 persistent group reservations. SCSI-3 is a group reservation: every node has a key on a dedicated area on the disk and when a node has to leave, another node will just kick off its key. So it is what I'm looking for. [1] - http://www.desktoplinux.com/news/NS3524659857.html Best Regards Maciej Bogucki From beres.laszlo at sys-admin.hu Fri Aug 10 10:01:32 2007 From: beres.laszlo at sys-admin.hu (BERES Laszlo) Date: Fri, 10 Aug 2007 12:01:32 +0200 Subject: [Linux-cluster] Updating fence scripts Message-ID: <46BC377C.9060109@sys-admin.hu> Hello, just a silly question: if I update fence package, do I have to restart fenced? Does it affect GFS? I've never done this before, just updated the whole cluster. Thanks, -- B?RES L?szl? RHCE, RHCX senior IT engineer, trainer From maciej.bogucki at artegence.com Fri Aug 10 10:23:51 2007 From: maciej.bogucki at artegence.com (Maciej Bogucki) Date: Fri, 10 Aug 2007 12:23:51 +0200 Subject: [Linux-cluster] Updating fence scripts In-Reply-To: <46BC377C.9060109@sys-admin.hu> References: <46BC377C.9060109@sys-admin.hu> Message-ID: <46BC3CB7.5010909@artegence.com> > just a silly question: if I update fence package, do I have to restart > fenced? Does it affect GFS? I've never done this before, just updated > the whole cluster. Hello, For production clusters: 1. do tests in testing environment 2. install in production 3. do tests in production(at night with sheduled downtime if needed). If You doesn't have test environment then I suggest You to do tests in production to check if everything is working as You expected. But, if there is no change in fence_xxx script You don't need to do fence testing. Best Regards Maciej Bogucki From bernard.chew at muvee.com Fri Aug 10 11:56:04 2007 From: bernard.chew at muvee.com (Bernard Chew) Date: Fri, 10 Aug 2007 19:56:04 +0800 Subject: [Linux-cluster] Using cmirror Message-ID: <1186746964.16863.8.camel@ws-berd.sg.muvee.net> Hi, I read that cmirror provides user-level utilities for managing cluster mirroring but could not find much documentation on it. Can anyone point me to any documentation / guide around? Regards, Bernard Chew IT Operations From beres.laszlo at sys-admin.hu Fri Aug 10 11:58:39 2007 From: beres.laszlo at sys-admin.hu (BERES Laszlo) Date: Fri, 10 Aug 2007 13:58:39 +0200 Subject: [Linux-cluster] Updating fence scripts In-Reply-To: <46BC3CB7.5010909@artegence.com> References: <46BC377C.9060109@sys-admin.hu> <46BC3CB7.5010909@artegence.com> Message-ID: <46BC52EF.7020903@sys-admin.hu> Maciej Bogucki wrote: > If You doesn't have test environment then I suggest You to do tests in > production to check if everything is working as You expected. Thanks, but my question was somehow explicit about fencing :) > But, if there is no change in fence_xxx script You don't need to do > fence testing. There is a change in fence_ilo script, I have to upgrade it. -- B?RES L?szl? RHCE, RHCX senior IT engineer, trainer From maciej.bogucki at artegence.com Fri Aug 10 12:05:03 2007 From: maciej.bogucki at artegence.com (Maciej Bogucki) Date: Fri, 10 Aug 2007 14:05:03 +0200 Subject: [Linux-cluster] Updating fence scripts In-Reply-To: <46BC52EF.7020903@sys-admin.hu> References: <46BC377C.9060109@sys-admin.hu> <46BC3CB7.5010909@artegence.com> <46BC52EF.7020903@sys-admin.hu> Message-ID: <46BC546F.4040901@artegence.com> >> But, if there is no change in fence_xxx script You don't need to do >> fence testing. > > There is a change in fence_ilo script, I have to upgrade it. Hello, If there are changes in fence_ilo I suggest You to perform testing. Best Regards Maciej Bogucki From beres.laszlo at sys-admin.hu Fri Aug 10 12:07:12 2007 From: beres.laszlo at sys-admin.hu (BERES Laszlo) Date: Fri, 10 Aug 2007 14:07:12 +0200 Subject: [Linux-cluster] Updating fence scripts In-Reply-To: <46BC546F.4040901@artegence.com> References: <46BC377C.9060109@sys-admin.hu> <46BC3CB7.5010909@artegence.com> <46BC52EF.7020903@sys-admin.hu> <46BC546F.4040901@artegence.com> Message-ID: <46BC54F0.7000804@sys-admin.hu> Maciej Bogucki wrote: > If there are changes in fence_ilo I suggest You to perform testing. Unfortunately we don't have test systems, only a productive one. -- B?RES L?szl? RHCE, RHCX senior IT engineer, trainer From maciej.bogucki at artegence.com Fri Aug 10 12:31:33 2007 From: maciej.bogucki at artegence.com (Maciej Bogucki) Date: Fri, 10 Aug 2007 14:31:33 +0200 Subject: [Linux-cluster] Updating fence scripts In-Reply-To: <46BC54F0.7000804@sys-admin.hu> References: <46BC377C.9060109@sys-admin.hu> <46BC3CB7.5010909@artegence.com> <46BC52EF.7020903@sys-admin.hu> <46BC546F.4040901@artegence.com> <46BC54F0.7000804@sys-admin.hu> Message-ID: <46BC5AA5.4010700@artegence.com> >> If there are changes in fence_ilo I suggest You to perform testing. > > Unfortunately we don't have test systems, only a productive one. Hello, So I suggest You to plan scheduled downtime at night and test fencing in production. Best Regards Maciej Bogucki From dist-list at LEXUM.UMontreal.CA Fri Aug 10 13:58:13 2007 From: dist-list at LEXUM.UMontreal.CA (FM) Date: Fri, 10 Aug 2007 09:58:13 -0400 Subject: [Linux-cluster] SAN + multipathd + GFS : SCSI error Message-ID: <46BC6EF5.2010500@lexum.umontreal.ca> Hello, All servers are RHEL 4.5 SAN is HP EVA 4000 we are using linux qla modules and multipathd cluster server have only one FC Card In the dmesg of servers connected to GFS we have a lot of : SCSI error : <0 0 1 1> return code = 0x20000 end_request: I/O error, dev sdd, sector 37807111 The cluster seems to work fine but I'd like to know if we can avoid this error. here is a multipathd -ll output : [root at como ~]# multipath -ll mpath1 (3600508b4001051e40000900000310000) [size=500 GB][features="1 queue_if_no_path"][hwhandler="0"] \_ round-robin 0 [prio=50][active] \_ 0:0:0:1 sda 8:0 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 0:0:1:1 sdd 8:48 [active][ready] mpath3 (3600508b4001051e400009000009e0000) [size=150 GB][features="1 queue_if_no_path"][hwhandler="0"] \_ round-robin 0 [prio=50][active] \_ 0:0:1:2 sde 8:64 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 0:0:0:2 sdb 8:16 [active][ready] and the device in the multipath.conf devices { device { vendor "HP " product "HSV200 " path_grouping_policy group_by_prio getuid_callout "/sbin/scsi_id -g -u -s /block/%n" path_checker tur path_selector "round-robin 0" prio_callout "/sbin/mpath_prio_alua %d" failback immediate no_path_retry 60 } } From simone.gotti at email.it Fri Aug 10 14:14:22 2007 From: simone.gotti at email.it (Simone Gotti) Date: Fri, 10 Aug 2007 16:14:22 +0200 Subject: [Linux-cluster] SAN + multipathd + GFS : SCSI error In-Reply-To: <46BC6EF5.2010500@lexum.umontreal.ca> References: <46BC6EF5.2010500@lexum.umontreal.ca> Message-ID: <1186755262.6117.11.camel@localhost> Hi, I saw various machines with Qlogic HBAs having this issue (error code 0x20000 is DID_BUS_BUSY), in my case when using device mapper multipath, the path getting the error was failed by dm-multipath and then reactived because the path checker reported it was up (as it was transient error). It looks like a wrong qla2xxx behavior as reported in this knowledge base: http://kbase.redhat.com/faq/FAQ_46_9001.shtm and also in bug https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=231319 where there's a proposed fix for RHEL4 U6. I tested the workaround proposed in the kbase in a test environment where unfortunately this issue wasn't present and I simulated it forcing an HBA lip with sysfs but with this test the problem didn't disappeared. Maybe your issue is the same. Bye! On Fri, 2007-08-10 at 09:58 -0400, FM wrote: > Hello, > All servers are RHEL 4.5 > SAN is HP EVA 4000 > we are using linux qla modules and multipathd > cluster server have only one FC Card > > > In the dmesg of servers connected to GFS we have a lot of : > SCSI error : <0 0 1 1> return code = 0x20000 > end_request: I/O error, dev sdd, sector 37807111 > > The cluster seems to work fine but I'd like to know if we can avoid this > error. > > here is a multipathd -ll output : > > [root at como ~]# multipath -ll > mpath1 (3600508b4001051e40000900000310000) > [size=500 GB][features="1 queue_if_no_path"][hwhandler="0"] > \_ round-robin 0 [prio=50][active] > \_ 0:0:0:1 sda 8:0 [active][ready] > \_ round-robin 0 [prio=10][enabled] > \_ 0:0:1:1 sdd 8:48 [active][ready] > > mpath3 (3600508b4001051e400009000009e0000) > [size=150 GB][features="1 queue_if_no_path"][hwhandler="0"] > \_ round-robin 0 [prio=50][active] > \_ 0:0:1:2 sde 8:64 [active][ready] > \_ round-robin 0 [prio=10][enabled] > \_ 0:0:0:2 sdb 8:16 [active][ready] > > > > and the device in the multipath.conf > > devices { > device { > vendor "HP " > product "HSV200 " > path_grouping_policy group_by_prio > getuid_callout "/sbin/scsi_id -g -u -s /block/%n" > path_checker tur > path_selector "round-robin 0" > prio_callout "/sbin/mpath_prio_alua %d" > failback immediate > no_path_retry 60 > } > } > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Simone Gotti -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From jbrassow at redhat.com Fri Aug 10 15:01:29 2007 From: jbrassow at redhat.com (Jonathan Brassow) Date: Fri, 10 Aug 2007 10:01:29 -0500 Subject: [Linux-cluster] Using cmirror In-Reply-To: <1186746964.16863.8.camel@ws-berd.sg.muvee.net> References: <1186746964.16863.8.camel@ws-berd.sg.muvee.net> Message-ID: <8F3D9B0A-3ABE-4636-A27F-B1120DBC85F2@redhat.com> If you've set up a cluster and are using LVM, it will work the same way as single machine mirroring. http://www.redhat.com/docs/manuals/csgfs/browse/4.5/ SAC_Cluster_Logical_Volume_Manager/mirror_create.html brassow On Aug 10, 2007, at 6:56 AM, Bernard Chew wrote: > Hi, > > I read that cmirror provides user-level utilities for managing cluster > mirroring but could not find much documentation on it. Can anyone > point > me to any documentation / guide around? > > Regards, > Bernard Chew > IT Operations > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From Quentin.Arce at Sun.COM Fri Aug 10 15:12:48 2007 From: Quentin.Arce at Sun.COM (Quentin Arce) Date: Fri, 10 Aug 2007 08:12:48 -0700 Subject: [Linux-cluster] Add a fence device of type SUN ILOM In-Reply-To: <1186689692.3002.12.camel@localhost.localdomain> References: <1186689692.3002.12.camel@localhost.localdomain> Message-ID: <46BC8070.3070400@Sun.Com> jim parsons wrote: > On Thu, 2007-08-09 at 14:13 -0400, Lanthier, St?phanie wrote: > >> ? >> Dear list members, >> >> I have in production a RHCS cluster composed of three RHEL4u5 nodes >> that use GFS. Initially, I first put no fence device on the nodes. I >> just defined a manual fence device without associating it to the >> nodes. >> >> As the GFS file system is not accessible when I'm rebooting one of the >> three nodes, I'm realizing the importance of fence devices. >> >> I just defined manual fence devices for the three nodes, but I read >> that manual fence device is not a good idea for production >> environment. >> > You need to run the fence_ack_manual script after fencing...it is really > a pain, and DEF not anything to use for production. > >> >> My machines are SUN Fire X4100. I see that we can define a fence >> device of type HP ILO. I would like to know if I can use the HP ILO >> form in system-config-cluster tool to enter and use a SUN ILOM as >> fence device? >> > Know, please, that system-config-cluster is just a front-end editor for > the /etc/cluster/cluster.conf file. It takes your fence form values and > inserts the values in the proper format in the file and then calls the > methods to update the cluster with the new file. > > I do not know the params needed for the SUN ILOM, but I doubt very much > that the fence_ilo agent would do the correct thing. It would be easy to > find out, though. Man fence_ilo and see the params needed, and then run > the agent from the command line (/sbin/fence_ilo -a > System.ILOM.To.Reboot.Now -l login -p passwd > ...and see what happens...I kind of doubt it will work :/ > How does ILOM work? telnet or ssh? Is there an snmp interface to ILOM? > If so, there might be a way...by hacking on another agent. > So, I'm a lurker on this list as I no longer have a cluster up... but I work on ILOM and I would love to see this work. This isn't official support, I'm a developer not a customer support person. So, it's more on my time. If there is anything I can do... Please let me know. Questions on this problem, regarding what ILOM can / can't do, how to check state of the server via ILOM, etc. Thanks, Quentin > Adding an agent is really not too big of a deal, if you are handy with > scripting. > >> >> > -J > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From Quentin.Arce at Sun.COM Fri Aug 10 15:28:23 2007 From: Quentin.Arce at Sun.COM (Quentin Arce) Date: Fri, 10 Aug 2007 08:28:23 -0700 Subject: [Linux-cluster] Add a fence device of type SUN ILOM In-Reply-To: <1186689692.3002.12.camel@localhost.localdomain> References: <1186689692.3002.12.camel@localhost.localdomain> Message-ID: <46BC8417.6010201@Sun.Com> jim parsons wrote: > On Thu, 2007-08-09 at 14:13 -0400, Lanthier, St?phanie wrote: > >> ? >> Dear list members, >> >> I have in production a RHCS cluster composed of three RHEL4u5 nodes >> that use GFS. Initially, I first put no fence device on the nodes. I >> just defined a manual fence device without associating it to the >> nodes. >> >> As the GFS file system is not accessible when I'm rebooting one of the >> three nodes, I'm realizing the importance of fence devices. >> >> I just defined manual fence devices for the three nodes, but I read >> that manual fence device is not a good idea for production >> environment. >> > You need to run the fence_ack_manual script after fencing...it is really > a pain, and DEF not anything to use for production. > >> >> My machines are SUN Fire X4100. I see that we can define a fence >> device of type HP ILO. I would like to know if I can use the HP ILO >> form in system-config-cluster tool to enter and use a SUN ILOM as >> fence device? >> > Know, please, that system-config-cluster is just a front-end editor for > the /etc/cluster/cluster.conf file. It takes your fence form values and > inserts the values in the proper format in the file and then calls the > methods to update the cluster with the new file. > > I do not know the params needed for the SUN ILOM, but I doubt very much > that the fence_ilo agent would do the correct thing. It would be easy to > find out, though. Man fence_ilo and see the params needed, and then run > the agent from the command line (/sbin/fence_ilo -a > System.ILOM.To.Reboot.Now -l login -p passwd > ...and see what happens...I kind of doubt it will work :/ > Oh, forgot to answer.... > How does ILOM work? telnet or ssh? ssh, no telnet. > Is there an snmp interface to ILOM? > Yes and IPMI You just need the mib file for snmp. It should be on the public download site. If it's not I'll find out where it's published. The mib files you need are, SUN-ILOM-CONTROL-MIB.mib and SUN-PLATFORM-MIB.mib see: http://www.sun.com/products-n-solutions/hardware/docs/html/820-0280-12/snmp_using.html#50491426_94813 > If so, there might be a way...by hacking on another agent. > > Adding an agent is really not too big of a deal, if you are handy with > scripting. > >> >> > -J > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From bernard.chew at muvee.com Fri Aug 10 15:49:47 2007 From: bernard.chew at muvee.com (Bernard Chew) Date: Fri, 10 Aug 2007 23:49:47 +0800 Subject: [Linux-cluster] Using cmirror References: <1186746964.16863.8.camel@ws-berd.sg.muvee.net> <8F3D9B0A-3ABE-4636-A27F-B1120DBC85F2@redhat.com> Message-ID: <229C73600EB0E54DA818AB599482BCE951EB01@shadowfax.sg.muvee.net> Hi, Can this work with GFS where I have 2 iscsi disks (ie. /dev/sda & /dev/sdb) from 2 different iscsi-target servers and I create a mirrored GFS volume? Regards, Bernard Chew -----Original Message----- From: linux-cluster-bounces at redhat.com on behalf of Jonathan Brassow Sent: Fri 8/10/2007 11:01 PM To: linux clustering Subject: Re: [Linux-cluster] Using cmirror If you've set up a cluster and are using LVM, it will work the same way as single machine mirroring. http://www.redhat.com/docs/manuals/csgfs/browse/4.5/ SAC_Cluster_Logical_Volume_Manager/mirror_create.html brassow On Aug 10, 2007, at 6:56 AM, Bernard Chew wrote: > Hi, > > I read that cmirror provides user-level utilities for managing cluster > mirroring but could not find much documentation on it. Can anyone > point > me to any documentation / guide around? > > Regards, > Bernard Chew > IT Operations > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 3110 bytes Desc: not available URL: From james at cloud9.co.uk Fri Aug 10 16:11:55 2007 From: james at cloud9.co.uk (James Fidell) Date: Fri, 10 Aug 2007 17:11:55 +0100 Subject: [Linux-cluster] Using cmirror In-Reply-To: <229C73600EB0E54DA818AB599482BCE951EB01@shadowfax.sg.muvee.net> References: <1186746964.16863.8.camel@ws-berd.sg.muvee.net> <8F3D9B0A-3ABE-4636-A27F-B1120DBC85F2@redhat.com> <229C73600EB0E54DA818AB599482BCE951EB01@shadowfax.sg.muvee.net> Message-ID: <46BC8E4B.1070803@cloud9.co.uk> Bernard Chew wrote: > Hi, > > Can this work with GFS where I have 2 iscsi disks (ie. /dev/sda & /dev/sdb) from 2 different iscsi-target servers and I create a mirrored GFS volume? I had no problems creating such a setup. Where I did have problems was when one of the iscsi disks "went away". At that point the iscsi layer appeared to hang and lvm locked up :( (You need three separate PVs to create a mirrored LVM volume, btw). James From jparsons at redhat.com Fri Aug 10 16:13:21 2007 From: jparsons at redhat.com (James Parsons) Date: Fri, 10 Aug 2007 12:13:21 -0400 Subject: [Linux-cluster] Add a fence device of type SUN ILOM In-Reply-To: <46BC8070.3070400@Sun.Com> References: <1186689692.3002.12.camel@localhost.localdomain> <46BC8070.3070400@Sun.Com> Message-ID: <46BC8EA1.90609@redhat.com> Quentin Arce wrote: > >> >> >>> >>> My machines are SUN Fire X4100. I see that we can define a fence >>> device of type HP ILO. I would like to know if I can use the HP ILO >>> form in system-config-cluster tool to enter and use a SUN ILOM as >>> fence device? >>> >> >> >> How does ILOM work? telnet or ssh? Is there an snmp interface to ILOM? >> If so, there might be a way...by hacking on another agent. >> > > So, I'm a lurker on this list as I no longer have a cluster up... but > I work on ILOM and I would love to see this work. This isn't official > support, I'm a developer not a customer support person. So, it's more > on my time. If there is anything I can do... Please let me know. > Questions on this problem, regarding what ILOM can / can't do, how to > check state of the server via ILOM, etc. Quentin! That is very kind of you. If you help with the ILOM protocol, I'll help with the agent/script. This thread could form a document on how to write an arbitrary fence agent for use with rhcs. Where is documentation available? Generally, three things are needed from a baseboard management device in order to use it for fencing: 1) A way to shut the system down, 2) a way to power the system up, and 3) a way to check if it is up or down. What means can a script use to communicate with the ILOM card? Are there big delta's in the protocol between different ILOM versions? I look forward to hearing from you. -J From jparsons at redhat.com Fri Aug 10 16:22:55 2007 From: jparsons at redhat.com (James Parsons) Date: Fri, 10 Aug 2007 12:22:55 -0400 Subject: [Linux-cluster] Updating fence scripts In-Reply-To: <46BC52EF.7020903@sys-admin.hu> References: <46BC377C.9060109@sys-admin.hu> <46BC3CB7.5010909@artegence.com> <46BC52EF.7020903@sys-admin.hu> Message-ID: <46BC90DF.2070000@redhat.com> BERES Laszlo wrote: >Maciej Bogucki wrote: > > > >>If You doesn't have test environment then I suggest You to do tests in >>production to check if everything is working as You expected. >> >> > >Thanks, but my question was somehow explicit about fencing :) > > > >>But, if there is no change in fence_xxx script You don't need to do >>fence testing. >> >> > >There is a change in fence_ilo script, I have to upgrade it. > > > If you are referring to the 5.1 beta build, or the 4.5 asynchronous update, the change to fence_ilo is a minor fix to support ilo2...if you are running ilo currently with no problems, it is very highly extremely unlikely ;), that you will encounter a problem. Still, as the man says, you should test to be certain...as failed fencing when you need it is a VERY bad thing. Here is what changed, btw... --------------------------------------------------------------------------------- --- cluster/fence/agents/ilo/fence_ilo.pl 2007/04/09 15:22:39 1.3.2.3.2.1 +++ cluster/fence/agents/ilo/fence_ilo.pl 2007/07/17 18:38:59 1.3.2.3.2.2 @@ -279,10 +279,13 @@ foreach my $line (@response) { + if ($line =~ /FIRMWARE_VERSION\s*=\s*\"(.*)\"/) { + $firmware_rev = $1; + } if ($line =~ /MANAGEMENT_PROCESSOR\s*=\s*\"(.*)\"/) { if ($1 eq "iLO2") { $ilo_vers = 2; - print "power_status: reporting iLO2\n" if ($verbose); + print "power_status: reporting iLO2 $firmware_rev\n" if ($verbose); } } @@ -358,7 +361,11 @@ # HOLD_PWR_BUTTON is used to power the machine off, and # PRESS_PWR_BUTTON is used to power the machine on; # when the power is off, HOLD_PWR_BUTTON has no effect. - sendsock $socket, "\n"; + if ($firmware_rev > 1.29) { + sendsock $socket, "\n"; + } else { + sendsock $socket, "\n"; + } } # As of firmware version 1.71 (RIBCL 2.21) The SET_HOST_POWER command # is no longer available. HOLD_PWR_BTN and PRESS_PWR_BTN are used @@ -515,6 +522,7 @@ $action = "reboot"; $ribcl_vers = undef; # undef = autodetect $ilo_vers = 1; +$firmware_rev = 0; From brad at bradandkim.net Fri Aug 10 16:33:32 2007 From: brad at bradandkim.net (brad at bradandkim.net) Date: Fri, 10 Aug 2007 11:33:32 -0500 (CDT) Subject: [Linux-cluster] Add a fence device of type SUN ILOM In-Reply-To: <46BC8EA1.90609@redhat.com> References: <1186689692.3002.12.camel@localhost.localdomain> <46BC8070.3070400@Sun.Com> <46BC8EA1.90609@redhat.com> Message-ID: <58515.129.237.174.144.1186763612.squirrel@webmail.bradandkim.net> > Quentin Arce wrote: > >> >>> >>> >>>> >>>> My machines are SUN Fire X4100. I see that we can define a fence >>>> device of type HP ILO. I would like to know if I can use the HP ILO >>>> form in system-config-cluster tool to enter and use a SUN ILOM as >>>> fence device? >>>> >>> >>> >>> How does ILOM work? telnet or ssh? Is there an snmp interface to ILOM? >>> If so, there might be a way...by hacking on another agent. >>> >> >> So, I'm a lurker on this list as I no longer have a cluster up... but >> I work on ILOM and I would love to see this work. This isn't official >> support, I'm a developer not a customer support person. So, it's more >> on my time. If there is anything I can do... Please let me know. >> Questions on this problem, regarding what ILOM can / can't do, how to >> check state of the server via ILOM, etc. > > Quentin! That is very kind of you. If you help with the ILOM protocol, > I'll help with the agent/script. This thread could form a document on > how to write an arbitrary fence agent for use with rhcs. > > Where is documentation available? Generally, three things are needed > from a baseboard management device in order to use it for fencing: 1) A > way to shut the system down, 2) a way to power the system up, and 3) a > way to check if it is up or down. > > What means can a script use to communicate with the ILOM card? Are there > big delta's in the protocol between different ILOM versions? > > I look forward to hearing from you. > > -J I am interested in seeing this thread play out as well since I have 26 SUN servers I am beginning to cluster. My question is why use SNMP over IPMI v2.0. I can do the above three things with: /usr/bin/ipmitool -U -P -H chassis power off /usr/bin/ipmitool -U -P -H chassis power on /usr/bin/ipmitool -U -P -H chassis power status I don't need any MIB's for this either. It seems to me this might be an easier solution than snmp, but I may be missing something. Thanks, Brad Crotchett brad at bradandkim.net http://www.bradandkim.net From Quentin.Arce at Sun.COM Fri Aug 10 16:35:12 2007 From: Quentin.Arce at Sun.COM (Quentin Arce) Date: Fri, 10 Aug 2007 09:35:12 -0700 Subject: [Linux-cluster] Add a fence device of type SUN ILOM In-Reply-To: <46BC8EA1.90609@redhat.com> References: <1186689692.3002.12.camel@localhost.localdomain> <46BC8070.3070400@Sun.Com> <46BC8EA1.90609@redhat.com> Message-ID: <46BC93C0.4070401@Sun.Com> James Parsons wrote: > Quentin Arce wrote: > >> >>> >>> >>>> >>>> My machines are SUN Fire X4100. I see that we can define a fence >>>> device of type HP ILO. I would like to know if I can use the HP ILO >>>> form in system-config-cluster tool to enter and use a SUN ILOM as >>>> fence device? >>>> >>> >>> >>> How does ILOM work? telnet or ssh? Is there an snmp interface to ILOM? >>> If so, there might be a way...by hacking on another agent. >>> >> >> So, I'm a lurker on this list as I no longer have a cluster up... but >> I work on ILOM and I would love to see this work. This isn't >> official support, I'm a developer not a customer support person. So, >> it's more on my time. If there is anything I can do... Please let me >> know. Questions on this problem, regarding what ILOM can / can't do, >> how to check state of the server via ILOM, etc. > > Quentin! That is very kind of you. If you help with the ILOM protocol, > I'll help with the agent/script. This thread could form a document on > how to write an arbitrary fence agent for use with rhcs. > > Where is documentation available? Generally, three things are needed > from a baseboard management device in order to use it for fencing: 1) > A way to shut the system down, 2) a way to power the system up, and 3) > a way to check if it is up or down. > In that case. I'm not very familiar with the other LOM cards out there. I have been told that IPMI is standard on most all cards from most vendors for the past few years. If this is true then perhaps the simplest method is to use ipmitool to get/set these states. SNMP works fine also but requires the user to turn on r/w v3 snmp access. If you can confirm that IPMI is standard then it's the simplest as the commands to get / set many readings / options are standard via the IPMI spec. Thanks, Quentin > What means can a script use to communicate with the ILOM card? Are > there big delta's in the protocol between different ILOM versions? > > I look forward to hearing from you. > > -J > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From Quentin.Arce at Sun.COM Fri Aug 10 16:37:29 2007 From: Quentin.Arce at Sun.COM (Quentin Arce) Date: Fri, 10 Aug 2007 09:37:29 -0700 Subject: [Linux-cluster] Add a fence device of type SUN ILOM In-Reply-To: <58515.129.237.174.144.1186763612.squirrel@webmail.bradandkim.net> References: <1186689692.3002.12.camel@localhost.localdomain> <46BC8070.3070400@Sun.Com> <46BC8EA1.90609@redhat.com> <58515.129.237.174.144.1186763612.squirrel@webmail.bradandkim.net> Message-ID: <46BC9449.5090408@Sun.Com> brad at bradandkim.net wrote: >> Quentin Arce wrote: >> >> >>>> >>>>> My machines are SUN Fire X4100. I see that we can define a fence >>>>> device of type HP ILO. I would like to know if I can use the HP ILO >>>>> form in system-config-cluster tool to enter and use a SUN ILOM as >>>>> fence device? >>>>> >>>>> >>>> How does ILOM work? telnet or ssh? Is there an snmp interface to ILOM? >>>> If so, there might be a way...by hacking on another agent. >>>> >>>> >>> So, I'm a lurker on this list as I no longer have a cluster up... but >>> I work on ILOM and I would love to see this work. This isn't official >>> support, I'm a developer not a customer support person. So, it's more >>> on my time. If there is anything I can do... Please let me know. >>> Questions on this problem, regarding what ILOM can / can't do, how to >>> check state of the server via ILOM, etc. >>> >> Quentin! That is very kind of you. If you help with the ILOM protocol, >> I'll help with the agent/script. This thread could form a document on >> how to write an arbitrary fence agent for use with rhcs. >> >> Where is documentation available? Generally, three things are needed >> from a baseboard management device in order to use it for fencing: 1) A >> way to shut the system down, 2) a way to power the system up, and 3) a >> way to check if it is up or down. >> >> What means can a script use to communicate with the ILOM card? Are there >> big delta's in the protocol between different ILOM versions? >> >> I look forward to hearing from you. >> >> -J >> > > I am interested in seeing this thread play out as well since I have 26 SUN > servers I am beginning to cluster. My question is why use SNMP over IPMI > v2.0. I can do the above three things with: > > /usr/bin/ipmitool -U -P -H chassis power off > /usr/bin/ipmitool -U -P -H chassis power on > /usr/bin/ipmitool -U -P -H chassis power status > > I don't need any MIB's for this either. It seems to me this might be an > easier solution than snmp, but I may be missing something. > > No, I think you have it all covered. :-) If other vendors support IPMI then the script should just use it. > Thanks, > > Brad Crotchett > brad at bradandkim.net > http://www.bradandkim.net > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From Quentin.Arce at Sun.COM Fri Aug 10 16:38:34 2007 From: Quentin.Arce at Sun.COM (Quentin Arce) Date: Fri, 10 Aug 2007 09:38:34 -0700 Subject: [Linux-cluster] Add a fence device of type SUN ILOM In-Reply-To: <58515.129.237.174.144.1186763612.squirrel@webmail.bradandkim.net> References: <1186689692.3002.12.camel@localhost.localdomain> <46BC8070.3070400@Sun.Com> <46BC8EA1.90609@redhat.com> <58515.129.237.174.144.1186763612.squirrel@webmail.bradandkim.net> Message-ID: <46BC948A.3010907@Sun.Com> brad at bradandkim.net wrote: >> Quentin Arce wrote: >> >> >>>> >>>>> My machines are SUN Fire X4100. I see that we can define a fence >>>>> device of type HP ILO. I would like to know if I can use the HP ILO >>>>> form in system-config-cluster tool to enter and use a SUN ILOM as >>>>> fence device? >>>>> >>>>> >>>> How does ILOM work? telnet or ssh? Is there an snmp interface to ILOM? >>>> If so, there might be a way...by hacking on another agent. >>>> >>>> >>> So, I'm a lurker on this list as I no longer have a cluster up... but >>> I work on ILOM and I would love to see this work. This isn't official >>> support, I'm a developer not a customer support person. So, it's more >>> on my time. If there is anything I can do... Please let me know. >>> Questions on this problem, regarding what ILOM can / can't do, how to >>> check state of the server via ILOM, etc. >>> >> Quentin! That is very kind of you. If you help with the ILOM protocol, >> I'll help with the agent/script. This thread could form a document on >> how to write an arbitrary fence agent for use with rhcs. >> >> Where is documentation available? Generally, three things are needed >> from a baseboard management device in order to use it for fencing: 1) A >> way to shut the system down, 2) a way to power the system up, and 3) a >> way to check if it is up or down. >> >> What means can a script use to communicate with the ILOM card? Are there >> big delta's in the protocol between different ILOM versions? >> >> I look forward to hearing from you. >> >> -J >> > > I am interested in seeing this thread play out as well since I have 26 SUN > servers I am beginning to cluster. My question is why use SNMP over IPMI > v2.0. I can do the above three things with: > > /usr/bin/ipmitool -U -P -H chassis power off > /usr/bin/ipmitool -U -P -H chassis power on > /usr/bin/ipmitool -U -P -H chassis power status > > I don't need any MIB's for this either. It seems to me this might be an > easier solution than snmp, but I may be missing something. > > Oh make sure you are using lanplus mode for this. > Thanks, > > Brad Crotchett > brad at bradandkim.net > http://www.bradandkim.net > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From dist-list at LEXUM.UMontreal.CA Fri Aug 10 17:35:07 2007 From: dist-list at LEXUM.UMontreal.CA (FM) Date: Fri, 10 Aug 2007 13:35:07 -0400 Subject: [Linux-cluster] lvextend : Error locking ... Volume group for uuid not found Message-ID: <46BCA1CB.2010802@lexum.umontreal.ca> Ut is not really a good friday for me :) I am trying to extend le logical volume on cluster nodes : lvextend -t -l +1279 /dev/SAN-group1/home Test mode: Metadata will NOT be updated. Extending logical volume home to 654.98 GB Error locking on node catanzaro.dmz.lexum.pri: Volume group for uuid not found: Q8Wmg3qy2FFuCDUuIiI5zFyzVHKzvb53LJgndbQeYPeUzkiDcSxGmZ5a3IjLntOM Failed to suspend home I do not understansd where this uuid comes from. my vg uuid : --- Volume group --- VG Name SAN-group1 System ID Format lvm2 Metadata Areas 3 Metadata Sequence No 5 VG Access read/write VG Status resizable MAX LV 0 Cur LV 1 Open LV 1 Max PV 0 Cur PV 3 Act PV 3 VG Size 654.98 GB PE Size 4.00 MB Total PE 167676 Alloc PE / Size 166397 / 649.99 GB Free PE / Size 1279 / 5.00 GB VG UUID Q8Wmg3-qy2F-FuCD-UuIi-I5zF-yzVH-Kzvb53 From brad at bradandkim.net Fri Aug 10 17:38:11 2007 From: brad at bradandkim.net (brad at bradandkim.net) Date: Fri, 10 Aug 2007 12:38:11 -0500 (CDT) Subject: [Linux-cluster] Add a fence device of type SUN ILOM In-Reply-To: <46BC948A.3010907@Sun.Com> References: <1186689692.3002.12.camel@localhost.localdomain> <46BC8070.3070400@Sun.Com> <46BC8EA1.90609@redhat.com> <58515.129.237.174.144.1186763612.squirrel@webmail.bradandkim.net> <46BC948A.3010907@Sun.Com> Message-ID: <43480.129.237.174.144.1186767491.squirrel@webmail.bradandkim.net> > brad at bradandkim.net wrote: >>> Quentin Arce wrote: >>> >>> >>>>> >>>>>> My machines are SUN Fire X4100. I see that we can define a fence >>>>>> device of type HP ILO. I would like to know if I can use the HP ILO >>>>>> form in system-config-cluster tool to enter and use a SUN ILOM as >>>>>> fence device? >>>>>> >>>>>> >>>>> How does ILOM work? telnet or ssh? Is there an snmp interface to >>>>> ILOM? >>>>> If so, there might be a way...by hacking on another agent. >>>>> >>>>> >>>> So, I'm a lurker on this list as I no longer have a cluster up... but >>>> I work on ILOM and I would love to see this work. This isn't official >>>> support, I'm a developer not a customer support person. So, it's more >>>> on my time. If there is anything I can do... Please let me know. >>>> Questions on this problem, regarding what ILOM can / can't do, how to >>>> check state of the server via ILOM, etc. >>>> >>> Quentin! That is very kind of you. If you help with the ILOM protocol, >>> I'll help with the agent/script. This thread could form a document on >>> how to write an arbitrary fence agent for use with rhcs. >>> >>> Where is documentation available? Generally, three things are needed >>> from a baseboard management device in order to use it for fencing: 1) A >>> way to shut the system down, 2) a way to power the system up, and 3) a >>> way to check if it is up or down. >>> >>> What means can a script use to communicate with the ILOM card? Are >>> there >>> big delta's in the protocol between different ILOM versions? >>> >>> I look forward to hearing from you. >>> >>> -J >>> >> >> I am interested in seeing this thread play out as well since I have 26 >> SUN >> servers I am beginning to cluster. My question is why use SNMP over >> IPMI >> v2.0. I can do the above three things with: >> >> /usr/bin/ipmitool -U -P -H chassis power off >> /usr/bin/ipmitool -U -P -H chassis power on >> /usr/bin/ipmitool -U -P -H chassis power >> status >> >> I don't need any MIB's for this either. It seems to me this might be an >> easier solution than snmp, but I may be missing something. >> >> > > Oh make sure you are using lanplus mode for this. > Will do, and thanks. Brad Crotchett brad at bradandkim.net http://www.bradandkim.net From dist-list at LEXUM.UMontreal.CA Fri Aug 10 17:39:38 2007 From: dist-list at LEXUM.UMontreal.CA (FM) Date: Fri, 10 Aug 2007 13:39:38 -0400 Subject: [Linux-cluster] add journal to GFS In-Reply-To: <20070810085926.1a4a13d1@mathieu.toulouse> References: <46BB8F4E.7070203@lexum.umontreal.ca> <20070810085926.1a4a13d1@mathieu.toulouse> Message-ID: <46BCA2DA.5050606@lexum.umontreal.ca> TX I will try that ... with another LUN :) Mathieu Avila wrote: > Le Thu, 09 Aug 2007 18:03:58 -0400, > FM a ?crit : > >> Hello, >> I have to add more journals to add a new nodes. >> SO What I did : >> create a new LUN >> add it to lvm usign : >> lvcreate >> vgextend >> lvextend >> >> after that I use gfs_grow >> now the GFS is 150 GB bigger BUT >> gfs_jadd say that did not have space to add journals >> >> > > You have used the new space for normal data or meta-data, so that it > isn't available anymore. "gfs_jadd" doesn't use the free space of a > file system, it needs to be fed up with new space, just like gfs_grow > works. > As your file system cannot be shrinked, the only solution i see is to > add space one more time (much less, 150G is very much unless you want > to 100+ nodes), and run gfs_jadd again. > > Cluster team, please correct me if i'm wrong. > > -- > Mathieu > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From dist-list at LEXUM.UMontreal.CA Fri Aug 10 18:22:57 2007 From: dist-list at LEXUM.UMontreal.CA (FM) Date: Fri, 10 Aug 2007 14:22:57 -0400 Subject: [Linux-cluster] lvextend : Error locking ... Volume group for uuid not found FIXED In-Reply-To: <46BCA1CB.2010802@lexum.umontreal.ca> References: <46BCA1CB.2010802@lexum.umontreal.ca> Message-ID: <46BCAD01.2050805@lexum.umontreal.ca> service clvmd restart did the trick FM wrote: > Ut is not really a good friday for me :) > > I am trying to extend le logical volume on cluster nodes : > > lvextend -t -l +1279 /dev/SAN-group1/home > Test mode: Metadata will NOT be updated. > Extending logical volume home to 654.98 GB > Error locking on node catanzaro.dmz.lexum.pri: Volume group for uuid > not found: Q8Wmg3qy2FFuCDUuIiI5zFyzVHKzvb53LJgndbQeYPeUzkiDcSxGmZ5a3IjLntOM > Failed to suspend home > > I do not understansd where this uuid comes from. > my vg uuid : > > > --- Volume group --- > VG Name SAN-group1 > System ID > Format lvm2 > Metadata Areas 3 > Metadata Sequence No 5 > VG Access read/write > VG Status resizable > MAX LV 0 > Cur LV 1 > Open LV 1 > Max PV 0 > Cur PV 3 > Act PV 3 > VG Size 654.98 GB > PE Size 4.00 MB > Total PE 167676 > Alloc PE / Size 166397 / 649.99 GB > Free PE / Size 1279 / 5.00 GB > VG UUID Q8Wmg3-qy2F-FuCD-UuIi-I5zF-yzVH-Kzvb53 > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From jparsons at redhat.com Fri Aug 10 18:59:10 2007 From: jparsons at redhat.com (James Parsons) Date: Fri, 10 Aug 2007 14:59:10 -0400 Subject: [Linux-cluster] Add a fence device of type SUN ILOM In-Reply-To: <43480.129.237.174.144.1186767491.squirrel@webmail.bradandkim.net> References: <1186689692.3002.12.camel@localhost.localdomain> <46BC8070.3070400@Sun.Com> <46BC8EA1.90609@redhat.com> <58515.129.237.174.144.1186763612.squirrel@webmail.bradandkim.net> <46BC948A.3010907@Sun.Com> <43480.129.237.174.144.1186767491.squirrel@webmail.bradandkim.net> Message-ID: <46BCB57E.7000809@redhat.com> brad at bradandkim.net wrote: >>brad at bradandkim.net wrote: >> >> >>>>Quentin Arce wrote: >>>> >>>> >>>> >>>> >>>>>>>My machines are SUN Fire X4100. I see that we can define a fence >>>>>>>device of type HP ILO. I would like to know if I can use the HP ILO >>>>>>>form in system-config-cluster tool to enter and use a SUN ILOM as >>>>>>>fence device? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>How does ILOM work? telnet or ssh? Is there an snmp interface to >>>>>>ILOM? >>>>>>If so, there might be a way...by hacking on another agent. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>So, I'm a lurker on this list as I no longer have a cluster up... but >>>>>I work on ILOM and I would love to see this work. This isn't official >>>>>support, I'm a developer not a customer support person. So, it's more >>>>>on my time. If there is anything I can do... Please let me know. >>>>>Questions on this problem, regarding what ILOM can / can't do, how to >>>>>check state of the server via ILOM, etc. >>>>> >>>>> >>>>> >>>>Quentin! That is very kind of you. If you help with the ILOM protocol, >>>>I'll help with the agent/script. This thread could form a document on >>>>how to write an arbitrary fence agent for use with rhcs. >>>> >>>>Where is documentation available? Generally, three things are needed >>>>from a baseboard management device in order to use it for fencing: 1) A >>>>way to shut the system down, 2) a way to power the system up, and 3) a >>>>way to check if it is up or down. >>>> >>>>What means can a script use to communicate with the ILOM card? Are >>>>there >>>>big delta's in the protocol between different ILOM versions? >>>> >>>>I look forward to hearing from you. >>>> >>>>-J >>>> >>>> >>>> >>>I am interested in seeing this thread play out as well since I have 26 >>>SUN >>>servers I am beginning to cluster. My question is why use SNMP over >>>IPMI >>>v2.0. I can do the above three things with: >>> >>>/usr/bin/ipmitool -U -P -H chassis power off >>>/usr/bin/ipmitool -U -P -H chassis power on >>>/usr/bin/ipmitool -U -P -H chassis power >>>status >>> >>>I don't need any MIB's for this either. It seems to me this might be an >>>easier solution than snmp, but I may be missing something. >>> >>> >>> >>> >>Oh make sure you are using lanplus mode for this. >> >> >> > >Will do, and thanks. > > That is a nice solution. There is a fence_ipmilan agent in the red hat cluster distibution...how are you invoking the above for fencing? To check if the rh agent works, here is the command line you would use (it installs into /sbin...): /sbin/fence_ipmilan -a -l -p -P -o [off,on,reboot,status] There is a man page for fence_ipmilan that details some extra params. Well, I guess that solves the issue...if anyone would use an snmp-based ILOM agent, we could talk about how to construct that...otherwise, so much for my idea of this thread being instructions for creating arbitrary agents! ;) -J From brad at bradandkim.net Fri Aug 10 19:21:31 2007 From: brad at bradandkim.net (brad at bradandkim.net) Date: Fri, 10 Aug 2007 14:21:31 -0500 (CDT) Subject: [Linux-cluster] Add a fence device of type SUN ILOM In-Reply-To: <46BCB57E.7000809@redhat.com> References: <1186689692.3002.12.camel@localhost.localdomain> <46BC8070.3070400@Sun.Com> <46BC8EA1.90609@redhat.com> <58515.129.237.174.144.1186763612.squirrel@webmail.bradandkim.net> <46BC948A.3010907@Sun.Com> <43480.129.237.174.144.1186767491.squirrel@webmail.bradandkim.net> <46BCB57E.7000809@redhat.com> Message-ID: <56032.129.237.174.144.1186773691.squirrel@webmail.bradandkim.net> > brad at bradandkim.net wrote: > >>>brad at bradandkim.net wrote: >>> >>> >>>>>Quentin Arce wrote: >>>>> >>>>> >>>>> >>>>> >>>>>>>>My machines are SUN Fire X4100. I see that we can define a fence >>>>>>>>device of type HP ILO. I would like to know if I can use the HP ILO >>>>>>>>form in system-config-cluster tool to enter and use a SUN ILOM as >>>>>>>>fence device? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>How does ILOM work? telnet or ssh? Is there an snmp interface to >>>>>>>ILOM? >>>>>>>If so, there might be a way...by hacking on another agent. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>So, I'm a lurker on this list as I no longer have a cluster up... but >>>>>>I work on ILOM and I would love to see this work. This isn't >>>>>> official >>>>>>support, I'm a developer not a customer support person. So, it's >>>>>> more >>>>>>on my time. If there is anything I can do... Please let me know. >>>>>>Questions on this problem, regarding what ILOM can / can't do, how to >>>>>>check state of the server via ILOM, etc. >>>>>> >>>>>> >>>>>> >>>>>Quentin! That is very kind of you. If you help with the ILOM protocol, >>>>>I'll help with the agent/script. This thread could form a document on >>>>>how to write an arbitrary fence agent for use with rhcs. >>>>> >>>>>Where is documentation available? Generally, three things are needed >>>>>from a baseboard management device in order to use it for fencing: 1) >>>>> A >>>>>way to shut the system down, 2) a way to power the system up, and 3) a >>>>>way to check if it is up or down. >>>>> >>>>>What means can a script use to communicate with the ILOM card? Are >>>>>there >>>>>big delta's in the protocol between different ILOM versions? >>>>> >>>>>I look forward to hearing from you. >>>>> >>>>>-J >>>>> >>>>> >>>>> >>>>I am interested in seeing this thread play out as well since I have 26 >>>>SUN >>>>servers I am beginning to cluster. My question is why use SNMP over >>>>IPMI >>>>v2.0. I can do the above three things with: >>>> >>>>/usr/bin/ipmitool -U -P -H chassis power >>>> off >>>>/usr/bin/ipmitool -U -P -H chassis power on >>>>/usr/bin/ipmitool -U -P -H chassis power >>>>status >>>> >>>>I don't need any MIB's for this either. It seems to me this might be >>>> an >>>>easier solution than snmp, but I may be missing something. >>>> >>>> >>>> >>>> >>>Oh make sure you are using lanplus mode for this. >>> >>> >>> >> >>Will do, and thanks. >> >> > That is a nice solution. There is a fence_ipmilan agent in the red hat > cluster distibution...how are you invoking the above for fencing? To > check if the rh agent works, here is the command line you would use (it > installs into /sbin...): > > /sbin/fence_ipmilan -a -l -p -P -o > [off,on,reboot,status] > > There is a man page for fence_ipmilan that details some extra params. > > Well, I guess that solves the issue...if anyone would use an snmp-based > ILOM agent, we could talk about how to construct that...otherwise, so > much for my idea of this thread being instructions for creating > arbitrary agents! ;) > > -J > I just tested it and it seems to work perfectly. Sorry for bringing the thread to a premature end :) Brad Crotchett brad at bradandkim.net http://www.bradandkim.net From adel at opennet.ae Fri Aug 10 20:14:22 2007 From: adel at opennet.ae (Adel Ben Zarrouk) Date: Sat, 11 Aug 2007 00:14:22 +0400 Subject: [Linux-cluster] Oracle E-Business Suite and GFS In-Reply-To: <46BCAD01.2050805@lexum.umontreal.ca> References: <46BCA1CB.2010802@lexum.umontreal.ca> <46BCAD01.2050805@lexum.umontreal.ca> Message-ID: <200708110014.22709.adel@opennet.ae> Hi, One of our customer planning to setup Oracle EBuiness Suite and we are thinking to propose GFS 6.1 instead of OCFS2. My questions here: -Oracle EBS certified with GFS6.1 and RHEL -If there is any customer has done this before -Any benchmark available or comparison between GFS 6.1 and OCFS2. Looking forwar for your feedback Regards --Adel From doseyg at r-networks.net Sat Aug 11 02:11:45 2007 From: doseyg at r-networks.net (Glen Dosey) Date: Fri, 10 Aug 2007 22:11:45 -0400 Subject: [Linux-cluster] add journal to GFS In-Reply-To: <46BCA2DA.5050606@lexum.umontreal.ca> References: <46BB8F4E.7070203@lexum.umontreal.ca> <20070810085926.1a4a13d1@mathieu.toulouse> <46BCA2DA.5050606@lexum.umontreal.ca> Message-ID: <1186798305.8784.7.camel@eclipse.office.r-networks.net> I realize it's not normally a big deal, but if you don't want to have too many luns and pvs floating around you should be able to add another LUN of say 180GB, followed by a pvcreate, vgextend, pvmove, vgreduce, lvextend and gfs_jadd. You'll have 30GB of unallocated disk space available for the journal as well as 150GB for the gfs filesystem previously created on the prior (now removed) LUN. On Fri, 2007-08-10 at 13:39 -0400, FM wrote: > TX I will try that ... with another LUN :) > > > Mathieu Avila wrote: > > Le Thu, 09 Aug 2007 18:03:58 -0400, > > FM a ?crit : > > > >> Hello, > >> I have to add more journals to add a new nodes. > >> SO What I did : > >> create a new LUN > >> add it to lvm usign : > >> lvcreate > >> vgextend > >> lvextend > >> > >> after that I use gfs_grow > >> now the GFS is 150 GB bigger BUT > >> gfs_jadd say that did not have space to add journals > >> > >> > > > > You have used the new space for normal data or meta-data, so that it > > isn't available anymore. "gfs_jadd" doesn't use the free space of a > > file system, it needs to be fed up with new space, just like gfs_grow > > works. > > As your file system cannot be shrinked, the only solution i see is to > > add space one more time (much less, 150G is very much unless you want > > to 100+ nodes), and run gfs_jadd again. > > > > Cluster team, please correct me if i'm wrong. > > > > -- > > Mathieu > > > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From bernard.chew at muvee.com Sat Aug 11 07:59:15 2007 From: bernard.chew at muvee.com (Bernard Chew) Date: Sat, 11 Aug 2007 15:59:15 +0800 Subject: [Linux-cluster] Using cmirror In-Reply-To: <46BC8E4B.1070803@cloud9.co.uk> References: <1186746964.16863.8.camel@ws-berd.sg.muvee.net> <8F3D9B0A-3ABE-4636-A27F-B1120DBC85F2@redhat.com><229C73600EB0E54DA818AB599482BCE951EB01@shadowfax.sg.muvee.net> <46BC8E4B.1070803@cloud9.co.uk> Message-ID: <229C73600EB0E54DA818AB599482BCE9019C6984@shadowfax.sg.muvee.net> Hi James and Brassow, Thank you for the replies on using cmirror. I'll try the configuration (below) on my test servers, and post the findings here. Regards, Bernard Chew -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of James Fidell Sent: Saturday, August 11, 2007 12:12 AM To: linux clustering Subject: Re: [Linux-cluster] Using cmirror Bernard Chew wrote: > Hi, > > Can this work with GFS where I have 2 iscsi disks (ie. /dev/sda & /dev/sdb) from 2 different iscsi-target servers and I create a mirrored GFS volume? I had no problems creating such a setup. Where I did have problems was when one of the iscsi disks "went away". At that point the iscsi layer appeared to hang and lvm locked up :( (You need three separate PVs to create a mirrored LVM volume, btw). James -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From mbrookov at mines.edu Sun Aug 12 16:55:32 2007 From: mbrookov at mines.edu (Matthew B. Brookover) Date: Sun, 12 Aug 2007 10:55:32 -0600 Subject: [Linux-cluster] flock behavior different between GFS and EXT3 Message-ID: <1186937732.4915.49.camel@mickey.mattbrookover.com> I am attempting to move a program using an EXT3 file system to a GFS file system. The program uses flock to serialize access between processes. On an EXT3 file system I can get an exclusive lock on a file, make some change to the file, then get a shared lock without loosing the lock. On GFS when the program tries to demote from the exclusive lock to a shared lock, the lock is freed allowing another process to step in and take the lock. Is there a way to get flock on GFS to behave the way it does on the EXT3 file system? I have attached sample C source code and here are instructions to demonstrate this issue. My cluster is running GFS 6.1, RHEL 4 update 5 with all of the patches. Compile both programs: [mbrookov at imagine locktest]$ cc -o flock_EX_SH flock_EX_SH.c [mbrookov at imagine locktest]$ cc -o flockwritelock flockwritelock.c [mbrookov at imagine locktest]$ EXT3 test: Start up xterm twice and cd to the directory where you compiled the 2 programs. On my system, /tmp is an EXT3 file system. In the first xterm, run 'flock_EX_SH /tmp/bar' and hit return. In the second xterm, run 'flockwritelock /tmp/bar' and hit return. The flockwritelock process will block waiting for an exclusive lock on the file /tmp/bar. One the first xterm, hit return, the flock_EX_SH process will attempt to demote the exclusive lock to a shared lock and display a prompt. The flockwritelock process on the second xterm will stay blocked. In the first xterm, hit return again, the flock_EX_SH process will free the lock, close the file and exit. The flockwritelock process will then receive the exclusive lock on /tmp/bar and display a prompt. Hit return in the second xterm to get flockwritelock to close and exit. Output on first xterm: [mbrookov at imagine locktest]$ ./flock_EX_SH /tmp/bar Have exclusive lock, hit return to free write lock on /tmp/bar and exit Attempt to demote lock on /tmp/bar to shared lock Have shared lock, hit return to free lock on /tmp/bar and exit [mbrookov at imagine locktest]$ Output on second xterm: [mbrookov at imagine locktest]$ ./flockwritelock /tmp/bar Have write lock, hit return to free write lock on /tmp/bar and exit [mbrookov at imagine locktest]$ GFS test: Start up xterm twice and cd to the directory where you compiled the 2 programs. On my system, the locktest directory is on a GFS file system. In the first xterm, run 'flock_EX_SH bar' and hit return. In the second xterm, run 'flockwritelock bar' and hit return. The flockwritelock process will block waiting for an exclusive lock on the file bar. On the first xterm, hit return, the flock_EX_SH process will attempt to demote the exclusive lock on bar to a shared lock but will fail because the system call to flock frees the lock allowing the flockwritelock process to get an exclusive lock. The flock_EX_SH process will exit. Hit return on the second xterm, flockwritelock will close bar and exit. Output on first xterm: [mbrookov at imagine locktest]$ ./flock_EX_SH bar Have exclusive lock, hit return to free write lock on bar and exit Attempt to demote lock on bar to shared lock Could not demote to shared lock on file bar, Resource temporarily unavailable [mbrookov at imagine locktest]$ Output on second xterm: [mbrookov at imagine locktest]$ ./flockwritelock bar Have write lock, hit return to free write lock on bar and exit [mbrookov at imagine locktest]$ The results for flock on GFS are the same if you run the two programs on the same node or on 2 different nodes. The locks (shared, exclusive, blocking, non blocking) also work correctly on both file systems. The problem is the case where GFS will free the exclusive lock and return an error instead of demote the exclusive lock to a shared lock. The program depends on the EXT3 flock behavior -- the exclusive lock can be demoted to a shared lock without the possibility that another process that is blocked waiting for an exclusive lock receiving the lock. Thank you Matt mbrookov at mines.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: flock_EX_SH.c Type: text/x-csrc Size: 1291 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: flockwritelock.c Type: text/x-csrc Size: 1073 bytes Desc: not available URL: From bsheets at singlefin.net Sun Aug 12 23:23:43 2007 From: bsheets at singlefin.net (Brian Sheets) Date: Sun, 12 Aug 2007 23:23:43 +0000 (UTC) Subject: [Linux-cluster] fence_apc 7930s Message-ID: <1891144504.370791186961023635.JavaMail.root@v-mailhost2.mxpath.net> Did anyone get this working and allow system names for the port tags? Mine are all labled as [empty] I've not tried switching it back to the default tag. Thanks Brian >Ok... I screwed up. I figured it would be my mistake in the end. > >I renamed the outlet to the hostname of the system a while ago. I change >the name back to the default "Outlet #" and fence_apc worked the first try. >I always use the port number and not the name, but maybe I broke something >by changing the port name on the apc devices. >I have a 2 node cluster now with fencing across two apc 7930s using >redundant power supplies. >Thanks everyone for the help, >Eric m From Alain.Moulle at bull.net Mon Aug 13 06:23:00 2007 From: Alain.Moulle at bull.net (Alain Moulle) Date: Mon, 13 Aug 2007 08:23:00 +0200 Subject: [Linux-cluster] CS4 U5 / RHEL4 U4 ? Message-ID: <46BFF8C4.90409@bull.net> Hi Is there any incompatibility to re-build the CS4 U5 for RHEL4 U4 ? (Just for the benefit of all patches) Thanks Alain From janne.peltonen at helsinki.fi Mon Aug 13 09:07:04 2007 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Mon, 13 Aug 2007 12:07:04 +0300 Subject: [Linux-cluster] Load peaks - caused by the cluster? Message-ID: <20070813090703.GR17564@helsinki.fi> Hi! Remember the fs.sh status checks mayhem I reported a while ago? Now, there was the ghost-like load flux, but the system getting stuck wasn't (only) because of the excess number of execs - it was, plain and simple, memory starvation. *sigh* Anyway, now that I (or, to be exact, my servers) have enough memory, I noticed that the problem with the inexplicable load flux hasn't gone anywhere. With a more-or-less regular 11-hour interval, there is a four-hour long peak in the load, shaped like an elf's pointy hat. (In an otherwise idle system, the height of the peak is abt 6.0. If there is load caused by something "real", the peak is on top of the other load - it looks as if it just linearly adds up.) I'm seriously beginning to consider the possibility that there are elfs in my kernel, since I can't see the peaks anywhere else than the loads: CPU usage, number of processes, IP/TCP/UDP traffic, IO load, paging activity - nothing reflects the load peaks. I had a look at the process accounting statistics during a peak and during no peak, but couldn't see any difference. One suggestion my colleague had was that the peaks might be caused by the cluster somehow changing the 'lead' - somewhere inside the kernel, in such a low level that it can't be noticed elsewhere than in the load. That was because there is a difference of phase in the peaks. It didn't sound very credible to me, but I'll ask anyway: could there be something like that going on? On the other hand, on the one node in the cluster that doesn't have rgmanager running (it's in the cluster so that there wouldn't be an even number of nodes), I'm not seeing these elfs. And I have an another cluster that had the elf-hats before I added an exit 0 into their fs.sh scripts. But they don't have the elf-hats anymore. The difference between these two clusters is that the cluster with elfs has a lot more active cluster services than the one without. That is, the cluster with elfs has a lot more, say, ip.sh execs than the one without. I wonder if these, when over a certain limit, could have an effect on the load similar to the excess fs.fh execs had? Next, I think I'm going to put an exit 0 to the status checks of ip.sh (and see if the elfs go away). Then I'm going to start wondering if the cluster'd notice our server room falling apart... ;) Any suggestions? At this point, I'm not any more even certain whether the problem lies within the cluster. On the other hand, since I see no difference at the process level during peak and no-peak time, the difference must (as far as I understand) be inside kernel. So it can't be my application. So it must be the cluster, mustn't it? Thanks. --Janne -- Janne Peltonen From cluster at defuturo.co.uk Mon Aug 13 09:53:16 2007 From: cluster at defuturo.co.uk (Robert Clark) Date: Mon, 13 Aug 2007 10:53:16 +0100 Subject: [Linux-cluster] Assertion failed in do_flock (bz198302) Message-ID: <1186998796.2650.6.camel@rutabaga.defuturo.co.uk> I've been seeing the same assertions as in bz198302, so I've tried out the debug patch there and it looks like they are being triggered by an EAGAIN from flock_lock_file_wait. Is this an expected return code? Robert From Arne.Brieseneck at vodafone.com Mon Aug 13 12:01:39 2007 From: Arne.Brieseneck at vodafone.com (Brieseneck, Arne, VF-Group) Date: Mon, 13 Aug 2007 14:01:39 +0200 Subject: [Linux-cluster] Add a fence device of type SUN ILOM In-Reply-To: <56032.129.237.174.144.1186773691.squirrel@webmail.bradandkim.net> Message-ID: Hi Brad, If it works perfect I'd like to use your configuration for my own SUN X4100 systems. Can you please send your configuration files? Thanks a lot Arne -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of brad at bradandkim.net Sent: Freitag, 10. August 2007 21:22 To: linux clustering Subject: Re: [Linux-cluster] Add a fence device of type SUN ILOM > brad at bradandkim.net wrote: > >>>brad at bradandkim.net wrote: >>> >>> >>>>>Quentin Arce wrote: >>>>> >>>>> >>>>> >>>>> >>>>>>>>My machines are SUN Fire X4100. I see that we can define a fence >>>>>>>>device of type HP ILO. I would like to know if I can use the HP >>>>>>>>ILO form in system-config-cluster tool to enter and use a SUN >>>>>>>>ILOM as fence device? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>How does ILOM work? telnet or ssh? Is there an snmp interface to >>>>>>>ILOM? >>>>>>>If so, there might be a way...by hacking on another agent. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>So, I'm a lurker on this list as I no longer have a cluster up... >>>>>>but I work on ILOM and I would love to see this work. This isn't >>>>>>official support, I'm a developer not a customer support person. >>>>>>So, it's more on my time. If there is anything I can do... >>>>>>Please let me know. >>>>>>Questions on this problem, regarding what ILOM can / can't do, how >>>>>>to check state of the server via ILOM, etc. >>>>>> >>>>>> >>>>>> >>>>>Quentin! That is very kind of you. If you help with the ILOM >>>>>protocol, I'll help with the agent/script. This thread could form a >>>>>document on how to write an arbitrary fence agent for use with rhcs. >>>>> >>>>>Where is documentation available? Generally, three things are >>>>>needed from a baseboard management device in order to use it for >>>>>fencing: 1) A way to shut the system down, 2) a way to power the >>>>>system up, and 3) a way to check if it is up or down. >>>>> >>>>>What means can a script use to communicate with the ILOM card? Are >>>>>there big delta's in the protocol between different ILOM versions? >>>>> >>>>>I look forward to hearing from you. >>>>> >>>>>-J >>>>> >>>>> >>>>> >>>>I am interested in seeing this thread play out as well since I have >>>>26 SUN servers I am beginning to cluster. My question is why use >>>>SNMP over IPMI v2.0. I can do the above three things with: >>>> >>>>/usr/bin/ipmitool -U -P -H chassis power >>>>off /usr/bin/ipmitool -U -P -H chassis >>>>power on /usr/bin/ipmitool -U -P -H >>>>chassis power status >>>> >>>>I don't need any MIB's for this either. It seems to me this might >>>>be an easier solution than snmp, but I may be missing something. >>>> >>>> >>>> >>>> >>>Oh make sure you are using lanplus mode for this. >>> >>> >>> >> >>Will do, and thanks. >> >> > That is a nice solution. There is a fence_ipmilan agent in the red hat > cluster distibution...how are you invoking the above for fencing? To > check if the rh agent works, here is the command line you would use > (it installs into /sbin...): > > /sbin/fence_ipmilan -a -l -p -P -o > [off,on,reboot,status] > > There is a man page for fence_ipmilan that details some extra params. > > Well, I guess that solves the issue...if anyone would use an > snmp-based ILOM agent, we could talk about how to construct > that...otherwise, so much for my idea of this thread being > instructions for creating arbitrary agents! ;) > > -J > I just tested it and it seems to work perfectly. Sorry for bringing the thread to a premature end :) Brad Crotchett brad at bradandkim.net http://www.bradandkim.net -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From lanthier.stephanie at uqam.ca Mon Aug 13 14:02:26 2007 From: lanthier.stephanie at uqam.ca (=?iso-8859-1?Q?Lanthier=2C_St=E9phanie?=) Date: Mon, 13 Aug 2007 10:02:26 -0400 Subject: [Linux-cluster] Add a fence device of type SUN ILOM In-Reply-To: <20070810151307.A882973687@hormel.redhat.com> Message-ID: Message: 6 Date: Fri, 10 Aug 2007 09:32:51 +0200 From: Maciej Bogucki Subject: Re: [Linux-cluster] Add a fence device of type SUN ILOM To: linux clustering Message-ID: <46BC14A3.70303 at artegence.com> Content-Type: text/plain; charset=UTF-8 > As the GFS file system is not accessible when I'm rebooting one of the > three nodes, I'm realizing the importance of fence devices. It is strange to me, because if You have rc scripts and quorum properly configured, and when You perform reboot of one node Your GFS filesystem should be accesible all time. Best Regards Maciej Bogucki Dear Maciej, To answer your question, I read this about fence behavior on http://www.centos.org/docs/4/4.5/SAC_Cluster_Suite_Overview/s2-fencing-overview-CSO.html "When the cluster manager determines that a node has failed, it communicates to other cluster-infrastructure components that the node has failed. The fencing program (either fenced or GULM), when notified of the failure, fences the failed node. Other cluster-infrastructure components determine what actions to take - that is, they perform any recovery that needs to done. For example, DLM and GFS (in a cluster configured with CMAN/DLM), when notified of a node failure, suspend activity until they detect that the fencing program has completed fencing the failed node. Upon confirmation that the failed node is fenced, DLM and GFS perform recovery. DLM releases locks of the failed node; GFS recovers the journal of the failed node." As I had no fence running, I understood that DLM and GFS were in suspend, waiting to know about the fence completion. Best regards __________________ Stephanie Lanthier Analyste de l'informatique Universite du Quebec a Montreal Service de l'informatique et des telecommunications lanthier.stephanie at uqam.ca Telephone : 514-987-3000 poste 6106 Bureau : PK-M535 From jos at xos.nl Mon Aug 13 15:50:42 2007 From: jos at xos.nl (Jos Vos) Date: Mon, 13 Aug 2007 17:50:42 +0200 Subject: [Linux-cluster] IPv6 cluster addresses are "tentative" (for two seconds) Message-ID: <200708131550.l7DFogb07702@xos037.xos.nl> Hi, When using IPv6 addresses in the cluster configuration, I see that these are labaled "tentative" ("ip addr list" output) in the first two seconds when the service script runs. This appears to prohibit programs from binding to these addresses, so I need to add a sleep (or something more sophisticated, like a loop that looks when this address is not "tentative" anymore) in my cluster service script: then it seems to work fine. Is this the only solution or are there more sophisticated (and better) solutions possible? Does the same delay (w.r.t. availability) also apply to the normal IPv6 network config scripts ("ifup-ipv6") or is this problem specific to the cluster suite (if yes, should the cluster suite be adapted)? B.t.w., this is on RHEL 5.0. Thanks, -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From pbruna at it-linux.cl Mon Aug 13 15:57:00 2007 From: pbruna at it-linux.cl (Patricio A. Bruna) Date: Mon, 13 Aug 2007 11:57:00 -0400 (CLT) Subject: [Linux-cluster] GFS Problem Message-ID: <16166123.35021187020620217.JavaMail.root@lisa.itlinux.cl> Hi, I've configured a RHEL 5 cluster of 2 nodes, using GFS(v.1) filesystems. Im having a problem when restarting one of the nodes, the other node can not longer access the GFS partitions, so i must reboot both. Althoug, yesterday i resize a GFS partition, lvmextend and then gfs_grow, and the other node gaves I/O error and dismounts all the GFS partitions. Do you have any ideas why this is happening? PD: Im attaching the cluster.conf. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lanthier.stephanie at uqam.ca Mon Aug 13 15:43:01 2007 From: lanthier.stephanie at uqam.ca (=?iso-8859-1?Q?Lanthier=2C_St=E9phanie?=) Date: Mon, 13 Aug 2007 11:43:01 -0400 Subject: [Linux-cluster] Add a fence device of type SUN ILOM In-Reply-To: <20070811021159.95BD473107@hormel.redhat.com> Message-ID: Dear list members, If I understand well, here are the steps I have to follow to configure a fence device that will use SUN ILOM interface : 1. Ensure that OpenIPMI and OpenIPMI-tools packages are installed on the cluster nodes. 2. With system-config-cluster tool, add a new fence device of type "IPMI Lan". Fill the form with the ILOM IP address, the name of the administrator user and his password. Then associate the fence device with the cluster node. 3. Repeat the step above for the SUN ILOM interface of each node. 4. Send the new configuration to the cluster 5. That's all and everything will be handle correctly by the cluster. Am I ok? Best regards __________________ Stephanie Lanthier Analyste de l'informatique Universite du Quebec a Montreal Service de l'informatique et des telecommunications lanthier.stephanie at uqam.ca T?l?phone : 514-987-3000 poste 6106 Bureau : PK-M535 ------------------------------ Message: 13 Date: Fri, 10 Aug 2007 14:21:31 -0500 (CDT) From: brad at bradandkim.net Subject: Re: [Linux-cluster] Add a fence device of type SUN ILOM To: "linux clustering" Message-ID: <56032.129.237.174.144.1186773691.squirrel at webmail.bradandkim.net> Content-Type: text/plain;charset=iso-8859-1 > brad at bradandkim.net wrote: > >>>brad at bradandkim.net wrote: >>> >>> >>>>>Quentin Arce wrote: >>>>> >>>>> >>>>> >>>>> >>>>>>>>My machines are SUN Fire X4100. I see that we can define a fence >>>>>>>>device of type HP ILO. I would like to know if I can use the HP ILO >>>>>>>>form in system-config-cluster tool to enter and use a SUN ILOM as >>>>>>>>fence device? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>How does ILOM work? telnet or ssh? Is there an snmp interface to >>>>>>>ILOM? >>>>>>>If so, there might be a way...by hacking on another agent. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>So, I'm a lurker on this list as I no longer have a cluster up... but >>>>>>I work on ILOM and I would love to see this work. This isn't >>>>>> official >>>>>>support, I'm a developer not a customer support person. So, it's >>>>>> more >>>>>>on my time. If there is anything I can do... Please let me know. >>>>>>Questions on this problem, regarding what ILOM can / can't do, how to >>>>>>check state of the server via ILOM, etc. >>>>>> >>>>>> >>>>>> >>>>>Quentin! That is very kind of you. If you help with the ILOM protocol, >>>>>I'll help with the agent/script. This thread could form a document on >>>>>how to write an arbitrary fence agent for use with rhcs. >>>>> >>>>>Where is documentation available? Generally, three things are needed >>>>>from a baseboard management device in order to use it for fencing: 1) >>>>> A >>>>>way to shut the system down, 2) a way to power the system up, and 3) a >>>>>way to check if it is up or down. >>>>> >>>>>What means can a script use to communicate with the ILOM card? Are >>>>>there >>>>>big delta's in the protocol between different ILOM versions? >>>>> >>>>>I look forward to hearing from you. >>>>> >>>>>-J >>>>> >>>>> >>>>> >>>>I am interested in seeing this thread play out as well since I have 26 >>>>SUN >>>>servers I am beginning to cluster. My question is why use SNMP over >>>>IPMI >>>>v2.0. I can do the above three things with: >>>> >>>>/usr/bin/ipmitool -U -P -H chassis power >>>> off >>>>/usr/bin/ipmitool -U -P -H chassis power on >>>>/usr/bin/ipmitool -U -P -H chassis power >>>>status >>>> >>>>I don't need any MIB's for this either. It seems to me this might be >>>> an >>>>easier solution than snmp, but I may be missing something. >>>> >>>> >>>> >>>> >>>Oh make sure you are using lanplus mode for this. >>> >>> >>> >> >>Will do, and thanks. >> >> > That is a nice solution. There is a fence_ipmilan agent in the red hat > cluster distibution...how are you invoking the above for fencing? To > check if the rh agent works, here is the command line you would use (it > installs into /sbin...): > > /sbin/fence_ipmilan -a -l -p -P -o > [off,on,reboot,status] > > There is a man page for fence_ipmilan that details some extra params. > > Well, I guess that solves the issue...if anyone would use an snmp-based > ILOM agent, we could talk about how to construct that...otherwise, so > much for my idea of this thread being instructions for creating > arbitrary agents! ;) > > -J > I just tested it and it seems to work perfectly. Sorry for bringing the thread to a premature end :) Brad Crotchett brad at bradandkim.net http://www.bradandkim.net From pbruna at it-linux.cl Mon Aug 13 16:03:05 2007 From: pbruna at it-linux.cl (Patricio A. Bruna) Date: Mon, 13 Aug 2007 12:03:05 -0400 (CLT) Subject: [Linux-cluster] GFS Problem In-Reply-To: <16166123.35021187020620217.JavaMail.root@lisa.itlinux.cl> Message-ID: <22952137.35051187020985554.JavaMail.root@lisa.itlinux.cl> Sory, here goes cluster.conf ----- Mensaje Original ----- De: "Patricio A. Bruna" Para: linux-cluster at redhat.com Enviados: lunes 13 de agosto de 2007 11H57 (GMT-0400) America/Santiago Asunto: [Linux-cluster] GFS Problem Hi, I've configured a RHEL 5 cluster of 2 nodes, using GFS(v.1) filesystems. Im having a problem when restarting one of the nodes, the other node can not longer access the GFS partitions, so i must reboot both. Althoug, yesterday i resize a GFS partition, lvmextend and then gfs_grow, and the other node gaves I/O error and dismounts all the GFS partitions. Do you have any ideas why this is happening? PD: Im attaching the cluster.conf. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster.conf Type: application/octet-stream Size: 1290 bytes Desc: not available URL: From jos at xos.nl Mon Aug 13 16:06:20 2007 From: jos at xos.nl (Jos Vos) Date: Mon, 13 Aug 2007 18:06:20 +0200 Subject: [Linux-cluster] GFS Problem In-Reply-To: <16166123.35021187020620217.JavaMail.root@lisa.itlinux.cl>; from pbruna@it-linux.cl on Mon, Aug 13, 2007 at 11:57:00AM -0400 References: <16166123.35021187020620217.JavaMail.root@lisa.itlinux.cl> Message-ID: <20070813180620.A7707@xos037.xos.nl> On Mon, Aug 13, 2007 at 11:57:00AM -0400, Patricio A. Bruna wrote: > I've configured a RHEL 5 cluster of 2 nodes, using GFS(v.1) filesystems. > Im having a problem when restarting one of the nodes, the other node can not longer access the GFS partitions, so i must reboot both. > Althoug, yesterday i resize a GFS partition, lvmextend and then gfs_grow, and the other node gaves I/O error and dismounts all the GFS partitions. > > Do you have any ideas why this is happening? This sounds like a problem at SCSI-level. Do you see SCSI errors in /var/log/messages? What kind of shared storage access are you using? > PD: Im attaching the cluster.conf. Not found. -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From brad at bradandkim.net Mon Aug 13 16:12:50 2007 From: brad at bradandkim.net (brad at bradandkim.net) Date: Mon, 13 Aug 2007 11:12:50 -0500 (CDT) Subject: [Linux-cluster] Add a fence device of type SUN ILOM In-Reply-To: References: Message-ID: <53101.129.237.174.144.1187021570.squirrel@webmail.bradandkim.net> > Hi Brad, > > If it works perfect I'd like to use your configuration for my own SUN > X4100 systems. Can you please send your configuration files? > > Thanks a lot > Arne > I don't have the cluster completely configured yet. I was simply testing the command line script '/sbin/fence_ipmilan -a -l -p -P -o [off,on,reboot,status]' mentioned by James. I will be working on generating a more complete cluster config throughout this week. Basically I just have 3 nodes, each fenced with a fence_ipmilan agent, and one gfs filesystem defined. I have noticed that if I use system-config-cluster to define the fence_ipmilan agents it sets an attribute lanplus="" which I was told to use by Quentin. However, the next time I run system-config-cluster it cannot parse the cluster.conf file and errors out. I have removed the lanplus portion for now and it seems to be ok, but I am going to look into that more. Good luck! Brad Crotchett brad at bradandkim.net http://www.bradandkim.net From jos at xos.nl Mon Aug 13 16:20:44 2007 From: jos at xos.nl (Jos Vos) Date: Mon, 13 Aug 2007 18:20:44 +0200 Subject: [Linux-cluster] GFS Problem In-Reply-To: <22952137.35051187020985554.JavaMail.root@lisa.itlinux.cl>; from pbruna@it-linux.cl on Mon, Aug 13, 2007 at 12:03:05PM -0400 References: <16166123.35021187020620217.JavaMail.root@lisa.itlinux.cl> <22952137.35051187020985554.JavaMail.root@lisa.itlinux.cl> Message-ID: <20070813182044.B7707@xos037.xos.nl> On Mon, Aug 13, 2007 at 12:03:05PM -0400, Patricio A. Bruna wrote: > I've configured a RHEL 5 cluster of 2 nodes, using GFS(v.1) filesystems. > Im having a problem when restarting one of the nodes, the other node can not longer access the GFS partitions, so i must reboot both. > Althoug, yesterday i resize a GFS partition, lvmextend and then gfs_grow, and the other node gaves I/O error and dismounts all the GFS partitions. Hmm... several questions arise now: - Did you create the GFS filesystems with the correct locking protocol (lock_dlm)? - Do you use clvmd? Did you mark your VG's to be "clustered" and do you have "locking_style = 3" in /etc/lvm/lvm.conf? > PD: Im attaching the cluster.conf. Where are the GFS filesystems in cluster.conf? -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From brad at bradandkim.net Mon Aug 13 16:25:23 2007 From: brad at bradandkim.net (brad at bradandkim.net) Date: Mon, 13 Aug 2007 11:25:23 -0500 (CDT) Subject: [Linux-cluster] Add a fence device of type SUN ILOM In-Reply-To: References: Message-ID: <58270.129.237.174.144.1187022323.squirrel@webmail.bradandkim.net> > > Dear list members, > > If I understand well, here are the steps I have to follow to configure a > fence device that will use SUN ILOM interface : > > 1. Ensure that OpenIPMI and OpenIPMI-tools packages are installed on the > cluster nodes. > 2. With system-config-cluster tool, add a new fence device of type "IPMI > Lan". Fill the form with the ILOM IP address, the name of the > administrator user and his password. Then associate the fence device with > the cluster node. > 3. Repeat the step above for the SUN ILOM interface of each node. > 4. Send the new configuration to the cluster > 5. That's all and everything will be handle correctly by the cluster. > > Am I ok? > > Best regards > __________________ > > Stephanie Lanthier > > Analyste de l'informatique > Universite du Quebec a Montreal > Service de l'informatique et des telecommunications > lanthier.stephanie at uqam.ca > T?l?phone : 514-987-3000 poste 6106 > Bureau : PK-M535 > > > > > > > ------------------------------ > > Message: 13 > Date: Fri, 10 Aug 2007 14:21:31 -0500 (CDT) > From: brad at bradandkim.net > Subject: Re: [Linux-cluster] Add a fence device of type SUN ILOM > To: "linux clustering" > Message-ID: > <56032.129.237.174.144.1186773691.squirrel at webmail.bradandkim.net> > Content-Type: text/plain;charset=iso-8859-1 > > >> brad at bradandkim.net wrote: >> >>>>brad at bradandkim.net wrote: >>>> >>>> >>>>>>Quentin Arce wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>>>My machines are SUN Fire X4100. I see that we can define a fence >>>>>>>>>device of type HP ILO. I would like to know if I can use the HP >>>>>>>>> ILO >>>>>>>>>form in system-config-cluster tool to enter and use a SUN ILOM as >>>>>>>>>fence device? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>How does ILOM work? telnet or ssh? Is there an snmp interface to >>>>>>>>ILOM? >>>>>>>>If so, there might be a way...by hacking on another agent. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>So, I'm a lurker on this list as I no longer have a cluster up... >>>>>>> but >>>>>>>I work on ILOM and I would love to see this work. This isn't >>>>>>> official >>>>>>>support, I'm a developer not a customer support person. So, it's >>>>>>> more >>>>>>>on my time. If there is anything I can do... Please let me know. >>>>>>>Questions on this problem, regarding what ILOM can / can't do, how >>>>>>> to >>>>>>>check state of the server via ILOM, etc. >>>>>>> >>>>>>> >>>>>>> >>>>>>Quentin! That is very kind of you. If you help with the ILOM >>>>>> protocol, >>>>>>I'll help with the agent/script. This thread could form a document on >>>>>>how to write an arbitrary fence agent for use with rhcs. >>>>>> >>>>>>Where is documentation available? Generally, three things are needed >>>>>>from a baseboard management device in order to use it for fencing: 1) >>>>>> A >>>>>>way to shut the system down, 2) a way to power the system up, and 3) >>>>>> a >>>>>>way to check if it is up or down. >>>>>> >>>>>>What means can a script use to communicate with the ILOM card? Are >>>>>>there >>>>>>big delta's in the protocol between different ILOM versions? >>>>>> >>>>>>I look forward to hearing from you. >>>>>> >>>>>>-J >>>>>> >>>>>> >>>>>> >>>>>I am interested in seeing this thread play out as well since I have 26 >>>>>SUN >>>>>servers I am beginning to cluster. My question is why use SNMP over >>>>>IPMI >>>>>v2.0. I can do the above three things with: >>>>> >>>>>/usr/bin/ipmitool -U -P -H chassis power >>>>> off >>>>>/usr/bin/ipmitool -U -P -H chassis power >>>>> on >>>>>/usr/bin/ipmitool -U -P -H chassis power >>>>>status >>>>> >>>>>I don't need any MIB's for this either. It seems to me this might be >>>>> an >>>>>easier solution than snmp, but I may be missing something. >>>>> >>>>> >>>>> >>>>> >>>>Oh make sure you are using lanplus mode for this. >>>> >>>> >>>> >>> >>>Will do, and thanks. >>> >>> >> That is a nice solution. There is a fence_ipmilan agent in the red hat >> cluster distibution...how are you invoking the above for fencing? To >> check if the rh agent works, here is the command line you would use (it >> installs into /sbin...): >> >> /sbin/fence_ipmilan -a -l -p -P -o >> [off,on,reboot,status] >> >> There is a man page for fence_ipmilan that details some extra params. >> >> Well, I guess that solves the issue...if anyone would use an snmp-based >> ILOM agent, we could talk about how to construct that...otherwise, so >> much for my idea of this thread being instructions for creating >> arbitrary agents! ;) >> >> -J >> > > I just tested it and it seems to work perfectly. Sorry for bringing the > thread to a premature end :) > > Brad Crotchett > brad at bradandkim.net > http://www.bradandkim.net > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > I believe that will do it. I just tested shutting down the network interfaces on one node and fenced successfully used the ipmilan fencing agent to reboot the node. Brad Crotchett brad at bradandkim.net http://www.bradandkim.net From jparsons at redhat.com Mon Aug 13 16:35:35 2007 From: jparsons at redhat.com (James Parsons) Date: Mon, 13 Aug 2007 12:35:35 -0400 Subject: [Linux-cluster] Add a fence device of type SUN ILOM In-Reply-To: <53101.129.237.174.144.1187021570.squirrel@webmail.bradandkim.net> References: <53101.129.237.174.144.1187021570.squirrel@webmail.bradandkim.net> Message-ID: <46C08857.1070907@redhat.com> brad at bradandkim.net wrote: >>Hi Brad, >> >>If it works perfect I'd like to use your configuration for my own SUN >>X4100 systems. Can you please send your configuration files? >> >>Thanks a lot >>Arne >> >> >> > >I don't have the cluster completely configured yet. I was simply testing >the command line script '/sbin/fence_ipmilan -a -l -p > -P -o [off,on,reboot,status]' mentioned by James. I will be >working on generating a more complete cluster config throughout this week. > Basically I just have 3 nodes, each fenced with a fence_ipmilan agent, >and one gfs filesystem defined. > >I have noticed that if I use system-config-cluster to define the >fence_ipmilan agents it sets an attribute > >lanplus="" > Hmmm...that is the same as saying 'lanplus="0"'...IOT, it unsets it. If you get a parse error when it is in, then the relaxng schema file is not up to date...you can ignore the warning. There should be a 'lanplus' checkbox added to cfg for ipmi fencing, if you wish to use it. -J From pbruna at it-linux.cl Mon Aug 13 16:43:54 2007 From: pbruna at it-linux.cl (Patricio A. Bruna) Date: Mon, 13 Aug 2007 12:43:54 -0400 (CLT) Subject: [Linux-cluster] GFS Problem In-Reply-To: <20070813182044.B7707@xos037.xos.nl> Message-ID: <20683328.35321187023434188.JavaMail.root@lisa.itlinux.cl> Jos, Im answering one by one > - Did you create the GFS filesystems with the correct locking > protocol (lock_dlm)? Yes i did, like the documentations says: gfs_mkfs -p lock_dlm -t cluster_eseia:portales -j 4 /dev/CLUSTERLVM/portales I used 4 journal, cause we are hoping to add 2 more nodes soon. > - Do you use clvmd? Did you mark your VG's to be "clustered" and > do you have "locking_style = 3" in /etc/lvm/lvm.conf? No i did not. By the way i only have locking_type in the file, i suppose is what you meant. Righ now is: locking_type = 1 > > PD: Im attaching the cluster.conf. > > Where are the GFS filesystems in cluster.conf? They arent, because they are supposed to start right away the server boot, and the services i have: IP and Perlbal, do not use the GFS filesystems. In my logs files i saw this: ################################## newton fenced[2394]: fencing node "davinci" newton fenced[2394]: fence "davinci" failed newton fenced[2394]: fencing node "davinci" newton fenced[2394]: fence "davinci" failed #################################### and after runnig the gfs_grow command in davinci: #################################### Aug 11 04:06:10 newton kernel: attempt to access beyond end of device Aug 11 04:06:10 newton kernel: dm-4: rw=0, want=270678704, limit=251658240 Aug 11 04:06:10 newton kernel: GFS: fsid=cluster_eseia:seiaprod-dig.0: fatal: I/O error Aug 11 04:06:10 newton kernel: GFS: fsid=cluster_eseia:seiaprod-dig.0: block = 33834837 Aug 11 04:06:10 newton kernel: GFS: fsid=cluster_eseia:seiaprod-dig.0: function = gfs_dreread Aug 11 04:06:10 newton kernel: GFS: fsid=cluster_eseia:seiaprod-dig.0: file = /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.c, line = 576 Aug 11 04:06:10 newton kernel: GFS: fsid=cluster_eseia:seiaprod-dig.0: time = 1186819570 Aug 11 04:06:10 newton kernel: GFS: fsid=cluster_eseia:seiaprod-dig.0: about to withdraw from the cluster Aug 11 04:06:10 newton kernel: GFS: fsid=cluster_eseia:seiaprod-dig.0: telling LM to withdraw Aug 11 04:06:11 newton kernel: dlm: drop message 11 from 1 for unknown lockspace 655362 Aug 11 04:06:11 newton kernel: GFS: fsid=cluster_eseia:seiaprod-dig.0: withdrawn Aug 11 04:06:11 newton kernel: [] gfs_lm_withdraw+0x76/0x82 [gfs] Aug 11 04:06:11 newton kernel: [] gfs_io_error_bh_i+0x2c/0x31 [gfs] Aug 11 04:06:11 newton kernel: [] gfs_dreread+0x9f/0xbf [gfs] Aug 11 04:06:11 newton kernel: [] gfs_dread+0x20/0x36 [gfs] Aug 11 04:06:11 newton kernel: [] get_leaf+0x17/0x88 [gfs] Aug 11 04:06:11 newton kernel: [] gfs_dir_read+0x13f/0x68f [gfs] Aug 11 04:06:11 newton kernel: [] wait_for_completion+0x18/0x8d Aug 11 04:06:11 newton kernel: [] complete+0x2b/0x3d Aug 11 04:06:11 newton kernel: [] gfs_readdir+0xea/0x29e [gfs] Aug 11 04:06:11 newton kernel: [] filldir_reg_func+0x0/0x13b [gfs] Aug 11 04:06:11 newton kernel: [] filldir64+0x0/0xc5 Aug 11 04:06:11 newton kernel: [] anon_vma_prepare+0x11/0xa5 Aug 11 04:06:11 newton kernel: [] filldir64+0x0/0xc5 Aug 11 04:06:11 newton kernel: [] vfs_readdir+0x63/0x8d Aug 11 04:06:11 newton kernel: [] filldir64+0x0/0xc5 Aug 11 04:06:11 newton kernel: [] sys_getdents64+0x63/0xa5 Aug 11 04:06:11 newton kernel: [] syscall_call+0x7/0xb Aug 11 04:06:11 newton kernel: ======================= ############################################################################### From rpeterso at redhat.com Mon Aug 13 16:43:07 2007 From: rpeterso at redhat.com (Bob Peterson) Date: Mon, 13 Aug 2007 11:43:07 -0500 Subject: [Linux-cluster] GFS Problem In-Reply-To: <16166123.35021187020620217.JavaMail.root@lisa.itlinux.cl> References: <16166123.35021187020620217.JavaMail.root@lisa.itlinux.cl> Message-ID: <1187023387.19616.11.camel@technetium.msp.redhat.com> On Mon, 2007-08-13 at 11:57 -0400, Patricio A. Bruna wrote: > Hi, > I've configured a RHEL 5 cluster of 2 nodes, using GFS(v.1) > filesystems. > Im having a problem when restarting one of the nodes, the other node > can not longer access the GFS partitions, so i must reboot both. > Althoug, yesterday i resize a GFS partition, lvmextend and then > gfs_grow, and the other node gaves I/O error and dismounts all the GFS > partitions. > > Do you have any ideas why this is happening? > > PD: Im attaching the cluster.conf. Hi, Check to make sure the clustered bit is on for the vg. See: http://sources.redhat.com/cluster/faq.html#clvmd_clustered Regards, Bob Peterson From jos at xos.nl Mon Aug 13 16:53:18 2007 From: jos at xos.nl (Jos Vos) Date: Mon, 13 Aug 2007 18:53:18 +0200 Subject: [Linux-cluster] GFS Problem In-Reply-To: <20683328.35321187023434188.JavaMail.root@lisa.itlinux.cl>; from pbruna@it-linux.cl on Mon, Aug 13, 2007 at 12:43:54PM -0400 References: <20070813182044.B7707@xos037.xos.nl> <20683328.35321187023434188.JavaMail.root@lisa.itlinux.cl> Message-ID: <20070813185318.C7707@xos037.xos.nl> On Mon, Aug 13, 2007 at 12:43:54PM -0400, Patricio A. Bruna wrote: > > - Did you create the GFS filesystems with the correct locking > > protocol (lock_dlm)? > Yes i did, like the documentations says: > gfs_mkfs -p lock_dlm -t cluster_eseia:portales -j 4 /dev/CLUSTERLVM/portales OK. > > - Do you use clvmd? Did you mark your VG's to be "clustered" and > > do you have "locking_style = 3" in /etc/lvm/lvm.conf? > > No i did not. By the way i only have locking_type in the file, i suppose is what you meant. Righ now is: > locking_type = 1 This should be "3". Furthermore, apply this command to each of your clustererd volume groups: vgchange --clustered y /dev/vg... Then do a "vgscan". > They arent, because they are supposed to start right away the server boot, and the services i have: IP and Perlbal, do not use the GFS filesystems. But the cluster services, including clvmd, have to be started before the GFS filesystems are used. Better make it another service, that *only* has the GFS filesystems as resources, and that uses its own failover domain (one for each node), so that mounting the volumes are taken care of by the cluster services. > Aug 11 04:06:10 newton kernel: attempt to access beyond end of device > Aug 11 04:06:10 newton kernel: dm-4: rw=0, want=270678704, limit=251658240 > [...] I think this is because the VG's are not marked to be clustered and thus the other node is not aware of the resizing. -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From jos at xos.nl Mon Aug 13 16:57:41 2007 From: jos at xos.nl (Jos Vos) Date: Mon, 13 Aug 2007 18:57:41 +0200 Subject: [Linux-cluster] GFS Problem In-Reply-To: <1187023387.19616.11.camel@technetium.msp.redhat.com>; from rpeterso@redhat.com on Mon, Aug 13, 2007 at 11:43:07AM -0500 References: <16166123.35021187020620217.JavaMail.root@lisa.itlinux.cl> <1187023387.19616.11.camel@technetium.msp.redhat.com> Message-ID: <20070813185741.D7707@xos037.xos.nl> On Mon, Aug 13, 2007 at 11:43:07AM -0500, Bob Peterson wrote: > Check to make sure the clustered bit is on for the vg. > See: http://sources.redhat.com/cluster/faq.html#clvmd_clustered B.t.w. Bob, are you aware of the fact that this is *not* documented in the RHEL5 guides (GFS and LVM), as far as I can see, and that the vgchange manual page does not describe this option (although "vgchange --help" shows the option)? It cost me quite some time to find this, as it is mentioned in the guides that you should set the "clustered" bit, but you can't easily find (except in the URL you just gave) how to do that. -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From pbruna at it-linux.cl Mon Aug 13 17:05:05 2007 From: pbruna at it-linux.cl (Patricio A. Bruna) Date: Mon, 13 Aug 2007 13:05:05 -0400 (CLT) Subject: [Linux-cluster] GFS Problem In-Reply-To: <20070813185318.C7707@xos037.xos.nl> Message-ID: <32724846.35361187024705316.JavaMail.root@lisa.itlinux.cl> > This should be "3". What happens with the locals VGs that are no part of the cluster? >Furthermore, apply this command to each of your > clustererd volume groups: > > vgchange --clustered y /dev/vg... > > Then do a "vgscan". > Is safe to run those command with a node down and the order in production? Thanks From berthiaume_wayne at emc.com Mon Aug 13 17:09:37 2007 From: berthiaume_wayne at emc.com (berthiaume_wayne at emc.com) Date: Mon, 13 Aug 2007 13:09:37 -0400 Subject: [Linux-cluster] create GFS file system on imported iSCSI disk In-Reply-To: <229C73600EB0E54DA818AB599482BCE901921895@shadowfax.sg.muvee.net> References: <229C73600EB0E54DA818AB599482BCE901921863@shadowfax.sg.muvee.net><46ADFBD3.7060200@redhat.com> <229C73600EB0E54DA818AB599482BCE901921895@shadowfax.sg.muvee.net> Message-ID: /etc/fstab will require the _netdev flag or else the filesystem will not get mounted during boot. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bernard Chew Sent: Monday, July 30, 2007 10:02 PM To: linux clustering Subject: RE: [Linux-cluster] create GFS file system on imported iSCSI disk Thanks Bryn! - Bernard -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bryn M. Reeves Sent: Monday, July 30, 2007 10:55 PM To: linux clustering Subject: Re: [Linux-cluster] create GFS file system on imported iSCSI disk -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Bernard Chew wrote: > May I know if we need to set up logical volumes on the imported iSCSI > disk before creating GFS or we can immediately run "gfs_mkfs -p lock_dlm > -t alpha_cluster:gfs01 -j 8 /dev/sdb" (where /dev/sdb refers to the > imported iSCSI disk) on one node and all nodes to "mount -t gfs /dev/sdb > /test"? > There aren't any special steps needed for iSCSI over any other kind of shared storage (once you've configured the initiators & the devices are visible to the OS). Creating volume groups and using logical volumes for GFS is a good idea if you are likely to want to resize your devices at a later time but is not strictly necessary. Other than that, the steps you detailed should work fine. Kind regards, Bryn. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFGrfvT6YSQoMYUY94RAgGFAJ96WnspeUgKHKiBwHRh71aluGcoUgCfXvrB fg2wFdqf96s6kciF0ypfzB0= =cTyA -----END PGP SIGNATURE----- -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From jos at xos.nl Mon Aug 13 17:10:19 2007 From: jos at xos.nl (Jos Vos) Date: Mon, 13 Aug 2007 19:10:19 +0200 Subject: [Linux-cluster] GFS Problem In-Reply-To: <32724846.35361187024705316.JavaMail.root@lisa.itlinux.cl>; from pbruna@it-linux.cl on Mon, Aug 13, 2007 at 01:05:05PM -0400 References: <20070813185318.C7707@xos037.xos.nl> <32724846.35361187024705316.JavaMail.root@lisa.itlinux.cl> Message-ID: <20070813191019.E7707@xos037.xos.nl> On Mon, Aug 13, 2007 at 01:05:05PM -0400, Patricio A. Bruna wrote: > What happens with the locals VGs that are no part of the cluster? AFAIK this works ok (I have those too). > > vgchange --clustered y /dev/vg... > > > > Then do a "vgscan". > > Is safe to run those command with a node down and the order in production? Yes, it just changes a bit on the physical storage. -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From rpeterso at redhat.com Mon Aug 13 17:09:06 2007 From: rpeterso at redhat.com (Bob Peterson) Date: Mon, 13 Aug 2007 12:09:06 -0500 Subject: [Linux-cluster] GFS Problem In-Reply-To: <20070813185741.D7707@xos037.xos.nl> References: <16166123.35021187020620217.JavaMail.root@lisa.itlinux.cl> <1187023387.19616.11.camel@technetium.msp.redhat.com> <20070813185741.D7707@xos037.xos.nl> Message-ID: <1187024946.19616.14.camel@technetium.msp.redhat.com> On Mon, 2007-08-13 at 18:57 +0200, Jos Vos wrote: > On Mon, Aug 13, 2007 at 11:43:07AM -0500, Bob Peterson wrote: > > > Check to make sure the clustered bit is on for the vg. > > See: http://sources.redhat.com/cluster/faq.html#clvmd_clustered > > B.t.w. Bob, are you aware of the fact that this is *not* documented > in the RHEL5 guides (GFS and LVM), as far as I can see, and that > the vgchange manual page does not describe this option (although > "vgchange --help" shows the option)? > > It cost me quite some time to find this, as it is mentioned in > the guides that you should set the "clustered" bit, but you can't > easily find (except in the URL you just gave) how to do that. Hi Jos, No, I wasn't aware of that fact. I've passed this on to our documentation folks and hopefully they'll correct the manual. Regards, Bob Peterson From berthiaume_wayne at emc.com Mon Aug 13 17:21:51 2007 From: berthiaume_wayne at emc.com (berthiaume_wayne at emc.com) Date: Mon, 13 Aug 2007 13:21:51 -0400 Subject: [Linux-cluster] SAN + multipathd + GFS : SCSI error In-Reply-To: <46BC6EF5.2010500@lexum.umontreal.ca> References: <46BC6EF5.2010500@lexum.umontreal.ca> Message-ID: These are DID_BUS_BUSY errors being reported by the QLogic driver. I would check your SAN for congestion, increase the cache in the EVA, or change the queue depth in your qla2xxx driver. Regards, Wayne. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of FM Sent: Friday, August 10, 2007 9:58 AM To: Redhat Cluster Subject: [Linux-cluster] SAN + multipathd + GFS : SCSI error Hello, All servers are RHEL 4.5 SAN is HP EVA 4000 we are using linux qla modules and multipathd cluster server have only one FC Card In the dmesg of servers connected to GFS we have a lot of : SCSI error : <0 0 1 1> return code = 0x20000 end_request: I/O error, dev sdd, sector 37807111 The cluster seems to work fine but I'd like to know if we can avoid this error. here is a multipathd -ll output : [root at como ~]# multipath -ll mpath1 (3600508b4001051e40000900000310000) [size=500 GB][features="1 queue_if_no_path"][hwhandler="0"] \_ round-robin 0 [prio=50][active] \_ 0:0:0:1 sda 8:0 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 0:0:1:1 sdd 8:48 [active][ready] mpath3 (3600508b4001051e400009000009e0000) [size=150 GB][features="1 queue_if_no_path"][hwhandler="0"] \_ round-robin 0 [prio=50][active] \_ 0:0:1:2 sde 8:64 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 0:0:0:2 sdb 8:16 [active][ready] and the device in the multipath.conf devices { device { vendor "HP " product "HSV200 " path_grouping_policy group_by_prio getuid_callout "/sbin/scsi_id -g -u -s /block/%n" path_checker tur path_selector "round-robin 0" prio_callout "/sbin/mpath_prio_alua %d" failback immediate no_path_retry 60 } } -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From jos at xos.nl Mon Aug 13 17:44:29 2007 From: jos at xos.nl (Jos Vos) Date: Mon, 13 Aug 2007 19:44:29 +0200 Subject: [Linux-cluster] "Address already in use" messages at first startup Message-ID: <200708131744.l7DHiTx09045@xos037.xos.nl> Hi, I always seem to get at the first service start after a reboot: Aug 13 19:29:16 host1 in.rdiscd[5832]: setsockopt (IP_ADD_MEMBERSHIP): Address already in use Aug 13 19:29:16 host1 in.rdiscd[5832]: Failed joining addresses Aug 13 19:29:17 host1 in.rdiscd[5884]: setsockopt (IP_ADD_MEMBERSHIP): Address already in use Aug 13 19:29:17 host1 in.rdiscd[5884]: Failed joining addresses I have one IPv4 and one IPv6 address associated to the service, so I guess there is one message for each address. The address assignments seem to work fine for the rest. I see "rdisc -fs" running (after a fresh reboot), but the "rdisc" service is disabled (set to "off" with chkconfig). What is starting this "rdisc" daemon and do I need it? Thanks, -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From mhanafi at csc.com Mon Aug 13 18:33:54 2007 From: mhanafi at csc.com (Mahmoud Hanafi) Date: Mon, 13 Aug 2007 14:33:54 -0400 Subject: [Linux-cluster] Kernel Panic GFS2 and NFS Message-ID: I am getting the following kernel panic when exporting GFS2 via NFS. When the client mounts and does a ls the server panics. Any one else seen this issue? Any ideas? dlm: connecting to 6 dlm: got connection from 6 NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory NFSD: starting 90-second grace period Unable to handle kernel NULL pointer dereference at 0000000000000088 RIP: [] :gfs2:gfs2_glock_dq+0x15/0xa5 PGD 22d792067 PUD 22d77a067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /fs/gfs2/nfs_cluster:stripe4_512k/lock_module/recover_done CPU 0 Modules linked in: nfsd exportfs lockd nfs_acl qla2xxx lock_dlm gfs2 dlm configfs iptable_filter ip_tables autofs4 hidp rfcomm l2cap bluetooth sunrpc ip6t_REJECT xt_ tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_round_robin dm_multipath video sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parpo rt joydev sr_mod sg pcspkr ib_mthca ib_mad ib_core shpchp bnx2 ide_cd cdrom serio_raw dm_snapshot dm_zero dm_mirror dm_mod usb_storage scsi_transport_fc megaraid_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 29865, comm: nfsd Not tainted 2.6.18-8.el5 #1 RIP: 0010:[] [] :gfs2:gfs2_glock_dq+0x15/0xa5 RSP: 0018:ffff8101f51c96e0 EFLAGS: 00010246 RAX: 0000000000000008 RBX: 0000000000000000 RCX: ffff8101f51c9cd0 RDX: ffff8101f57f2080 RSI: ffff8101f51c97b0 RDI: ffff8101f51c9720 RBP: ffff8101f51c9720 R08: ffff8101f504f014 R09: ffff8101f51c99b8 R10: ffff8101f57ad008 R11: ffffffff8811d23c R12: ffff8101f81fe780 R13: ffff8101f51c97b0 R14: 0000000000000001 R15: ffff8101f504f000 FS: 00002aaaab0146f0(0000) GS:ffffffff8038a000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000088 CR3: 000000022d613000 CR4: 00000000000006e0 Process nfsd (pid: 29865, threadinfo ffff8101f51c8000, task ffff8101f57f2080) Stack: 0000000000000000 ffff8101f51c9720 ffff8101f51c99b0 ffff8101f81fe780 ffff8101f51c97b0 ffffffff8811112c 0000000000000000 ffffffff8811d2e3 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Call Trace: [] :gfs2:gfs2_glock_dq_uninit+0x9/0x12 [] :gfs2:gfs2_getattr+0xa7/0xb4 [] vfs_getattr+0x2d/0xa9 [] :nfsd:encode_post_op_attr+0x3f/0x213 [] :nfsd:encode_entry+0x232/0x53e [] zone_statistics+0x3e/0x6d [] enqueue_task+0x41/0x56 [] __activate_task+0x27/0x39 [] try_to_wake_up+0x407/0x418 [] __wake_up_common+0x3e/0x68 [] :nfsd:nfs3svc_encode_entry_plus+0xb/0x10 [] :gfs2:filldir_func+0x22/0x86 [] :gfs2:do_filldir_main+0x126/0x16d [] :gfs2:filldir_func+0x0/0x86 [] :gfs2:gfs2_dirent_gather+0x0/0x24 [] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10 [] :gfs2:gfs2_dir_read+0x416/0x479 [] :gfs2:filldir_func+0x0/0x86 [] :gfs2:gfs2_trans_end+0x14e/0x16b [] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10 [] :gfs2:gfs2_readdir+0x98/0xbe [] :gfs2:gfs2_glock_nq_atime+0x14e/0x292 [] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10 [] vfs_readdir+0x77/0xa9 [] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10 [] :nfsd:nfsd_readdir+0x6d/0xc5 [] :nfsd:nfsd3_proc_readdirplus+0xf8/0x224 [] :nfsd:nfsd_dispatch+0xd7/0x198 [] :sunrpc:svc_process+0x43c/0x6fa [] __down_read+0x12/0x92 [] :nfsd:nfsd+0x0/0x327 [] :nfsd:nfsd+0x1b3/0x327 [] child_rip+0xa/0x11 [] :nfsd:nfsd+0x0/0x327 [] :nfsd:nfsd+0x0/0x327 [] child_rip+0x0/0x11 Code: 4c 8b ab 88 00 00 00 74 0a 31 f6 48 89 df e8 40 fa ff ff 4c RIP [] :gfs2:gfs2_glock_dq+0x15/0xa5 RSP CR2: 0000000000000088 <0>Kernel panic - not syncing: Fatal exception Mahmoud Hanafi Sr. System Administrator CSC HPC COE Bld. 676 2435 Fifth Street WPAFB, Ohio 45433 (937) 255-1536 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- This is a PRIVATE message. If you are not the intended recipient, please delete without copying and kindly advise us by e-mail of the mistake in delivery. NOTE: Regardless of content, this e-mail shall not operate to bind CSC to any order or other contract unless pursuant to explicit written agreement or government initiative expressly permitting the use of e-mail for such purpose. -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From wcheng at redhat.com Mon Aug 13 19:02:51 2007 From: wcheng at redhat.com (Wendy Cheng) Date: Mon, 13 Aug 2007 15:02:51 -0400 Subject: [Linux-cluster] Kernel Panic GFS2 and NFS In-Reply-To: References: Message-ID: <46C0AADB.1070209@redhat.com> Mahmoud Hanafi wrote: > > I am getting the following kernel panic when exporting GFS2 via NFS. > When the client mounts and does a ls the server panics. Any one else > seen this issue? Any ideas? We had this issue no time ago. Your kernel version is way too old... Wendy > > dlm: connecting to 6 > dlm: got connection from 6 > NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory > NFSD: starting 90-second grace period > Unable to handle kernel NULL pointer dereference at 0000000000000088 RIP: > [] :gfs2:gfs2_glock_dq+0x15/0xa5 > PGD 22d792067 PUD 22d77a067 PMD 0 > Oops: 0000 [1] SMP > last sysfs file: > /fs/gfs2/nfs_cluster:stripe4_512k/lock_module/recover_done > CPU 0 > Modules linked in: nfsd exportfs lockd nfs_acl qla2xxx lock_dlm gfs2 > dlm configfs iptable_filter ip_tables autofs4 hidp rfcomm l2cap > bluetooth sunrpc ip6t_REJECT xt_ > tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_round_robin > dm_multipath video sbs i2c_ec i2c_core button battery asus_acpi > acpi_memhotplug ac parport_pc lp parpo > rt joydev sr_mod sg pcspkr ib_mthca ib_mad ib_core shpchp bnx2 ide_cd > cdrom serio_raw dm_snapshot dm_zero dm_mirror dm_mod usb_storage > scsi_transport_fc megaraid_sas > sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd > Pid: 29865, comm: nfsd Not tainted 2.6.18-8.el5 #1 > RIP: 0010:[] [] > :gfs2:gfs2_glock_dq+0x15/0xa5 > RSP: 0018:ffff8101f51c96e0 EFLAGS: 00010246 > RAX: 0000000000000008 RBX: 0000000000000000 RCX: ffff8101f51c9cd0 > RDX: ffff8101f57f2080 RSI: ffff8101f51c97b0 RDI: ffff8101f51c9720 > RBP: ffff8101f51c9720 R08: ffff8101f504f014 R09: ffff8101f51c99b8 > R10: ffff8101f57ad008 R11: ffffffff8811d23c R12: ffff8101f81fe780 > R13: ffff8101f51c97b0 R14: 0000000000000001 R15: ffff8101f504f000 > FS: 00002aaaab0146f0(0000) GS:ffffffff8038a000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 0000000000000088 CR3: 000000022d613000 CR4: 00000000000006e0 > Process nfsd (pid: 29865, threadinfo ffff8101f51c8000, task > ffff8101f57f2080) > Stack: 0000000000000000 ffff8101f51c9720 ffff8101f51c99b0 > ffff8101f81fe780 > ffff8101f51c97b0 ffffffff8811112c 0000000000000000 ffffffff8811d2e3 > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > Call Trace: > [] :gfs2:gfs2_glock_dq_uninit+0x9/0x12 > [] :gfs2:gfs2_getattr+0xa7/0xb4 > [] vfs_getattr+0x2d/0xa9 > [] :nfsd:encode_post_op_attr+0x3f/0x213 > [] :nfsd:encode_entry+0x232/0x53e > [] zone_statistics+0x3e/0x6d > [] enqueue_task+0x41/0x56 > [] __activate_task+0x27/0x39 > [] try_to_wake_up+0x407/0x418 > [] __wake_up_common+0x3e/0x68 > [] :nfsd:nfs3svc_encode_entry_plus+0xb/0x10 > [] :gfs2:filldir_func+0x22/0x86 > [] :gfs2:do_filldir_main+0x126/0x16d > [] :gfs2:filldir_func+0x0/0x86 > [] :gfs2:gfs2_dirent_gather+0x0/0x24 > [] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10 > [] :gfs2:gfs2_dir_read+0x416/0x479 > [] :gfs2:filldir_func+0x0/0x86 > [] :gfs2:gfs2_trans_end+0x14e/0x16b > [] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10 > [] :gfs2:gfs2_readdir+0x98/0xbe > [] :gfs2:gfs2_glock_nq_atime+0x14e/0x292 > [] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10 > [] vfs_readdir+0x77/0xa9 > [] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10 > [] :nfsd:nfsd_readdir+0x6d/0xc5 > [] :nfsd:nfsd3_proc_readdirplus+0xf8/0x224 > [] :nfsd:nfsd_dispatch+0xd7/0x198 > [] :sunrpc:svc_process+0x43c/0x6fa > [] __down_read+0x12/0x92 > [] :nfsd:nfsd+0x0/0x327 > [] :nfsd:nfsd+0x1b3/0x327 > [] child_rip+0xa/0x11 > [] :nfsd:nfsd+0x0/0x327 > [] :nfsd:nfsd+0x0/0x327 > [] child_rip+0x0/0x11 > > > Code: 4c 8b ab 88 00 00 00 74 0a 31 f6 48 89 df e8 40 fa ff ff 4c > RIP [] :gfs2:gfs2_glock_dq+0x15/0xa5 > RSP > CR2: 0000000000000088 > <0>Kernel panic - not syncing: Fatal exception > > > > > > Mahmoud Hanafi > Sr. System Administrator > CSC HPC COE > Bld. 676 > 2435 Fifth Street > WPAFB, Ohio 45433 > (937) 255-1536 > > > > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > This is a PRIVATE message. If you are not the intended recipient, > please delete without copying and kindly advise us by e-mail of the > mistake in delivery. NOTE: Regardless of content, this e-mail shall > not operate to bind CSC to any order or other contract unless pursuant > to explicit written agreement or government initiative expressly > permitting the use of e-mail for such purpose. > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From wcheng at redhat.com Mon Aug 13 19:06:46 2007 From: wcheng at redhat.com (Wendy Cheng) Date: Mon, 13 Aug 2007 15:06:46 -0400 Subject: [Linux-cluster] Kernel Panic GFS2 and NFS In-Reply-To: <46C0AADB.1070209@redhat.com> References: <46C0AADB.1070209@redhat.com> Message-ID: <46C0ABC6.6060200@redhat.com> Wendy Cheng wrote: > Mahmoud Hanafi wrote: >> >> I am getting the following kernel panic when exporting GFS2 via NFS. >> When the client mounts and does a ls the server panics. Any one else >> seen this issue? Any ideas? > > We had this issue no time ago. Your kernel version is way too old... > Wendy Sorry ... too many things to juggle at the same time .. s/no time/long time/ ... The 2.6.18.37.el5 runs reasonably well... Wendy From Randy.Brown at noaa.gov Mon Aug 13 19:51:24 2007 From: Randy.Brown at noaa.gov (Randy Brown) Date: Mon, 13 Aug 2007 15:51:24 -0400 Subject: [Linux-cluster] Setting up HA cluster as NAS head for storage Message-ID: <46C0B63C.4080305@noaa.gov> I am trying to configure two matching servers in a high availability cluster to work as a NAS head for NFS mounts from our ISCSI based network storage. Has anyone done this or is anyone doing this? I am struggling with getting the NFS exports configured so machines outside the cluster can mount these filesystems. I believe I have the two servers failing over correctly. That is, the NFS service I created is properly failing over if one of the machines is unavailable. Any help would be greatly appreciated. I have read the "Configuring and Managing a Redhat Cluster" document as well as the "Redhat Cluster Suite Overview" document, but they don't quite cover what I'm trying to do. Any suggested resources? Thank you, Randy -------------- next part -------------- A non-text attachment was scrubbed... Name: randy.brown.vcf Type: text/x-vcard Size: 348 bytes Desc: not available URL: From Christopher.Barry at qlogic.com Mon Aug 13 19:57:45 2007 From: Christopher.Barry at qlogic.com (Christopher Barry) Date: Mon, 13 Aug 2007 15:57:45 -0400 Subject: [Linux-cluster] Setting up HA cluster as NAS head for storage In-Reply-To: <46C0B63C.4080305@noaa.gov> References: <46C0B63C.4080305@noaa.gov> Message-ID: <1187035065.5231.21.camel@localhost> On Mon, 2007-08-13 at 15:51 -0400, Randy Brown wrote: > I am trying to configure two matching servers in a high availability > cluster to work as a NAS head for NFS mounts from our ISCSI based > network storage. Has anyone done this or is anyone doing this? I am > struggling with getting the NFS exports configured so machines outside > the cluster can mount these filesystems. I believe I have the two > servers failing over correctly. That is, the NFS service I created is > properly failing over if one of the machines is unavailable. Any help > would be greatly appreciated. I have read the "Configuring and Managing > a Redhat Cluster" document as well as the "Redhat Cluster Suite > Overview" document, but they don't quite cover what I'm trying to do. > Any suggested resources? > > Thank you, > > Randy > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Take a look here: "http://www.linux-ha.org/HaNFS" -- Regards, -C Christopher Barry Systems Engineer, Principal Qlogic Corporation 780 5th Avenue Suite 140 King of Prussia, PA 19406 From pbruna at it-linux.cl Mon Aug 13 22:14:25 2007 From: pbruna at it-linux.cl (Patricio A. Bruna) Date: Mon, 13 Aug 2007 18:14:25 -0400 (CLT) Subject: [Linux-cluster] Cluster Fence Problem Message-ID: <4760290.36191187043265985.JavaMail.root@lisa.itlinux.cl> Hi, I have configured a cluster, at least i though so, but when i start system-config-cluster an error messages appear that says there is no members joined to the cluster. This is over RHEL 5. In both logs i get: Node1: fenced[2381]: davinci not a cluster member after 3 sec post_join_delay Aug 13 13:41:26 newton fenced[2381]: fencing node "davinci" Aug 13 13:41:26 newton fenced[2381]: fence "davinci" failed Aug 13 13:41:31 newton fenced[2381]: fencing node "davinci" Aug 13 13:41:31 newton fenced[2381]: fence "davinci" failed Node2: Aug 13 13:47:00 davinci fenced[2539]: newton not a cluster member after 3 sec post_join_delay Aug 13 13:47:00 davinci fenced[2539]: fencing node "newton" Aug 13 13:47:00 davinci fenced[2539]: fence "newton" failed Aug 13 13:47:05 davinci fenced[2539]: fencing node "newton" Aug 13 13:47:05 davinci fenced[2539]: fence "newton" failed Any help will be appreciated. PD: im attaching both logs and cluster.conf -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: messages.davinci Type: application/octet-stream Size: 63505 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: messages.newton Type: application/octet-stream Size: 147371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster.conf Type: application/octet-stream Size: 1290 bytes Desc: not available URL: From chris at cmiware.com Mon Aug 13 22:25:22 2007 From: chris at cmiware.com (Chris Harms) Date: Mon, 13 Aug 2007 17:25:22 -0500 Subject: [Linux-cluster] Cluster Fence Problem In-Reply-To: <4760290.36191187043265985.JavaMail.root@lisa.itlinux.cl> References: <4760290.36191187043265985.JavaMail.root@lisa.itlinux.cl> Message-ID: <46C0DA52.7040008@cmiware.com> We have set our post_join_delay to 15 minutes (900 seconds) to be safe. We had no end of trouble with fencing issues on a 2-node cluster, and only allowing 3 seconds between the first node in a cluster starting and the 2nd noding coming online before it gets fenced is fairly ridiculous. Patricio A. Bruna wrote: > Hi, > I have configured a cluster, at least i though so, but when i start > system-config-cluster an error messages appear that says there is no > members joined to the cluster. > This is over RHEL 5. In both logs i get: > > Node1: > fenced[2381]: davinci not a cluster member after 3 sec post_join_delay > Aug 13 13:41:26 newton fenced[2381]: fencing node "davinci" > Aug 13 13:41:26 newton fenced[2381]: fence "davinci" failed > Aug 13 13:41:31 newton fenced[2381]: fencing node "davinci" > Aug 13 13:41:31 newton fenced[2381]: fence "davinci" failed > > Node2: > Aug 13 13:47:00 davinci fenced[2539]: newton not a cluster member > after 3 sec post_join_delay > Aug 13 13:47:00 davinci fenced[2539]: fencing node "newton" > Aug 13 13:47:00 davinci fenced[2539]: fence "newton" failed > Aug 13 13:47:05 davinci fenced[2539]: fencing node "newton" > Aug 13 13:47:05 davinci fenced[2539]: fence "newton" failed > > Any help will be appreciated. > > PD: im attaching both logs and cluster.conf > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From mhanafi at csc.com Mon Aug 13 23:34:31 2007 From: mhanafi at csc.com (Mahmoud Hanafi) Date: Mon, 13 Aug 2007 19:34:31 -0400 Subject: [Linux-cluster] locking question Message-ID: Which is better lock_dlm or GLUM? and why? Thanks, Mahmoud Hanafi Sr. System Administrator CSC HPC COE Bld. 676 2435 Fifth Street WPAFB, Ohio 45433 (937) 255-1536 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- This is a PRIVATE message. If you are not the intended recipient, please delete without copying and kindly advise us by e-mail of the mistake in delivery. NOTE: Regardless of content, this e-mail shall not operate to bind CSC to any order or other contract unless pursuant to explicit written agreement or government initiative expressly permitting the use of e-mail for such purpose. -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From jparsons at redhat.com Tue Aug 14 06:25:08 2007 From: jparsons at redhat.com (jim parsons) Date: Tue, 14 Aug 2007 02:25:08 -0400 Subject: [Linux-cluster] Cluster Fence Problem In-Reply-To: <46C0DA52.7040008@cmiware.com> References: <4760290.36191187043265985.JavaMail.root@lisa.itlinux.cl> <46C0DA52.7040008@cmiware.com> Message-ID: <1187072708.3014.6.camel@localhost.localdomain> On Mon, 2007-08-13 at 17:25 -0500, Chris Harms wrote: > We have set our post_join_delay to 15 minutes (900 seconds) to be safe. > We had no end of trouble with fencing issues on a 2-node cluster, and > only allowing 3 seconds between the first node in a cluster starting and > the 2nd noding coming online before it gets fenced is fairly ridiculous. > Also, you need to associate the nodes with the fencedevices. This is missing in the conf file. If using s-c-cluster, select a node, click 'manage fencing for this node', and make a default level, then, add fence to the level by choosing the correct fencedevice (newton_ilo for newton, etc.) from the little dropdowm menu. -j > > Patricio A. Bruna wrote: > > Hi, > > I have configured a cluster, at least i though so, but when i start > > system-config-cluster an error messages appear that says there is no > > members joined to the cluster. > > This is over RHEL 5. In both logs i get: > > > > Node1: > > fenced[2381]: davinci not a cluster member after 3 sec post_join_delay > > Aug 13 13:41:26 newton fenced[2381]: fencing node "davinci" > > Aug 13 13:41:26 newton fenced[2381]: fence "davinci" failed > > Aug 13 13:41:31 newton fenced[2381]: fencing node "davinci" > > Aug 13 13:41:31 newton fenced[2381]: fence "davinci" failed > > > > Node2: > > Aug 13 13:47:00 davinci fenced[2539]: newton not a cluster member > > after 3 sec post_join_delay > > Aug 13 13:47:00 davinci fenced[2539]: fencing node "newton" > > Aug 13 13:47:00 davinci fenced[2539]: fence "newton" failed > > Aug 13 13:47:05 davinci fenced[2539]: fencing node "newton" > > Aug 13 13:47:05 davinci fenced[2539]: fence "newton" failed > > > > Any help will be appreciated. > > > > PD: im attaching both logs and cluster.conf > > > > > > ------------------------------------------------------------------------ > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From sebastian.walter at fu-berlin.de Tue Aug 14 08:01:46 2007 From: sebastian.walter at fu-berlin.de (Sebastian Walter) Date: Tue, 14 Aug 2007 10:01:46 +0200 Subject: [Linux-cluster] GFS Problem In-Reply-To: <20070813185318.C7707@xos037.xos.nl> References: <20070813182044.B7707@xos037.xos.nl> <20683328.35321187023434188.JavaMail.root@lisa.itlinux.cl> <20070813185318.C7707@xos037.xos.nl> Message-ID: <46C1616A.5020904@fu-berlin.de> Hello Jos, Jos Vos wrote: > But the cluster services, including clvmd, have to be started before > the GFS filesystems are used. Better make it another service, that > *only* has the GFS filesystems as resources, and that uses its own > failover domain (one for each node), so that mounting the volumes are > taken care of by the cluster services. What you describe is exactly the setup I want, but wasn't able to configure it yet. I was setting up a single gfs service including the shared gfs resource of the mount point, but it starts only on one node at once. Do I have to create one service for each node running on its self failover domain? This sounds if it could work indeed. And do I need additionally a lvm resource in that setup (I seem to remember that the clvmd service was not chkconfig'ed "on" by default)? Thanks for any help! regards, Sebastian From jos at xos.nl Tue Aug 14 09:18:56 2007 From: jos at xos.nl (Jos Vos) Date: Tue, 14 Aug 2007 11:18:56 +0200 Subject: [Linux-cluster] GFS Problem In-Reply-To: <46C1616A.5020904@fu-berlin.de>; from sebastian.walter@fu-berlin.de on Tue, Aug 14, 2007 at 10:01:46AM +0200 References: <20070813182044.B7707@xos037.xos.nl> <20683328.35321187023434188.JavaMail.root@lisa.itlinux.cl> <20070813185318.C7707@xos037.xos.nl> <46C1616A.5020904@fu-berlin.de> Message-ID: <20070814111856.A16445@xos037.xos.nl> On Tue, Aug 14, 2007 at 10:01:46AM +0200, Sebastian Walter wrote: > What you describe is exactly the setup I want, but wasn't able to > configure it yet. I was setting up a single gfs service including the > shared gfs resource of the mount point, but it starts only on one node > at once. Do I have to create one service for each node running on its > self failover domain? [...] Yes, for all nodes you want them to be mounted you need a seperate service with its own one-node (exclusive) failover domain. > [...] This sounds if it could work indeed. And do I need > additionally a lvm resource in that setup (I seem to remember that the > clvmd service was not chkconfig'ed "on" by default)? Just do "chkconfig clvmd on", that should work. -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From rpeterso at redhat.com Tue Aug 14 13:38:22 2007 From: rpeterso at redhat.com (Bob Peterson) Date: Tue, 14 Aug 2007 08:38:22 -0500 Subject: [Linux-cluster] locking question In-Reply-To: References: Message-ID: <1187098702.19616.43.camel@technetium.msp.redhat.com> On Mon, 2007-08-13 at 19:34 -0400, Mahmoud Hanafi wrote: > > Which is better lock_dlmor GLUM? and why? > > > Thanks, > Mahmoud Hanafi > Sr. System Administrator Hi Mahmoud, Here's my take on it: http://sources.redhat.com/cluster/faq.html#dlm_which Regards, Bob Peterson From eric at bootseg.com Tue Aug 14 13:49:26 2007 From: eric at bootseg.com (Eric Kerin) Date: Tue, 14 Aug 2007 09:49:26 -0400 Subject: [Linux-cluster] fence_apc 7930s In-Reply-To: <1891144504.370791186961023635.JavaMail.root@v-mailhost2.mxpath.net> References: <1891144504.370791186961023635.JavaMail.root@v-mailhost2.mxpath.net> Message-ID: <46C1B2E6.6070100@bootseg.com> Brian Sheets wrote: > Did anyone get this working and allow system names for the port tags? > Mine are all labled as [empty] I've not tried switching it back to the default tag. > I have named ports on mine, and I use the fence_apc_snmp agent to control them. It works very well for me, and has for quite some time. Eric Kerin eric at bootseg.com From sebastian.walter at fu-berlin.de Tue Aug 14 15:15:20 2007 From: sebastian.walter at fu-berlin.de (Sebastian Walter) Date: Tue, 14 Aug 2007 17:15:20 +0200 Subject: [Linux-cluster] Buffer I/O error on device diapered_dm-0 Message-ID: <46C1C708.5050502@fu-berlin.de> Dear list, lately during accessing the gfs volume with heavy load, I get strange errors in /var/log/messages, forcing clurmgr to restart the gfs service: Aug 14 16:45:54 host kernel: Buffer I/O error on device diapered_dm-0, logical block 249820 I'm using qla2xxx kernel modules on QLogic 2432 FC HBA's. Any advice? Thanks for any help! Regards, sebastian From chris at cmiware.com Tue Aug 14 15:19:07 2007 From: chris at cmiware.com (Chris Harms) Date: Tue, 14 Aug 2007 10:19:07 -0500 Subject: [Linux-cluster] modclusterd memory leak Message-ID: <46C1C7EB.8090300@cmiware.com> We installed the 5.1 Beta RPMs of the cluster suite and have left our cluster running unfettered for over a week. It now appears modclusterd has a slow memory leak. Its consuming 1.5% and climbing of our 16GB of RAM which is up from 1.3% yesterday. I would be happy to do some tests and send along the results. Please advise. Thanks, Chris From rhurst at bidmc.harvard.edu Tue Aug 14 18:46:39 2007 From: rhurst at bidmc.harvard.edu (Robert Hurst) Date: Tue, 14 Aug 2007 14:46:39 -0400 Subject: [Linux-cluster] rgmanager ceases to send syslog messages Message-ID: <1187117199.14904.12.camel@xw9300.bidmc.harvard.edu> Odd, a member node's rgmanager (clurgmgrd) stopped sending syslog messages, in particular, a 'status' message of a service it was running. This causes us a problem, as we monitor syslog messages from a centralized server to update us of services running by nodename. Is there a signal or event that can trigger clurgmgrd to restart its monitoring and logging of its running service? The last instances of it running and showing 'WATSON status' follow. Note, I realize there was an issue with this particular cluster.conf change, but those changes had nothing to do with the WATSON service, and all other nodes are still sending their 'service status' syslog messages. Why would 'WATSON status' just stop? Aug 6 14:38:35 db5 clurgmgrd: [16354]: Executing /etc/init.d/WATSON status Aug 6 14:39:05 db5 clurgmgrd: [16354]: Executing /etc/init.d/WATSON status Aug 6 14:39:20 db5 ccsd[13802]: Update of cluster.conf complete (version 187 -> 188). Aug 6 14:39:25 db5 clurgmgrd[16354]: Reconfiguring Aug 6 14:39:25 db5 clurgmgrd[16354]: Loading Service Data Aug 6 14:39:25 db5 clurgmgrd[16354]: Error storing ip: Duplicate Aug 6 14:39:26 db5 clurgmgrd[16354]: Unique attribute collision. type=clusterfs attr=device value=/dev/VGCCC1/lvol0 Aug 6 14:39:26 db5 clurgmgrd[16354]: Error storing clusterfs resource Aug 6 14:39:26 db5 clurgmgrd[16354]: Unique attribute collision. type=clusterfs attr=device value=/dev/VGCCC1/lvol1 Aug 6 14:39:26 db5 clurgmgrd[16354]: Error storing clusterfs resource Aug 6 14:39:26 db5 clurgmgrd[16354]: Stopping changed resources. Aug 6 14:39:26 db5 clurgmgrd[16354]: Restarting changed resources. Aug 6 14:39:26 db5 clurgmgrd[16354]: Starting changed resources. Aug 6 14:39:26 db5 clurgmgrd: [16354]: Executing /etc/init.d/syslogger stop Aug 6 14:39:27 db5 clurgmgrd: [16354]: Executing /etc/init.d/luci stop Aug 6 14:39:27 db5 clurgmgrd: [16354]: Executing /etc/init.d/webmin stop Aug 6 14:39:27 db5 clurgmgrd: [16354]: Executing /etc/init.d/nagios stop I continue to get messages from clurgmgrd, but only through Magma Event changes, i.e.: Aug 7 16:09:03 db5 clurgmgrd[16354]: Magma Event: Membership Change Aug 7 16:09:03 db5 clurgmgrd[16354]: State change: db1 UP -------------- next part -------------- An HTML attachment was scrubbed... URL: From adas at redhat.com Tue Aug 14 19:37:07 2007 From: adas at redhat.com (Abhijith Das) Date: Tue, 14 Aug 2007 14:37:07 -0500 Subject: [Linux-cluster] Assertion failed in do_flock (bz198302) In-Reply-To: <1186998796.2650.6.camel@rutabaga.defuturo.co.uk> References: <1186998796.2650.6.camel@rutabaga.defuturo.co.uk> Message-ID: <46C20463.5030206@redhat.com> Robert Clark wrote: > I've been seeing the same assertions as in bz198302, so I've tried out >the debug patch there and it looks like they are being triggered by an >EAGAIN from flock_lock_file_wait. Is this an expected return code? > > Robert > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster > > > Robert, No. This is not an expected return code as far as I can tell from the code. Are you running NFS on GFS? If you can reproduce this problem reliably, can you add that information to the bz? Also the output of cat /proc/locks and 'gfs_tool lockdump' would be useful. Thanks, --Abhi From brad at bradandkim.net Tue Aug 14 20:41:37 2007 From: brad at bradandkim.net (brad at bradandkim.net) Date: Tue, 14 Aug 2007 15:41:37 -0500 (CDT) Subject: [Linux-cluster] Postgresql service on RHCS Message-ID: <43600.129.237.174.117.1187124097.squirrel@webmail.bradandkim.net> I am attempting to set up an active/passive failover environment for postgresql-8.2. I created a failover domain with 2 nodes, one of them preferred, added a postgresql-8 resource and an IP address resource. Now when I set up a service I run into problems. The idea is to have both the postgresql-8 resource and the IP address resource within the service so that it will move the floating IP to the other node in the domain and start postgres on it. The IP portion seems to work fine but postgresql always fails. I get these error messages: Aug 14 15:40:47 cdb6 clurgmgrd[4106]: Starting disabled service 10.10.1.221 Aug 14 15:40:47 cdb6 clurgmgrd: [4106]: Adding IPv4 address 10.10.1.236 to bond0 Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: Starting Service postgres-8:cdb6 Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: Starting Service postgres-8:cdb6 > Failed Aug 14 15:40:48 cdb6 clurgmgrd[4106]: start on postgres-8:cdb6 returned 1 (generic error) Aug 14 15:40:48 cdb6 clurgmgrd[4106]: #68: Failed to start 10.10.1.221; return value: 1 Aug 14 15:40:48 cdb6 clurgmgrd[4106]: Stopping service 10.10.1.221 Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: Stopping Service postgres-8:cdb6 Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: Checking Existence Of File /var/run/cluster/postgres-8/postgres-8:cdb6.pid [postgres-8:cdb6] > Failed - File Doesn't Exist Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: Stopping Service postgres-8:cdb6 > Failed Aug 14 15:40:48 cdb6 clurgmgrd[4106]: stop on postgres-8:cdb6 returned 1 (generic error) Aug 14 15:40:48 cdb6 clurgmgrd[4106]: #12: RG 10.10.1.221 failed to stop; intervention required Aug 14 15:40:48 cdb6 clurgmgrd[4106]: Service 10.10.1.221 is failed Aug 14 15:40:48 cdb6 clurgmgrd[4106]: #13: Service 10.10.1.221 failed to stop cleanly Here are the relavent portions of cluster.conf: