From darkblue2000 at gmail.com  Wed Aug  1 01:29:54 2007
From: darkblue2000 at gmail.com (darkblue)
Date: Wed, 1 Aug 2007 09:29:54 +0800
Subject: [Linux-cluster] dependency problem when install
	cman-kernel-2.6.9-50.2.src.rpm
Message-ID: <2c8195ff0707311829m4b6d54fel38364af8eedf4632@mail.gmail.com>

When I installing cman-kernel, there is a dependency problem.
[root at rh4-clus1 rhcs4]# rpm -iv cman-kernel-2.6.9-59.2.src.rpm
[root at rh4-clus1 SPECS]# rpmbuild -ba --target=i686 cman-kernel.spec
Building target platforms: i686
Building for target i686
error: Failed build dependencies:
        kernel-devel = 2.6.9-55.EL is needed by cman-kernel-2.6.9-50.2.i686
        kernel-smp-devel = 2.6.9-55.EL is needed by cman-kernel-2.6.9-50.2.i686
        kernel-hugemem-devel = 2.6.9-55.EL is needed by
cman-kernel-2.6.9-50.2.i686
        kernel-xenU-devel = 2.6.9-55.EL is needed by cman-kernel-2.6.9-50.2.i686
[root at rh4-clus1 SPECS]# uname -a
Linux rh4-clus1.darkblue.com 2.6.9-55.EL #1 Fri Apr 20 16:35:59 EDT
2007 i686 i686 i386 GNU/Linux

I am curious that I am using the 2.6.9-55 kernel, why there is still a
dependency problem? and How to fix it?
-- 
He is nothing



From orkcu at yahoo.com  Wed Aug  1 02:32:06 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Tue, 31 Jul 2007 19:32:06 -0700 (PDT)
Subject: [Linux-cluster] LVS redundancy server and network type: DIRECT
In-Reply-To: <46AF9B6F.7080300@lexum.umontreal.ca>
Message-ID: <702274.88733.qm@web50603.mail.re2.yahoo.com>
--- FM <dist-list at LEXUM.UMontreal.CA> wrote:

> Tx for the reply,
> I re read the doc and my  question remains :-)
> ex :
> from the RH documentation :
> Create the ARP table entries for each virtual IP
> address on each real
> server (the real_ip is the IP the director uses to
> communicate with the
> real server; often this is the IP bound to eth0):

are you clear about the fact that real_ip is the real
IP of the real server? (the one that the LVS use to
connect to the real server :-) )

> arptables -A IN -d <virtual_ip> -j DROP
> arptables -A OUT -d <virtual_ip> -j mangle
> --mangle-ip-s <real_ip>
> 

you should do this in each real server, so real_ip is
diferent for each real server

none of those IPs are IP bounded to an specific LVS
(master or slave), well vip is but is like a floating
ip :-)


> 
> If I create a redundancy server, and if the master
> server goes down, the
> backup server will create all the <virtual_ip> but
> not the <real_ip> so

I think you beleave that this "real_ip" is the IP
owned by the LVS to comunicate with the real server,
but it is not. The real_ip is the IP owned by the real
server, which is used by the LVS to connect to the
real server :-)
maybe a graphic is very needed :-)
just keep in mind what is purpose of the arptable
commands: avoid at all means that the real server
announce that it has the VIP as one of it address

cu
roger


__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


       
____________________________________________________________________________________
Be a better Heartthrob. Get better relationship answers from someone who knows. Yahoo! Answers - Check it out. 
http://answers.yahoo.com/dir/?link=list&sid=396545433



From darkblue2000 at gmail.com  Wed Aug  1 06:41:58 2007
From: darkblue2000 at gmail.com (darkblue)
Date: Wed, 1 Aug 2007 14:41:58 +0800
Subject: [Linux-cluster] Which packages are the right combination for
	AS4U5?
In-Reply-To: <2c8195ff0707301714v160b1590l1ab325bbd0a12cc2@mail.gmail.com>
References: <2c8195ff0707292037s7441c7cv91946b6fc3e98fc9@mail.gmail.com>
	<46ADB6DA.7000603@fu-berlin.de>
	<2c8195ff0707300317m5ca565b8n6ea66465d72232f2@mail.gmail.com>
	<46ADCCDD.1010209@fu-berlin.de>
	<2c8195ff0707301714v160b1590l1ab325bbd0a12cc2@mail.gmail.com>
Message-ID: <2c8195ff0707312341reac4308gc4691c976397d644@mail.gmail.com>

hello, I had been tried to install rhcs, but failed.
hmm, the error look like this:
[root at rh4-clus1 SPECS]# rpmbuild -ba --target=i686 cman.spec
Building target platforms: i686
Building for target i686
error: Failed build dependencies:
        cman-kernheaders >= 2.6.9 is needed by cman-1.0.17-0.i686
        ccs-devel is needed by cman-1.0.17-0.i686

So, Is that mean I have to download and install ccs-devel first.
but I can't find it on redhat's ftp.
anybody know where to find it?
2007/7/31, darkblue <darkblue2000 at gmail.com>:
> thanks, thank you very much, you save my life.
> I am gonna to install the src.rpm combination tonight.
>
> 2007/7/30, Sebastian Walter <sebastian.walter at fu-berlin.de>:
> > If you are using RHEL in an production environment, I can only recommend
> > you to use the original rhel packages, as the centos' ones are modified.
> > Anyway, the versions of the rpm's should be the same. So this is the
> > list of packages what is installed on my centos 4.5 system:
> >
> > (rhcs: rgmanager system-config-cluster ccsd magma magma-plugins cman
> > cman-kernel-smp dlm dlm-kernel-smp fence gulm iddev)
> > Installing:
> > cman                    x86_64     1.0.17-0         csgfs              67 k
> > cman-kernel-smp         x86_64     2.6.9-50.2       csgfs             133 k
> > dlm                     x86_64     1.0.3-1          csgfs              13 k
> > dlm-kernel-smp          x86_64     2.6.9-46.16      csgfs             132 k
> > fence                   x86_64     1.32.45-1.0.1    csgfs             282 k
> > gulm                    x86_64     1.0.10-0         csgfs             151 k
> > iddev                   x86_64     2.0.0-4          csgfs             2.3 k
> > magma                   x86_64     1.0.7-1          csgfs              37 k
> > magma-plugins           x86_64     1.0.12-0         csgfs              19 k
> > rgmanager               x86_64     1.9.68-1         csgfs             209 k
> > system-config-cluster   noarch     1.0.45-1.0       csgfs             122 k
> > Installing for dependencies:
> > ccs                     x86_64     1.0.10-0         csgfs              80 k
> > perl-Net-Telnet         noarch     3.03-3           csgfs              51 k
> > seamonkey-nss           x86_64     1.0.9-2.el4.centos  update
> > 872 k
> >
> > (gfs: GFS GFS-kernel-smp gnbd gnbd-kernel-smp lvm2-cluster
> > GFS-kernheaders gnbd-kernheaders)
> > GFS                     x86_64     6.1.14-0         csgfs             152 k
> > GFS-kernel-smp          x86_64     2.6.9-72.2       csgfs             214 k
> > GFS-kernheaders         x86_64     2.6.9-72.2       csgfs              20 k
> > gnbd                    x86_64     1.0.9-1          csgfs             142 k
> > gnbd-kernel-smp         x86_64     2.6.9-10.20      csgfs              13 k
> > gnbd-kernheaders        x86_64     2.6.9-10.20      csgfs             4.1 k
> > lvm2-cluster            x86_64     2.02.21-7.el4    csgfs             199 k
> >
> > In this configuration, which comes from the yum rhcs repository, I had
> > to downgrade to kernel kernel-smp-2.6.9-55.EL. Maybe you also want to
> > install luci and ricci:
> >
> > yum install luci
> > Installing:
> >  luci                    x86_64     0.9.1-8.el4.centos.1  csgfs
> >
> > yum install ricci
> > Installing:
> >  ricci                   x86_64     0.9.1-8.el4.centos.1  csgfs 1.1 M
> > Installing for dependencies:
> >  modcluster              x86_64     0.9.1-8.el4.centos
> > csgfs             317 k
> >  oddjob                  x86_64     0.26-1.1         base               57 k
> >  oddjob-libs             x86_64     0.26-1.1         base               43 k
> >
> > That's it. Regards,
> > Sebastian
> >
> > darkblue wrote:
> > > thanks very much, I have been waiting this letter for the whole day.
> > > May I using yum to install centos's packages on redhat as4u5, because
> > > the OS of the production server is redhat as4u5.
> > >
> > > 2007/7/30, Sebastian Walter wrote:
> > >
> > >> Hi,
> > >>
> > >> maybe you want to orientate on the centos distribution. Easiest for you
> > >> would be to somehow get yum working and then import the whole
> > >> repository. If it's a new installation, I suggest you to switch to such
> > >> a rhel-compatible distribution as centos or scientific linux anyway.
> > >>
> > >> http://mirror.centos.org/centos/4/csgfs/
> > >>
> > >> Regards,
> > >> Sebastian
> > >>
> > >>
> > >> darkblue wrote:
> > >>
> > >>> Hello,
> > >>> I am a newbie of cluster.I encounter a problem when installing cluster
> > >>> suite on AS4U5, I download the following packages from
> > >>> ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/4AS/en/RHCS/SRPMS/
> > >>>
> > >>> ccs-1.0.10-0.src.rpm
> > >>> ccs-1.0.2-0.src.rpm
> > >>> ccs-1.0.3-0.src.rpm
> > >>> ccs-1.0.7-0.src.rpm
> > >>> clustermon-0.9.1-8.src.rpm
> > >>> cman-1.0.11-0.src.rpm
> > >>> cman-1.0.17-0.src.rpm
> > >>> cman-1.0.2-0.src.rpm
> > >>> cman-1.0.4-0.src.rpm
> > >>> cman-kernel-2.6.9-39.5.src.rpm
> > >>> cman-kernel-2.6.9-39.8.src.rpm
> > >>> cman-kernel-2.6.9-41.0.2.src.rpm
> > >>> cman-kernel-2.6.9-41.0.src.rpm
> > >>> cman-kernel-2.6.9-43.8.3.src.rpm
> > >>> cman-kernel-2.6.9-43.8.5.src.rpm
> > >>> cman-kernel-2.6.9-43.8.src.rpm
> > >>> cman-kernel-2.6.9-45.14.src.rpm
> > >>> cman-kernel-2.6.9-45.15.src.rpm
> > >>> cman-kernel-2.6.9-45.2.src.rpm
> > >>> cman-kernel-2.6.9-45.3.src.rpm
> > >>> cman-kernel-2.6.9-45.4.src.rpm
> > >>> cman-kernel-2.6.9-45.5.src.rpm
> > >>> cman-kernel-2.6.9-45.8.src.rpm
> > >>> cman-kernel-2.6.9-50.2.0.1.src.rpm
> > >>> cman-kernel-2.6.9-50.2.src.rpm
> > >>> conga-0.9.1-8.src.rpm
> > >>> dlm-1.0.0-5.src.rpm
> > >>> dlm-1.0.1-1.src.rpm
> > >>> dlm-1.0.3-1.src.rpm
> > >>> dlm-kernel-2.6.9-37.7.src.rpm
> > >>> dlm-kernel-2.6.9-37.9.src.rpm
> > >>> dlm-kernel-2.6.9-39.1.2.src.rpm
> > >>> dlm-kernel-2.6.9-39.1.src.rpm
> > >>> dlm-kernel-2.6.9-41.7.1.src.rpm
> > >>> dlm-kernel-2.6.9-41.7.2.src.rpm
> > >>> dlm-kernel-2.6.9-41.7.src.rpm
> > >>> dlm-kernel-2.6.9-42.10.src.rpm
> > >>> dlm-kernel-2.6.9-42.11.src.rpm
> > >>> dlm-kernel-2.6.9-42.12.src.rpm
> > >>> dlm-kernel-2.6.9-42.13.src.rpm
> > >>> dlm-kernel-2.6.9-44.2.src.rpm
> > >>> dlm-kernel-2.6.9-44.3.src.rpm
> > >>> dlm-kernel-2.6.9-44.8.src.rpm
> > >>> dlm-kernel-2.6.9-44.9.src.rpm
> > >>> dlm-kernel-2.6.9-46.16.0.1.src.rpm
> > >>> dlm-kernel-2.6.9-46.16.src.rpm
> > >>> fence-1.32.10-0.src.rpm
> > >>> fence-1.32.18-0.src.rpm
> > >>> fence-1.32.25-1.src.rpm
> > >>> fence-1.32.45-1.0.1.src.rpm
> > >>> fence-1.32.45-1.src.rpm
> > >>> fence-1.32.6-0.src.rpm
> > >>> gulm-1.0.10-0.src.rpm
> > >>> gulm-1.0.4-0.src.rpm
> > >>> gulm-1.0.6-0.src.rpm
> > >>> gulm-1.0.7-0.src.rpm
> > >>> gulm-1.0.8-0.src.rpm
> > >>> iddev-2.0.0-3.src.rpm
> > >>> iddev-2.0.0-4.src.rpm
> > >>> magma-1.0.1-4.src.rpm
> > >>> magma-1.0.3-2.src.rpm
> > >>> magma-1.0.4-0.src.rpm
> > >>> magma-1.0.6-0.src.rpm
> > >>> magma-1.0.7-1.src.rpm
> > >>> magma-plugins-1.0.12-0.src.rpm
> > >>> magma-plugins-1.0.2-0.src.rpm
> > >>> magma-plugins-1.0.5-0.src.rpm
> > >>> magma-plugins-1.0.6-0.src.rpm
> > >>> magma-plugins-1.0.9-0.src.rpm
> > >>> piranha-0.8.1-1.src.rpm
> > >>> piranha-0.8.2-1.src.rpm
> > >>> rgmanager-1.9.38-0.src.rpm
> > >>> rgmanager-1.9.39-0.src.rpm
> > >>> rgmanager-1.9.43-0.src.rpm
> > >>> rgmanager-1.9.46-0.src.rpm
> > >>> rgmanager-1.9.53-0.src.rpm
> > >>> rgmanager-1.9.54-1.src.rpm
> > >>> rgmanager-1.9.68-1.src.rpm
> > >>> system-config-cluster-1.0.16-1.0.src.rpm
> > >>> system-config-cluster-1.0.25-1.0.src.rpm
> > >>> system-config-cluster-1.0.27-1.0.src.rpm
> > >>> system-config-cluster-1.0.45-1.0.src.rpm
> > >>>
> > >>> but I want to know which packages are the right combination for AS4U5?
> > >>>
> > >>>
> > >>>
> > >>
> > >
> > >
> > >
> >
> >
>
>
> --
> He is nothing
>


-- 
He is nothing



From benjamin.jakubowski at gmail.com  Wed Aug  1 08:39:52 2007
From: benjamin.jakubowski at gmail.com (Benjamin Jakubowski)
Date: Wed, 1 Aug 2007 10:39:52 +0200
Subject: [Linux-cluster] RHCS on RedHat As 4 u4
In-Reply-To: <46AF6B24.9000208@redhat.com>
References: <6c80b6370707310937te6a3476x3dc682a971083622@mail.gmail.com>
	<46AF65A3.1010806@redhat.com>
	<6c80b6370707310956j10208835lff5af4d0806bf17@mail.gmail.com>
	<46AF6B24.9000208@redhat.com>
Message-ID: <6c80b6370708010139h6438d1dsd1cc5bc1dfba8dff@mail.gmail.com>

OK thanks, it seems to become better but this is my simple cluster.conf

My first node is : server1

<?xml version="1.0" ?>
<cluster alias="sisclad_re7" config_version="7" name="siclad_re7">
        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="3"/>
        <clusternodes>
                <clusternode name="server1"/>
        </clusternodes>
        <fencedevices/>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>

when i start ccsd ok :
Aug  1 10:35:59 sv157020 ccsd[20726]: Starting ccsd 1.0.10:
Aug  1 10:35:59 sv157020 ccsd[20726]:  Built: Mar 19 2007 17:44:26
Aug  1 10:35:59 sv157020 ccsd[20726]:  Copyright (C) Red Hat, Inc.  2004
All rights reserved.
Aug  1 10:35:59 sv157020 ccsd:  succeeded

when i try to start cman :
Aug  1 10:36:28 sv157020 ccsd[20726]: Unable to connect to cluster
infrastructure after 30 seconds.
Aug  1 10:36:33 sv157020 kernel: CMAN 2.6.9-45.2 (built Jul 13 2006
11:42:36) installed
Aug  1 10:36:33 sv157020 kernel: NET: Registered protocol family 30
Aug  1 10:36:33 sv157020 ccsd[20726]: cluster.conf (cluster name =
siclad_re7, version = 7) found.
Aug  1 10:36:34 sv157020 kernel: CMAN: Waiting to join or form a
Linux-cluster
Aug  1 10:36:52 sv157020 sshd(pam_unix)[20747]: session opened for user root
by root(uid=0)
Aug  1 10:36:59 sv157020 ccsd[20726]: Unable to connect to cluster
infrastructure after 60 seconds.
Aug  1 10:37:06 sv157020 kernel: CMAN: forming a new cluster
Aug  1 10:37:06 sv157020 kernel: CMAN: quorum regained, resuming activity
Aug  1 10:37:06 sv157020 cman: startup succeeded

when i try to start rgmanager :
Aug  1 10:37:29 sv157020 ccsd[20726]: Unable to connect to cluster
infrastructure after 90 seconds.
Aug  1 10:37:36 sv157020 ccsd[20726]: Cluster is not quorate.  Refusing
connection.
Aug  1 10:37:36 sv157020 ccsd[20726]: Error while processing connect:
Connection refused
Aug  1 10:37:36 sv157020 ccsd[20726]: Invalid descriptor specified (-111).
Aug  1 10:37:36 sv157020 ccsd[20726]: Someone may be attempting something
evil.
Aug  1 10:37:36 sv157020 ccsd[20726]: Error while processing get: Invalid
request descriptor
Aug  1 10:37:36 sv157020 ccsd[20726]: Invalid descriptor specified (-111).
Aug  1 10:37:36 sv157020 ccsd[20726]: Someone may be attempting something
evil.
Aug  1 10:37:36 sv157020 ccsd[20726]: Error while processing get: Invalid
request descriptor
Aug  1 10:37:36 sv157020 ccsd[20726]: Invalid descriptor specified (-21).
Aug  1 10:37:36 sv157020 ccsd[20726]: Someone may be attempting something
evil.
Aug  1 10:37:36 sv157020 ccsd[20726]: Error while processing disconnect:
Invalid request descriptor
Aug  1 10:37:36 sv157020 clurgmgrd[20817]: <notice> Resource Group Manager
Starting
Aug  1 10:37:36 sv157020 clurgmgrd[20817]: <info> Loading Service Data
Aug  1 10:37:37 sv157020 ccsd[20726]: Cluster is not quorate.  Refusing
connection.
Aug  1 10:37:37 sv157020 ccsd[20726]: Error while processing connect:
Connection refused
Aug  1 10:37:37 sv157020 clurgmgrd[20817]: <crit> #5: Couldn't connect to
ccsd!
Aug  1 10:37:37 sv157020 clurgmgrd[20817]: <crit> #8: Couldn't initialize
services
Aug  1 10:37:37 sv157020 rgmanager: D?(c)marrage de clurgmgrd failed

Do u have any idea ?

is so simple in RHEL 5, but i need RHEL 4U4

Thanks a lot



2007/7/31, Bryn M. Reeves <breeves at redhat.com>:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Benjamin Jakubowski wrote:
> > The probleme is :
> > i need to preserve the kernel version 2.6.9-42.ELsmp, to preserve a SAN
> > compatibility and in RHN, there isn't have a cluster kernel module ?
> > do u have any idea ?
> >
> > Thanks a lot
> > Benjamin
>
> OK, if you need to match with a specific kernel release you'll need to
> use the Cluster Suite packages that were released at the same time.
>
> For U4 (2.6.9-42.EL), I think those would be:
>
> cman-kernel-2.6.9-45.2
> cman-kernel-smp-2.6.9-45.2
> cman-kernel-hugemem-2.6.9-45.2
> dlm-kernel-hugemem-2.6.9-42.10
> dlm-kernel-2.6.9-42.10
> dlm-kernel-smp-2.6.9-42.10
>
> If you also need the packages for GFS, those are:
>
> GFS-kernel-2.6.9-58.0
> GFS-kernel-smp-2.6.9-58.0
> GFS-kernel-hugemem-2.6.9-58.0
> gnbd-kernel-hugemem-2.6.9-9.41
> gnbd-kernel-2.6.9-9.41
> gnbd-kernel-smp-2.6.9-9.41
>
> Those packages are still available on RHN - just go to the web
> interface, hit search by package & they should appear there.
>
> Be aware though that there were some important kernel related bugfixes
> applied after U4 - particularly for systems using multipath SAN storage.
>
> Kind regards,
> Bryn.
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.7 (GNU/Linux)
> Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org
>
> iD8DBQFGr2sk6YSQoMYUY94RAp+qAJ9yKqpXxmFXqQzRc//0RVFzavPdzgCgk+YE
> YJFtx2iQm4zdYnKXQ/QmRHY=
> =s4BM
> -----END PGP SIGNATURE-----
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
@+
Benj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070801/7d26a05b/attachment.htm>

From benjamin.jakubowski at gmail.com  Wed Aug  1 08:43:21 2007
From: benjamin.jakubowski at gmail.com (Benjamin Jakubowski)
Date: Wed, 1 Aug 2007 10:43:21 +0200
Subject: [Linux-cluster] Which packages are the right combination for
	AS4U5?
In-Reply-To: <2c8195ff0707312341reac4308gc4691c976397d644@mail.gmail.com>
References: <2c8195ff0707292037s7441c7cv91946b6fc3e98fc9@mail.gmail.com>
	<46ADB6DA.7000603@fu-berlin.de>
	<2c8195ff0707300317m5ca565b8n6ea66465d72232f2@mail.gmail.com>
	<46ADCCDD.1010209@fu-berlin.de>
	<2c8195ff0707301714v160b1590l1ab325bbd0a12cc2@mail.gmail.com>
	<2c8195ff0707312341reac4308gc4691c976397d644@mail.gmail.com>
Message-ID: <6c80b6370708010143g4062dfedya86c96a65a94f886@mail.gmail.com>

Hi,
do you any access on RHN, cause there is an iso to install RedHatCluster on
RedHet 4U5 :
- Software Download
    - Select RHEL 4 version
     -and RedHat CLuster Suite





-- 
@+
Benj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070801/9f72b9b8/attachment.htm>

From sebastian.walter at fu-berlin.de  Wed Aug  1 09:32:05 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Wed, 01 Aug 2007 11:32:05 +0200
Subject: [Linux-cluster] dependency problem when
	install	cman-kernel-2.6.9-50.2.src.rpm
In-Reply-To: <2c8195ff0707311829m4b6d54fel38364af8eedf4632@mail.gmail.com>
References: <2c8195ff0707311829m4b6d54fel38364af8eedf4632@mail.gmail.com>
Message-ID: <46B05315.8080709@fu-berlin.de>

You should be able to install these packages via RHN and up2date.


darkblue wrote:
> When I installing cman-kernel, there is a dependency problem.
> [root at rh4-clus1 rhcs4]# rpm -iv cman-kernel-2.6.9-59.2.src.rpm
> [root at rh4-clus1 SPECS]# rpmbuild -ba --target=i686 cman-kernel.spec
> Building target platforms: i686
> Building for target i686
> error: Failed build dependencies:
>         kernel-devel = 2.6.9-55.EL is needed by cman-kernel-2.6.9-50.2.i686
>         kernel-smp-devel = 2.6.9-55.EL is needed by cman-kernel-2.6.9-50.2.i686
>         kernel-hugemem-devel = 2.6.9-55.EL is needed by
> cman-kernel-2.6.9-50.2.i686
>         kernel-xenU-devel = 2.6.9-55.EL is needed by cman-kernel-2.6.9-50.2.i686
> [root at rh4-clus1 SPECS]# uname -a
> Linux rh4-clus1.darkblue.com 2.6.9-55.EL #1 Fri Apr 20 16:35:59 EDT
> 2007 i686 i686 i386 GNU/Linux
>
> I am curious that I am using the 2.6.9-55 kernel, why there is still a
> dependency problem? and How to fix it?
>   



From maciej.bogucki at artegence.com  Wed Aug  1 08:59:16 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Wed, 01 Aug 2007 10:59:16 +0200
Subject: [Linux-cluster] dependency problem when
	install	cman-kernel-2.6.9-50.2.src.rpm
In-Reply-To: <2c8195ff0707311829m4b6d54fel38364af8eedf4632@mail.gmail.com>
References: <2c8195ff0707311829m4b6d54fel38364af8eedf4632@mail.gmail.com>
Message-ID: <46B04B64.7010506@artegence.com>

darkblue napisa?(a):
> When I installing cman-kernel, there is a dependency problem.
> [root at rh4-clus1 rhcs4]# rpm -iv cman-kernel-2.6.9-59.2.src.rpm
> [root at rh4-clus1 SPECS]# rpmbuild -ba --target=i686 cman-kernel.spec
> Building target platforms: i686
> Building for target i686
> error: Failed build dependencies:
>         kernel-devel = 2.6.9-55.EL is needed by cman-kernel-2.6.9-50.2.i686
>         kernel-smp-devel = 2.6.9-55.EL is needed by cman-kernel-2.6.9-50.2.i686
>         kernel-hugemem-devel = 2.6.9-55.EL is needed by
> cman-kernel-2.6.9-50.2.i686
>         kernel-xenU-devel = 2.6.9-55.EL is needed by cman-kernel-2.6.9-50.2.i686
> [root at rh4-clus1 SPECS]# uname -a
> Linux rh4-clus1.darkblue.com 2.6.9-55.EL #1 Fri Apr 20 16:35:59 EDT
> 2007 i686 i686 i386 GNU/Linux
> 
> I am curious that I am using the 2.6.9-55 kernel, why there is still a
> dependency problem? and How to fix it?


up2date -u kernel-devel kernel-smp-devel kernel-hugemem-devel
kernel-xenU-devel



From srigler at MarathonOil.com  Wed Aug  1 11:57:45 2007
From: srigler at MarathonOil.com (Steve Rigler)
Date: Wed, 01 Aug 2007 06:57:45 -0500
Subject: [Linux-cluster] LVS redundancy server and network type: DIRECT
In-Reply-To: <46AF9B6F.7080300@lexum.umontreal.ca>
References: <46AF5D36.9090304@lexum.umontreal.ca>
	<46AF7E49.6050201@redhat.com>  <46AF9B6F.7080300@lexum.umontreal.ca>
Message-ID: <1185969465.22317.8.camel@houuc8>

On Tue, 2007-07-31 at 16:28 -0400, FM wrote:
> Tx for the reply,
> I re read the doc and my  question remains :-)
> ex :
> from the RH documentation :
> Create the ARP table entries for each virtual IP address on each real
> server (the real_ip is the IP the director uses to communicate with the
> real server; often this is the IP bound to eth0):
> arptables -A IN -d <virtual_ip> -j DROP
> arptables -A OUT -d <virtual_ip> -j mangle --mangle-ip-s <real_ip>
> 
> 
> If I create a redundancy server, and if the master server goes down, the
> backup server will create all the <virtual_ip> but not the <real_ip> so
> all the real servers still have the arptables setting to modify the
> source of the IP packet to look likes the master LVS server that is down
> now.

Another way you can do it is by adding iptables rules to you real
servers like:

-A PREROUTING -d <vip> -p tcp -m tcp --dport <dport> -j REDIRECT

I didn't have much luck using arptables, but this worked well for me.

-Steve



From grimme at atix.de  Wed Aug  1 12:37:04 2007
From: grimme at atix.de (Marc Grimme)
Date: Wed, 1 Aug 2007 14:37:04 +0200
Subject: [Linux-cluster] Activating a clustered volumeggroup without running
	cluster
Message-ID: <200708011437.04734.grimme@atix.de>

Hello,
is there a way to activate a clustered volumegroup without a running cluster.
Base RHEL4U5??
vgchange -ay 
complains about skipping clustered vgs.
Regards Marc.
-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany

Phone: +49-89 452 3538-0
Fax:   +49-89 990 1766-0

Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962

Vorstand: 
Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)

Vorsitzender des Aufsichtsrats:
Dr. Martin Buss



From dgavin at davegavin.com  Wed Aug  1 13:09:53 2007
From: dgavin at davegavin.com (Dave Gavin)
Date: Wed, 1 Aug 2007 09:09:53 -0400
Subject: [Linux-cluster] Adding a new fencing script ?
Message-ID: <20070801090953.28bbce51@setanta.asarlai>


 I have a couple of Server Tech devices controlling the power for the servers in
my cluster and they don't seem to have a fence script. I modified a copy of the
brocade script to work with the Server Tech device and the script is in /sbin
with the permissions/ownership matching the other fence_* scripts. Can anyone
point me at a how-to or a doc somewhere on adding this script to the drop-down
in system-config-cluster ?

Thanks
 Dave

-- 
Being shot out of a cannon will always be better than being squeezed
out of a tube.  That is why God made fast motorcycles, Bubba....
  "Song of the Sausage Creature" Hunter S. Thompson (RIP 02/20/2005)



From jparsons at redhat.com  Wed Aug  1 13:57:21 2007
From: jparsons at redhat.com (jim parsons)
Date: Wed, 01 Aug 2007 09:57:21 -0400
Subject: [Linux-cluster] Adding a new fencing script ?
In-Reply-To: <20070801090953.28bbce51@setanta.asarlai>
References: <20070801090953.28bbce51@setanta.asarlai>
Message-ID: <1185976641.3318.17.camel@localhost.localdomain>

On Wed, 2007-08-01 at 09:09 -0400, Dave Gavin wrote:
>  I have a couple of Server Tech devices controlling the power for the servers in
> my cluster and they don't seem to have a fence script. I modified a copy of the
> brocade script to work with the Server Tech device and the script is in /sbin
> with the permissions/ownership matching the other fence_* scripts. Can anyone
> point me at a how-to or a doc somewhere on adding this script to the drop-down
> in system-config-cluster ?
Hi Dave,

Adding a new fence form to s-c-c is kind of a daunting task...I'll
explain below; but in the meanwhile, have you considered donating your
fence agent under a gpl variant? It would be nice for cluster users to
have ServerTech support...

Anyhow, yes, you should be able to drop your agent into /sbin with
similar permisions to other agents.

You don't need s-c-c, of course, to use your agent...you can edit the
cluster.conf file directly and add it there. Under the fencedevices
section, include an entry for your agent and set the agent attribute to
whatever you named it (agent='fence_dave').

Put shared attributes in the fencedevice section and node specific attrs
under the clusternode->fence->method->device tag. Then propagate the new
file...first run ccs_tool update and then cman_tool version -r...man
these comands for details - but dont forget to incerment the
config_version attribute in the conf file before propagating a new one.

Here is a rough outline of how to add it to s-c-c:
First add the form fields to one of the windows in fence.glade. Just
follow the conventions for naming along the lines of the other agent
forms that you find there. In each of the three windows that contain
fence forms, there is a device column and an instance column...both will
need to be extended for your new agent.

Next, you will need to edit the python file FenceHandler.py...This
should be the only other file you need to touch. I would pick an
existing agent and follow it through the file, noticing all of the
places it is set...for example, for each fence device and fence
instance, there is a populate method, a validate method, a clear form
method, and a process_widgets method entry. Then there are a few hash
maps to add to. There are comments in the file to assist in adding a new
fence type.

In summary, add forms to glade file, then edit FenceHandler.py

-J



From jparsons at redhat.com  Wed Aug  1 14:09:24 2007
From: jparsons at redhat.com (jim parsons)
Date: Wed, 01 Aug 2007 10:09:24 -0400
Subject: [Linux-cluster] RHCS on RedHat As 4 u4
In-Reply-To: <6c80b6370708010139h6438d1dsd1cc5bc1dfba8dff@mail.gmail.com>
References: <6c80b6370707310937te6a3476x3dc682a971083622@mail.gmail.com>
	<46AF65A3.1010806@redhat.com>
	<6c80b6370707310956j10208835lff5af4d0806bf17@mail.gmail.com>
	<46AF6B24.9000208@redhat.com>
	<6c80b6370708010139h6438d1dsd1cc5bc1dfba8dff@mail.gmail.com>
Message-ID: <1185977365.3318.21.camel@localhost.localdomain>

I am not sure here, but I believe you need a <cman/> tag in your conf
file under <cluster>...there are two locking types available in
rhel4...perhaps you need to specify the one you wish to use? At any
rate, it is a simple thing to add and try.

-J

On Wed, 2007-08-01 at 10:39 +0200, Benjamin Jakubowski wrote:
> OK thanks, it seems to become better but this is my simple
> cluster.conf
> 
> My first node is : server1
> 
> <?xml version="1.0" ?>
> <cluster alias="sisclad_re7" config_version="7" name="siclad_re7"> 
>         <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="3"/>
>         <clusternodes>
>                 <clusternode name="server1"/>
>         </clusternodes> 
>         <fencedevices/>
>         <rm>
>                 <failoverdomains/>
>                 <resources/>
>         </rm>
> </cluster>
> 
> when i start ccsd ok :
> Aug  1 10:35:59 sv157020 ccsd[20726]: Starting ccsd 1.0.10:
> Aug  1 10:35:59 sv157020 ccsd[20726]:  Built: Mar 19 2007 17:44:26
> Aug  1 10:35:59 sv157020 ccsd[20726]:  Copyright (C) Red Hat, Inc.
> 2004  All rights reserved.
> Aug  1 10:35:59 sv157020 ccsd:  succeeded
> 
> when i try to start cman :
> Aug  1 10:36:28 sv157020 ccsd[20726]: Unable to connect to cluster
> infrastructure after 30 seconds.
> Aug  1 10:36:33 sv157020 kernel: CMAN 2.6.9-45.2 (built Jul 13 2006
> 11:42:36) installed 
> Aug  1 10:36:33 sv157020 kernel: NET: Registered protocol family 30
> Aug  1 10:36:33 sv157020 ccsd[20726]: cluster.conf (cluster name =
> siclad_re7, version = 7) found.
> Aug  1 10:36:34 sv157020 kernel: CMAN: Waiting to join or form a
> Linux-cluster 
> Aug  1 10:36:52 sv157020 sshd(pam_unix)[20747]: session opened for
> user root by root(uid=0)
> Aug  1 10:36:59 sv157020 ccsd[20726]: Unable to connect to cluster
> infrastructure after 60 seconds.
> Aug  1 10:37:06 sv157020 kernel: CMAN: forming a new cluster 
> Aug  1 10:37:06 sv157020 kernel: CMAN: quorum regained, resuming
> activity
> Aug  1 10:37:06 sv157020 cman: startup succeeded
> 
> when i try to start rgmanager :
> Aug  1 10:37:29 sv157020 ccsd[20726]: Unable to connect to cluster
> infrastructure after 90 seconds. 
> Aug  1 10:37:36 sv157020 ccsd[20726]: Cluster is not quorate.
> Refusing connection.
> Aug  1 10:37:36 sv157020 ccsd[20726]: Error while processing connect:
> Connection refused
> Aug  1 10:37:36 sv157020 ccsd[20726]: Invalid descriptor specified
> (-111). 
> Aug  1 10:37:36 sv157020 ccsd[20726]: Someone may be attempting
> something evil.
> Aug  1 10:37:36 sv157020 ccsd[20726]: Error while processing get:
> Invalid request descriptor
> Aug  1 10:37:36 sv157020 ccsd[20726]: Invalid descriptor specified
> (-111). 
> Aug  1 10:37:36 sv157020 ccsd[20726]: Someone may be attempting
> something evil.
> Aug  1 10:37:36 sv157020 ccsd[20726]: Error while processing get:
> Invalid request descriptor
> Aug  1 10:37:36 sv157020 ccsd[20726]: Invalid descriptor specified
> (-21). 
> Aug  1 10:37:36 sv157020 ccsd[20726]: Someone may be attempting
> something evil.
> Aug  1 10:37:36 sv157020 ccsd[20726]: Error while processing
> disconnect: Invalid request descriptor
> Aug  1 10:37:36 sv157020 clurgmgrd[20817]: <notice> Resource Group
> Manager Starting 
> Aug  1 10:37:36 sv157020 clurgmgrd[20817]: <info> Loading Service Data
> Aug  1 10:37:37 sv157020 ccsd[20726]: Cluster is not quorate.
> Refusing connection.
> Aug  1 10:37:37 sv157020 ccsd[20726]: Error while processing connect:
> Connection refused 
> Aug  1 10:37:37 sv157020 clurgmgrd[20817]: <crit> #5: Couldn't connect
> to ccsd!
> Aug  1 10:37:37 sv157020 clurgmgrd[20817]: <crit> #8: Couldn't
> initialize services
> Aug  1 10:37:37 sv157020 rgmanager: D??marrage de clurgmgrd failed 
> 
> Do u have any idea ?
> 
> is so simple in RHEL 5, but i need RHEL 4U4
> 
> Thanks a lot
> 
> 
> 
> 2007/7/31, Bryn M. Reeves <breeves at redhat.com>:
>         -----BEGIN PGP SIGNED MESSAGE-----
>         Hash: SHA1
>         
>         Benjamin Jakubowski wrote: 
>         > The probleme is :
>         > i need to preserve the kernel version 2.6.9-42.ELsmp, to
>         preserve a SAN
>         > compatibility and in RHN, there isn't have a cluster kernel
>         module ?
>         > do u have any idea ?
>         > 
>         > Thanks a lot
>         > Benjamin
>         
>         OK, if you need to match with a specific kernel release you'll
>         need to
>         use the Cluster Suite packages that were released at the same
>         time.
>         
>         For U4 (2.6.9-42.EL), I think those would be: 
>         
>         cman-kernel-2.6.9-45.2
>         cman-kernel-smp-2.6.9-45.2
>         cman-kernel-hugemem-2.6.9-45.2
>         dlm-kernel-hugemem-2.6.9-42.10
>         dlm-kernel-2.6.9-42.10
>         dlm-kernel-smp-2.6.9-42.10
>         
>         If you also need the packages for GFS, those are: 
>         
>         GFS-kernel-2.6.9-58.0
>         GFS-kernel-smp-2.6.9-58.0
>         GFS-kernel-hugemem-2.6.9-58.0
>         gnbd-kernel-hugemem-2.6.9-9.41
>         gnbd-kernel-2.6.9-9.41
>         gnbd-kernel-smp-2.6.9-9.41
>         
>         Those packages are still available on RHN - just go to the
>         web 
>         interface, hit search by package & they should appear there.
>         
>         Be aware though that there were some important kernel related
>         bugfixes
>         applied after U4 - particularly for systems using multipath
>         SAN storage. 
>         
>         Kind regards,
>         Bryn.
>         -----BEGIN PGP SIGNATURE-----
>         Version: GnuPG v1.4.7 (GNU/Linux)
>         Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org
>         
>         iD8DBQFGr2sk6YSQoMYUY94RAp+qAJ9yKqpXxmFXqQzRc//0RVFzavPdzgCgk
>         +YE 
>         YJFtx2iQm4zdYnKXQ/QmRHY=
>         =s4BM
>         -----END PGP SIGNATURE-----
>         
>         --
>         Linux-cluster mailing list
>         Linux-cluster at redhat.com
>         https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> -- 
> @+
> Benj 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From dgavin at davegavin.com  Wed Aug  1 16:12:15 2007
From: dgavin at davegavin.com (Dave Gavin)
Date: Wed, 1 Aug 2007 12:12:15 -0400 (EDT)
Subject: [Linux-cluster] Adding a new fencing script ?
In-Reply-To: <1185976641.3318.17.camel@localhost.localdomain>
References: <20070801090953.28bbce51@setanta.asarlai>
	<1185976641.3318.17.camel@localhost.localdomain>
Message-ID: <52485.157.130.62.182.1185984735.squirrel@dgavin.no-ip.com>

On Wed, August 1, 2007 9:57 am, jim parsons wrote:
> On Wed, 2007-08-01 at 09:09 -0400, Dave Gavin wrote:
>>  I have a couple of Server Tech devices controlling the power for the
>> servers in
>> my cluster and they don't seem to have a fence script. I modified a copy
>> of the
>> brocade script to work with the Server Tech device and the script is in
>> /sbin
>> with the permissions/ownership matching the other fence_* scripts. Can
>> anyone
>> point me at a how-to or a doc somewhere on adding this script to the
>> drop-down
>> in system-config-cluster ?
> Hi Dave,
>
> Adding a new fence form to s-c-c is kind of a daunting task...I'll
> explain below; but in the meanwhile, have you considered donating your
> fence agent under a gpl variant? It would be nice for cluster users to
> have ServerTech support...
>
> Anyhow, yes, you should be able to drop your agent into /sbin with
> similar permisions to other agents.
>
> You don't need s-c-c, of course, to use your agent...you can edit the
> cluster.conf file directly and add it there. Under the fencedevices
> section, include an entry for your agent and set the agent attribute to
> whatever you named it (agent='fence_dave').
>
> Put shared attributes in the fencedevice section and node specific attrs
> under the clusternode->fence->method->device tag. Then propagate the new
> file...first run ccs_tool update and then cman_tool version -r...man
> these comands for details - but dont forget to incerment the
> config_version attribute in the conf file before propagating a new one.
>
> Here is a rough outline of how to add it to s-c-c:
> First add the form fields to one of the windows in fence.glade. Just
> follow the conventions for naming along the lines of the other agent
> forms that you find there. In each of the three windows that contain
> fence forms, there is a device column and an instance column...both will
> need to be extended for your new agent.
>
> Next, you will need to edit the python file FenceHandler.py...This
> should be the only other file you need to touch. I would pick an
> existing agent and follow it through the file, noticing all of the
> places it is set...for example, for each fence device and fence
> instance, there is a populate method, a validate method, a clear form
> method, and a process_widgets method entry. Then there are a few hash
> maps to add to. There are comments in the file to assist in adding a new
> fence type.
>
> In summary, add forms to glade file, then edit FenceHandler.py
>
> -J
>

HI Jim,

   Yeow! Daunting pretty much captures it 8-)

   I tried hacking the files and got a bit dizzy figuring out which
labelN/entryN/tableN was available - I actually got through changing  
fence.glade and then FenceHandler.py just blew me away.... I guess I'll
give up on the gui and just use the command line tools to update and
propagate the cluster configuration from now on.

  Playing it safe, I added two brocade fence devices to the config using
s-c-c and then manually edited that cluster.conf, changing the
fence_brocade to fence_servertech (the other options are OK for what I
need). I was then able to propagate this to the other node OK. Started
up fenced and no smoke so far.... I have to go to our co-locate to do
the serious testing, so that'll be tomorrow.

  I'd be happy to pass my script on as GPL (it's just a modified version
of fence_brocade - someone else did the heavy lifting), how would I go
about that ?


 Thanks very much for the quick and detailed answer,
 Dave


-- 
Being shot out of a cannon will always be better than being squeezed
out of a tube.  That is why God made fast motorcycles, Bubba....
  "Song of the Sausage Creature" Hunter S. Thompson (RIP 02/20/2005)



From benjamin.jakubowski at gmail.com  Wed Aug  1 18:38:38 2007
From: benjamin.jakubowski at gmail.com (Benjamin Jakubowski)
Date: Wed, 1 Aug 2007 20:38:38 +0200
Subject: [Linux-cluster] RHCS on RedHat As 4 u4
In-Reply-To: <1185977365.3318.21.camel@localhost.localdomain>
References: <6c80b6370707310937te6a3476x3dc682a971083622@mail.gmail.com>
	<46AF65A3.1010806@redhat.com>
	<6c80b6370707310956j10208835lff5af4d0806bf17@mail.gmail.com>
	<46AF6B24.9000208@redhat.com>
	<6c80b6370708010139h6438d1dsd1cc5bc1dfba8dff@mail.gmail.com>
	<1185977365.3318.21.camel@localhost.localdomain>
Message-ID: <6c80b6370708011138r17f04e14v5ad7dfe99e0b2b66@mail.gmail.com>

it's the same pb, withc balise <cman/>
really strange ...

2007/8/1, jim parsons <jparsons at redhat.com>:
>
> I am not sure here, but I believe you need a <cman/> tag in your conf
> file under <cluster>...there are two locking types available in
> rhel4...perhaps you need to specify the one you wish to use? At any
> rate, it is a simple thing to add and try.
>
> -J
>
> On Wed, 2007-08-01 at 10:39 +0200, Benjamin Jakubowski wrote:
> > OK thanks, it seems to become better but this is my simple
> > cluster.conf
> >
> > My first node is : server1
> >
> > <?xml version="1.0" ?>
> > <cluster alias="sisclad_re7" config_version="7" name="siclad_re7">
> >         <fence_daemon clean_start="0" post_fail_delay="0"
> > post_join_delay="3"/>
> >         <clusternodes>
> >                 <clusternode name="server1"/>
> >         </clusternodes>
> >         <fencedevices/>
> >         <rm>
> >                 <failoverdomains/>
> >                 <resources/>
> >         </rm>
> > </cluster>
> >
> > when i start ccsd ok :
> > Aug  1 10:35:59 sv157020 ccsd[20726]: Starting ccsd 1.0.10:
> > Aug  1 10:35:59 sv157020 ccsd[20726]:  Built: Mar 19 2007 17:44:26
> > Aug  1 10:35:59 sv157020 ccsd[20726]:  Copyright (C) Red Hat, Inc.
> > 2004  All rights reserved.
> > Aug  1 10:35:59 sv157020 ccsd:  succeeded
> >
> > when i try to start cman :
> > Aug  1 10:36:28 sv157020 ccsd[20726]: Unable to connect to cluster
> > infrastructure after 30 seconds.
> > Aug  1 10:36:33 sv157020 kernel: CMAN 2.6.9-45.2 (built Jul 13 2006
> > 11:42:36) installed
> > Aug  1 10:36:33 sv157020 kernel: NET: Registered protocol family 30
> > Aug  1 10:36:33 sv157020 ccsd[20726]: cluster.conf (cluster name =
> > siclad_re7, version = 7) found.
> > Aug  1 10:36:34 sv157020 kernel: CMAN: Waiting to join or form a
> > Linux-cluster
> > Aug  1 10:36:52 sv157020 sshd(pam_unix)[20747]: session opened for
> > user root by root(uid=0)
> > Aug  1 10:36:59 sv157020 ccsd[20726]: Unable to connect to cluster
> > infrastructure after 60 seconds.
> > Aug  1 10:37:06 sv157020 kernel: CMAN: forming a new cluster
> > Aug  1 10:37:06 sv157020 kernel: CMAN: quorum regained, resuming
> > activity
> > Aug  1 10:37:06 sv157020 cman: startup succeeded
> >
> > when i try to start rgmanager :
> > Aug  1 10:37:29 sv157020 ccsd[20726]: Unable to connect to cluster
> > infrastructure after 90 seconds.
> > Aug  1 10:37:36 sv157020 ccsd[20726]: Cluster is not quorate.
> > Refusing connection.
> > Aug  1 10:37:36 sv157020 ccsd[20726]: Error while processing connect:
> > Connection refused
> > Aug  1 10:37:36 sv157020 ccsd[20726]: Invalid descriptor specified
> > (-111).
> > Aug  1 10:37:36 sv157020 ccsd[20726]: Someone may be attempting
> > something evil.
> > Aug  1 10:37:36 sv157020 ccsd[20726]: Error while processing get:
> > Invalid request descriptor
> > Aug  1 10:37:36 sv157020 ccsd[20726]: Invalid descriptor specified
> > (-111).
> > Aug  1 10:37:36 sv157020 ccsd[20726]: Someone may be attempting
> > something evil.
> > Aug  1 10:37:36 sv157020 ccsd[20726]: Error while processing get:
> > Invalid request descriptor
> > Aug  1 10:37:36 sv157020 ccsd[20726]: Invalid descriptor specified
> > (-21).
> > Aug  1 10:37:36 sv157020 ccsd[20726]: Someone may be attempting
> > something evil.
> > Aug  1 10:37:36 sv157020 ccsd[20726]: Error while processing
> > disconnect: Invalid request descriptor
> > Aug  1 10:37:36 sv157020 clurgmgrd[20817]: <notice> Resource Group
> > Manager Starting
> > Aug  1 10:37:36 sv157020 clurgmgrd[20817]: <info> Loading Service Data
> > Aug  1 10:37:37 sv157020 ccsd[20726]: Cluster is not quorate.
> > Refusing connection.
> > Aug  1 10:37:37 sv157020 ccsd[20726]: Error while processing connect:
> > Connection refused
> > Aug  1 10:37:37 sv157020 clurgmgrd[20817]: <crit> #5: Couldn't connect
> > to ccsd!
> > Aug  1 10:37:37 sv157020 clurgmgrd[20817]: <crit> #8: Couldn't
> > initialize services
> > Aug  1 10:37:37 sv157020 rgmanager: D?(c)marrage de clurgmgrd failed
> >
> > Do u have any idea ?
> >
> > is so simple in RHEL 5, but i need RHEL 4U4
> >
> > Thanks a lot
> >
> >
> >
> > 2007/7/31, Bryn M. Reeves <breeves at redhat.com>:
> >         -----BEGIN PGP SIGNED MESSAGE-----
> >         Hash: SHA1
> >
> >         Benjamin Jakubowski wrote:
> >         > The probleme is :
> >         > i need to preserve the kernel version 2.6.9-42.ELsmp, to
> >         preserve a SAN
> >         > compatibility and in RHN, there isn't have a cluster kernel
> >         module ?
> >         > do u have any idea ?
> >         >
> >         > Thanks a lot
> >         > Benjamin
> >
> >         OK, if you need to match with a specific kernel release you'll
> >         need to
> >         use the Cluster Suite packages that were released at the same
> >         time.
> >
> >         For U4 (2.6.9-42.EL), I think those would be:
> >
> >         cman-kernel-2.6.9-45.2
> >         cman-kernel-smp-2.6.9-45.2
> >         cman-kernel-hugemem-2.6.9-45.2
> >         dlm-kernel-hugemem-2.6.9-42.10
> >         dlm-kernel-2.6.9-42.10
> >         dlm-kernel-smp-2.6.9-42.10
> >
> >         If you also need the packages for GFS, those are:
> >
> >         GFS-kernel-2.6.9-58.0
> >         GFS-kernel-smp-2.6.9-58.0
> >         GFS-kernel-hugemem-2.6.9-58.0
> >         gnbd-kernel-hugemem-2.6.9-9.41
> >         gnbd-kernel-2.6.9-9.41
> >         gnbd-kernel-smp-2.6.9-9.41
> >
> >         Those packages are still available on RHN - just go to the
> >         web
> >         interface, hit search by package & they should appear there.
> >
> >         Be aware though that there were some important kernel related
> >         bugfixes
> >         applied after U4 - particularly for systems using multipath
> >         SAN storage.
> >
> >         Kind regards,
> >         Bryn.
> >         -----BEGIN PGP SIGNATURE-----
> >         Version: GnuPG v1.4.7 (GNU/Linux)
> >         Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org
> >
> >         iD8DBQFGr2sk6YSQoMYUY94RAp+qAJ9yKqpXxmFXqQzRc//0RVFzavPdzgCgk
> >         +YE
> >         YJFtx2iQm4zdYnKXQ/QmRHY=
> >         =s4BM
> >         -----END PGP SIGNATURE-----
> >
> >         --
> >         Linux-cluster mailing list
> >         Linux-cluster at redhat.com
> >         https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> >
> > --
> > @+
> > Benj
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
@+
Benj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070801/b43eef91/attachment.htm>

From chris at cmiware.com  Thu Aug  2 00:55:22 2007
From: chris at cmiware.com (Chris Harms)
Date: Wed, 01 Aug 2007 19:55:22 -0500
Subject: [Linux-cluster] cluster suite crashing
Message-ID: <46B12B7A.2070402@cmiware.com>

I am again attempting a 2-node cluster (two_node=1).  This time we have 
power fencing, creating a cluster config from scratch.

Unplug network cables on Node A. Node B still plugged in. (Expected B to 
fence A.)
Node B does not attempt fencing, claims to have lost quorum (???). (
Plug Node A back in.
Node A fences Node B

On reboot, Node B reboots itself right after fencing Node A.
[repeat]

clurgmgrd[3630]: <crit> *Watchdog: Daemon died, rebooting

*Various things appear directly ahead of this in the log.  Most of the 
time it was a service script that was failing a stop operation.  
Correcting it did not resolve the issue:

[/var/log/messages on Node B]
clurgmgrd[5669]: <notice> Resource Group Manager Starting
[selinux warnings]
clurgmgrd[5667]: <crit> Watchdog: Daemon died, rebooting...
kernel: md: stopping all md devices.
fenced[4617]: fence "[Node A]" success
[reboot]

[some pertinent lines from cluster.conf - they are identical on each node]
    <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="12"/>
    <cman expected_votes="1" two_node="1"/>

Meanwhile, Node A comes up and fences B when it gets a chance.

I'm really at a loss on what to do.  We are running the RHEL 5 rpms from 
RHN.  Googling the error message yields some results on crashes in 
RGManager which were allegedly fixed in version 4.  I have seen some 
other squirrelly behavior out of RGManager at various points, but 
reboots seemed to fix those so I figured proper fencing might render 
them moot.

Any advice is welcome.

Thanks,
Chris



From jacquesb at fnb.co.za  Thu Aug  2 08:56:03 2007
From: jacquesb at fnb.co.za (Jacques Botha)
Date: Thu, 02 Aug 2007 10:56:03 +0200
Subject: [Linux-cluster] RHEL5.1 beta and qdisk
Message-ID: <1186044963.6514.4.camel@f2821966>

Okay

I have a cluster, everything is setup correctly.

I start qdisk, and it likes the heuristics, but the moment it upgrades
the cluster votes, everything to do with clustering just stops.
The machine is still responsive, you can talk to it over the network,
but cman_tool status, cman_tool nodes, clustat,  all just sit and blink
at the prompt.

I can stop qdisk immediately afterwards, but it doesn't change the
state, everything stays broken.


-- 
Jacques Botha
South Africa
+27-11-889-4142

To read FirstRand Bank's Disclaimer for this email click on the following address or copy into your Internet browser: 
https://www.fnb.co.za/disclaimer.html 

If you are unable to access the Disclaimer, send a blank e-mail to
firstrandbankdisclaimer at fnb.co.za and we will send you a copy of the Disclaimer.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070802/e6332717/attachment.sig>

From doc at mts.com.ua  Thu Aug  2 09:19:31 2007
From: doc at mts.com.ua (Eugene Melnichuk)
Date: Thu, 02 Aug 2007 12:19:31 +0300
Subject: [Linux-cluster] RHEL5.1 beta and qdisk
In-Reply-To: <1186044963.6514.4.camel@f2821966>
References: <1186044963.6514.4.camel@f2821966>
Message-ID: <46B1A1A3.5060507@mts.com.ua>

I confirm this bug,  and already asked this question on this list (with 
subject "Hang on start fence_tool join with qdisk") but problem still 
here...

--
Eugene Melnichuk
Leading Engineer
email: doc at mts.com.ua <mailto:doc at umc.ua>
mob: +380503304043
pbx: +380501105731
BU MTS Ukraine
49/2 Pobedy ave., room 4.26, 03680, Kyiv, Ukraine


Jacques Botha ?????:
> Okay
>
> I have a cluster, everything is setup correctly.
>
> I start qdisk, and it likes the heuristics, but the moment it upgrades
> the cluster votes, everything to do with clustering just stops.
> The machine is still responsive, you can talk to it over the network,
> but cman_tool status, cman_tool nodes, clustat,  all just sit and blink
> at the prompt.
>
> I can stop qdisk immediately afterwards, but it doesn't change the
> state, everything stays broken.
>
>
>   
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070802/8bff78b1/attachment.htm>

From lhh at redhat.com  Thu Aug  2 15:48:19 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 2 Aug 2007 11:48:19 -0400
Subject: [Linux-cluster] fs.sh?
In-Reply-To: <20070731145441.GE21896@helsinki.fi>
References: <20070629132556.GK29854@helsinki.fi>
	<20070706163152.GH1681@redhat.com>
	<20070706182930.GE5981@redhat.com>
	<20070706183151.GF5981@redhat.com>
	<20070706183658.GA24692@helsinki.fi>
	<20070710221922.GG18076@redhat.com>
	<20070731121438.GA21896@helsinki.fi>
	<20070731134121.GH4955@redhat.com>
	<20070731145441.GE21896@helsinki.fi>
Message-ID: <20070802154818.GB26367@redhat.com>

On Tue, Jul 31, 2007 at 05:54:41PM +0300, Janne Peltonen wrote:
> On Tue, Jul 31, 2007 at 09:41:21AM -0400, Lon Hohberger wrote:
> > On Tue, Jul 31, 2007 at 03:14:38PM +0300, Janne Peltonen wrote:
> > > On Tue, Jul 10, 2007 at 06:19:22PM -0400, Lon Hohberger wrote:
> > > > 
> > > > http://people.redhat.com/lhh/rhel5-test
> > > > 
> > > > You'll need at least the updated cman package.  The -2.1lhh build of
> > > > rgmanager is the one I just built today; the others are a bit older.
> > > 
> > > Well, I installed the new versions of the cman and rgmanager packages I
> > > found there, but to no avail: I still get 1500 invocations of fs.sh per
> > > second.
> > 
> > I put a log message in fs.sh:
> > 
> > Jul 31 09:27:29 bart clurgmgrd: [4395]: <err> /usr/share/cluster/fs.sh
> > TEST 
> > 
> > It comes up once every several (10-20) seconds like it's supposed to. 
> 
> I did the same, with the same results. It seems to me that the clurgmgrd
> process isn't calling the complete script any more times than it's
> supposed to.  What I'm seeing are the execs of fs.sh, that is, it
> includes each () and `` and so on. Each fs.sh invocation seems to create
> quite an amount of subshells.
> 
> I'm sorry for having misled you. And this all means, there isn't
> probably much reason to read the cluster.conf and rg_test rules output -
> I'll attach them anyway.

Yeah, it does hit a lot of subshells.  Awks, seds, and the like.  Some
pattern substitution and matching can be done in pure bash.

That's quite an impressive cluster.conf...  I'm going to look at it
some.

-- Lon




From lhh at redhat.com  Thu Aug  2 15:51:55 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 2 Aug 2007 11:51:55 -0400
Subject: [Linux-cluster] fs.sh?
In-Reply-To: <20070731171123.GH21896@helsinki.fi>
References: <20070629132556.GK29854@helsinki.fi>
	<20070706163152.GH1681@redhat.com>
	<20070706182930.GE5981@redhat.com>
	<20070706183151.GF5981@redhat.com>
	<20070706183658.GA24692@helsinki.fi>
	<20070710221922.GG18076@redhat.com>
	<20070731121438.GA21896@helsinki.fi>
	<20070731134121.GH4955@redhat.com>
	<20070731145441.GE21896@helsinki.fi>
	<20070731171123.GH21896@helsinki.fi>
Message-ID: <20070802155155.GC26367@redhat.com>

On Tue, Jul 31, 2007 at 08:11:23PM +0300, Janne Peltonen wrote:
> On Tue, Jul 31, 2007 at 05:54:41PM +0300, Janne Peltonen wrote:
> > On Tue, Jul 31, 2007 at 09:41:21AM -0400, Lon Hohberger wrote:
> > > On Tue, Jul 31, 2007 at 03:14:38PM +0300, Janne Peltonen wrote:
> > > > On Tue, Jul 10, 2007 at 06:19:22PM -0400, Lon Hohberger wrote:
> > > > > 
> > > > > http://people.redhat.com/lhh/rhel5-test
> > > > > 
> > > > > You'll need at least the updated cman package.  The -2.1lhh build of
> > > > > rgmanager is the one I just built today; the others are a bit older.
> > > > 
> > > > Well, I installed the new versions of the cman and rgmanager packages I
> > > > found there, but to no avail: I still get 1500 invocations of fs.sh per
> > > > second.
> > > 
> > > I put a log message in fs.sh:
> > > 
> > > Jul 31 09:27:29 bart clurgmgrd: [4395]: <err> /usr/share/cluster/fs.sh
> > > TEST 
> > > 
> > > It comes up once every several (10-20) seconds like it's supposed to. 
> > 
> > I did the same, with the same results. It seems to me that the clurgmgrd
> > process isn't calling the complete script any more times than it's
> > supposed to.  What I'm seeing are the execs of fs.sh, that is, it
> > includes each () and `` and so on. Each fs.sh invocation seems to create
> > quite an amount of subshells.
> > 
> > I'm sorry for having misled you. And this all means, there isn't
> > probably much reason to read the cluster.conf and rg_test rules output -
> > I'll attach them anyway.
> 
> After running the new rgmanager packages for abt four hours without any
> of the load fluctuation I'd experienced before (with a more-or-less
> four-hour interval, system load first increases slowly until it reaches
> a high level - dependent on overall system load - and then swiftly
> decreases to near zero, to start increasing again. This fluctuation
> peaks at about 5.0 in a system with no users at all, but many services.
> If there are many users and the user peak coincides with the base peak,
> the system experiences a shortish load peak of abt 100.0, after which it
> recovers and the basic load fluctuation becomes visible again). Then the
> load averages started increasing again, to something 10.0ish, so -
> frustrated - I edited /usr/share/cluster/fs.sh and put an exit 0 to the
> switch-case "status|monitor" on $1. Well. Load averages promptly fell
> back to under 0.5, disk usage% fell by 30 %-units, and overall system
> responsiveness increased considerably.
> 
> So I'll be running my cluster without fs status checks for now. I hope
> someone'll work out what's wrong with fs.sh soon... ;)

There are a number of things we can do - can you file a bugzilla about
this, now that we know what's going on? (and that it's not internal
rgmanager difficulties, just inefficient scripting)?

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Aug  2 15:54:33 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 2 Aug 2007 11:54:33 -0400
Subject: [Linux-cluster] clustat on GULM
In-Reply-To: <6596a7c70707310722l24a10059l9309229ba7c534fc@mail.gmail.com>
References: <6596a7c70707310722l24a10059l9309229ba7c534fc@mail.gmail.com>
Message-ID: <20070802155432.GD26367@redhat.com>

On Tue, Jul 31, 2007 at 10:22:33AM -0400, siman hew wrote:
> I found clustat report wrong information about rgmanager when a cluster is
> GULM.
> I setup a 3-node cluster on RHEL4U5, with GULM.  There are one failover
> domain, one resource and one service, just for testing.
> I found clustat always report rgmanger is running after cluster is started.
> With the configuration with DLM, clustat report correctly.
> node4 is lock server, and all nodes report the same inaccurate information.

GULM doesn't have a way to provide sub-groups really, except for users
of lockspaces.

Can you file a bugzilla about this?  I wonder if we can get the
information for "what nodes are in this lockspace" from GULM.  If it's
possible, we could probably fix it.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Aug  2 15:55:40 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 2 Aug 2007 11:55:40 -0400
Subject: [Linux-cluster] Odd cluster problems
In-Reply-To: <46AF59DC.9000906@utmem.edu>
References: <46AF59DC.9000906@utmem.edu>
Message-ID: <20070802155540.GE26367@redhat.com>

On Tue, Jul 31, 2007 at 10:48:44AM -0500, Jay Leafey wrote:
> I've got a 3-node cluster running CentOS 4.5 and I cannot communicate 
> with the resource group manager.  When I use the clustat command I get a 
> timeout:
> 
> >[root at rapier ~]# clustat
> >Timed out waiting for a response from Resource Group Manager
> >Member Status: Quorate
> >
> >  Member Name                              Status
> >  ------ ----                              ------
> >  rapier.utmem.edu                         Online, Local, rgmanager
> >  thorax.utmem.edu                         Offline
> >  cyclops.utmem.edu                        Online, rgmanager

> >Fence Domain:    "default"                           2   2 recover 4 -
> >[1 2]

Until fencing completes, rgmanager won't respond.

fence_ack_manual needs to be run.

> >
> ><SNIP>
> >
> >User:            "usrm::manager"                    10  10 recover 2 -
> >[1 2]
> >

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Aug  2 15:56:19 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 2 Aug 2007 11:56:19 -0400
Subject: [Linux-cluster] VF: Abort: Invalid header in reply from member #1
In-Reply-To: <a6d13c780707310852g685bd07cv98942edc46e35c25@mail.gmail.com>
References: <a6d13c780707310852g685bd07cv98942edc46e35c25@mail.gmail.com>
Message-ID: <20070802155619.GF26367@redhat.com>

On Tue, Jul 31, 2007 at 12:52:38PM -0300, Filipe Miranda wrote:
> Hello everybody,
> 
> I am using RedHatCluster Suite for RHEL3 and I am experiencing the following
> errors in the cluster's log file:

Which version do you have installed?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Aug  2 15:58:00 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 2 Aug 2007 11:58:00 -0400
Subject: [Linux-cluster] cluster suite crashing
In-Reply-To: <46B12B7A.2070402@cmiware.com>
References: <46B12B7A.2070402@cmiware.com>
Message-ID: <20070802155800.GG26367@redhat.com>

On Wed, Aug 01, 2007 at 07:55:22PM -0500, Chris Harms wrote:
> I am again attempting a 2-node cluster (two_node=1).  This time we have 
> power fencing, creating a cluster config from scratch.
> 
> Unplug network cables on Node A. Node B still plugged in. (Expected B to 
> fence A.)
> Node B does not attempt fencing, claims to have lost quorum (???). (
> Plug Node A back in.
> Node A fences Node B
> 
> On reboot, Node B reboots itself right after fencing Node A.
> [repeat]

> clurgmgrd[3630]: <crit> *Watchdog: Daemon died, rebooting

That's a bug; what version of rgmanger do you have installed?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Aug  2 15:59:11 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 2 Aug 2007 11:59:11 -0400
Subject: [Linux-cluster] RHEL5.1 beta and qdisk
In-Reply-To: <1186044963.6514.4.camel@f2821966>
References: <1186044963.6514.4.camel@f2821966>
Message-ID: <20070802155911.GH26367@redhat.com>

On Thu, Aug 02, 2007 at 10:56:03AM +0200, Jacques Botha wrote:
> Okay
> 
> I have a cluster, everything is setup correctly.
> 
> I start qdisk, and it likes the heuristics, but the moment it upgrades
> the cluster votes, everything to do with clustering just stops.
> The machine is still responsive, you can talk to it over the network,
> but cman_tool status, cman_tool nodes, clustat,  all just sit and blink
> at the prompt.
> 
> I can stop qdisk immediately afterwards, but it doesn't change the
> state, everything stays broken.

Fencing being attempted?

cman_tool shouldn't *ever* hang as a result of qdiskd.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From chris at cmiware.com  Thu Aug  2 16:08:51 2007
From: chris at cmiware.com (Chris Harms)
Date: Thu, 02 Aug 2007 11:08:51 -0500
Subject: [Linux-cluster] cluster suite crashing
In-Reply-To: <20070802155800.GG26367@redhat.com>
References: <46B12B7A.2070402@cmiware.com> <20070802155800.GG26367@redhat.com>
Message-ID: <46B20193.8050205@cmiware.com>

rgmanager-2.0.24-1.el5

I'm not sure if this is useful or not, but I had just rebooted Node B 
when we pulled the cables on Node A.  It is possible not all of the 
services / inter-node communication had completed.

Thanks,
Chris

Lon Hohberger wrote:
> On Wed, Aug 01, 2007 at 07:55:22PM -0500, Chris Harms wrote:
>   
>> I am again attempting a 2-node cluster (two_node=1).  This time we have 
>> power fencing, creating a cluster config from scratch.
>>
>> Unplug network cables on Node A. Node B still plugged in. (Expected B to 
>> fence A.)
>> Node B does not attempt fencing, claims to have lost quorum (???). (
>> Plug Node A back in.
>> Node A fences Node B
>>
>> On reboot, Node B reboots itself right after fencing Node A.
>> [repeat]
>>     
>
>   
>> clurgmgrd[3630]: <crit> *Watchdog: Daemon died, rebooting
>>     
>
> That's a bug; what version of rgmanger do you have installed?
>
>   



From jleafey at utmem.edu  Thu Aug  2 19:00:13 2007
From: jleafey at utmem.edu (Jay Leafey)
Date: Thu, 02 Aug 2007 14:00:13 -0500
Subject: [Linux-cluster] Odd cluster problems
In-Reply-To: <20070802155540.GE26367@redhat.com>
References: <46AF59DC.9000906@utmem.edu> <20070802155540.GE26367@redhat.com>
Message-ID: <46B229BD.1020009@utmem.edu>

Lon Hohberger wrote:
> On Tue, Jul 31, 2007 at 10:48:44AM -0500, Jay Leafey wrote:
>> I've got a 3-node cluster running CentOS 4.5 and I cannot communicate 
>> with the resource group manager.  When I use the clustat command I get a 
>> timeout:
>>
>>> [root at rapier ~]# clustat
>>> Timed out waiting for a response from Resource Group Manager
>>> Member Status: Quorate
>>>
>>>  Member Name                              Status
>>>  ------ ----                              ------
>>>  rapier.utmem.edu                         Online, Local, rgmanager
>>>  thorax.utmem.edu                         Offline
>>>  cyclops.utmem.edu                        Online, rgmanager
> 
>>> Fence Domain:    "default"                           2   2 recover 4 -
>>> [1 2]
> 
> Until fencing completes, rgmanager won't respond.
> 
> fence_ack_manual needs to be run.
> 
>>> <SNIP>
>>>
>>> User:            "usrm::manager"                    10  10 recover 2 -
>>> [1 2]
>>>
> 

Your reply was a bit confusing at first, but looking deeper showed you 
were right on the mark.  The systems (using HP ILO fencing) were unable 
to communicate with each other very well or with the ILO ports at all. 
Turns out some of the ports they were configured on had been moved to a 
different VLAN, so the network was split between the ILOs and the host 
ports.

Configuring the ports properly seems to have resolved the issue, 
everything is working fine now.  I guess I just need to keep the rubber 
hose handy for "discussions" with the network guys! (grin!)

Thanks!
-- 
Jay Leafey - University of Tennessee
E-Mail:  jleafey at utmem.edu  Phone:  901-448-6534  FAX:  901-448-8199

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5158 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070802/edce09f2/attachment.bin>

From orkcu at yahoo.com  Thu Aug  2 20:07:12 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Thu, 2 Aug 2007 13:07:12 -0700 (PDT)
Subject: [Linux-cluster] How to add start-up options to pulse ??
Message-ID: <41175.41349.qm@web50610.mail.re2.yahoo.com>

Hi is there anyway to start pulse with one of its
options , especificaly --forceactive, without manualy
modification of the init.d/pulse start-stop script?

I guess if I modify this script with the next update
to piranha I will lose all the changes, am I correct?

I was thinking of having just like other init.d script
have: a startup config file in /etc/sysconfig/pulse
but to implement this I must modify pulse start-stop
script so .....

should I ask for an enhance to the package in
bugzilla?
I do not have Redhat cluster support for rhel4 (which
is the one I am working with) but I do have support
for rhel5-server-AP, so maybe a ticket might help?


thanks
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


      ____________________________________________________________________________________
Shape Yahoo! in your own image.  Join our Network Research Panel today!   http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 




From lhh at redhat.com  Thu Aug  2 20:19:17 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 2 Aug 2007 16:19:17 -0400
Subject: [Linux-cluster] Odd cluster problems
In-Reply-To: <46B229BD.1020009@utmem.edu>
References: <46AF59DC.9000906@utmem.edu> <20070802155540.GE26367@redhat.com>
	<46B229BD.1020009@utmem.edu>
Message-ID: <20070802201917.GL26367@redhat.com>

On Thu, Aug 02, 2007 at 02:00:13PM -0500, Jay Leafey wrote:
> Lon Hohberger wrote:
> >On Tue, Jul 31, 2007 at 10:48:44AM -0500, Jay Leafey wrote:
> >>I've got a 3-node cluster running CentOS 4.5 and I cannot communicate 
> >>with the resource group manager.  When I use the clustat command I get a 
> >>timeout:
> >>
> >>>[root at rapier ~]# clustat
> >>>Timed out waiting for a response from Resource Group Manager
> >>>Member Status: Quorate
> >>>
> >>> Member Name                              Status
> >>> ------ ----                              ------
> >>> rapier.utmem.edu                         Online, Local, rgmanager
> >>> thorax.utmem.edu                         Offline
> >>> cyclops.utmem.edu                        Online, rgmanager
> >
> >>>Fence Domain:    "default"                           2   2 recover 4 -
> >>>[1 2]
> >
> >Until fencing completes, rgmanager won't respond.
> >
> >fence_ack_manual needs to be run.
> >
> >>><SNIP>
> >>>
> >>>User:            "usrm::manager"                    10  10 recover 2 -
> >>>[1 2]
> >>>
> >
> 
> Your reply was a bit confusing at first, but looking deeper showed you 
> were right on the mark.  The systems (using HP ILO fencing) were unable 
> to communicate with each other very well or with the ILO ports at all. 
> Turns out some of the ports they were configured on had been moved to a 
> different VLAN, so the network was split between the ILOs and the host 
> ports.

Sorry, I just assumed you were using manual fencing as opposed to iLO,
since that's the 90+/- % case of why fencing was stuck in the 'recover'
state.

I guess we all know what happens when you assume... :)  Or maybe, when I
assume?

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Aug  2 20:23:15 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 2 Aug 2007 16:23:15 -0400
Subject: [Linux-cluster] cluster suite crashing
In-Reply-To: <46B20193.8050205@cmiware.com>
References: <46B12B7A.2070402@cmiware.com> <20070802155800.GG26367@redhat.com>
	<46B20193.8050205@cmiware.com>
Message-ID: <20070802202314.GM26367@redhat.com>

On Thu, Aug 02, 2007 at 11:08:51AM -0500, Chris Harms wrote:
> rgmanager-2.0.24-1.el5
> 
> I'm not sure if this is useful or not, but I had just rebooted Node B 
> when we pulled the cables on Node A.  It is possible not all of the 
> services / inter-node communication had completed.

Could you pull from CVS (RHEL5 or 51 branches)?  The current code has a
couple of crash bugs fixed.

Note that if you store:

DAEMON_COREFILE_LIMIT="unlimited"
RGMGR_OPTS="-w"

... in /etc/sysconfig/cluster, rgmanager will generate a core file in
the root directory.  Attaching the core to the bug report will help
determine whether it's something already fixed in CVS.

But seriously, if you see 'daemon died, rebooting' it's either user
error (you did a 'kill -9' of only one rgmanager pid) or a bug (crash).

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Aug  2 20:25:02 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 2 Aug 2007 16:25:02 -0400
Subject: [Linux-cluster] How to add start-up options to pulse ??
In-Reply-To: <41175.41349.qm@web50610.mail.re2.yahoo.com>
References: <41175.41349.qm@web50610.mail.re2.yahoo.com>
Message-ID: <20070802202502.GN26367@redhat.com>

On Thu, Aug 02, 2007 at 01:07:12PM -0700, Roger Pe?a wrote:
> Hi is there anyway to start pulse with one of its
> options , especificaly --forceactive, without manualy
> modification of the init.d/pulse start-stop script?
> 
> I guess if I modify this script with the next update
> to piranha I will lose all the changes, am I correct?
> 
> I was thinking of having just like other init.d script
> have: a startup config file in /etc/sysconfig/pulse
> but to implement this I must modify pulse start-stop
> script so .....

It should leave your pulse init script and create pulse.rpmnew,
actually.

> should I ask for an enhance to the package in
> bugzilla?

Yes.  Most packages should have /etc/sysconfig/<something> which gets
sourced on startup for specifically this purpose.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From orkcu at yahoo.com  Thu Aug  2 21:27:28 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Thu, 2 Aug 2007 14:27:28 -0700 (PDT)
Subject: [Linux-cluster] How to add start-up options to pulse ??
In-Reply-To: <20070802202502.GN26367@redhat.com>
Message-ID: <450271.81221.qm@web50604.mail.re2.yahoo.com>


--- Lon Hohberger <lhh at redhat.com> wrote:

> On Thu, Aug 02, 2007 at 01:07:12PM -0700, Roger Pe?a
> wrote:
> > Hi is there anyway to start pulse with one of its
> > options , especificaly --forceactive, without
> manualy
> > modification of the init.d/pulse start-stop
> script?
> > 
> > I guess if I modify this script with the next
> update
> > to piranha I will lose all the changes, am I
> correct?
> > 
> > I was thinking of having just like other init.d
> script
> > have: a startup config file in
> /etc/sysconfig/pulse
> > but to implement this I must modify pulse
> start-stop
> > script so .....
> 
> It should leave your pulse init script and create
> pulse.rpmnew,
> actually.

I download the src.rpm I saw the spec file :-)
you are right ;-)
(In the past, I had the experience with other rpm that
overwrite those scripts :-( )



> 
> > should I ask for an enhance to the package in
> > bugzilla?
> 
> Yes.  Most packages should have
> /etc/sysconfig/<something> which gets
> sourced on startup for specifically this purpose.


where should I get the cvs of piranha?
http://sources.redhat.com/cgi-bin/cvsweb.cgi/?sortby=file&hideattic=1&logsort=date&f=u&hidenonreadable=1&cvsroot=piranha
do not show any code

I ask because I am interested in one of your patch
that is already in CVS tree, the one attached to:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=238498

thanks
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


      ____________________________________________________________________________________
Luggage? GPS? Comic books? 
Check out fitting gifts for grads at Yahoo! Search
http://search.yahoo.com/search?fr=oni_on_mail&p=graduation+gifts&cs=bz



From chris at cmiware.com  Thu Aug  2 23:23:55 2007
From: chris at cmiware.com (Chris Harms)
Date: Thu, 02 Aug 2007 18:23:55 -0500
Subject: [Linux-cluster] cluster suite crashing
In-Reply-To: <20070802202314.GM26367@redhat.com>
References: <46B12B7A.2070402@cmiware.com>
	<20070802155800.GG26367@redhat.com>	<46B20193.8050205@cmiware.com>
	<20070802202314.GM26367@redhat.com>
Message-ID: <46B2678B.7000702@cmiware.com>

I grabbed the RHEL5 branch out of CVS, but compilation fails with

make[2]: Entering directory `/usr/src/cluster-cvs/cluster/dlm/lib'
gcc -Wall  -g -I. -O2  -D_REENTRANT -c -o libdlm.o libdlm.c
libdlm.c: In function ?set_version_v5?:
libdlm.c:324: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c:325: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c:326: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c: In function ?set_version_v6?:
libdlm.c:335: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c:336: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c:337: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c: In function ?detect_kernel_version?:
libdlm.c:443: error: storage size of ?v? isn?t known
libdlm.c:446: error: invalid application of ?sizeof? to incomplete type 
?struct dlm_device_version?
libdlm.c:448: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c:449: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c:450: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c:452: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c:453: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c:454: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c:443: warning: unused variable ?v?
libdlm.c: In function ?do_dlm_dispatch?:
libdlm.c:590: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c: In function ?ls_lock_v6?:
libdlm.c:835: error: ?struct dlm_lock_params? has no member named ?xid?
libdlm.c:837: error: ?struct dlm_lock_params? has no member named ?timeout?
libdlm.c: In function ?ls_lock?:
libdlm.c:892: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c: In function ?dlm_ls_lockx?:
libdlm.c:916: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c: In function ?dlm_ls_unlock?:
libdlm.c:1067: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c: In function ?dlm_ls_deadlock_cancel?:
libdlm.c:1099: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c:1115: error: ?DLM_USER_DEADLOCK? undeclared (first use in this 
function)
libdlm.c:1115: error: (Each undeclared identifier is reported only once
libdlm.c:1115: error: for each function it appears in.)
libdlm.c: In function ?dlm_ls_purge?:
libdlm.c:1134: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c:1145: error: ?DLM_USER_PURGE? undeclared (first use in this 
function)
libdlm.c:1146: error: ?union <anonymous>? has no member named ?purge?
libdlm.c:1147: error: ?union <anonymous>? has no member named ?purge?
libdlm.c: In function ?create_lockspace?:
libdlm.c:1311: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c: In function ?release_lockspace?:
libdlm.c:1415: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c: In function ?dlm_kernel_version?:
libdlm.c:1501: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c:1502: error: invalid use of undefined type ?struct 
dlm_device_version?
libdlm.c:1503: error: invalid use of undefined type ?struct 
dlm_device_version?
make[2]: *** [libdlm.o] Error 1
make[2]: Leaving directory `/usr/src/cluster-cvs/cluster/dlm/lib'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/usr/src/cluster-cvs/cluster/dlm'
make: *** [all] Error 2

I guess it doesn't like the officially supported RHEL kernel (2.6.18-8.1.8).

We also are trying to get the 5.1 Beta rpms going with no success.  So 
far a kernel panic on 5.1 kernel (2.6.18-36)



Lon Hohberger wrote:
> On Thu, Aug 02, 2007 at 11:08:51AM -0500, Chris Harms wrote:
>   
>> rgmanager-2.0.24-1.el5
>>
>> I'm not sure if this is useful or not, but I had just rebooted Node B 
>> when we pulled the cables on Node A.  It is possible not all of the 
>> services / inter-node communication had completed.
>>     
>
> Could you pull from CVS (RHEL5 or 51 branches)?  The current code has a
> couple of crash bugs fixed.
>
> Note that if you store:
>
> DAEMON_COREFILE_LIMIT="unlimited"
> RGMGR_OPTS="-w"
>
> ... in /etc/sysconfig/cluster, rgmanager will generate a core file in
> the root directory.  Attaching the core to the bug report will help
> determine whether it's something already fixed in CVS.
>
> But seriously, if you see 'daemon died, rebooting' it's either user
> error (you did a 'kill -9' of only one rgmanager pid) or a bug (crash).
>
> -- Lon
>
>   



From orkcu at yahoo.com  Fri Aug  3 02:14:44 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Thu, 2 Aug 2007 19:14:44 -0700 (PDT)
Subject: [Linux-cluster] How to add start-up options to pulse ??
In-Reply-To: <450271.81221.qm@web50604.mail.re2.yahoo.com>
Message-ID: <526805.31332.qm@web50612.mail.re2.yahoo.com>


--- Roger Pe?a <orkcu at yahoo.com> wrote:

> 
> --- Lon Hohberger <lhh at redhat.com> wrote:
> 
> > On Thu, Aug 02, 2007 at 01:07:12PM -0700, Roger
> Pe?a
> > wrote:
> > > Hi is there anyway to start pulse with one of
> its
> > > options , especificaly --forceactive, without
> > manualy
> > > modification of the init.d/pulse start-stop
> > script?
> > > 
> > > I guess if I modify this script with the next
> > update
> > > to piranha I will lose all the changes, am I
> > correct?
> > > 
> > > I was thinking of having just like other init.d
> > script
> > > have: a startup config file in
> > /etc/sysconfig/pulse
> > > but to implement this I must modify pulse
> > start-stop
> > > script so .....
> > 
> > It should leave your pulse init script and create
> > pulse.rpmnew,
> > actually.
> 
> I download the src.rpm I saw the spec file :-)
> you are right ;-)
> (In the past, I had the experience with other rpm
> that
> overwrite those scripts :-( )
I speak to quickly, piranha 0.8.2 (RHEL4) do not
overwrite /etc/init.d/* but 0.8.4 (RHEL5) actually do
it (line 83 of piranha.spec)

there was a change in the piranha.spec file between
the two versions

looking into the way piranha is package, I think a
patch to implement an startup config file in
/etc/sysconfig/ is a litle more complex than what I
think at the first place because:
1- create the /etc/sysconfig/pulse prototype file in
the piranha package
2- modify the Makefile of the package to install this
file in the proper place with the proper name
3- repackage the tar.gz
4- modify the spec.file to include the new file in the
"file" section
5- repackage the rpm

not dificult but "lot" of work and not so important
patch so, I guess, this will not be a priority for
redhat ...

will you appreciate a patch for all this ? I can do
it, in fact I will, again rhel5 version

thanks
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


       
____________________________________________________________________________________
Got a little couch potato? 
Check out fun summer activities for kids.
http://search.yahoo.com/search?fr=oni_on_mail&p=summer+activities+for+kids&cs=bz 



From jacquesb at fnb.co.za  Fri Aug  3 07:02:10 2007
From: jacquesb at fnb.co.za (Jacques Botha)
Date: Fri, 03 Aug 2007 09:02:10 +0200
Subject: [Linux-cluster] RHEL5.1 beta and qdisk
In-Reply-To: <20070802155911.GH26367@redhat.com>
References: <1186044963.6514.4.camel@f2821966>
	<20070802155911.GH26367@redhat.com>
Message-ID: <1186124530.5993.2.camel@f2821966>

No fencing being attempted, it just sits there, and the last thing in
the log is qdisk upgrading the status of the cluser:


Aug  2 10:36:21 fnbgw01 qdiskd[3519]: <info> Quorum Partition: /dev/sdb1
Label: fnbgw_qdisk 
Aug  2 10:36:21 fnbgw01 qdiskd[3520]: <info> Quorum Daemon Initializing 
Aug  2 10:36:24 fnbgw01 qdiskd[3520]: <info> Heuristic: 'ping
172.20.28.193 -c3 -t1' UP 
Aug  2 10:36:24 fnbgw01 qdiskd[3520]: <info> Heuristic: 'ping
172.20.28.195 -c3 -t1' UP 
Aug  2 10:36:24 fnbgw01 qdiskd[3520]: <info> Heuristic: 'ping
172.20.28.196 -c3 -t1' UP 
Aug  2 10:36:24 fnbgw01 qdiskd[3520]: <info> Heuristic: 'ping
172.20.28.197 -c3 -t1' UP 
Aug  2 10:36:24 fnbgw01 qdiskd[3520]: <info> Heuristic: 'ping
172.20.28.198 -c3 -t1' UP 
Aug  2 10:36:31 fnbgw01 qdiskd[3520]: <info> Initial score 6/6 
Aug  2 10:36:31 fnbgw01 qdiskd[3520]: <info> Initialization complete 
Aug  2 10:36:31 fnbgw01 openais[3469]: [CMAN ] quorum device registered 
Aug  2 10:36:31 fnbgw01 qdiskd[3520]: <notice> Score sufficient for
master operation (6/6; required=3); upgrading 
Aug  2 10:36:45 fnbgw01 qdiskd[3520]: <info> Assuming master role 


And after that you can't get any joy.
Jacques

On Thu, 2007-08-02 at 11:59 -0400, Lon Hohberger wrote:
> On Thu, Aug 02, 2007 at 10:56:03AM +0200, Jacques Botha wrote:
> > Okay
> > 
> > I have a cluster, everything is setup correctly.
> > 
> > I start qdisk, and it likes the heuristics, but the moment it upgrades
> > the cluster votes, everything to do with clustering just stops.
> > The machine is still responsive, you can talk to it over the network,
> > but cman_tool status, cman_tool nodes, clustat,  all just sit and blink
> > at the prompt.
> > 
> > I can stop qdisk immediately afterwards, but it doesn't change the
> > state, everything stays broken.
> 
> Fencing being attempted?
> 
> cman_tool shouldn't *ever* hang as a result of qdiskd.
> 

To read FirstRand Bank's Disclaimer for this email click on the following address or copy into your Internet browser: 
https://www.fnb.co.za/disclaimer.html 

If you are unable to access the Disclaimer, send a blank e-mail to
firstrandbankdisclaimer at fnb.co.za and we will send you a copy of the Disclaimer.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070803/2fa3b6f0/attachment.sig>

From janne.peltonen at helsinki.fi  Fri Aug  3 13:09:58 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Fri, 3 Aug 2007 16:09:58 +0300
Subject: [Linux-cluster] fs.sh?
In-Reply-To: <20070802154818.GB26367@redhat.com>
References: <20070629132556.GK29854@helsinki.fi>
	<20070706163152.GH1681@redhat.com>
	<20070706182930.GE5981@redhat.com>
	<20070706183151.GF5981@redhat.com>
	<20070706183658.GA24692@helsinki.fi>
	<20070710221922.GG18076@redhat.com>
	<20070731121438.GA21896@helsinki.fi>
	<20070731134121.GH4955@redhat.com>
	<20070731145441.GE21896@helsinki.fi>
	<20070802154818.GB26367@redhat.com>
Message-ID: <20070803130957.GL21896@helsinki.fi>

On Thu, Aug 02, 2007 at 11:48:19AM -0400, Lon Hohberger wrote:
> That's quite an impressive cluster.conf...  I'm going to look at it
> some.

One thing that makes it longer than strictly necessary is the fact that
each service has its own prioritized failover domain. I could, of
course, just have one failover domain for each possible permutation of
nodes pcn1-hb..pcn4-hb... on the other hand, I feel safer to edit the
failover domain specification on a live system than to edit the service
specification (to change the failover domain) - I don't know whether
rgmanager would want to restart a service if I changed its failover
domain (would it?), but I know it doesn't restart a service if I edit
the failover domain specification...


--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From jos at xos.nl  Fri Aug  3 14:56:19 2007
From: jos at xos.nl (Jos Vos)
Date: Fri, 03 Aug 2007 16:56:19 +0200
Subject: [Linux-cluster] Priority in failover domains sometimes not honoured
Message-ID: <200708031456.l73EuJM22161@xos037.xos.nl>

Hi,

I have the following failover domains:

      <failoverdomains>
              <failoverdomain name="prio_host1" ordered="1" restricted="1">
                      <failoverdomainnode name="host1" priority="1"/>
                      <failoverdomainnode name="host2" priority="2"/>
              </failoverdomain>
              <failoverdomain name="prio_host2" ordered="1" restricted="1">
                      <failoverdomainnode name="host1" priority="2"/>
                      <failoverdomainnode name="host2" priority="1"/>
              </failoverdomain>
      </failoverdomains>

But often a service for, say, "prio_host2" stays running on host1,
(or vice versa), while both hosts are running fine.  In a "stable"
situation, that service should be moved to host2, AFAIK (and I
have seen this happen in some situations).

This is on RHEL 5.0.

Is this a bug or...?

Thanks,

--
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From simanhew at gmail.com  Fri Aug  3 18:50:45 2007
From: simanhew at gmail.com (siman hew)
Date: Fri, 3 Aug 2007 14:50:45 -0400
Subject: [Linux-cluster] clustat on GULM
In-Reply-To: <20070802155432.GD26367@redhat.com>
References: <6596a7c70707310722l24a10059l9309229ba7c534fc@mail.gmail.com>
	<20070802155432.GD26367@redhat.com>
Message-ID: <6596a7c70708031150xe002365y8b04e0ded11b7edc@mail.gmail.com>

Entered a bug.Bugzilla Bug
250811<https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=250811>:
clustat report wrong information about rgmanager on a GULM clusterHope it
will be solved soon, which can save some work of mine, since I can not
trust/rely on clustat right now.

Thanks,

Siman

On 8/2/07, Lon Hohberger <lhh at redhat.com> wrote:
>
> On Tue, Jul 31, 2007 at 10:22:33AM -0400, siman hew wrote:
> > I found clustat report wrong information about rgmanager when a cluster
> is
> > GULM.
> > I setup a 3-node cluster on RHEL4U5, with GULM.  There are one failover
> > domain, one resource and one service, just for testing.
> > I found clustat always report rgmanger is running after cluster is
> started.
> > With the configuration with DLM, clustat report correctly.
> > node4 is lock server, and all nodes report the same inaccurate
> information.
>
> GULM doesn't have a way to provide sub-groups really, except for users
> of lockspaces.
>
> Can you file a bugzilla about this?  I wonder if we can get the
> information for "what nodes are in this lockspace" from GULM.  If it's
> possible, we could probably fix it.
>
> --
> Lon Hohberger - Software Engineer - Red Hat, Inc.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070803/debef294/attachment.htm>

From karl at klxsystems.net  Fri Aug  3 19:29:34 2007
From: karl at klxsystems.net (karl at klxsystems.net)
Date: Fri, 3 Aug 2007 12:29:34 -0700 (PDT)
Subject: [Linux-cluster] fencing with IBM RAS II on Centos 5 cman start
Message-ID: <47888.66.93.167.218.1186169374.squirrel@www.klxsystems.net>

Hi,

I am attempting to use an IBM RSA II (Remote Supervisor Adapter) and my
fencing device, as it is listed in the dropdown menu of supported devices
when you create and configure a fencing device.

Is there a specific log I can tail to find out what's going on with my
fencing device and why it won't start?

Also, if any of you have familiarity with this device, how has it run for
you as a fence?  Anyhow, just interested in knocking out what appears to
be a minor error and not seeing any immediate reference to what logging
mechanism I should be consulting to troubelshoot GFS-related items.

To be clear, this is happening when I try to start cman.

----snip-----

[root at ftp3 RPMS]# rpm -i ibmusbasm64-1.37_rhel4-2.x86_64.rpm
  Found IBM Remote Supervisor Adaptor II.
  Removing previous Start/Kill ibmasm run levels.
  Starting IBM RSA II daemon
  Calling install_initd ibmasm
[root at ftp3 RPMS]# /etc/rc.d/init.d/cman start
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... failed

-karlski                                                           [FAILED]



From jparsons at redhat.com  Fri Aug  3 19:36:08 2007
From: jparsons at redhat.com (James Parsons)
Date: Fri, 03 Aug 2007 15:36:08 -0400
Subject: [Linux-cluster] fencing with IBM RAS II on Centos 5 cman start
In-Reply-To: <47888.66.93.167.218.1186169374.squirrel@www.klxsystems.net>
References: <47888.66.93.167.218.1186169374.squirrel@www.klxsystems.net>
Message-ID: <46B383A8.7090601@redhat.com>

karl at klxsystems.net wrote:

>Hi,
>
>I am attempting to use an IBM RSA II (Remote Supervisor Adapter) and my
>fencing device, as it is listed in the dropdown menu of supported devices
>when you create and configure a fencing device.
>
>Is there a specific log I can tail to find out what's going on with my
>fencing device and why it won't start?
>
No, but you could try running the agent from the command line and seeing 
what happens...
/sbin/fence_rsa -a hostname_of_rsa_card -l login -p password -v (for 
verbose). You can see the fence_rsa man page for more info.

fence_rsa has been a goo performer...I seem to recall a bug about not 
working with custom command prompts, but that has been fixed about 
everywhere I believe.

>[root at ftp3 RPMS]# /etc/rc.d/init.d/cman start
>Starting cluster:
>   Loading modules... done
>   Mounting configfs... done
>   Starting ccsd... done
>   Starting cman... done
>   Starting daemons... done
>   Starting fencing... failed
>
Can you post your conf file?

-j

>
>  
>




From miksir at maker.ru  Sat Aug  4 10:02:40 2007
From: miksir at maker.ru (Dmitriy MiksIr)
Date: Sat, 04 Aug 2007 14:02:40 +0400
Subject: [Linux-cluster] rsync via GFS
Message-ID: <f91is0$5bo$1@sea.gmane.org>

Hi! I want to sync nodes by run rsync from GFS shared storage to node's 
filesystem. But preparing of rsync filelist is very slow (due "stat" 
trouble?). Is any suggestions to sync nodes or tune rsync?



From chris at cmiware.com  Sun Aug  5 19:33:43 2007
From: chris at cmiware.com (Chris Harms)
Date: Sun, 05 Aug 2007 14:33:43 -0500
Subject: [Linux-cluster] rgmanager 5.1 Beta Segfault
Message-ID: <46B62617.8000104@cmiware.com>

5.1 Beta RPMs of cluster suite have so far resolved our dual fencing and 
self-reboot issues.  The cluster was stopped and sat overnight, and when 
I started it via Conga today, one node came up fine, but the other 
logged the following:

kernel: clurgmgrd[32233]: segfault at 0000000000000048 rip 
0000000000419804 rsp 0000000040a00030 error 4

rgmanager-2.0.28-1.el5
cman-2.0.70-1.el5

Per Lon's instructions, I setup some configuration options to have it 
write a Core file.  Stopping and restarting the cluster via Conga did 
not reproduce the segfault.

Please advise on any additional information that would be helpful and 
where I should send the core file if necessary.

Thanks,
Chris



From maciej.bogucki at artegence.com  Mon Aug  6 09:05:49 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Mon, 06 Aug 2007 11:05:49 +0200
Subject: [Linux-cluster] rsync via GFS
In-Reply-To: <f91is0$5bo$1@sea.gmane.org>
References: <f91is0$5bo$1@sea.gmane.org>
Message-ID: <46B6E46D.3090200@artegence.com>

Dmitriy MiksIr napisa?(a):
> Hi! I want to sync nodes by run rsync from GFS shared storage to node's
> filesystem. But preparing of rsync filelist is very slow (due "stat"
> trouble?). Is any suggestions to sync nodes or tune rsync?

Hello,

a) Disable quota, it will increase performance at least 2-3 times:
gfs_tool settune /gfs quota_account 0
b) Rsync have to create a lock for each file so You could try to
increase /proc/cluster/lock_dlm/drop_count
c) Mount GFS filesystem with noatime flag.
d) Create direcotries with small number of files - stat() will be faster

Best Regards
Maciej Bogucki



From rainer at ultra-secure.de  Mon Aug  6 10:37:13 2007
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Mon, 06 Aug 2007 12:37:13 +0200
Subject: [Linux-cluster] rsync via GFS
In-Reply-To: <46B6E46D.3090200@artegence.com>
References: <f91is0$5bo$1@sea.gmane.org> <46B6E46D.3090200@artegence.com>
Message-ID: <46B6F9D9.6080905@ultra-secure.de>

Maciej Bogucki wrote:
> Dmitriy MiksIr napisa?(a):
>   
>> Hi! I want to sync nodes by run rsync from GFS shared storage to node's
>> filesystem. But preparing of rsync filelist is very slow (due "stat"
>> trouble?). Is any suggestions to sync nodes or tune rsync?
>>     
>
> Hello,
>
> a) Disable quota, it will increase performance at least 2-3 times:
> gfs_tool settune /gfs quota_account 0
> b) Rsync have to create a lock for each file so You could try to
> increase /proc/cluster/lock_dlm/drop_count
> c) Mount GFS filesystem with noatime flag.
> d) Create direcotries with small number of files - stat() will be faster
>
>   

e) make snapshot, only mount snapshot on to-sync-node (and with
lock_nolock).

http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-lock-nolock.html

Disadvantage: only works for one node at a time.

f) use NFS.... ;-)


cheers,
Rainer



From miksir at maker.ru  Mon Aug  6 10:39:39 2007
From: miksir at maker.ru (Dmitriy MiksIr)
Date: Mon, 06 Aug 2007 14:39:39 +0400
Subject: [Linux-cluster] Re: rsync via GFS
In-Reply-To: <46B6E46D.3090200@artegence.com>
References: <f91is0$5bo$1@sea.gmane.org> <46B6E46D.3090200@artegence.com>
Message-ID: <f96tpa$u0c$1@sea.gmane.org>

Great hint's for tune, thanks. But it's help not much - too many files 
(it's full tree of system with /usr /var etc). May be try to use some 
filesystem-in-file (put to gfs shared storage file and init some 
filesystem inside)? Is anyone try something like this?

Maciej Bogucki ?????:
> Dmitriy MiksIr napisa?(a):
>> Hi! I want to sync nodes by run rsync from GFS shared storage to node's
>> filesystem. But preparing of rsync filelist is very slow (due "stat"
>> trouble?). Is any suggestions to sync nodes or tune rsync?
> 
> Hello,
> 
> a) Disable quota, it will increase performance at least 2-3 times:
> gfs_tool settune /gfs quota_account 0
> b) Rsync have to create a lock for each file so You could try to
> increase /proc/cluster/lock_dlm/drop_count
> c) Mount GFS filesystem with noatime flag.
> d) Create direcotries with small number of files - stat() will be faster
> 
> Best Regards
> Maciej Bogucki
> 



From miksir at maker.ru  Mon Aug  6 10:47:24 2007
From: miksir at maker.ru (Dmitriy MiksIr)
Date: Mon, 06 Aug 2007 14:47:24 +0400
Subject: [Linux-cluster] Re: rsync via GFS
In-Reply-To: <46B6F9D9.6080905@ultra-secure.de>
References: <f91is0$5bo$1@sea.gmane.org> <46B6E46D.3090200@artegence.com>
	<46B6F9D9.6080905@ultra-secure.de>
Message-ID: <f96u7r$v3u$1@sea.gmane.org>

Rainer Duffner ?????:
>>   
> 
> e) make snapshot, only mount snapshot on to-sync-node (and with
> lock_nolock).
> 
> http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-lock-nolock.html
> 
> Disadvantage: only works for one node at a time.
> 

Can you give me some more links about snapshots? =)

> f) use NFS.... ;-)
> 
> 
> cheers,
> Rainer
> 



From rainer at ultra-secure.de  Mon Aug  6 11:29:28 2007
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Mon, 06 Aug 2007 13:29:28 +0200
Subject: [Linux-cluster] Re: rsync via GFS
In-Reply-To: <f96u7r$v3u$1@sea.gmane.org>
References: <f91is0$5bo$1@sea.gmane.org>
	<46B6E46D.3090200@artegence.com>	<46B6F9D9.6080905@ultra-secure.de>
	<f96u7r$v3u$1@sea.gmane.org>
Message-ID: <46B70618.60006@ultra-secure.de>

Dmitriy MiksIr wrote:
> Rainer Duffner ?????:
>>>   
>>
>> e) make snapshot, only mount snapshot on to-sync-node (and with
>> lock_nolock).
>>
>> http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-lock-nolock.html
>>
>> Disadvantage: only works for one node at a time.
>>
>
> Can you give me some more links about snapshots? =)
>

Oh...
Ahem...
I forgot to say: this was meant to use the snapshot-facilities of your
storage.
I was assuming, you use some kind of SAN.
What storage do you use? I don't see anything being mentioned in the
original post.




cheers,
Rainer



From miksir at maker.ru  Mon Aug  6 11:45:34 2007
From: miksir at maker.ru (Dmitriy MiksIr)
Date: Mon, 06 Aug 2007 15:45:34 +0400
Subject: [Linux-cluster] Re: rsync via GFS
In-Reply-To: <46B70618.60006@ultra-secure.de>
References: <f91is0$5bo$1@sea.gmane.org>	<46B6E46D.3090200@artegence.com>	<46B6F9D9.6080905@ultra-secure.de>	<f96u7r$v3u$1@sea.gmane.org>
	<46B70618.60006@ultra-secure.de>
Message-ID: <f971ks$a45$1@sea.gmane.org>

Rainer Duffner ?????:
> Dmitriy MiksIr wrote:
>> Rainer Duffner ?????:
>>>>   
>>> e) make snapshot, only mount snapshot on to-sync-node (and with
>>> lock_nolock).
>>>
>>> http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-lock-nolock.html
>>>
>>> Disadvantage: only works for one node at a time.
>>>
>> Can you give me some more links about snapshots? =)
>>
> 
> Oh...
> Ahem...
> I forgot to say: this was meant to use the snapshot-facilities of your
> storage.
> I was assuming, you use some kind of SAN.
> What storage do you use? I don't see anything being mentioned in the
> original post.

  Linux from kernel.org 2.6.19.2
  FC storage (HP MSA1500)
  LVM2
  Cluster package 1.04


> 
> 
> 
> 
> cheers,
> Rainer
> 



From rainer at ultra-secure.de  Mon Aug  6 12:52:38 2007
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Mon, 06 Aug 2007 14:52:38 +0200
Subject: [Linux-cluster] Re: rsync via GFS
In-Reply-To: <f971ks$a45$1@sea.gmane.org>
References: <f91is0$5bo$1@sea.gmane.org>	<46B6E46D.3090200@artegence.com>	<46B6F9D9.6080905@ultra-secure.de>	<f96u7r$v3u$1@sea.gmane.org>	<46B70618.60006@ultra-secure.de>
	<f971ks$a45$1@sea.gmane.org>
Message-ID: <46B71996.30500@ultra-secure.de>

Dmitriy MiksIr wrote:
> Rainer Duffner ?????:
>> Dmitriy MiksIr wrote:
>>> Rainer Duffner ?????:
>>>>>   
>>>> e) make snapshot, only mount snapshot on to-sync-node (and with
>>>> lock_nolock).
>>>>
>>>> http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-lock-nolock.html
>>>>
>>>>
>>>> Disadvantage: only works for one node at a time.
>>>>
>>> Can you give me some more links about snapshots? =)
>>>
>>
>> Oh...
>> Ahem...
>> I forgot to say: this was meant to use the snapshot-facilities of your
>> storage.
>> I was assuming, you use some kind of SAN.
>> What storage do you use? I don't see anything being mentioned in the
>> original post.
>
>  Linux from kernel.org 2.6.19.2
>  FC storage (HP MSA1500)


Does that do hardware-snapshots?


cheers,
Rainer



From miksir at maker.ru  Mon Aug  6 13:18:08 2007
From: miksir at maker.ru (Dmitriy MiksIr)
Date: Mon, 06 Aug 2007 17:18:08 +0400
Subject: [Linux-cluster] Re: rsync via GFS
In-Reply-To: <46B71996.30500@ultra-secure.de>
References: <f91is0$5bo$1@sea.gmane.org>	<46B6E46D.3090200@artegence.com>	<46B6F9D9.6080905@ultra-secure.de>	<f96u7r$v3u$1@sea.gmane.org>	<46B70618.60006@ultra-secure.de>	<f971ks$a45$1@sea.gmane.org>
	<46B71996.30500@ultra-secure.de>
Message-ID: <f9772f$r94$1@sea.gmane.org>


   In my opinion, no =(
   I think, two ways for me - create special partition for sync on 
storage or in existing gfs partition create file and map it as block 
device with direct io access...


Rainer Duffner ?????:
> 
> 
> Does that do hardware-snapshots?



From sys.mailing at gmail.com  Mon Aug  6 13:31:16 2007
From: sys.mailing at gmail.com (Bjorn Oglefjorn)
Date: Mon, 6 Aug 2007 09:31:16 -0400
Subject: [Linux-cluster] nofailback for failover domains?
Message-ID: <926ab61b0708060631h5aa3ef0fu2fe183e80df752f6@mail.gmail.com>

I found that a 'nofailback' option was added for the <failoverdomains>
section of the conf.  I can't find any reference to 'nofailback' in any RHCS
doc I can find.  I'm guessing it should look like this:

    <failoverdomain name="test_failover_domain" ordered="1" restricted="1"
nofailback="1">
        ...
    </failoverdomain>

Can someone confirm?  I will attempt to confirm this myself and will report
back when I know for sure, but it seems to behave as I would expect.

Thanks,
BO
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070806/baa3a697/attachment.htm>

From thomas at mainloop.se  Mon Aug  6 14:58:05 2007
From: thomas at mainloop.se (Thomas Althoff)
Date: Mon, 6 Aug 2007 16:58:05 +0200
Subject: [Linux-cluster] rsync via GFS
In-Reply-To: <46B6E46D.3090200@artegence.com>
References: <f91is0$5bo$1@sea.gmane.org> <46B6E46D.3090200@artegence.com>
Message-ID: <6788CB42B92B3449B633BEBC173D9BB7884A07@spoke.intranet.mainloop.net>

> b) Rsync have to create a lock for each file so You could try to
increase /proc/cluster/lock_dlm/drop_count

How ?

I'm running RHEL5, with GFS (not GFS2) and lock_dlm.
I don't see /proc/cluster on my servers.

-Thomas



From jmaddox at stetson.edu  Mon Aug  6 17:17:26 2007
From: jmaddox at stetson.edu (John Maddox)
Date: Mon, 6 Aug 2007 13:17:26 -0400
Subject: [Linux-cluster] Capturing ricci errors
Message-ID: <D68FFD5C2DF2174E9ACF9A4CA6458D810A0816CE@alpha2.ad.stetson.edu>

Hello - still new to RH Cluster - trying to set up a small, 2 node
cluster and frequently get "A ricci error has occurred ... etc you will
be redirected" - but I never get to see what the error was.  Is ricci
logging these errors anywhere?  I don't see them in '/var/log/messages',
nor in luci's log.  

 

Thanks in advance.

John

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070806/10585a2a/attachment.htm>

From mgrac at redhat.com  Tue Aug  7 13:39:26 2007
From: mgrac at redhat.com (Marek 'marx' Grac)
Date: Tue, 07 Aug 2007 15:39:26 +0200
Subject: [Linux-cluster] How to add start-up options to pulse ??
In-Reply-To: <526805.31332.qm@web50612.mail.re2.yahoo.com>
References: <526805.31332.qm@web50612.mail.re2.yahoo.com>
Message-ID: <46B8760E.7030404@redhat.com>

Hi,

? wrote:
> not dificult but "lot" of work and not so important
> patch so, I guess, this will not be a priority for
> redhat ...
>
> will you appreciate a patch for all this ? I can do
> it, in fact I will, again rhel5 version
>   
I will appreciate it and you will find it in next update :)

marx,

-- 
Marek Grac
Red Hat Czech s.r.o.



From orkcu at yahoo.com  Tue Aug  7 16:06:02 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Tue, 7 Aug 2007 09:06:02 -0700 (PDT)
Subject: [Linux-cluster] How to add start-up options to pulse ??
In-Reply-To: <46B8760E.7030404@redhat.com>
Message-ID: <38991.27701.qm@web50607.mail.re2.yahoo.com>


--- Marek 'marx' Grac <mgrac at redhat.com> wrote:

> Hi,
> 
> ??? wrote:
> > not dificult but "lot" of work and not so
> important
> > patch so, I guess, this will not be a priority for
> > redhat ...
> >
> > will you appreciate a patch for all this ? I can
> do
> > it, in fact I will, again rhel5 version
> >   
> I will appreciate it and you will find it in next
> update :)
> 
ok, here is the RFE bugzilla entry:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=250888

although very simple patch, not big deal ...

thanks
roger


__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


       
____________________________________________________________________________________
Sick sense of humor? Visit Yahoo! TV's 
Comedy with an Edge to see what's on, when. 
http://tv.yahoo.com/collections/222



From storm at elemental.it  Tue Aug  7 18:15:46 2007
From: storm at elemental.it (St0rM)
Date: Tue, 07 Aug 2007 20:15:46 +0200
Subject: [Linux-cluster] Using mysql cluster with GFS on RHEL 4
In-Reply-To: <2CC2091A-BBAC-4E9C-98BC-DE3EAA82646B@engineyard.com>
References: <433093DF7AD7444DA65EFAFE3987879C2454BA@jellyfish.highlyscyld.com>
	<2CC2091A-BBAC-4E9C-98BC-DE3EAA82646B@engineyard.com>
Message-ID: <46B8B6D2.5060509@elemental.it>

Excuse me if i drag this dead (mail) body from the water of november 2006...

>> So with myisam tables I can do active/active on the same database
>> with shared data? Or is it the inram database that is shared-nothing?

> That is correct, with MyISAM tables, you can have active/active on GFS
> storage.

... And if I use InnoDB?

What happen if I have two separate servers connecter with a SCSI
storage, using GFS, having the two MySQL server using a datadir on a
shared mounted partition residing on the storage ?



-- 
St0rM

-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GIT d-() s:+>: a- C++(++++) UL++++$ P+ L++++$ E- W+++$ N- o+ K w--() !O
!M>+ !V PS+ PE Y+(++) PGP>+ t+ 5?>+ X++ R++ tv-- b+ DI+++ D+ G+ e* h---
r++ y+++
------END GEEK CODE BLOCK------

"There are only 10 types of people in the world:
   Those who understand binary, and those who don't"



From nattaponv at hotmail.com  Wed Aug  8 06:00:55 2007
From: nattaponv at hotmail.com (nattapon viroonsri)
Date: Wed, 08 Aug 2007 06:00:55 +0000
Subject: [Linux-cluster] Missed too many heartbeats
Message-ID: <BAY123-F109836702F02812B8E2E77A6E70@phx.gbl>


OS: RHEL4 Update 4
Kernel: 2.6.9-42.ELsmp
Cluster: RhCS4 Update4, RHGFS4 U4(GFS-6.1.6-1)
Multipath: EMCpower.LINUX-4.5.1-022
Storage: Fibre channel with EMC CX-320
Fence Device: DELL DRAC5
Service: Postfix, Courier-imap

nodeA.example.com: 192.168.0.20
nodeB.example.com: 192.168.0.60

Drac5(nodeA): 192.168.0.121
Drac5(nodeB); 192.168.0.161


I have 2 node using gfs cluster and powerpath connect through fibre to 
EMC-CX-320 Storage.
both node use drac5 as fence device
Heartbeat traffice use same interface as normal traffic(Mail,imap/pop3)

Problem is only NodeB alway fenced NodeA with reason  "Missed too many 
heartbeats"

After NodeA was rebooted system can join cluster again and working fine 
until  nodeB start fence again,  May be

4-5 hour or 6-7 hour later.

This happen in random manner  2-3 time per day
Memory,Cpu,i/o look good and  Traffice not peak  during problem have occured 
(from sar, and mrtg)
no drop, no collision from ifconfig command

In logfile show same messages every time nodeB start fenced NodeA
I try to extend heartbeat interval by  change  "deadnode_timeout" from 21 to 
61 but doesn't help

Have anyway  to solve this problem or enable more debuging ?
Do i have to dedicate network card to separte heartbeat and normal traffic ?



###### /var/log/message
Aug  7 21:50:06 nodeB kernel: CMAN: removing node nodeA.example.com from the 
cluster : Missed too many

heartbeats
Aug  7 21:50:06 nodeB fenced[20770]: nodeA.example.com not a cluster member 
after 0 sec post_fail_delay
Aug  7 21:50:06 nodeB fenced[20770]: fencing node "nodeA.example.com"
Aug  7 21:50:15 nodeB fenced[20770]: fence "nodeA.example.com" success
Aug  7 21:50:22 nodeB kernel: GFS: fsid=bkkair_cluster:gfs01.1: jid=0: 
Trying to acquire journal lock...
Aug  7 21:50:22 nodeB kernel: GFS: fsid=bkkair_cluster:gfs01.1: jid=0: 
Looking at journal...
Aug  7 21:50:22 nodeB kernel: GFS: fsid=bkkair_cluster:gfs01.1: jid=0: Done
Aug  7 21:53:36 nodeB kernel: CMAN: node nodeA.example.com rejoining


###### /etc/cluster/cluster.conf ################

<?xml version="1.0" ?>
<cluster config_version="7" name="bkkair_cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="15"/>
        <clusternodes>
                <clusternode name="nodeA.example.com" votes="1">
                        <fence>
                                <method name="1">
                                        <device modulename="" 
name="DRAC-nodeA"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="nodeB.example.com" votes="1">
                        <fence>
                                <method name="1">
                                        <device modulename="" 
name="DRAC-nodeB"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
         <cman deadnode_timeout="61"/>
        <fencedevices>
                <fencedevice agent="fence_drac" ipaddr="192.168.0.121" 
login="root" name="DRAC-nodeA"

passwd="supervis"/>
                <fencedevice agent="fence_drac" ipaddr="192.168.0.161" 
login="root" name="DRAC-nodeB"

passwd="supervis"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>


#####################################################


Regards,
Nattapon

_________________________________________________________________
FREE pop-up blocking with the new MSN Toolbar - get it now! 
http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/



From maciej.bogucki at artegence.com  Thu Aug  9 13:49:33 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Thu, 09 Aug 2007 15:49:33 +0200
Subject: [Linux-cluster] Using mysql cluster with GFS on RHEL 4
In-Reply-To: <46B8B6D2.5060509@elemental.it>
References: <433093DF7AD7444DA65EFAFE3987879C2454BA@jellyfish.highlyscyld.com>	<2CC2091A-BBAC-4E9C-98BC-DE3EAA82646B@engineyard.com>
	<46B8B6D2.5060509@elemental.it>
Message-ID: <46BB1B6D.6020008@artegence.com>

St0rM napisa?(a):
> Excuse me if i drag this dead (mail) body from the water of november 2006...
> 
>>> So with myisam tables I can do active/active on the same database
>>> with shared data? Or is it the inram database that is shared-nothing?
> 
>> That is correct, with MyISAM tables, you can have active/active on GFS
>> storage.

Did somebody test it in production?


> ... And if I use InnoDB?
> 
> What happen if I have two separate servers connecter with a SCSI
> storage, using GFS, having the two MySQL server using a datadir on a
> shared mounted partition residing on the storage ?

It will not work, because InnoDB isn't impemented to be clustered
storage engine.
Please check http://www.mysql.com/products/cluster/ with NDB engine
stored in RAM or wait for stable MySQL 5.1 which from version 5.1.6[1]
supports "Cluster Disk Data Tables"

[1] - http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-disk-data.html

Best Regards
Maciej Bogucki



From lanthier.stephanie at uqam.ca  Thu Aug  9 18:13:37 2007
From: lanthier.stephanie at uqam.ca (=?UTF-8?B?TGFudGhpZXIsIFN0w6lwaGFuaWU=?=)
Date: Thu, 9 Aug 2007 14:13:37 -0400
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
Message-ID: <B3C24EA955FF0C4EA14658997CD3E25E13CFC9AF@CAHIER.gst.uqam.ca>

Dear list members,
 
I have in production a RHCS cluster composed of three RHEL4u5 nodes that use GFS. Initially, I first put no fence device on the nodes. I just defined a manual fence device without associating it to the nodes. 
 
As the GFS file system is not accessible when I'm rebooting one of the three nodes, I'm realizing the importance of fence devices.
 
I just defined manual fence devices for the three nodes, but I read that manual fence device is not a good idea for production environment.
 
My machines are SUN Fire X4100. I see that we can define a fence device of type HP ILO. I would like to know if I can use the HP ILO form in system-config-cluster tool to enter and use a SUN ILOM as fence device?
 
If so, do you have any points I should pay attention for when I will define them? I recall that I'm working on a production environment and I'm scary to put things worst that they already are.
 
Thank you very much

__________________

St?phanie Lanthier

Analyste de l'informatique
Universit? du Qu?bec ? Montr?al
Service de l'informatique et des t?l?communications
lanthier.stephanie at uqam.ca
T?l?phone : 514-987-3000 poste 6106
Bureau : PK-M535


 

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070809/425f4889/attachment.htm>

From brad at bradandkim.net  Thu Aug  9 18:36:00 2007
From: brad at bradandkim.net (brad at bradandkim.net)
Date: Thu, 9 Aug 2007 13:36:00 -0500 (CDT)
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <B3C24EA955FF0C4EA14658997CD3E25E13CFC9AF@CAHIER.gst.uqam.ca>
References: <B3C24EA955FF0C4EA14658997CD3E25E13CFC9AF@CAHIER.gst.uqam.ca>
Message-ID: <52540.129.237.174.144.1186684560.squirrel@webmail.bradandkim.net>


> Dear list members,
>
> I have in production a RHCS cluster composed of three RHEL4u5 nodes that
> use GFS. Initially, I first put no fence device on the nodes. I just
> defined a manual fence device without associating it to the nodes.
>
> As the GFS file system is not accessible when I'm rebooting one of the
> three nodes, I'm realizing the importance of fence devices.
>
> I just defined manual fence devices for the three nodes, but I read that
> manual fence device is not a good idea for production environment.
>
> My machines are SUN Fire X4100. I see that we can define a fence device of
> type HP ILO. I would like to know if I can use the HP ILO form in
> system-config-cluster tool to enter and use a SUN ILOM as fence device?
>
> If so, do you have any points I should pay attention for when I will
> define them? I recall that I'm working on a production environment and I'm
> scary to put things worst that they already are.
>
> Thank you very much
>
> __________________
>
> St??phanie Lanthier


I run a mix of SUNFire X4100's and X4600's and am currently testing a
cluster setup with them.  Though I have not fully tested it yet, I am
planning on trying ipmi_lan as the fence device since the cards support
IPMI.  I can let you know how it works out.

Thanks,

Brad Crotchett
brad at bradandkim.net
http://www.bradandkim.net



From jparsons at redhat.com  Thu Aug  9 20:01:32 2007
From: jparsons at redhat.com (jim parsons)
Date: Thu, 09 Aug 2007 16:01:32 -0400
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <B3C24EA955FF0C4EA14658997CD3E25E13CFC9AF@CAHIER.gst.uqam.ca>
References: <B3C24EA955FF0C4EA14658997CD3E25E13CFC9AF@CAHIER.gst.uqam.ca>
Message-ID: <1186689692.3002.12.camel@localhost.localdomain>

On Thu, 2007-08-09 at 14:13 -0400, Lanthier, St?phanie wrote:
> ? 
> Dear list members,
>  
> I have in production a RHCS cluster composed of three RHEL4u5 nodes
> that use GFS. Initially, I first put no fence device on the nodes. I
> just defined a manual fence device without associating it to the
> nodes. 
>  
> As the GFS file system is not accessible when I'm rebooting one of the
> three nodes, I'm realizing the importance of fence devices.
>  
> I just defined manual fence devices for the three nodes, but I read
> that manual fence device is not a good idea for production
> environment.
You need to run the fence_ack_manual script after fencing...it is really
a pain, and DEF not anything to use for production.
>  
> My machines are SUN Fire X4100. I see that we can define a fence
> device of type HP ILO. I would like to know if I can use the HP ILO
> form in system-config-cluster tool to enter and use a SUN ILOM as
> fence device?
Know, please, that system-config-cluster is just a front-end editor for
the /etc/cluster/cluster.conf file. It takes your fence form values and
inserts the values in the proper format in the file and then calls the
methods to update the cluster with the new file.

I do not know the params needed for the SUN ILOM, but I doubt very much
that the fence_ilo agent would do the correct thing. It would be easy to
find out, though. Man fence_ilo and see the params needed, and then run
the agent from the command line (/sbin/fence_ilo -a
System.ILOM.To.Reboot.Now -l login -p passwd 
...and see what happens...I kind of doubt it will work :/
How does ILOM work? telnet or ssh? Is there an snmp interface to ILOM?
If so, there might be a way...by hacking on another agent.

Adding an agent is really not too big of a deal, if you are handy with
scripting.
>  
-J
> 



From dist-list at LEXUM.UMontreal.CA  Thu Aug  9 22:03:58 2007
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Thu, 09 Aug 2007 18:03:58 -0400
Subject: [Linux-cluster] add journal to GFS 
Message-ID: <46BB8F4E.7070203@lexum.umontreal.ca>

Hello,
I have to add more journals to add a new nodes.
SO What I did :
create a new LUN
add it to lvm usign :
lvcreate
vgextend
lvextend

after that I use gfs_grow
now the GFS is 150 GB bigger  BUT
gfs_jadd say that did not have space to add journals



What did I do wrong ?

Tx



From mathieu.avila at seanodes.com  Fri Aug 10 06:59:26 2007
From: mathieu.avila at seanodes.com (Mathieu Avila)
Date: Fri, 10 Aug 2007 08:59:26 +0200
Subject: [Linux-cluster] add journal to GFS
In-Reply-To: <46BB8F4E.7070203@lexum.umontreal.ca>
References: <46BB8F4E.7070203@lexum.umontreal.ca>
Message-ID: <20070810085926.1a4a13d1@mathieu.toulouse>

Le Thu, 09 Aug 2007 18:03:58 -0400,
FM <dist-list at LEXUM.UMontreal.CA> a ?crit :

> Hello,
> I have to add more journals to add a new nodes.
> SO What I did :
> create a new LUN
> add it to lvm usign :
> lvcreate
> vgextend
> lvextend
> 
> after that I use gfs_grow
> now the GFS is 150 GB bigger  BUT
> gfs_jadd say that did not have space to add journals
> 
> 

You have used the new space for normal data or meta-data, so that it
isn't available anymore. "gfs_jadd" doesn't use the free space of a
file system, it needs to be fed up with new space, just like gfs_grow
works.
As your file system cannot be shrinked, the only solution i see is to
add space one more time (much less, 150G is very much unless you want
to 100+ nodes), and run gfs_jadd again.

Cluster team, please correct me if i'm wrong.

--
Mathieu





From maciej.bogucki at artegence.com  Fri Aug 10 07:32:51 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Fri, 10 Aug 2007 09:32:51 +0200
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <B3C24EA955FF0C4EA14658997CD3E25E13CFC9AF@CAHIER.gst.uqam.ca>
References: <B3C24EA955FF0C4EA14658997CD3E25E13CFC9AF@CAHIER.gst.uqam.ca>
Message-ID: <46BC14A3.70303@artegence.com>

> As the GFS file system is not accessible when I'm rebooting one of the
> three nodes, I'm realizing the importance of fence devices.
It is strange to me, because if You have rc scripts and quorum properly
configured, and when You perform reboot of one node Your GFS filesystem
should be accesible all time.

Best Regards
Maciej Bogucki



From maciej.bogucki at artegence.com  Fri Aug 10 09:30:16 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Fri, 10 Aug 2007 11:30:16 +0200
Subject: [Linux-cluster] RH4U5 and SCSI-3 Persistent reservation 
Message-ID: <46BC3028.2090507@artegence.com>

Hello,

I want to run fencing based on SCSI-3 persisten reservation protocol. I
have two servers with GFS filesystem and I want to write to all of them
at one time. But when I start scsi_reserve on the second node I get:

---cut---
Aug 10 11:26:08 host2 scsi_reserve: register of device /dev/sdb1 succeeded
Aug 10 11:26:08 host2 kernel: scsi0 (0,0,1) : reservation conflict
Aug 10 11:26:08 host2 kernel: 492 [RAIDarray.mpp]Array_Module_0:1:0:1 IO
FAILURE. vcmnd SN 35975 pdev H0:C0:T0:L1 0x00/0x00/0x00 0x00000018
mpp_status:3
Aug 10 11:26:08 host2 kernel: scsi2 (0,0,1) : reservation conflict
Aug 10 11:26:08 host2 kernel: SCSI error : <2 0 0 1> return code = 0x18
---cut---

So I think that RH4U5 only supports SCSI-2 persistent reservations.

But here [1] we can read that RH4U5 support for SCSI-3 persistent group
reservations. SCSI-3 is a group reservation: every node has a key on a
dedicated area on the disk and when a node has to leave, another node
will just kick off its key. So it is what I'm looking for.

[1] - http://www.desktoplinux.com/news/NS3524659857.html

Best Regards
Maciej Bogucki



From beres.laszlo at sys-admin.hu  Fri Aug 10 10:01:32 2007
From: beres.laszlo at sys-admin.hu (BERES Laszlo)
Date: Fri, 10 Aug 2007 12:01:32 +0200
Subject: [Linux-cluster] Updating fence scripts
Message-ID: <46BC377C.9060109@sys-admin.hu>

Hello,

just a silly question: if I update fence package, do I have to restart
fenced? Does it affect GFS? I've never done this before, just updated
the whole cluster.

Thanks,

-- 
B?RES L?szl?	 RHCE, RHCX
senior IT engineer, trainer



From maciej.bogucki at artegence.com  Fri Aug 10 10:23:51 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Fri, 10 Aug 2007 12:23:51 +0200
Subject: [Linux-cluster] Updating fence scripts
In-Reply-To: <46BC377C.9060109@sys-admin.hu>
References: <46BC377C.9060109@sys-admin.hu>
Message-ID: <46BC3CB7.5010909@artegence.com>

> just a silly question: if I update fence package, do I have to restart
> fenced? Does it affect GFS? I've never done this before, just updated
> the whole cluster.

Hello,

For production clusters:
1. do tests in testing environment
2. install in production
3. do tests in production(at night with sheduled downtime if needed).

If You doesn't have test environment then I suggest You to do tests in
production to check if everything is working as You expected.
But, if there is no change in fence_xxx script You don't need to do
fence testing.

Best Regards
Maciej Bogucki



From bernard.chew at muvee.com  Fri Aug 10 11:56:04 2007
From: bernard.chew at muvee.com (Bernard Chew)
Date: Fri, 10 Aug 2007 19:56:04 +0800
Subject: [Linux-cluster] Using cmirror
Message-ID: <1186746964.16863.8.camel@ws-berd.sg.muvee.net>

Hi,

I read that cmirror provides user-level utilities for managing cluster
mirroring but could not find much documentation on it. Can anyone point
me to any documentation / guide around?

Regards,
Bernard Chew
IT Operations



From beres.laszlo at sys-admin.hu  Fri Aug 10 11:58:39 2007
From: beres.laszlo at sys-admin.hu (BERES Laszlo)
Date: Fri, 10 Aug 2007 13:58:39 +0200
Subject: [Linux-cluster] Updating fence scripts
In-Reply-To: <46BC3CB7.5010909@artegence.com>
References: <46BC377C.9060109@sys-admin.hu> <46BC3CB7.5010909@artegence.com>
Message-ID: <46BC52EF.7020903@sys-admin.hu>

Maciej Bogucki wrote:

> If You doesn't have test environment then I suggest You to do tests in
> production to check if everything is working as You expected.

Thanks, but my question was somehow explicit about fencing :)

> But, if there is no change in fence_xxx script You don't need to do
> fence testing.

There is a change in fence_ilo script, I have to upgrade it.

-- 
B?RES L?szl?	 RHCE, RHCX
senior IT engineer, trainer



From maciej.bogucki at artegence.com  Fri Aug 10 12:05:03 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Fri, 10 Aug 2007 14:05:03 +0200
Subject: [Linux-cluster] Updating fence scripts
In-Reply-To: <46BC52EF.7020903@sys-admin.hu>
References: <46BC377C.9060109@sys-admin.hu> <46BC3CB7.5010909@artegence.com>
	<46BC52EF.7020903@sys-admin.hu>
Message-ID: <46BC546F.4040901@artegence.com>

>> But, if there is no change in fence_xxx script You don't need to do
>> fence testing.
> 
> There is a change in fence_ilo script, I have to upgrade it.

Hello,

If there are changes in fence_ilo I suggest You to perform testing.

Best Regards
Maciej Bogucki



From beres.laszlo at sys-admin.hu  Fri Aug 10 12:07:12 2007
From: beres.laszlo at sys-admin.hu (BERES Laszlo)
Date: Fri, 10 Aug 2007 14:07:12 +0200
Subject: [Linux-cluster] Updating fence scripts
In-Reply-To: <46BC546F.4040901@artegence.com>
References: <46BC377C.9060109@sys-admin.hu>
	<46BC3CB7.5010909@artegence.com>	<46BC52EF.7020903@sys-admin.hu>
	<46BC546F.4040901@artegence.com>
Message-ID: <46BC54F0.7000804@sys-admin.hu>

Maciej Bogucki wrote:

> If there are changes in fence_ilo I suggest You to perform testing.

Unfortunately we don't have test systems, only a productive one.

-- 
B?RES L?szl?	 RHCE, RHCX
senior IT engineer, trainer



From maciej.bogucki at artegence.com  Fri Aug 10 12:31:33 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Fri, 10 Aug 2007 14:31:33 +0200
Subject: [Linux-cluster] Updating fence scripts
In-Reply-To: <46BC54F0.7000804@sys-admin.hu>
References: <46BC377C.9060109@sys-admin.hu>	<46BC3CB7.5010909@artegence.com>	<46BC52EF.7020903@sys-admin.hu>	<46BC546F.4040901@artegence.com>
	<46BC54F0.7000804@sys-admin.hu>
Message-ID: <46BC5AA5.4010700@artegence.com>

>> If there are changes in fence_ilo I suggest You to perform testing.
> 
> Unfortunately we don't have test systems, only a productive one.

Hello,

So I suggest You to plan scheduled downtime at night and test fencing in
production.

Best Regards
Maciej Bogucki



From dist-list at LEXUM.UMontreal.CA  Fri Aug 10 13:58:13 2007
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Fri, 10 Aug 2007 09:58:13 -0400
Subject: [Linux-cluster] SAN + multipathd + GFS : SCSI error 
Message-ID: <46BC6EF5.2010500@lexum.umontreal.ca>

Hello,
All servers are RHEL 4.5
SAN is HP EVA 4000
we are using linux qla modules and multipathd
cluster server have only one FC Card


In the dmesg of servers connected to GFS we have a lot of :
SCSI error : <0 0 1 1> return code = 0x20000
end_request: I/O error, dev sdd, sector 37807111

The cluster seems to work fine but I'd like to know if we can avoid this
error.

here is a multipathd -ll output :

[root at como ~]# multipath -ll
mpath1 (3600508b4001051e40000900000310000)
[size=500 GB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [prio=50][active]
 \_ 0:0:0:1 sda 8:0  [active][ready]
\_ round-robin 0 [prio=10][enabled]
 \_ 0:0:1:1 sdd 8:48 [active][ready]

mpath3 (3600508b4001051e400009000009e0000)
[size=150 GB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [prio=50][active]
 \_ 0:0:1:2 sde 8:64 [active][ready]
\_ round-robin 0 [prio=10][enabled]
 \_ 0:0:0:2 sdb 8:16 [active][ready]



and the device in the multipath.conf

devices {
        device {
                vendor                  "HP "
                product                 "HSV200 "
                path_grouping_policy    group_by_prio
                getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
                path_checker            tur
                path_selector           "round-robin 0"
                prio_callout            "/sbin/mpath_prio_alua %d"
                failback                immediate
                no_path_retry           60
        }
}



From simone.gotti at email.it  Fri Aug 10 14:14:22 2007
From: simone.gotti at email.it (Simone Gotti)
Date: Fri, 10 Aug 2007 16:14:22 +0200
Subject: [Linux-cluster] SAN + multipathd + GFS : SCSI error
In-Reply-To: <46BC6EF5.2010500@lexum.umontreal.ca>
References: <46BC6EF5.2010500@lexum.umontreal.ca>
Message-ID: <1186755262.6117.11.camel@localhost>

Hi,

I saw various machines with Qlogic HBAs having this issue (error code
0x20000 is DID_BUS_BUSY), in my case when using device mapper multipath,
the path getting the error was failed by dm-multipath and then reactived
because the path checker reported it was up (as it was transient error).

It looks like a wrong qla2xxx behavior as reported in this knowledge
base:  http://kbase.redhat.com/faq/FAQ_46_9001.shtm
and also in bug
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=231319
where there's a proposed fix for RHEL4 U6.

I tested the workaround proposed in the kbase in a test environment
where unfortunately this issue wasn't present and I simulated it forcing
an HBA lip with sysfs but with this test the problem didn't disappeared.

Maybe your issue is the same.

Bye!

On Fri, 2007-08-10 at 09:58 -0400, FM wrote:
> Hello,
> All servers are RHEL 4.5
> SAN is HP EVA 4000
> we are using linux qla modules and multipathd
> cluster server have only one FC Card
> 
> 
> In the dmesg of servers connected to GFS we have a lot of :
> SCSI error : <0 0 1 1> return code = 0x20000
> end_request: I/O error, dev sdd, sector 37807111
> 
> The cluster seems to work fine but I'd like to know if we can avoid this
> error.
> 
> here is a multipathd -ll output :
> 
> [root at como ~]# multipath -ll
> mpath1 (3600508b4001051e40000900000310000)
> [size=500 GB][features="1 queue_if_no_path"][hwhandler="0"]
> \_ round-robin 0 [prio=50][active]
>  \_ 0:0:0:1 sda 8:0  [active][ready]
> \_ round-robin 0 [prio=10][enabled]
>  \_ 0:0:1:1 sdd 8:48 [active][ready]
> 
> mpath3 (3600508b4001051e400009000009e0000)
> [size=150 GB][features="1 queue_if_no_path"][hwhandler="0"]
> \_ round-robin 0 [prio=50][active]
>  \_ 0:0:1:2 sde 8:64 [active][ready]
> \_ round-robin 0 [prio=10][enabled]
>  \_ 0:0:0:2 sdb 8:16 [active][ready]
> 
> 
> 
> and the device in the multipath.conf
> 
> devices {
>         device {
>                 vendor                  "HP "
>                 product                 "HSV200 "
>                 path_grouping_policy    group_by_prio
>                 getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
>                 path_checker            tur
>                 path_selector           "round-robin 0"
>                 prio_callout            "/sbin/mpath_prio_alua %d"
>                 failback                immediate
>                 no_path_retry           60
>         }
> }
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-- 
Simone Gotti
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070810/cdc6b317/attachment.sig>

From jbrassow at redhat.com  Fri Aug 10 15:01:29 2007
From: jbrassow at redhat.com (Jonathan Brassow)
Date: Fri, 10 Aug 2007 10:01:29 -0500
Subject: [Linux-cluster] Using cmirror
In-Reply-To: <1186746964.16863.8.camel@ws-berd.sg.muvee.net>
References: <1186746964.16863.8.camel@ws-berd.sg.muvee.net>
Message-ID: <8F3D9B0A-3ABE-4636-A27F-B1120DBC85F2@redhat.com>

If you've set up a cluster and are using LVM, it will work the same  
way as single machine mirroring.

http://www.redhat.com/docs/manuals/csgfs/browse/4.5/ 
SAC_Cluster_Logical_Volume_Manager/mirror_create.html

  brassow

On Aug 10, 2007, at 6:56 AM, Bernard Chew wrote:

> Hi,
>
> I read that cmirror provides user-level utilities for managing cluster
> mirroring but could not find much documentation on it. Can anyone  
> point
> me to any documentation / guide around?
>
> Regards,
> Bernard Chew
> IT Operations
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From Quentin.Arce at Sun.COM  Fri Aug 10 15:12:48 2007
From: Quentin.Arce at Sun.COM (Quentin Arce)
Date: Fri, 10 Aug 2007 08:12:48 -0700
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <1186689692.3002.12.camel@localhost.localdomain>
References: <B3C24EA955FF0C4EA14658997CD3E25E13CFC9AF@CAHIER.gst.uqam.ca>
	<1186689692.3002.12.camel@localhost.localdomain>
Message-ID: <46BC8070.3070400@Sun.Com>

jim parsons wrote:
> On Thu, 2007-08-09 at 14:13 -0400, Lanthier, St?phanie wrote:
>   
>> ? 
>> Dear list members,
>>  
>> I have in production a RHCS cluster composed of three RHEL4u5 nodes
>> that use GFS. Initially, I first put no fence device on the nodes. I
>> just defined a manual fence device without associating it to the
>> nodes. 
>>  
>> As the GFS file system is not accessible when I'm rebooting one of the
>> three nodes, I'm realizing the importance of fence devices.
>>  
>> I just defined manual fence devices for the three nodes, but I read
>> that manual fence device is not a good idea for production
>> environment.
>>     
> You need to run the fence_ack_manual script after fencing...it is really
> a pain, and DEF not anything to use for production.
>   
>>  
>> My machines are SUN Fire X4100. I see that we can define a fence
>> device of type HP ILO. I would like to know if I can use the HP ILO
>> form in system-config-cluster tool to enter and use a SUN ILOM as
>> fence device?
>>     
> Know, please, that system-config-cluster is just a front-end editor for
> the /etc/cluster/cluster.conf file. It takes your fence form values and
> inserts the values in the proper format in the file and then calls the
> methods to update the cluster with the new file.
>
> I do not know the params needed for the SUN ILOM, but I doubt very much
> that the fence_ilo agent would do the correct thing. It would be easy to
> find out, though. Man fence_ilo and see the params needed, and then run
> the agent from the command line (/sbin/fence_ilo -a
> System.ILOM.To.Reboot.Now -l login -p passwd 
> ...and see what happens...I kind of doubt it will work :/
> How does ILOM work? telnet or ssh? Is there an snmp interface to ILOM?
> If so, there might be a way...by hacking on another agent.
>   
So, I'm a lurker on this list as I no longer have a cluster up... but I 
work on ILOM and I would love to see this work.  This isn't official 
support, I'm a developer not a customer support person.  So, it's more 
on my time.  If there is anything I can do... Please let me know.  
Questions on this problem, regarding what ILOM can / can't do, how to 
check state of the server via ILOM, etc.

Thanks,

Quentin
> Adding an agent is really not too big of a deal, if you are handy with
> scripting.
>   
>>  
>>     
> -J
>   
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   



From Quentin.Arce at Sun.COM  Fri Aug 10 15:28:23 2007
From: Quentin.Arce at Sun.COM (Quentin Arce)
Date: Fri, 10 Aug 2007 08:28:23 -0700
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <1186689692.3002.12.camel@localhost.localdomain>
References: <B3C24EA955FF0C4EA14658997CD3E25E13CFC9AF@CAHIER.gst.uqam.ca>
	<1186689692.3002.12.camel@localhost.localdomain>
Message-ID: <46BC8417.6010201@Sun.Com>

jim parsons wrote:
> On Thu, 2007-08-09 at 14:13 -0400, Lanthier, St?phanie wrote:
>   
>> ? 
>> Dear list members,
>>  
>> I have in production a RHCS cluster composed of three RHEL4u5 nodes
>> that use GFS. Initially, I first put no fence device on the nodes. I
>> just defined a manual fence device without associating it to the
>> nodes. 
>>  
>> As the GFS file system is not accessible when I'm rebooting one of the
>> three nodes, I'm realizing the importance of fence devices.
>>  
>> I just defined manual fence devices for the three nodes, but I read
>> that manual fence device is not a good idea for production
>> environment.
>>     
> You need to run the fence_ack_manual script after fencing...it is really
> a pain, and DEF not anything to use for production.
>   
>>  
>> My machines are SUN Fire X4100. I see that we can define a fence
>> device of type HP ILO. I would like to know if I can use the HP ILO
>> form in system-config-cluster tool to enter and use a SUN ILOM as
>> fence device?
>>     
> Know, please, that system-config-cluster is just a front-end editor for
> the /etc/cluster/cluster.conf file. It takes your fence form values and
> inserts the values in the proper format in the file and then calls the
> methods to update the cluster with the new file.
>
> I do not know the params needed for the SUN ILOM, but I doubt very much
> that the fence_ilo agent would do the correct thing. It would be easy to
> find out, though. Man fence_ilo and see the params needed, and then run
> the agent from the command line (/sbin/fence_ilo -a
> System.ILOM.To.Reboot.Now -l login -p passwd 
> ...and see what happens...I kind of doubt it will work :/
>   
Oh, forgot to answer....
> How does ILOM work? telnet or ssh?
ssh, no telnet.
>  Is there an snmp interface to ILOM?
>   
Yes and IPMI

You just need the mib file for snmp.  It should be on the public 
download site.  If it's not I'll find out where it's published.  The mib 
files you need are, SUN-ILOM-CONTROL-MIB.mib and SUN-PLATFORM-MIB.mib
see:
http://www.sun.com/products-n-solutions/hardware/docs/html/820-0280-12/snmp_using.html#50491426_94813



> If so, there might be a way...by hacking on another agent.
>
> Adding an agent is really not too big of a deal, if you are handy with
> scripting.
>   
>>  
>>     
> -J
>   
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   



From bernard.chew at muvee.com  Fri Aug 10 15:49:47 2007
From: bernard.chew at muvee.com (Bernard Chew)
Date: Fri, 10 Aug 2007 23:49:47 +0800
Subject: [Linux-cluster] Using cmirror
References: <1186746964.16863.8.camel@ws-berd.sg.muvee.net>
	<8F3D9B0A-3ABE-4636-A27F-B1120DBC85F2@redhat.com>
Message-ID: <229C73600EB0E54DA818AB599482BCE951EB01@shadowfax.sg.muvee.net>

Hi,

Can this work with GFS where I have 2 iscsi disks (ie. /dev/sda & /dev/sdb) from 2 different iscsi-target servers and I create a mirrored GFS volume?

Regards,
Bernard Chew


-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of Jonathan Brassow
Sent: Fri 8/10/2007 11:01 PM
To: linux clustering
Subject: Re: [Linux-cluster] Using cmirror
 
If you've set up a cluster and are using LVM, it will work the same  
way as single machine mirroring.

http://www.redhat.com/docs/manuals/csgfs/browse/4.5/ 
SAC_Cluster_Logical_Volume_Manager/mirror_create.html

  brassow

On Aug 10, 2007, at 6:56 AM, Bernard Chew wrote:

> Hi,
>
> I read that cmirror provides user-level utilities for managing cluster
> mirroring but could not find much documentation on it. Can anyone  
> point
> me to any documentation / guide around?
>
> Regards,
> Bernard Chew
> IT Operations
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3110 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070810/5f5fde16/attachment.bin>

From james at cloud9.co.uk  Fri Aug 10 16:11:55 2007
From: james at cloud9.co.uk (James Fidell)
Date: Fri, 10 Aug 2007 17:11:55 +0100
Subject: [Linux-cluster] Using cmirror
In-Reply-To: <229C73600EB0E54DA818AB599482BCE951EB01@shadowfax.sg.muvee.net>
References: <1186746964.16863.8.camel@ws-berd.sg.muvee.net>	<8F3D9B0A-3ABE-4636-A27F-B1120DBC85F2@redhat.com>
	<229C73600EB0E54DA818AB599482BCE951EB01@shadowfax.sg.muvee.net>
Message-ID: <46BC8E4B.1070803@cloud9.co.uk>

Bernard Chew wrote:
> Hi,
> 
> Can this work with GFS where I have 2 iscsi disks (ie. /dev/sda & /dev/sdb) from 2 different iscsi-target servers and I create a mirrored GFS volume?

I had no problems creating such a setup.

Where I did have problems was when one of the iscsi disks "went away".
At that point the iscsi layer appeared to hang and lvm locked up :(

(You need three separate PVs to create a mirrored LVM volume, btw).

James



From jparsons at redhat.com  Fri Aug 10 16:13:21 2007
From: jparsons at redhat.com (James Parsons)
Date: Fri, 10 Aug 2007 12:13:21 -0400
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <46BC8070.3070400@Sun.Com>
References: <B3C24EA955FF0C4EA14658997CD3E25E13CFC9AF@CAHIER.gst.uqam.ca>	<1186689692.3002.12.camel@localhost.localdomain>
	<46BC8070.3070400@Sun.Com>
Message-ID: <46BC8EA1.90609@redhat.com>

Quentin Arce wrote:

>
>>  
>>
>>>  
>>> My machines are SUN Fire X4100. I see that we can define a fence
>>> device of type HP ILO. I would like to know if I can use the HP ILO
>>> form in system-config-cluster tool to enter and use a SUN ILOM as
>>> fence device?
>>>     
>>
>>
>> How does ILOM work? telnet or ssh? Is there an snmp interface to ILOM?
>> If so, there might be a way...by hacking on another agent.
>>   
>
> So, I'm a lurker on this list as I no longer have a cluster up... but 
> I work on ILOM and I would love to see this work.  This isn't official 
> support, I'm a developer not a customer support person.  So, it's more 
> on my time.  If there is anything I can do... Please let me know.  
> Questions on this problem, regarding what ILOM can / can't do, how to 
> check state of the server via ILOM, etc.

Quentin! That is very kind of you. If you help with the ILOM protocol, 
I'll help with the agent/script. This thread could form a document on 
how to write an arbitrary fence agent for use with rhcs.

Where is documentation available? Generally, three things are needed 
from a baseboard management device in order to use it for fencing: 1) A 
way to shut the system down, 2) a way to power the system up, and 3) a 
way to check if it is up or down.

What means can a script use to communicate with the ILOM card? Are there 
big delta's in the protocol between different ILOM versions?

I look forward to hearing from you.

-J




From jparsons at redhat.com  Fri Aug 10 16:22:55 2007
From: jparsons at redhat.com (James Parsons)
Date: Fri, 10 Aug 2007 12:22:55 -0400
Subject: [Linux-cluster] Updating fence scripts
In-Reply-To: <46BC52EF.7020903@sys-admin.hu>
References: <46BC377C.9060109@sys-admin.hu> <46BC3CB7.5010909@artegence.com>
	<46BC52EF.7020903@sys-admin.hu>
Message-ID: <46BC90DF.2070000@redhat.com>

BERES Laszlo wrote:

>Maciej Bogucki wrote:
>
>  
>
>>If You doesn't have test environment then I suggest You to do tests in
>>production to check if everything is working as You expected.
>>    
>>
>
>Thanks, but my question was somehow explicit about fencing :)
>
>  
>
>>But, if there is no change in fence_xxx script You don't need to do
>>fence testing.
>>    
>>
>
>There is a change in fence_ilo script, I have to upgrade it.
>
>  
>
If you are referring to the 5.1 beta build, or the 4.5 asynchronous 
update, the change to fence_ilo is a minor fix to support ilo2...if you 
are running ilo currently with no problems, it is very highly extremely 
unlikely ;), that you will encounter a problem. Still, as the man says, 
you should test to be certain...as failed fencing when you need it is a 
VERY bad thing.

Here is what changed, btw...
---------------------------------------------------------------------------------

--- cluster/fence/agents/ilo/fence_ilo.pl	2007/04/09 15:22:39	1.3.2.3.2.1
+++ cluster/fence/agents/ilo/fence_ilo.pl	2007/07/17 18:38:59	1.3.2.3.2.2
@@ -279,10 +279,13 @@
 
 	foreach my $line (@response)
 	{
+		if ($line =~ /FIRMWARE_VERSION\s*=\s*\"(.*)\"/) {
+			$firmware_rev = $1;
+		}
 		if ($line =~ /MANAGEMENT_PROCESSOR\s*=\s*\"(.*)\"/) {
 			if ($1 eq "iLO2") {
 				$ilo_vers = 2;
-				print "power_status: reporting iLO2\n" if ($verbose);
+				print "power_status: reporting iLO2 $firmware_rev\n" if ($verbose);
 			}
 		}
 
@@ -358,7 +361,11 @@
 		# HOLD_PWR_BUTTON is used to power the machine off, and
 		# PRESS_PWR_BUTTON is used to power the machine on;
 		# when the power is off, HOLD_PWR_BUTTON has no effect.
-		sendsock $socket, "<HOLD_PWR_BTN/>\n";
+		if ($firmware_rev > 1.29) {
+			sendsock $socket, "<HOLD_PWR_BTN TOGGLE=\"Yes\" />\n";
+		} else {
+			sendsock $socket, "<HOLD_PWR_BTN/>\n";
+		}
 	}
 	# As of firmware version 1.71 (RIBCL 2.21) The SET_HOST_POWER command
 	# is no longer available.  HOLD_PWR_BTN and PRESS_PWR_BTN are used 
@@ -515,6 +522,7 @@
 $action = "reboot";
 $ribcl_vers = undef; # undef = autodetect
 $ilo_vers = 1;
+$firmware_rev = 0;





From brad at bradandkim.net  Fri Aug 10 16:33:32 2007
From: brad at bradandkim.net (brad at bradandkim.net)
Date: Fri, 10 Aug 2007 11:33:32 -0500 (CDT)
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <46BC8EA1.90609@redhat.com>
References: <B3C24EA955FF0C4EA14658997CD3E25E13CFC9AF@CAHIER.gst.uqam.ca>
	<1186689692.3002.12.camel@localhost.localdomain>
	<46BC8070.3070400@Sun.Com> <46BC8EA1.90609@redhat.com>
Message-ID: <58515.129.237.174.144.1186763612.squirrel@webmail.bradandkim.net>


> Quentin Arce wrote:
>
>>
>>>
>>>
>>>>
>>>> My machines are SUN Fire X4100. I see that we can define a fence
>>>> device of type HP ILO. I would like to know if I can use the HP ILO
>>>> form in system-config-cluster tool to enter and use a SUN ILOM as
>>>> fence device?
>>>>
>>>
>>>
>>> How does ILOM work? telnet or ssh? Is there an snmp interface to ILOM?
>>> If so, there might be a way...by hacking on another agent.
>>>
>>
>> So, I'm a lurker on this list as I no longer have a cluster up... but
>> I work on ILOM and I would love to see this work.  This isn't official
>> support, I'm a developer not a customer support person.  So, it's more
>> on my time.  If there is anything I can do... Please let me know.
>> Questions on this problem, regarding what ILOM can / can't do, how to
>> check state of the server via ILOM, etc.
>
> Quentin! That is very kind of you. If you help with the ILOM protocol,
> I'll help with the agent/script. This thread could form a document on
> how to write an arbitrary fence agent for use with rhcs.
>
> Where is documentation available? Generally, three things are needed
> from a baseboard management device in order to use it for fencing: 1) A
> way to shut the system down, 2) a way to power the system up, and 3) a
> way to check if it is up or down.
>
> What means can a script use to communicate with the ILOM card? Are there
> big delta's in the protocol between different ILOM versions?
>
> I look forward to hearing from you.
>
> -J

I am interested in seeing this thread play out as well since I have 26 SUN
servers I am beginning to cluster.  My question is why use SNMP over IPMI
v2.0.  I can do the above three things with:

/usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power off
/usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power on
/usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power status

I don't need any MIB's for this either.  It seems to me this might be an
easier solution than snmp, but I may be missing something.

Thanks,

Brad Crotchett
brad at bradandkim.net
http://www.bradandkim.net



From Quentin.Arce at Sun.COM  Fri Aug 10 16:35:12 2007
From: Quentin.Arce at Sun.COM (Quentin Arce)
Date: Fri, 10 Aug 2007 09:35:12 -0700
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <46BC8EA1.90609@redhat.com>
References: <B3C24EA955FF0C4EA14658997CD3E25E13CFC9AF@CAHIER.gst.uqam.ca>
	<1186689692.3002.12.camel@localhost.localdomain>
	<46BC8070.3070400@Sun.Com> <46BC8EA1.90609@redhat.com>
Message-ID: <46BC93C0.4070401@Sun.Com>

James Parsons wrote:
> Quentin Arce wrote:
>
>>
>>>  
>>>
>>>>  
>>>> My machines are SUN Fire X4100. I see that we can define a fence
>>>> device of type HP ILO. I would like to know if I can use the HP ILO
>>>> form in system-config-cluster tool to enter and use a SUN ILOM as
>>>> fence device?
>>>>     
>>>
>>>
>>> How does ILOM work? telnet or ssh? Is there an snmp interface to ILOM?
>>> If so, there might be a way...by hacking on another agent.
>>>   
>>
>> So, I'm a lurker on this list as I no longer have a cluster up... but 
>> I work on ILOM and I would love to see this work.  This isn't 
>> official support, I'm a developer not a customer support person.  So, 
>> it's more on my time.  If there is anything I can do... Please let me 
>> know.  Questions on this problem, regarding what ILOM can / can't do, 
>> how to check state of the server via ILOM, etc.
>
> Quentin! That is very kind of you. If you help with the ILOM protocol, 
> I'll help with the agent/script. This thread could form a document on 
> how to write an arbitrary fence agent for use with rhcs.
>
> Where is documentation available? Generally, three things are needed 
> from a baseboard management device in order to use it for fencing: 1) 
> A way to shut the system down, 2) a way to power the system up, and 3) 
> a way to check if it is up or down.
>
In that case.  I'm not very familiar with the other LOM cards out 
there.  I have been told that IPMI is standard on most all cards from 
most vendors for the past few years.  If this is true then perhaps the 
simplest method is to use ipmitool to get/set these states.  SNMP works 
fine also but requires the user to turn on r/w v3 snmp access.  If you 
can confirm that IPMI is standard then it's the simplest as the commands 
to get / set many readings / options are standard via the IPMI spec.

Thanks,

Quentin
> What means can a script use to communicate with the ILOM card? Are 
> there big delta's in the protocol between different ILOM versions?
>
> I look forward to hearing from you.
>
> -J
>
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From Quentin.Arce at Sun.COM  Fri Aug 10 16:37:29 2007
From: Quentin.Arce at Sun.COM (Quentin Arce)
Date: Fri, 10 Aug 2007 09:37:29 -0700
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <58515.129.237.174.144.1186763612.squirrel@webmail.bradandkim.net>
References: <B3C24EA955FF0C4EA14658997CD3E25E13CFC9AF@CAHIER.gst.uqam.ca>
	<1186689692.3002.12.camel@localhost.localdomain>
	<46BC8070.3070400@Sun.Com> <46BC8EA1.90609@redhat.com>
	<58515.129.237.174.144.1186763612.squirrel@webmail.bradandkim.net>
Message-ID: <46BC9449.5090408@Sun.Com>

brad at bradandkim.net wrote:
>> Quentin Arce wrote:
>>
>>     
>>>>         
>>>>> My machines are SUN Fire X4100. I see that we can define a fence
>>>>> device of type HP ILO. I would like to know if I can use the HP ILO
>>>>> form in system-config-cluster tool to enter and use a SUN ILOM as
>>>>> fence device?
>>>>>
>>>>>           
>>>> How does ILOM work? telnet or ssh? Is there an snmp interface to ILOM?
>>>> If so, there might be a way...by hacking on another agent.
>>>>
>>>>         
>>> So, I'm a lurker on this list as I no longer have a cluster up... but
>>> I work on ILOM and I would love to see this work.  This isn't official
>>> support, I'm a developer not a customer support person.  So, it's more
>>> on my time.  If there is anything I can do... Please let me know.
>>> Questions on this problem, regarding what ILOM can / can't do, how to
>>> check state of the server via ILOM, etc.
>>>       
>> Quentin! That is very kind of you. If you help with the ILOM protocol,
>> I'll help with the agent/script. This thread could form a document on
>> how to write an arbitrary fence agent for use with rhcs.
>>
>> Where is documentation available? Generally, three things are needed
>> from a baseboard management device in order to use it for fencing: 1) A
>> way to shut the system down, 2) a way to power the system up, and 3) a
>> way to check if it is up or down.
>>
>> What means can a script use to communicate with the ILOM card? Are there
>> big delta's in the protocol between different ILOM versions?
>>
>> I look forward to hearing from you.
>>
>> -J
>>     
>
> I am interested in seeing this thread play out as well since I have 26 SUN
> servers I am beginning to cluster.  My question is why use SNMP over IPMI
> v2.0.  I can do the above three things with:
>
> /usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power off
> /usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power on
> /usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power status
>
> I don't need any MIB's for this either.  It seems to me this might be an
> easier solution than snmp, but I may be missing something.
>
>   

No, I think you have it all covered.  :-)

If other vendors support IPMI then the script should just use it.

> Thanks,
>
> Brad Crotchett
> brad at bradandkim.net
> http://www.bradandkim.net
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   



From Quentin.Arce at Sun.COM  Fri Aug 10 16:38:34 2007
From: Quentin.Arce at Sun.COM (Quentin Arce)
Date: Fri, 10 Aug 2007 09:38:34 -0700
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <58515.129.237.174.144.1186763612.squirrel@webmail.bradandkim.net>
References: <B3C24EA955FF0C4EA14658997CD3E25E13CFC9AF@CAHIER.gst.uqam.ca>
	<1186689692.3002.12.camel@localhost.localdomain>
	<46BC8070.3070400@Sun.Com> <46BC8EA1.90609@redhat.com>
	<58515.129.237.174.144.1186763612.squirrel@webmail.bradandkim.net>
Message-ID: <46BC948A.3010907@Sun.Com>

brad at bradandkim.net wrote:
>> Quentin Arce wrote:
>>
>>     
>>>>         
>>>>> My machines are SUN Fire X4100. I see that we can define a fence
>>>>> device of type HP ILO. I would like to know if I can use the HP ILO
>>>>> form in system-config-cluster tool to enter and use a SUN ILOM as
>>>>> fence device?
>>>>>
>>>>>           
>>>> How does ILOM work? telnet or ssh? Is there an snmp interface to ILOM?
>>>> If so, there might be a way...by hacking on another agent.
>>>>
>>>>         
>>> So, I'm a lurker on this list as I no longer have a cluster up... but
>>> I work on ILOM and I would love to see this work.  This isn't official
>>> support, I'm a developer not a customer support person.  So, it's more
>>> on my time.  If there is anything I can do... Please let me know.
>>> Questions on this problem, regarding what ILOM can / can't do, how to
>>> check state of the server via ILOM, etc.
>>>       
>> Quentin! That is very kind of you. If you help with the ILOM protocol,
>> I'll help with the agent/script. This thread could form a document on
>> how to write an arbitrary fence agent for use with rhcs.
>>
>> Where is documentation available? Generally, three things are needed
>> from a baseboard management device in order to use it for fencing: 1) A
>> way to shut the system down, 2) a way to power the system up, and 3) a
>> way to check if it is up or down.
>>
>> What means can a script use to communicate with the ILOM card? Are there
>> big delta's in the protocol between different ILOM versions?
>>
>> I look forward to hearing from you.
>>
>> -J
>>     
>
> I am interested in seeing this thread play out as well since I have 26 SUN
> servers I am beginning to cluster.  My question is why use SNMP over IPMI
> v2.0.  I can do the above three things with:
>
> /usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power off
> /usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power on
> /usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power status
>
> I don't need any MIB's for this either.  It seems to me this might be an
> easier solution than snmp, but I may be missing something.
>
>   

Oh make sure you are using lanplus mode for this.


> Thanks,
>
> Brad Crotchett
> brad at bradandkim.net
> http://www.bradandkim.net
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   



From dist-list at LEXUM.UMontreal.CA  Fri Aug 10 17:35:07 2007
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Fri, 10 Aug 2007 13:35:07 -0400
Subject: [Linux-cluster] lvextend : Error locking ... Volume group for uuid
	not found
Message-ID: <46BCA1CB.2010802@lexum.umontreal.ca>

Ut is not really a good friday for me :)

I am trying to extend le logical volume on cluster nodes :

lvextend -t -l +1279 /dev/SAN-group1/home
  Test mode: Metadata will NOT be updated.
  Extending logical volume home to 654.98 GB
  Error locking on node catanzaro.dmz.lexum.pri: Volume group for uuid
not found: Q8Wmg3qy2FFuCDUuIiI5zFyzVHKzvb53LJgndbQeYPeUzkiDcSxGmZ5a3IjLntOM
  Failed to suspend home

I do not understansd where this uuid comes from.
my vg uuid :


  --- Volume group ---
  VG Name               SAN-group1
  System ID
  Format                lvm2
  Metadata Areas        3
  Metadata Sequence No  5
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               1
  Max PV                0
  Cur PV                3
  Act PV                3
  VG Size               654.98 GB
  PE Size               4.00 MB
  Total PE              167676
  Alloc PE / Size       166397 / 649.99 GB
  Free  PE / Size       1279 / 5.00 GB
  VG UUID               Q8Wmg3-qy2F-FuCD-UuIi-I5zF-yzVH-Kzvb53



From brad at bradandkim.net  Fri Aug 10 17:38:11 2007
From: brad at bradandkim.net (brad at bradandkim.net)
Date: Fri, 10 Aug 2007 12:38:11 -0500 (CDT)
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <46BC948A.3010907@Sun.Com>
References: <B3C24EA955FF0C4EA14658997CD3E25E13CFC9AF@CAHIER.gst.uqam.ca>
	<1186689692.3002.12.camel@localhost.localdomain>
	<46BC8070.3070400@Sun.Com> <46BC8EA1.90609@redhat.com>
	<58515.129.237.174.144.1186763612.squirrel@webmail.bradandkim.net>
	<46BC948A.3010907@Sun.Com>
Message-ID: <43480.129.237.174.144.1186767491.squirrel@webmail.bradandkim.net>


> brad at bradandkim.net wrote:
>>> Quentin Arce wrote:
>>>
>>>
>>>>>
>>>>>> My machines are SUN Fire X4100. I see that we can define a fence
>>>>>> device of type HP ILO. I would like to know if I can use the HP ILO
>>>>>> form in system-config-cluster tool to enter and use a SUN ILOM as
>>>>>> fence device?
>>>>>>
>>>>>>
>>>>> How does ILOM work? telnet or ssh? Is there an snmp interface to
>>>>> ILOM?
>>>>> If so, there might be a way...by hacking on another agent.
>>>>>
>>>>>
>>>> So, I'm a lurker on this list as I no longer have a cluster up... but
>>>> I work on ILOM and I would love to see this work.  This isn't official
>>>> support, I'm a developer not a customer support person.  So, it's more
>>>> on my time.  If there is anything I can do... Please let me know.
>>>> Questions on this problem, regarding what ILOM can / can't do, how to
>>>> check state of the server via ILOM, etc.
>>>>
>>> Quentin! That is very kind of you. If you help with the ILOM protocol,
>>> I'll help with the agent/script. This thread could form a document on
>>> how to write an arbitrary fence agent for use with rhcs.
>>>
>>> Where is documentation available? Generally, three things are needed
>>> from a baseboard management device in order to use it for fencing: 1) A
>>> way to shut the system down, 2) a way to power the system up, and 3) a
>>> way to check if it is up or down.
>>>
>>> What means can a script use to communicate with the ILOM card? Are
>>> there
>>> big delta's in the protocol between different ILOM versions?
>>>
>>> I look forward to hearing from you.
>>>
>>> -J
>>>
>>
>> I am interested in seeing this thread play out as well since I have 26
>> SUN
>> servers I am beginning to cluster.  My question is why use SNMP over
>> IPMI
>> v2.0.  I can do the above three things with:
>>
>> /usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power off
>> /usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power on
>> /usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power
>> status
>>
>> I don't need any MIB's for this either.  It seems to me this might be an
>> easier solution than snmp, but I may be missing something.
>>
>>
>
> Oh make sure you are using lanplus mode for this.
>

Will do, and thanks.

Brad Crotchett
brad at bradandkim.net
http://www.bradandkim.net



From dist-list at LEXUM.UMontreal.CA  Fri Aug 10 17:39:38 2007
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Fri, 10 Aug 2007 13:39:38 -0400
Subject: [Linux-cluster] add journal to GFS
In-Reply-To: <20070810085926.1a4a13d1@mathieu.toulouse>
References: <46BB8F4E.7070203@lexum.umontreal.ca>
	<20070810085926.1a4a13d1@mathieu.toulouse>
Message-ID: <46BCA2DA.5050606@lexum.umontreal.ca>

TX I will try that ... with another LUN :)


Mathieu Avila wrote:
> Le Thu, 09 Aug 2007 18:03:58 -0400,
> FM <dist-list at LEXUM.UMontreal.CA> a ?crit :
> 
>> Hello,
>> I have to add more journals to add a new nodes.
>> SO What I did :
>> create a new LUN
>> add it to lvm usign :
>> lvcreate
>> vgextend
>> lvextend
>>
>> after that I use gfs_grow
>> now the GFS is 150 GB bigger  BUT
>> gfs_jadd say that did not have space to add journals
>>
>>
> 
> You have used the new space for normal data or meta-data, so that it
> isn't available anymore. "gfs_jadd" doesn't use the free space of a
> file system, it needs to be fed up with new space, just like gfs_grow
> works.
> As your file system cannot be shrinked, the only solution i see is to
> add space one more time (much less, 150G is very much unless you want
> to 100+ nodes), and run gfs_jadd again.
> 
> Cluster team, please correct me if i'm wrong.
> 
> --
> Mathieu
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From dist-list at LEXUM.UMontreal.CA  Fri Aug 10 18:22:57 2007
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Fri, 10 Aug 2007 14:22:57 -0400
Subject: [Linux-cluster] lvextend : Error locking ... Volume group for
	uuid	not found FIXED
In-Reply-To: <46BCA1CB.2010802@lexum.umontreal.ca>
References: <46BCA1CB.2010802@lexum.umontreal.ca>
Message-ID: <46BCAD01.2050805@lexum.umontreal.ca>

service clvmd restart did the trick


FM wrote:
> Ut is not really a good friday for me :)
> 
> I am trying to extend le logical volume on cluster nodes :
> 
> lvextend -t -l +1279 /dev/SAN-group1/home
>   Test mode: Metadata will NOT be updated.
>   Extending logical volume home to 654.98 GB
>   Error locking on node catanzaro.dmz.lexum.pri: Volume group for uuid
> not found: Q8Wmg3qy2FFuCDUuIiI5zFyzVHKzvb53LJgndbQeYPeUzkiDcSxGmZ5a3IjLntOM
>   Failed to suspend home
> 
> I do not understansd where this uuid comes from.
> my vg uuid :
> 
> 
>   --- Volume group ---
>   VG Name               SAN-group1
>   System ID
>   Format                lvm2
>   Metadata Areas        3
>   Metadata Sequence No  5
>   VG Access             read/write
>   VG Status             resizable
>   MAX LV                0
>   Cur LV                1
>   Open LV               1
>   Max PV                0
>   Cur PV                3
>   Act PV                3
>   VG Size               654.98 GB
>   PE Size               4.00 MB
>   Total PE              167676
>   Alloc PE / Size       166397 / 649.99 GB
>   Free  PE / Size       1279 / 5.00 GB
>   VG UUID               Q8Wmg3-qy2F-FuCD-UuIi-I5zF-yzVH-Kzvb53
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From jparsons at redhat.com  Fri Aug 10 18:59:10 2007
From: jparsons at redhat.com (James Parsons)
Date: Fri, 10 Aug 2007 14:59:10 -0400
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <43480.129.237.174.144.1186767491.squirrel@webmail.bradandkim.net>
References: <B3C24EA955FF0C4EA14658997CD3E25E13CFC9AF@CAHIER.gst.uqam.ca>	<1186689692.3002.12.camel@localhost.localdomain>	<46BC8070.3070400@Sun.Com>
	<46BC8EA1.90609@redhat.com>	<58515.129.237.174.144.1186763612.squirrel@webmail.bradandkim.net>	<46BC948A.3010907@Sun.Com>
	<43480.129.237.174.144.1186767491.squirrel@webmail.bradandkim.net>
Message-ID: <46BCB57E.7000809@redhat.com>

brad at bradandkim.net wrote:

>>brad at bradandkim.net wrote:
>>    
>>
>>>>Quentin Arce wrote:
>>>>
>>>>
>>>>        
>>>>
>>>>>>>My machines are SUN Fire X4100. I see that we can define a fence
>>>>>>>device of type HP ILO. I would like to know if I can use the HP ILO
>>>>>>>form in system-config-cluster tool to enter and use a SUN ILOM as
>>>>>>>fence device?
>>>>>>>
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>How does ILOM work? telnet or ssh? Is there an snmp interface to
>>>>>>ILOM?
>>>>>>If so, there might be a way...by hacking on another agent.
>>>>>>
>>>>>>
>>>>>>            
>>>>>>
>>>>>So, I'm a lurker on this list as I no longer have a cluster up... but
>>>>>I work on ILOM and I would love to see this work.  This isn't official
>>>>>support, I'm a developer not a customer support person.  So, it's more
>>>>>on my time.  If there is anything I can do... Please let me know.
>>>>>Questions on this problem, regarding what ILOM can / can't do, how to
>>>>>check state of the server via ILOM, etc.
>>>>>
>>>>>          
>>>>>
>>>>Quentin! That is very kind of you. If you help with the ILOM protocol,
>>>>I'll help with the agent/script. This thread could form a document on
>>>>how to write an arbitrary fence agent for use with rhcs.
>>>>
>>>>Where is documentation available? Generally, three things are needed
>>>>from a baseboard management device in order to use it for fencing: 1) A
>>>>way to shut the system down, 2) a way to power the system up, and 3) a
>>>>way to check if it is up or down.
>>>>
>>>>What means can a script use to communicate with the ILOM card? Are
>>>>there
>>>>big delta's in the protocol between different ILOM versions?
>>>>
>>>>I look forward to hearing from you.
>>>>
>>>>-J
>>>>
>>>>        
>>>>
>>>I am interested in seeing this thread play out as well since I have 26
>>>SUN
>>>servers I am beginning to cluster.  My question is why use SNMP over
>>>IPMI
>>>v2.0.  I can do the above three things with:
>>>
>>>/usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power off
>>>/usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power on
>>>/usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power
>>>status
>>>
>>>I don't need any MIB's for this either.  It seems to me this might be an
>>>easier solution than snmp, but I may be missing something.
>>>
>>>
>>>      
>>>
>>Oh make sure you are using lanplus mode for this.
>>
>>    
>>
>
>Will do, and thanks.
>  
>
That is a nice solution. There is a fence_ipmilan agent in the red hat 
cluster distibution...how are you invoking the above for fencing?  To 
check if the rh agent works, here is the command line you would use (it 
installs into /sbin...):

/sbin/fence_ipmilan -a <ilom IP> -l <user> -p <password> -P -o 
[off,on,reboot,status]

There is a man page for fence_ipmilan that details some extra params.

Well, I guess that solves the issue...if anyone would use an snmp-based 
ILOM agent, we could talk about how to construct that...otherwise, so 
much for my idea of this thread being instructions for creating 
arbitrary agents! ;)

-J



From brad at bradandkim.net  Fri Aug 10 19:21:31 2007
From: brad at bradandkim.net (brad at bradandkim.net)
Date: Fri, 10 Aug 2007 14:21:31 -0500 (CDT)
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <46BCB57E.7000809@redhat.com>
References: <B3C24EA955FF0C4EA14658997CD3E25E13CFC9AF@CAHIER.gst.uqam.ca>
	<1186689692.3002.12.camel@localhost.localdomain>
	<46BC8070.3070400@Sun.Com> <46BC8EA1.90609@redhat.com>
	<58515.129.237.174.144.1186763612.squirrel@webmail.bradandkim.net>
	<46BC948A.3010907@Sun.Com>
	<43480.129.237.174.144.1186767491.squirrel@webmail.bradandkim.net>
	<46BCB57E.7000809@redhat.com>
Message-ID: <56032.129.237.174.144.1186773691.squirrel@webmail.bradandkim.net>


> brad at bradandkim.net wrote:
>
>>>brad at bradandkim.net wrote:
>>>
>>>
>>>>>Quentin Arce wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>>My machines are SUN Fire X4100. I see that we can define a fence
>>>>>>>>device of type HP ILO. I would like to know if I can use the HP ILO
>>>>>>>>form in system-config-cluster tool to enter and use a SUN ILOM as
>>>>>>>>fence device?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>How does ILOM work? telnet or ssh? Is there an snmp interface to
>>>>>>>ILOM?
>>>>>>>If so, there might be a way...by hacking on another agent.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>So, I'm a lurker on this list as I no longer have a cluster up... but
>>>>>>I work on ILOM and I would love to see this work.  This isn't
>>>>>> official
>>>>>>support, I'm a developer not a customer support person.  So, it's
>>>>>> more
>>>>>>on my time.  If there is anything I can do... Please let me know.
>>>>>>Questions on this problem, regarding what ILOM can / can't do, how to
>>>>>>check state of the server via ILOM, etc.
>>>>>>
>>>>>>
>>>>>>
>>>>>Quentin! That is very kind of you. If you help with the ILOM protocol,
>>>>>I'll help with the agent/script. This thread could form a document on
>>>>>how to write an arbitrary fence agent for use with rhcs.
>>>>>
>>>>>Where is documentation available? Generally, three things are needed
>>>>>from a baseboard management device in order to use it for fencing: 1)
>>>>> A
>>>>>way to shut the system down, 2) a way to power the system up, and 3) a
>>>>>way to check if it is up or down.
>>>>>
>>>>>What means can a script use to communicate with the ILOM card? Are
>>>>>there
>>>>>big delta's in the protocol between different ILOM versions?
>>>>>
>>>>>I look forward to hearing from you.
>>>>>
>>>>>-J
>>>>>
>>>>>
>>>>>
>>>>I am interested in seeing this thread play out as well since I have 26
>>>>SUN
>>>>servers I am beginning to cluster.  My question is why use SNMP over
>>>>IPMI
>>>>v2.0.  I can do the above three things with:
>>>>
>>>>/usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power
>>>> off
>>>>/usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power on
>>>>/usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power
>>>>status
>>>>
>>>>I don't need any MIB's for this either.  It seems to me this might be
>>>> an
>>>>easier solution than snmp, but I may be missing something.
>>>>
>>>>
>>>>
>>>>
>>>Oh make sure you are using lanplus mode for this.
>>>
>>>
>>>
>>
>>Will do, and thanks.
>>
>>
> That is a nice solution. There is a fence_ipmilan agent in the red hat
> cluster distibution...how are you invoking the above for fencing?  To
> check if the rh agent works, here is the command line you would use (it
> installs into /sbin...):
>
> /sbin/fence_ipmilan -a <ilom IP> -l <user> -p <password> -P -o
> [off,on,reboot,status]
>
> There is a man page for fence_ipmilan that details some extra params.
>
> Well, I guess that solves the issue...if anyone would use an snmp-based
> ILOM agent, we could talk about how to construct that...otherwise, so
> much for my idea of this thread being instructions for creating
> arbitrary agents! ;)
>
> -J
>

I just tested it and it seems to work perfectly.  Sorry for bringing the
thread to a premature end  :)

Brad Crotchett
brad at bradandkim.net
http://www.bradandkim.net



From adel at opennet.ae  Fri Aug 10 20:14:22 2007
From: adel at opennet.ae (Adel Ben Zarrouk)
Date: Sat, 11 Aug 2007 00:14:22 +0400
Subject: [Linux-cluster] Oracle E-Business Suite and GFS
In-Reply-To: <46BCAD01.2050805@lexum.umontreal.ca>
References: <46BCA1CB.2010802@lexum.umontreal.ca>
	<46BCAD01.2050805@lexum.umontreal.ca>
Message-ID: <200708110014.22709.adel@opennet.ae>

Hi,

One of our customer planning to setup Oracle EBuiness Suite and we are 
thinking to propose GFS 6.1 instead of OCFS2.

My questions here:

-Oracle EBS certified with GFS6.1 and RHEL
-If there is any customer has done this before
-Any benchmark available or comparison between GFS 6.1 and OCFS2.

Looking forwar for your feedback

Regards
 
 --Adel



From doseyg at r-networks.net  Sat Aug 11 02:11:45 2007
From: doseyg at r-networks.net (Glen Dosey)
Date: Fri, 10 Aug 2007 22:11:45 -0400
Subject: [Linux-cluster] add journal to GFS
In-Reply-To: <46BCA2DA.5050606@lexum.umontreal.ca>
References: <46BB8F4E.7070203@lexum.umontreal.ca>
	<20070810085926.1a4a13d1@mathieu.toulouse>
	<46BCA2DA.5050606@lexum.umontreal.ca>
Message-ID: <1186798305.8784.7.camel@eclipse.office.r-networks.net>

I realize it's not normally a big deal, but if you don't want to have
too many luns and pvs floating around you should be able to add another
LUN of say 180GB, followed by a pvcreate, vgextend, pvmove, vgreduce,
lvextend and gfs_jadd. You'll have 30GB of unallocated disk space
available for the journal as well as 150GB for the gfs filesystem
previously created on the prior (now removed) LUN.


On Fri, 2007-08-10 at 13:39 -0400, FM wrote:
> TX I will try that ... with another LUN :)
> 
> 
> Mathieu Avila wrote:
> > Le Thu, 09 Aug 2007 18:03:58 -0400,
> > FM <dist-list at LEXUM.UMontreal.CA> a ?crit :
> > 
> >> Hello,
> >> I have to add more journals to add a new nodes.
> >> SO What I did :
> >> create a new LUN
> >> add it to lvm usign :
> >> lvcreate
> >> vgextend
> >> lvextend
> >>
> >> after that I use gfs_grow
> >> now the GFS is 150 GB bigger  BUT
> >> gfs_jadd say that did not have space to add journals
> >>
> >>
> > 
> > You have used the new space for normal data or meta-data, so that it
> > isn't available anymore. "gfs_jadd" doesn't use the free space of a
> > file system, it needs to be fed up with new space, just like gfs_grow
> > works.
> > As your file system cannot be shrinked, the only solution i see is to
> > add space one more time (much less, 150G is very much unless you want
> > to 100+ nodes), and run gfs_jadd again.
> > 
> > Cluster team, please correct me if i'm wrong.
> > 
> > --
> > Mathieu
> > 
> > 
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From bernard.chew at muvee.com  Sat Aug 11 07:59:15 2007
From: bernard.chew at muvee.com (Bernard Chew)
Date: Sat, 11 Aug 2007 15:59:15 +0800
Subject: [Linux-cluster] Using cmirror
In-Reply-To: <46BC8E4B.1070803@cloud9.co.uk>
References: <1186746964.16863.8.camel@ws-berd.sg.muvee.net>	<8F3D9B0A-3ABE-4636-A27F-B1120DBC85F2@redhat.com><229C73600EB0E54DA818AB599482BCE951EB01@shadowfax.sg.muvee.net>
	<46BC8E4B.1070803@cloud9.co.uk>
Message-ID: <229C73600EB0E54DA818AB599482BCE9019C6984@shadowfax.sg.muvee.net>

Hi James and Brassow,

Thank you for the replies on using cmirror. I'll try the configuration
(below) on my test servers, and post the findings here.

Regards,
Bernard Chew

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of James Fidell
Sent: Saturday, August 11, 2007 12:12 AM
To: linux clustering
Subject: Re: [Linux-cluster] Using cmirror

Bernard Chew wrote:
> Hi,
> 
> Can this work with GFS where I have 2 iscsi disks (ie. /dev/sda &
/dev/sdb) from 2 different iscsi-target servers and I create a mirrored
GFS volume?

I had no problems creating such a setup.

Where I did have problems was when one of the iscsi disks "went away".
At that point the iscsi layer appeared to hang and lvm locked up :(

(You need three separate PVs to create a mirrored LVM volume, btw).

James

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




From mbrookov at mines.edu  Sun Aug 12 16:55:32 2007
From: mbrookov at mines.edu (Matthew B. Brookover)
Date: Sun, 12 Aug 2007 10:55:32 -0600
Subject: [Linux-cluster] flock behavior different between GFS and EXT3
Message-ID: <1186937732.4915.49.camel@mickey.mattbrookover.com>

I am attempting to move a program using an EXT3 file system to a GFS
file system.  The program uses flock to serialize access between
processes.  On an EXT3 file system I can get an exclusive lock on a
file, make some change to the file, then get a shared lock without
loosing the lock.  On GFS when the program tries to demote from the
exclusive lock to a shared lock, the lock is freed allowing another
process to step in and take the lock.

Is there a way to get flock on GFS to behave the way it does on the EXT3
file system?

I have attached sample C source code and here are instructions to
demonstrate this issue.

My cluster is running GFS 6.1, RHEL 4 update 5 with all of the patches.

Compile both programs: 

[mbrookov at imagine locktest]$ cc -o flock_EX_SH flock_EX_SH.c 
[mbrookov at imagine locktest]$ cc -o flockwritelock flockwritelock.c
[mbrookov at imagine locktest]$ 

EXT3 test:

Start up xterm twice and cd to the directory where you compiled the 2
programs.  On my system, /tmp is an EXT3 file system.

In the first xterm, run 'flock_EX_SH /tmp/bar'  and hit return.  In the
second xterm, run 'flockwritelock /tmp/bar' and hit return.  The
flockwritelock process will block waiting for an exclusive lock on the
file /tmp/bar.

One the first xterm, hit return, the flock_EX_SH process will attempt to
demote the exclusive lock to a shared lock and display a prompt.  The
flockwritelock process on the second xterm will stay blocked.

In the first xterm, hit return again, the flock_EX_SH process will free
the lock, close the file and exit.  The flockwritelock process will then
receive the exclusive lock on /tmp/bar and display a prompt.  Hit return
in the second xterm to get flockwritelock to close and exit.

Output on first xterm: 

[mbrookov at imagine locktest]$ ./flock_EX_SH /tmp/bar
Have exclusive lock, hit return to free write lock on /tmp/bar and exit

Attempt to demote lock on /tmp/bar to shared lock
Have shared lock, hit return to free lock on /tmp/bar and exit

[mbrookov at imagine locktest]$ 

Output on second xterm: 

[mbrookov at imagine locktest]$ ./flockwritelock /tmp/bar
Have write lock, hit return to free write lock on /tmp/bar and exit

[mbrookov at imagine locktest]$




GFS test:

Start up xterm twice and cd to the directory where you compiled the 2
programs.  On my system, the locktest directory is on a GFS file system.

In the first xterm, run 'flock_EX_SH bar'  and hit return.  In the
second xterm, run 'flockwritelock bar' and hit return.  The
flockwritelock process will block waiting for an exclusive lock on the
file bar.

On the first xterm, hit return, the flock_EX_SH process will attempt to
demote the exclusive lock on bar to a shared lock but will fail because
the system call to flock frees the lock allowing the flockwritelock
process to get an exclusive lock.  The flock_EX_SH process will exit.

Hit return on the second xterm, flockwritelock will close bar and exit.

Output on first xterm: 

[mbrookov at imagine locktest]$ ./flock_EX_SH bar
Have exclusive lock, hit return to free write lock on bar and exit

Attempt to demote lock on bar to shared lock
Could not demote to shared lock on file bar, Resource temporarily unavailable
[mbrookov at imagine locktest]$ 

Output on second xterm: 

[mbrookov at imagine locktest]$ ./flockwritelock bar
Have write lock, hit return to free write lock on bar and exit

[mbrookov at imagine locktest]$ 

The results for flock on GFS are the same if you run the two programs on
the same node or on 2 different nodes.  The locks (shared, exclusive,
blocking, non blocking) also work correctly on both file systems.  The
problem is the case where GFS will free the exclusive lock and return an
error instead of demote the exclusive lock to a shared lock.

The program depends on the EXT3 flock behavior -- the exclusive lock can
be demoted to a shared lock without the possibility that another process
that is blocked waiting for an exclusive lock receiving the lock.

Thank you

Matt
mbrookov at mines.edu


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070812/f7b17921/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: flock_EX_SH.c
Type: text/x-csrc
Size: 1291 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070812/f7b17921/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: flockwritelock.c
Type: text/x-csrc
Size: 1073 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070812/f7b17921/attachment-0001.bin>

From bsheets at singlefin.net  Sun Aug 12 23:23:43 2007
From: bsheets at singlefin.net (Brian Sheets)
Date: Sun, 12 Aug 2007 23:23:43 +0000 (UTC)
Subject: [Linux-cluster] fence_apc 7930s
Message-ID: <1891144504.370791186961023635.JavaMail.root@v-mailhost2.mxpath.net>

Did anyone get this working and allow system names for the port tags?
Mine are all labled as [empty] I've not tried switching it back to the default tag.

Thanks

Brian

>Ok... I screwed up.  I figured it would be my mistake in the end.
>
>I renamed the outlet to the hostname of the system a while ago.  I change
>the name back to the default "Outlet #" and fence_apc worked the first try.
>I always use the port number and not the name, but maybe I broke something
>by changing the port name on the apc devices.

>I have a 2 node cluster now with fencing across two apc 7930s using
>redundant power supplies.

>Thanks everyone for the help,

>Eric m



From Alain.Moulle at bull.net  Mon Aug 13 06:23:00 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Mon, 13 Aug 2007 08:23:00 +0200
Subject: [Linux-cluster] CS4 U5 / RHEL4 U4 ?
Message-ID: <46BFF8C4.90409@bull.net>

Hi

Is there any incompatibility to re-build the CS4 U5 for RHEL4 U4 ?

(Just for the benefit of all patches)

Thanks
Alain





From janne.peltonen at helsinki.fi  Mon Aug 13 09:07:04 2007
From: janne.peltonen at helsinki.fi (Janne Peltonen)
Date: Mon, 13 Aug 2007 12:07:04 +0300
Subject: [Linux-cluster] Load peaks - caused by the cluster?
Message-ID: <20070813090703.GR17564@helsinki.fi>

Hi!

Remember the fs.sh status checks mayhem I reported a while ago? Now,
there was the ghost-like load flux, but the system getting stuck wasn't
(only) because of the excess number of execs - it was, plain and simple,
memory starvation. *sigh*

Anyway, now that I (or, to be exact, my servers) have enough memory, I
noticed that the problem with the inexplicable load flux hasn't gone
anywhere. With a more-or-less regular 11-hour interval, there is a
four-hour long peak in the load, shaped like an elf's pointy hat. (In an
otherwise idle system, the height of the peak is abt 6.0. If there is
load caused by something "real", the peak is on top of the other load -
it looks as if it just linearly adds up.) I'm seriously beginning to
consider the possibility that there are elfs in my kernel, since I can't
see the peaks anywhere else than the loads: CPU usage, number of
processes, IP/TCP/UDP traffic, IO load, paging activity - nothing
reflects the load peaks. I had a look at the process accounting
statistics during a peak and during no peak, but couldn't see any
difference.

One suggestion my colleague had was that the peaks might be caused by
the cluster somehow changing the 'lead' - somewhere inside the kernel,
in such a low level that it can't be noticed elsewhere than in the load.
That was because there is a difference of phase in the peaks. It didn't
sound very credible to me, but I'll ask anyway: could there be something
like that going on?

On the other hand, on the one node in the cluster that doesn't have
rgmanager running (it's in the cluster so that there wouldn't be an even
number of nodes), I'm not seeing these elfs. And I have an another
cluster that had the elf-hats before I added an exit 0 into their fs.sh
scripts. But they don't have the elf-hats anymore. The difference
between these two clusters is that the cluster with elfs has a lot more
active cluster services than the one without. That is, the cluster with
elfs has a lot more, say, ip.sh execs than the one without. I wonder if
these, when over a certain limit, could have an effect on the load
similar to the excess fs.fh execs had?

Next, I think I'm going to put an exit 0 to the status checks of ip.sh
(and see if the elfs go away). Then I'm going to start wondering if the
cluster'd notice our server room falling apart... ;)

Any suggestions? At this point, I'm not any more even certain whether
the problem lies within the cluster. On the other hand, since I see no
difference at the process level during peak and no-peak time, the
difference must (as far as I understand) be inside kernel. So it can't
be my application. So it must be the cluster, mustn't it?

Thanks.


--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi>



From cluster at defuturo.co.uk  Mon Aug 13 09:53:16 2007
From: cluster at defuturo.co.uk (Robert Clark)
Date: Mon, 13 Aug 2007 10:53:16 +0100
Subject: [Linux-cluster] Assertion failed in do_flock (bz198302)
Message-ID: <1186998796.2650.6.camel@rutabaga.defuturo.co.uk>

  I've been seeing the same assertions as in bz198302, so I've tried out
the debug patch there and it looks like they are being triggered by an
EAGAIN from flock_lock_file_wait. Is this an expected return code?

	Robert



From Arne.Brieseneck at vodafone.com  Mon Aug 13 12:01:39 2007
From: Arne.Brieseneck at vodafone.com (Brieseneck, Arne, VF-Group)
Date: Mon, 13 Aug 2007 14:01:39 +0200
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <56032.129.237.174.144.1186773691.squirrel@webmail.bradandkim.net>
Message-ID: <E67F1468BF7A4C418D874810215A377EC2B231@EITO-MBX01.internal.vodafone.com>

Hi Brad,

If it works perfect I'd like to use your configuration for my own SUN
X4100 systems. Can you please send your configuration files?

Thanks a lot 
Arne

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of
brad at bradandkim.net
Sent: Freitag, 10. August 2007 21:22
To: linux clustering
Subject: Re: [Linux-cluster] Add a fence device of type SUN ILOM


> brad at bradandkim.net wrote:
>
>>>brad at bradandkim.net wrote:
>>>
>>>
>>>>>Quentin Arce wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>>My machines are SUN Fire X4100. I see that we can define a fence

>>>>>>>>device of type HP ILO. I would like to know if I can use the HP 
>>>>>>>>ILO form in system-config-cluster tool to enter and use a SUN 
>>>>>>>>ILOM as fence device?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>How does ILOM work? telnet or ssh? Is there an snmp interface to 
>>>>>>>ILOM?
>>>>>>>If so, there might be a way...by hacking on another agent.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>So, I'm a lurker on this list as I no longer have a cluster up... 
>>>>>>but I work on ILOM and I would love to see this work.  This isn't

>>>>>>official support, I'm a developer not a customer support person.  
>>>>>>So, it's  more on my time.  If there is anything I can do... 
>>>>>>Please let me know.
>>>>>>Questions on this problem, regarding what ILOM can / can't do, how

>>>>>>to check state of the server via ILOM, etc.
>>>>>>
>>>>>>
>>>>>>
>>>>>Quentin! That is very kind of you. If you help with the ILOM 
>>>>>protocol, I'll help with the agent/script. This thread could form a

>>>>>document on how to write an arbitrary fence agent for use with
rhcs.
>>>>>
>>>>>Where is documentation available? Generally, three things are 
>>>>>needed from a baseboard management device in order to use it for 
>>>>>fencing: 1)  A way to shut the system down, 2) a way to power the 
>>>>>system up, and 3) a way to check if it is up or down.
>>>>>
>>>>>What means can a script use to communicate with the ILOM card? Are 
>>>>>there big delta's in the protocol between different ILOM versions?
>>>>>
>>>>>I look forward to hearing from you.
>>>>>
>>>>>-J
>>>>>
>>>>>
>>>>>
>>>>I am interested in seeing this thread play out as well since I have 
>>>>26 SUN servers I am beginning to cluster.  My question is why use 
>>>>SNMP over IPMI v2.0.  I can do the above three things with:
>>>>
>>>>/usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power

>>>>off /usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis 
>>>>power on /usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> 
>>>>chassis power status
>>>>
>>>>I don't need any MIB's for this either.  It seems to me this might 
>>>>be  an easier solution than snmp, but I may be missing something.
>>>>
>>>>
>>>>
>>>>
>>>Oh make sure you are using lanplus mode for this.
>>>
>>>
>>>
>>
>>Will do, and thanks.
>>
>>
> That is a nice solution. There is a fence_ipmilan agent in the red hat

> cluster distibution...how are you invoking the above for fencing?  To 
> check if the rh agent works, here is the command line you would use 
> (it installs into /sbin...):
>
> /sbin/fence_ipmilan -a <ilom IP> -l <user> -p <password> -P -o 
> [off,on,reboot,status]
>
> There is a man page for fence_ipmilan that details some extra params.
>
> Well, I guess that solves the issue...if anyone would use an 
> snmp-based ILOM agent, we could talk about how to construct 
> that...otherwise, so much for my idea of this thread being 
> instructions for creating arbitrary agents! ;)
>
> -J
>

I just tested it and it seems to work perfectly.  Sorry for bringing the
thread to a premature end  :)

Brad Crotchett
brad at bradandkim.net
http://www.bradandkim.net

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From lanthier.stephanie at uqam.ca  Mon Aug 13 14:02:26 2007
From: lanthier.stephanie at uqam.ca (=?iso-8859-1?Q?Lanthier=2C_St=E9phanie?=)
Date: Mon, 13 Aug 2007 10:02:26 -0400
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <20070810151307.A882973687@hormel.redhat.com>
Message-ID: <B3C24EA955FF0C4EA14658997CD3E25E13E94F9F@CAHIER.gst.uqam.ca>


Message: 6
Date: Fri, 10 Aug 2007 09:32:51 +0200
From: Maciej Bogucki <maciej.bogucki at artegence.com>
Subject: Re: [Linux-cluster] Add a fence device of type SUN ILOM
To: linux clustering <linux-cluster at redhat.com>
Message-ID: <46BC14A3.70303 at artegence.com>
Content-Type: text/plain; charset=UTF-8

> As the GFS file system is not accessible when I'm rebooting one of the 
> three nodes, I'm realizing the importance of fence devices.
It is strange to me, because if You have rc scripts and quorum properly configured, and when You perform reboot of one node Your GFS filesystem should be accesible all time.

Best Regards
Maciej Bogucki

Dear Maciej,

To answer your question, I read this about fence behavior on http://www.centos.org/docs/4/4.5/SAC_Cluster_Suite_Overview/s2-fencing-overview-CSO.html

"When the cluster manager determines that a node has failed, it communicates to other cluster-infrastructure components that the node has failed. The fencing program (either fenced or GULM), when notified of the failure, fences the failed node. Other cluster-infrastructure components determine what actions to take - that is, they perform any recovery that needs to done. For example, DLM and GFS (in a cluster configured with CMAN/DLM), when notified of a node failure, suspend activity until they detect that the fencing program has completed fencing the failed node. Upon confirmation that the failed node is fenced, DLM and GFS perform recovery. DLM releases locks of the failed node; GFS recovers the journal of the failed node."

As I had no fence running, I understood that DLM and GFS were in suspend, waiting to know about the fence completion.

Best regards

__________________

Stephanie Lanthier

Analyste de l'informatique
Universite du Quebec a Montreal
Service de l'informatique et des telecommunications
lanthier.stephanie at uqam.ca
Telephone : 514-987-3000 poste 6106
Bureau : PK-M535

 






From jos at xos.nl  Mon Aug 13 15:50:42 2007
From: jos at xos.nl (Jos Vos)
Date: Mon, 13 Aug 2007 17:50:42 +0200
Subject: [Linux-cluster] IPv6 cluster addresses are "tentative" (for two
	seconds)
Message-ID: <200708131550.l7DFogb07702@xos037.xos.nl>

Hi,

When using IPv6 addresses in the cluster configuration, I see that
these are labaled "tentative" ("ip addr list" output) in the first
two seconds when the service script runs.

This appears to prohibit programs from binding to these addresses,
so I need to add a sleep (or something more sophisticated, like a
loop that looks when this address is not "tentative" anymore) in
my cluster service script: then it seems to work fine.

Is this the only solution or are there more sophisticated (and better)
solutions possible?

Does the same delay (w.r.t. availability) also apply to the normal
IPv6 network config scripts ("ifup-ipv6") or is this problem specific
to the cluster suite (if yes, should the cluster suite be adapted)?

B.t.w., this is on RHEL 5.0.

Thanks,

--
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From pbruna at it-linux.cl  Mon Aug 13 15:57:00 2007
From: pbruna at it-linux.cl (Patricio A. Bruna)
Date: Mon, 13 Aug 2007 11:57:00 -0400 (CLT)
Subject: [Linux-cluster] GFS Problem
Message-ID: <16166123.35021187020620217.JavaMail.root@lisa.itlinux.cl>

Hi, 
I've configured a RHEL 5 cluster of 2 nodes, using GFS(v.1) filesystems. 
Im having a problem when restarting one of the nodes, the other node can not longer access the GFS partitions, so i must reboot both. 
Althoug, yesterday i resize a GFS partition, lvmextend and then gfs_grow, and the other node gaves I/O error and dismounts all the GFS partitions. 

Do you have any ideas why this is happening? 

PD: Im attaching the cluster.conf. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070813/25e20e72/attachment.htm>

From lanthier.stephanie at uqam.ca  Mon Aug 13 15:43:01 2007
From: lanthier.stephanie at uqam.ca (=?iso-8859-1?Q?Lanthier=2C_St=E9phanie?=)
Date: Mon, 13 Aug 2007 11:43:01 -0400
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <20070811021159.95BD473107@hormel.redhat.com>
Message-ID: <B3C24EA955FF0C4EA14658997CD3E25E13E951C2@CAHIER.gst.uqam.ca>


Dear list members, 

If I understand well, here are the steps I have to follow to configure a fence device that will use SUN ILOM interface :

1. Ensure that OpenIPMI and OpenIPMI-tools packages are installed on the cluster nodes.
2. With system-config-cluster tool, add a new fence device of type "IPMI Lan". Fill the form with the ILOM IP address, the name of the administrator user and his password. Then associate the fence device with the cluster node.
3. Repeat the step above for the SUN ILOM interface of each node.
4. Send the new configuration to the cluster
5. That's all and everything will be handle correctly by the cluster.

Am I ok? 

Best regards
__________________

Stephanie Lanthier

Analyste de l'informatique
Universite du Quebec a Montreal
Service de l'informatique et des telecommunications
lanthier.stephanie at uqam.ca
T?l?phone : 514-987-3000 poste 6106
Bureau : PK-M535

 




------------------------------

Message: 13
Date: Fri, 10 Aug 2007 14:21:31 -0500 (CDT)
From: brad at bradandkim.net
Subject: Re: [Linux-cluster] Add a fence device of type SUN ILOM
To: "linux clustering" <linux-cluster at redhat.com>
Message-ID:
	<56032.129.237.174.144.1186773691.squirrel at webmail.bradandkim.net>
Content-Type: text/plain;charset=iso-8859-1


> brad at bradandkim.net wrote:
>
>>>brad at bradandkim.net wrote:
>>>
>>>
>>>>>Quentin Arce wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>>My machines are SUN Fire X4100. I see that we can define a fence
>>>>>>>>device of type HP ILO. I would like to know if I can use the HP ILO
>>>>>>>>form in system-config-cluster tool to enter and use a SUN ILOM as
>>>>>>>>fence device?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>How does ILOM work? telnet or ssh? Is there an snmp interface to
>>>>>>>ILOM?
>>>>>>>If so, there might be a way...by hacking on another agent.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>So, I'm a lurker on this list as I no longer have a cluster up... but
>>>>>>I work on ILOM and I would love to see this work.  This isn't
>>>>>> official
>>>>>>support, I'm a developer not a customer support person.  So, it's
>>>>>> more
>>>>>>on my time.  If there is anything I can do... Please let me know.
>>>>>>Questions on this problem, regarding what ILOM can / can't do, how to
>>>>>>check state of the server via ILOM, etc.
>>>>>>
>>>>>>
>>>>>>
>>>>>Quentin! That is very kind of you. If you help with the ILOM protocol,
>>>>>I'll help with the agent/script. This thread could form a document on
>>>>>how to write an arbitrary fence agent for use with rhcs.
>>>>>
>>>>>Where is documentation available? Generally, three things are needed
>>>>>from a baseboard management device in order to use it for fencing: 1)
>>>>> A
>>>>>way to shut the system down, 2) a way to power the system up, and 3) a
>>>>>way to check if it is up or down.
>>>>>
>>>>>What means can a script use to communicate with the ILOM card? Are
>>>>>there
>>>>>big delta's in the protocol between different ILOM versions?
>>>>>
>>>>>I look forward to hearing from you.
>>>>>
>>>>>-J
>>>>>
>>>>>
>>>>>
>>>>I am interested in seeing this thread play out as well since I have 26
>>>>SUN
>>>>servers I am beginning to cluster.  My question is why use SNMP over
>>>>IPMI
>>>>v2.0.  I can do the above three things with:
>>>>
>>>>/usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power
>>>> off
>>>>/usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power on
>>>>/usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power
>>>>status
>>>>
>>>>I don't need any MIB's for this either.  It seems to me this might be
>>>> an
>>>>easier solution than snmp, but I may be missing something.
>>>>
>>>>
>>>>
>>>>
>>>Oh make sure you are using lanplus mode for this.
>>>
>>>
>>>
>>
>>Will do, and thanks.
>>
>>
> That is a nice solution. There is a fence_ipmilan agent in the red hat
> cluster distibution...how are you invoking the above for fencing?  To
> check if the rh agent works, here is the command line you would use (it
> installs into /sbin...):
>
> /sbin/fence_ipmilan -a <ilom IP> -l <user> -p <password> -P -o
> [off,on,reboot,status]
>
> There is a man page for fence_ipmilan that details some extra params.
>
> Well, I guess that solves the issue...if anyone would use an snmp-based
> ILOM agent, we could talk about how to construct that...otherwise, so
> much for my idea of this thread being instructions for creating
> arbitrary agents! ;)
>
> -J
>

I just tested it and it seems to work perfectly.  Sorry for bringing the
thread to a premature end  :)

Brad Crotchett
brad at bradandkim.net
http://www.bradandkim.net




From pbruna at it-linux.cl  Mon Aug 13 16:03:05 2007
From: pbruna at it-linux.cl (Patricio A. Bruna)
Date: Mon, 13 Aug 2007 12:03:05 -0400 (CLT)
Subject: [Linux-cluster] GFS Problem
In-Reply-To: <16166123.35021187020620217.JavaMail.root@lisa.itlinux.cl>
Message-ID: <22952137.35051187020985554.JavaMail.root@lisa.itlinux.cl>

Sory, here goes cluster.conf 
----- Mensaje Original ----- 
De: "Patricio A. Bruna" <pbruna at it-linux.cl> 
Para: linux-cluster at redhat.com 
Enviados: lunes 13 de agosto de 2007 11H57 (GMT-0400) America/Santiago 
Asunto: [Linux-cluster] GFS Problem 


Hi, 
I've configured a RHEL 5 cluster of 2 nodes, using GFS(v.1) filesystems. 
Im having a problem when restarting one of the nodes, the other node can not longer access the GFS partitions, so i must reboot both. 
Althoug, yesterday i resize a GFS partition, lvmextend and then gfs_grow, and the other node gaves I/O error and dismounts all the GFS partitions. 

Do you have any ideas why this is happening? 

PD: Im attaching the cluster.conf. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070813/afc1b5ff/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/octet-stream
Size: 1290 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070813/afc1b5ff/attachment.obj>

From jos at xos.nl  Mon Aug 13 16:06:20 2007
From: jos at xos.nl (Jos Vos)
Date: Mon, 13 Aug 2007 18:06:20 +0200
Subject: [Linux-cluster] GFS Problem
In-Reply-To: <16166123.35021187020620217.JavaMail.root@lisa.itlinux.cl>;
	from pbruna@it-linux.cl on Mon, Aug 13, 2007 at 11:57:00AM -0400
References: <16166123.35021187020620217.JavaMail.root@lisa.itlinux.cl>
Message-ID: <20070813180620.A7707@xos037.xos.nl>

On Mon, Aug 13, 2007 at 11:57:00AM -0400, Patricio A. Bruna wrote:

> I've configured a RHEL 5 cluster of 2 nodes, using GFS(v.1) filesystems. 
> Im having a problem when restarting one of the nodes, the other node can not longer access the GFS partitions, so i must reboot both. 
> Althoug, yesterday i resize a GFS partition, lvmextend and then gfs_grow, and the other node gaves I/O error and dismounts all the GFS partitions. 
> 
> Do you have any ideas why this is happening? 

This sounds like a problem at SCSI-level.  Do you see SCSI errors in
/var/log/messages?  What kind of shared storage access are you using?

> PD: Im attaching the cluster.conf. 

Not found.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From brad at bradandkim.net  Mon Aug 13 16:12:50 2007
From: brad at bradandkim.net (brad at bradandkim.net)
Date: Mon, 13 Aug 2007 11:12:50 -0500 (CDT)
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <E67F1468BF7A4C418D874810215A377EC2B231@EITO-MBX01.internal.vodafone.c
	om>
References: <E67F1468BF7A4C418D874810215A377EC2B231@EITO-MBX01.internal.vodafone.com>
Message-ID: <53101.129.237.174.144.1187021570.squirrel@webmail.bradandkim.net>


> Hi Brad,
>
> If it works perfect I'd like to use your configuration for my own SUN
> X4100 systems. Can you please send your configuration files?
>
> Thanks a lot
> Arne
>

I don't have the cluster completely configured yet.  I was simply testing
the command line script '/sbin/fence_ipmilan -a <ilom IP> -l <user> -p
<password> -P -o [off,on,reboot,status]' mentioned by James.  I will be
working on generating a more complete cluster config throughout this week.
 Basically I just have 3 nodes, each fenced with a fence_ipmilan agent,
and one gfs filesystem defined.

I have noticed that if I use system-config-cluster to define the
fence_ipmilan agents it sets an attribute

lanplus=""

which I was told to use by Quentin.  However, the next time I run
system-config-cluster it cannot parse the cluster.conf file and errors
out.  I have removed the lanplus portion for now and it seems to be ok,
but I am going to look into that more.

Good luck!

Brad Crotchett
brad at bradandkim.net
http://www.bradandkim.net



From jos at xos.nl  Mon Aug 13 16:20:44 2007
From: jos at xos.nl (Jos Vos)
Date: Mon, 13 Aug 2007 18:20:44 +0200
Subject: [Linux-cluster] GFS Problem
In-Reply-To: <22952137.35051187020985554.JavaMail.root@lisa.itlinux.cl>;
	from pbruna@it-linux.cl on Mon, Aug 13, 2007 at 12:03:05PM -0400
References: <16166123.35021187020620217.JavaMail.root@lisa.itlinux.cl>
	<22952137.35051187020985554.JavaMail.root@lisa.itlinux.cl>
Message-ID: <20070813182044.B7707@xos037.xos.nl>

On Mon, Aug 13, 2007 at 12:03:05PM -0400, Patricio A. Bruna wrote:

> I've configured a RHEL 5 cluster of 2 nodes, using GFS(v.1) filesystems. 
> Im having a problem when restarting one of the nodes, the other node can not longer access the GFS partitions, so i must reboot both. 
> Althoug, yesterday i resize a GFS partition, lvmextend and then gfs_grow, and the other node gaves I/O error and dismounts all the GFS partitions. 

Hmm... several questions arise now:

-  Did you create the GFS filesystems with the correct locking
   protocol (lock_dlm)?

-  Do you use clvmd?  Did you mark your VG's to be "clustered" and
   do you have "locking_style = 3" in /etc/lvm/lvm.conf?

> PD: Im attaching the cluster.conf. 

Where are the GFS filesystems in cluster.conf?

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From brad at bradandkim.net  Mon Aug 13 16:25:23 2007
From: brad at bradandkim.net (brad at bradandkim.net)
Date: Mon, 13 Aug 2007 11:25:23 -0500 (CDT)
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <B3C24EA955FF0C4EA14658997CD3E25E13E951C2@CAHIER.gst.uqam.ca>
References: <B3C24EA955FF0C4EA14658997CD3E25E13E951C2@CAHIER.gst.uqam.ca>
Message-ID: <58270.129.237.174.144.1187022323.squirrel@webmail.bradandkim.net>


>
> Dear list members,
>
> If I understand well, here are the steps I have to follow to configure a
> fence device that will use SUN ILOM interface :
>
> 1. Ensure that OpenIPMI and OpenIPMI-tools packages are installed on the
> cluster nodes.
> 2. With system-config-cluster tool, add a new fence device of type "IPMI
> Lan". Fill the form with the ILOM IP address, the name of the
> administrator user and his password. Then associate the fence device with
> the cluster node.
> 3. Repeat the step above for the SUN ILOM interface of each node.
> 4. Send the new configuration to the cluster
> 5. That's all and everything will be handle correctly by the cluster.
>
> Am I ok?
>
> Best regards
> __________________
>
> Stephanie Lanthier
>
> Analyste de l'informatique
> Universite du Quebec a Montreal
> Service de l'informatique et des telecommunications
> lanthier.stephanie at uqam.ca
> T?l?phone : 514-987-3000 poste 6106
> Bureau : PK-M535
>
>
>
>
>
>
> ------------------------------
>
> Message: 13
> Date: Fri, 10 Aug 2007 14:21:31 -0500 (CDT)
> From: brad at bradandkim.net
> Subject: Re: [Linux-cluster] Add a fence device of type SUN ILOM
> To: "linux clustering" <linux-cluster at redhat.com>
> Message-ID:
> 	<56032.129.237.174.144.1186773691.squirrel at webmail.bradandkim.net>
> Content-Type: text/plain;charset=iso-8859-1
>
>
>> brad at bradandkim.net wrote:
>>
>>>>brad at bradandkim.net wrote:
>>>>
>>>>
>>>>>>Quentin Arce wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>>My machines are SUN Fire X4100. I see that we can define a fence
>>>>>>>>>device of type HP ILO. I would like to know if I can use the HP
>>>>>>>>> ILO
>>>>>>>>>form in system-config-cluster tool to enter and use a SUN ILOM as
>>>>>>>>>fence device?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>How does ILOM work? telnet or ssh? Is there an snmp interface to
>>>>>>>>ILOM?
>>>>>>>>If so, there might be a way...by hacking on another agent.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>So, I'm a lurker on this list as I no longer have a cluster up...
>>>>>>> but
>>>>>>>I work on ILOM and I would love to see this work.  This isn't
>>>>>>> official
>>>>>>>support, I'm a developer not a customer support person.  So, it's
>>>>>>> more
>>>>>>>on my time.  If there is anything I can do... Please let me know.
>>>>>>>Questions on this problem, regarding what ILOM can / can't do, how
>>>>>>> to
>>>>>>>check state of the server via ILOM, etc.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>Quentin! That is very kind of you. If you help with the ILOM
>>>>>> protocol,
>>>>>>I'll help with the agent/script. This thread could form a document on
>>>>>>how to write an arbitrary fence agent for use with rhcs.
>>>>>>
>>>>>>Where is documentation available? Generally, three things are needed
>>>>>>from a baseboard management device in order to use it for fencing: 1)
>>>>>> A
>>>>>>way to shut the system down, 2) a way to power the system up, and 3)
>>>>>> a
>>>>>>way to check if it is up or down.
>>>>>>
>>>>>>What means can a script use to communicate with the ILOM card? Are
>>>>>>there
>>>>>>big delta's in the protocol between different ILOM versions?
>>>>>>
>>>>>>I look forward to hearing from you.
>>>>>>
>>>>>>-J
>>>>>>
>>>>>>
>>>>>>
>>>>>I am interested in seeing this thread play out as well since I have 26
>>>>>SUN
>>>>>servers I am beginning to cluster.  My question is why use SNMP over
>>>>>IPMI
>>>>>v2.0.  I can do the above three things with:
>>>>>
>>>>>/usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power
>>>>> off
>>>>>/usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power
>>>>> on
>>>>>/usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis power
>>>>>status
>>>>>
>>>>>I don't need any MIB's for this either.  It seems to me this might be
>>>>> an
>>>>>easier solution than snmp, but I may be missing something.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>Oh make sure you are using lanplus mode for this.
>>>>
>>>>
>>>>
>>>
>>>Will do, and thanks.
>>>
>>>
>> That is a nice solution. There is a fence_ipmilan agent in the red hat
>> cluster distibution...how are you invoking the above for fencing?  To
>> check if the rh agent works, here is the command line you would use (it
>> installs into /sbin...):
>>
>> /sbin/fence_ipmilan -a <ilom IP> -l <user> -p <password> -P -o
>> [off,on,reboot,status]
>>
>> There is a man page for fence_ipmilan that details some extra params.
>>
>> Well, I guess that solves the issue...if anyone would use an snmp-based
>> ILOM agent, we could talk about how to construct that...otherwise, so
>> much for my idea of this thread being instructions for creating
>> arbitrary agents! ;)
>>
>> -J
>>
>
> I just tested it and it seems to work perfectly.  Sorry for bringing the
> thread to a premature end  :)
>
> Brad Crotchett
> brad at bradandkim.net
> http://www.bradandkim.net
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

I believe that will do it.  I just tested shutting down the network
interfaces on one node and fenced successfully used the ipmilan fencing
agent to reboot the node.

Brad Crotchett
brad at bradandkim.net
http://www.bradandkim.net



From jparsons at redhat.com  Mon Aug 13 16:35:35 2007
From: jparsons at redhat.com (James Parsons)
Date: Mon, 13 Aug 2007 12:35:35 -0400
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <53101.129.237.174.144.1187021570.squirrel@webmail.bradandkim.net>
References: <E67F1468BF7A4C418D874810215A377EC2B231@EITO-MBX01.internal.vodafone.com>
	<53101.129.237.174.144.1187021570.squirrel@webmail.bradandkim.net>
Message-ID: <46C08857.1070907@redhat.com>

brad at bradandkim.net wrote:

>>Hi Brad,
>>
>>If it works perfect I'd like to use your configuration for my own SUN
>>X4100 systems. Can you please send your configuration files?
>>
>>Thanks a lot
>>Arne
>>
>>    
>>
>
>I don't have the cluster completely configured yet.  I was simply testing
>the command line script '/sbin/fence_ipmilan -a <ilom IP> -l <user> -p
><password> -P -o [off,on,reboot,status]' mentioned by James.  I will be
>working on generating a more complete cluster config throughout this week.
> Basically I just have 3 nodes, each fenced with a fence_ipmilan agent,
>and one gfs filesystem defined.
>
>I have noticed that if I use system-config-cluster to define the
>fence_ipmilan agents it sets an attribute
>
>lanplus=""
>
Hmmm...that is the same as saying 'lanplus="0"'...IOT, it unsets it.
If you get a parse error when it is in, then the relaxng schema file is 
not up to date...you can ignore the warning.

There should be a 'lanplus' checkbox added to cfg for ipmi fencing, if 
you wish to use it.

-J



From pbruna at it-linux.cl  Mon Aug 13 16:43:54 2007
From: pbruna at it-linux.cl (Patricio A. Bruna)
Date: Mon, 13 Aug 2007 12:43:54 -0400 (CLT)
Subject: [Linux-cluster] GFS Problem
In-Reply-To: <20070813182044.B7707@xos037.xos.nl>
Message-ID: <20683328.35321187023434188.JavaMail.root@lisa.itlinux.cl>

Jos,
Im answering one by one

> -  Did you create the GFS filesystems with the correct locking
>    protocol (lock_dlm)?
Yes i did, like the documentations says: 
gfs_mkfs -p lock_dlm -t cluster_eseia:portales -j 4 /dev/CLUSTERLVM/portales

I used 4 journal, cause we are hoping to add 2 more nodes soon.

> -  Do you use clvmd?  Did you mark your VG's to be "clustered" and
>    do you have "locking_style = 3" in /etc/lvm/lvm.conf?

No i did not. By the way i only have locking_type in the file, i suppose is what you meant. Righ now is:
locking_type = 1

> > PD: Im attaching the cluster.conf. 
> 
> Where are the GFS filesystems in cluster.conf?
They arent, because they are supposed to start right away the server boot, and the services i have: IP and Perlbal, do not use the GFS filesystems.


In my logs files i saw this:

##################################
newton fenced[2394]: fencing node "davinci"
newton fenced[2394]: fence "davinci" failed
newton fenced[2394]: fencing node "davinci"
newton fenced[2394]: fence "davinci" failed
####################################

and after runnig the gfs_grow command in davinci:

####################################
Aug 11 04:06:10 newton kernel: attempt to access beyond end of device
Aug 11 04:06:10 newton kernel: dm-4: rw=0, want=270678704, limit=251658240
Aug 11 04:06:10 newton kernel: GFS: fsid=cluster_eseia:seiaprod-dig.0: fatal: I/O error
Aug 11 04:06:10 newton kernel: GFS: fsid=cluster_eseia:seiaprod-dig.0:   block = 33834837
Aug 11 04:06:10 newton kernel: GFS: fsid=cluster_eseia:seiaprod-dig.0:   function = gfs_dreread
Aug 11 04:06:10 newton kernel: GFS: fsid=cluster_eseia:seiaprod-dig.0:   file = /builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.c, line = 576
Aug 11 04:06:10 newton kernel: GFS: fsid=cluster_eseia:seiaprod-dig.0:   time = 1186819570
Aug 11 04:06:10 newton kernel: GFS: fsid=cluster_eseia:seiaprod-dig.0: about to withdraw from the cluster
Aug 11 04:06:10 newton kernel: GFS: fsid=cluster_eseia:seiaprod-dig.0: telling LM to withdraw
Aug 11 04:06:11 newton kernel: dlm: drop message 11 from 1 for unknown lockspace 655362
Aug 11 04:06:11 newton kernel: GFS: fsid=cluster_eseia:seiaprod-dig.0: withdrawn
Aug 11 04:06:11 newton kernel:  [<f8d6d8ba>] gfs_lm_withdraw+0x76/0x82 [gfs]
Aug 11 04:06:11 newton kernel:  [<f8d81786>] gfs_io_error_bh_i+0x2c/0x31 [gfs]
Aug 11 04:06:11 newton kernel:  [<f8d5babd>] gfs_dreread+0x9f/0xbf [gfs]
Aug 11 04:06:11 newton kernel:  [<f8d5bafd>] gfs_dread+0x20/0x36 [gfs]
Aug 11 04:06:11 newton kernel:  [<f8d5d32b>] get_leaf+0x17/0x88 [gfs]
Aug 11 04:06:11 newton kernel:  [<f8d5de86>] gfs_dir_read+0x13f/0x68f [gfs]
Aug 11 04:06:11 newton kernel:  [<c05fadc5>] wait_for_completion+0x18/0x8d
Aug 11 04:06:11 newton kernel:  [<c041d186>] complete+0x2b/0x3d
Aug 11 04:06:11 newton kernel:  [<f8d750f0>] gfs_readdir+0xea/0x29e [gfs]
Aug 11 04:06:11 newton kernel:  [<f8d73de0>] filldir_reg_func+0x0/0x13b [gfs]
Aug 11 04:06:11 newton kernel:  [<c0479fc4>] filldir64+0x0/0xc5
Aug 11 04:06:11 newton kernel:  [<c045fdc9>] anon_vma_prepare+0x11/0xa5
Aug 11 04:06:11 newton kernel:  [<c0479fc4>] filldir64+0x0/0xc5
Aug 11 04:06:11 newton kernel:  [<c047a1a5>] vfs_readdir+0x63/0x8d
Aug 11 04:06:11 newton kernel:  [<c0479fc4>] filldir64+0x0/0xc5
Aug 11 04:06:11 newton kernel:  [<c047a232>] sys_getdents64+0x63/0xa5
Aug 11 04:06:11 newton kernel:  [<c0403eff>] syscall_call+0x7/0xb
Aug 11 04:06:11 newton kernel:  =======================
###############################################################################




From rpeterso at redhat.com  Mon Aug 13 16:43:07 2007
From: rpeterso at redhat.com (Bob Peterson)
Date: Mon, 13 Aug 2007 11:43:07 -0500
Subject: [Linux-cluster] GFS Problem
In-Reply-To: <16166123.35021187020620217.JavaMail.root@lisa.itlinux.cl>
References: <16166123.35021187020620217.JavaMail.root@lisa.itlinux.cl>
Message-ID: <1187023387.19616.11.camel@technetium.msp.redhat.com>

On Mon, 2007-08-13 at 11:57 -0400, Patricio A. Bruna wrote:
> Hi,
> I've configured a RHEL 5 cluster of 2 nodes, using GFS(v.1)
> filesystems.
> Im having a problem when restarting one of the nodes, the other node
> can not longer access the GFS partitions, so i must reboot both.
> Althoug, yesterday i resize a GFS partition, lvmextend and then
> gfs_grow, and the other node gaves I/O error and dismounts all the GFS
> partitions.
> 
> Do you have any ideas why this is happening?
> 
> PD: Im attaching the cluster.conf.

Hi,

Check to make sure the clustered bit is on for the vg.
See: http://sources.redhat.com/cluster/faq.html#clvmd_clustered

Regards,

Bob Peterson




From jos at xos.nl  Mon Aug 13 16:53:18 2007
From: jos at xos.nl (Jos Vos)
Date: Mon, 13 Aug 2007 18:53:18 +0200
Subject: [Linux-cluster] GFS Problem
In-Reply-To: <20683328.35321187023434188.JavaMail.root@lisa.itlinux.cl>;
	from pbruna@it-linux.cl on Mon, Aug 13, 2007 at 12:43:54PM -0400
References: <20070813182044.B7707@xos037.xos.nl>
	<20683328.35321187023434188.JavaMail.root@lisa.itlinux.cl>
Message-ID: <20070813185318.C7707@xos037.xos.nl>

On Mon, Aug 13, 2007 at 12:43:54PM -0400, Patricio A. Bruna wrote:

> > -  Did you create the GFS filesystems with the correct locking
> >    protocol (lock_dlm)?
> Yes i did, like the documentations says: 
> gfs_mkfs -p lock_dlm -t cluster_eseia:portales -j 4 /dev/CLUSTERLVM/portales

OK.

> > -  Do you use clvmd?  Did you mark your VG's to be "clustered" and
> >    do you have "locking_style = 3" in /etc/lvm/lvm.conf?
> 
> No i did not. By the way i only have locking_type in the file, i suppose is what you meant. Righ now is:
> locking_type = 1

This should be "3".  Furthermore, apply this command to each of your
clustererd volume groups:

   vgchange --clustered y /dev/vg...

Then do a "vgscan".

> They arent, because they are supposed to start right away the server boot, and the services i have: IP and Perlbal, do not use the GFS filesystems.

But the cluster services, including clvmd, have to be started before
the GFS filesystems are used.  Better make it another service, that
*only* has the GFS filesystems as resources, and that uses its own
failover domain (one for each node), so that mounting the volumes are
taken care of by the cluster services.

> Aug 11 04:06:10 newton kernel: attempt to access beyond end of device
> Aug 11 04:06:10 newton kernel: dm-4: rw=0, want=270678704, limit=251658240
> [...]

I think this is because the VG's are not marked to be clustered and thus
the other node is not aware of the resizing.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From jos at xos.nl  Mon Aug 13 16:57:41 2007
From: jos at xos.nl (Jos Vos)
Date: Mon, 13 Aug 2007 18:57:41 +0200
Subject: [Linux-cluster] GFS Problem
In-Reply-To: <1187023387.19616.11.camel@technetium.msp.redhat.com>;
	from rpeterso@redhat.com on Mon, Aug 13, 2007 at 11:43:07AM -0500
References: <16166123.35021187020620217.JavaMail.root@lisa.itlinux.cl>
	<1187023387.19616.11.camel@technetium.msp.redhat.com>
Message-ID: <20070813185741.D7707@xos037.xos.nl>

On Mon, Aug 13, 2007 at 11:43:07AM -0500, Bob Peterson wrote:

> Check to make sure the clustered bit is on for the vg.
> See: http://sources.redhat.com/cluster/faq.html#clvmd_clustered

B.t.w. Bob, are you aware of the fact that this is *not* documented
in the RHEL5 guides (GFS and LVM), as far as I can see, and that
the vgchange manual page does not describe this option (although
"vgchange --help" shows the option)?

It cost me quite some time to find this, as it is mentioned in
the guides that you should set the "clustered" bit, but you can't
easily find (except in the URL you just gave) how to do that.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From pbruna at it-linux.cl  Mon Aug 13 17:05:05 2007
From: pbruna at it-linux.cl (Patricio A. Bruna)
Date: Mon, 13 Aug 2007 13:05:05 -0400 (CLT)
Subject: [Linux-cluster] GFS Problem
In-Reply-To: <20070813185318.C7707@xos037.xos.nl>
Message-ID: <32724846.35361187024705316.JavaMail.root@lisa.itlinux.cl>

> This should be "3".

What happens with the locals VGs that are no part of the cluster?

>Furthermore, apply this command to each of your
> clustererd volume groups:
> 
>    vgchange --clustered y /dev/vg...
> 
> Then do a "vgscan".
> 

Is safe to run those command with a node down and the order in production?

Thanks





From berthiaume_wayne at emc.com  Mon Aug 13 17:09:37 2007
From: berthiaume_wayne at emc.com (berthiaume_wayne at emc.com)
Date: Mon, 13 Aug 2007 13:09:37 -0400
Subject: [Linux-cluster] create GFS file system on imported iSCSI disk
In-Reply-To: <229C73600EB0E54DA818AB599482BCE901921895@shadowfax.sg.muvee.net>
References: <229C73600EB0E54DA818AB599482BCE901921863@shadowfax.sg.muvee.net><46ADFBD3.7060200@redhat.com>
	<229C73600EB0E54DA818AB599482BCE901921895@shadowfax.sg.muvee.net>
Message-ID: <D364D39DAD21D444BAE2C70919B62809069D8F0C@CORPUSMX40A.corp.emc.com>

/etc/fstab will require the _netdev flag or else the filesystem will not
get mounted during boot.  

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bernard Chew
Sent: Monday, July 30, 2007 10:02 PM
To: linux clustering
Subject: RE: [Linux-cluster] create GFS file system on imported iSCSI
disk

Thanks Bryn!

- Bernard

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bryn M. Reeves
Sent: Monday, July 30, 2007 10:55 PM
To: linux clustering
Subject: Re: [Linux-cluster] create GFS file system on imported iSCSI
disk

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Bernard Chew wrote:
> May I know if we need to set up logical volumes on the imported iSCSI
> disk before creating GFS or we can immediately run "gfs_mkfs -p
lock_dlm
> -t alpha_cluster:gfs01 -j 8 /dev/sdb" (where /dev/sdb refers to the
> imported iSCSI disk) on one node and all nodes to "mount -t gfs
/dev/sdb
> /test"?
> 

There aren't any special steps needed for iSCSI over any other kind of
shared storage (once you've configured the initiators & the devices are
visible to the OS).

Creating volume groups and using logical volumes for GFS is a good idea
if you are likely to want to resize your devices at a later time but is
not strictly necessary.

Other than that, the steps you detailed should work fine.

Kind regards,
Bryn.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFGrfvT6YSQoMYUY94RAgGFAJ96WnspeUgKHKiBwHRh71aluGcoUgCfXvrB
fg2wFdqf96s6kciF0ypfzB0=
=cTyA
-----END PGP SIGNATURE-----

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




From jos at xos.nl  Mon Aug 13 17:10:19 2007
From: jos at xos.nl (Jos Vos)
Date: Mon, 13 Aug 2007 19:10:19 +0200
Subject: [Linux-cluster] GFS Problem
In-Reply-To: <32724846.35361187024705316.JavaMail.root@lisa.itlinux.cl>;
	from pbruna@it-linux.cl on Mon, Aug 13, 2007 at 01:05:05PM -0400
References: <20070813185318.C7707@xos037.xos.nl>
	<32724846.35361187024705316.JavaMail.root@lisa.itlinux.cl>
Message-ID: <20070813191019.E7707@xos037.xos.nl>

On Mon, Aug 13, 2007 at 01:05:05PM -0400, Patricio A. Bruna wrote:

> What happens with the locals VGs that are no part of the cluster?

AFAIK this works ok (I have those too).

> >    vgchange --clustered y /dev/vg...
> > 
> > Then do a "vgscan".
> 
> Is safe to run those command with a node down and the order in production?

Yes, it just changes a bit on the physical storage.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From rpeterso at redhat.com  Mon Aug 13 17:09:06 2007
From: rpeterso at redhat.com (Bob Peterson)
Date: Mon, 13 Aug 2007 12:09:06 -0500
Subject: [Linux-cluster] GFS Problem
In-Reply-To: <20070813185741.D7707@xos037.xos.nl>
References: <16166123.35021187020620217.JavaMail.root@lisa.itlinux.cl>
	<1187023387.19616.11.camel@technetium.msp.redhat.com>
	<20070813185741.D7707@xos037.xos.nl>
Message-ID: <1187024946.19616.14.camel@technetium.msp.redhat.com>

On Mon, 2007-08-13 at 18:57 +0200, Jos Vos wrote:
> On Mon, Aug 13, 2007 at 11:43:07AM -0500, Bob Peterson wrote:
> 
> > Check to make sure the clustered bit is on for the vg.
> > See: http://sources.redhat.com/cluster/faq.html#clvmd_clustered
> 
> B.t.w. Bob, are you aware of the fact that this is *not* documented
> in the RHEL5 guides (GFS and LVM), as far as I can see, and that
> the vgchange manual page does not describe this option (although
> "vgchange --help" shows the option)?
> 
> It cost me quite some time to find this, as it is mentioned in
> the guides that you should set the "clustered" bit, but you can't
> easily find (except in the URL you just gave) how to do that.

Hi Jos,

No, I wasn't aware of that fact.  I've passed this on to our
documentation folks and hopefully they'll correct the manual.

Regards,

Bob Peterson




From berthiaume_wayne at emc.com  Mon Aug 13 17:21:51 2007
From: berthiaume_wayne at emc.com (berthiaume_wayne at emc.com)
Date: Mon, 13 Aug 2007 13:21:51 -0400
Subject: [Linux-cluster] SAN + multipathd + GFS : SCSI error 
In-Reply-To: <46BC6EF5.2010500@lexum.umontreal.ca>
References: <46BC6EF5.2010500@lexum.umontreal.ca>
Message-ID: <D364D39DAD21D444BAE2C70919B62809069D8F37@CORPUSMX40A.corp.emc.com>

These are DID_BUS_BUSY errors being reported by the QLogic driver. I
would check your SAN for congestion, increase the cache in the EVA, or
change the queue depth in your qla2xxx driver.

Regards,
Wayne. 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of FM
Sent: Friday, August 10, 2007 9:58 AM
To: Redhat Cluster
Subject: [Linux-cluster] SAN + multipathd + GFS : SCSI error 

Hello,
All servers are RHEL 4.5
SAN is HP EVA 4000
we are using linux qla modules and multipathd
cluster server have only one FC Card


In the dmesg of servers connected to GFS we have a lot of :
SCSI error : <0 0 1 1> return code = 0x20000
end_request: I/O error, dev sdd, sector 37807111

The cluster seems to work fine but I'd like to know if we can avoid this
error.

here is a multipathd -ll output :

[root at como ~]# multipath -ll
mpath1 (3600508b4001051e40000900000310000)
[size=500 GB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [prio=50][active]
 \_ 0:0:0:1 sda 8:0  [active][ready]
\_ round-robin 0 [prio=10][enabled]
 \_ 0:0:1:1 sdd 8:48 [active][ready]

mpath3 (3600508b4001051e400009000009e0000)
[size=150 GB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [prio=50][active]
 \_ 0:0:1:2 sde 8:64 [active][ready]
\_ round-robin 0 [prio=10][enabled]
 \_ 0:0:0:2 sdb 8:16 [active][ready]



and the device in the multipath.conf

devices {
        device {
                vendor                  "HP "
                product                 "HSV200 "
                path_grouping_policy    group_by_prio
                getuid_callout          "/sbin/scsi_id -g -u -s
/block/%n"
                path_checker            tur
                path_selector           "round-robin 0"
                prio_callout            "/sbin/mpath_prio_alua %d"
                failback                immediate
                no_path_retry           60
        }
}

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




From jos at xos.nl  Mon Aug 13 17:44:29 2007
From: jos at xos.nl (Jos Vos)
Date: Mon, 13 Aug 2007 19:44:29 +0200
Subject: [Linux-cluster] "Address already in use" messages at first startup
Message-ID: <200708131744.l7DHiTx09045@xos037.xos.nl>

Hi,

I always seem to get at the first service start after a reboot:

Aug 13 19:29:16 host1 in.rdiscd[5832]: setsockopt (IP_ADD_MEMBERSHIP): Address already in use
Aug 13 19:29:16 host1 in.rdiscd[5832]: Failed joining addresses
Aug 13 19:29:17 host1 in.rdiscd[5884]: setsockopt (IP_ADD_MEMBERSHIP): Address already in use
Aug 13 19:29:17 host1 in.rdiscd[5884]: Failed joining addresses

I have one IPv4 and one IPv6 address associated to the service, so I guess
there is one message for each address.  The address assignments seem to
work fine for the rest.

I see "rdisc -fs" running (after a fresh reboot), but the "rdisc" service
is disabled (set to "off" with chkconfig).

What is starting this "rdisc" daemon and do I need it?

Thanks,

--
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From mhanafi at csc.com  Mon Aug 13 18:33:54 2007
From: mhanafi at csc.com (Mahmoud Hanafi)
Date: Mon, 13 Aug 2007 14:33:54 -0400
Subject: [Linux-cluster] Kernel Panic GFS2 and NFS
Message-ID: <OF0609C12D.20386FEA-ON85257336.0065B2C0-85257336.0065FAFE@csc.com>

I am getting the following kernel panic when exporting GFS2 via NFS. When 
the client mounts and does a ls the server panics. Any one else seen this 
issue? Any ideas?

dlm: connecting to 6
dlm: got connection from 6
NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
NFSD: starting 90-second grace period
Unable to handle kernel NULL pointer dereference at 0000000000000088 RIP: 
 [<ffffffff88111069>] :gfs2:gfs2_glock_dq+0x15/0xa5
PGD 22d792067 PUD 22d77a067 PMD 0 
Oops: 0000 [1] SMP 
last sysfs file: 
/fs/gfs2/nfs_cluster:stripe4_512k/lock_module/recover_done
CPU 0 
Modules linked in: nfsd exportfs lockd nfs_acl qla2xxx lock_dlm gfs2 dlm 
configfs iptable_filter ip_tables autofs4 hidp rfcomm l2cap bluetooth 
sunrpc ip6t_REJECT xt_
tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_round_robin 
dm_multipath video sbs i2c_ec i2c_core button battery asus_acpi 
acpi_memhotplug ac parport_pc lp parpo
rt joydev sr_mod sg pcspkr ib_mthca ib_mad ib_core shpchp bnx2 ide_cd 
cdrom serio_raw dm_snapshot dm_zero dm_mirror dm_mod usb_storage 
scsi_transport_fc megaraid_sas
 sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 29865, comm: nfsd Not tainted 2.6.18-8.el5 #1
RIP: 0010:[<ffffffff88111069>]  [<ffffffff88111069>] 
:gfs2:gfs2_glock_dq+0x15/0xa5
RSP: 0018:ffff8101f51c96e0  EFLAGS: 00010246
RAX: 0000000000000008 RBX: 0000000000000000 RCX: ffff8101f51c9cd0
RDX: ffff8101f57f2080 RSI: ffff8101f51c97b0 RDI: ffff8101f51c9720
RBP: ffff8101f51c9720 R08: ffff8101f504f014 R09: ffff8101f51c99b8
R10: ffff8101f57ad008 R11: ffffffff8811d23c R12: ffff8101f81fe780
R13: ffff8101f51c97b0 R14: 0000000000000001 R15: ffff8101f504f000
FS:  00002aaaab0146f0(0000) GS:ffffffff8038a000(0000) 
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000088 CR3: 000000022d613000 CR4: 00000000000006e0
Process nfsd (pid: 29865, threadinfo ffff8101f51c8000, task 
ffff8101f57f2080)
Stack:  0000000000000000 ffff8101f51c9720 ffff8101f51c99b0 
ffff8101f81fe780
 ffff8101f51c97b0 ffffffff8811112c 0000000000000000 ffffffff8811d2e3
 0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff8811112c>] :gfs2:gfs2_glock_dq_uninit+0x9/0x12
 [<ffffffff8811d2e3>] :gfs2:gfs2_getattr+0xa7/0xb4
 [<ffffffff8000ddca>] vfs_getattr+0x2d/0xa9
 [<ffffffff885fb4bf>] :nfsd:encode_post_op_attr+0x3f/0x213
 [<ffffffff885fbc57>] :nfsd:encode_entry+0x232/0x53e
 [<ffffffff800bea3c>] zone_statistics+0x3e/0x6d
 [<ffffffff80086480>] enqueue_task+0x41/0x56
 [<ffffffff800864bc>] __activate_task+0x27/0x39
 [<ffffffff80044d16>] try_to_wake_up+0x407/0x418
 [<ffffffff800850dd>] __wake_up_common+0x3e/0x68
 [<ffffffff885fbf6e>] :nfsd:nfs3svc_encode_entry_plus+0xb/0x10
 [<ffffffff8811adf1>] :gfs2:filldir_func+0x22/0x86
 [<ffffffff8810bb85>] :gfs2:do_filldir_main+0x126/0x16d
 [<ffffffff8811adcf>] :gfs2:filldir_func+0x0/0x86
 [<ffffffff8810a454>] :gfs2:gfs2_dirent_gather+0x0/0x24
 [<ffffffff885fbf63>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10
 [<ffffffff8810c0ab>] :gfs2:gfs2_dir_read+0x416/0x479
 [<ffffffff8811adcf>] :gfs2:filldir_func+0x0/0x86
 [<ffffffff88126ba7>] :gfs2:gfs2_trans_end+0x14e/0x16b
 [<ffffffff885fbf63>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10
 [<ffffffff8811b559>] :gfs2:gfs2_readdir+0x98/0xbe
 [<ffffffff88112c5e>] :gfs2:gfs2_glock_nq_atime+0x14e/0x292
 [<ffffffff885fbf63>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10
 [<ffffffff8003498f>] vfs_readdir+0x77/0xa9
 [<ffffffff885fbf63>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10
 [<ffffffff885f3e12>] :nfsd:nfsd_readdir+0x6d/0xc5
 [<ffffffff885fade0>] :nfsd:nfsd3_proc_readdirplus+0xf8/0x224
 [<ffffffff885f00e9>] :nfsd:nfsd_dispatch+0xd7/0x198
 [<ffffffff884474b5>] :sunrpc:svc_process+0x43c/0x6fa
 [<ffffffff8006214d>] __down_read+0x12/0x92
 [<ffffffff885f0471>] :nfsd:nfsd+0x0/0x327
 [<ffffffff885f0624>] :nfsd:nfsd+0x1b3/0x327
 [<ffffffff8005bfe5>] child_rip+0xa/0x11
 [<ffffffff885f0471>] :nfsd:nfsd+0x0/0x327
 [<ffffffff885f0471>] :nfsd:nfsd+0x0/0x327
 [<ffffffff8005bfdb>] child_rip+0x0/0x11


Code: 4c 8b ab 88 00 00 00 74 0a 31 f6 48 89 df e8 40 fa ff ff 4c 
RIP  [<ffffffff88111069>] :gfs2:gfs2_glock_dq+0x15/0xa5
 RSP <ffff8101f51c96e0>
CR2: 0000000000000088
 <0>Kernel panic - not syncing: Fatal exception





Mahmoud Hanafi
Sr. System Administrator
CSC HPC COE
Bld. 676
2435 Fifth Street
WPAFB, Ohio 45433
(937) 255-1536



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This is a PRIVATE message. If you are not the intended recipient, please 
delete without copying and kindly advise us by e-mail of the mistake in 
delivery. NOTE: Regardless of content, this e-mail shall not operate to 
bind CSC to any order or other contract unless pursuant to explicit 
written agreement or government initiative expressly permitting the use of 
e-mail for such purpose.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070813/e46176f2/attachment.htm>

From wcheng at redhat.com  Mon Aug 13 19:02:51 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Mon, 13 Aug 2007 15:02:51 -0400
Subject: [Linux-cluster] Kernel Panic GFS2 and NFS
In-Reply-To: <OF0609C12D.20386FEA-ON85257336.0065B2C0-85257336.0065FAFE@csc.com>
References: <OF0609C12D.20386FEA-ON85257336.0065B2C0-85257336.0065FAFE@csc.com>
Message-ID: <46C0AADB.1070209@redhat.com>

Mahmoud Hanafi wrote:
>
> I am getting the following kernel panic when exporting GFS2 via NFS. 
> When the client mounts and does a ls the server panics. Any one else 
> seen this issue? Any ideas?

We had this issue no time ago. Your kernel version is way too old... Wendy

>
> dlm: connecting to 6
> dlm: got connection from 6
> NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
> NFSD: starting 90-second grace period
> Unable to handle kernel NULL pointer dereference at 0000000000000088 RIP:
> [<ffffffff88111069>] :gfs2:gfs2_glock_dq+0x15/0xa5
> PGD 22d792067 PUD 22d77a067 PMD 0
> Oops: 0000 [1] SMP
> last sysfs file: 
> /fs/gfs2/nfs_cluster:stripe4_512k/lock_module/recover_done
> CPU 0
> Modules linked in: nfsd exportfs lockd nfs_acl qla2xxx lock_dlm gfs2 
> dlm configfs iptable_filter ip_tables autofs4 hidp rfcomm l2cap 
> bluetooth sunrpc ip6t_REJECT xt_
> tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_round_robin 
> dm_multipath video sbs i2c_ec i2c_core button battery asus_acpi 
> acpi_memhotplug ac parport_pc lp parpo
> rt joydev sr_mod sg pcspkr ib_mthca ib_mad ib_core shpchp bnx2 ide_cd 
> cdrom serio_raw dm_snapshot dm_zero dm_mirror dm_mod usb_storage 
> scsi_transport_fc megaraid_sas
> sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
> Pid: 29865, comm: nfsd Not tainted 2.6.18-8.el5 #1
> RIP: 0010:[<ffffffff88111069>]  [<ffffffff88111069>] 
> :gfs2:gfs2_glock_dq+0x15/0xa5
> RSP: 0018:ffff8101f51c96e0  EFLAGS: 00010246
> RAX: 0000000000000008 RBX: 0000000000000000 RCX: ffff8101f51c9cd0
> RDX: ffff8101f57f2080 RSI: ffff8101f51c97b0 RDI: ffff8101f51c9720
> RBP: ffff8101f51c9720 R08: ffff8101f504f014 R09: ffff8101f51c99b8
> R10: ffff8101f57ad008 R11: ffffffff8811d23c R12: ffff8101f81fe780
> R13: ffff8101f51c97b0 R14: 0000000000000001 R15: ffff8101f504f000
> FS:  00002aaaab0146f0(0000) GS:ffffffff8038a000(0000) 
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000088 CR3: 000000022d613000 CR4: 00000000000006e0
> Process nfsd (pid: 29865, threadinfo ffff8101f51c8000, task 
> ffff8101f57f2080)
> Stack:  0000000000000000 ffff8101f51c9720 ffff8101f51c99b0 
> ffff8101f81fe780
> ffff8101f51c97b0 ffffffff8811112c 0000000000000000 ffffffff8811d2e3
> 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> Call Trace:
> [<ffffffff8811112c>] :gfs2:gfs2_glock_dq_uninit+0x9/0x12
> [<ffffffff8811d2e3>] :gfs2:gfs2_getattr+0xa7/0xb4
> [<ffffffff8000ddca>] vfs_getattr+0x2d/0xa9
> [<ffffffff885fb4bf>] :nfsd:encode_post_op_attr+0x3f/0x213
> [<ffffffff885fbc57>] :nfsd:encode_entry+0x232/0x53e
> [<ffffffff800bea3c>] zone_statistics+0x3e/0x6d
> [<ffffffff80086480>] enqueue_task+0x41/0x56
> [<ffffffff800864bc>] __activate_task+0x27/0x39
> [<ffffffff80044d16>] try_to_wake_up+0x407/0x418
> [<ffffffff800850dd>] __wake_up_common+0x3e/0x68
> [<ffffffff885fbf6e>] :nfsd:nfs3svc_encode_entry_plus+0xb/0x10
> [<ffffffff8811adf1>] :gfs2:filldir_func+0x22/0x86
> [<ffffffff8810bb85>] :gfs2:do_filldir_main+0x126/0x16d
> [<ffffffff8811adcf>] :gfs2:filldir_func+0x0/0x86
> [<ffffffff8810a454>] :gfs2:gfs2_dirent_gather+0x0/0x24
> [<ffffffff885fbf63>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10
> [<ffffffff8810c0ab>] :gfs2:gfs2_dir_read+0x416/0x479
> [<ffffffff8811adcf>] :gfs2:filldir_func+0x0/0x86
> [<ffffffff88126ba7>] :gfs2:gfs2_trans_end+0x14e/0x16b
> [<ffffffff885fbf63>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10
> [<ffffffff8811b559>] :gfs2:gfs2_readdir+0x98/0xbe
> [<ffffffff88112c5e>] :gfs2:gfs2_glock_nq_atime+0x14e/0x292
> [<ffffffff885fbf63>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10
> [<ffffffff8003498f>] vfs_readdir+0x77/0xa9
> [<ffffffff885fbf63>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10
> [<ffffffff885f3e12>] :nfsd:nfsd_readdir+0x6d/0xc5
> [<ffffffff885fade0>] :nfsd:nfsd3_proc_readdirplus+0xf8/0x224
> [<ffffffff885f00e9>] :nfsd:nfsd_dispatch+0xd7/0x198
> [<ffffffff884474b5>] :sunrpc:svc_process+0x43c/0x6fa
> [<ffffffff8006214d>] __down_read+0x12/0x92
> [<ffffffff885f0471>] :nfsd:nfsd+0x0/0x327
> [<ffffffff885f0624>] :nfsd:nfsd+0x1b3/0x327
> [<ffffffff8005bfe5>] child_rip+0xa/0x11
> [<ffffffff885f0471>] :nfsd:nfsd+0x0/0x327
> [<ffffffff885f0471>] :nfsd:nfsd+0x0/0x327
> [<ffffffff8005bfdb>] child_rip+0x0/0x11
>
>
> Code: 4c 8b ab 88 00 00 00 74 0a 31 f6 48 89 df e8 40 fa ff ff 4c
> RIP  [<ffffffff88111069>] :gfs2:gfs2_glock_dq+0x15/0xa5
> RSP <ffff8101f51c96e0>
> CR2: 0000000000000088
> <0>Kernel panic - not syncing: Fatal exception
>
>
>
>
>
> Mahmoud Hanafi
> Sr. System Administrator
> CSC HPC COE
> Bld. 676
> 2435 Fifth Street
> WPAFB, Ohio 45433
> (937) 255-1536
>
>
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> This is a PRIVATE message. If you are not the intended recipient, 
> please delete without copying and kindly advise us by e-mail of the 
> mistake in delivery. NOTE: Regardless of content, this e-mail shall 
> not operate to bind CSC to any order or other contract unless pursuant 
> to explicit written agreement or government initiative expressly 
> permitting the use of e-mail for such purpose.
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From wcheng at redhat.com  Mon Aug 13 19:06:46 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Mon, 13 Aug 2007 15:06:46 -0400
Subject: [Linux-cluster] Kernel Panic GFS2 and NFS
In-Reply-To: <46C0AADB.1070209@redhat.com>
References: <OF0609C12D.20386FEA-ON85257336.0065B2C0-85257336.0065FAFE@csc.com>
	<46C0AADB.1070209@redhat.com>
Message-ID: <46C0ABC6.6060200@redhat.com>

Wendy Cheng wrote:
> Mahmoud Hanafi wrote:
>>
>> I am getting the following kernel panic when exporting GFS2 via NFS. 
>> When the client mounts and does a ls the server panics. Any one else 
>> seen this issue? Any ideas?
>
> We had this issue no time ago. Your kernel version is way too old... 
> Wendy

Sorry ... too many things to juggle at the same time ..  s/no time/long 
time/ ... The 2.6.18.37.el5 runs reasonably well... Wendy



From Randy.Brown at noaa.gov  Mon Aug 13 19:51:24 2007
From: Randy.Brown at noaa.gov (Randy Brown)
Date: Mon, 13 Aug 2007 15:51:24 -0400
Subject: [Linux-cluster] Setting up HA cluster as NAS head for storage
Message-ID: <46C0B63C.4080305@noaa.gov>

I am trying to configure two matching servers in a high availability 
cluster to work as a NAS head for NFS mounts from our ISCSI based 
network storage.  Has anyone done this or is anyone doing this?  I am 
struggling with getting the NFS exports configured so machines outside 
the cluster can mount these filesystems.  I believe I have the two 
servers failing over correctly.  That is, the NFS service I created is 
properly failing over if one of the machines is unavailable.  Any help 
would be greatly appreciated.  I have read the "Configuring and Managing 
a Redhat Cluster" document as well as the "Redhat Cluster Suite 
Overview" document, but they don't quite cover what I'm trying to do.  
Any suggested resources?

Thank you,

Randy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: randy.brown.vcf
Type: text/x-vcard
Size: 348 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070813/82c6ad74/attachment.vcf>

From Christopher.Barry at qlogic.com  Mon Aug 13 19:57:45 2007
From: Christopher.Barry at qlogic.com (Christopher Barry)
Date: Mon, 13 Aug 2007 15:57:45 -0400
Subject: [Linux-cluster] Setting up HA cluster as NAS head for storage
In-Reply-To: <46C0B63C.4080305@noaa.gov>
References: <46C0B63C.4080305@noaa.gov>
Message-ID: <1187035065.5231.21.camel@localhost>

On Mon, 2007-08-13 at 15:51 -0400, Randy Brown wrote:
> I am trying to configure two matching servers in a high availability 
> cluster to work as a NAS head for NFS mounts from our ISCSI based 
> network storage.  Has anyone done this or is anyone doing this?  I am 
> struggling with getting the NFS exports configured so machines outside 
> the cluster can mount these filesystems.  I believe I have the two 
> servers failing over correctly.  That is, the NFS service I created is 
> properly failing over if one of the machines is unavailable.  Any help 
> would be greatly appreciated.  I have read the "Configuring and Managing 
> a Redhat Cluster" document as well as the "Redhat Cluster Suite 
> Overview" document, but they don't quite cover what I'm trying to do.  
> Any suggested resources?
> 
> Thank you,
> 
> Randy
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


Take a look here:
"http://www.linux-ha.org/HaNFS"


-- 
Regards,
-C

Christopher Barry
Systems Engineer, Principal
Qlogic Corporation
780 5th Avenue
Suite 140
King of Prussia, PA 19406



From pbruna at it-linux.cl  Mon Aug 13 22:14:25 2007
From: pbruna at it-linux.cl (Patricio A. Bruna)
Date: Mon, 13 Aug 2007 18:14:25 -0400 (CLT)
Subject: [Linux-cluster] Cluster Fence Problem
Message-ID: <4760290.36191187043265985.JavaMail.root@lisa.itlinux.cl>

Hi, 
I have configured a cluster, at least i though so, but when i start system-config-cluster an error messages appear that says there is no members joined to the cluster. 
This is over RHEL 5. In both logs i get: 

Node1: 
fenced[2381]: davinci not a cluster member after 3 sec post_join_delay 
Aug 13 13:41:26 newton fenced[2381]: fencing node "davinci" 
Aug 13 13:41:26 newton fenced[2381]: fence "davinci" failed 
Aug 13 13:41:31 newton fenced[2381]: fencing node "davinci" 
Aug 13 13:41:31 newton fenced[2381]: fence "davinci" failed 

Node2: 
Aug 13 13:47:00 davinci fenced[2539]: newton not a cluster member after 3 sec post_join_delay 
Aug 13 13:47:00 davinci fenced[2539]: fencing node "newton" 
Aug 13 13:47:00 davinci fenced[2539]: fence "newton" failed 
Aug 13 13:47:05 davinci fenced[2539]: fencing node "newton" 
Aug 13 13:47:05 davinci fenced[2539]: fence "newton" failed 

Any help will be appreciated. 

PD: im attaching both logs and cluster.conf 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070813/56888468/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages.davinci
Type: application/octet-stream
Size: 63505 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070813/56888468/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages.newton
Type: application/octet-stream
Size: 147371 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070813/56888468/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/octet-stream
Size: 1290 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070813/56888468/attachment-0002.obj>

From chris at cmiware.com  Mon Aug 13 22:25:22 2007
From: chris at cmiware.com (Chris Harms)
Date: Mon, 13 Aug 2007 17:25:22 -0500
Subject: [Linux-cluster] Cluster Fence Problem
In-Reply-To: <4760290.36191187043265985.JavaMail.root@lisa.itlinux.cl>
References: <4760290.36191187043265985.JavaMail.root@lisa.itlinux.cl>
Message-ID: <46C0DA52.7040008@cmiware.com>

We have set our post_join_delay to 15 minutes (900 seconds) to be safe.  
We had no end of trouble with fencing issues on a 2-node cluster, and 
only allowing 3 seconds between the first node in a cluster starting and 
the 2nd noding coming online before it gets fenced is fairly ridiculous.


Patricio A. Bruna wrote:
> Hi,
> I have configured a cluster, at least i though so, but when i start 
> system-config-cluster  an error messages appear that says there is no 
> members joined to the cluster.
> This is over RHEL 5. In both logs i get:
>
> Node1:
> fenced[2381]: davinci not a cluster member after 3 sec post_join_delay
> Aug 13 13:41:26 newton fenced[2381]: fencing node "davinci"
> Aug 13 13:41:26 newton fenced[2381]: fence "davinci" failed
> Aug 13 13:41:31 newton fenced[2381]: fencing node "davinci"
> Aug 13 13:41:31 newton fenced[2381]: fence "davinci" failed
>
> Node2:
> Aug 13 13:47:00 davinci fenced[2539]: newton not a cluster member 
> after 3 sec post_join_delay
> Aug 13 13:47:00 davinci fenced[2539]: fencing node "newton"
> Aug 13 13:47:00 davinci fenced[2539]: fence "newton" failed
> Aug 13 13:47:05 davinci fenced[2539]: fencing node "newton"
> Aug 13 13:47:05 davinci fenced[2539]: fence "newton" failed
>
> Any help will be appreciated.
>
> PD: im attaching both logs and cluster.conf
>
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From mhanafi at csc.com  Mon Aug 13 23:34:31 2007
From: mhanafi at csc.com (Mahmoud Hanafi)
Date: Mon, 13 Aug 2007 19:34:31 -0400
Subject: [Linux-cluster] locking question
Message-ID: <OF7EBA1D8F.76672071-ON85257336.008173ED-85257336.00818580@csc.com>

Which is better lock_dlm or GLUM? and why?


Thanks,
Mahmoud Hanafi
Sr. System Administrator
CSC HPC COE
Bld. 676
2435 Fifth Street
WPAFB, Ohio 45433
(937) 255-1536



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This is a PRIVATE message. If you are not the intended recipient, please 
delete without copying and kindly advise us by e-mail of the mistake in 
delivery. NOTE: Regardless of content, this e-mail shall not operate to 
bind CSC to any order or other contract unless pursuant to explicit 
written agreement or government initiative expressly permitting the use of 
e-mail for such purpose.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070813/2ef6177a/attachment.htm>

From jparsons at redhat.com  Tue Aug 14 06:25:08 2007
From: jparsons at redhat.com (jim parsons)
Date: Tue, 14 Aug 2007 02:25:08 -0400
Subject: [Linux-cluster] Cluster Fence Problem
In-Reply-To: <46C0DA52.7040008@cmiware.com>
References: <4760290.36191187043265985.JavaMail.root@lisa.itlinux.cl>
	<46C0DA52.7040008@cmiware.com>
Message-ID: <1187072708.3014.6.camel@localhost.localdomain>

On Mon, 2007-08-13 at 17:25 -0500, Chris Harms wrote:
> We have set our post_join_delay to 15 minutes (900 seconds) to be safe.  
> We had no end of trouble with fencing issues on a 2-node cluster, and 
> only allowing 3 seconds between the first node in a cluster starting and 
> the 2nd noding coming online before it gets fenced is fairly ridiculous.
> 
Also, you need to associate the nodes with the fencedevices. This is
missing in the conf file. If using s-c-cluster, select a node, click
'manage fencing for this node', and make a default level, then, add
fence to the level by choosing the correct fencedevice (newton_ilo for
newton, etc.) from the little dropdowm menu. 

-j
> 
> Patricio A. Bruna wrote:
> > Hi,
> > I have configured a cluster, at least i though so, but when i start 
> > system-config-cluster  an error messages appear that says there is no 
> > members joined to the cluster.
> > This is over RHEL 5. In both logs i get:
> >
> > Node1:
> > fenced[2381]: davinci not a cluster member after 3 sec post_join_delay
> > Aug 13 13:41:26 newton fenced[2381]: fencing node "davinci"
> > Aug 13 13:41:26 newton fenced[2381]: fence "davinci" failed
> > Aug 13 13:41:31 newton fenced[2381]: fencing node "davinci"
> > Aug 13 13:41:31 newton fenced[2381]: fence "davinci" failed
> >
> > Node2:
> > Aug 13 13:47:00 davinci fenced[2539]: newton not a cluster member 
> > after 3 sec post_join_delay
> > Aug 13 13:47:00 davinci fenced[2539]: fencing node "newton"
> > Aug 13 13:47:00 davinci fenced[2539]: fence "newton" failed
> > Aug 13 13:47:05 davinci fenced[2539]: fencing node "newton"
> > Aug 13 13:47:05 davinci fenced[2539]: fence "newton" failed
> >
> > Any help will be appreciated.
> >
> > PD: im attaching both logs and cluster.conf
> >
> >
> > ------------------------------------------------------------------------
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From sebastian.walter at fu-berlin.de  Tue Aug 14 08:01:46 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Tue, 14 Aug 2007 10:01:46 +0200
Subject: [Linux-cluster] GFS Problem
In-Reply-To: <20070813185318.C7707@xos037.xos.nl>
References: <20070813182044.B7707@xos037.xos.nl>	<20683328.35321187023434188.JavaMail.root@lisa.itlinux.cl>
	<20070813185318.C7707@xos037.xos.nl>
Message-ID: <46C1616A.5020904@fu-berlin.de>

Hello Jos,

Jos Vos wrote:
> But the cluster services, including clvmd, have to be started before
> the GFS filesystems are used.  Better make it another service, that
> *only* has the GFS filesystems as resources, and that uses its own
> failover domain (one for each node), so that mounting the volumes are
> taken care of by the cluster services.
What you describe is exactly the setup I want, but wasn't able to
configure it yet. I was setting up a single gfs service including the
shared gfs resource of the mount point, but it starts only on one node
at once. Do I have to create one service for each node running on its
self failover domain? This sounds if it could work indeed. And do I need
additionally a lvm resource in that setup (I seem to remember that the
clvmd service was not chkconfig'ed "on" by default)?

Thanks for any help!

regards,
Sebastian



From jos at xos.nl  Tue Aug 14 09:18:56 2007
From: jos at xos.nl (Jos Vos)
Date: Tue, 14 Aug 2007 11:18:56 +0200
Subject: [Linux-cluster] GFS Problem
In-Reply-To: <46C1616A.5020904@fu-berlin.de>;
	from sebastian.walter@fu-berlin.de on Tue, Aug 14, 2007 at
	10:01:46AM +0200
References: <20070813182044.B7707@xos037.xos.nl>
	<20683328.35321187023434188.JavaMail.root@lisa.itlinux.cl>
	<20070813185318.C7707@xos037.xos.nl>
	<46C1616A.5020904@fu-berlin.de>
Message-ID: <20070814111856.A16445@xos037.xos.nl>

On Tue, Aug 14, 2007 at 10:01:46AM +0200, Sebastian Walter wrote:

> What you describe is exactly the setup I want, but wasn't able to
> configure it yet. I was setting up a single gfs service including the
> shared gfs resource of the mount point, but it starts only on one node
> at once. Do I have to create one service for each node running on its
> self failover domain? [...]

Yes, for all nodes you want them to be mounted you need a seperate
service with its own one-node (exclusive) failover domain.

> [...] This sounds if it could work indeed. And do I need
> additionally a lvm resource in that setup (I seem to remember that the
> clvmd service was not chkconfig'ed "on" by default)?

Just do "chkconfig clvmd on", that should work.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From rpeterso at redhat.com  Tue Aug 14 13:38:22 2007
From: rpeterso at redhat.com (Bob Peterson)
Date: Tue, 14 Aug 2007 08:38:22 -0500
Subject: [Linux-cluster] locking question
In-Reply-To: <OF7EBA1D8F.76672071-ON85257336.008173ED-85257336.00818580@csc.com>
References: <OF7EBA1D8F.76672071-ON85257336.008173ED-85257336.00818580@csc.com>
Message-ID: <1187098702.19616.43.camel@technetium.msp.redhat.com>

On Mon, 2007-08-13 at 19:34 -0400, Mahmoud Hanafi wrote:
> 
> Which is better lock_dlmor GLUM? and why? 
> 
> 
> Thanks, 
> Mahmoud Hanafi
> Sr. System Administrator

Hi Mahmoud,

Here's my take on it:

http://sources.redhat.com/cluster/faq.html#dlm_which

Regards,

Bob Peterson




From eric at bootseg.com  Tue Aug 14 13:49:26 2007
From: eric at bootseg.com (Eric Kerin)
Date: Tue, 14 Aug 2007 09:49:26 -0400
Subject: [Linux-cluster] fence_apc 7930s
In-Reply-To: <1891144504.370791186961023635.JavaMail.root@v-mailhost2.mxpath.net>
References: <1891144504.370791186961023635.JavaMail.root@v-mailhost2.mxpath.net>
Message-ID: <46C1B2E6.6070100@bootseg.com>

Brian Sheets wrote:
> Did anyone get this working and allow system names for the port tags?
> Mine are all labled as [empty] I've not tried switching it back to the default tag.
>   
I have named ports on mine, and I use the fence_apc_snmp agent to 
control them.  It works very well for me, and has for quite some time.

Eric Kerin
eric at bootseg.com




From sebastian.walter at fu-berlin.de  Tue Aug 14 15:15:20 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Tue, 14 Aug 2007 17:15:20 +0200
Subject: [Linux-cluster] Buffer I/O error on device diapered_dm-0
Message-ID: <46C1C708.5050502@fu-berlin.de>

Dear list,

lately during accessing the gfs volume with heavy load, I get strange
errors in /var/log/messages, forcing clurmgr to  restart the gfs service:

Aug 14 16:45:54 host kernel: Buffer I/O error on device diapered_dm-0,
logical block 249820

I'm using qla2xxx kernel modules  on QLogic 2432 FC HBA's.

Any advice? Thanks for any help!

Regards, sebastian



From chris at cmiware.com  Tue Aug 14 15:19:07 2007
From: chris at cmiware.com (Chris Harms)
Date: Tue, 14 Aug 2007 10:19:07 -0500
Subject: [Linux-cluster] modclusterd memory leak
Message-ID: <46C1C7EB.8090300@cmiware.com>

We installed the 5.1 Beta RPMs of the cluster suite and have left our 
cluster running unfettered for over a week. It now appears modclusterd 
has a slow memory leak.  Its consuming 1.5% and climbing of our 16GB of 
RAM which is up from 1.3% yesterday.  I would be happy to do some tests 
and send along the results.  Please advise.

Thanks,
Chris



From rhurst at bidmc.harvard.edu  Tue Aug 14 18:46:39 2007
From: rhurst at bidmc.harvard.edu (Robert Hurst)
Date: Tue, 14 Aug 2007 14:46:39 -0400
Subject: [Linux-cluster] rgmanager ceases to send syslog messages
Message-ID: <1187117199.14904.12.camel@xw9300.bidmc.harvard.edu>

Odd, a member node's rgmanager (clurgmgrd) stopped sending syslog
messages, in particular, a 'status' message of a service it was running.
This causes us a problem, as we monitor syslog messages from a
centralized server to update us of services running by nodename.

Is there a signal or event that can trigger clurgmgrd to restart its
monitoring and logging of its running service?

The last instances of it running and showing 'WATSON status' follow.
Note, I realize there was an issue with this particular cluster.conf
change, but those changes had nothing to do with the WATSON service, and
all other nodes are still sending their 'service status' syslog
messages.  Why would 'WATSON status' just stop?

Aug  6 14:38:35 db5 clurgmgrd: [16354]: <info>
Executing /etc/init.d/WATSON status 
Aug  6 14:39:05 db5 clurgmgrd: [16354]: <info>
Executing /etc/init.d/WATSON status
Aug  6 14:39:20 db5 ccsd[13802]: Update of cluster.conf complete
(version 187 -> 188).
Aug  6 14:39:25 db5 clurgmgrd[16354]: <notice> Reconfiguring 
Aug  6 14:39:25 db5 clurgmgrd[16354]: <info> Loading Service Data 
Aug  6 14:39:25 db5 clurgmgrd[16354]: <err> Error storing ip: Duplicate 
Aug  6 14:39:26 db5 clurgmgrd[16354]: <err> Unique attribute collision.
type=clusterfs attr=device value=/dev/VGCCC1/lvol0 
Aug  6 14:39:26 db5 clurgmgrd[16354]: <err> Error storing clusterfs
resource 
Aug  6 14:39:26 db5 clurgmgrd[16354]: <err> Unique attribute collision.
type=clusterfs attr=device value=/dev/VGCCC1/lvol1 
Aug  6 14:39:26 db5 clurgmgrd[16354]: <err> Error storing clusterfs
resource 
Aug  6 14:39:26 db5 clurgmgrd[16354]: <info> Stopping changed
resources. 
Aug  6 14:39:26 db5 clurgmgrd[16354]: <info> Restarting changed
resources. 
Aug  6 14:39:26 db5 clurgmgrd[16354]: <info> Starting changed
resources. 
Aug  6 14:39:26 db5 clurgmgrd: [16354]: <info>
Executing /etc/init.d/syslogger stop
Aug  6 14:39:27 db5 clurgmgrd: [16354]: <info>
Executing /etc/init.d/luci stop 
Aug  6 14:39:27 db5 clurgmgrd: [16354]: <info>
Executing /etc/init.d/webmin stop
Aug  6 14:39:27 db5 clurgmgrd: [16354]: <info>
Executing /etc/init.d/nagios stop

I continue to get messages from clurgmgrd, but only through Magma Event
changes, i.e.:

Aug  7 16:09:03 db5 clurgmgrd[16354]: <info> Magma Event: Membership
Change 
Aug  7 16:09:03 db5 clurgmgrd[16354]: <info> State change: db1 UP

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070814/db04c830/attachment.htm>

From adas at redhat.com  Tue Aug 14 19:37:07 2007
From: adas at redhat.com (Abhijith Das)
Date: Tue, 14 Aug 2007 14:37:07 -0500
Subject: [Linux-cluster] Assertion failed in do_flock (bz198302)
In-Reply-To: <1186998796.2650.6.camel@rutabaga.defuturo.co.uk>
References: <1186998796.2650.6.camel@rutabaga.defuturo.co.uk>
Message-ID: <46C20463.5030206@redhat.com>

Robert Clark wrote:

>  I've been seeing the same assertions as in bz198302, so I've tried out
>the debug patch there and it looks like they are being triggered by an
>EAGAIN from flock_lock_file_wait. Is this an expected return code?
>
>	Robert
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>
>  
>
Robert,
No. This is not an expected return code as far as I can tell from the
code. Are you running NFS on GFS? If you can reproduce this problem
reliably, can you add that information to the bz? Also the output of cat
/proc/locks and 'gfs_tool lockdump' would be useful.
Thanks,
--Abhi



From brad at bradandkim.net  Tue Aug 14 20:41:37 2007
From: brad at bradandkim.net (brad at bradandkim.net)
Date: Tue, 14 Aug 2007 15:41:37 -0500 (CDT)
Subject: [Linux-cluster] Postgresql service on RHCS
Message-ID: <43600.129.237.174.117.1187124097.squirrel@webmail.bradandkim.net>

I am attempting to set up an active/passive failover environment for
postgresql-8.2.  I created a failover domain with 2 nodes, one of them
preferred, added a postgresql-8 resource and an IP address resource.  Now
when I set up a service I run into problems.  The idea is to have both the
postgresql-8 resource and the IP address resource within the service so
that it will move the floating IP to the other node in the domain and
start postgres on it.   The IP portion seems to work fine but postgresql
always fails.  I get these error messages:

Aug 14 15:40:47 cdb6 clurgmgrd[4106]: <notice> Starting disabled service
10.10.1.221
Aug 14 15:40:47 cdb6 clurgmgrd: [4106]: <info> Adding IPv4 address
10.10.1.236 to bond0
Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <info> Starting Service
postgres-8:cdb6
Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <err> Starting Service
postgres-8:cdb6 > Failed
Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> start on postgres-8:cdb6
returned 1 (generic error)
Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <warning> #68: Failed to start
10.10.1.221; return value: 1
Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> Stopping service 10.10.1.221
Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <info> Stopping Service
postgres-8:cdb6
Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <err> Checking Existence Of File
/var/run/cluster/postgres-8/postgres-8:cdb6.pid [postgres-8:cdb6] > Failed
- File Doesn't Exist
Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <err> Stopping Service
postgres-8:cdb6 > Failed
Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> stop on postgres-8:cdb6
returned 1 (generic error)
Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <crit> #12: RG 10.10.1.221 failed to
stop; intervention required
Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> Service 10.10.1.221 is failed
Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <crit> #13: Service 10.10.1.221
failed to stop cleanly


Here are the relavent portions of cluster.conf:

               <failoverdomains>
                        <failoverdomain name="cdb6_pgsql" ordered="1"
restricted="1">
                                <failoverdomainnode name="cdb6"
priority="1"/>
                                <failoverdomainnode name="cdb5"
priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <clusterfs device="/dev/mapper/gfsvol1"
force_unmount="0" fsid="36155" fstype="gfs"
mountpoint="/mnt/gfs" name="gfs1" options=""/>
                        <postgres-8
config_file="/var/lib/pgsql/data/postgresql.conf"
name="cdb6" postmaster_options=""
postmaster_user="postgres" shutdown_wait=""/>
                        <ip address="10.10.1.236" monitor_link="1"/>
                        <script file="/etc/rc.d/init.d/postgresql"
name="postgres"/>
                </resources>
                <service autostart="0" domain="cdb6_pgsql"
name="10.10.1.221" recovery="relocate">
                        <ip ref="10.10.1.236"/>
                        <postgres-8 ref="cdb6"/>
                </service>

I looked into the init script failing to return zero issue and I don't
think that is it.  My postgresql startup script doesn't even show that
there was an attempt to start it.  When setting up a postgresql-8 resource
I assume that 'config file' is the config file for postgres or is it the
startup script '/etc/rc.d/init.d/postgresql'?  Neither works for me.

Any tips would be greatly appreciated.

Brad Crotchett
brad at bradandkim.net
http://www.bradandkim.net



From chris at cmiware.com  Tue Aug 14 20:49:52 2007
From: chris at cmiware.com (Chris Harms)
Date: Tue, 14 Aug 2007 15:49:52 -0500
Subject: [Linux-cluster] Postgresql service on RHCS
In-Reply-To: <43600.129.237.174.117.1187124097.squirrel@webmail.bradandkim.net>
References: <43600.129.237.174.117.1187124097.squirrel@webmail.bradandkim.net>
Message-ID: <46C21570.9050406@cmiware.com>

I used a generic script resource instead of the supplied Postgres-8.  It 
didn't seem to work, but then we needed 2 of them and not 1.

If you're using the init script, try using it as a script resource instead.

brad at bradandkim.net wrote:
> I am attempting to set up an active/passive failover environment for
> postgresql-8.2.  I created a failover domain with 2 nodes, one of them
> preferred, added a postgresql-8 resource and an IP address resource.  Now
> when I set up a service I run into problems.  The idea is to have both the
> postgresql-8 resource and the IP address resource within the service so
> that it will move the floating IP to the other node in the domain and
> start postgres on it.   The IP portion seems to work fine but postgresql
> always fails.  I get these error messages:
>
> Aug 14 15:40:47 cdb6 clurgmgrd[4106]: <notice> Starting disabled service
> 10.10.1.221
> Aug 14 15:40:47 cdb6 clurgmgrd: [4106]: <info> Adding IPv4 address
> 10.10.1.236 to bond0
> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <info> Starting Service
> postgres-8:cdb6
> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <err> Starting Service
> postgres-8:cdb6 > Failed
> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> start on postgres-8:cdb6
> returned 1 (generic error)
> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <warning> #68: Failed to start
> 10.10.1.221; return value: 1
> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> Stopping service 10.10.1.221
> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <info> Stopping Service
> postgres-8:cdb6
> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <err> Checking Existence Of File
> /var/run/cluster/postgres-8/postgres-8:cdb6.pid [postgres-8:cdb6] > Failed
> - File Doesn't Exist
> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <err> Stopping Service
> postgres-8:cdb6 > Failed
> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> stop on postgres-8:cdb6
> returned 1 (generic error)
> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <crit> #12: RG 10.10.1.221 failed to
> stop; intervention required
> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> Service 10.10.1.221 is failed
> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <crit> #13: Service 10.10.1.221
> failed to stop cleanly
>
>
> Here are the relavent portions of cluster.conf:
>
>                <failoverdomains>
>                         <failoverdomain name="cdb6_pgsql" ordered="1"
> restricted="1">
>                                 <failoverdomainnode name="cdb6"
> priority="1"/>
>                                 <failoverdomainnode name="cdb5"
> priority="2"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <resources>
>                         <clusterfs device="/dev/mapper/gfsvol1"
> force_unmount="0" fsid="36155" fstype="gfs"
> mountpoint="/mnt/gfs" name="gfs1" options=""/>
>                         <postgres-8
> config_file="/var/lib/pgsql/data/postgresql.conf"
> name="cdb6" postmaster_options=""
> postmaster_user="postgres" shutdown_wait=""/>
>                         <ip address="10.10.1.236" monitor_link="1"/>
>                         <script file="/etc/rc.d/init.d/postgresql"
> name="postgres"/>
>                 </resources>
>                 <service autostart="0" domain="cdb6_pgsql"
> name="10.10.1.221" recovery="relocate">
>                         <ip ref="10.10.1.236"/>
>                         <postgres-8 ref="cdb6"/>
>                 </service>
>
> I looked into the init script failing to return zero issue and I don't
> think that is it.  My postgresql startup script doesn't even show that
> there was an attempt to start it.  When setting up a postgresql-8 resource
> I assume that 'config file' is the config file for postgres or is it the
> startup script '/etc/rc.d/init.d/postgresql'?  Neither works for me.
>
> Any tips would be greatly appreciated.
>
> Brad Crotchett
> brad at bradandkim.net
> http://www.bradandkim.net
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   



From brad at bradandkim.net  Tue Aug 14 21:17:15 2007
From: brad at bradandkim.net (brad at bradandkim.net)
Date: Tue, 14 Aug 2007 16:17:15 -0500 (CDT)
Subject: [Linux-cluster] Postgresql service on RHCS
In-Reply-To: <46C21570.9050406@cmiware.com>
References: <43600.129.237.174.117.1187124097.squirrel@webmail.bradandkim.net>
	<46C21570.9050406@cmiware.com>
Message-ID: <54842.129.237.174.117.1187126235.squirrel@webmail.bradandkim.net>


> I used a generic script resource instead of the supplied Postgres-8.  It
> didn't seem to work, but then we needed 2 of them and not 1.
>
> If you're using the init script, try using it as a script resource
> instead.
>
> brad at bradandkim.net wrote:
>> I am attempting to set up an active/passive failover environment for
>> postgresql-8.2.  I created a failover domain with 2 nodes, one of them
>> preferred, added a postgresql-8 resource and an IP address resource.
>> Now
>> when I set up a service I run into problems.  The idea is to have both
>> the
>> postgresql-8 resource and the IP address resource within the service so
>> that it will move the floating IP to the other node in the domain and
>> start postgres on it.   The IP portion seems to work fine but postgresql
>> always fails.  I get these error messages:
>>
>> Aug 14 15:40:47 cdb6 clurgmgrd[4106]: <notice> Starting disabled service
>> 10.10.1.221
>> Aug 14 15:40:47 cdb6 clurgmgrd: [4106]: <info> Adding IPv4 address
>> 10.10.1.236 to bond0
>> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <info> Starting Service
>> postgres-8:cdb6
>> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <err> Starting Service
>> postgres-8:cdb6 > Failed
>> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> start on postgres-8:cdb6
>> returned 1 (generic error)
>> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <warning> #68: Failed to start
>> 10.10.1.221; return value: 1
>> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> Stopping service
>> 10.10.1.221
>> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <info> Stopping Service
>> postgres-8:cdb6
>> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <err> Checking Existence Of File
>> /var/run/cluster/postgres-8/postgres-8:cdb6.pid [postgres-8:cdb6] >
>> Failed
>> - File Doesn't Exist
>> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <err> Stopping Service
>> postgres-8:cdb6 > Failed
>> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> stop on postgres-8:cdb6
>> returned 1 (generic error)
>> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <crit> #12: RG 10.10.1.221 failed
>> to
>> stop; intervention required
>> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> Service 10.10.1.221 is
>> failed
>> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <crit> #13: Service 10.10.1.221
>> failed to stop cleanly
>>
>>
>> Here are the relavent portions of cluster.conf:
>>
>>                <failoverdomains>
>>                         <failoverdomain name="cdb6_pgsql" ordered="1"
>> restricted="1">
>>                                 <failoverdomainnode name="cdb6"
>> priority="1"/>
>>                                 <failoverdomainnode name="cdb5"
>> priority="2"/>
>>                         </failoverdomain>
>>                 </failoverdomains>
>>                 <resources>
>>                         <clusterfs device="/dev/mapper/gfsvol1"
>> force_unmount="0" fsid="36155" fstype="gfs"
>> mountpoint="/mnt/gfs" name="gfs1" options=""/>
>>                         <postgres-8
>> config_file="/var/lib/pgsql/data/postgresql.conf"
>> name="cdb6" postmaster_options=""
>> postmaster_user="postgres" shutdown_wait=""/>
>>                         <ip address="10.10.1.236" monitor_link="1"/>
>>                         <script file="/etc/rc.d/init.d/postgresql"
>> name="postgres"/>
>>                 </resources>
>>                 <service autostart="0" domain="cdb6_pgsql"
>> name="10.10.1.221" recovery="relocate">
>>                         <ip ref="10.10.1.236"/>
>>                         <postgres-8 ref="cdb6"/>
>>                 </service>
>>
>> I looked into the init script failing to return zero issue and I don't
>> think that is it.  My postgresql startup script doesn't even show that
>> there was an attempt to start it.  When setting up a postgresql-8
>> resource
>> I assume that 'config file' is the config file for postgres or is it the
>> startup script '/etc/rc.d/init.d/postgresql'?  Neither works for me.
>>
>> Any tips would be greatly appreciated.
>>
>> Brad Crotchett
>

I tried that as well and it still fails.  Does rgmanager already know to
use start|stop|restart with the init script?

Thanks,

Brad Crotchett
brad at bradandkim.net
http://www.bradandkim.net



From tameikareed at satx.rr.com  Wed Aug 15 00:50:02 2007
From: tameikareed at satx.rr.com (tameikareed)
Date: Tue, 14 Aug 2007 19:50:02 -0500
Subject: [Linux-cluster] SAN + multipathd + GFS : SCSI error
In-Reply-To: <D364D39DAD21D444BAE2C70919B62809069D8F37@CORPUSMX40A.corp.emc.com>
References: <46BC6EF5.2010500@lexum.umontreal.ca>
	<D364D39DAD21D444BAE2C70919B62809069D8F37@CORPUSMX40A.corp.emc.com>
Message-ID: <46C24DBA.6000404@satx.rr.com>

Have you tried moving the data to another lun and deleting the lun with 
errors and recreating that lun? I know that involves downtime unless you 
have some type of failover implemented.

berthiaume_wayne at emc.com wrote:
> These are DID_BUS_BUSY errors being reported by the QLogic driver. I
> would check your SAN for congestion, increase the cache in the EVA, or
> change the queue depth in your qla2xxx driver.
>
> Regards,
> Wayne. 
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of FM
> Sent: Friday, August 10, 2007 9:58 AM
> To: Redhat Cluster
> Subject: [Linux-cluster] SAN + multipathd + GFS : SCSI error 
>
> Hello,
> All servers are RHEL 4.5
> SAN is HP EVA 4000
> we are using linux qla modules and multipathd
> cluster server have only one FC Card
>
>
> In the dmesg of servers connected to GFS we have a lot of :
> SCSI error : <0 0 1 1> return code = 0x20000
> end_request: I/O error, dev sdd, sector 37807111
>
> The cluster seems to work fine but I'd like to know if we can avoid this
> error.
>
> here is a multipathd -ll output :
>
> [root at como ~]# multipath -ll
> mpath1 (3600508b4001051e40000900000310000)
> [size=500 GB][features="1 queue_if_no_path"][hwhandler="0"]
> \_ round-robin 0 [prio=50][active]
>  \_ 0:0:0:1 sda 8:0  [active][ready]
> \_ round-robin 0 [prio=10][enabled]
>  \_ 0:0:1:1 sdd 8:48 [active][ready]
>
> mpath3 (3600508b4001051e400009000009e0000)
> [size=150 GB][features="1 queue_if_no_path"][hwhandler="0"]
> \_ round-robin 0 [prio=50][active]
>  \_ 0:0:1:2 sde 8:64 [active][ready]
> \_ round-robin 0 [prio=10][enabled]
>  \_ 0:0:0:2 sdb 8:16 [active][ready]
>
>
>
> and the device in the multipath.conf
>
> devices {
>         device {
>                 vendor                  "HP "
>                 product                 "HSV200 "
>                 path_grouping_policy    group_by_prio
>                 getuid_callout          "/sbin/scsi_id -g -u -s
> /block/%n"
>                 path_checker            tur
>                 path_selector           "round-robin 0"
>                 prio_callout            "/sbin/mpath_prio_alua %d"
>                 failback                immediate
>                 no_path_retry           60
>         }
> }
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>   



From mhanafi at csc.com  Wed Aug 15 02:54:49 2007
From: mhanafi at csc.com (Mahmoud Hanafi)
Date: Tue, 14 Aug 2007 22:54:49 -0400
Subject: [Linux-cluster] gfs2 doc and performance question
Message-ID: <OF1AFC40D6.54F9CCA4-ON85257338.000FAFE1-85257338.001002B3@csc.com>

I been trying to find documentation on gfs2. but it seem that there isn't 
much. 

Is there any detail documentation on gfs2?
Can one set the block size when creating gfs2 filesystem?
Has there been any performance benchmark of gfs2?


Thanks,
Mahmoud Hanafi
Sr. System Administrator
CSC HPC COE
Bld. 676
2435 Fifth Street
WPAFB, Ohio 45433
(937) 255-1536



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This is a PRIVATE message. If you are not the intended recipient, please 
delete without copying and kindly advise us by e-mail of the mistake in 
delivery. NOTE: Regardless of content, this e-mail shall not operate to 
bind CSC to any order or other contract unless pursuant to explicit 
written agreement or government initiative expressly permitting the use of 
e-mail for such purpose.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070814/d242c069/attachment.htm>

From cluster at defuturo.co.uk  Wed Aug 15 10:27:00 2007
From: cluster at defuturo.co.uk (Robert Clark)
Date: Wed, 15 Aug 2007 11:27:00 +0100
Subject: [Linux-cluster] Assertion failed in do_flock (bz198302)
In-Reply-To: <46C20463.5030206@redhat.com>
References: <1186998796.2650.6.camel@rutabaga.defuturo.co.uk>
	<46C20463.5030206@redhat.com>
Message-ID: <1187173620.4213.8.camel@rutabaga.defuturo.co.uk>

  Hi Abhi,

On Tue, 2007-08-14 at 14:37 -0500, Abhijith Das wrote:

> Robert Clark wrote:
> 
> >  I've been seeing the same assertions as in bz198302, so I've tried out
> >the debug patch there and it looks like they are being triggered by an
> >EAGAIN from flock_lock_file_wait. Is this an expected return code?

> No. This is not an expected return code as far as I can tell from the
> code. Are you running NFS on GFS? If you can reproduce this problem
> reliably, can you add that information to the bz? Also the output of cat
> /proc/locks and 'gfs_tool lockdump' would be useful.

  No, I'm not running NFS. My filesystem is used as a home directory for
users running GNOME desktops, so there are some pretty complex programs
accessing it.

  After some experimentation, I've found a reliable way to reproduce the
assertion involving opening a file twice and attempting to lock it on
both filehandles - more details in bugzilla.

	Thanks,

		Robert



From mm at yuhu.biz  Wed Aug 15 10:59:22 2007
From: mm at yuhu.biz (Marian Marinov)
Date: Wed, 15 Aug 2007 13:59:22 +0300
Subject: [Linux-cluster] Postgresql service on RHCS
In-Reply-To: <54842.129.237.174.117.1187126235.squirrel@webmail.bradandkim.net>
References: <43600.129.237.174.117.1187124097.squirrel@webmail.bradandkim.net>
	<46C21570.9050406@cmiware.com>
	<54842.129.237.174.117.1187126235.squirrel@webmail.bradandkim.net>
Message-ID: <200708151359.22202.mm@yuhu.biz>

Have you tried to use PGCluster (http://pgcluster.projects.postgresql.org/) or 
Postgres-R (http://www.postgres-r.org/about/about).

You can achieve the same result without having a cluster filesystem.

Best regards
  M.Marinov

On Wednesday 15 August 2007 00:17:15 brad at bradandkim.net wrote:
> > I used a generic script resource instead of the supplied Postgres-8.  It
> > didn't seem to work, but then we needed 2 of them and not 1.
> >
> > If you're using the init script, try using it as a script resource
> > instead.
> >
> > brad at bradandkim.net wrote:
> >> I am attempting to set up an active/passive failover environment for
> >> postgresql-8.2.  I created a failover domain with 2 nodes, one of them
> >> preferred, added a postgresql-8 resource and an IP address resource.
> >> Now
> >> when I set up a service I run into problems.  The idea is to have both
> >> the
> >> postgresql-8 resource and the IP address resource within the service so
> >> that it will move the floating IP to the other node in the domain and
> >> start postgres on it.   The IP portion seems to work fine but postgresql
> >> always fails.  I get these error messages:
> >>
> >> Aug 14 15:40:47 cdb6 clurgmgrd[4106]: <notice> Starting disabled service
> >> 10.10.1.221
> >> Aug 14 15:40:47 cdb6 clurgmgrd: [4106]: <info> Adding IPv4 address
> >> 10.10.1.236 to bond0
> >> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <info> Starting Service
> >> postgres-8:cdb6
> >> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <err> Starting Service
> >> postgres-8:cdb6 > Failed
> >> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> start on postgres-8:cdb6
> >> returned 1 (generic error)
> >> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <warning> #68: Failed to start
> >> 10.10.1.221; return value: 1
> >> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> Stopping service
> >> 10.10.1.221
> >> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <info> Stopping Service
> >> postgres-8:cdb6
> >> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <err> Checking Existence Of File
> >> /var/run/cluster/postgres-8/postgres-8:cdb6.pid [postgres-8:cdb6] >
> >> Failed
> >> - File Doesn't Exist
> >> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <err> Stopping Service
> >> postgres-8:cdb6 > Failed
> >> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> stop on postgres-8:cdb6
> >> returned 1 (generic error)
> >> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <crit> #12: RG 10.10.1.221 failed
> >> to
> >> stop; intervention required
> >> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> Service 10.10.1.221 is
> >> failed
> >> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <crit> #13: Service 10.10.1.221
> >> failed to stop cleanly
> >>
> >>
> >> Here are the relavent portions of cluster.conf:
> >>
> >>                <failoverdomains>
> >>                         <failoverdomain name="cdb6_pgsql" ordered="1"
> >> restricted="1">
> >>                                 <failoverdomainnode name="cdb6"
> >> priority="1"/>
> >>                                 <failoverdomainnode name="cdb5"
> >> priority="2"/>
> >>                         </failoverdomain>
> >>                 </failoverdomains>
> >>                 <resources>
> >>                         <clusterfs device="/dev/mapper/gfsvol1"
> >> force_unmount="0" fsid="36155" fstype="gfs"
> >> mountpoint="/mnt/gfs" name="gfs1" options=""/>
> >>                         <postgres-8
> >> config_file="/var/lib/pgsql/data/postgresql.conf"
> >> name="cdb6" postmaster_options=""
> >> postmaster_user="postgres" shutdown_wait=""/>
> >>                         <ip address="10.10.1.236" monitor_link="1"/>
> >>                         <script file="/etc/rc.d/init.d/postgresql"
> >> name="postgres"/>
> >>                 </resources>
> >>                 <service autostart="0" domain="cdb6_pgsql"
> >> name="10.10.1.221" recovery="relocate">
> >>                         <ip ref="10.10.1.236"/>
> >>                         <postgres-8 ref="cdb6"/>
> >>                 </service>
> >>
> >> I looked into the init script failing to return zero issue and I don't
> >> think that is it.  My postgresql startup script doesn't even show that
> >> there was an attempt to start it.  When setting up a postgresql-8
> >> resource
> >> I assume that 'config file' is the config file for postgres or is it the
> >> startup script '/etc/rc.d/init.d/postgresql'?  Neither works for me.
> >>
> >> Any tips would be greatly appreciated.
> >>
> >> Brad Crotchett
>
> I tried that as well and it still fails.  Does rgmanager already know to
> use start|stop|restart with the init script?
>
> Thanks,
>
> Brad Crotchett
> brad at bradandkim.net
> http://www.bradandkim.net
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From devrim at gunduz.org  Wed Aug 15 11:04:27 2007
From: devrim at gunduz.org (=?iso-8859-9?Q?Devrim_G=DCND=DCZ?=)
Date: Wed, 15 Aug 2007 04:04:27 -0700 (PDT)
Subject: [Linux-cluster] Postgresql service on RHCS
In-Reply-To: <200708151359.22202.mm@yuhu.biz>
References: <43600.129.237.174.117.1187124097.squirrel@webmail.bradandkim.net>
	<46C21570.9050406@cmiware.com>
	<54842.129.237.174.117.1187126235.squirrel@webmail.bradandkim.net>
	<200708151359.22202.mm@yuhu.biz>
Message-ID: <Pine.LNX.4.64.0708150403390.10386@community1.commandprompt.com>


Hi,

On Wed, 15 Aug 2007, Marian Marinov wrote:

> Have you tried to use PGCluster (http://pgcluster.projects.postgresql.org/) or

It is not a shared-disk system.

> Postgres-R (http://www.postgres-r.org/about/about).

This project has not released any public code yet.

Regards,
--
Devrim G?ND?Z
devrim~gunduz.org, devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
                       http://www.gunduz.org



From dnlombar at ichips.intel.com  Wed Aug 15 13:43:18 2007
From: dnlombar at ichips.intel.com (Lombard, David N)
Date: Wed, 15 Aug 2007 06:43:18 -0700
Subject: [Linux-cluster] fs.sh?
In-Reply-To: <20070802154818.GB26367@redhat.com>
References: <20070629132556.GK29854@helsinki.fi>
	<20070706163152.GH1681@redhat.com>
	<20070706182930.GE5981@redhat.com>
	<20070706183151.GF5981@redhat.com>
	<20070706183658.GA24692@helsinki.fi>
	<20070710221922.GG18076@redhat.com>
	<20070731121438.GA21896@helsinki.fi>
	<20070731134121.GH4955@redhat.com>
	<20070731145441.GE21896@helsinki.fi>
	<20070802154818.GB26367@redhat.com>
Message-ID: <20070815134318.GA3922@nlxdcldnl2.cl.intel.com>

On Thu, Aug 02, 2007 at 11:48:19AM -0400, Lon Hohberger wrote:
> On Tue, Jul 31, 2007 at 05:54:41PM +0300, Janne Peltonen wrote:
> > 
> > I did the same, with the same results. It seems to me that the clurgmgrd
> > process isn't calling the complete script any more times than it's
> > supposed to.  What I'm seeing are the execs of fs.sh, that is, it
> > includes each () and `` and so on. Each fs.sh invocation seems to create
> > quite an amount of subshells.
> > 
> > I'm sorry for having misled you. And this all means, there isn't
> > probably much reason to read the cluster.conf and rg_test rules output -
> > I'll attach them anyway.
> 
> Yeah, it does hit a lot of subshells.  Awks, seds, and the like.  Some
> pattern substitution and matching can be done in pure bash.

Perhaps a *lot* more than you likely think...

Sadly, this is the common case.  Probably 99.99% of all shell scripts
that I see are still Bourne scripts or very close to it.  The bash shell
is SO MUCH more powerful than Bourne, but it gets used in such a clumsy
and trivial manner that wastes far too many processes and far too many
cycles.

And yes, I have submitted patches when the abuse was truly egregious,
but such updates are universally ignored as there is no actual error to
point to.

Sigh...

-- 
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.



From brad at bradandkim.net  Wed Aug 15 14:31:30 2007
From: brad at bradandkim.net (brad at bradandkim.net)
Date: Wed, 15 Aug 2007 09:31:30 -0500 (CDT)
Subject: [Linux-cluster] Postgresql service on RHCS
In-Reply-To: <200708151359.22202.mm@yuhu.biz>
References: <43600.129.237.174.117.1187124097.squirrel@webmail.bradandkim.net>
	<46C21570.9050406@cmiware.com>
	<54842.129.237.174.117.1187126235.squirrel@webmail.bradandkim.net>
	<200708151359.22202.mm@yuhu.biz>
Message-ID: <35673.129.237.174.117.1187188290.squirrel@webmail.bradandkim.net>


> Have you tried to use PGCluster
> (http://pgcluster.projects.postgresql.org/) or
> Postgres-R (http://www.postgres-r.org/about/about).
>
> You can achieve the same result without having a cluster filesystem.
>
> Best regards
>   M.Marinov
>
> On Wednesday 15 August 2007 00:17:15 brad at bradandkim.net wrote:
>> > I used a generic script resource instead of the supplied Postgres-8.
>> It
>> > didn't seem to work, but then we needed 2 of them and not 1.
>> >
>> > If you're using the init script, try using it as a script resource
>> > instead.
>> >
>> > brad at bradandkim.net wrote:
>> >> I am attempting to set up an active/passive failover environment for
>> >> postgresql-8.2.  I created a failover domain with 2 nodes, one of
>> them
>> >> preferred, added a postgresql-8 resource and an IP address resource.
>> >> Now
>> >> when I set up a service I run into problems.  The idea is to have
>> both
>> >> the
>> >> postgresql-8 resource and the IP address resource within the service
>> so
>> >> that it will move the floating IP to the other node in the domain and
>> >> start postgres on it.   The IP portion seems to work fine but
>> postgresql
>> >> always fails.  I get these error messages:
>> >>
>> >> Aug 14 15:40:47 cdb6 clurgmgrd[4106]: <notice> Starting disabled
>> service
>> >> 10.10.1.221
>> >> Aug 14 15:40:47 cdb6 clurgmgrd: [4106]: <info> Adding IPv4 address
>> >> 10.10.1.236 to bond0
>> >> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <info> Starting Service
>> >> postgres-8:cdb6
>> >> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <err> Starting Service
>> >> postgres-8:cdb6 > Failed
>> >> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> start on
>> postgres-8:cdb6
>> >> returned 1 (generic error)
>> >> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <warning> #68: Failed to start
>> >> 10.10.1.221; return value: 1
>> >> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> Stopping service
>> >> 10.10.1.221
>> >> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <info> Stopping Service
>> >> postgres-8:cdb6
>> >> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <err> Checking Existence Of
>> File
>> >> /var/run/cluster/postgres-8/postgres-8:cdb6.pid [postgres-8:cdb6] >
>> >> Failed
>> >> - File Doesn't Exist
>> >> Aug 14 15:40:48 cdb6 clurgmgrd: [4106]: <err> Stopping Service
>> >> postgres-8:cdb6 > Failed
>> >> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> stop on
>> postgres-8:cdb6
>> >> returned 1 (generic error)
>> >> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <crit> #12: RG 10.10.1.221
>> failed
>> >> to
>> >> stop; intervention required
>> >> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <notice> Service 10.10.1.221 is
>> >> failed
>> >> Aug 14 15:40:48 cdb6 clurgmgrd[4106]: <crit> #13: Service 10.10.1.221
>> >> failed to stop cleanly
>> >>
>> >>
>> >> Here are the relavent portions of cluster.conf:
>> >>
>> >>                <failoverdomains>
>> >>                         <failoverdomain name="cdb6_pgsql" ordered="1"
>> >> restricted="1">
>> >>                                 <failoverdomainnode name="cdb6"
>> >> priority="1"/>
>> >>                                 <failoverdomainnode name="cdb5"
>> >> priority="2"/>
>> >>                         </failoverdomain>
>> >>                 </failoverdomains>
>> >>                 <resources>
>> >>                         <clusterfs device="/dev/mapper/gfsvol1"
>> >> force_unmount="0" fsid="36155" fstype="gfs"
>> >> mountpoint="/mnt/gfs" name="gfs1" options=""/>
>> >>                         <postgres-8
>> >> config_file="/var/lib/pgsql/data/postgresql.conf"
>> >> name="cdb6" postmaster_options=""
>> >> postmaster_user="postgres" shutdown_wait=""/>
>> >>                         <ip address="10.10.1.236" monitor_link="1"/>
>> >>                         <script file="/etc/rc.d/init.d/postgresql"
>> >> name="postgres"/>
>> >>                 </resources>
>> >>                 <service autostart="0" domain="cdb6_pgsql"
>> >> name="10.10.1.221" recovery="relocate">
>> >>                         <ip ref="10.10.1.236"/>
>> >>                         <postgres-8 ref="cdb6"/>
>> >>                 </service>
>> >>
>> >> I looked into the init script failing to return zero issue and I
>> don't
>> >> think that is it.  My postgresql startup script doesn't even show
>> that
>> >> there was an attempt to start it.  When setting up a postgresql-8
>> >> resource
>> >> I assume that 'config file' is the config file for postgres or is it
>> the
>> >> startup script '/etc/rc.d/init.d/postgresql'?  Neither works for me.
>> >>
>> >> Any tips would be greatly appreciated.
>> >>
>> >> Brad Crotchett
>>
>> I tried that as well and it still fails.  Does rgmanager already know to
>> use start|stop|restart with the init script?
>>
>> Thanks,
>>
>> Brad Crotchett
>> brad at bradandkim.net
>> http://www.bradandkim.net

Thanks for the suggestions, but I have looked at all of the
replication/availability solutions for postgres and determined this is the
way I want to go.

Brad Crotchett
brad at bradandkim.net
http://www.bradandkim.net



From brad at bradandkim.net  Wed Aug 15 14:34:47 2007
From: brad at bradandkim.net (brad at bradandkim.net)
Date: Wed, 15 Aug 2007 09:34:47 -0500 (CDT)
Subject: [Linux-cluster] Postgresql service on RHCS
In-Reply-To: <Pine.LNX.4.64.0708150403390.10386@community1.commandprompt.com>
References: <43600.129.237.174.117.1187124097.squirrel@webmail.bradandkim.net>
	<46C21570.9050406@cmiware.com>
	<54842.129.237.174.117.1187126235.squirrel@webmail.bradandkim.net>
	<200708151359.22202.mm@yuhu.biz>
	<Pine.LNX.4.64.0708150403390.10386@community1.commandprompt.com>
Message-ID: <38273.129.237.174.117.1187188487.squirrel@webmail.bradandkim.net>


>
> Hi,
>
> On Wed, 15 Aug 2007, Marian Marinov wrote:
>
>> Have you tried to use PGCluster
>> (http://pgcluster.projects.postgresql.org/) or
>
> It is not a shared-disk system.
>
>> Postgres-R (http://www.postgres-r.org/about/about).
>
> This project has not released any public code yet.
>
> Regards,
> --
> Devrim G?ND?Z
> devrim~gunduz.org, devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
>                        http://www.gunduz.org

I saw that you have a PDF about this kind of setup.  Have you setup a
failover service for Postgres on Cluster Suite?

Thanks,

Brad Crotchett
brad at bradandkim.net
http://www.bradandkim.net



From brad at bradandkim.net  Wed Aug 15 14:58:43 2007
From: brad at bradandkim.net (brad at bradandkim.net)
Date: Wed, 15 Aug 2007 09:58:43 -0500 (CDT)
Subject: [Linux-cluster] Postgresql service on RHCS RESOLVED
In-Reply-To: <35673.129.237.174.117.1187188290.squirrel@webmail.bradandkim.net>
References: <43600.129.237.174.117.1187124097.squirrel@webmail.bradandkim.net>
	<46C21570.9050406@cmiware.com>
	<54842.129.237.174.117.1187126235.squirrel@webmail.bradandkim.net>
	<200708151359.22202.mm@yuhu.biz>
	<35673.129.237.174.117.1187188290.squirrel@webmail.bradandkim.net>
Message-ID: <56663.129.237.174.117.1187189923.squirrel@webmail.bradandkim.net>



> Thanks for the suggestions, but I have looked at all of the
> replication/availability solutions for postgres and determined this is the
> way I want to go.
>
> Brad Crotchett

I have this working now.  It seems like I was caught in a loop where
neither of the nodes could successfully shut down the service since the
service wasn't running anywhere.  A fresh look at it this morning,
disabling and then re-enabling the service and it now starts up fine,
fences the failed primary, fails over to secondary, and then fails back to
primary when it comes back.  Everything looks good.  Now I just need to
give it a whirl with a shared partition from the SAN.

Thanks,

Brad Crotchett
brad at bradandkim.net
http://www.bradandkim.net



From rmccabe at redhat.com  Wed Aug 15 15:59:56 2007
From: rmccabe at redhat.com (Ryan McCabe)
Date: Wed, 15 Aug 2007 11:59:56 -0400
Subject: [Linux-cluster] modclusterd memory leak
In-Reply-To: <46C1C7EB.8090300@cmiware.com>
References: <46C1C7EB.8090300@cmiware.com>
Message-ID: <20070815155956.GA11255@redhat.com>

On Tue, Aug 14, 2007 at 10:19:07AM -0500, Chris Harms wrote:
> We installed the 5.1 Beta RPMs of the cluster suite and have left our 
> cluster running unfettered for over a week. It now appears modclusterd 
> has a slow memory leak.  Its consuming 1.5% and climbing of our 16GB of 
> RAM which is up from 1.3% yesterday.  I would be happy to do some tests 
> and send along the results.  Please advise.

Hi,

Thanks for the report. I've created a bugzilla ticket to track this
here: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=252357


Ryan



From Alexandre.Racine at mhicc.org  Wed Aug 15 16:31:45 2007
From: Alexandre.Racine at mhicc.org (Alexandre Racine)
Date: Wed, 15 Aug 2007 12:31:45 -0400
Subject: [Linux-cluster] gfs2 doc and performance question
References: <OF1AFC40D6.54F9CCA4-ON85257338.000FAFE1-85257338.001002B3@csc.com>
Message-ID: <C43CF0825BF59D4FBC1F6A2AF45EB88D3C3419@cumulonimbus.RG.local>

I don't know the answer to all your questions, but the blocksize is the first option in the man page of gfs_mkfs

# man gfs_mkfs

DESCRIPTION
       gfs_mkfs is used to create a Global File System.


OPTIONS
       -b BlockSize
              Set  the  filesystem block size to BlockSize (must be a power of two).  The minimum block size
              is 512.  The FS block size cannot exceed the machine's memory page size.  On the  most  archi-
              tectures  (i386, x86_64, s390, s390x), the memory page size is 4096 bytes.  On other architec-
              tures it may be bigger.  The default block size is 4096 bytes.  In  general,  GFS  filesystems
              should not deviate from the default value.
[...]



Alexandre Racine
Projets sp?ciaux
514-461-1300 poste 3304
alexandre.racine at mhicc.org



-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of Mahmoud Hanafi
Sent: Tue 2007-08-14 22:54
To: linux clustering
Subject: [Linux-cluster] gfs2 doc and performance question
 
I been trying to find documentation on gfs2. but it seem that there isn't 
much. 

Is there any detail documentation on gfs2?
Can one set the block size when creating gfs2 filesystem?
Has there been any performance benchmark of gfs2?


Thanks,
Mahmoud Hanafi
Sr. System Administrator
CSC HPC COE
Bld. 676
2435 Fifth Street
WPAFB, Ohio 45433
(937) 255-1536



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This is a PRIVATE message. If you are not the intended recipient, please 
delete without copying and kindly advise us by e-mail of the mistake in 
delivery. NOTE: Regardless of content, this e-mail shall not operate to 
bind CSC to any order or other contract unless pursuant to explicit 
written agreement or government initiative expressly permitting the use of 
e-mail for such purpose.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3706 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070815/bbd00c2c/attachment.bin>

From mhanafi at csc.com  Wed Aug 15 16:46:40 2007
From: mhanafi at csc.com (Mahmoud Hanafi)
Date: Wed, 15 Aug 2007 12:46:40 -0400
Subject: [Linux-cluster] gfs2 doc and performance question
In-Reply-To: <C43CF0825BF59D4FBC1F6A2AF45EB88D3C3419@cumulonimbus.RG.local>
Message-ID: <OFB379ABB9.2150E74E-ON85257338.005BF89B-85257338.005C2B8F@csc.com>

The man page you point out is for gfs1. The man page for mkfs.gfs2 doesn't 
have the "-b" option for blocksize. That goes to my point about 
documentation.

-Mahmoud

Mahmoud Hanafi
Sr. System Administrator
CSC HPC COE
Bld. 676
2435 Fifth Street
WPAFB, Ohio 45433
(937) 255-1536



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This is a PRIVATE message. If you are not the intended recipient, please 
delete without copying and kindly advise us by e-mail of the mistake in 
delivery. NOTE: Regardless of content, this e-mail shall not operate to 
bind CSC to any order or other contract unless pursuant to explicit 
written agreement or government initiative expressly permitting the use of 
e-mail for such purpose.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------




"Alexandre Racine" <Alexandre.Racine at mhicc.org> 
Sent by: linux-cluster-bounces at redhat.com
08/15/2007 12:31 PM
Please respond to
linux clustering <linux-cluster at redhat.com>


To
"linux clustering" <linux-cluster at redhat.com>
cc

Subject
RE: [Linux-cluster] gfs2 doc and performance question






I don't know the answer to all your questions, but the blocksize is the 
first option in the man page of gfs_mkfs

# man gfs_mkfs

DESCRIPTION
       gfs_mkfs is used to create a Global File System.


OPTIONS
       -b BlockSize
              Set  the  filesystem block size to BlockSize (must be a 
power of two).  The minimum block size
              is 512.  The FS block size cannot exceed the machine's 
memory page size.  On the  most  archi-
              tectures  (i386, x86_64, s390, s390x), the memory page size 
is 4096 bytes.  On other architec-
              tures it may be bigger.  The default block size is 4096 
bytes.  In  general,  GFS  filesystems
              should not deviate from the default value.
[...]



Alexandre Racine
Projets sp?ciaux
514-461-1300 poste 3304
alexandre.racine at mhicc.org



-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of Mahmoud Hanafi
Sent: Tue 2007-08-14 22:54
To: linux clustering
Subject: [Linux-cluster] gfs2 doc and performance question
 
I been trying to find documentation on gfs2. but it seem that there isn't 
much. 

Is there any detail documentation on gfs2?
Can one set the block size when creating gfs2 filesystem?
Has there been any performance benchmark of gfs2?


Thanks,
Mahmoud Hanafi
Sr. System Administrator
CSC HPC COE
Bld. 676
2435 Fifth Street
WPAFB, Ohio 45433
(937) 255-1536



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This is a PRIVATE message. If you are not the intended recipient, please 
delete without copying and kindly advise us by e-mail of the mistake in 
delivery. NOTE: Regardless of content, this e-mail shall not operate to 
bind CSC to any order or other contract unless pursuant to explicit 
written agreement or government initiative expressly permitting the use of 

e-mail for such purpose.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070815/75aab9d3/attachment.htm>

From Alexandre.Racine at mhicc.org  Wed Aug 15 16:52:19 2007
From: Alexandre.Racine at mhicc.org (Alexandre Racine)
Date: Wed, 15 Aug 2007 12:52:19 -0400
Subject: [Linux-cluster] dlm Invalid module format
Message-ID: <C43CF0825BF59D4FBC1F6A2AF45EB88D3C341B@cumulonimbus.RG.local>

Hi people,

I am having difficulties to startup gfs with Gentoo.
I am currently running 2.6.20-gentoo-r8 with "GFS2 file system support". I have try with or without "GFS2 "nolock" locking module" and "GFS2 DLM locking module" but it is always the same result.

When starting dlm, I get this :

+ /etc/init.d/dlm start
 * Loading dlm module ...
 * Loading dlm kernel module ...
FATAL: Error inserting dlm (/lib/modules/2.6.20-gentoo-r8/kernel/cluster/dlm.ko): Invalid module format
 * Failed to load dlm kernel module                                                        [ !! ]

Looking from dmesg gives this error : "dlm: exports duplicate symbol dlm_new_lockspace (owned by kernel)"

Does someone have any ideas what is happening here? Thanks.




Alexandre Racine
Projets sp?ciaux
514-461-1300 poste 3304
alexandre.racine at mhicc.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070815/64ff2dcb/attachment.htm>

From rpeterso at redhat.com  Wed Aug 15 17:14:12 2007
From: rpeterso at redhat.com (Bob Peterson)
Date: Wed, 15 Aug 2007 12:14:12 -0500
Subject: [Linux-cluster] gfs2 doc and performance question
In-Reply-To: <OFB379ABB9.2150E74E-ON85257338.005BF89B-85257338.005C2B8F@csc.com>
References: <OFB379ABB9.2150E74E-ON85257338.005BF89B-85257338.005C2B8F@csc.com>
Message-ID: <1187198052.27982.3.camel@technetium.msp.redhat.com>

On Wed, 2007-08-15 at 12:46 -0400, Mahmoud Hanafi wrote:
> 
> The man page you point out is for gfs1. The man page for mkfs.gfs2
> doesn't have the "-b" option for blocksize. That goes to my point
> about documentation. 
> 
> -Mahmoud 
> 
> Mahmoud Hanafi

Hi Mahmoud,

The latest mkfs.gfs2 in cvs has the -b option again, but
I'm not sure which packages that change made it in to.
I might be suggest a package if you tell me what platform,
otherwise you can build it from the source.
We temporarily disabled it while we were working on
alternate block sizes.  

If you're going to use alternate block sizes, you should 
make sure the gfs2 kernel module you're using will support
it properly.  Are you compiling the gfs2 kernel code from
the nmw git tree?  Or are you just using an installed package?

Regards,

Bob Peterson
Red Hat Cluster Suite




From mhanafi at csc.com  Thu Aug 16 02:16:57 2007
From: mhanafi at csc.com (Mahmoud Hanafi)
Date: Wed, 15 Aug 2007 22:16:57 -0400
Subject: [Linux-cluster] gfs2 doc and performance question
In-Reply-To: <1187198052.27982.3.camel@technetium.msp.redhat.com>
Message-ID: <OF4E43DB2C.9B27B16C-ON85257339.000C4930-85257339.000C8B8D@csc.com>

I am using the current kernel.org kernel with gfs2 (2.6.22.2). 

What about documentation? for example is there any doc describing 
gfs2_tool tuning option? 

Mahmoud Hanafi
Sr. System Administrator
CSC HPC COE
Bld. 676
2435 Fifth Street
WPAFB, Ohio 45433
(937) 255-1536



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This is a PRIVATE message. If you are not the intended recipient, please 
delete without copying and kindly advise us by e-mail of the mistake in 
delivery. NOTE: Regardless of content, this e-mail shall not operate to 
bind CSC to any order or other contract unless pursuant to explicit 
written agreement or government initiative expressly permitting the use of 
e-mail for such purpose.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------




Bob Peterson <rpeterso at redhat.com> 
Sent by: linux-cluster-bounces at redhat.com
08/15/2007 01:14 PM
Please respond to
rpeterso at redhat.com; Please respond to
linux clustering <linux-cluster at redhat.com>


To
linux clustering <linux-cluster at redhat.com>
cc
linux-cluster-bounces at redhat.com
Subject
RE: [Linux-cluster] gfs2 doc and performance question






On Wed, 2007-08-15 at 12:46 -0400, Mahmoud Hanafi wrote:
> 
> The man page you point out is for gfs1. The man page for mkfs.gfs2
> doesn't have the "-b" option for blocksize. That goes to my point
> about documentation. 
> 
> -Mahmoud 
> 
> Mahmoud Hanafi

Hi Mahmoud,

The latest mkfs.gfs2 in cvs has the -b option again, but
I'm not sure which packages that change made it in to.
I might be suggest a package if you tell me what platform,
otherwise you can build it from the source.
We temporarily disabled it while we were working on
alternate block sizes. 

If you're going to use alternate block sizes, you should 
make sure the gfs2 kernel module you're using will support
it properly.  Are you compiling the gfs2 kernel code from
the nmw git tree?  Or are you just using an installed package?

Regards,

Bob Peterson
Red Hat Cluster Suite


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070815/7e80e005/attachment.htm>

From kjoyeux at jouy.inra.fr  Thu Aug 16 07:15:58 2007
From: kjoyeux at jouy.inra.fr (kieran JOYEUX)
Date: Thu, 16 Aug 2007 09:15:58 +0200
Subject: [Linux-cluster] NFS failover problem
Message-ID: <46C3F9AE.5060702@jouy.inra.fr>

Hi guys,

I am implementing a two node cluster sharing via NFS, their local 
storage to one client.
At the moment, i am simulating a failover during a copy from the NFS 
server to the local client disk.

The first time i got a NFS file handle error. I tried to use a 
Filesystem ID (fsid) on the mount parameter of the client but now here 
is my issue:

[root@**** ****]# time cp 1Go.t* /usr
cp: reading `1Go.tyt': Input/output error


My cluster.conf :
<?xml version="1.0"?>
<cluster alias="mig_nfs" config_version="128" name="mig_nfs">
      <fence_daemon post_fail_delay="0" post_join_delay="3"/>
      <clusternodes>
              <clusternode name="ha1" votes="1">
                      <fence>
                              <method name="1">
                                      <device name="barriere" 
nodename="ha1"/>
                              </method>
                      </fence>
              </clusternode>
              <clusternode name="ha2" votes="1">
                      <fence>
                              <method name="1">
                                      <device name="barriere" 
nodename="ha2"/>
                              </method>
                      </fence>
              </clusternode>
      </clusternodes>
      <cman expected_votes="1" two_node="1"/>
      <fencedevices>
              <fencedevice agent="fence_manual" name="barriere"/>
      </fencedevices>
      <rm>
              <failoverdomains>
                      <failoverdomain name="mig_fod" ordered="1" 
restricted="0">
                              <failoverdomainnode name="ha1" priority="2"/>
                              <failoverdomainnode name="ha2" priority="1"/>
                      </failoverdomain>
              </failoverdomains>
              <resources>
                      <ip address="138.102.22.33" monitor_link="1"/>
                      <nfsexport name="/usr/local/genome"/>
                      <nfsclient name="mig" options="ro,fsid=20" 
path="/usr/local/genome" target="138.102.22.0/255.255.192.0"/>
                      <nfsclient name="mig213" options="fsid=213,ro" 
path="/usr/local/genome" target="138.102.22.213"/>
                      <nfsclient name="mig217" options="ro,fsid=217" 
path="/usr/local/genome" target="138.102.22.217"/>
              </resources>
              <service autostart="1" domain="mig_fod" name="nfs">
                      <ip ref="138.102.22.33"/>
                      <nfsexport ref="/usr/local/genome"/>
                      <nfsclient ref="mig"/>
              </service>
      </rm>
</cluster>


If you have any ideas or remarks, i would love to hear them. Thanks a lot.

Kieran



From maciej.bogucki at artegence.com  Thu Aug 16 14:16:53 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Thu, 16 Aug 2007 16:16:53 +0200
Subject: [Linux-cluster] DLM in recover state - node can't connect to cluster
Message-ID: <46C45C55.9040806@artegence.com>

Hello,

I have five node cluster. Node05 failed(kernel panic), and fencing
failed. When I rebooted failed node05, it can't connect to cluster and
filesystem is locked, because it is in the recover state. I need to
reboot all nodes to recover cluster.

On node05 I get "fenced: startup failed"

Here is the output form another node in cluster:

---cut---
[root at node03 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run
U-1,10,1
[2 3 5 4]

DLM Lock Space:  "clvmd"                             2   3 run
U-1,10,1
[2 3 5 4]

DLM Lock Space:  "repository"                        3   4 recover 2 -
[2 3 5 4]

GFS Mount Group: "repository"                        4   5 recover 0 -
[2 3 5 4]

[root at node03 ~]#
---cut---

What does mean "U-1,10,1"?

Here is some information form cluster.conf

---cut---
<fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>
<cman expected_votes="3" deadnode_timeout="120" hello_timer="10"/>
---cut---

I don't have the latest cman, fence, dlm, and kernel, so maybe it is a
problem?

cman-1.0.11-0
fence-1.32.25-1
dlm-1.0.1-1
kernel-smp-2.6.9-42.0.3.EL

Best Regards
Maciej Bogucki



From maciej.bogucki at artegence.com  Thu Aug 16 14:45:30 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Thu, 16 Aug 2007 16:45:30 +0200
Subject: [Linux-cluster] DLM in recover state - node can't connect to
	cluster
In-Reply-To: <46C45C55.9040806@artegence.com>
References: <46C45C55.9040806@artegence.com>
Message-ID: <46C4630A.4070007@artegence.com>

Maciej Bogucki napisa?(a):
> Hello,
> 
> I have five node cluster. Node05 failed(kernel panic), and fencing
> failed. When I rebooted failed node05, it can't connect to cluster and
> filesystem is locked, because it is in the recover state. I need to
> reboot all nodes to recover cluster.
> 
> On node05 I get "fenced: startup failed"
> 
> Here is the output form another node in cluster:
> 
> ---cut---
> [root at node03 ~]# cat /proc/cluster/services
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                           1   2 run
> U-1,10,1
> [2 3 5 4]
> 
> DLM Lock Space:  "clvmd"                             2   3 run
> U-1,10,1
> [2 3 5 4]
> 
> DLM Lock Space:  "repository"                        3   4 recover 2 -
> [2 3 5 4]
> 
> GFS Mount Group: "repository"                        4   5 recover 0 -
> [2 3 5 4]
> 
> [root at node03 ~]#
> ---cut---
> 
> What does mean "U-1,10,1"?
> 
> Here is some information form cluster.conf
> 
> ---cut---
> <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>
> <cman expected_votes="3" deadnode_timeout="120" hello_timer="10"/>
> ---cut---
> 
> I don't have the latest cman, fence, dlm, and kernel, so maybe it is a
> problem?
> 
> cman-1.0.11-0
> fence-1.32.25-1
> dlm-1.0.1-1
> kernel-smp-2.6.9-42.0.3.EL
> 

I have found it in the logs also

Aug 16 14:13:44 node03 kernel: dlm: repository: restbl_rsb_update_recv
rsb not found 67098
Aug 16 14:14:07 node03 kernel: dlm: repository: restbl_rsb_update_recv
rsb not found 72602
Aug 16 14:14:23 node03 kernel: dlm: repository: restbl_rsb_update_recv
rsb not found 64752
Aug 16 14:14:23 node03 kernel: dlm: repository: restbl_rsb_update_recv
rsb not found 67108
Aug 16 14:14:23 node03 kernel: dlm: repository: restbl_rsb_update_recv
rsb not found 69654
Aug 16 14:14:23 node03 kernel: dlm: repository: restbl_rsb_update_recv
rsb not found 69781
Aug 16 14:14:23 node03 kernel: dlm: repository: restbl_rsb_update_recv
rsb not found 87705

What does it mean?

Best Regards
Maciej Bogucki



From Paul.McDowell at celera.com  Thu Aug 16 15:35:08 2007
From: Paul.McDowell at celera.com (Paul n McDowell)
Date: Thu, 16 Aug 2007 11:35:08 -0400
Subject: [Linux-cluster] NFS failover problem
In-Reply-To: <46C3F9AE.5060702@jouy.inra.fr>
Message-ID: <OF7B2AE2FF.922CA280-ON85257339.0051AB0D-85257339.00559D1A@applera.com>

Kieran,

I'm currently experiencing a similar problem with an HA NFS server that I 
just built on RHEL4 with GFS. 

I have two different linux clients, one running RHEL3 U8, the other 
running RHEL4 U5 (same as the HA NFS servers)

If I use the same standard mount options on both clients (e.g. mount 
SERVER:/exportfs    /mountpoint -t nfs -o rw,noatime ) then everything 
works fine until I perform a failover.  At that point the RHEL 3 client is 
OK but the RHEL 4 client can no longer stat the filesystem (df hangs).  If 
I move the service back the hung df command completes.  I don't see an I/O 
error per say but any copies to and from that mountpoint are inactive 
until I relocate the service back.

I tried other versions of Unix and found that all of them could stat the 
file system after failover except the RHEL4 U5 version.  The only way 
round this I've found so far is to use the udp protocol instead of tcp 
with version 3 nfs.

So my mount commands look something more like this:

# mount SERVER:/exportfs /mountpoint  -t nfs -o rw,noatime,udp,nfsvers=3

I dont know if you can tolerate udp in your environment but it might be 
worth playing around with.

Regards,

Paul








kieran JOYEUX <kjoyeux at jouy.inra.fr> 
Sent by: linux-cluster-bounces at redhat.com
08/16/2007 03:15 AM
Please respond to
linux clustering <linux-cluster at redhat.com>


To
Linux-cluster at redhat.com
cc

Subject
[Linux-cluster] NFS failover problem






Hi guys,

I am implementing a two node cluster sharing via NFS, their local 
storage to one client.
At the moment, i am simulating a failover during a copy from the NFS 
server to the local client disk.

The first time i got a NFS file handle error. I tried to use a 
Filesystem ID (fsid) on the mount parameter of the client but now here 
is my issue:

[root@**** ****]# time cp 1Go.t* /usr
cp: reading `1Go.tyt': Input/output error


My cluster.conf :
<?xml version="1.0"?>
<cluster alias="mig_nfs" config_version="128" name="mig_nfs">
      <fence_daemon post_fail_delay="0" post_join_delay="3"/>
      <clusternodes>
              <clusternode name="ha1" votes="1">
                      <fence>
                              <method name="1">
                                      <device name="barriere" 
nodename="ha1"/>
                              </method>
                      </fence>
              </clusternode>
              <clusternode name="ha2" votes="1">
                      <fence>
                              <method name="1">
                                      <device name="barriere" 
nodename="ha2"/>
                              </method>
                      </fence>
              </clusternode>
      </clusternodes>
      <cman expected_votes="1" two_node="1"/>
      <fencedevices>
              <fencedevice agent="fence_manual" name="barriere"/>
      </fencedevices>
      <rm>
              <failoverdomains>
                      <failoverdomain name="mig_fod" ordered="1" 
restricted="0">
                              <failoverdomainnode name="ha1" 
priority="2"/>
                              <failoverdomainnode name="ha2" 
priority="1"/>
                      </failoverdomain>
              </failoverdomains>
              <resources>
                      <ip address="138.102.22.33" monitor_link="1"/>
                      <nfsexport name="/usr/local/genome"/>
                      <nfsclient name="mig" options="ro,fsid=20" 
path="/usr/local/genome" target="138.102.22.0/255.255.192.0"/>
                      <nfsclient name="mig213" options="fsid=213,ro" 
path="/usr/local/genome" target="138.102.22.213"/>
                      <nfsclient name="mig217" options="ro,fsid=217" 
path="/usr/local/genome" target="138.102.22.217"/>
              </resources>
              <service autostart="1" domain="mig_fod" name="nfs">
                      <ip ref="138.102.22.33"/>
                      <nfsexport ref="/usr/local/genome"/>
                      <nfsclient ref="mig"/>
              </service>
      </rm>
</cluster>


If you have any ideas or remarks, i would love to hear them. Thanks a lot.

Kieran

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070816/989e54bc/attachment.htm>

From maciej.bogucki at artegence.com  Thu Aug 16 15:55:46 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Thu, 16 Aug 2007 17:55:46 +0200
Subject: [Linux-cluster] NFS failover problem
In-Reply-To: <OF7B2AE2FF.922CA280-ON85257339.0051AB0D-85257339.00559D1A@applera.com>
References: <OF7B2AE2FF.922CA280-ON85257339.0051AB0D-85257339.00559D1A@applera.com>
Message-ID: <46C47382.2090504@artegence.com>

> If I use the same standard mount options on both clients (e.g. mount
> SERVER:/exportfs    /mountpoint -t nfs -o rw,noatime ) then everything
> works fine until I perform a failover.  At that point the RHEL 3 client
> is OK but the RHEL 4 client can no longer stat the filesystem (df
> hangs).  If I move the service back the hung df command completes.  I
> don't see an I/O error per say but any copies to and from that
> mountpoint are inactive until I relocate the service back.

Hello,


I suppose that You have problem with NFS lock manager [1 point 4e] or
with NFS statd[1 point 4f].

[1] - http://www.linux-ha.org/DRBD/NFS

Best Regards
Maciej Bogucki



From Paul.McDowell at celera.com  Thu Aug 16 16:32:10 2007
From: Paul.McDowell at celera.com (Paul n McDowell)
Date: Thu, 16 Aug 2007 12:32:10 -0400
Subject: [Linux-cluster] NFS failover problem
In-Reply-To: <46C47382.2090504@artegence.com>
Message-ID: <OF34CDD63E.36EED658-ON85257339.00580BFA-85257339.005AD5D8@applera.com>

Hi Maciej,

I'm not sure about these recommendations.  I agree with Peter Kruse's 
comments about creating a symbolic link for /var/lib/nfs [1 point 4e].  I 
have multiple NFS services in my cluster each serving a different file 
system so what would I be soft linking /var/lib/nfs to?

And for the NFS statd recommendation [1 point 4f], again I have several 
NFS services running on my cluster, each with their own IP alias, so what 
should I be setting STATD_HOSTNAME to in /etc/sysconfig/network ?

I never had to make either of these changes in my previous NFS server 
running on a RHEL 3 cluster which worked perfectly.

Thanks.

Paul





Maciej Bogucki <maciej.bogucki at artegence.com> 
Sent by: linux-cluster-bounces at redhat.com
08/16/2007 11:55 AM
Please respond to
linux clustering <linux-cluster at redhat.com>


To
linux clustering <linux-cluster at redhat.com>
cc

Subject
Re: [Linux-cluster] NFS failover problem






> If I use the same standard mount options on both clients (e.g. mount
> SERVER:/exportfs    /mountpoint -t nfs -o rw,noatime ) then everything
> works fine until I perform a failover.  At that point the RHEL 3 client
> is OK but the RHEL 4 client can no longer stat the filesystem (df
> hangs).  If I move the service back the hung df command completes.  I
> don't see an I/O error per say but any copies to and from that
> mountpoint are inactive until I relocate the service back.

Hello,


I suppose that You have problem with NFS lock manager [1 point 4e] or
with NFS statd[1 point 4f].

[1] - http://www.linux-ha.org/DRBD/NFS

Best Regards
Maciej Bogucki

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070816/037a32de/attachment.htm>

From hlawatschek at atix.de  Fri Aug 17 07:29:05 2007
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Fri, 17 Aug 2007 09:29:05 +0200
Subject: [Linux-cluster] fencing: external vs watchdog
Message-ID: <200708170929.05517.hlawatschek@atix.de>

Hi,

I'd like to discuss and collect information about the two diffrent fencing 
approaches.

external fencing: The failed cluster node is disconnected from the storage 
device by onother node in the cluster. After a failure detection all cluster 
activities are suspended until the IO fencing of the failed node has been 
completed successfully.

watchdog fencing: A failed cluster node has to recognize the failure by itself 
and will be shut down by a kind of internal watchdog feature.

Now, I see that theoretically the external fencing method (when configured 
correctly) is the betterer approach because of the exactly defined state 
during a fencing and recovery operation.

But the question is: What are real world examples of failures when the 
watchdog fencing would fail and cause data corruption on the storage device ?
I'd like to collect some real world examples and also theoretical approaches.

All comments welcome !

Mark
-- 
Gruss / Regards,

Dipl.-Ing. Mark Hlawatschek
http://www.atix.de/
http://www.open-sharedroot.org/

**
ATIX Informationstechnologie und Consulting AG



From genis.pujol at miquel.es  Fri Aug 17 08:02:28 2007
From: genis.pujol at miquel.es (Genis Pujol Hamelink)
Date: Fri, 17 Aug 2007 10:02:28 +0200
Subject: [Linux-cluster] a couple of questions about setting up a 2 node
	cluster on VMWare ESX
Message-ID: <6811C4D60177864588C3156A4B56889D012C79@EXCHANGE01.miquel.es>

Hello everybody,
 
I'm setting up a 2 node cluster on VMWare ESX (2 virtual machines running RHEL 3). I have a couple of questions:
 
*Is this configuration supported? is it documented ?
*What is the best approach to configure the fencing mechanism?
 
regards,
 
Gen?s.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070817/4d889c76/attachment.htm>

From maciej.bogucki at artegence.com  Fri Aug 17 10:21:12 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Fri, 17 Aug 2007 12:21:12 +0200
Subject: [Linux-cluster] NFS failover problem
In-Reply-To: <OF34CDD63E.36EED658-ON85257339.00580BFA-85257339.005AD5D8@applera.com>
References: <OF34CDD63E.36EED658-ON85257339.00580BFA-85257339.005AD5D8@applera.com>
Message-ID: <46C57698.9050803@artegence.com>

Paul n McDowell napisa?(a):
> 
> Hi Maciej,
> 
> I'm not sure about these recommendations.  I agree with Peter Kruse's
> comments about creating a symbolic link for /var/lib/nfs [1 point 4e].
>  I have multiple NFS services in my cluster each serving a different
> file system so what would I be soft linking /var/lib/nfs to?
> 
It doesn't matter how many NFS services You have. The state information
for the NFS server is in /var/lib/nfs, and You have to put it on shared
storage.

> And for the NFS statd recommendation [1 point 4f], again I have several
> NFS services running on my cluster, each with their own IP alias, so
> what should I be setting STATD_HOSTNAME to in /etc/sysconfig/network ?

Takene from "man rpc.statd"

---cut---
       -n, --name name
              specify a name for rpc.statd to use as the local hostname.
By default, rpc.statd will call gethostname(2) to get the local
hostname. Specifying  a  local  hostname
              may be useful for machines with more than one interfaces.
----cut---
	
> 
> I never had to make either of these changes in my previous NFS server
> running on a RHEL 3 cluster which worked perfectly.

First please to make sure that it doesn't help You, then seek different
problem. /var/lib/nfs not on stared storage is common problem on Fedora,
SUSE and Red Hat distribution.

Best Regards
Maciej Bogucki



From maciej.bogucki at artegence.com  Fri Aug 17 11:03:03 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Fri, 17 Aug 2007 13:03:03 +0200
Subject: [Linux-cluster] fencing: external vs watchdog
In-Reply-To: <200708170929.05517.hlawatschek@atix.de>
References: <200708170929.05517.hlawatschek@atix.de>
Message-ID: <46C58067.7080704@artegence.com>

Mark Hlawatschek napisa?(a):
> Hi,
> 
> I'd like to discuss and collect information about the two diffrent fencing 
> approaches.
> 
> external fencing: The failed cluster node is disconnected from the storage 
> device by onother node in the cluster. After a failure detection all cluster 
> activities are suspended until the IO fencing of the failed node has been 
> completed successfully.
> 
> watchdog fencing: A failed cluster node has to recognize the failure by itself 
> and will be shut down by a kind of internal watchdog feature.
> 
> Now, I see that theoretically the external fencing method (when configured 
> correctly) is the betterer approach because of the exactly defined state 
> during a fencing and recovery operation.
> 
> But the question is: What are real world examples of failures when the 
> watchdog fencing would fail and cause data corruption on the storage device ?
> I'd like to collect some real world examples and also theoretical approaches.
> 
> All comments welcome !

Hello,

Watchdog fencing isn't good for at least two reasons:
1. Watchodog is a piece of code which run in user space, so You don't
have 100% guarantee that it will run correctly.
2. Watchdog fencing can't protect You against split-brain situations,
where the consequences could be corruption of You data. Here comes
external fencing.

There is another point of view about Linux Clusters and other Commercial
Clusters(fe. Sun Cluster). Linux Cluster resist in user-space so You
don't have guarantee that local fencing will run ok, and You need
exteral fencing to resolve this main problem. Sun Cluster resist in
kernel-space, so when one node lost quorum it do "kernel panic" and You
have 100% guarantee that it will success.
For me network fencing(IPMI,DRAC,...) isn't good, because You have to
connect via network and it could fail, and so on. The best fencing
mechanism is fence_scsi, which is an I/O fencing agent. I can be used
with the SCSI devices that support persistent reservations (SPC-2 or
greater). In more cases You have shares storages taht support SPC-2 or
SPC-3.

Best Regards
Maciej Bogucki



From nattaponv at hotmail.com  Fri Aug 17 11:26:57 2007
From: nattaponv at hotmail.com (nattapon viroonsri)
Date: Fri, 17 Aug 2007 11:26:57 +0000
Subject: [Linux-cluster] Is Ntp syncronize effect Gfs 
Message-ID: <BAY123-F25DA91477DD3269EFEAFDDA6D80@phx.gbl>



Is ntp syncronize between node in gfs cluster nescessary ?

And,  Is Time difference between node in cluster can cause of "Miss too many 
heartbeat" in logfile ?




Regards,
Nattapon

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/



From Alexandre.Racine at mhicc.org  Fri Aug 17 15:53:17 2007
From: Alexandre.Racine at mhicc.org (Alexandre Racine)
Date: Fri, 17 Aug 2007 11:53:17 -0400
Subject: [Linux-cluster] dlm Invalid module format [SOLVED]
References: <C43CF0825BF59D4FBC1F6A2AF45EB88D3C341B@cumulonimbus.RG.local>
Message-ID: <C43CF0825BF59D4FBC1F6A2AF45EB88D3C341E@cumulonimbus.RG.local>

In the good spirit of openess I'll answer my own question.

The problem was that there is another place in the kernel in the file system menu where there is a DLM. You have to remove that one too for dlm-kernel to load.

"Recompile and recompile until it is perfect!" :)



Alexandre Racine
Projets sp?ciaux
514-461-1300 poste 3304
alexandre.racine at mhicc.org



-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of Alexandre Racine
Sent: Wed 2007-08-15 12:52
To: linux-cluster at redhat.com
Subject: [Linux-cluster] dlm Invalid module format
 
Hi people,

I am having difficulties to startup gfs with Gentoo.
I am currently running 2.6.20-gentoo-r8 with "GFS2 file system support". I have try with or without "GFS2 "nolock" locking module" and "GFS2 DLM locking module" but it is always the same result.

When starting dlm, I get this :

+ /etc/init.d/dlm start
 * Loading dlm module ...
 * Loading dlm kernel module ...
FATAL: Error inserting dlm (/lib/modules/2.6.20-gentoo-r8/kernel/cluster/dlm.ko): Invalid module format
 * Failed to load dlm kernel module                                                        [ !! ]

Looking from dmesg gives this error : "dlm: exports duplicate symbol dlm_new_lockspace (owned by kernel)"

Does someone have any ideas what is happening here? Thanks.




Alexandre Racine
Projets sp?ciaux
514-461-1300 poste 3304
alexandre.racine at mhicc.org


-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3355 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070817/abbd680a/attachment.bin>

From JESmall at directv.com  Fri Aug 17 15:57:53 2007
From: JESmall at directv.com (Small, Jon E)
Date: Fri, 17 Aug 2007 08:57:53 -0700
Subject: [Linux-cluster] Problem with latest fence_apc
Message-ID: <40CEA9CB6E0E8444B59DD4F4E848F71061AF79@CORVUS.LA.FRD.DIRECTV.COM>

I installed the latest RPM (fence-1.32.45-1.0.2.i686.rpm) that includes
the fence_apc fix for APC firmware 3.x and it still does not work in our
environment. The Status action works, but Off & On do not work.

 

Here is the environment information:

-          RedHat 4 Update #5

-          APC model #AP7968

-          APC Firmware 3.3.3

 

Attached is the copy of /tmp/apclog from an execution of an Off action:

./fence_apc -a 10.13.2.146 -l apc -o Off -n 3 -p apc -v

 

 

Incidentally, I attempted to use the fence_apc_snmp as well, but it
fails with the following statement:

 

./fence_apc_snmp -a 10.13.2.146 -c private -n 3 -o Status -v

called with {'udpport': '161', 'option': 'status', 'ipaddr':
'10.13.2.146', 'community': 'private', 'switch': '', 'port': '3'}

invalid status outletStatusOff

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070817/ef45c770/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: apclog
Type: application/octet-stream
Size: 61947 bytes
Desc: apclog
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070817/ef45c770/attachment.obj>

From lhh at redhat.com  Fri Aug 17 16:02:53 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 17 Aug 2007 12:02:53 -0400
Subject: [Linux-cluster] fs.sh?
In-Reply-To: <20070803130957.GL21896@helsinki.fi>
References: <20070706163152.GH1681@redhat.com>
	<20070706182930.GE5981@redhat.com>
	<20070706183151.GF5981@redhat.com>
	<20070706183658.GA24692@helsinki.fi>
	<20070710221922.GG18076@redhat.com>
	<20070731121438.GA21896@helsinki.fi>
	<20070731134121.GH4955@redhat.com>
	<20070731145441.GE21896@helsinki.fi>
	<20070802154818.GB26367@redhat.com>
	<20070803130957.GL21896@helsinki.fi>
Message-ID: <20070817160253.GL5853@redhat.com>

On Fri, Aug 03, 2007 at 04:09:58PM +0300, Janne Peltonen wrote:
> On Thu, Aug 02, 2007 at 11:48:19AM -0400, Lon Hohberger wrote:
> > That's quite an impressive cluster.conf...  I'm going to look at it
> > some.
> 
> One thing that makes it longer than strictly necessary is the fact that
> each service has its own prioritized failover domain. I could, of
> course, just have one failover domain for each possible permutation of
> nodes pcn1-hb..pcn4-hb... on the other hand, I feel safer to edit the
> failover domain specification on a live system than to edit the service
> specification (to change the failover domain) - I don't know whether
> rgmanager would want to restart a service if I changed its failover
> domain (would it?), but I know it doesn't restart a service if I edit
> the failover domain specification...

It will restart the service currently; there's an open bz against rhel4
to fix it (not strictly failover domains - but generally any attr. of
the service except its name should not cause a restart; it's fixed in
5.1).


-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Aug 17 16:06:33 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 17 Aug 2007 12:06:33 -0400
Subject: [Linux-cluster] fs.sh?
In-Reply-To: <20070815134318.GA3922@nlxdcldnl2.cl.intel.com>
References: <20070706163152.GH1681@redhat.com>
	<20070706182930.GE5981@redhat.com>
	<20070706183151.GF5981@redhat.com>
	<20070706183658.GA24692@helsinki.fi>
	<20070710221922.GG18076@redhat.com>
	<20070731121438.GA21896@helsinki.fi>
	<20070731134121.GH4955@redhat.com>
	<20070731145441.GE21896@helsinki.fi>
	<20070802154818.GB26367@redhat.com>
	<20070815134318.GA3922@nlxdcldnl2.cl.intel.com>
Message-ID: <20070817160632.GM5853@redhat.com>

On Wed, Aug 15, 2007 at 06:43:18AM -0700, Lombard, David N wrote:
> On Thu, Aug 02, 2007 at 11:48:19AM -0400, Lon Hohberger wrote:
> > On Tue, Jul 31, 2007 at 05:54:41PM +0300, Janne Peltonen wrote:
> > Yeah, it does hit a lot of subshells.  Awks, seds, and the like.  Some
> > pattern substitution and matching can be done in pure bash.
> 
> Perhaps a *lot* more than you likely think...
> 
> Sadly, this is the common case.  Probably 99.99% of all shell scripts
> that I see are still Bourne scripts or very close to it.  The bash shell
> is SO MUCH more powerful than Bourne, but it gets used in such a clumsy
> and trivial manner that wastes far too many processes and far too many
> cycles.
> 
> And yes, I have submitted patches when the abuse was truly egregious,
> but such updates are universally ignored as there is no actual error to
> point to.

Did you have any for this in particular?

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Aug 17 16:07:31 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 17 Aug 2007 12:07:31 -0400
Subject: [Linux-cluster] clustat on GULM
In-Reply-To: <6596a7c70708031150xe002365y8b04e0ded11b7edc@mail.gmail.com>
References: <6596a7c70707310722l24a10059l9309229ba7c534fc@mail.gmail.com>
	<20070802155432.GD26367@redhat.com>
	<6596a7c70708031150xe002365y8b04e0ded11b7edc@mail.gmail.com>
Message-ID: <20070817160731.GN5853@redhat.com>

On Fri, Aug 03, 2007 at 02:50:45PM -0400, siman hew wrote:
> Entered a bug.Bugzilla Bug
> 250811<https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=250811>:
> clustat report wrong information about rgmanager on a GULM clusterHope it
> will be solved soon, which can save some work of mine, since I can not
> trust/rely on clustat right now.

Possibly.  It's not exactly an easy thing to layer pseudo-group support
on top of gulm...

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Aug 17 16:10:31 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 17 Aug 2007 12:10:31 -0400
Subject: [Linux-cluster] RHEL5.1 beta and qdisk
In-Reply-To: <1186124530.5993.2.camel@f2821966>
References: <1186044963.6514.4.camel@f2821966>
	<20070802155911.GH26367@redhat.com>
	<1186124530.5993.2.camel@f2821966>
Message-ID: <20070817161031.GO5853@redhat.com>

On Fri, Aug 03, 2007 at 09:02:10AM +0200, Jacques Botha wrote:
> No fencing being attempted, it just sits there, and the last thing in
> the log is qdisk upgrading the status of the cluser:
> 
> 
> Aug  2 10:36:21 fnbgw01 qdiskd[3519]: <info> Quorum Partition: /dev/sdb1
> Label: fnbgw_qdisk 
> Aug  2 10:36:21 fnbgw01 qdiskd[3520]: <info> Quorum Daemon Initializing 
> Aug  2 10:36:24 fnbgw01 qdiskd[3520]: <info> Heuristic: 'ping
> 172.20.28.193 -c3 -t1' UP 
> Aug  2 10:36:24 fnbgw01 qdiskd[3520]: <info> Heuristic: 'ping
> 172.20.28.195 -c3 -t1' UP 
> Aug  2 10:36:24 fnbgw01 qdiskd[3520]: <info> Heuristic: 'ping
> 172.20.28.196 -c3 -t1' UP 
> Aug  2 10:36:24 fnbgw01 qdiskd[3520]: <info> Heuristic: 'ping
> 172.20.28.197 -c3 -t1' UP 
> Aug  2 10:36:24 fnbgw01 qdiskd[3520]: <info> Heuristic: 'ping
> 172.20.28.198 -c3 -t1' UP 
> Aug  2 10:36:31 fnbgw01 qdiskd[3520]: <info> Initial score 6/6 
> Aug  2 10:36:31 fnbgw01 qdiskd[3520]: <info> Initialization complete 
> Aug  2 10:36:31 fnbgw01 openais[3469]: [CMAN ] quorum device registered 
> Aug  2 10:36:31 fnbgw01 qdiskd[3520]: <notice> Score sufficient for
> master operation (6/6; required=3); upgrading 
> Aug  2 10:36:45 fnbgw01 qdiskd[3520]: <info> Assuming master role 
> 
> And after that you can't get any joy.

Does it happen w/o qdisk?

Note that with qdisk in a 2-node cluster, you want expected_votes = 3,
two_node = 0, one vote per node and 1 vote for qdisk. 

In a 2-node cluster w/o qdisk, you want expected_votes = 1, two_node =
1.  Fencing is required whenever two_node = 1, even if it's manual
fencing.


-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Aug 17 16:12:36 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 17 Aug 2007 12:12:36 -0400
Subject: [Linux-cluster] Priority in failover domains sometimes not
	honoured
In-Reply-To: <200708031456.l73EuJM22161@xos037.xos.nl>
References: <200708031456.l73EuJM22161@xos037.xos.nl>
Message-ID: <20070817161236.GP5853@redhat.com>

On Fri, Aug 03, 2007 at 04:56:19PM +0200, Jos Vos wrote:
> Hi,
> 
> I have the following failover domains:
> 
>       <failoverdomains>
>               <failoverdomain name="prio_host1" ordered="1" restricted="1">
>                       <failoverdomainnode name="host1" priority="1"/>
>                       <failoverdomainnode name="host2" priority="2"/>
>               </failoverdomain>
>               <failoverdomain name="prio_host2" ordered="1" restricted="1">
>                       <failoverdomainnode name="host1" priority="2"/>
>                       <failoverdomainnode name="host2" priority="1"/>
>               </failoverdomain>
>       </failoverdomains>
> 
> But often a service for, say, "prio_host2" stays running on host1,
> (or vice versa), while both hosts are running fine.  In a "stable"
> situation, that service should be moved to host2, AFAIK (and I
> have seen this happen in some situations).
> 
> This is on RHEL 5.0.
> 
> Is this a bug or...?

It sounds like one.  I think I have a fix for it in 5.1 beta as a side
effect of another bug.  I added a smoothing delay for membership
transitions.

Janne was experiencing the same issue, IIRC.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Aug 17 16:13:32 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 17 Aug 2007 12:13:32 -0400
Subject: [Linux-cluster] rgmanager 5.1 Beta Segfault
In-Reply-To: <46B62617.8000104@cmiware.com>
References: <46B62617.8000104@cmiware.com>
Message-ID: <20070817161332.GQ5853@redhat.com>

On Sun, Aug 05, 2007 at 02:33:43PM -0500, Chris Harms wrote:
> 5.1 Beta RPMs of cluster suite have so far resolved our dual fencing and 
> self-reboot issues.  The cluster was stopped and sat overnight, and when 
> I started it via Conga today, one node came up fine, but the other 
> logged the following:
> 
> kernel: clurgmgrd[32233]: segfault at 0000000000000048 rip 
> 0000000000419804 rsp 0000000040a00030 error 4
> 
> rgmanager-2.0.28-1.el5
> cman-2.0.70-1.el5
> 

Keep trying on this one; we'll keep trying on our end too.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Aug 17 16:14:46 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 17 Aug 2007 12:14:46 -0400
Subject: [Linux-cluster] nofailback for failover domains?
In-Reply-To: <926ab61b0708060631h5aa3ef0fu2fe183e80df752f6@mail.gmail.com>
References: <926ab61b0708060631h5aa3ef0fu2fe183e80df752f6@mail.gmail.com>
Message-ID: <20070817161446.GR5853@redhat.com>

On Mon, Aug 06, 2007 at 09:31:16AM -0400, Bjorn Oglefjorn wrote:
> I found that a 'nofailback' option was added for the <failoverdomains>
> section of the conf.  I can't find any reference to 'nofailback' in any RHCS
> doc I can find.  I'm guessing it should look like this:
> 
>     <failoverdomain name="test_failover_domain" ordered="1" restricted="1"
> nofailback="1">
>         ...
>     </failoverdomain>
> 
> Can someone confirm?  I will attempt to confirm this myself and will report
> back when I know for sure, but it seems to behave as I would expect.

Josef Bacik wrote it; IIRC it works only in ordered failover domains,
and doesn't prevent relocation to a node in the domain if it's running
on a node outside the domain.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Aug 17 16:16:52 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 17 Aug 2007 12:16:52 -0400
Subject: [Linux-cluster] CS4 U5 / RHEL4 U4 ?
In-Reply-To: <46BFF8C4.90409@bull.net>
References: <46BFF8C4.90409@bull.net>
Message-ID: <20070817161652.GS5853@redhat.com>

On Mon, Aug 13, 2007 at 08:23:00AM +0200, Alain Moulle wrote:
> Hi
> 
> Is there any incompatibility to re-build the CS4 U5 for RHEL4 U4 ?

Not sure; with most of the userland bits, my guess would be no.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Aug 17 16:20:44 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 17 Aug 2007 12:20:44 -0400
Subject: [Linux-cluster] Load peaks - caused by the cluster?
In-Reply-To: <20070813090703.GR17564@helsinki.fi>
References: <20070813090703.GR17564@helsinki.fi>
Message-ID: <20070817162044.GT5853@redhat.com>

On Mon, Aug 13, 2007 at 12:07:04PM +0300, Janne Peltonen wrote:
> 
> Next, I think I'm going to put an exit 0 to the status checks of ip.sh
> (and see if the elfs go away). Then I'm going to start wondering if the
> cluster'd notice our server room falling apart... ;)

There are certainly plenty of exec()s in ip.sh.

> Any suggestions? At this point, I'm not any more even certain whether
> the problem lies within the cluster. On the other hand, since I see no
> difference at the process level during peak and no-peak time, the
> difference must (as far as I understand) be inside kernel. So it can't
> be my application. So it must be the cluster, mustn't it?


One other thing you could do is spread out the check intervals for
resources.  For example:

  <service name="test">
   <fs name="foo" device="..." ...>
      <action name="status" depth="*" interval="3600" />
   </fs>
  </service>

If you do find that it's ip.sh, note that in your bz about fs.sh
actually.

I don't understand why it happens in 11-hour cycles though.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Aug 17 16:22:12 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 17 Aug 2007 12:22:12 -0400
Subject: [Linux-cluster] rgmanager ceases to send syslog messages
In-Reply-To: <1187117199.14904.12.camel@xw9300.bidmc.harvard.edu>
References: <1187117199.14904.12.camel@xw9300.bidmc.harvard.edu>
Message-ID: <20070817162212.GU5853@redhat.com>

On Tue, Aug 14, 2007 at 02:46:39PM -0400, Robert Hurst wrote:
> Odd, a member node's rgmanager (clurgmgrd) stopped sending syslog
> messages, in particular, a 'status' message of a service it was running.
> This causes us a problem, as we monitor syslog messages from a
> centralized server to update us of services running by nodename.
> 
> I continue to get messages from clurgmgrd, but only through Magma Event
> changes, i.e.:
> 
> Aug  7 16:09:03 db5 clurgmgrd[16354]: <info> Magma Event: Membership
> Change 
> Aug  7 16:09:03 db5 clurgmgrd[16354]: <info> State change: db1 UP
> 

What release?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From JacquesB at fnb.co.za  Fri Aug 17 16:23:35 2007
From: JacquesB at fnb.co.za (Botha, Jacques (FNB))
Date: Fri, 17 Aug 2007 18:23:35 +0200
Subject: [Linux-cluster] RHEL5.1 beta and qdisk
References: <1186044963.6514.4.camel@f2821966><20070802155911.GH26367@redhat.com><1186124530.5993.2.camel@f2821966>
	<20070817161031.GO5853@redhat.com>
Message-ID: <92BE9C0BC718D840B8D9498A585E94E00DA358@fnbrbgmx01.fnb.co.za>


> Does it happen w/o qdisk?

Nope, works beautifully.  Just when you start q-disk that it goes funny.

>Note that with qdisk in a 2-node cluster, you want expected_votes = 3,
>two_node = 0, one vote per node and 1 vote for qdisk. 

>In a 2-node cluster w/o qdisk, you want expected_votes = 1, two_node =
>1.  Fencing is required whenever two_node = 1, even if it's manual
>fencing.

This is a 4 node cluster.


-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


To read FirstRand Bank's Disclaimer for this email click on the following address or copy into your Internet browser: 
https://www.fnb.co.za/disclaimer.html 

If you are unable to access the Disclaimer, send a blank e-mail to
firstrandbankdisclaimer at fnb.co.za and we will send you a copy of the Disclaimer.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070817/a2be32d0/attachment.htm>

From lhh at redhat.com  Fri Aug 17 16:33:34 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 17 Aug 2007 12:33:34 -0400
Subject: [Linux-cluster] fencing: external vs watchdog
In-Reply-To: <200708170929.05517.hlawatschek@atix.de>
References: <200708170929.05517.hlawatschek@atix.de>
Message-ID: <20070817163334.GV5853@redhat.com>

On Fri, Aug 17, 2007 at 09:29:05AM +0200, Mark Hlawatschek wrote:
> Hi,
> 
> I'd like to discuss and collect information about the two diffrent fencing 
> approaches.
> 
> external fencing: The failed cluster node is disconnected from the storage 
> device by onother node in the cluster. After a failure detection all cluster 
> activities are suspended until the IO fencing of the failed node has been 
> completed successfully.
> 
> watchdog fencing: A failed cluster node has to recognize the failure by itself 
> and will be shut down by a kind of internal watchdog feature.
> 
> Now, I see that theoretically the external fencing method (when configured 
> correctly) is the betterer approach because of the exactly defined state 
> during a fencing and recovery operation.

> But the question is: What are real world examples of failures when the 
> watchdog fencing would fail and cause data corruption on the storage device ?
> I'd like to collect some real world examples and also theoretical approaches.

Hardware watchdog failure.

Kernel watchdog thread failure.

Wrong watchog time (timer fires after failover attempt when node isn't
really hung).

With linux-cluster, historically, we have required verification from a
3rd party (IE the fence device) - and not assumed anything.  That is,
fencing devices are assumed to be set up correctly - and therefore, when
we ask the fence device if this port (node, whatever) is off, it can
say "Yes" or "No".

Fencing goes like this:

  do {
    sleep(5);
    fence_node();
  } while (!node_is_fenced() && !node_rejoined());

With a watchdog timer, there's no fencing and no verification.

Red Hat Cluster Suite 3 was the last instance of linux-cluster / RHCS
which supported use of watchdog timers (based on assumptions), but it
wasn't considered fencing. (In RHEL 2.1 you added an actual "watchdog"
agent to the cluster config, which was a no-op, how useful...)

Data integrity after a failure was disclaimed if watchdogs were used.

I can't think of any real-world problems that resulted, but I forget
lots of things ;)

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From pbruna at it-linux.cl  Fri Aug 17 16:41:17 2007
From: pbruna at it-linux.cl (Patricio A. Bruna)
Date: Fri, 17 Aug 2007 12:41:17 -0400 (CLT)
Subject: [Linux-cluster] Multicast
Message-ID: <15398601.41701187368877032.JavaMail.root@lisa.itlinux.cl>

Hi, 
Does cluster suite use Multicast in Red Hat 5? 
If is so, how can i test my Switchs allow multicast traffic? 

Thanks 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070817/111ce104/attachment.htm>

From lhh at redhat.com  Fri Aug 17 17:34:22 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 17 Aug 2007 13:34:22 -0400
Subject: [Linux-cluster] fencing: external vs watchdog
In-Reply-To: <46C58067.7080704@artegence.com>
References: <200708170929.05517.hlawatschek@atix.de>
	<46C58067.7080704@artegence.com>
Message-ID: <20070817173422.GW5853@redhat.com>

On Fri, Aug 17, 2007 at 01:03:03PM +0200, Maciej Bogucki wrote:
> 1. Watchodog is a piece of code which run in user space, so You don't
> have 100% guarantee that it will run correctly.

Some clarification:

Traditionally, a watchdog is a piece of hardware which a userland
daemon writes to periodically.  Failure to write to the piece of
hardware after a set time causes a system reset (the app holding the
watchdog open crashing is one obvious way to cause this to happen).

The Linux kernel also has a software watchdog (called softdog) which
operates in the kernel using the same API it exposes for hardware
watchdogs.

The watchdog daemon (Debian, RHEL5.1, etc.) is one implementation of the
userland part of code which is well-known and often confused with being
a watchdog timer itself.  It monitors administrator-defined resources
and touches the watchdog timer device periodically if things are "ok"
and stops if things go bad (stopping causes the WD to fire).

The point here is that it doesn't matter if the userspace code fails,
blows up, or otherwise - the *failure* mode for a watchdog timer is to
reset the system.


> 2. Watchdog fencing can't protect You against split-brain situations,
> where the consequences could be corruption of You data. Here comes
> external fencing.

You can (at least, mostly) solve this if you have alternative mechanisms
for cluster communications (ex: a quorum disk on a SAN and/or using
external tie-breakers/ping-nodes/whatever).

However - more inline with your point - it's not simple, and it relies
on a lot of assumptions.


> There is another point of view about Linux Clusters and other Commercial
> Clusters(fe. Sun Cluster). Linux Cluster resist in user-space so You
> don't have guarantee that local fencing will run ok, and You need
> exteral fencing to resolve this main problem. Sun Cluster resist in
> kernel-space, so when one node lost quorum it do "kernel panic" and You
> have 100% guarantee that it will success.

FWIW, the hardware watchdog timer is outside of the operating system
entirely.  The entire kernel could hang/crash and the watchdog would
still fire.

Most of the reason for fencing (at all) is the notion of a live-hang of
an indefinite period of time - where a node just stops for a few seconds
due to a kernel bug or for some other reason.  If the whole kernel stops
for a few seconds, the node won't know it's no longer in the quorum, or
calling panic() could be delayed.

There are kernel hangcheck timers, but as I understand it, they're racy:
You can not guarantee that the hang-check will complete before an
outstanding I/O is flushed to disk.  I could be wrong here.


> For me network fencing(IPMI,DRAC,...) isn't good, because You have to
> connect via network and it could fail, and so on. The best fencing
> mechanism is fence_scsi, which is an I/O fencing agent. I can be used
> with the SCSI devices that support persistent reservations (SPC-2 or
> greater). In more cases You have shares storages taht support SPC-2 or
> SPC-3.

Yup.  You can also use FC zoning, in addition to fence_scsi if you
want.

The biggest thing about not using watchdog timers as 'fencing' is
that it's complex and difficult to do correctly/reliably, especially in
the two-node case.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Aug 17 20:32:30 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 17 Aug 2007 16:32:30 -0400
Subject: [Linux-cluster] RHEL5.1 beta and qdisk
In-Reply-To: <92BE9C0BC718D840B8D9498A585E94E00DA358@fnbrbgmx01.fnb.co.za>
References: <20070817161031.GO5853@redhat.com>
	<92BE9C0BC718D840B8D9498A585E94E00DA358@fnbrbgmx01.fnb.co.za>
Message-ID: <20070817203230.GY5853@redhat.com>

On Fri, Aug 17, 2007 at 06:23:35PM +0200, Botha, Jacques (FNB) wrote:
> 
> > Does it happen w/o qdisk?
> 
> Nope, works beautifully.  Just when you start q-disk that it goes funny.
> 
> >Note that with qdisk in a 2-node cluster, you want expected_votes = 3,
> >two_node = 0, one vote per node and 1 vote for qdisk. 
> 
> >In a 2-node cluster w/o qdisk, you want expected_votes = 1, two_node =
> >1.  Fencing is required whenever two_node = 1, even if it's manual
> >fencing.
> 
> This is a 4 node cluster.

so -

expected_votes = 7
1 vote per node
3 votes for qdisk

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Aug 17 20:38:50 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 17 Aug 2007 16:38:50 -0400
Subject: [Linux-cluster] NFS failover problem
In-Reply-To: <46C3F9AE.5060702@jouy.inra.fr>
References: <46C3F9AE.5060702@jouy.inra.fr>
Message-ID: <20070817203850.GZ5853@redhat.com>

On Thu, Aug 16, 2007 at 09:15:58AM +0200, kieran JOYEUX wrote:
> Hi guys,
> 
> I am implementing a two node cluster sharing via NFS, their local 
> storage to one client.
> At the moment, i am simulating a failover during a copy from the NFS 
> server to the local client disk.


>              <resources>
>                      <ip address="138.102.22.33" monitor_link="1"/>
>                      <nfsexport name="/usr/local/genome"/>
>                      <nfsclient name="mig" options="ro,fsid=20" 
> path="/usr/local/genome" target="138.102.22.0/255.255.192.0"/>
>                      <nfsclient name="mig213" options="fsid=213,ro" 
> path="/usr/local/genome" target="138.102.22.213"/>
>                      <nfsclient name="mig217" options="ro,fsid=217" 
> path="/usr/local/genome" target="138.102.22.217"/>
>              </resources>
>              <service autostart="1" domain="mig_fod" name="nfs">
>                      <ip ref="138.102.22.33"/>
>                      <nfsexport ref="/usr/local/genome"/>
>                      <nfsclient ref="mig"/>
>              </service>

Try something like:

	<resources>
		<ip address="138.102.22.33" monitor_link="1"/>
		<nfsexport name="/usr/local/genome"/>
		<nfsclient name="mig" options="ro,fsid=20"
			target="138.102.22.0/255.255.192.0"/>
		<nfsclient name="mig213" options="fsid=213,ro"
			target="138.102.22.213"/>
		<nfsclient name="mig217" options="ro,fsid=217" 
			target="138.102.22.217"/>
		<clusterfs name="my_cluster_fs" device="..." fstype="gfs"
			mountpoint="/usr/local/genome"/>
	</resources>

	<service autostart="1" domain="mig_fod" name="nfs" nfslock="1">
  		<ip ref="138.102.22.33"/>
		<clusterfs ref="/usr/local/genome">
			<nfsexport ref="/usr/local/genome"/>
				<nfsclient ref="mig"/>
			</nfsexport>
		</clusterfs>
	</service>

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From pbruna at it-linux.cl  Fri Aug 17 23:37:54 2007
From: pbruna at it-linux.cl (Patricio A. Bruna)
Date: Fri, 17 Aug 2007 19:37:54 -0400 (CLT)
Subject: [Linux-cluster] Not member
Message-ID: <19745236.42841187393874612.JavaMail.root@lisa.itlinux.cl>

I need a little help here. 
I run the group_tool dump fence and im getting this in both node (davinci and newton) 

davinci: 1187392572 node "newton" not a cman member, cn 1 
newton: 1187392289 node "davinci" not a cman member, cn 1 

Im attaching my cluster.conf 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070817/439ebd3a/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/octet-stream
Size: 1630 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070817/439ebd3a/attachment.obj>

From sebastian.walter at fu-berlin.de  Sun Aug 19 00:06:30 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Sun, 19 Aug 2007 02:06:30 +0200
Subject: [Linux-cluster] GFS hangs, nodes die
Message-ID: <46C78986.4070301@fu-berlin.de>

Dear list,

this is the tragical story of my cluster running rhel/csgfs 4u5: the
cluster in generally is running fine, but when I increase the load to a 
certain level (heavy I/O), it collapses. About 20% of the nodes do crash 
(not reacting any more, but no sign of kernel panic), the others can't 
access the gfs resource.
Gfs is set up as a rgmanager service with failover domain for each node
(same problem also exists when mounting via /etc/fstab).

Who is willing to provide a happy end?

Thanks, Sebastian
**

This is what /var/log/messages gives me (on nearly all nodes):
Aug 18 04:39:06 compute-0-2 clurgmgrd[4225]: <err> #49: Failed getting 
status for RG gfs-2
and e.g.
Aug 18 04:45:38 compute-0-6 clurgmgrd[9074]: <err> #50: Unable to obtain
cluster lock: Connection timed out

[root at compute-0-3 ~]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 53
Cluster name: dtm
Cluster ID: 741
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 10
Expected_votes: 11
Total_votes: 10
Quorum: 6
Active subsystems: 8
Node name: compute-0-3
Node ID: 4
Node addresses: 10.1.255.252

[root at compute-0-6 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           3   2 recover 4 -
[1 2 6 10 9 8 3 7 4 11]
DLM Lock Space:  "clvmd"                             7   3 recover 0 -
[1 2 6 10 9 8 3 7 4 11]
DLM Lock Space:  "Magma"                            12   5 recover 0 -
[1 2 6 10 9 8 3 7 4 11]
DLM Lock Space:  "homeneu"                          17   6 recover 0 -
[10 9 8 7 2 3 6 4 1 11]
GFS Mount Group: "homeneu"                          18   7 recover 0 -
[10 9 8 7 2 3 6 4 1 11]
User:            "usrm::manager"                    11   4 recover 0 -
[1 2 6 10 9 8 3 7 4 11]

[root at compute-0-10 ~]# cat /proc/cluster/dlm_stats
DLM stats (HZ=1000)

Lock operations:       4036
Unlock operations:     2001
Convert operations:    1862
Completion ASTs:       7898
Blocking ASTs:           52

Lockqueue        num  waittime   ave
WAIT_RSB        3778     28862     7
WAIT_CONV         75       482     6
WAIT_GRANT      2171      7235     3
WAIT_UNLOCK      153      1606    10
Total           6177     38185     6

[root at compute-0-10 ~]# cat /proc/cluster/sm_debug
sevent state 7
02000012 sevent state 9
00000003 remove node 5 count 10
01000011 remove node 5 count 10
0100000c remove node 5 count 10
01000007 remove node 5 count 10
02000012 remove node 5 count 10
0300000b remove node 5 count 10
00000003 recover state 0





From grimme at atix.de  Sun Aug 19 09:42:07 2007
From: grimme at atix.de (Marc Grimme)
Date: Sun, 19 Aug 2007 11:42:07 +0200
Subject: [Linux-cluster] GFS hangs, nodes die
In-Reply-To: <46C78986.4070301@fu-berlin.de>
References: <46C78986.4070301@fu-berlin.de>
Message-ID: <200708191142.08802.grimme@atix.de>

Hello Sebastian,
what do gfs_tool counters on the fs tell you?
And ps axf? Do you have a lot of "D" processes?
Regards Marc.
On Sunday 19 August 2007 02:06:30 Sebastian Walter wrote:
> Dear list,
>
> this is the tragical story of my cluster running rhel/csgfs 4u5: the
> cluster in generally is running fine, but when I increase the load to a
> certain level (heavy I/O), it collapses. About 20% of the nodes do crash
> (not reacting any more, but no sign of kernel panic), the others can't
> access the gfs resource.
> Gfs is set up as a rgmanager service with failover domain for each node
> (same problem also exists when mounting via /etc/fstab).
>
> Who is willing to provide a happy end?
>
> Thanks, Sebastian
> **
>
> This is what /var/log/messages gives me (on nearly all nodes):
> Aug 18 04:39:06 compute-0-2 clurgmgrd[4225]: <err> #49: Failed getting
> status for RG gfs-2
> and e.g.
> Aug 18 04:45:38 compute-0-6 clurgmgrd[9074]: <err> #50: Unable to obtain
> cluster lock: Connection timed out
>
> [root at compute-0-3 ~]# cat /proc/cluster/status
> Protocol version: 5.0.1
> Config version: 53
> Cluster name: dtm
> Cluster ID: 741
> Cluster Member: Yes
> Membership state: Cluster-Member
> Nodes: 10
> Expected_votes: 11
> Total_votes: 10
> Quorum: 6
> Active subsystems: 8
> Node name: compute-0-3
> Node ID: 4
> Node addresses: 10.1.255.252
>
> [root at compute-0-6 ~]# cat /proc/cluster/services
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                           3   2 recover 4 -
> [1 2 6 10 9 8 3 7 4 11]
> DLM Lock Space:  "clvmd"                             7   3 recover 0 -
> [1 2 6 10 9 8 3 7 4 11]
> DLM Lock Space:  "Magma"                            12   5 recover 0 -
> [1 2 6 10 9 8 3 7 4 11]
> DLM Lock Space:  "homeneu"                          17   6 recover 0 -
> [10 9 8 7 2 3 6 4 1 11]
> GFS Mount Group: "homeneu"                          18   7 recover 0 -
> [10 9 8 7 2 3 6 4 1 11]
> User:            "usrm::manager"                    11   4 recover 0 -
> [1 2 6 10 9 8 3 7 4 11]
>
> [root at compute-0-10 ~]# cat /proc/cluster/dlm_stats
> DLM stats (HZ=1000)
>
> Lock operations:       4036
> Unlock operations:     2001
> Convert operations:    1862
> Completion ASTs:       7898
> Blocking ASTs:           52
>
> Lockqueue        num  waittime   ave
> WAIT_RSB        3778     28862     7
> WAIT_CONV         75       482     6
> WAIT_GRANT      2171      7235     3
> WAIT_UNLOCK      153      1606    10
> Total           6177     38185     6
>
> [root at compute-0-10 ~]# cat /proc/cluster/sm_debug
> sevent state 7
> 02000012 sevent state 9
> 00000003 remove node 5 count 10
> 01000011 remove node 5 count 10
> 0100000c remove node 5 count 10
> 01000007 remove node 5 count 10
> 02000012 remove node 5 count 10
> 0300000b remove node 5 count 10
> 00000003 recover state 0
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany

Phone: +49-89 452 3538-0
Fax:   +49-89 990 1766-0

Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962

Vorstand: 
Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)

Vorsitzender des Aufsichtsrats:
Dr. Martin Buss



From sebastian.walter at fu-berlin.de  Sun Aug 19 09:53:39 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Sun, 19 Aug 2007 11:53:39 +0200
Subject: [Linux-cluster] GFS hangs, nodes die
In-Reply-To: <200708191142.08802.grimme@atix.de>
References: <46C78986.4070301@fu-berlin.de> <200708191142.08802.grimme@atix.de>
Message-ID: <46C81323.5010800@fu-berlin.de>

Hi Marc!

Thanks for your help. As I restarted everything now, I can't check this. 
I will do when it's crahsing again (I will do some tests now). I 
realised that one node did hang with kernel panic. Attached is the 
screenshot.

regards
sebastian


Marc Grimme wrote:
> Hello Sebastian,
> what do gfs_tool counters on the fs tell you?
> And ps axf? Do you have a lot of "D" processes?
> Regards Marc.
> On Sunday 19 August 2007 02:06:30 Sebastian Walter wrote:
>   
>> Dear list,
>>
>> this is the tragical story of my cluster running rhel/csgfs 4u5: the
>> cluster in generally is running fine, but when I increase the load to a
>> certain level (heavy I/O), it collapses. About 20% of the nodes do crash
>> (not reacting any more, but no sign of kernel panic), the others can't
>> access the gfs resource.
>> Gfs is set up as a rgmanager service with failover domain for each node
>> (same problem also exists when mounting via /etc/fstab).
>>
>> Who is willing to provide a happy end?
>>
>> Thanks, Sebastian
>> **
>>
>> This is what /var/log/messages gives me (on nearly all nodes):
>> Aug 18 04:39:06 compute-0-2 clurgmgrd[4225]: <err> #49: Failed getting
>> status for RG gfs-2
>> and e.g.
>> Aug 18 04:45:38 compute-0-6 clurgmgrd[9074]: <err> #50: Unable to obtain
>> cluster lock: Connection timed out
>>
>> [root at compute-0-3 ~]# cat /proc/cluster/status
>> Protocol version: 5.0.1
>> Config version: 53
>> Cluster name: dtm
>> Cluster ID: 741
>> Cluster Member: Yes
>> Membership state: Cluster-Member
>> Nodes: 10
>> Expected_votes: 11
>> Total_votes: 10
>> Quorum: 6
>> Active subsystems: 8
>> Node name: compute-0-3
>> Node ID: 4
>> Node addresses: 10.1.255.252
>>
>> [root at compute-0-6 ~]# cat /proc/cluster/services
>> Service          Name                              GID LID State     Code
>> Fence Domain:    "default"                           3   2 recover 4 -
>> [1 2 6 10 9 8 3 7 4 11]
>> DLM Lock Space:  "clvmd"                             7   3 recover 0 -
>> [1 2 6 10 9 8 3 7 4 11]
>> DLM Lock Space:  "Magma"                            12   5 recover 0 -
>> [1 2 6 10 9 8 3 7 4 11]
>> DLM Lock Space:  "homeneu"                          17   6 recover 0 -
>> [10 9 8 7 2 3 6 4 1 11]
>> GFS Mount Group: "homeneu"                          18   7 recover 0 -
>> [10 9 8 7 2 3 6 4 1 11]
>> User:            "usrm::manager"                    11   4 recover 0 -
>> [1 2 6 10 9 8 3 7 4 11]
>>
>> [root at compute-0-10 ~]# cat /proc/cluster/dlm_stats
>> DLM stats (HZ=1000)
>>
>> Lock operations:       4036
>> Unlock operations:     2001
>> Convert operations:    1862
>> Completion ASTs:       7898
>> Blocking ASTs:           52
>>
>> Lockqueue        num  waittime   ave
>> WAIT_RSB        3778     28862     7
>> WAIT_CONV         75       482     6
>> WAIT_GRANT      2171      7235     3
>> WAIT_UNLOCK      153      1606    10
>> Total           6177     38185     6
>>
>> [root at compute-0-10 ~]# cat /proc/cluster/sm_debug
>> sevent state 7
>> 02000012 sevent state 9
>> 00000003 remove node 5 count 10
>> 01000011 remove node 5 count 10
>> 0100000c remove node 5 count 10
>> 01000007 remove node 5 count 10
>> 02000012 remove node 5 count 10
>> 0300000b remove node 5 count 10
>> 00000003 recover state 0
>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>     
>
>
>
>   

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Picture 8.png
Type: image/png
Size: 23680 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070819/9112cced/attachment.png>

From grimme at atix.de  Sun Aug 19 20:12:43 2007
From: grimme at atix.de (Marc Grimme)
Date: Sun, 19 Aug 2007 22:12:43 +0200
Subject: [Linux-cluster] GFS hangs, nodes die
In-Reply-To: <46C81323.5010800@fu-berlin.de>
References: <46C78986.4070301@fu-berlin.de> <200708191142.08802.grimme@atix.de>
	<46C81323.5010800@fu-berlin.de>
Message-ID: <200708192212.44151.grimme@atix.de>

Hi Sebastian,
you might also want to have a look at here:
http://www.open-sharedroot.org/Members/marc/blog/blog-on-gfs/
I collected some information about the problem you've hit (It must be that 
problem).
Next time you should also look at the console of every node. You should see 
some intesting messages before there.
Use the glock_purge gfs_tool option it will help and always keep a look on the 
gfs_tool counters and there on the locks.

BTW: the unable to obtain lock is only the rgmanager complaining about not 
being able to obtain a lock and as side effect. The problem is that a new 
lockid cannot be got within time.

Regards Marc.
On Sunday 19 August 2007 11:53:39 you wrote:
> Hi Marc!
>
> Thanks for your help. As I restarted everything now, I can't check this.
> I will do when it's crahsing again (I will do some tests now). I
> realised that one node did hang with kernel panic. Attached is the
> screenshot.
>
> regards
> sebastian
>
> Marc Grimme wrote:
> > Hello Sebastian,
> > what do gfs_tool counters on the fs tell you?
> > And ps axf? Do you have a lot of "D" processes?
> > Regards Marc.
> >
> > On Sunday 19 August 2007 02:06:30 Sebastian Walter wrote:
> >> Dear list,
> >>
> >> this is the tragical story of my cluster running rhel/csgfs 4u5: the
> >> cluster in generally is running fine, but when I increase the load to a
> >> certain level (heavy I/O), it collapses. About 20% of the nodes do crash
> >> (not reacting any more, but no sign of kernel panic), the others can't
> >> access the gfs resource.
> >> Gfs is set up as a rgmanager service with failover domain for each node
> >> (same problem also exists when mounting via /etc/fstab).
> >>
> >> Who is willing to provide a happy end?
> >>
> >> Thanks, Sebastian
> >> **
> >>
> >> This is what /var/log/messages gives me (on nearly all nodes):
> >> Aug 18 04:39:06 compute-0-2 clurgmgrd[4225]: <err> #49: Failed getting
> >> status for RG gfs-2
> >> and e.g.
> >> Aug 18 04:45:38 compute-0-6 clurgmgrd[9074]: <err> #50: Unable to obtain
> >> cluster lock: Connection timed out
> >>
> >> [root at compute-0-3 ~]# cat /proc/cluster/status
> >> Protocol version: 5.0.1
> >> Config version: 53
> >> Cluster name: dtm
> >> Cluster ID: 741
> >> Cluster Member: Yes
> >> Membership state: Cluster-Member
> >> Nodes: 10
> >> Expected_votes: 11
> >> Total_votes: 10
> >> Quorum: 6
> >> Active subsystems: 8
> >> Node name: compute-0-3
> >> Node ID: 4
> >> Node addresses: 10.1.255.252
> >>
> >> [root at compute-0-6 ~]# cat /proc/cluster/services
> >> Service          Name                              GID LID State    
> >> Code Fence Domain:    "default"                           3   2 recover
> >> 4 - [1 2 6 10 9 8 3 7 4 11]
> >> DLM Lock Space:  "clvmd"                             7   3 recover 0 -
> >> [1 2 6 10 9 8 3 7 4 11]
> >> DLM Lock Space:  "Magma"                            12   5 recover 0 -
> >> [1 2 6 10 9 8 3 7 4 11]
> >> DLM Lock Space:  "homeneu"                          17   6 recover 0 -
> >> [10 9 8 7 2 3 6 4 1 11]
> >> GFS Mount Group: "homeneu"                          18   7 recover 0 -
> >> [10 9 8 7 2 3 6 4 1 11]
> >> User:            "usrm::manager"                    11   4 recover 0 -
> >> [1 2 6 10 9 8 3 7 4 11]
> >>
> >> [root at compute-0-10 ~]# cat /proc/cluster/dlm_stats
> >> DLM stats (HZ=1000)
> >>
> >> Lock operations:       4036
> >> Unlock operations:     2001
> >> Convert operations:    1862
> >> Completion ASTs:       7898
> >> Blocking ASTs:           52
> >>
> >> Lockqueue        num  waittime   ave
> >> WAIT_RSB        3778     28862     7
> >> WAIT_CONV         75       482     6
> >> WAIT_GRANT      2171      7235     3
> >> WAIT_UNLOCK      153      1606    10
> >> Total           6177     38185     6
> >>
> >> [root at compute-0-10 ~]# cat /proc/cluster/sm_debug
> >> sevent state 7
> >> 02000012 sevent state 9
> >> 00000003 remove node 5 count 10
> >> 01000011 remove node 5 count 10
> >> 0100000c remove node 5 count 10
> >> 01000007 remove node 5 count 10
> >> 02000012 remove node 5 count 10
> >> 0300000b remove node 5 count 10
> >> 00000003 recover state 0
> >>
> >>
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany

Phone: +49-89 452 3538-0
Fax:   +49-89 990 1766-0

Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962

Vorstand: 
Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)

Vorsitzender des Aufsichtsrats:
Dr. Martin Buss



From pcaulfie at redhat.com  Mon Aug 20 07:28:07 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 20 Aug 2007 08:28:07 +0100
Subject: [Linux-cluster] Multicast
In-Reply-To: <15398601.41701187368877032.JavaMail.root@lisa.itlinux.cl>
References: <15398601.41701187368877032.JavaMail.root@lisa.itlinux.cl>
Message-ID: <46C94287.9080608@redhat.com>

Patricio A. Bruna wrote:
> Hi,
> Does cluster suite use Multicast in Red Hat 5?

RHEL5 uses multicast by default. There is no broadcast option any more


> If is so, how can i test my Switchs allow multicast traffic?

I don't know - consult the manual that came with the switches?

-- 
Patrick



From maciej.bogucki at artegence.com  Mon Aug 20 07:42:03 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Mon, 20 Aug 2007 09:42:03 +0200
Subject: [Linux-cluster] GFS hangs, nodes die
In-Reply-To: <46C78986.4070301@fu-berlin.de>
References: <46C78986.4070301@fu-berlin.de>
Message-ID: <46C945CB.5070801@artegence.com>

> [...cut...]
> certain level (heavy I/O), it collapses. About 20% of the nodes do crash
> (not reacting any more, but no sign of kernel panic), the others can't
> access the gfs resource.
> [...cut...]
> 
> [root at compute-0-6 ~]# cat /proc/cluster/services
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                           3   2 recover 4 -
> [1 2 6 10 9 8 3 7 4 11]
> DLM Lock Space:  "clvmd"                             7   3 recover 0 -
> [1 2 6 10 9 8 3 7 4 11]
> DLM Lock Space:  "Magma"                            12   5 recover 0 -
> [1 2 6 10 9 8 3 7 4 11]
> DLM Lock Space:  "homeneu"                          17   6 recover 0 -
> [10 9 8 7 2 3 6 4 1 11]
> GFS Mount Group: "homeneu"                          18   7 recover 0 -
> [10 9 8 7 2 3 6 4 1 11]
> User:            "usrm::manager"                    11   4 recover 0 -
> [1 2 6 10 9 8 3 7 4 11]

Hello,

1. Do You have Fibre-Channel SAN storage or use GNDB/iSCSI?
2. The other nodes can't access GFS fs because cluster is in recover
state. Do You have fencing properly configured?

Best Regards
Maciej Bogucki



From fajar at telkom.net.id  Mon Aug 20 07:47:47 2007
From: fajar at telkom.net.id (Fajar A. Nugraha)
Date: Mon, 20 Aug 2007 14:47:47 +0700
Subject: [Linux-cluster] Multicast
In-Reply-To: <15398601.41701187368877032.JavaMail.root@lisa.itlinux.cl>
References: <15398601.41701187368877032.JavaMail.root@lisa.itlinux.cl>
Message-ID: <46C94723.1030805@telkom.net.id>

Patricio A. Bruna wrote:
> Hi,
> Does cluster suite use Multicast in Red Hat 5?
> If is so, how can i test my Switchs allow multicast traffic?
>
The switches that I tested allow multicast traffic by default.
You can test it by running some programs that uses multicast address, like :
- ping 224.0.0.1. Note that by default RHEL ignores broadcast pings, so
you might need to run
"sysctl -w net.ipv4.icmp_echo_ignore_broadcasts=0 " first
- http://www.jgroups.org/javagroupsnew/docs/manual/html/ch02.html#d0e455

Regards,

Fajar



From sebastian.walter at fu-berlin.de  Mon Aug 20 08:24:20 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Mon, 20 Aug 2007 10:24:20 +0200
Subject: [Linux-cluster] GFS hangs, nodes die
In-Reply-To: <200708192212.44151.grimme@atix.de>
References: <46C78986.4070301@fu-berlin.de> <200708191142.08802.grimme@atix.de>
	<46C81323.5010800@fu-berlin.de> <200708192212.44151.grimme@atix.de>
Message-ID: <46C94FB4.2030406@fu-berlin.de>

Hi Marc,

thanks for your answer. Yesterday night, I set the 'gfs_tool settune
glock_purge 50'. After putting some stress on the cluster, one node now
failed, while gfs is still running, but very, very slow (maybe it will
collapse soon). /proc/cluster/services tells me that the nodes are all
still ok, and the failed doesn't get fenced (normally, fencing works
well in the environment). But I can't ssh into the node, and on the
console I can't login (even if it still let's me type it doesn't react).

The main jobs of the cluster went to 'D' state.

This is what `gfs_tool counters /global/home` gives me (this command
also hang, but after restarting the failed node it worked):
[root at dtm ~]# gfs_tool counters /global/home

                                  locks 23532
                             locks held 10774
                           freeze count 0
                          incore inodes 10764
                       metadata buffers 4
                        unlinked inodes 1
                              quota IDs 2
                     incore log buffers 0
                         log space used 0.10%
              meta header cache entries 0
                     glock dependencies 0
                 glocks on reclaim list 0
                              log wraps 2
                   outstanding LM calls 0
                  outstanding BIO calls 0
                       fh2dentry misses 0
                       glocks reclaimed 421095
                         glock nq calls 2196642
                         glock dq calls 2051456
                   glock prefetch calls 171529
                          lm_lock calls 345420
                        lm_unlock calls 155518
                           lm callbacks 515665
                     address operations 4431
                      dentry operations 377484
                      export operations 0
                        file operations 535012
                       inode operations 892178
                       super operations 1196275
                          vm operations 0
                        block I/O reads 126680
                       block I/O writes 133808

Any suggestions?

Regards,
Sebastian


Marc Grimme wrote:
> Hi Sebastian,
> you might also want to have a look at here:
> http://www.open-sharedroot.org/Members/marc/blog/blog-on-gfs/
> I collected some information about the problem you've hit (It must be that 
> problem).
> Next time you should also look at the console of every node. You should see 
> some intesting messages before there.
> Use the glock_purge gfs_tool option it will help and always keep a look on the 
> gfs_tool counters and there on the locks.
>
> BTW: the unable to obtain lock is only the rgmanager complaining about not 
> being able to obtain a lock and as side effect. The problem is that a new 
> lockid cannot be got within time.
>
> Regards Marc.
> On Sunday 19 August 2007 11:53:39 you wrote:
>   
>> Hi Marc!
>>
>> Thanks for your help. As I restarted everything now, I can't check this.
>> I will do when it's crahsing again (I will do some tests now). I
>> realised that one node did hang with kernel panic. Attached is the
>> screenshot.
>>
>> regards
>> sebastian
>>
>> Marc Grimme wrote:
>>     
>>> Hello Sebastian,
>>> what do gfs_tool counters on the fs tell you?
>>> And ps axf? Do you have a lot of "D" processes?
>>> Regards Marc.
>>>
>>> On Sunday 19 August 2007 02:06:30 Sebastian Walter wrote:
>>>       
>>>> Dear list,
>>>>
>>>> this is the tragical story of my cluster running rhel/csgfs 4u5: the
>>>> cluster in generally is running fine, but when I increase the load to a
>>>> certain level (heavy I/O), it collapses. About 20% of the nodes do crash
>>>> (not reacting any more, but no sign of kernel panic), the others can't
>>>> access the gfs resource.
>>>> Gfs is set up as a rgmanager service with failover domain for each node
>>>> (same problem also exists when mounting via /etc/fstab).
>>>>
>>>> Who is willing to provide a happy end?
>>>>
>>>> Thanks, Sebastian
>>>> **
>>>>
>>>> This is what /var/log/messages gives me (on nearly all nodes):
>>>> Aug 18 04:39:06 compute-0-2 clurgmgrd[4225]: <err> #49: Failed getting
>>>> status for RG gfs-2
>>>> and e.g.
>>>> Aug 18 04:45:38 compute-0-6 clurgmgrd[9074]: <err> #50: Unable to obtain
>>>> cluster lock: Connection timed out
>>>>
>>>> [root at compute-0-3 ~]# cat /proc/cluster/status
>>>> Protocol version: 5.0.1
>>>> Config version: 53
>>>> Cluster name: dtm
>>>> Cluster ID: 741
>>>> Cluster Member: Yes
>>>> Membership state: Cluster-Member
>>>> Nodes: 10
>>>> Expected_votes: 11
>>>> Total_votes: 10
>>>> Quorum: 6
>>>> Active subsystems: 8
>>>> Node name: compute-0-3
>>>> Node ID: 4
>>>> Node addresses: 10.1.255.252
>>>>
>>>> [root at compute-0-6 ~]# cat /proc/cluster/services
>>>> Service          Name                              GID LID State    
>>>> Code Fence Domain:    "default"                           3   2 recover
>>>> 4 - [1 2 6 10 9 8 3 7 4 11]
>>>> DLM Lock Space:  "clvmd"                             7   3 recover 0 -
>>>> [1 2 6 10 9 8 3 7 4 11]
>>>> DLM Lock Space:  "Magma"                            12   5 recover 0 -
>>>> [1 2 6 10 9 8 3 7 4 11]
>>>> DLM Lock Space:  "homeneu"                          17   6 recover 0 -
>>>> [10 9 8 7 2 3 6 4 1 11]
>>>> GFS Mount Group: "homeneu"                          18   7 recover 0 -
>>>> [10 9 8 7 2 3 6 4 1 11]
>>>> User:            "usrm::manager"                    11   4 recover 0 -
>>>> [1 2 6 10 9 8 3 7 4 11]
>>>>
>>>> [root at compute-0-10 ~]# cat /proc/cluster/dlm_stats
>>>> DLM stats (HZ=1000)
>>>>
>>>> Lock operations:       4036
>>>> Unlock operations:     2001
>>>> Convert operations:    1862
>>>> Completion ASTs:       7898
>>>> Blocking ASTs:           52
>>>>
>>>> Lockqueue        num  waittime   ave
>>>> WAIT_RSB        3778     28862     7
>>>> WAIT_CONV         75       482     6
>>>> WAIT_GRANT      2171      7235     3
>>>> WAIT_UNLOCK      153      1606    10
>>>> Total           6177     38185     6
>>>>
>>>> [root at compute-0-10 ~]# cat /proc/cluster/sm_debug
>>>> sevent state 7
>>>> 02000012 sevent state 9
>>>> 00000003 remove node 5 count 10
>>>> 01000011 remove node 5 count 10
>>>> 0100000c remove node 5 count 10
>>>> 01000007 remove node 5 count 10
>>>> 02000012 remove node 5 count 10
>>>> 0300000b remove node 5 count 10
>>>> 00000003 recover state 0
>>>>
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>         
>
>
>
>   



From sebastian.walter at fu-berlin.de  Mon Aug 20 08:29:39 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Mon, 20 Aug 2007 10:29:39 +0200
Subject: [Linux-cluster] GFS hangs, nodes die
In-Reply-To: <46C945CB.5070801@artegence.com>
References: <46C78986.4070301@fu-berlin.de> <46C945CB.5070801@artegence.com>
Message-ID: <46C950F3.5080807@fu-berlin.de>

Hi Maciej,

Maciej Bogucki wrote:
> 1. Do You have Fibre-Channel SAN storage or use GNDB/iSCSI?
This is FC SAN on QLogic HBAs.

> 2. The other nodes can't access GFS fs because cluster is in recover
> state. Do You have fencing properly configured?
>   
Fencing normally works (e.g. when I manually reboot a node, it gets
restarted again after 10 seconds). With this problem showing up, fenced
is not interacting as far as I see it.

The cluster generally works well. Only if I increase the I/O load to a
certain level, the problems appear.

Regards,
Sebastian



From maciej.bogucki at artegence.com  Mon Aug 20 09:17:51 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Mon, 20 Aug 2007 11:17:51 +0200
Subject: [Linux-cluster] GFS hangs, nodes die
In-Reply-To: <46C950F3.5080807@fu-berlin.de>
References: <46C78986.4070301@fu-berlin.de> <46C945CB.5070801@artegence.com>
	<46C950F3.5080807@fu-berlin.de>
Message-ID: <46C95C3F.7010903@artegence.com>

Sebastian Walter napisa?(a):
> Hi Maciej,
> 
> Maciej Bogucki wrote:
>> 1. Do You have Fibre-Channel SAN storage or use GNDB/iSCSI?
> This is FC SAN on QLogic HBAs.
Please check that You have the latest BIOS for You HBAs, for some one
there was important updates. Also check that You have the latest
firmware for Your SAN storage. We have IBM DS4700 and without firmware
update [1] we got nodes die(kernel panic).

[1] -
http://www-304.ibm.com/jct01004c/systems/support/supportsite.wss/docdisplay?lndocid=MIGR-5071006&brandind=5000008

Best Regards
Maciej Bogucki



From wcheng at redhat.com  Mon Aug 20 13:42:50 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Mon, 20 Aug 2007 09:42:50 -0400
Subject: [Linux-cluster] GFS hangs, nodes die
In-Reply-To: <46C94FB4.2030406@fu-berlin.de>
References: <46C78986.4070301@fu-berlin.de>
	<200708191142.08802.grimme@atix.de>	<46C81323.5010800@fu-berlin.de>
	<200708192212.44151.grimme@atix.de> <46C94FB4.2030406@fu-berlin.de>
Message-ID: <46C99A5A.5030905@redhat.com>

Sebastian Walter wrote:

>  
>
>>>>>
>>>>>This is what /var/log/messages gives me (on nearly all nodes):
>>>>>Aug 18 04:39:06 compute-0-2 clurgmgrd[4225]: <err> #49: Failed getting
>>>>>status for RG gfs-2
>>>>>and e.g.
>>>>>Aug 18 04:45:38 compute-0-6 clurgmgrd[9074]: <err> #50: Unable to obtain
>>>>>cluster lock: Connection timed out
>>>>>
>>>>>          
>>>>>

GFS glock trimming patch *could* help. However, the lock leak *here* is 
from clurgmgrd (cluster infrastructure), not GFS (filesystem) itself. So 
these two are different issues. I vaguely recall clurgmgrd did have a 
bugzilla for this and was fixed sometime ago.

Lon ?

-- Wendy







From ulisesglez31 at yahoo.com.mx  Mon Aug 20 14:10:33 2007
From: ulisesglez31 at yahoo.com.mx (ulises gonzalez)
Date: Mon, 20 Aug 2007 09:10:33 -0500 (CDT)
Subject: [Linux-cluster] software
Message-ID: <539174.54225.qm@web35210.mail.mud.yahoo.com>

Hello

I want to implement a 4-node cluster, but I don't know what software is the best to do that work. Which soft do you recomend me??

Thanks

       
---------------------------------

?S? un mejor fot?grafo!
Perfecciona tu t?cnica y encuentra las mejores fotos en:
http://mx.yahoo.com/promos/mejorambientalista.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070820/7f419e55/attachment.htm>

From sdake at redhat.com  Mon Aug 20 15:45:59 2007
From: sdake at redhat.com (Steven Dake)
Date: Mon, 20 Aug 2007 08:45:59 -0700
Subject: [Linux-cluster] Multicast
In-Reply-To: <15398601.41701187368877032.JavaMail.root@lisa.itlinux.cl>
References: <15398601.41701187368877032.JavaMail.root@lisa.itlinux.cl>
Message-ID: <1187624759.3537.2.camel@balance>

On Fri, 2007-08-17 at 12:41 -0400, Patricio A. Bruna wrote:
> Hi,
> Does cluster suite use Multicast in Red Hat 5?
> If is so, how can i test my Switchs allow multicast traffic?
> 
Depends on your switch model, whether it is uplinked, and whether it is
layer 2 or connected to a layer 3 device.  A description of your network
topology would be helpful.

Regards
-steve

> Thanks
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From kjoyeux at jouy.inra.fr  Mon Aug 20 15:57:34 2007
From: kjoyeux at jouy.inra.fr (kieran JOYEUX)
Date: Mon, 20 Aug 2007 17:57:34 +0200
Subject: [Linux-cluster] NFS failover problem
In-Reply-To: <OF7B2AE2FF.922CA280-ON85257339.0051AB0D-85257339.00559D1A@applera.com>
References: <OF7B2AE2FF.922CA280-ON85257339.0051AB0D-85257339.00559D1A@applera.com>
Message-ID: <46C9B9EE.4080306@jouy.inra.fr>

First of all thanks a lot to Paul, Maciej and Lon! :)

-------------------------------------------------------------------

Paul, I tried one Centos 5 as a client and didn't work. I also tried 
with the udp & nfsvers=3 option without success :( Thanks anyway!

Maciej, I didn't know i could use DRBD with RHCS, thanks! I am pretty 
sure it could resolve my issue. But i would like to try with the RedHat 
tools before.

Lon, Thanks for your cluster.conf. I tried using the nfslock option on 
my original configuration but it didn't work out.
If i understood the rest of your cluster.conf, it means using GFS which 
is, i think, not possible for my needs. Let me clarify these, i have two 
servers sharing via NFS, applications located on their local disks. The 
share's content are static and read only. All i need is to provide the 
share at any time with a tranparent failover for my clients. If i could 
do that with GFS why not, is it possible?
Thanks once again :)

Kieran



From dmp at mjm.com  Mon Aug 20 16:01:55 2007
From: dmp at mjm.com (Michael Patrimonio)
Date: Mon, 20 Aug 2007 12:01:55 -0400
Subject: [Linux-cluster] Configuring NFS Export Options
Message-ID: <576CD4D20306324798BB3AE562A2C084027AA14D@X2K3.mjm.com>

Greetings,
 
Where (or how) does a user enter the options available for the NFS export command in the system-config-cluster interface? I am trying to make file systems (and some of their sub-directories) of a NFS server highly available to users.
 
The configuration is a two-node cluster running RHEL4 Update 5 (upgraded from RHEL4 Update 4 to address other cluster-related issues with NFS). At this time, the use of GFS and RHEL5 are not viable options.
 
Any assistance would be very much appreciated.
 
Thanks & Regards,
Michael
=====
 
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070820/a4a51f6f/attachment.htm>

From pbruna at it-linux.cl  Mon Aug 20 16:13:49 2007
From: pbruna at it-linux.cl (Patricio A. Bruna)
Date: Mon, 20 Aug 2007 12:13:49 -0400 (CLT)
Subject: [Linux-cluster] Multicast
In-Reply-To: <1187624759.3537.2.camel@balance>
Message-ID: <15915709.43261187626429150.JavaMail.root@lisa.itlinux.cl>

Hi,
It was the configuration of broadcast in RHEL 5, now i can ping the multicast IP.

But my real problem is still there, when i open system-config-cluster it says this member is not part of a cluster.

----- "Steven Dake" <sdake at redhat.com> escribi?:
> On Fri, 2007-08-17 at 12:41 -0400, Patricio A. Bruna wrote:
> > Hi,
> > Does cluster suite use Multicast in Red Hat 5?
> > If is so, how can i test my Switchs allow multicast traffic?
> > 
> Depends on your switch model, whether it is uplinked, and whether it
> is
> layer 2 or connected to a layer 3 device.  A description of your
> network
> topology would be helpful.
> 
> Regards
> -steve
> 
> > Thanks
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From sebastian.walter at fu-berlin.de  Mon Aug 20 16:19:31 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Mon, 20 Aug 2007 18:19:31 +0200
Subject: [Linux-cluster] GFS hangs, nodes die
In-Reply-To: <46C99A5A.5030905@redhat.com>
References: <46C78986.4070301@fu-berlin.de>	<200708191142.08802.grimme@atix.de>	<46C81323.5010800@fu-berlin.de>	<200708192212.44151.grimme@atix.de>
	<46C94FB4.2030406@fu-berlin.de> <46C99A5A.5030905@redhat.com>
Message-ID: <46C9BF13.1050102@fu-berlin.de>

Hi,

after putting massive load on the cluster, 55 % of the nodes died again
(after adjusting the glock_purge to 50). I don't think (and hope) that
it's the hardware, as normal filesystems don't make problems and running
it with low load also runs fine. I will check this, but it will be a
more comprehensive task. Maybe I can improve by tuning the volume better?

Here is what /var/log/messages gives me:
Aug 20 16:24:50 compute-0-10.local clurgmgrd[4283]: <err> #48: Unable to
obtain cluster lock: Connection timed out 
Aug 20 16:25:04 compute-0-3.local clurgmgrd[4280]: <err> #48: Unable to
obtain cluster lock: Connection timed out 
Aug 20 16:25:35 compute-0-10.local clurgmgrd[4283]: <err> #50: Unable to
obtain cluster lock: Connection timed out 
Aug 20 16:25:49 compute-0-3.local clurgmgrd[4280]: <err> #50: Unable to
obtain cluster lock: Connection timed out  
(these are the errors from the still running nodes, they are repeated
several times)

gfs_tool counters /global/home is blocked and not responding. Btw, I'm
running CentOS 4 Update 5 on all the nodes.

Thanks for any comment. Regards,
Sebastian

Wendy Cheng wrote:
> Sebastian Walter wrote:
>
>>  
>>
>>>>>>
>>>>>> This is what /var/log/messages gives me (on nearly all nodes):
>>>>>> Aug 18 04:39:06 compute-0-2 clurgmgrd[4225]: <err> #49: Failed
>>>>>> getting
>>>>>> status for RG gfs-2
>>>>>> and e.g.
>>>>>> Aug 18 04:45:38 compute-0-6 clurgmgrd[9074]: <err> #50: Unable to
>>>>>> obtain
>>>>>> cluster lock: Connection timed out
>>>>>>
>>>>>>         
>
> GFS glock trimming patch *could* help. However, the lock leak *here*
> is from clurgmgrd (cluster infrastructure), not GFS (filesystem)
> itself. So these two are different issues. I vaguely recall clurgmgrd
> did have a bugzilla for this and was fixed sometime ago.
>
> Lon ?
>
> -- Wendy
>
>
>
>
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From Jeremyc at tasconline.com  Mon Aug 20 16:32:05 2007
From: Jeremyc at tasconline.com (Jeremy Carroll)
Date: Mon, 20 Aug 2007 11:32:05 -0500
Subject: [Linux-cluster] Cluster replication / mirroring with Netvault
	Replicator?
Message-ID: <BC16657A85D83746AFF99A42587CC6BC07C77F14@exchange2003.tasconline.com>

Overview: I currently run a RedHat Cluster of 8 JBoss web servers
sharing data via a Global File System. My company is looking to gain
redundancy for our RHCS GFS cluster due to problems with GFS kernel
panics. During periods of network instability the GFS mounted volumes on
the cluster will oops (Most likely due to DLM failures) and panic the
majority of the machines in the cluster. This of course dissolves the
quorum and takes the entire thing down. We need some redundancy with
GFS.

 

My goal is to have a second cluster completely separate from the first
with the GFS data replicated in an active / active configuration. We
will have our F5 Load Balancers divide traffic to both clusters evenly.
In researching replication software to accomplish this goal I came
across BakBone's NetVault Replicator. Does anybody have experience with
real-time replication via this product? Any caveats to using this for
bi-directional replication of two GFS volumes?

Jeremy Carroll
Systems Engineer
Total Administrative Services Corporation
2302 International Lane
Madison, WI 53704-3140
800-422-4661, Extension 4247
608-241-1900, Extension 4247
608-663-9605, Fax
jeremy at tasconline.com <mailto:jeremy at tasconline.com> 

 

This electronic message and any attachments hereto (this "email) may
contain information that is privileged, confidential or
otherwise protected from disclosure. The information is intended for the
use of the addressee(s) only. If the reader of this email is not the
intended recipient, any disclosure, copy, distribution or use of the
contents of attachments to this email is prohibited. If you have
received this email in error, please notify us immediately by reply
email, then delete this email and destroy any printed copies. It is the
sole responsibility of the recipient to ensure that this email is free
from virus or other defect, which may adversely affect the recipient's
business operations. We disclaim any liability for any damages,
including special and incidental or consequential damages, and any
related costs or expenses arising in any way whatsoever from the receipt
or use of this email. Thank you.

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070820/248a4e2e/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 7197 bytes
Desc: image001.jpg
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070820/248a4e2e/attachment.jpg>

From sdake at redhat.com  Mon Aug 20 16:34:42 2007
From: sdake at redhat.com (Steven Dake)
Date: Mon, 20 Aug 2007 09:34:42 -0700
Subject: [Linux-cluster] Multicast
In-Reply-To: <15915709.43261187626429150.JavaMail.root@lisa.itlinux.cl>
References: <15915709.43261187626429150.JavaMail.root@lisa.itlinux.cl>
Message-ID: <1187627682.3537.6.camel@balance>

run cman_tool nodes on all nodes
send /var/log/messages

for all nodes in the cluster

Identify your network topology (Are you using a Cisco switch?  Are you
using a PIX device in a 65xx Catalyst?)  Or some other forms of switche.

Regards
-steve

On Mon, 2007-08-20 at 12:13 -0400, Patricio A. Bruna wrote:
> Hi,
> It was the configuration of broadcast in RHEL 5, now i can ping the multicast IP.
> 
> But my real problem is still there, when i open system-config-cluster it says this member is not part of a cluster.
> 
> ----- "Steven Dake" <sdake at redhat.com> escribi?:
> > On Fri, 2007-08-17 at 12:41 -0400, Patricio A. Bruna wrote:
> > > Hi,
> > > Does cluster suite use Multicast in Red Hat 5?
> > > If is so, how can i test my Switchs allow multicast traffic?
> > > 
> > Depends on your switch model, whether it is uplinked, and whether it
> > is
> > layer 2 or connected to a layer 3 device.  A description of your
> > network
> > topology would be helpful.
> > 
> > Regards
> > -steve
> > 
> > > Thanks
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From sdake at redhat.com  Mon Aug 20 17:57:04 2007
From: sdake at redhat.com (Steven Dake)
Date: Mon, 20 Aug 2007 10:57:04 -0700
Subject: [Linux-cluster] Multicast
In-Reply-To: <28501078.43571187628704824.JavaMail.root@lisa.itlinux.cl>
References: <28501078.43571187628704824.JavaMail.root@lisa.itlinux.cl>
Message-ID: <1187632625.3537.15.camel@balance>

try running 

cman_tool services

_and_

clustat

on both nodes

Thanks
-steve


On Mon, 2007-08-20 at 12:51 -0400, Patricio A. Bruna wrote:
> Steven,
> This cluster has only two nodes (newton, davinci) that are sharing a MSA storage device, one HBA each, where exists a GFS file system.
> 
> cman_tool in newton:
> Node  Sts   Inc   Joined               Name
>    1   M      8   2007-08-18 10:45:08  davinci
>    2   M      4   2007-08-18 10:44:36  newton
> 
> cman_tool in davinci:
> Node  Sts   Inc   Joined               Name
>    1   M      4   2007-08-18 10:44:50  davinci
>    2   M      8   2007-08-18 10:45:09  newton
> 
> 
> group_tool dump fence on newton:
> 1187448277 our_nodeid 2 our_name newton
> 1187448277 listen 4 member 5 groupd 7
> 1187448278 client 3: join default
> 1187448278 delay post_join 900s post_fail 0s
> 1187448278 added 2 nodes from ccs
> 1187448278 setid default 65538
> 1187448278 start default 1 members 2
> 1187448278 do_recovery stop 0 start 1 finish 0
> 1187448278 node "davinci" not a cman member, cn 1
> 1187448278 add first victim davinci
> 1187448279 node "davinci" not a cman member, cn 1
> 1187448280 node "davinci" not a cman member, cn 1
> 1187448281 node "davinci" not a cman member, cn 1
> 1187448282 node "davinci" not a cman member, cn 1
> 1187448283 node "davinci" not a cman member, cn 1
> 1187448284 node "davinci" not a cman member, cn 1
> 1187448285 node "davinci" not a cman member, cn 1
> 1187448286 node "davinci" not a cman member, cn 1
> 1187448287 node "davinci" not a cman member, cn 1
> 1187448288 node "davinci" not a cman member, cn 1
> 1187448289 node "davinci" not a cman member, cn 1
> 1187448290 node "davinci" not a cman member, cn 1
> 1187448291 node "davinci" not a cman member, cn 1
> 1187448292 node "davinci" not a cman member, cn 1
> 1187448293 node "davinci" not a cman member, cn 1
> 1187448294 node "davinci" not a cman member, cn 1
> 1187448295 node "davinci" not a cman member, cn 1
> 1187448296 node "davinci" not a cman member, cn 1
> 1187448297 node "davinci" not a cman member, cn 1
> 1187448298 node "davinci" not a cman member, cn 1
> 1187448299 node "davinci" not a cman member, cn 1
> 1187448300 node "davinci" not a cman member, cn 1
> 1187448301 node "davinci" not a cman member, cn 1
> 1187448302 node "davinci" not a cman member, cn 1
> 1187448303 node "davinci" not a cman member, cn 1
> 1187448304 node "davinci" not a cman member, cn 1
> 1187448305 node "davinci" not a cman member, cn 1
> 1187448306 node "davinci" not a cman member, cn 1
> 1187448307 node "davinci" not a cman member, cn 1
> 1187448308 node "davinci" not a cman member, cn 1
> 1187448309 reduce victim davinci
> 1187448309 delay of 31s leaves 0 victims
> 1187448309 finish default 1
> 1187448309 stop default
> 1187448309 start default 2 members 1 2
> 1187448309 do_recovery stop 1 start 2 finish 1
> 1187448616 client 3: dump
> 1187628236 client 3: dump
> 
> Im attaching both messages
> 
> ----- "Steven Dake" <sdake at redhat.com> escribi?:
> > run cman_tool nodes on all nodes
> > send /var/log/messages
> > 
> > for all nodes in the cluster
> > 
> > Identify your network topology (Are you using a Cisco switch?  Are
> > you
> > using a PIX device in a 65xx Catalyst?)  Or some other forms of
> > switche.
> > 
> > Regards
> > -steve
> > 
> > On Mon, 2007-08-20 at 12:13 -0400, Patricio A. Bruna wrote:
> > > Hi,
> > > It was the configuration of broadcast in RHEL 5, now i can ping the
> > multicast IP.
> > > 
> > > But my real problem is still there, when i open
> > system-config-cluster it says this member is not part of a cluster.
> > > 
> > > ----- "Steven Dake" <sdake at redhat.com> escribi?:
> > > > On Fri, 2007-08-17 at 12:41 -0400, Patricio A. Bruna wrote:
> > > > > Hi,
> > > > > Does cluster suite use Multicast in Red Hat 5?
> > > > > If is so, how can i test my Switchs allow multicast traffic?
> > > > > 
> > > > Depends on your switch model, whether it is uplinked, and whether
> > it
> > > > is
> > > > layer 2 or connected to a layer 3 device.  A description of your
> > > > network
> > > > topology would be helpful.
> > > > 
> > > > Regards
> > > > -steve
> > > > 
> > > > > Thanks
> > > > > 
> > > > > --
> > > > > Linux-cluster mailing list
> > > > > Linux-cluster at redhat.com
> > > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > > 
> > > > --
> > > > Linux-cluster mailing list
> > > > Linux-cluster at redhat.com
> > > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > >



From lhh at redhat.com  Mon Aug 20 18:01:22 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 20 Aug 2007 14:01:22 -0400
Subject: [Linux-cluster] Multicast
In-Reply-To: <1187632625.3537.15.camel@balance>
References: <28501078.43571187628704824.JavaMail.root@lisa.itlinux.cl>
	<1187632625.3537.15.camel@balance>
Message-ID: <20070820180122.GC26997@redhat.com>

On Mon, Aug 20, 2007 at 10:57:04AM -0700, Steven Dake wrote:
> try running 
> 
> cman_tool services
> 
> _and_
> 
> clustat

Also, "clustat -fx" would be useful. It sounds like there's a disconnect
between what CMAN says and what system-config-cluster says.


-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Mon Aug 20 18:01:51 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 20 Aug 2007 14:01:51 -0400
Subject: [Linux-cluster] software
In-Reply-To: <539174.54225.qm@web35210.mail.mud.yahoo.com>
References: <539174.54225.qm@web35210.mail.mud.yahoo.com>
Message-ID: <20070820180151.GD26997@redhat.com>

On Mon, Aug 20, 2007 at 09:10:33AM -0500, ulises gonzalez wrote:
> Hello
> 
> I want to implement a 4-node cluster, but I don't know what software is the best to do that work. Which soft do you recomend me??

What kind of cluster are you trying to build - what do you want it to
do?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Mon Aug 20 18:06:40 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 20 Aug 2007 14:06:40 -0400
Subject: [Linux-cluster] Configuring NFS Export Options
In-Reply-To: <576CD4D20306324798BB3AE562A2C084027AA14D@X2K3.mjm.com>
References: <576CD4D20306324798BB3AE562A2C084027AA14D@X2K3.mjm.com>
Message-ID: <20070820180640.GE26997@redhat.com>

On Mon, Aug 20, 2007 at 12:01:55PM -0400, Michael Patrimonio wrote:
> Greetings,
>  
> Where (or how) does a user enter the options available for the NFS export command in the system-config-cluster interface? I am trying to make file systems (and some of their sub-directories) of a NFS server highly available to users.

Which options are you looking for?  Some options are (usually)
inherited.

Your service structure should resemble:

  <service name="foo">
    <fs name="filesystem1" mountpoint="/mnt/tmp" ...>
      <nfsexports name="exports">
        <nfsclient name="client1" target="client1.mydomain.com"
		  options="rw,no_root_squash,..."/>
      </nfsexport>
    </fs>
  </service>

I *thought* the nfs client configuration dialog box allowed you to put
in name/target/options... is that not happening?

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Mon Aug 20 18:07:44 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 20 Aug 2007 14:07:44 -0400
Subject: [Linux-cluster] Cluster replication / mirroring with Netvault
	Replicator?
In-Reply-To: <BC16657A85D83746AFF99A42587CC6BC07C77F14@exchange2003.tasconline.com>
References: <BC16657A85D83746AFF99A42587CC6BC07C77F14@exchange2003.tasconline.com>
Message-ID: <20070820180744.GF26997@redhat.com>

On Mon, Aug 20, 2007 at 11:32:05AM -0500, Jeremy Carroll wrote:
> Overview: I currently run a RedHat Cluster of 8 JBoss web servers
> sharing data via a Global File System. My company is looking to gain
> redundancy for our RHCS GFS cluster due to problems with GFS kernel
> panics. During periods of network instability the GFS mounted volumes on
> the cluster will oops (Most likely due to DLM failures) and panic the
> majority of the machines in the cluster. This of course dissolves the
> quorum and takes the entire thing down. We need some redundancy with
> GFS.

Could you attach some of these panics so we can look at them?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From pbruna at it-linux.cl  Mon Aug 20 18:21:45 2007
From: pbruna at it-linux.cl (Patricio A. Bruna)
Date: Mon, 20 Aug 2007 14:21:45 -0400 (CLT)
Subject: [Linux-cluster] Multicast
In-Reply-To: <26305344.43631187633501947.JavaMail.root@lisa.itlinux.cl>
Message-ID: <20581858.43681187634105022.JavaMail.root@lisa.itlinux.cl>

Here goes the information.
Alo, im attaching my cluster.conf
Thanks.

cman_tool services - newton:

type             level name             id       state
fence            0     default          00010002 JOIN_START_WAIT [1 2]
dlm              1     clvmd            00020001 JOIN_START_WAIT [1 2]
dlm              1     rgmanager        00020002 JOIN_START_WAIT [1 2]
dlm              1     seia-home        00040001 JOIN_START_WAIT [1 2]
dlm              1     portales         00060001 JOIN_START_WAIT [1 2]
dlm              1     firmp-respaldo   00080001 JOIN_START_WAIT [1 2]
dlm              1     seiaprod-dig     000a0001 JOIN_START_WAIT [1 2]
dlm              1     firmprod-upload  000c0001 JOIN_START_WAIT [1 2]
gfs              2     seia-home        00030001 JOIN_START_WAIT [1 2]
gfs              2     portales         00050001 JOIN_START_WAIT [1 2]
gfs              2     firmp-respaldo   00070001 JOIN_START_WAIT [1 2]
gfs              2     seiaprod-dig     00090001 JOIN_START_WAIT [1 2]
gfs              2     firmprod-upload  000b0001 JOIN_START_WAIT [1 2]

cman_tool services - davinci:
type             level name             id       state
fence            0     default          00010001 JOIN_START_WAIT [1 2]
dlm              1     clvmd            00020001 JOIN_START_WAIT [1 2]
dlm              1     seia-home        00040001 JOIN_START_WAIT [1 2]
dlm              1     portales         00060001 JOIN_START_WAIT [1 2]
dlm              1     firmp-respaldo   00080001 JOIN_START_WAIT [1 2]
dlm              1     seiaprod-dig     000a0001 JOIN_START_WAIT [1 2]
dlm              1     firmprod-upload  000c0001 JOIN_START_WAIT [1 2]
dlm              1     rgmanager        00020002 JOIN_START_WAIT [1 2]
gfs              2     seia-home        00030001 JOIN_START_WAIT [1 2]
gfs              2     portales         00050001 JOIN_START_WAIT [1 2]
gfs              2     firmp-respaldo   00070001 JOIN_START_WAIT [1 2]
gfs              2     seiaprod-dig     00090001 JOIN_START_WAIT [1 2]
gfs              2     firmprod-upload  000b0001 JOIN_START_WAIT [1 2]

clustat - newton:
Member Status: Quorate

  Member Name        ID   Status
  ------ ----        ---- ------
  davinci               1 Online, rgmanager
  newton                2 Online, Local, rgmanager

  Service Name         Owner (Last)       State
  ------- ----         ----- ------       -----
  service:eseia_portal newton           started


clustat - davinci:
Member Status: Quorate

  Member Name        ID   Status
  ------ ----        ---- ------
  davinci               1 Online, Local, rgmanager
  newton                2 Online, rgmanager

  Service Name         Owner (Last)       State
  ------- ----         ----- ------       -----
  service:eseia_portal newton           started


clustat -fx - newton:
<?xml version="1.0"?>
<clustat version="4.1.1">
  <quorum quorate="1" groupmember="1"/>
  <nodes>
    <node name="davinci" state="1" local="0" estranged="0" rgmanager="1" nodeid="0x00000001"/>
    <node name="newton" state="1" local="1" estranged="0" rgmanager="1" nodeid="0x00000002"/>
  </nodes>
  <groups>
    <group name="service:eseia_portales" state="112" state_str="started"  owner="newton" last_owner="newton" restarts="1" last_transition="1187498576" last_transition_str="Sun Aug 19 00:42:56 2007"/>
  </groups>
</clustat>

clustat -fx - davinci:
<?xml version="1.0"?>
<clustat version="4.1.1">
  <quorum quorate="1" groupmember="1"/>
  <nodes>
    <node name="davinci" state="1" local="1" estranged="0" rgmanager="1" nodeid="0x00000001"/>
    <node name="newton" state="1" local="0" estranged="0" rgmanager="1" nodeid="0x00000002"/>
  </nodes>
  <groups>
    <group name="service:eseia_portales" state="112" state_str="started"  owner="newton" last_owner="newton" restarts="1" last_transition="1187498576" last_transition_str="Sun Aug 19 00:42:56 2007"/>
  </groups>
</clustat>

----- "Lon Hohberger" <lhh at redhat.com> escribi?:
> On Mon, Aug 20, 2007 at 10:57:04AM -0700, Steven Dake wrote:
> > try running 
> > 
> > cman_tool services
> > 
> > _and_
> > 
> > clustat
> 
> Also, "clustat -fx" would be useful. It sounds like there's a
> disconnect
> between what CMAN says and what system-config-cluster says.
> 
> 
> -- 
> Lon Hohberger - Software Engineer - Red Hat, Inc.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/octet-stream
Size: 1630 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070820/bf766940/attachment.obj>

From dmp at mjm.com  Mon Aug 20 18:49:04 2007
From: dmp at mjm.com (Michael Patrimonio)
Date: Mon, 20 Aug 2007 14:49:04 -0400
Subject: [Linux-cluster] Configuring NFS Export Options
References: <576CD4D20306324798BB3AE562A2C084027AA14D@X2K3.mjm.com>
	<20070820180640.GE26997@redhat.com>
Message-ID: <576CD4D20306324798BB3AE562A2C084027AA152@X2K3.mjm.com>

 
Thank you for your prompt response. I believe it is doing what it's supposed to, but the documentation is confusing.
 
When I see "NFS Client" I am expecting it to configure a highly available NFS client, not the client access for my NFS server. Most of my experience is with MC/ServiceGuard, which may also be contributing to my confusion.
 
I did a test and I can see an entry for the directory in the output of the exportfs command. However, when I try to mount the file system on the remote node, I am getting a "program not registered error".
 
=====
 
 
 

________________________________

From: linux-cluster-bounces at redhat.com on behalf of Lon Hohberger
Sent: Mon 8/20/2007 14:06
To: linux clustering
Subject: Re: [Linux-cluster] Configuring NFS Export Options



On Mon, Aug 20, 2007 at 12:01:55PM -0400, Michael Patrimonio wrote:
> Greetings,
> 
> Where (or how) does a user enter the options available for the NFS export command in the system-config-cluster interface? I am trying to make file systems (and some of their sub-directories) of a NFS server highly available to users.

Which options are you looking for?  Some options are (usually)
inherited.

Your service structure should resemble:

  <service name="foo">
    <fs name="filesystem1" mountpoint="/mnt/tmp" ...>
      <nfsexports name="exports">
        <nfsclient name="client1" target="client1.mydomain.com"
                  options="rw,no_root_squash,..."/>
      </nfsexport>
    </fs>
  </service>

I *thought* the nfs client configuration dialog box allowed you to put
in name/target/options... is that not happening?

-- Lon

--
Lon Hohberger - Software Engineer - Red Hat, Inc.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 5387 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070820/3566ba55/attachment.bin>

From grimme at atix.de  Mon Aug 20 18:52:42 2007
From: grimme at atix.de (Marc Grimme)
Date: Mon, 20 Aug 2007 20:52:42 +0200
Subject: [Linux-cluster] GFS hangs, nodes die
In-Reply-To: <46C9BF13.1050102@fu-berlin.de>
References: <46C78986.4070301@fu-berlin.de> <46C99A5A.5030905@redhat.com>
	<46C9BF13.1050102@fu-berlin.de>
Message-ID: <200708202052.42581.grimme@atix.de>

On Monday 20 August 2007 18:19:31 Sebastian Walter wrote:
> Hi,
>
> after putting massive load on the cluster, 55 % of the nodes died again
> (after adjusting the glock_purge to 50). I don't think (and hope) that
> it's the hardware, as normal filesystems don't make problems and running
> it with low load also runs fine. I will check this, but it will be a
> more comprehensive task. Maybe I can improve by tuning the volume better?
>
> Here is what /var/log/messages gives me:
> Aug 20 16:24:50 compute-0-10.local clurgmgrd[4283]: <err> #48: Unable to
> obtain cluster lock: Connection timed out
> Aug 20 16:25:04 compute-0-3.local clurgmgrd[4280]: <err> #48: Unable to
> obtain cluster lock: Connection timed out
> Aug 20 16:25:35 compute-0-10.local clurgmgrd[4283]: <err> #50: Unable to
> obtain cluster lock: Connection timed out
> Aug 20 16:25:49 compute-0-3.local clurgmgrd[4280]: <err> #50: Unable to
> obtain cluster lock: Connection timed out
> (these are the errors from the still running nodes, they are repeated
> several times)
>
> gfs_tool counters /global/home is blocked and not responding. Btw, I'm
> running CentOS 4 Update 5 on all the nodes.
>
> Thanks for any comment. Regards,
> Sebastian
>
> Wendy Cheng wrote:
> > Sebastian Walter wrote:
> >>>>>> This is what /var/log/messages gives me (on nearly all nodes):
> >>>>>> Aug 18 04:39:06 compute-0-2 clurgmgrd[4225]: <err> #49: Failed
> >>>>>> getting
> >>>>>> status for RG gfs-2
> >>>>>> and e.g.
> >>>>>> Aug 18 04:45:38 compute-0-6 clurgmgrd[9074]: <err> #50: Unable to
> >>>>>> obtain
> >>>>>> cluster lock: Connection timed out
> >
> > GFS glock trimming patch *could* help. However, the lock leak *here*
> > is from clurgmgrd (cluster infrastructure), not GFS (filesystem)
> > itself. So these two are different issues. I vaguely recall clurgmgrd
> > did have a bugzilla for this and was fixed sometime ago.
> >
> > Lon ?
> >
> > -- Wendy
> >
> >
> >
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Do you also see some messages on the console of the nodes. And the gfs_tool 
counters would help before that problem occures. So let it run sometimes 
before to see if locks increase.
What kind of stress tests are you doing? I bet searching the whole filesystem.
What makes me wonder is that the gfs_tool glock_purge does not work whereas it 
worked for me with exactly the same problems. Did you set it _AFTER_ the fs 
was mounted?

Regards Marc.

-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany

Phone: +49-89 452 3538-0
Fax:   +49-89 990 1766-0

Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962

Vorstand: 
Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)

Vorsitzender des Aufsichtsrats:
Dr. Martin Buss



From lhh at redhat.com  Mon Aug 20 19:12:52 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 20 Aug 2007 15:12:52 -0400
Subject: [Linux-cluster] Configuring NFS Export Options
In-Reply-To: <576CD4D20306324798BB3AE562A2C084027AA152@X2K3.mjm.com>
References: <576CD4D20306324798BB3AE562A2C084027AA14D@X2K3.mjm.com>
	<20070820180640.GE26997@redhat.com>
	<576CD4D20306324798BB3AE562A2C084027AA152@X2K3.mjm.com>
Message-ID: <20070820191249.GG26997@redhat.com>

On Mon, Aug 20, 2007 at 02:49:04PM -0400, Michael Patrimonio wrote:
>  
> Thank you for your prompt response. I believe it is doing what it's supposed to, but the documentation is confusing.
>  
> When I see "NFS Client" I am expecting it to configure a highly available NFS client, not the client access for my NFS server. Most of my experience is with MC/ServiceGuard, which may also be contributing to my confusion.

> I did a test and I can see an entry for the directory in the output of the exportfs command. However, when I try to mount the file system on the remote node, I am getting a "program not registered error".

Statd / lockd not running on the client-side?

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From Randy.Brown at noaa.gov  Mon Aug 20 19:26:08 2007
From: Randy.Brown at noaa.gov (Randy Brown)
Date: Mon, 20 Aug 2007 15:26:08 -0400
Subject: [Linux-cluster] Please correct me if I'm wrong, but...
Message-ID: <46C9EAD0.1060107@noaa.gov>

in order to configure a two-node high availability NFS failover cluster, 
I need to use GFS, correct?

I am wanting to configure two machines in a cluster and use them as a 
NAS head for an ISCSI based storage unit  providing NFS file systems to 
the machines on our network.  I'd like to have the units configured so 
if one fails, the other takes over and no interruption to the nfs mounts 
(high-availability failover). Everything I'm reading indicates that GFS 
would be required to do this, but would like confirmation.  In case I'm 
just missing something.  Worst case, I guess I could just have our 
second machine set up and ready to go as a warm spare.

Any thoughts or suggestions would be greatly appreciated.

Thank you,

Randy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: randy.brown.vcf
Type: text/x-vcard
Size: 348 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070820/e22f977a/attachment.vcf>

From teigland at redhat.com  Mon Aug 20 19:40:06 2007
From: teigland at redhat.com (David Teigland)
Date: Mon, 20 Aug 2007 14:40:06 -0500
Subject: [Linux-cluster] Multicast
In-Reply-To: <20581858.43681187634105022.JavaMail.root@lisa.itlinux.cl>
References: <26305344.43631187633501947.JavaMail.root@lisa.itlinux.cl>
	<20581858.43681187634105022.JavaMail.root@lisa.itlinux.cl>
Message-ID: <20070820194006.GF31911@redhat.com>

On Mon, Aug 20, 2007 at 02:21:45PM -0400, Patricio A. Bruna wrote:
> Here goes the information.

Is this RHEL 5.0?  Are you using the original cman init script or the
updated one?  Could you 'chkconfig cman off' on both nodes, then reboot
both nodes, when both have come back, then manually 'service cman start'
on both nodes.  If they get stuck could you collect from both nodes:

cman_tool status
cman_tool nodes
group_tool -v
group_tool dump > groupd.txt
group_tool dump fence > fenced.txt
/var/log/messages (anything related to aisexec, cman, groupd, fencing, ...)

Thanks,
Dave



From pbruna at it-linux.cl  Mon Aug 20 20:12:32 2007
From: pbruna at it-linux.cl (Patricio A. Bruna)
Date: Mon, 20 Aug 2007 16:12:32 -0400 (CLT)
Subject: [Linux-cluster] Multicast
In-Reply-To: <20070820194006.GF31911@redhat.com>
Message-ID: <2725062.44091187640752722.JavaMail.root@lisa.itlinux.cl>

This is RHEL 5.0.
Im using the original cman init script provided by: cman-2.0.60-1.el5


----- "David Teigland" <teigland at redhat.com> escribi?:
> On Mon, Aug 20, 2007 at 02:21:45PM -0400, Patricio A. Bruna wrote:
> > Here goes the information.
> 
> Is this RHEL 5.0?  Are you using the original cman init script or the
> updated one?  Could you 'chkconfig cman off' on both nodes, then
> reboot
> both nodes, when both have come back, then manually 'service cman
> start'
> on both nodes.  If they get stuck could you collect from both nodes:
> 
> cman_tool status
> cman_tool nodes
> group_tool -v
> group_tool dump > groupd.txt
> group_tool dump fence > fenced.txt
> /var/log/messages (anything related to aisexec, cman, groupd, fencing,
> ...)
> 
> Thanks,
> Dave




From dmp at mjm.com  Mon Aug 20 20:15:03 2007
From: dmp at mjm.com (Michael Patrimonio)
Date: Mon, 20 Aug 2007 16:15:03 -0400
Subject: [Linux-cluster] Please correct me if I'm wrong, but...
References: <46C9EAD0.1060107@noaa.gov>
Message-ID: <576CD4D20306324798BB3AE562A2C084027AA153@X2K3.mjm.com>

Randy,
 
My application is slightly different, but I went through all of this very recently, and depending on a couple of variables, the answer may be; "no, you do not need GFS."
 
I am in the process of configuring a two-node cluster, with each acting as an NFS server. In the event of a node failure, the surviving node takes ownership of the file systems and exports them as NFS shares. Initially, we built the systems with RHEL 4 Update 4 and Cluster Suite. Our initial attempts to build the NFS services did no go well because of problems with this version of Cluster Suite's NFS facility. Red Hat support suggested going back to RHEL 4 Update 3, which was not an option for us, so we built a partially manual failover.
 
But then RHEL4 Update 5 and RHEL5 were released, and we now had two options:
 
1 - Update to the RHEL4 U5 release to configure service with NFS as it was originally designed. According to Red Hat support, the Cluster Suite NFS "bug" introduced with the RHEL4 Update 4 release was fixed in RHEL4 Update 5, but not RHEL5.
 
2 - Update to RHEL5, which repackages the whole Cluster Suite product with GFS bundled into it at no additional cost. Prior to this, GFS had to be purchased. This would have added a layer of complexity to the configuration we don't need at this time.
 
You may want to wait for someone else to address the "continuity" component of your question.
 
I hope this helps.
 
Regards,
Michael
 
P.S.--According to the documentation, the RHEL4 Update 3 release had a working "NFS" facility and a utility to import the NFS configuration from the existing /etc/exports file. Unfortunately, this was dropped with RHEL4 Update 4 and not restored with Update 5. This would have made this build of this configuration so much easier. Oh well.
=====

________________________________

From: linux-cluster-bounces at redhat.com on behalf of Randy Brown
Sent: Mon 8/20/2007 15:26
To: linux clustering
Subject: [Linux-cluster] Please correct me if I'm wrong, but...



in order to configure a two-node high availability NFS failover cluster,
I need to use GFS, correct?

I am wanting to configure two machines in a cluster and use them as a
NAS head for an ISCSI based storage unit  providing NFS file systems to
the machines on our network.  I'd like to have the units configured so
if one fails, the other takes over and no interruption to the nfs mounts
(high-availability failover). Everything I'm reading indicates that GFS
would be required to do this, but would like confirmation.  In case I'm
just missing something.  Worst case, I guess I could just have our
second machine set up and ready to go as a warm spare.

Any thoughts or suggestions would be greatly appreciated.

Thank you,

Randy


-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 6055 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070820/01805466/attachment.bin>

From lhh at redhat.com  Mon Aug 20 20:20:18 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 20 Aug 2007 16:20:18 -0400
Subject: [Linux-cluster] Please correct me if I'm wrong, but...
In-Reply-To: <46C9EAD0.1060107@noaa.gov>
References: <46C9EAD0.1060107@noaa.gov>
Message-ID: <20070820202018.GK26997@redhat.com>

On Mon, Aug 20, 2007 at 03:26:08PM -0400, Randy Brown wrote:
> in order to configure a two-node high availability NFS failover cluster, 
> I need to use GFS, correct?

You can use EXT3; you just can only mount the file system on one
node at a time. 

With GFS, you can export the same file system from *both* cluster nodes.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From teigland at redhat.com  Mon Aug 20 20:16:15 2007
From: teigland at redhat.com (David Teigland)
Date: Mon, 20 Aug 2007 15:16:15 -0500
Subject: [Linux-cluster] Multicast
In-Reply-To: <2725062.44091187640752722.JavaMail.root@lisa.itlinux.cl>
References: <20070820194006.GF31911@redhat.com>
	<2725062.44091187640752722.JavaMail.root@lisa.itlinux.cl>
Message-ID: <20070820201615.GG31911@redhat.com>

On Mon, Aug 20, 2007 at 04:12:32PM -0400, Patricio A. Bruna wrote:
> This is RHEL 5.0.
> Im using the original cman init script provided by: cman-2.0.60-1.el5

You'll want to update that; the cluster will be clobbered by the xen init
script without the fix.  You can work around that for now by doing the
'chkconfig cman off' and then starting the cluster manually on both nodes
with 'service cman start'.

I can't say that the init script is your main or only problem here, but we
need to eliminate that first.


> ----- "David Teigland" <teigland at redhat.com> escribi??:
> > On Mon, Aug 20, 2007 at 02:21:45PM -0400, Patricio A. Bruna wrote:
> > > Here goes the information.
> > 
> > Is this RHEL 5.0?  Are you using the original cman init script or the
> > updated one?  Could you 'chkconfig cman off' on both nodes, then
> > reboot
> > both nodes, when both have come back, then manually 'service cman
> > start'
> > on both nodes.  If they get stuck could you collect from both nodes:
> > 
> > cman_tool status
> > cman_tool nodes
> > group_tool -v
> > group_tool dump > groupd.txt
> > group_tool dump fence > fenced.txt
> > /var/log/messages (anything related to aisexec, cman, groupd, fencing,
> > ...)
> > 
> > Thanks,
> > Dave



From Randy.Brown at noaa.gov  Mon Aug 20 20:25:32 2007
From: Randy.Brown at noaa.gov (Randy Brown)
Date: Mon, 20 Aug 2007 16:25:32 -0400
Subject: [Linux-cluster] Please correct me if I'm wrong, but...
In-Reply-To: <576CD4D20306324798BB3AE562A2C084027AA153@X2K3.mjm.com>
References: <46C9EAD0.1060107@noaa.gov>
	<576CD4D20306324798BB3AE562A2C084027AA153@X2K3.mjm.com>
Message-ID: <46C9F8BC.5@noaa.gov>

Thanks for the info.  I'm confused about one thing, though.  I was given 
approval to buy the Cluster suite for RHEL 5, but was told by our 
reseller that GFS was not included in the Cluster Suite and that GFS was 
about 4x as expensive as the Cluster Suite.  It seemed silly to me that 
GFS would be a separate purchase and be SO much more expensive. (I 
believe Cluster Suite was quoted at  about $500 per copy and GFS, or 
whatever we had to buy to get GFS was around $2200 per copy).  GFS seems 
like the way to go, but we can't afford $6000 for both the cluster suite 
and GFS if that's the case.  If it's included in RHCS, we may be in 
business.

Randy

Michael Patrimonio wrote:
> Randy,
>  
> My application is slightly different, but I went through all of this very recently, and depending on a couple of variables, the answer may be; "no, you do not need GFS."
>  
> I am in the process of configuring a two-node cluster, with each acting as an NFS server. In the event of a node failure, the surviving node takes ownership of the file systems and exports them as NFS shares. Initially, we built the systems with RHEL 4 Update 4 and Cluster Suite. Our initial attempts to build the NFS services did no go well because of problems with this version of Cluster Suite's NFS facility. Red Hat support suggested going back to RHEL 4 Update 3, which was not an option for us, so we built a partially manual failover.
>  
> But then RHEL4 Update 5 and RHEL5 were released, and we now had two options:
>  
> 1 - Update to the RHEL4 U5 release to configure service with NFS as it was originally designed. According to Red Hat support, the Cluster Suite NFS "bug" introduced with the RHEL4 Update 4 release was fixed in RHEL4 Update 5, but not RHEL5.
>  
> 2 - Update to RHEL5, which repackages the whole Cluster Suite product with GFS bundled into it at no additional cost. Prior to this, GFS had to be purchased. This would have added a layer of complexity to the configuration we don't need at this time.
>  
> You may want to wait for someone else to address the "continuity" component of your question.
>  
> I hope this helps.
>  
> Regards,
> Michael
>  
> P.S.--According to the documentation, the RHEL4 Update 3 release had a working "NFS" facility and a utility to import the NFS configuration from the existing /etc/exports file. Unfortunately, this was dropped with RHEL4 Update 4 and not restored with Update 5. This would have made this build of this configuration so much easier. Oh well.
> =====
>
> ________________________________
>
> From: linux-cluster-bounces at redhat.com on behalf of Randy Brown
> Sent: Mon 8/20/2007 15:26
> To: linux clustering
> Subject: [Linux-cluster] Please correct me if I'm wrong, but...
>
>
>
> in order to configure a two-node high availability NFS failover cluster,
> I need to use GFS, correct?
>
> I am wanting to configure two machines in a cluster and use them as a
> NAS head for an ISCSI based storage unit  providing NFS file systems to
> the machines on our network.  I'd like to have the units configured so
> if one fails, the other takes over and no interruption to the nfs mounts
> (high-availability failover). Everything I'm reading indicates that GFS
> would be required to do this, but would like confirmation.  In case I'm
> just missing something.  Worst case, I guess I could just have our
> second machine set up and ready to go as a warm spare.
>
> Any thoughts or suggestions would be greatly appreciated.
>
> Thank you,
>
> Randy
>
>
>   
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070820/180c7ef1/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: randy.brown.vcf
Type: text/x-vcard
Size: 348 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070820/180c7ef1/attachment.vcf>

From Randy.Brown at noaa.gov  Mon Aug 20 20:39:05 2007
From: Randy.Brown at noaa.gov (Randy Brown)
Date: Mon, 20 Aug 2007 16:39:05 -0400
Subject: [Linux-cluster] Please correct me if I'm wrong, but...
In-Reply-To: <20070820202018.GK26997@redhat.com>
References: <46C9EAD0.1060107@noaa.gov> <20070820202018.GK26997@redhat.com>
Message-ID: <46C9FBE9.6090504@noaa.gov>

Right.  That's the way I understood it to be.  Using ext3 would require 
us to have to umount and remount the file systems to the each host after 
the failure, though, correct?  In other words, would require 
administrator interaction.  GFS would do this automatically without 
impacting the users.

Randy

Lon Hohberger wrote:
> On Mon, Aug 20, 2007 at 03:26:08PM -0400, Randy Brown wrote:
>   
>> in order to configure a two-node high availability NFS failover cluster, 
>> I need to use GFS, correct?
>>     
>
> You can use EXT3; you just can only mount the file system on one
> node at a time. 
>
> With GFS, you can export the same file system from *both* cluster nodes.
>
> -- Lon
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070820/47bac54f/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: randy.brown.vcf
Type: text/x-vcard
Size: 348 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070820/47bac54f/attachment.vcf>

From jos at xos.nl  Mon Aug 20 20:53:52 2007
From: jos at xos.nl (Jos Vos)
Date: Mon, 20 Aug 2007 22:53:52 +0200
Subject: [Linux-cluster] Please correct me if I'm wrong, but...
In-Reply-To: <46C9FBE9.6090504@noaa.gov>;
	from Randy.Brown@noaa.gov on Mon, Aug 20, 2007 at 04:39:05PM -0400
References: <46C9EAD0.1060107@noaa.gov> <20070820202018.GK26997@redhat.com>
	<46C9FBE9.6090504@noaa.gov>
Message-ID: <20070820225352.A1799@xos037.xos.nl>

On Mon, Aug 20, 2007 at 04:39:05PM -0400, Randy Brown wrote:

> Right.  That's the way I understood it to be.  Using ext3 would require 
> us to have to umount and remount the file systems to the each host after 
> the failure, though, correct?  In other words, would require 
> administrator interaction.  GFS would do this automatically without 
> impacting the users.

It doesn't need to do automatic mounts/unmounts, it is just mounted in
parallel on multiple cluster nodes and can be used (R/W) in parallel
on all nodes.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From dmp at mjm.com  Mon Aug 20 20:54:38 2007
From: dmp at mjm.com (Michael Patrimonio)
Date: Mon, 20 Aug 2007 16:54:38 -0400
Subject: [Linux-cluster] Please correct me if I'm wrong, but...
References: <46C9EAD0.1060107@noaa.gov><576CD4D20306324798BB3AE562A2C084027AA153@X2K3.mjm.com>
	<46C9F8BC.5@noaa.gov>
Message-ID: <576CD4D20306324798BB3AE562A2C084027AA155@X2K3.mjm.com>

Randy,
 
I don't know what to say about the information you're getting, but I just logged in to a Red Hat Network account to verify. The Cluster Suite offering was repackaged with RHEL5, and it now has three major components (on Disk 6 of the ISO downloads). They are; Cluster, Cluster Storage (which includes GFS and lvm2-cluster) and Virtualization.
 
Regards,
Michael
=====

________________________________

From: linux-cluster-bounces at redhat.com on behalf of Randy Brown
Sent: Mon 8/20/2007 16:25
To: linux clustering
Subject: Re: [Linux-cluster] Please correct me if I'm wrong, but...


Thanks for the info.  I'm confused about one thing, though.  I was given approval to buy the Cluster suite for RHEL 5, but was told by our reseller that GFS was not included in the Cluster Suite and that GFS was about 4x as expensive as the Cluster Suite.  It seemed silly to me that GFS would be a separate purchase and be SO much more expensive. (I believe Cluster Suite was quoted at  about $500 per copy and GFS, or whatever we had to buy to get GFS was around $2200 per copy).  GFS seems like the way to go, but we can't afford $6000 for both the cluster suite and GFS if that's the case.  If it's included in RHCS, we may be in business.

Randy

Michael Patrimonio wrote: 

	Randy,
	 
	My application is slightly different, but I went through all of this very recently, and depending on a couple of variables, the answer may be; "no, you do not need GFS."
	 
	I am in the process of configuring a two-node cluster, with each acting as an NFS server. In the event of a node failure, the surviving node takes ownership of the file systems and exports them as NFS shares. Initially, we built the systems with RHEL 4 Update 4 and Cluster Suite. Our initial attempts to build the NFS services did no go well because of problems with this version of Cluster Suite's NFS facility. Red Hat support suggested going back to RHEL 4 Update 3, which was not an option for us, so we built a partially manual failover.
	 
	But then RHEL4 Update 5 and RHEL5 were released, and we now had two options:
	 
	1 - Update to the RHEL4 U5 release to configure service with NFS as it was originally designed. According to Red Hat support, the Cluster Suite NFS "bug" introduced with the RHEL4 Update 4 release was fixed in RHEL4 Update 5, but not RHEL5.
	 
	2 - Update to RHEL5, which repackages the whole Cluster Suite product with GFS bundled into it at no additional cost. Prior to this, GFS had to be purchased. This would have added a layer of complexity to the configuration we don't need at this time.
	 
	You may want to wait for someone else to address the "continuity" component of your question.
	 
	I hope this helps.
	 
	Regards,
	Michael
	 
	P.S.--According to the documentation, the RHEL4 Update 3 release had a working "NFS" facility and a utility to import the NFS configuration from the existing /etc/exports file. Unfortunately, this was dropped with RHEL4 Update 4 and not restored with Update 5. This would have made this build of this configuration so much easier. Oh well.
	=====
	
	________________________________
	
	From: linux-cluster-bounces at redhat.com on behalf of Randy Brown
	Sent: Mon 8/20/2007 15:26
	To: linux clustering
	Subject: [Linux-cluster] Please correct me if I'm wrong, but...
	
	
	
	in order to configure a two-node high availability NFS failover cluster,
	I need to use GFS, correct?
	
	I am wanting to configure two machines in a cluster and use them as a
	NAS head for an ISCSI based storage unit  providing NFS file systems to
	the machines on our network.  I'd like to have the units configured so
	if one fails, the other takes over and no interruption to the nfs mounts
	(high-availability failover). Everything I'm reading indicates that GFS
	would be required to do this, but would like confirmation.  In case I'm
	just missing something.  Worst case, I guess I could just have our
	second machine set up and ready to go as a warm spare.
	
	Any thoughts or suggestions would be greatly appreciated.
	
	Thank you,
	
	Randy
	
	
	  
	
________________________________


	--
	Linux-cluster mailing list
	Linux-cluster at redhat.com
	https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 6563 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070820/c223688b/attachment.bin>

From dmp at mjm.com  Mon Aug 20 20:55:10 2007
From: dmp at mjm.com (Michael Patrimonio)
Date: Mon, 20 Aug 2007 16:55:10 -0400
Subject: [Linux-cluster] Please correct me if I'm wrong, but...
References: <46C9EAD0.1060107@noaa.gov> <20070820202018.GK26997@redhat.com>
	<46C9FBE9.6090504@noaa.gov>
Message-ID: <576CD4D20306324798BB3AE562A2C084027AA156@X2K3.mjm.com>

 
No, it does not require administrator interaction--in the service you would add a resource for the file system. This would move the file system between nodes.
 
=====
 
________________________________

From: linux-cluster-bounces at redhat.com on behalf of Randy Brown
Sent: Mon 8/20/2007 16:39
To: linux clustering
Subject: Re: [Linux-cluster] Please correct me if I'm wrong, but...


Right.  That's the way I understood it to be.  Using ext3 would require us to have to umount and remount the file systems to the each host after the failure, though, correct?  In other words, would require administrator interaction.  GFS would do this automatically without impacting the users.

Randy

Lon Hohberger wrote: 

	On Mon, Aug 20, 2007 at 03:26:08PM -0400, Randy Brown wrote:
	  

		in order to configure a two-node high availability NFS failover cluster, 
		I need to use GFS, correct?
		    

	You can use EXT3; you just can only mount the file system on one
	node at a time. 
	
	With GFS, you can export the same file system from *both* cluster nodes.
	
	-- Lon
	
	  

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 4299 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070820/aa8cab1c/attachment.bin>

From jos at xos.nl  Mon Aug 20 21:01:24 2007
From: jos at xos.nl (Jos Vos)
Date: Mon, 20 Aug 2007 23:01:24 +0200
Subject: [Linux-cluster] Please correct me if I'm wrong, but...
In-Reply-To: <576CD4D20306324798BB3AE562A2C084027AA155@X2K3.mjm.com>;
	from dmp@mjm.com on Mon, Aug 20, 2007 at 04:54:38PM -0400
References: <46C9EAD0.1060107@noaa.gov><576CD4D20306324798BB3AE562A2C084027AA153@X2K3.mjm.com>
	<46C9F8BC.5@noaa.gov>
	<576CD4D20306324798BB3AE562A2C084027AA155@X2K3.mjm.com>
Message-ID: <20070820230124.B1799@xos037.xos.nl>

On Mon, Aug 20, 2007 at 04:54:38PM -0400, Michael Patrimonio wrote:

> I don't know what to say about the information you're getting, but I
> just logged in to a Red Hat Network account to verify. The Cluster Suite
> offering was repackaged with RHEL5, and it now has three major components
> (on Disk 6 of the ISO downloads). They are; Cluster, Cluster Storage
> (which includes GFS and lvm2-cluster) and Virtualization.

Virtualization has not relation with the cluster suite, it can even be
bought in combination with a Desktop.

RHEL4's RHCS is now in called "Cluster" and RHEL4's RHGFS is now
"Cluster Storage".  Not much have changed w.r.t. the subscriptions,
I think the situation might vary per country, but Cluster Storage
in indeed much more expensive than the Cluster component.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From Randy.Brown at noaa.gov  Mon Aug 20 21:02:31 2007
From: Randy.Brown at noaa.gov (Randy Brown)
Date: Mon, 20 Aug 2007 17:02:31 -0400
Subject: [Linux-cluster] Please correct me if I'm wrong, but...
In-Reply-To: <576CD4D20306324798BB3AE562A2C084027AA156@X2K3.mjm.com>
References: <46C9EAD0.1060107@noaa.gov> <20070820202018.GK26997@redhat.com>
	<46C9FBE9.6090504@noaa.gov>
	<576CD4D20306324798BB3AE562A2C084027AA156@X2K3.mjm.com>
Message-ID: <46CA0167.8060909@noaa.gov>

hmmmm.. OK.  So it looks like we can do what we want to do with or 
without GFS, and it looks like I was also misinformed about the 
cost/availability.  Interesting.  I guess I know what I'll be doing over 
the next couple of days then.  Thanks for all of the quick, informative, 
responses.  I'm sure I'll have more questions in the coming days.

Regards,
Randy

Michael Patrimonio wrote:
>  
> No, it does not require administrator interaction--in the service you would add a resource for the file system. This would move the file system between nodes.
>  
> =====
>  
> ________________________________
>
> From: linux-cluster-bounces at redhat.com on behalf of Randy Brown
> Sent: Mon 8/20/2007 16:39
> To: linux clustering
> Subject: Re: [Linux-cluster] Please correct me if I'm wrong, but...
>
>
> Right.  That's the way I understood it to be.  Using ext3 would require us to have to umount and remount the file systems to the each host after the failure, though, correct?  In other words, would require administrator interaction.  GFS would do this automatically without impacting the users.
>
> Randy
>
> Lon Hohberger wrote: 
>
> 	On Mon, Aug 20, 2007 at 03:26:08PM -0400, Randy Brown wrote:
> 	  
>
> 		in order to configure a two-node high availability NFS failover cluster, 
> 		I need to use GFS, correct?
> 		    
>
> 	You can use EXT3; you just can only mount the file system on one
> 	node at a time. 
> 	
> 	With GFS, you can export the same file system from *both* cluster nodes.
> 	
> 	-- Lon
> 	
> 	  
>
>   
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070820/377a452a/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: randy.brown.vcf
Type: text/x-vcard
Size: 348 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070820/377a452a/attachment.vcf>

From jos at xos.nl  Mon Aug 20 21:06:19 2007
From: jos at xos.nl (Jos Vos)
Date: Mon, 20 Aug 2007 23:06:19 +0200
Subject: [Linux-cluster] Please correct me if I'm wrong, but...
In-Reply-To: <46CA0167.8060909@noaa.gov>;
	from Randy.Brown@noaa.gov on Mon, Aug 20, 2007 at 05:02:31PM -0400
References: <46C9EAD0.1060107@noaa.gov> <20070820202018.GK26997@redhat.com>
	<46C9FBE9.6090504@noaa.gov>
	<576CD4D20306324798BB3AE562A2C084027AA156@X2K3.mjm.com>
	<46CA0167.8060909@noaa.gov>
Message-ID: <20070820230619.C1799@xos037.xos.nl>

On Mon, Aug 20, 2007 at 05:02:31PM -0400, Randy Brown wrote:

> hmmmm.. OK.  So it looks like we can do what we want to do with or 
> without GFS, and it looks like I was also misinformed about the 
> cost/availability.  Interesting.  I guess I know what I'll be doing over 
> the next couple of days then.  Thanks for all of the quick, informative, 
> responses.  I'm sure I'll have more questions in the coming days.

I don't know why you think you was misinformed about the costs.

Cluster Storage indeed is *much* more expensive than the Cluster
software itself (I didn't compare the figures you gave, as they may
differ per country/continent).

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From kanderso at redhat.com  Mon Aug 20 21:12:55 2007
From: kanderso at redhat.com (Kevin Anderson)
Date: Mon, 20 Aug 2007 16:12:55 -0500
Subject: [Linux-cluster] Please correct me if I'm wrong, but...
In-Reply-To: <576CD4D20306324798BB3AE562A2C084027AA155@X2K3.mjm.com>
References: <46C9EAD0.1060107@noaa.gov>
	<576CD4D20306324798BB3AE562A2C084027AA153@X2K3.mjm.com>
	<46C9F8BC.5@noaa.gov>
	<576CD4D20306324798BB3AE562A2C084027AA155@X2K3.mjm.com>
Message-ID: <1187644375.2714.36.camel@localhost.localdomain>

On Mon, 2007-08-20 at 16:54 -0400, Michael Patrimonio wrote:
> Randy,
>  
> I don't know what to say about the information you're getting, but I just logged in to a Red Hat Network account to verify. The Cluster Suite offering was repackaged with RHEL5, and it now has three major components (on Disk 6 of the ISO downloads). They are; Cluster, Cluster Storage (which includes GFS and lvm2-cluster) and Virtualization.
>  
The cluster and gfs components are actually part of the RHEL5 Advanced
Platform product grouping, so if you purchase a subscription to that,
you get the cluster components, gfs and virtualization.

Kevin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070820/faea64a5/attachment.htm>

From jos at xos.nl  Mon Aug 20 21:18:26 2007
From: jos at xos.nl (Jos Vos)
Date: Mon, 20 Aug 2007 23:18:26 +0200
Subject: [Linux-cluster] Please correct me if I'm wrong, but...
In-Reply-To: <1187644375.2714.36.camel@localhost.localdomain>;
	from kanderso@redhat.com on Mon, Aug 20, 2007 at 04:12:55PM -0500
References: <46C9EAD0.1060107@noaa.gov>
	<576CD4D20306324798BB3AE562A2C084027AA153@X2K3.mjm.com>
	<46C9F8BC.5@noaa.gov>
	<576CD4D20306324798BB3AE562A2C084027AA155@X2K3.mjm.com>
	<1187644375.2714.36.camel@localhost.localdomain>
Message-ID: <20070820231826.D1799@xos037.xos.nl>

On Mon, Aug 20, 2007 at 04:12:55PM -0500, Kevin Anderson wrote:

> The cluster and gfs components are actually part of the RHEL5 Advanced
> Platform product grouping, so if you purchase a subscription to that,
> you get the cluster components, gfs and virtualization.

Does that actually work that way in the US?

Here in Europe at least that's not true, RHEL AP includes virtualization,
but cluster and cluster storage still have to be bought seperately.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From Jeremyc at tasconline.com  Mon Aug 20 21:27:37 2007
From: Jeremyc at tasconline.com (Jeremy Carroll)
Date: Mon, 20 Aug 2007 16:27:37 -0500
Subject: [Linux-cluster] Please correct me if I'm wrong, but...
In-Reply-To: <20070820231826.D1799@xos037.xos.nl>
References: <46C9EAD0.1060107@noaa.gov><576CD4D20306324798BB3AE562A2C084027AA153@X2K3.mjm.com><46C9F8BC.5@noaa.gov><576CD4D20306324798BB3AE562A2C084027AA155@X2K3.mjm.com><1187644375.2714.36.camel@localhost.localdomain>
	<20070820231826.D1799@xos037.xos.nl>
Message-ID: <BC16657A85D83746AFF99A42587CC6BC07C77F18@exchange2003.tasconline.com>

Yes. In the US RHEL 5 AP Includes GFS / Cluster.

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jos Vos
Sent: Monday, August 20, 2007 4:18 PM
To: linux clustering
Subject: Re: [Linux-cluster] Please correct me if I'm wrong, but...

On Mon, Aug 20, 2007 at 04:12:55PM -0500, Kevin Anderson wrote:

> The cluster and gfs components are actually part of the RHEL5 Advanced
> Platform product grouping, so if you purchase a subscription to that,
> you get the cluster components, gfs and virtualization.

Does that actually work that way in the US?

Here in Europe at least that's not true, RHEL AP includes
virtualization,
but cluster and cluster storage still have to be bought seperately.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From joseparrella at gmail.com  Mon Aug 20 21:54:40 2007
From: joseparrella at gmail.com (=?ISO-8859-1?Q?Jos=E9_Miguel_Parrella_Romero?=)
Date: Mon, 20 Aug 2007 17:54:40 -0400
Subject: [Linux-cluster] Please correct me if I'm wrong, but...
In-Reply-To: <46C9EAD0.1060107@noaa.gov>
References: <46C9EAD0.1060107@noaa.gov>
Message-ID: <46CA0DA0.1060501@gmail.com>

Randy Brown wrote:
> in order to configure a two-node high availability NFS failover cluster,
> I need to use GFS, correct?

No. Both options accomplish the same objective: sharing an storage space
and providing seamless services to your users. I'll try and explain my
experience with GFS vs. NFS, and I'm already using GFS because of things
I'll expose now. Please note that I'm not an expert in this topic, so
ignore me if this sounds insane.

So, you have a SAN and a media to access it from two nodes (say, Fiber
Channel) -- both of your nodes will 'see' a block device, where you can
make a filesystem and mount it. Looks tempting.

If you use 'ext3', you can mount the filesystem in both nodes, but
changes in one node won't be seen by the other, since changes to the
inode table reside on each node and they have no way to tell each other
something has changed.

You need some way to 'share' the filesystem. You can do this by means of
a client/server filesystem (such as NFS, Samba or Coda) or by a so
called 'distributed filesystem' such as GFS or (ew!) OCFS.

With GFS, you format the device each node see as GFS, and you can mount
it in both nodes. You need some underlying services, such as ccs, cman
and lockd, in order to get both nodes communicated. They send each other
control messages (such as 'hey, i'm using this file') and they monitor
each other ('hey, are you there?') and fence (= nightmare) properly.

This communication media is usually Ethernet, and it can get as bad as
10BaseT. This is an 'active-active' configuration, if you want to see it
that way. With NFS, you go 'active-passive': one of your nodes mount the
device and export it, the other one mounts the export.

In this case, data is shared through the communication media, and if
it's 10BaseT, then you're bottlenecking something. That's why some
people will tell you that you _need_ to use GFS.

If you need to do high-availability, then you need the passive node to
be able to mount the device and share it as an NFS export (the latter
might not be necessary, if you think of it) and do that quickly without
users being impacted. NFS helps by being stateless, and you need
something else, such as Heartbeat, to 'raise' the resources when the
other node goes down.

The 'Schillinger page' [1] proved to be a nice source of information
regarding active-active NFS using non-vendor-locked tools. It's now
down, but Google has it cached.

Finally, I want you to know that I'm a Debian user (and the whole
cluster is Debian-based), and I find GFS a quite nice piece of software.
I didn't pay anything for it, it's free (both as in beer and as in a
speech) and it works OK for me, so don't worry if it's not in your
'package' -- you should be able to use it in any Linux system.

Hope this helps,
Jose

[1]
http://64.233.169.104/search?q=cache:uGsLGtBSuWoJ:chilli.linuxmds.com/~mschilli/NFS/active-active-nfs.html+active+active+NFS&hl=es&ct=clnk&cd=1&client=iceweasel-a



From pheeh at nodeps.org  Mon Aug 20 22:27:13 2007
From: pheeh at nodeps.org (pheeh at nodeps.org)
Date: Mon, 20 Aug 2007 15:27:13 -0700 (MST)
Subject: [Linux-cluster] GNBD with multiple iSCSI initiators
Message-ID: <47186.12.152.67.72.1187648833.squirrel@12.152.67.72>

I have a question about the multipath functionality that I can't seem to
find from the RedHat documentation.
I've only configured GFS for use with a single GNBD server which exports
local drives.
The goal in this configuration is to use 4 machines, 2 GNBD servers and 2
Storage servers.
The theory I have is the following:
Export one partition on each storage server via iscsi to the GNBD servers.
Use multipath to direct traffic between the two iscsi devices.
Configure the two GNBD servers for use in a fault tolerant environment.
The question comes in when figuring out hwo GNBD multipath will direct
traffic.
would i have each gnbd server offer out an individual iscsi initiator from
one storage server?  would i have each gnbd server offer out both iscsi
initiators?
would gnbd handle the data syncing between initiators to ensure that
storage servers both have the same data?
would i have one or two lvm's with this setup?
the documentation of gnbd and iscsi seems a bit sparse...the redhat manual
seems like it glazes over the subject.
when i purchase the subscription do i get more GFS documentation?  kinda a
catch 22 in that case since i hvae to get the proof of concept runnign
before i can get the budget approved :(



From sebastian.walter at fu-berlin.de  Tue Aug 21 07:52:32 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Tue, 21 Aug 2007 09:52:32 +0200
Subject: [Linux-cluster] GFS hangs, nodes die
In-Reply-To: <200708202052.42581.grimme@atix.de>
References: <46C78986.4070301@fu-berlin.de>
	<46C99A5A.5030905@redhat.com>	<46C9BF13.1050102@fu-berlin.de>
	<200708202052.42581.grimme@atix.de>
Message-ID: <46CA99C0.6060502@fu-berlin.de>

Hi,

Marc Grimme wrote:
> Do you also see some messages on the console of the nodes. And the
> gfs_tool
> counters would help before that problem occures. So let it run sometimes 
> before to see if locks increase.
> What kind of stress tests are you doing? I bet searching the whole filesystem.
> What makes me wonder is that the gfs_tool glock_purge does not work whereas it 
> worked for me with exactly the same problems. Did you set it _AFTER_ the fs 
> was mounted?
>
>   

That makes me optimistic. I set it after the volume was mounted, so I
will give it another try setting it before mounting it. Then I will also
mail myself the output of the counters every 10 minuts. Let's see...

...with best thanks
Sebastian



From erwan at seanodes.com  Tue Aug 21 08:09:24 2007
From: erwan at seanodes.com (Erwan Velu)
Date: Tue, 21 Aug 2007 10:09:24 +0200
Subject: [Linux-cluster] Multicast
In-Reply-To: <46C94723.1030805@telkom.net.id>
References: <15398601.41701187368877032.JavaMail.root@lisa.itlinux.cl>
	<46C94723.1030805@telkom.net.id>
Message-ID: <46CA9DB4.8070102@seanodes.com>

Fajar A. Nugraha wrote:
> Patricio A. Bruna wrote:
>   
>> Hi,
>> Does cluster suite use Multicast in Red Hat 5?
>> If is so, how can i test my Switchs allow multicast traffic?
>>
>>     
> The switches that I tested allow multicast traffic by default.
> You can test it by running some programs that uses multicast address, like :
> - ping 224.0.0.1. Note that by default RHEL ignores broadcast pings, so
>   
Regarding the **IANA Guidelines for IPv4 Multicast Address Assignments 
(**http://www.rfc-archive.org/getrfc.php?rfc=3171),
using some 224.0.0/24 based addresses will bypass  some multicast mecanisms.

The 224.0.0/24 addresses are dedicated to the local network routing 
messages meaning all ports of your switch will receive this traffic.
But for the others multicast addresses, 3com/cisco and most of high end 
switches need to be configured to do the routing.

I mean you can find the "snooping" and the "querier/query" options in 
the multicast's switch configuration. Regarding the configuration, the 
mulitcast will works or not.

Snooping Off: all multicast traffic works
Snoop On: query off: Mcast is not working except for 224.0.0/24
Snoop On, query on: all multicast traffic works

This have been confirmed on a 3com 3870 switch. Its default 
configuration is set on "snoop on/query off".
In such configuration, you must enable the "query/querier" option at 
least on a switch to make the multicast working.

So pinging 224.0.0/24 could generates some false positive results 
because in some configuration 224.0.0/24 could works whereas all the 
other multicast adresses are not.

On some low & middle end switches (dell etc..), activating the snooping 
enable the "querier" at the same time so this situation doesn't exists.



From grimme at atix.de  Tue Aug 21 08:09:38 2007
From: grimme at atix.de (Marc Grimme)
Date: Tue, 21 Aug 2007 10:09:38 +0200
Subject: [Linux-cluster] GFS hangs, nodes die
In-Reply-To: <46CA99C0.6060502@fu-berlin.de>
References: <46C78986.4070301@fu-berlin.de> <200708202052.42581.grimme@atix.de>
	<46CA99C0.6060502@fu-berlin.de>
Message-ID: <200708211009.38845.grimme@atix.de>

On Tuesday 21 August 2007 09:52:32 Sebastian Walter wrote:
> Hi,
>
> Marc Grimme wrote:
> > Do you also see some messages on the console of the nodes. And the
> > gfs_tool
> > counters would help before that problem occures. So let it run sometimes
> > before to see if locks increase.
> > What kind of stress tests are you doing? I bet searching the whole
> > filesystem. What makes me wonder is that the gfs_tool glock_purge does
> > not work whereas it worked for me with exactly the same problems. Did you
> > set it _AFTER_ the fs was mounted?
Sorry I mean after is right and before not ;-( .
And are you using the latest version of CS/GFS?
Do you have a lot of memory in your machines 16G or more?
>
> That makes me optimistic. I set it after the volume was mounted, so I
> will give it another try setting it before mounting it. Then I will also
> mail myself the output of the counters every 10 minuts. Let's see...
I would be interested in the counters.
Also add the process list in order to see if how much CPU-Time gfs_scand 
consumes.
i.e.
ps axwwww | sort -k4 -n | tail -10

Have fun Marc.
>
> ...with best thanks
> Sebastian
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany

Phone: +49-89 452 3538-0
Fax:   +49-89 990 1766-0

Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962

Vorstand: 
Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)

Vorsitzender des Aufsichtsrats:
Dr. Martin Buss



From fajar at telkom.net.id  Tue Aug 21 08:29:09 2007
From: fajar at telkom.net.id (Fajar A. Nugraha)
Date: Tue, 21 Aug 2007 15:29:09 +0700
Subject: [Linux-cluster] Multicast
In-Reply-To: <46CA9DB4.8070102@seanodes.com>
References: <15398601.41701187368877032.JavaMail.root@lisa.itlinux.cl>
	<46C94723.1030805@telkom.net.id> <46CA9DB4.8070102@seanodes.com>
Message-ID: <46CAA255.7000800@telkom.net.id>

Erwan Velu wrote:
> Fajar A. Nugraha wrote:
>> Patricio A. Bruna wrote:
>>  
>>> Hi,
>>> Does cluster suite use Multicast in Red Hat 5?
>>> If is so, how can i test my Switchs allow multicast traffic?
>>>
>>>     
>> The switches that I tested allow multicast traffic by default.
>> You can test it by running some programs that uses multicast address,
>> like :
>> - ping 224.0.0.1. Note that by default RHEL ignores broadcast pings, so
>>   
> Regarding the **IANA Guidelines for IPv4 Multicast Address Assignments
> (**http://www.rfc-archive.org/getrfc.php?rfc=3171),
> using some 224.0.0/24 based addresses will bypass  some multicast
> mecanisms.
>
> The 224.0.0/24 addresses are dedicated to the local network routing
> messages meaning all ports of your switch will receive this traffic.
I see. Thanks for the information.
> But for the others multicast addresses, 3com/cisco and most of high
> end switches need to be configured to do the routing.
>
We use Cisco and foundry mostly, with the addition of some blade
switches (not sure who made those), and so far multicast (including
RHEL5 cluster, and jgroups) works without
any special multicast switch config.

Regards,

Fajar



From maciej.bogucki at artegence.com  Tue Aug 21 09:32:24 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Tue, 21 Aug 2007 11:32:24 +0200
Subject: [Linux-cluster] NFS failover problem
In-Reply-To: <46C9B9EE.4080306@jouy.inra.fr>
References: <OF7B2AE2FF.922CA280-ON85257339.0051AB0D-85257339.00559D1A@applera.com>
	<46C9B9EE.4080306@jouy.inra.fr>
Message-ID: <46CAB128.9080109@artegence.com>

> Maciej, I didn't know i could use DRBD with RHCS, thanks! I am pretty
> sure it could resolve my issue. But i would like to try with the RedHat
> tools before.

Do You have shared storage?

Best Tegards
Maciej Bogucki



From kjoyeux at jouy.inra.fr  Tue Aug 21 09:40:42 2007
From: kjoyeux at jouy.inra.fr (kieran JOYEUX)
Date: Tue, 21 Aug 2007 11:40:42 +0200
Subject: [Linux-cluster] NFS failover problem
In-Reply-To: <46CAB128.9080109@artegence.com>
References: <OF7B2AE2FF.922CA280-ON85257339.0051AB0D-85257339.00559D1A@applera.com>	<46C9B9EE.4080306@jouy.inra.fr>
	<46CAB128.9080109@artegence.com>
Message-ID: <46CAB31A.3050809@jouy.inra.fr>

Hi Maciej,

No shared storage, i'm using the node's local disks.

Thanks

Maciej Bogucki wrote:
>> Maciej, I didn't know i could use DRBD with RHCS, thanks! I am pretty
>> sure it could resolve my issue. But i would like to try with the RedHat
>> tools before.
>>     
>
> Do You have shared storage?
>
> Best Tegards
> Maciej Bogucki
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>   



From maciej.bogucki at artegence.com  Tue Aug 21 10:03:13 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Tue, 21 Aug 2007 12:03:13 +0200
Subject: [Linux-cluster] NFS failover problem
In-Reply-To: <46CAB31A.3050809@jouy.inra.fr>
References: <OF7B2AE2FF.922CA280-ON85257339.0051AB0D-85257339.00559D1A@applera.com>	<46C9B9EE.4080306@jouy.inra.fr>	<46CAB128.9080109@artegence.com>
	<46CAB31A.3050809@jouy.inra.fr>
Message-ID: <46CAB861.4080508@artegence.com>

kieran JOYEUX napisa?(a):
> Hi Maciej,
> 
> No shared storage, i'm using the node's local disks.

Do You have filesystem replication between nodes? Or do You use rsync to
synchronize files between nodes? Or maybe You don't need it?

Best Regards
Maciej Bogucki



From kjoyeux at jouy.inra.fr  Tue Aug 21 11:02:04 2007
From: kjoyeux at jouy.inra.fr (kieran JOYEUX)
Date: Tue, 21 Aug 2007 13:02:04 +0200
Subject: [Linux-cluster] NFS failover problem
In-Reply-To: <46CAB861.4080508@artegence.com>
References: <OF7B2AE2FF.922CA280-ON85257339.0051AB0D-85257339.00559D1A@applera.com>	<46C9B9EE.4080306@jouy.inra.fr>	<46CAB128.9080109@artegence.com>	<46CAB31A.3050809@jouy.inra.fr>
	<46CAB861.4080508@artegence.com>
Message-ID: <46CAC62C.7050409@jouy.inra.fr>

I don't have any replication systems. I don't really need it, rsync 
would be ok. All i want is not having that NFS stale handle error... 
With heartbeat + DRBD i have no issues about it.

What do you sugest?

Maciej Bogucki wrote:
> kieran JOYEUX napisa?(a):
>   
>> Hi Maciej,
>>
>> No shared storage, i'm using the node's local disks.
>>     
>
> Do You have filesystem replication between nodes? Or do You use rsync to
> synchronize files between nodes? Or maybe You don't need it?
>
> Best Regards
> Maciej Bogucki
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>   



From Arne.Brieseneck at vodafone.com  Tue Aug 21 13:03:00 2007
From: Arne.Brieseneck at vodafone.com (Brieseneck, Arne, VF-Group)
Date: Tue, 21 Aug 2007 15:03:00 +0200
Subject: [Linux-cluster] Add a fence device of type SUN ILOM
In-Reply-To: <58270.129.237.174.144.1187022323.squirrel@webmail.bradandkim.net>
Message-ID: <E67F1468BF7A4C418D874810215A377EC2BA07@EITO-MBX01.internal.vodafone.com>

Hi*,

I have tried to set up this configfile here:

                <clusternode name="box1" nodeid="1" votes="1">
                        <multicast addr="224.0.0.1" interface="eth0"/>
                        <fence>
                                <method name="1">
                                        <device name="box1_IPMI"/>
                                </method>
                                <method name="2">
                                        <device name="manual" nodename="box1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="box2" nodeid="2" votes="1">
                        <multicast addr="224.0.0.1" interface="eth0"/>
                        <fence>
                                <method name="1">
                                        <device name="box2_IPMI"/>
                                </method>
                                <method name="2">
                                        <device name="manual" nodename="box2"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="box3" nodeid="3" votes="1">
                        <multicast addr="224.0.0.1" interface="eth0"/>
                        <fence>
                                <method name="1">
                                        <device name="box3_IPMI"/>
                                </method>
                                <method name="2">
                                        <device name="manual" nodename="box3"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="box4" nodeid="4" votes="1">
                        <multicast addr="224.0.0.1" interface="eth0"/>
                        <fence>
                                <method name="1">
                                        <device name="box4_IPMI"/>
                                </method>
                                <method name="2">
                                        <device name="manual" nodename="box4"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_ipmilan" auth="none" ipaddr="10.0.0.2" login="power" name="box1_IPMI" passwd="powerpower"/>
                <fencedevice agent="fence_ipmilan" auth="none" ipaddr="10.0.0.3" login="power" name="box2_IPMI" passwd="powerpower"/>
                <fencedevice agent="fence_ipmilan" auth="none" ipaddr="10.0.0.4" login="power" name="box3_IPMI" passwd="powerpower"/>
                <fencedevice agent="fence_ipmilan" auth="none" ipaddr="10.0.0.5" login="power" name="box4_IPMI" passwd="powerpower"/>
                <fencedevice agent="fence_manual" name="manual"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources>
                        <ip address="192.168.50.6" monitor_link="1"/>
                </resources>
        </rm>
</cluster>





The cluster is coming up without errors. But when I shutdown the network on one node nothing happens. The only message I get is 

dlm: closing connection to node 4 

On node 2 and 3.


Any suggestions?
Is there a timeout I have to wait for?

Why I can not ping my cluster address 192.168.50.6????
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of brad at bradandkim.net
Sent: Montag, 13. August 2007 18:25
To: linux clustering
Subject: RE: [Linux-cluster] Add a fence device of type SUN ILOM


>
> Dear list members,
>
> If I understand well, here are the steps I have to follow to configure 
> a fence device that will use SUN ILOM interface :
>
> 1. Ensure that OpenIPMI and OpenIPMI-tools packages are installed on 
> the cluster nodes.
> 2. With system-config-cluster tool, add a new fence device of type 
> "IPMI Lan". Fill the form with the ILOM IP address, the name of the 
> administrator user and his password. Then associate the fence device 
> with the cluster node.
> 3. Repeat the step above for the SUN ILOM interface of each node.
> 4. Send the new configuration to the cluster 5. That's all and 
> everything will be handle correctly by the cluster.
>
> Am I ok?
>
> Best regards
> __________________
>
> Stephanie Lanthier
>
> Analyste de l'informatique
> Universite du Quebec a Montreal
> Service de l'informatique et des telecommunications 
> lanthier.stephanie at uqam.ca T?l?phone : 514-987-3000 poste 6106 Bureau 
> : PK-M535
>
>
>
>
>
>
> ------------------------------
>
> Message: 13
> Date: Fri, 10 Aug 2007 14:21:31 -0500 (CDT)
> From: brad at bradandkim.net
> Subject: Re: [Linux-cluster] Add a fence device of type SUN ILOM
> To: "linux clustering" <linux-cluster at redhat.com>
> Message-ID:
> 	<56032.129.237.174.144.1186773691.squirrel at webmail.bradandkim.net>
> Content-Type: text/plain;charset=iso-8859-1
>
>
>> brad at bradandkim.net wrote:
>>
>>>>brad at bradandkim.net wrote:
>>>>
>>>>
>>>>>>Quentin Arce wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>>My machines are SUN Fire X4100. I see that we can define a 
>>>>>>>>>fence device of type HP ILO. I would like to know if I can use 
>>>>>>>>>the HP  ILO form in system-config-cluster tool to enter and use 
>>>>>>>>>a SUN ILOM as fence device?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>How does ILOM work? telnet or ssh? Is there an snmp interface to 
>>>>>>>>ILOM?
>>>>>>>>If so, there might be a way...by hacking on another agent.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>So, I'm a lurker on this list as I no longer have a cluster up...
>>>>>>> but
>>>>>>>I work on ILOM and I would love to see this work.  This isn't  
>>>>>>>official support, I'm a developer not a customer support person.  
>>>>>>>So, it's  more on my time.  If there is anything I can do... 
>>>>>>>Please let me know.
>>>>>>>Questions on this problem, regarding what ILOM can / can't do, 
>>>>>>>how  to check state of the server via ILOM, etc.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>Quentin! That is very kind of you. If you help with the ILOM  
>>>>>>protocol, I'll help with the agent/script. This thread could form 
>>>>>>a document on how to write an arbitrary fence agent for use with 
>>>>>>rhcs.
>>>>>>
>>>>>>Where is documentation available? Generally, three things are 
>>>>>>needed from a baseboard management device in order to use it for 
>>>>>>fencing: 1)  A way to shut the system down, 2) a way to power the 
>>>>>>system up, and 3)  a way to check if it is up or down.
>>>>>>
>>>>>>What means can a script use to communicate with the ILOM card? Are 
>>>>>>there big delta's in the protocol between different ILOM versions?
>>>>>>
>>>>>>I look forward to hearing from you.
>>>>>>
>>>>>>-J
>>>>>>
>>>>>>
>>>>>>
>>>>>I am interested in seeing this thread play out as well since I have 
>>>>>26 SUN servers I am beginning to cluster.  My question is why use 
>>>>>SNMP over IPMI v2.0.  I can do the above three things with:
>>>>>
>>>>>/usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> chassis 
>>>>>power  off /usr/bin/ipmitool -U <user> -P <password> -H <ilom IP> 
>>>>>chassis power  on /usr/bin/ipmitool -U <user> -P <password> -H 
>>>>><ilom IP> chassis power status
>>>>>
>>>>>I don't need any MIB's for this either.  It seems to me this might 
>>>>>be  an easier solution than snmp, but I may be missing something.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>Oh make sure you are using lanplus mode for this.
>>>>
>>>>
>>>>
>>>
>>>Will do, and thanks.
>>>
>>>
>> That is a nice solution. There is a fence_ipmilan agent in the red 
>> hat cluster distibution...how are you invoking the above for fencing?  
>> To check if the rh agent works, here is the command line you would 
>> use (it installs into /sbin...):
>>
>> /sbin/fence_ipmilan -a <ilom IP> -l <user> -p <password> -P -o 
>> [off,on,reboot,status]
>>
>> There is a man page for fence_ipmilan that details some extra params.
>>
>> Well, I guess that solves the issue...if anyone would use an 
>> snmp-based ILOM agent, we could talk about how to construct 
>> that...otherwise, so much for my idea of this thread being 
>> instructions for creating arbitrary agents! ;)
>>
>> -J
>>
>
> I just tested it and it seems to work perfectly.  Sorry for bringing 
> the thread to a premature end  :)
>
> Brad Crotchett
> brad at bradandkim.net
> http://www.bradandkim.net
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

I believe that will do it.  I just tested shutting down the network interfaces on one node and fenced successfully used the ipmilan fencing agent to reboot the node.

Brad Crotchett
brad at bradandkim.net
http://www.bradandkim.net

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From bfields at fieldses.org  Tue Aug 21 13:14:06 2007
From: bfields at fieldses.org (J. Bruce Fields)
Date: Tue, 21 Aug 2007 09:14:06 -0400
Subject: [Linux-cluster] NFS failover problem
In-Reply-To: <46CAC62C.7050409@jouy.inra.fr>
References: <OF7B2AE2FF.922CA280-ON85257339.0051AB0D-85257339.00559D1A@applera.com>
	<46C9B9EE.4080306@jouy.inra.fr> <46CAB128.9080109@artegence.com>
	<46CAB31A.3050809@jouy.inra.fr> <46CAB861.4080508@artegence.com>
	<46CAC62C.7050409@jouy.inra.fr>
Message-ID: <20070821131406.GA18079@fieldses.org>

On Tue, Aug 21, 2007 at 01:02:04PM +0200, kieran JOYEUX wrote:
> I don't have any replication systems. I don't really need it, rsync would 
> be ok. All i want is not having that NFS stale handle error... With 
> heartbeat + DRBD i have no issues about it.

Are you using rsync to copy the files in the exported filesystem, or are
you copying the filesystem image itself?

Copying at the file level doesn't work.  NFS needs to know that the
files really have the same identity--same inode number, same generation
number (same filehandle)--otherwise you will get stale filehandle
errors.

--b.



From kjoyeux at jouy.inra.fr  Tue Aug 21 13:45:22 2007
From: kjoyeux at jouy.inra.fr (kieran JOYEUX)
Date: Tue, 21 Aug 2007 15:45:22 +0200
Subject: [Linux-cluster] NFS failover problem
In-Reply-To: <20070821131406.GA18079@fieldses.org>
References: <OF7B2AE2FF.922CA280-ON85257339.0051AB0D-85257339.00559D1A@applera.com>	<46C9B9EE.4080306@jouy.inra.fr>
	<46CAB128.9080109@artegence.com>	<46CAB31A.3050809@jouy.inra.fr>
	<46CAB861.4080508@artegence.com>	<46CAC62C.7050409@jouy.inra.fr>
	<20070821131406.GA18079@fieldses.org>
Message-ID: <46CAEC72.2020102@jouy.inra.fr>

Hi Bruce, thanks for the reply :)

For the moment the two nfs share's content are the same. I use scp to 
copy them. It's static so i did it just once.

To resolve that NFS file handle error, i mounted the share with a static 
fsid for each client. Unfortunately, even with a FSID, i still have an 
error during the failover while copying from the NFS share to client's 
local disk...

cp: reading `/usr/local/genome/1Go.txt': Input/output error
cp: reading `/usr/local/genome/1Go.tyt': Input/output error
cp: cannot stat `/usr/local/genome/1Go.tzt': Permission denied

What should I try next?


J. Bruce Fields wrote:
> On Tue, Aug 21, 2007 at 01:02:04PM +0200, kieran JOYEUX wrote:
>   
>> I don't have any replication systems. I don't really need it, rsync would 
>> be ok. All i want is not having that NFS stale handle error... With 
>> heartbeat + DRBD i have no issues about it.
>>     
>
> Are you using rsync to copy the files in the exported filesystem, or are
> you copying the filesystem image itself?
>
> Copying at the file level doesn't work.  NFS needs to know that the
> files really have the same identity--same inode number, same generation
> number (same filehandle)--otherwise you will get stale filehandle
> errors.
>
> --b.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>   



From bfields at fieldses.org  Tue Aug 21 14:12:31 2007
From: bfields at fieldses.org (J. Bruce Fields)
Date: Tue, 21 Aug 2007 10:12:31 -0400
Subject: [Linux-cluster] NFS failover problem
In-Reply-To: <46CAEC72.2020102@jouy.inra.fr>
References: <OF7B2AE2FF.922CA280-ON85257339.0051AB0D-85257339.00559D1A@applera.com>
	<46C9B9EE.4080306@jouy.inra.fr> <46CAB128.9080109@artegence.com>
	<46CAB31A.3050809@jouy.inra.fr> <46CAB861.4080508@artegence.com>
	<46CAC62C.7050409@jouy.inra.fr>
	<20070821131406.GA18079@fieldses.org>
	<46CAEC72.2020102@jouy.inra.fr>
Message-ID: <20070821141231.GA22500@fieldses.org>

On Tue, Aug 21, 2007 at 03:45:22PM +0200, kieran JOYEUX wrote:
> For the moment the two nfs share's content are the same. I use scp to copy 
> them. It's static so i did it just once.

So, you just did something like this?:

	scp -r /shares/someshare/ otherserver:/shares/

> To resolve that NFS file handle error, i mounted the share with a static 
> fsid for each client.

That's not enough.

> Unfortunately, even with a FSID, i still have an 
> error during the failover while copying from the NFS share to client's 
> local disk...
>
> cp: reading `/usr/local/genome/1Go.txt': Input/output error
> cp: reading `/usr/local/genome/1Go.tyt': Input/output error
> cp: cannot stat `/usr/local/genome/1Go.tzt': Permission denied

That's expected.  You can't just copy the files across with scp, or
rsync, or tar.

--b.



From kjoyeux at jouy.inra.fr  Tue Aug 21 14:18:11 2007
From: kjoyeux at jouy.inra.fr (kieran JOYEUX)
Date: Tue, 21 Aug 2007 16:18:11 +0200
Subject: [Linux-cluster] NFS failover problem
In-Reply-To: <20070821141231.GA22500@fieldses.org>
References: <OF7B2AE2FF.922CA280-ON85257339.0051AB0D-85257339.00559D1A@applera.com>	<46C9B9EE.4080306@jouy.inra.fr>
	<46CAB128.9080109@artegence.com>	<46CAB31A.3050809@jouy.inra.fr>
	<46CAB861.4080508@artegence.com>	<46CAC62C.7050409@jouy.inra.fr>	<20070821131406.GA18079@fieldses.org>	<46CAEC72.2020102@jouy.inra.fr>
	<20070821141231.GA22500@fieldses.org>
Message-ID: <46CAF423.1070309@jouy.inra.fr>

So what should I do to get rid of that error? Any ideas?

J. Bruce Fields wrote:
> On Tue, Aug 21, 2007 at 03:45:22PM +0200, kieran JOYEUX wrote:
>   
>> For the moment the two nfs share's content are the same. I use scp to copy 
>> them. It's static so i did it just once.
>>     
>
> So, you just did something like this?:
>
> 	scp -r /shares/someshare/ otherserver:/shares/
>
>   
>> To resolve that NFS file handle error, i mounted the share with a static 
>> fsid for each client.
>>     
>
> That's not enough.
>
>   
>> Unfortunately, even with a FSID, i still have an 
>> error during the failover while copying from the NFS share to client's 
>> local disk...
>>
>> cp: reading `/usr/local/genome/1Go.txt': Input/output error
>> cp: reading `/usr/local/genome/1Go.tyt': Input/output error
>> cp: cannot stat `/usr/local/genome/1Go.tzt': Permission denied
>>     
>
> That's expected.  You can't just copy the files across with scp, or
> rsync, or tar.
>
> --b.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>   



From rainer at ultra-secure.de  Tue Aug 21 14:29:06 2007
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Tue, 21 Aug 2007 16:29:06 +0200
Subject: [Linux-cluster] NFS failover problem
In-Reply-To: <20070821141231.GA22500@fieldses.org>
References: <OF7B2AE2FF.922CA280-ON85257339.0051AB0D-85257339.00559D1A@applera.com>	<46C9B9EE.4080306@jouy.inra.fr>
	<46CAB128.9080109@artegence.com>	<46CAB31A.3050809@jouy.inra.fr>
	<46CAB861.4080508@artegence.com>	<46CAC62C.7050409@jouy.inra.fr>	<20070821131406.GA18079@fieldses.org>	<46CAEC72.2020102@jouy.inra.fr>
	<20070821141231.GA22500@fieldses.org>
Message-ID: <46CAF6B2.5020002@ultra-secure.de>

J. Bruce Fields wrote:
> On Tue, Aug 21, 2007 at 03:45:22PM +0200, kieran JOYEUX wrote:
>   
>> For the moment the two nfs share's content are the same. I use scp to copy 
>> them. It's static so i did it just once.
>>     
>
> So, you just did something like this?:
>
> 	scp -r /shares/someshare/ otherserver:/shares/
>
>   
>> To resolve that NFS file handle error, i mounted the share with a static 
>> fsid for each client.
>>     
>
> That's not enough.
>
>   
>> Unfortunately, even with a FSID, i still have an 
>> error during the failover while copying from the NFS share to client's 
>> local disk...
>>
>> cp: reading `/usr/local/genome/1Go.txt': Input/output error
>> cp: reading `/usr/local/genome/1Go.tyt': Input/output error
>> cp: cannot stat `/usr/local/genome/1Go.tzt': Permission denied
>>     
>
> That's expected.  You can't just copy the files across with scp, or
> rsync, or tar.
>
>   


I think, if you're _that_ low on budget, moving the whole scenario to a
vmware server setup is a good idea.
Marc's  http://www.open-sharedroot.org/  Website contains some
information about how to do this (although for a shared root cluster).

Are you preparing this for production?


cheers,
Rainer





From rkenna at redhat.com  Tue Aug 21 14:39:28 2007
From: rkenna at redhat.com (Rob Kenna)
Date: Tue, 21 Aug 2007 10:39:28 -0400
Subject: [Linux-cluster] Please correct me if I'm wrong, but... - RHEL
	5 Advanced Platform DOES include Clustering & GFS...
In-Reply-To: <BC16657A85D83746AFF99A42587CC6BC07C77F18@exchange2003.tasconline.com>
References: <46C9EAD0.1060107@noaa.gov><576CD4D20306324798BB3AE562A2C084027AA153@X2K3.mjm.com><46C9F8BC.5@noaa.gov><576CD4D20306324798BB3AE562A2C084027AA155@X2K3.mjm.com><1187644375.2714.36.camel@localhost.localdomain>	<20070820231826.D1799@xos037.xos.nl>
	<BC16657A85D83746AFF99A42587CC6BC07C77F18@exchange2003.tasconline.com>
Message-ID: <46CAF920.7040702@redhat.com>

Hi Folks -

Referring to http://www.redhat.com/rhel/server/advanced/, RHEL 5 
Advanced Platform does indeed include GFS and RHCS (Cluster Suite.) This 
is exclusive to RHEL 5, so you can not buy an RHEL 5 Advanced Platform 
subscription and run RHEL 4 with GFS. Also, the configuration is the 
same in all geographies.

Hope this helps.

- Rob

Jeremy Carroll wrote:
> Yes. In the US RHEL 5 AP Includes GFS / Cluster.
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jos Vos
> Sent: Monday, August 20, 2007 4:18 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Please correct me if I'm wrong, but...
>
> On Mon, Aug 20, 2007 at 04:12:55PM -0500, Kevin Anderson wrote:
>
>   
>> The cluster and gfs components are actually part of the RHEL5 Advanced
>> Platform product grouping, so if you purchase a subscription to that,
>> you get the cluster components, gfs and virtualization.
>>     
>
> Does that actually work that way in the US?
>
> Here in Europe at least that's not true, RHEL AP includes
> virtualization,
> but cluster and cluster storage still have to be bought seperately.
>
>   

-- 
Robert Kenna / Red Hat
Sr Product Mgr - Clustering & Storage
10 Technology Park Drive
Westford, MA 01886
o: (978) 392-2410 f: (978) 392-1001
c: (978) 771-6314 text: 9787716314 at vtext.com
rkenna at redhat.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070821/4c8ebfe1/attachment.htm>

From Jeremyc at tasconline.com  Tue Aug 21 14:48:24 2007
From: Jeremyc at tasconline.com (Jeremy Carroll)
Date: Tue, 21 Aug 2007 09:48:24 -0500
Subject: [Linux-cluster] Please correct me if I'm wrong,
	but... - RHEL5 Advanced Platform DOES include Clustering & GFS...
In-Reply-To: <46CAF920.7040702@redhat.com>
References: <46C9EAD0.1060107@noaa.gov><576CD4D20306324798BB3AE562A2C084027AA153@X2K3.mjm.com><46C9F8BC.5@noaa.gov><576CD4D20306324798BB3AE562A2C084027AA155@X2K3.mjm.com><1187644375.2714.36.camel@localhost.localdomain>	<20070820231826.D1799@xos037.xos.nl><BC16657A85D83746AFF99A42587CC6BC07C77F18@exchange2003.tasconline.com>
	<46CAF920.7040702@redhat.com>
Message-ID: <BC16657A85D83746AFF99A42587CC6BC07C77F1A@exchange2003.tasconline.com>

That is correct. You do not get the channels for RHEL4 Cluster / GFS
when you buy RHEL 5 Advanced Platform.  So don't think you can buy RHEL
5 AP and use the extra licenses for RHEL 4 GFS / Cluster suite.

________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rob Kenna
Sent: Tuesday, August 21, 2007 9:39 AM
To: linux clustering
Subject: Re: [Linux-cluster] Please correct me if I'm wrong, but... -
RHEL5 Advanced Platform DOES include Clustering & GFS...

 

Hi Folks -

Referring to http://www.redhat.com/rhel/server/advanced/, RHEL 5
Advanced Platform does indeed include GFS and RHCS (Cluster Suite.)
This is exclusive to RHEL 5, so you can not buy an RHEL 5 Advanced
Platform subscription and run RHEL 4 with GFS.  Also, the configuration
is the same in all geographies.  

Hope this helps.

- Rob

Jeremy Carroll wrote: 

Yes. In the US RHEL 5 AP Includes GFS / Cluster.
 
-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jos Vos
Sent: Monday, August 20, 2007 4:18 PM
To: linux clustering
Subject: Re: [Linux-cluster] Please correct me if I'm wrong, but...
 
On Mon, Aug 20, 2007 at 04:12:55PM -0500, Kevin Anderson wrote:
 
  

	The cluster and gfs components are actually part of the RHEL5
Advanced
	Platform product grouping, so if you purchase a subscription to
that,
	you get the cluster components, gfs and virtualization.
	    

 
Does that actually work that way in the US?
 
Here in Europe at least that's not true, RHEL AP includes
virtualization,
but cluster and cluster storage still have to be bought seperately.
 
  





-- 
Robert Kenna / Red Hat
Sr Product Mgr - Clustering & Storage
10 Technology Park Drive
Westford, MA 01886
o: (978) 392-2410 f: (978) 392-1001
c: (978) 771-6314 text: 9787716314 at vtext.com
rkenna at redhat.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070821/d989dbbb/attachment.htm>

From kjoyeux at jouy.inra.fr  Tue Aug 21 14:45:49 2007
From: kjoyeux at jouy.inra.fr (kieran JOYEUX)
Date: Tue, 21 Aug 2007 16:45:49 +0200
Subject: [Linux-cluster] NFS failover problem
In-Reply-To: <46CAF6B2.5020002@ultra-secure.de>
References: <OF7B2AE2FF.922CA280-ON85257339.0051AB0D-85257339.00559D1A@applera.com>	<46C9B9EE.4080306@jouy.inra.fr>	<46CAB128.9080109@artegence.com>	<46CAB31A.3050809@jouy.inra.fr>	<46CAB861.4080508@artegence.com>	<46CAC62C.7050409@jouy.inra.fr>	<20070821131406.GA18079@fieldses.org>	<46CAEC72.2020102@jouy.inra.fr>	<20070821141231.GA22500@fieldses.org>
	<46CAF6B2.5020002@ultra-secure.de>
Message-ID: <46CAFA9D.4030706@jouy.inra.fr>

Rainer Duffner wrote:
> I think, if you're _that_ low on budget, moving the whole scenario to a
> vmware server setup is a good idea.
> Marc's  http://www.open-sharedroot.org/  Website contains some
> information about how to do this (although for a shared root cluster).
>
> Are you preparing this for production?
>
>
> cheers,
> Rainer
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>   
Thanks Rainer, i'll take a loot at it.

Indeed i'm preparing it for production. We have pretty simple needs here 
that not require a huge architecture. It's only a high available NFS 
share providing read only applications. We do not require high capacity 
disks nor powerful servers.

you guys are telling me that RHCS are not able to work properly on this 
kind of architecture, it's pretty sad... I'll have to use heartbeat and 
DRBD instead. This solution is up & running really quickly. Plus, i have 
no NFS file handle error during the copy thanks to DRBD.

Am i the only one using RHCS that way? :)



From jos at xos.nl  Tue Aug 21 14:58:02 2007
From: jos at xos.nl (Jos Vos)
Date: Tue, 21 Aug 2007 16:58:02 +0200
Subject: [Linux-cluster] Please correct me if I'm wrong,
	but... - RHEL 5 Advanced Platform DOES include Clustering & GFS...
In-Reply-To: <46CAF920.7040702@redhat.com>;
	from rkenna@redhat.com on Tue, Aug 21, 2007 at 10:39:28AM -0400
References: <46C9EAD0.1060107@noaa.gov><576CD4D20306324798BB3AE562A2C084027AA153@X2K3.mjm.com><46C9F8BC.5@noaa.gov><576CD4D20306324798BB3AE562A2C084027AA155@X2K3.mjm.com><1187644375.2714.36.camel@localhost.localdomain>
	<20070820231826.D1799@xos037.xos.nl>
	<BC16657A85D83746AFF99A42587CC6BC07C77F18@exchange2003.tasconline.com>
	<46CAF920.7040702@redhat.com>
Message-ID: <20070821165802.A7367@xos037.xos.nl>

On Tue, Aug 21, 2007 at 10:39:28AM -0400, Rob Kenna wrote:

> Referring to http://www.redhat.com/rhel/server/advanced/, RHEL 5 
> Advanced Platform does indeed include GFS and RHCS (Cluster Suite.) This 
> is exclusive to RHEL 5, so you can not buy an RHEL 5 Advanced Platform 
> subscription and run RHEL 4 with GFS. Also, the configuration is the 
> same in all geographies.

OK, thanks for the clarification, that explains the confusion here,
as my case was about extending a RHEL4 cluster, where besides RHEL AP
still subscriptions had to be bought for both cluster and GFS.

So, with RHEL5 clustering and GFS has become a lot cheaper.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From maciej.bogucki at artegence.com  Tue Aug 21 15:08:18 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Tue, 21 Aug 2007 17:08:18 +0200
Subject: [Linux-cluster] Cluster replication / mirroring with Netvault
	Replicator?
In-Reply-To: <BC16657A85D83746AFF99A42587CC6BC07C77F14@exchange2003.tasconline.com>
References: <BC16657A85D83746AFF99A42587CC6BC07C77F14@exchange2003.tasconline.com>
Message-ID: <46CAFFE2.5010801@artegence.com>

Jeremy Carroll napisa?(a):
> Overview: I currently run a RedHat Cluster of 8 JBoss web servers
> sharing data via a Global File System. My company is looking to gain
> redundancy for our RHCS GFS cluster due to problems with GFS kernel
> panics. During periods of network instability the GFS mounted volumes on
> the cluster will oops (Most likely due to DLM failures) and panic the
> majority of the machines in the cluster. This of course dissolves the
> quorum and takes the entire thing down. We need some redundancy with GFS.
> 
>  
> 
> My goal is to have a second cluster completely separate from the first
> with the GFS data replicated in an active / active configuration. We
> will have our F5 Load Balancers divide traffic to both clusters evenly.
> In researching replication software to accomplish this goal I came
> across BakBone?s NetVault Replicator. Does anybody have experience with
> real-time replication via this product? Any caveats to using this for
> bi-directional replication of two GFS volumes?

First all all, check why You get kernel panic.
Second does Your aplication need GFS? maybe NFS will be ok?
If you need block device replication You coul use DRBD [1], version 0.8
support active-active configuration, but with clustered filesystemn like
GFS or OCFS2 :)

[1] - http://www.drbd.org/

Best Regards
Maciej Bogucki



From Jeremyc at tasconline.com  Tue Aug 21 15:38:28 2007
From: Jeremyc at tasconline.com (Jeremy Carroll)
Date: Tue, 21 Aug 2007 10:38:28 -0500
Subject: [Linux-cluster] Cluster replication / mirroring with
	NetvaultReplicator?
In-Reply-To: <46CAFFE2.5010801@artegence.com>
References: <BC16657A85D83746AFF99A42587CC6BC07C77F14@exchange2003.tasconline.com>
	<46CAFFE2.5010801@artegence.com>
Message-ID: <BC16657A85D83746AFF99A42587CC6BC07C77F1C@exchange2003.tasconline.com>

1: ) The app requires GFS.
2: ) I looked at DRDB. My feeling is that it is not a good product compared to Netvault Replicator which integrates with RHCS services and provides native failover. Also does DRDB do block level GFS replication without the need to mount both disks on the same host?

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Maciej Bogucki
Sent: Tuesday, August 21, 2007 10:08 AM
To: linux clustering
Subject: Re: [Linux-cluster] Cluster replication / mirroring with NetvaultReplicator?

Jeremy Carroll napisa?(a):
> Overview: I currently run a RedHat Cluster of 8 JBoss web servers
> sharing data via a Global File System. My company is looking to gain
> redundancy for our RHCS GFS cluster due to problems with GFS kernel
> panics. During periods of network instability the GFS mounted volumes on
> the cluster will oops (Most likely due to DLM failures) and panic the
> majority of the machines in the cluster. This of course dissolves the
> quorum and takes the entire thing down. We need some redundancy with GFS.
> 
>  
> 
> My goal is to have a second cluster completely separate from the first
> with the GFS data replicated in an active / active configuration. We
> will have our F5 Load Balancers divide traffic to both clusters evenly.
> In researching replication software to accomplish this goal I came
> across BakBone's NetVault Replicator. Does anybody have experience with
> real-time replication via this product? Any caveats to using this for
> bi-directional replication of two GFS volumes?

First all all, check why You get kernel panic.
Second does Your aplication need GFS? maybe NFS will be ok?
If you need block device replication You coul use DRBD [1], version 0.8
support active-active configuration, but with clustered filesystemn like
GFS or OCFS2 :)

[1] - http://www.drbd.org/

Best Regards
Maciej Bogucki

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From dmp at mjm.com  Tue Aug 21 17:25:25 2007
From: dmp at mjm.com (Michael Patrimonio)
Date: Tue, 21 Aug 2007 13:25:25 -0400
Subject: [Linux-cluster] NFS Export Information
Message-ID: <576CD4D20306324798BB3AE562A2C084027AA162@X2K3.mjm.com>

Greetings,
 
After a test failover of NFS, I noticed that the export information for the directories that were moved to the standby server with the service are still beings listed in the output of the "exportfs" command on the primary server. They are also coming up on the standby server, where the service is currently running and the files systems (and their directories) are currently located.
 
Doesn't the failover process cleanup after itself by removing the entries it adds into the list of exported directories? If not, will the extraneous (and, basically, worthless) entries ever present a problem?
 
As information: the systems are running RHEL 4 Update 5 with RH Cluster Suite and "ext3" (not "GFS") in an active/active configuration.
 
Any information would be very much appreciated.
 
Thanks & Regards,
Michael
=====
 
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070821/f41e8565/attachment.htm>

From ulisesglez31 at yahoo.com.mx  Tue Aug 21 17:41:47 2007
From: ulisesglez31 at yahoo.com.mx (ulises gonzalez)
Date: Tue, 21 Aug 2007 12:41:47 -0500 (CDT)
Subject: [Linux-cluster] software
In-Reply-To: <20070820180151.GD26997@redhat.com>
Message-ID: <520532.79001.qm@web35203.mail.mud.yahoo.com>

Sorry, ... I want to build a high aviable cluster, It will offer mail, samba, dns and ltsp services..

Lon Hohberger <lhh at redhat.com> escribi?: On Mon, Aug 20, 2007 at 09:10:33AM -0500, ulises gonzalez wrote:
> Hello
> 
> I want to implement a 4-node cluster, but I don't know what software is the best to do that work. Which soft do you recomend me??

What kind of cluster are you trying to build - what do you want it to
do?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


       
---------------------------------

?S? un mejor asador!
Aprende todo sobre asados en:
http://mx.yahoo.com/promos/mejorasador.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070821/9776ca46/attachment.htm>

From carlopmart at gmail.com  Wed Aug 22 07:14:43 2007
From: carlopmart at gmail.com (carlopmart)
Date: Wed, 22 Aug 2007 09:14:43 +0200
Subject: [Linux-cluster] deploying two node cluster with rhel5
Message-ID: <46CBE263.2050800@gmail.com>

Hi all,

  I am really confused about how to install two node cluster with rhel5.

  My questions:

  - Is essential to setup a quorum disk or not?
  - How many votes do I need to configure under cluster.conf?
  - Do I need to put two_nodes param under cluster.conf like in rhel4?
  - What params do I need or do I need to know to put under cluster.conf?

  Somebody knows a good howto that explains how to deploy this??

  Many thanks.

-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From sebastian.walter at fu-berlin.de  Wed Aug 22 08:06:21 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Wed, 22 Aug 2007 10:06:21 +0200
Subject: [Linux-cluster] GFS hangs, nodes die
In-Reply-To: <200708211009.38845.grimme@atix.de>
References: <46C78986.4070301@fu-berlin.de> <200708202052.42581.grimme@atix.de>
	<46CA99C0.6060502@fu-berlin.de> <200708211009.38845.grimme@atix.de>
Message-ID: <46CBEE7D.1010700@fu-berlin.de>

Hi Marc,

yesterday the same problem rised again, and I could observe the
counters. Btw, I'm using the newest version of RHCS/GFS (GFS-kernel-smp
2.6.9-72.2, GFS 6.1.14-0, rgmanager 1.9.68-1, cman-kernel-smp
2.6.9-50.2, cman 1.0.17-0). on one node, I have 8GB, on the others 4GB
RAM. The locks didnt change over time.

Thanks!

Sebastian

gfs_tool counters /global/home

                                  locks 2041
                             locks held 28
                           freeze count 0
                          incore inodes 20
                       metadata buffers 2
                        unlinked inodes 0
                              quota IDs 0
                     incore log buffers 0
                         log space used 0.10%
              meta header cache entries 1
                     glock dependencies 1
                 glocks on reclaim list 0
                              log wraps 0
                   outstanding LM calls 65
                  outstanding BIO calls 0
                       fh2dentry misses 0
                       glocks reclaimed 386
                         glock nq calls 214090
                         glock dq calls 214002
                   glock prefetch calls 148
                          lm_lock calls 364
                        lm_unlock calls 234
                           lm callbacks 593
                     address operations 0
                      dentry operations 46654
                      export operations 0
                        file operations 90629
                       inode operations 94213
                       super operations 173031
                          vm operations 0
                        block I/O reads 366
                       block I/O writes 292

ps axwwww | sort -k4 -n | tail -10
 6771 ?        S      0:00 [gfs_quotad]
 6772 ?        S      0:00 [gfs_inoded]
30527 ?        Ss     0:00 sshd: root at pts/0 
30529 pts/0    Ds+    0:00 -bash
17499 ?        Ss     1:15 /usr/sbin/gmond
 3796 ?        Sl     2:32 /usr/sbin/gmetad
 4251 ?        Sl     2:17 /opt/gridengine/bin/lx26-amd64/sge_qmaster
 4270 ?        Sl     5:33 /opt/gridengine/bin/lx26-amd64/sge_schedd
 3606 ?        Ss    14:50 /opt/rocks/bin/python /opt/rocks/bin/greceptor
 1802 ?        R    357:43 df -hP

cat /proc  /cluster/services

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           5   2 recover 4 -
[3 2 1 11 5 9 6 10 7 8]

DLM Lock Space:  "clvmd"                             7   3 recover 0 -
[3 2 1 11 5 9 6 10 7 8]

DLM Lock Space:  "Magma"                            17   5 recover 0 -
[3 2 1 11 5 9 6 10 7 8]

DLM Lock Space:  "homeneu"                          19   6 recover 0 -
[3 2 1 11 5 9 6 10 7 8]

GFS Mount Group: "homeneu"                          21   7 recover 0 -
[3 2 1 11 5 9 6 10 7 8]

User:            "usrm::manager"                    16   4 recover 0 -
[3 2 1 11 5 9 6 10 7 8]



Marc Grimme wrote:
> On Tuesday 21 August 2007 09:52:32 Sebastian Walter wrote:
>   
>> Hi,
>>
>> Marc Grimme wrote:
>>     
>>> Do you also see some messages on the console of the nodes. And the
>>> gfs_tool
>>> counters would help before that problem occures. So let it run sometimes
>>> before to see if locks increase.
>>> What kind of stress tests are you doing? I bet searching the whole
>>> filesystem. What makes me wonder is that the gfs_tool glock_purge does
>>> not work whereas it worked for me with exactly the same problems. Did you
>>> set it _AFTER_ the fs was mounted?
>>>       
> Sorry I mean after is right and before not ;-( .
> And are you using the latest version of CS/GFS?
> Do you have a lot of memory in your machines 16G or more?
>   
>> That makes me optimistic. I set it after the volume was mounted, so I
>> will give it another try setting it before mounting it. Then I will also
>> mail myself the output of the counters every 10 minuts. Let's see...
>>     
> I would be interested in the counters.
> Also add the process list in order to see if how much CPU-Time gfs_scand 
> consumes.
> i.e.
> ps axwwww | sort -k4 -n | tail -10
>
> Have fun Marc.
>   
>> ...with best thanks
>> Sebastian
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>     
>
>
>
>   



From jobot at wmdata.com  Wed Aug 22 09:07:21 2007
From: jobot at wmdata.com (=?iso-8859-1?Q?Borgstr=F6m_Jonas?=)
Date: Wed, 22 Aug 2007 11:07:21 +0200
Subject: [Linux-cluster] Node fencing problem
Message-ID: <CABF801D13AA444988E62B7AF62C371D01EEB6C1@WMRI000167.corp.wmdata.net>

Hi,

We're having some problems getting fencing to work as expected on our two-node cluster. 

Our cluster.conf file: http://pastebin.com/m7ac9376d
kernel version: 2.6.18-8.1.8.el5
cman version: 2.0.64-1.0.1.el5

When I'm simulating a network failure on a node I expect it to be fenced by the other node but that doesn't happen for some reason:

Steps to reproduce:
1. Start the cluster
2. Mount a GFS filesystem on both nodes (test-db1 and test-db2)
3. Simulate a net failure on test-db1
    http://pastebin.com/m19fda088

Expected result:
1. Node test-db2 would detect that test-db1 failed
2. test-db1 get fenced by test-db2
3. test-db2 replays the GFS journal (filesystem writable again)
4. Fail over services from test-db1 to test-db2

Actual result:
1. Node-test-db2 detects that something happened to test-db1
2. test-db2 replays the GFS journal (filesystem writable again)
3. The service on test-db1 is still listed as started and not failed
   over to test-db2 even though test-db2 thinks test-db1 is "offline".

Log files and debug output from test-db2:
   /var/log/messages after the failure: http://pastebin.com/m2fe4ce36
   "group_tool dump fence" output: http://pastebin.com/m79d21ed9
   clustat output: http://pastebin.com/m4d1007c2

And if I restore network connectivity on test-db1 the filsystem will become writeable on that node as well and probably results in filesystem corruption.

I think the fencedevice part of cluster.conf is correct since nodes are sometimes fenced when the cluster is started and one node isn't joining fast enough.

What am I doing wrong?

Regards,
Jonas



From basv at sara.nl  Wed Aug 22 09:23:08 2007
From: basv at sara.nl (Bas van der Vlies)
Date: Wed, 22 Aug 2007 11:23:08 +0200
Subject: [Linux-cluster] NFS and gfs problems
Message-ID: <46CC007C.1080403@sara.nl>

gfs version 1.0.4
kernel 2.6.20.16
nfs-common/etch uptodate 1:1.0.10-6+etch.1
nfs-kernel-server/etch uptodate 1:1.0.10-6+etch.1

We have a five node fileserver cluster. The GFS file system works 
perfectly, but on our clients we have major problems. See examples below

Do other people see this problems and is there a solution for this problem?


bas at gb2-r39n16:~$ vim j
E72: Close error on swap filebas at gb2-r39n16:~$

bas at gb2-r39n16:~$ tar cvf bin.tar bin
bin/
bin/mpicc-wrapper-data.txt
bin/mpiexec
bin/setmvapich
bin/modenv
bin/jobnodes
bin/SMclient
bin/ssh
tar: bin.tar: Warning: Cannot close: Invalid argument
bas at gb2-r39n16:~$



bas at gb2-r39n16:~$ /usr/sbin/bonnie++
Writing with putc()...done
Writing intelligently...done
Rewriting...Can't read a full block, only got 3944 bytes.
Can't read a full block, only got 8040 bytes.
Can't read a full block, only got 3944 bytes.
Can't read a full block, only got 8040 bytes.
Can't read a full block, only got 3944 bytes.
Can't read a full block, only got 3944 bytes.
Can't read a full block, only got 3944 bytes.
Can't read a full block, only got 8040 bytes.
Bad seek offset
Error in seek(0)


-- 
********************************************************************
*                                                                  *
*  Bas van der Vlies                     e-mail: basv at sara.nl      *
*  SARA - Academic Computing Services    phone:  +31 20 592 8012   *
*  Kruislaan 415                         fax:    +31 20 6683167    *
*  1098 SJ Amsterdam                                               *
*                                                                  *
********************************************************************



From maciej.bogucki at artegence.com  Wed Aug 22 09:49:54 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Wed, 22 Aug 2007 11:49:54 +0200
Subject: [Linux-cluster] Cluster replication / mirroring
	with	NetvaultReplicator?
In-Reply-To: <BC16657A85D83746AFF99A42587CC6BC07C77F1C@exchange2003.tasconline.com>
References: <BC16657A85D83746AFF99A42587CC6BC07C77F14@exchange2003.tasconline.com>	<46CAFFE2.5010801@artegence.com>
	<BC16657A85D83746AFF99A42587CC6BC07C77F1C@exchange2003.tasconline.com>
Message-ID: <46CC06C2.6070809@artegence.com>

> 2: ) I looked at DRDB. My feeling is that it is not a good product compared to Netvault Replicator which integrates with RHCS services and provides native failover. Also does DRDB do block level GFS replication without the need to mount both disks on the same host?

"mount bouth disk on the samo host"??? I don't understand this. DRBD 0.8
is block level replication(like RAID1 but via network), and you need
clestered filesystem on the of it. You see /dev/drbdX on both nodes, and
it only support two node clusters.

Best Regards
Maciej Bogucki



From rottmann at atix.de  Wed Aug 22 09:57:01 2007
From: rottmann at atix.de (Reiner Rottmann)
Date: Wed, 22 Aug 2007 11:57:01 +0200
Subject: [Linux-cluster] Kernel Oops gfs:gfs_glock_dq
Message-ID: <200708221157.05897.rottmann@atix.de>

Hi,

we experienced a kernel oops in gfs:gfs_glock_dq for the first time and we 
would like to know if there are others who are affected by this issue.

Although the node was successfully fenced by the cluster and could rejoin 
without any troubles we are curious about what happened. There are no 
suspicious log entries preceding the oops messages.

Is this perhaps a known bug and maybe already solved?


# cat /var/log/messages
...
Aug 17 17:45:00 node01.mgmt  Unable to handle kernel NULL pointer dereference 
at 0000000000000000 RIP:
Aug 17 17:45:00 node01.mgmt  <ffffffffa0179aef>{:gfs:gfs_glock_dq+191}
Aug 17 17:45:00 node01.mgmt  PML4 9363d067 PGD d5119067 PMD 0
Aug 17 17:45:00 node01.mgmt  Oops: 0000 [1] SMP
Aug 17 17:45:00 node01.mgmt  CPU 3
Aug 17 17:45:00 node01.mgmt  Modules linked in: parport_pc lp parport 
netconsole netdump autofs4 i2c_dev i2c_core sunrpc ds yenta_socket 
pcmcia_core ext3 j
bd button battery ac ohci_hcd hw_random scsi_transport_fc md5 ipv6 lock_dlm(U) 
dlm(U) gfs(U) lock_harness(U) cman(U) 8021q bonding(U) qla2300(U) qla2xxx_co
nf(U) qla2xxx(U) sd_mod scsi_mod dm_snapshot dm_mirror dm_mod tg3
Aug 17 17:45:00 node01.mgmt  Pid: 13218, comm: site_search_dev Not tainted 
2.6.9-42.0.3.ELsmp
Aug 17 17:45:00 node01.mgmt  RIP: 0010:[<ffffffffa0179aef>] 
<ffffffffa0179aef>{:gfs:gfs_glock_dq+191}
Aug 17 17:45:00 node01.mgmt  RSP: 0018:0000010042b33ce8  EFLAGS: 00010203
Aug 17 17:45:00 node01.mgmt  RAX: 000001009cc5c368 RBX: 0000000000000000 RCX: 
00000100af058d50
Aug 17 17:45:00 node01.mgmt  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
000001009cc5c35c
Aug 17 17:45:00 node01.mgmt  RBP: 000001009cc5c330 R08: 0000000000000000 R09: 
0000000000000000
Aug 17 17:45:00 node01.mgmt  R10: 0000000000000000 R11: 00000100af058ce0 R12: 
000001009cc5c35c
Aug 17 17:45:01 node01.mgmt  R13: 0000010008fcc5e8 R14: ffffff00005a3000 R15: 
00000100234079a8
Aug 17 17:45:01 node01.mgmt  FS:  0000002a95561a20(0000) GS:ffffffff804e5300
(0000) knlGS:0000000000000000
Aug 17 17:45:01 node01.mgmt  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Aug 17 17:45:01 node01.mgmt  CR2: 0000000000000000 CR3: 0000000001050000 CR4: 
00000000000006e0
Aug 17 17:45:01 node01.mgmt  Process site_search_dev (pid: 13218, threadinfo 
0000010042b32000, task 000001004886d7f0)
Aug 17 17:45:01 node01.mgmt  Stack: 0000010042b33cf8 00012c5400000000 
ffffffffa01b99c0 0000010008fcc5e8
Aug 17 17:45:01 node01.mgmt         0000010008fcc5c0 0000010008fcc5e8 
00000100dd2625e8 0000000000000000
Aug 17 17:45:01 node01.mgmt         0000000000000000 ffffffffa0179c08
Aug 17 17:45:01 node01.mgmt  Call 
Trace:<ffffffffa0179c08>{:gfs:gfs_glock_dq_uninit+9} 
<ffffffffa018ef63>{:gfs:gfs_flock+222}
Aug 17 17:45:01 node01.mgmt         
<ffffffffa01e1781>{:lock_dlm:get_resource+84} 
<ffffffffa01e23d7>{:lock_dlm:lm_dlm_punlock+642}
Aug 17 17:45:01 node01.mgmt         <ffffffff8018e5e4>{locks_remove_flock+97} 
<ffffffff8017a21d>{__fput+73}
Aug 17 17:45:01 node01.mgmt         <ffffffff80178e48>{filp_close+103} 
<ffffffff80178ed1>{sys_close+130}
Aug 17 17:45:01 node01.mgmt         <ffffffff8011026a>{system_call+126}
Aug 17 17:45:01 node01.mgmt
Aug 17 17:45:01 node01.mgmt  Code: 48 8b 02 0f 18 08 48 8d 45 38 48 39 c2 74 
0c ff 44 24 0c 49
Aug 17 17:45:01 node01.mgmt  RIP <ffffffffa0179aef>{:gfs:gfs_glock_dq+191} RSP 
<0000010042b33ce8>
Aug 17 17:45:01 node01.mgmt  CR2: 0000000000000000
...

$ uname -a
Linux node01 2.6.9-42.0.3.ELsmp #1 SMP Fri Oct 6 06:28:26 CDT 2006 x86_64 
x86_64 x86_64 GNU/Linux

$ rpm -qa
...
hp_qla2x00src-8.01.06-7
kernel-smp-2.6.9-42.0.3.EL
kernel-smp-devel-2.6.9-42.0.3.EL
ccs-1.0.7-0
cman-1.0.11-0
cman-kernel-smp-2.6.9-45.8
dlm-1.0.1-1
dlm-kernel-smp-2.6.9-44.3
GFS-6.1.6-1
GFS-kernel-smp-2.6.9-60.3
magma-1.0.6-0
magma-plugins-1.0.9-0
...


-- 
Gruss / Regards,

Dipl.-Ing. (FH) Reiner Rottmann

Phone: +49-89 452 3538-12

http://www.atix.de/
http://open-sharedroot.org/

PGP Key ID: 0xCA67C5A6
PGP Key Fingerprint = BF59FF006360B6E8D48F26B10D9F5A84CA67C5A6

**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany

Phone: +49-89 452 3538-0
Fax: ? +49-89 990 1766-0

Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962

Vorstand: 
Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)

Vorsitzender des Aufsichtsrats:
Dr. Martin Buss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070822/72868c28/attachment.sig>

From basv at sara.nl  Wed Aug 22 11:55:11 2007
From: basv at sara.nl (Bas van der Vlies)
Date: Wed, 22 Aug 2007 13:55:11 +0200
Subject: [Linux-cluster] NFS and gfs problems
In-Reply-To: <46CC007C.1080403@sara.nl>
References: <46CC007C.1080403@sara.nl>
Message-ID: <46CC241F.8000101@sara.nl>

We just downgraded to versions 1.0.3 with some cvs updates kernel 2.6.17.4 
and nfs-kernel-server 1.0.6.

Now we can run vim/tar/bonnie++ and other unix utilities without any problem.

With this version we experienced problems when the gfs filesystems becomimg 
full the nfsd daemons will fail and to the load of the machine becomes 
equal to number of nfsd daemoms and the system is unresponsive.

Regards

Bas van der Vlies wrote:
> gfs version 1.0.4
> kernel 2.6.20.16
> nfs-common/etch uptodate 1:1.0.10-6+etch.1
> nfs-kernel-server/etch uptodate 1:1.0.10-6+etch.1
> 
> We have a five node fileserver cluster. The GFS file system works
> perfectly, but on our clients we have major problems. See examples below
> 
> Do other people see this problems and is there a solution for this problem?
> 
> 
> bas at gb2-r39n16:~$ vim j
> E72: Close error on swap filebas at gb2-r39n16:~$
> 
> bas at gb2-r39n16:~$ tar cvf bin.tar bin
> bin/
> bin/mpicc-wrapper-data.txt
> bin/mpiexec
> bin/setmvapich
> bin/modenv
> bin/jobnodes
> bin/SMclient
> bin/ssh
> tar: bin.tar: Warning: Cannot close: Invalid argument
> bas at gb2-r39n16:~$
> 
> 
> 
> bas at gb2-r39n16:~$ /usr/sbin/bonnie++
> Writing with putc()...done
> Writing intelligently...done
> Rewriting...Can't read a full block, only got 3944 bytes.
> Can't read a full block, only got 8040 bytes.
> Can't read a full block, only got 3944 bytes.
> Can't read a full block, only got 8040 bytes.
> Can't read a full block, only got 3944 bytes.
> Can't read a full block, only got 3944 bytes.
> Can't read a full block, only got 3944 bytes.
> Can't read a full block, only got 8040 bytes.
> Bad seek offset
> Error in seek(0)
> 
> 
> --
> ********************************************************************
> *                                                                  *
> *  Bas van der Vlies                     e-mail: basv at sara.nl      *
> *  SARA - Academic Computing Services    phone:  +31 20 592 8012   *
> *  Kruislaan 415                         fax:    +31 20 6683167    *
> *  1098 SJ Amsterdam                                               *
> *                                                                  *
> ********************************************************************
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
********************************************************************
*                                                                  *
*  Bas van der Vlies                     e-mail: basv at sara.nl      *
*  SARA - Academic Computing Services    phone:  +31 20 592 8012   *
*  Kruislaan 415                         fax:    +31 20 6683167    *
*  1098 SJ Amsterdam                                               *
*                                                                  *
********************************************************************



From Jeremyc at tasconline.com  Wed Aug 22 13:47:13 2007
From: Jeremyc at tasconline.com (Jeremy Carroll)
Date: Wed, 22 Aug 2007 08:47:13 -0500
Subject: [Linux-cluster] Node fencing problem
In-Reply-To: <CABF801D13AA444988E62B7AF62C371D01EEB6C1@WMRI000167.corp.wmdata.net>
References: <CABF801D13AA444988E62B7AF62C371D01EEB6C1@WMRI000167.corp.wmdata.net>
Message-ID: <BC16657A85D83746AFF99A42587CC6BC07C77F26@exchange2003.tasconline.com>

What type of fencing method are you using on your cluster?

Also can you run "cman_tool services" on both nodes to make sure Fenced is running?

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 4:07 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Node fencing problem

Hi,

We're having some problems getting fencing to work as expected on our two-node cluster. 

Our cluster.conf file: http://pastebin.com/m7ac9376d
kernel version: 2.6.18-8.1.8.el5
cman version: 2.0.64-1.0.1.el5

When I'm simulating a network failure on a node I expect it to be fenced by the other node but that doesn't happen for some reason:

Steps to reproduce:
1. Start the cluster
2. Mount a GFS filesystem on both nodes (test-db1 and test-db2)
3. Simulate a net failure on test-db1
    http://pastebin.com/m19fda088

Expected result:
1. Node test-db2 would detect that test-db1 failed
2. test-db1 get fenced by test-db2
3. test-db2 replays the GFS journal (filesystem writable again)
4. Fail over services from test-db1 to test-db2

Actual result:
1. Node-test-db2 detects that something happened to test-db1
2. test-db2 replays the GFS journal (filesystem writable again)
3. The service on test-db1 is still listed as started and not failed
   over to test-db2 even though test-db2 thinks test-db1 is "offline".

Log files and debug output from test-db2:
   /var/log/messages after the failure: http://pastebin.com/m2fe4ce36
   "group_tool dump fence" output: http://pastebin.com/m79d21ed9
   clustat output: http://pastebin.com/m4d1007c2

And if I restore network connectivity on test-db1 the filsystem will become writeable on that node as well and probably results in filesystem corruption.

I think the fencedevice part of cluster.conf is correct since nodes are sometimes fenced when the cluster is started and one node isn't joining fast enough.

What am I doing wrong?

Regards,
Jonas

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From bernard.chew at muvee.com  Wed Aug 22 14:05:07 2007
From: bernard.chew at muvee.com (Bernard Chew)
Date: Wed, 22 Aug 2007 22:05:07 +0800
Subject: [Linux-cluster] Using cmirror
In-Reply-To: <8F3D9B0A-3ABE-4636-A27F-B1120DBC85F2@redhat.com>
Message-ID: <229C73600EB0E54DA818AB599482BCE901A3EF2D@shadowfax.sg.muvee.net>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jonathan Brassow
> Sent: Friday, August 10, 2007 11:01 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Using cmirror

> If you've set up a cluster and are using LVM, it will work the same
way as single machine mirroring.

> http://www.redhat.com/docs/manuals/csgfs/browse/4.5/
> SAC_Cluster_Logical_Volume_Manager/mirror_create.html

>  brassow

> On Aug 10, 2007, at 6:56 AM, Bernard Chew wrote:

> Hi,
>
> I read that cmirror provides user-level utilities for managing cluster

> mirroring but could not find much documentation on it. Can anyone 
> point me to any documentation / guide around?
>
> Regards,
> Bernard Chew
> IT Operations
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Hi all,

I have problems creating a mirrored linear logical volume in a 3 nodes
cluster (using lvcreate -L 960M -m1 -n lv1 vg1 /dev/sdd /dev/sdb
/dev/sdc), and encounter the following errors:

Error locking on node NodeA.hogwarts.com: device-mapper: reload ioctl
failed: Invalid argument
Error locking on node NodeB.hogwarts.com: device-mapper: reload ioctl
failed: Invalid argument
Error locking on node NodeC.hogwarts.com: device-mapper: reload ioctl
failed: Invalid argument
Failed to activate new LV

I am able to create a linear logical volume without mirroring using the
same 3 pvs with GFS filesystem (clustered) but not the above. Any help
is appreciated.

Thanks,
Bernard Chew



From bsd_daemon at msn.com  Wed Aug 22 13:56:29 2007
From: bsd_daemon at msn.com (mehmet celik)
Date: Wed, 22 Aug 2007 13:56:29 +0000
Subject: [Linux-cluster] Kernel Oops gfs:gfs_glock_dq
In-Reply-To: <200708221157.05897.rottmann@atix.de>
Message-ID: <BAY105-F2372BE0E8C9D177FFB15BDE3D50@phx.gbl>


hi,

did you look to manpages of gfs_tool ? I guess, you should change number of 
the glockd process. right now, it's not enough for your system.

. . .
. . .
# gfs_tool margs "oopses_ok,num_glockd=X"
# mount
. . .

the X must be a integer value (1...32).

Mehmet CELIK
Istanbul/TURKEY

>From: Reiner Rottmann <rottmann at atix.de>
>Reply-To: linux clustering <linux-cluster at redhat.com>
>To: linux-cluster at redhat.com
>Subject: [Linux-cluster] Kernel Oops gfs:gfs_glock_dq
>Date: Wed, 22 Aug 2007 11:57:01 +0200
>
>Hi,
>
>we experienced a kernel oops in gfs:gfs_glock_dq for the first time and we
>would like to know if there are others who are affected by this issue.
>
>Although the node was successfully fenced by the cluster and could rejoin
>without any troubles we are curious about what happened. There are no
>suspicious log entries preceding the oops messages.
>
>Is this perhaps a known bug and maybe already solved?
>
>
># cat /var/log/messages
>...
>Aug 17 17:45:00 node01.mgmt  Unable to handle kernel NULL pointer 
>dereference
>at 0000000000000000 RIP:
>Aug 17 17:45:00 node01.mgmt  <ffffffffa0179aef>{:gfs:gfs_glock_dq+191}
>Aug 17 17:45:00 node01.mgmt  PML4 9363d067 PGD d5119067 PMD 0
>Aug 17 17:45:00 node01.mgmt  Oops: 0000 [1] SMP
>Aug 17 17:45:00 node01.mgmt  CPU 3
>Aug 17 17:45:00 node01.mgmt  Modules linked in: parport_pc lp parport
>netconsole netdump autofs4 i2c_dev i2c_core sunrpc ds yenta_socket
>pcmcia_core ext3 j
>bd button battery ac ohci_hcd hw_random scsi_transport_fc md5 ipv6 
>lock_dlm(U)
>dlm(U) gfs(U) lock_harness(U) cman(U) 8021q bonding(U) qla2300(U) 
>qla2xxx_co
>nf(U) qla2xxx(U) sd_mod scsi_mod dm_snapshot dm_mirror dm_mod tg3
>Aug 17 17:45:00 node01.mgmt  Pid: 13218, comm: site_search_dev Not tainted
>2.6.9-42.0.3.ELsmp
>Aug 17 17:45:00 node01.mgmt  RIP: 0010:[<ffffffffa0179aef>]
><ffffffffa0179aef>{:gfs:gfs_glock_dq+191}
>Aug 17 17:45:00 node01.mgmt  RSP: 0018:0000010042b33ce8  EFLAGS: 00010203
>Aug 17 17:45:00 node01.mgmt  RAX: 000001009cc5c368 RBX: 0000000000000000 
>RCX:
>00000100af058d50
>Aug 17 17:45:00 node01.mgmt  RDX: 0000000000000000 RSI: 0000000000000000 
>RDI:
>000001009cc5c35c
>Aug 17 17:45:00 node01.mgmt  RBP: 000001009cc5c330 R08: 0000000000000000 
>R09:
>0000000000000000
>Aug 17 17:45:00 node01.mgmt  R10: 0000000000000000 R11: 00000100af058ce0 
>R12:
>000001009cc5c35c
>Aug 17 17:45:01 node01.mgmt  R13: 0000010008fcc5e8 R14: ffffff00005a3000 
>R15:
>00000100234079a8
>Aug 17 17:45:01 node01.mgmt  FS:  0000002a95561a20(0000) 
>GS:ffffffff804e5300
>(0000) knlGS:0000000000000000
>Aug 17 17:45:01 node01.mgmt  CS:  0010 DS: 0000 ES: 0000 CR0: 
>000000008005003b
>Aug 17 17:45:01 node01.mgmt  CR2: 0000000000000000 CR3: 0000000001050000 
>CR4:
>00000000000006e0
>Aug 17 17:45:01 node01.mgmt  Process site_search_dev (pid: 13218, 
>threadinfo
>0000010042b32000, task 000001004886d7f0)
>Aug 17 17:45:01 node01.mgmt  Stack: 0000010042b33cf8 00012c5400000000
>ffffffffa01b99c0 0000010008fcc5e8
>Aug 17 17:45:01 node01.mgmt         0000010008fcc5c0 0000010008fcc5e8
>00000100dd2625e8 0000000000000000
>Aug 17 17:45:01 node01.mgmt         0000000000000000 ffffffffa0179c08
>Aug 17 17:45:01 node01.mgmt  Call
>Trace:<ffffffffa0179c08>{:gfs:gfs_glock_dq_uninit+9}
><ffffffffa018ef63>{:gfs:gfs_flock+222}
>Aug 17 17:45:01 node01.mgmt
><ffffffffa01e1781>{:lock_dlm:get_resource+84}
><ffffffffa01e23d7>{:lock_dlm:lm_dlm_punlock+642}
>Aug 17 17:45:01 node01.mgmt         
><ffffffff8018e5e4>{locks_remove_flock+97}
><ffffffff8017a21d>{__fput+73}
>Aug 17 17:45:01 node01.mgmt         <ffffffff80178e48>{filp_close+103}
><ffffffff80178ed1>{sys_close+130}
>Aug 17 17:45:01 node01.mgmt         <ffffffff8011026a>{system_call+126}
>Aug 17 17:45:01 node01.mgmt
>Aug 17 17:45:01 node01.mgmt  Code: 48 8b 02 0f 18 08 48 8d 45 38 48 39 c2 
>74
>0c ff 44 24 0c 49
>Aug 17 17:45:01 node01.mgmt  RIP <ffffffffa0179aef>{:gfs:gfs_glock_dq+191} 
>RSP
><0000010042b33ce8>
>Aug 17 17:45:01 node01.mgmt  CR2: 0000000000000000
>...
>
>$ uname -a
>Linux node01 2.6.9-42.0.3.ELsmp #1 SMP Fri Oct 6 06:28:26 CDT 2006 x86_64
>x86_64 x86_64 GNU/Linux
>
>$ rpm -qa
>...
>hp_qla2x00src-8.01.06-7
>kernel-smp-2.6.9-42.0.3.EL
>kernel-smp-devel-2.6.9-42.0.3.EL
>ccs-1.0.7-0
>cman-1.0.11-0
>cman-kernel-smp-2.6.9-45.8
>dlm-1.0.1-1
>dlm-kernel-smp-2.6.9-44.3
>GFS-6.1.6-1
>GFS-kernel-smp-2.6.9-60.3
>magma-1.0.6-0
>magma-plugins-1.0.9-0
>...
>
>
>--
>Gruss / Regards,
>
>Dipl.-Ing. (FH) Reiner Rottmann
>
>Phone: +49-89 452 3538-12
>
>http://www.atix.de/
>http://open-sharedroot.org/
>
>PGP Key ID: 0xCA67C5A6
>PGP Key Fingerprint = BF59FF006360B6E8D48F26B10D9F5A84CA67C5A6
>
>**
>ATIX Informationstechnologie und Consulting AG
>Einsteinstr. 10
>85716 Unterschleissheim
>Deutschland/Germany
>
>Phone: +49-89 452 3538-0
>Fax:   +49-89 990 1766-0
>
>Registergericht: Amtsgericht Muenchen
>Registernummer: HRB 168930
>USt.-Id.: DE209485962
>
>Vorstand:
>Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)
>
>Vorsitzender des Aufsichtsrats:
>Dr. Martin Buss


><< attach4 >>




>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster

_________________________________________________________________
Find a local pizza place, movie theater, and more?.then map the best route! 
http://maps.live.com/default.aspx?v=2&ss=yp.bars~yp.pizza~yp.movie%20theater&cp=42.358996~-71.056691&style=r&lvl=13&tilt=-90&dir=0&alt=-1000&scene=950607&encType=1&FORM=MGAC01



From jobot at wmdata.com  Wed Aug 22 14:22:24 2007
From: jobot at wmdata.com (=?iso-8859-1?Q?Borgstr=F6m_Jonas?=)
Date: Wed, 22 Aug 2007 16:22:24 +0200
Subject: [Linux-cluster] Node fencing problem
In-Reply-To: <BC16657A85D83746AFF99A42587CC6BC07C77F26@exchange2003.tasconline.com>
References: <CABF801D13AA444988E62B7AF62C371D01EEB6C1@WMRI000167.corp.wmdata.net>
	<BC16657A85D83746AFF99A42587CC6BC07C77F26@exchange2003.tasconline.com>
Message-ID: <CABF801D13AA444988E62B7AF62C371D01F46E80@WMRI000167.corp.wmdata.net>

As you can see here http://pastebin.com/m7ac9376d I've configured both fence_drac and fence_manual.

And fenced appears to be running:
[root at test-db1 ~]# ps ax | grep fence
 3412 ?        Ss     0:00 /sbin/fenced
 5109 pts/0    S+     0:00 grep fence
[root at test-db1 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010001 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

And on test-db2:
[root at test-db2 ~]# ps ax | grep fence
 3428 ?        Ss     0:00 /sbin/fenced
 8848 pts/0    S+     0:00 grep fence
[root at test-db2 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010002 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 15:47
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

What type of fencing method are you using on your cluster?

Also can you run "cman_tool services" on both nodes to make sure Fenced is running?

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 4:07 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Node fencing problem

Hi,

We're having some problems getting fencing to work as expected on our two-node cluster. 

Our cluster.conf file: http://pastebin.com/m7ac9376d
kernel version: 2.6.18-8.1.8.el5
cman version: 2.0.64-1.0.1.el5

When I'm simulating a network failure on a node I expect it to be fenced by the other node but that doesn't happen for some reason:

Steps to reproduce:
1. Start the cluster
2. Mount a GFS filesystem on both nodes (test-db1 and test-db2)
3. Simulate a net failure on test-db1
    http://pastebin.com/m19fda088

Expected result:
1. Node test-db2 would detect that test-db1 failed
2. test-db1 get fenced by test-db2
3. test-db2 replays the GFS journal (filesystem writable again)
4. Fail over services from test-db1 to test-db2

Actual result:
1. Node-test-db2 detects that something happened to test-db1
2. test-db2 replays the GFS journal (filesystem writable again)
3. The service on test-db1 is still listed as started and not failed
   over to test-db2 even though test-db2 thinks test-db1 is "offline".

Log files and debug output from test-db2:
   /var/log/messages after the failure: http://pastebin.com/m2fe4ce36
   "group_tool dump fence" output: http://pastebin.com/m79d21ed9
   clustat output: http://pastebin.com/m4d1007c2

And if I restore network connectivity on test-db1 the filsystem will become writeable on that node as well and probably results in filesystem corruption.

I think the fencedevice part of cluster.conf is correct since nodes are sometimes fenced when the cluster is started and one node isn't joining fast enough.

What am I doing wrong?

Regards,
Jonas

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From Jeremyc at tasconline.com  Wed Aug 22 14:32:22 2007
From: Jeremyc at tasconline.com (Jeremy Carroll)
Date: Wed, 22 Aug 2007 09:32:22 -0500
Subject: [Linux-cluster] Node fencing problem
In-Reply-To: <CABF801D13AA444988E62B7AF62C371D01F46E80@WMRI000167.corp.wmdata.net>
References: <CABF801D13AA444988E62B7AF62C371D01EEB6C1@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F26@exchange2003.tasconline.com>
	<CABF801D13AA444988E62B7AF62C371D01F46E80@WMRI000167.corp.wmdata.net>
Message-ID: <BC16657A85D83746AFF99A42587CC6BC07C77F29@exchange2003.tasconline.com>

See if the fence agent is configured correctly. Run this command to see if it shuts down the node in question.

fence_drac -a 10.100.2.40 -l testdb -p <passwd> -o off

If it does not work I would check to see if access is enabled. By default, the telnet interface is not enabled. To enable the interface, you will need to use the racadm command in the racser-devel rpm available from Dell. To enable telnet on the DRAC:

[root]# racadm config -g cfgSerial -o cfgSerialTelnetEnable 1

[root]# racadm

Racreset


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 9:22 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

As you can see here http://pastebin.com/m7ac9376d I've configured both fence_drac and fence_manual.

And fenced appears to be running:
[root at test-db1 ~]# ps ax | grep fence
 3412 ?        Ss     0:00 /sbin/fenced
 5109 pts/0    S+     0:00 grep fence
[root at test-db1 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010001 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

And on test-db2:
[root at test-db2 ~]# ps ax | grep fence
 3428 ?        Ss     0:00 /sbin/fenced
 8848 pts/0    S+     0:00 grep fence
[root at test-db2 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010002 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 15:47
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

What type of fencing method are you using on your cluster?

Also can you run "cman_tool services" on both nodes to make sure Fenced is running?

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 4:07 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Node fencing problem

Hi,

We're having some problems getting fencing to work as expected on our two-node cluster. 

Our cluster.conf file: http://pastebin.com/m7ac9376d
kernel version: 2.6.18-8.1.8.el5
cman version: 2.0.64-1.0.1.el5

When I'm simulating a network failure on a node I expect it to be fenced by the other node but that doesn't happen for some reason:

Steps to reproduce:
1. Start the cluster
2. Mount a GFS filesystem on both nodes (test-db1 and test-db2)
3. Simulate a net failure on test-db1
    http://pastebin.com/m19fda088

Expected result:
1. Node test-db2 would detect that test-db1 failed
2. test-db1 get fenced by test-db2
3. test-db2 replays the GFS journal (filesystem writable again)
4. Fail over services from test-db1 to test-db2

Actual result:
1. Node-test-db2 detects that something happened to test-db1
2. test-db2 replays the GFS journal (filesystem writable again)
3. The service on test-db1 is still listed as started and not failed
   over to test-db2 even though test-db2 thinks test-db1 is "offline".

Log files and debug output from test-db2:
   /var/log/messages after the failure: http://pastebin.com/m2fe4ce36
   "group_tool dump fence" output: http://pastebin.com/m79d21ed9
   clustat output: http://pastebin.com/m4d1007c2

And if I restore network connectivity on test-db1 the filsystem will become writeable on that node as well and probably results in filesystem corruption.

I think the fencedevice part of cluster.conf is correct since nodes are sometimes fenced when the cluster is started and one node isn't joining fast enough.

What am I doing wrong?

Regards,
Jonas

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From jobot at wmdata.com  Wed Aug 22 14:49:04 2007
From: jobot at wmdata.com (=?iso-8859-1?Q?Borgstr=F6m_Jonas?=)
Date: Wed, 22 Aug 2007 16:49:04 +0200
Subject: [Linux-cluster] Node fencing problem
In-Reply-To: <BC16657A85D83746AFF99A42587CC6BC07C77F29@exchange2003.tasconline.com>
References: <CABF801D13AA444988E62B7AF62C371D01EEB6C1@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F26@exchange2003.tasconline.com><CABF801D13AA444988E62B7AF62C371D01F46E80@WMRI000167.corp.wmdata.net>
	<BC16657A85D83746AFF99A42587CC6BC07C77F29@exchange2003.tasconline.com>
Message-ID: <CABF801D13AA444988E62B7AF62C371D01F46EA1@WMRI000167.corp.wmdata.net>

Yes, both "fence_drac ..." and "fence_node test-db1.example.com" works.

The strange thing is that during the test I described earlier it looks like the cluster didn't even try to fence the failed node. /var/log/messages didn't mentioning anything about trying to fence any node. And neither did "group_tool dump fence".

And even if something would be wrong with the fence_drac configuration wouldn't fence_manual kick in instead?

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 16:32
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

See if the fence agent is configured correctly. Run this command to see if it shuts down the node in question.

fence_drac -a 10.100.2.40 -l testdb -p <passwd> -o off

If it does not work I would check to see if access is enabled. By default, the telnet interface is not enabled. To enable the interface, you will need to use the racadm command in the racser-devel rpm available from Dell. To enable telnet on the DRAC:

[root]# racadm config -g cfgSerial -o cfgSerialTelnetEnable 1

[root]# racadm

Racreset


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 9:22 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

As you can see here http://pastebin.com/m7ac9376d I've configured both fence_drac and fence_manual.

And fenced appears to be running:
[root at test-db1 ~]# ps ax | grep fence
 3412 ?        Ss     0:00 /sbin/fenced
 5109 pts/0    S+     0:00 grep fence
[root at test-db1 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010001 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

And on test-db2:
[root at test-db2 ~]# ps ax | grep fence
 3428 ?        Ss     0:00 /sbin/fenced
 8848 pts/0    S+     0:00 grep fence
[root at test-db2 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010002 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 15:47
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

What type of fencing method are you using on your cluster?

Also can you run "cman_tool services" on both nodes to make sure Fenced is running?

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 4:07 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Node fencing problem

Hi,

We're having some problems getting fencing to work as expected on our two-node cluster. 

Our cluster.conf file: http://pastebin.com/m7ac9376d
kernel version: 2.6.18-8.1.8.el5
cman version: 2.0.64-1.0.1.el5

When I'm simulating a network failure on a node I expect it to be fenced by the other node but that doesn't happen for some reason:

Steps to reproduce:
1. Start the cluster
2. Mount a GFS filesystem on both nodes (test-db1 and test-db2)
3. Simulate a net failure on test-db1
    http://pastebin.com/m19fda088

Expected result:
1. Node test-db2 would detect that test-db1 failed
2. test-db1 get fenced by test-db2
3. test-db2 replays the GFS journal (filesystem writable again)
4. Fail over services from test-db1 to test-db2

Actual result:
1. Node-test-db2 detects that something happened to test-db1
2. test-db2 replays the GFS journal (filesystem writable again)
3. The service on test-db1 is still listed as started and not failed
   over to test-db2 even though test-db2 thinks test-db1 is "offline".

Log files and debug output from test-db2:
   /var/log/messages after the failure: http://pastebin.com/m2fe4ce36
   "group_tool dump fence" output: http://pastebin.com/m79d21ed9
   clustat output: http://pastebin.com/m4d1007c2

And if I restore network connectivity on test-db1 the filsystem will become writeable on that node as well and probably results in filesystem corruption.

I think the fencedevice part of cluster.conf is correct since nodes are sometimes fenced when the cluster is started and one node isn't joining fast enough.

What am I doing wrong?

Regards,
Jonas

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From Jeremyc at tasconline.com  Wed Aug 22 14:56:53 2007
From: Jeremyc at tasconline.com (Jeremy Carroll)
Date: Wed, 22 Aug 2007 09:56:53 -0500
Subject: [Linux-cluster] Node fencing problem
In-Reply-To: <CABF801D13AA444988E62B7AF62C371D01F46EA1@WMRI000167.corp.wmdata.net>
References: <CABF801D13AA444988E62B7AF62C371D01EEB6C1@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F26@exchange2003.tasconline.com><CABF801D13AA444988E62B7AF62C371D01F46E80@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F29@exchange2003.tasconline.com>
	<CABF801D13AA444988E62B7AF62C371D01F46EA1@WMRI000167.corp.wmdata.net>
Message-ID: <BC16657A85D83746AFF99A42587CC6BC07C77F2A@exchange2003.tasconline.com>

Try restarting fence cleanly. Stop the fenced service and run this command.

fence_tool join -c

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 9:49 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Yes, both "fence_drac ..." and "fence_node test-db1.example.com" works.

The strange thing is that during the test I described earlier it looks like the cluster didn't even try to fence the failed node. /var/log/messages didn't mentioning anything about trying to fence any node. And neither did "group_tool dump fence".

And even if something would be wrong with the fence_drac configuration wouldn't fence_manual kick in instead?

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 16:32
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

See if the fence agent is configured correctly. Run this command to see if it shuts down the node in question.

fence_drac -a 10.100.2.40 -l testdb -p <passwd> -o off

If it does not work I would check to see if access is enabled. By default, the telnet interface is not enabled. To enable the interface, you will need to use the racadm command in the racser-devel rpm available from Dell. To enable telnet on the DRAC:

[root]# racadm config -g cfgSerial -o cfgSerialTelnetEnable 1

[root]# racadm

Racreset


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 9:22 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

As you can see here http://pastebin.com/m7ac9376d I've configured both fence_drac and fence_manual.

And fenced appears to be running:
[root at test-db1 ~]# ps ax | grep fence
 3412 ?        Ss     0:00 /sbin/fenced
 5109 pts/0    S+     0:00 grep fence
[root at test-db1 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010001 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

And on test-db2:
[root at test-db2 ~]# ps ax | grep fence
 3428 ?        Ss     0:00 /sbin/fenced
 8848 pts/0    S+     0:00 grep fence
[root at test-db2 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010002 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 15:47
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

What type of fencing method are you using on your cluster?

Also can you run "cman_tool services" on both nodes to make sure Fenced is running?

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 4:07 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Node fencing problem

Hi,

We're having some problems getting fencing to work as expected on our two-node cluster. 

Our cluster.conf file: http://pastebin.com/m7ac9376d
kernel version: 2.6.18-8.1.8.el5
cman version: 2.0.64-1.0.1.el5

When I'm simulating a network failure on a node I expect it to be fenced by the other node but that doesn't happen for some reason:

Steps to reproduce:
1. Start the cluster
2. Mount a GFS filesystem on both nodes (test-db1 and test-db2)
3. Simulate a net failure on test-db1
    http://pastebin.com/m19fda088

Expected result:
1. Node test-db2 would detect that test-db1 failed
2. test-db1 get fenced by test-db2
3. test-db2 replays the GFS journal (filesystem writable again)
4. Fail over services from test-db1 to test-db2

Actual result:
1. Node-test-db2 detects that something happened to test-db1
2. test-db2 replays the GFS journal (filesystem writable again)
3. The service on test-db1 is still listed as started and not failed
   over to test-db2 even though test-db2 thinks test-db1 is "offline".

Log files and debug output from test-db2:
   /var/log/messages after the failure: http://pastebin.com/m2fe4ce36
   "group_tool dump fence" output: http://pastebin.com/m79d21ed9
   clustat output: http://pastebin.com/m4d1007c2

And if I restore network connectivity on test-db1 the filsystem will become writeable on that node as well and probably results in filesystem corruption.

I think the fencedevice part of cluster.conf is correct since nodes are sometimes fenced when the cluster is started and one node isn't joining fast enough.

What am I doing wrong?

Regards,
Jonas

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From Arne.Brieseneck at vodafone.com  Wed Aug 22 15:00:19 2007
From: Arne.Brieseneck at vodafone.com (Brieseneck, Arne, VF-Group)
Date: Wed, 22 Aug 2007 17:00:19 +0200
Subject: [Linux-cluster] Node fencing via 'ssh/shutdown -r now' possible
In-Reply-To: <CABF801D13AA444988E62B7AF62C371D01F46EA1@WMRI000167.corp.wmdata.net>
Message-ID: <E67F1468BF7A4C418D874810215A377EC2BB6C@EITO-MBX01.internal.vodafone.com>

Hi all,

Has anybody an idea if it would be possible to fence first via a
controlled shutdown and if that one failed after a while go to the
second fencing via fence_impi?

The advantage would be, that the filesystem of the fenced node have a
better chance to be intact after shutdown then power cut off.


Thanks 



From Jeremyc at tasconline.com  Wed Aug 22 15:03:14 2007
From: Jeremyc at tasconline.com (Jeremy Carroll)
Date: Wed, 22 Aug 2007 10:03:14 -0500
Subject: [Linux-cluster] Node fencing via 'ssh/shutdown -r now' possible
In-Reply-To: <E67F1468BF7A4C418D874810215A377EC2BB6C@EITO-MBX01.internal.vodafone.com>
References: <CABF801D13AA444988E62B7AF62C371D01F46EA1@WMRI000167.corp.wmdata.net>
	<E67F1468BF7A4C418D874810215A377EC2BB6C@EITO-MBX01.internal.vodafone.com>
Message-ID: <BC16657A85D83746AFF99A42587CC6BC07C77F2B@exchange2003.tasconline.com>

GFS volumes will wait until fence reports back an exit status before
resuming activity. Running a shutdown command that could take 30+
seconds would be disastrous IMHO. It would lock up your cluster for
quite some time.

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Brieseneck, Arne,
VF-Group
Sent: Wednesday, August 22, 2007 10:00 AM
To: linux clustering
Subject: [Linux-cluster] Node fencing via 'ssh/shutdown -r now' possible

Hi all,

Has anybody an idea if it would be possible to fence first via a
controlled shutdown and if that one failed after a while go to the
second fencing via fence_impi?

The advantage would be, that the filesystem of the fenced node have a
better chance to be intact after shutdown then power cut off.


Thanks 

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From jobot at wmdata.com  Wed Aug 22 15:30:53 2007
From: jobot at wmdata.com (=?iso-8859-1?Q?Borgstr=F6m_Jonas?=)
Date: Wed, 22 Aug 2007 17:30:53 +0200
Subject: [Linux-cluster] Node fencing problem
In-Reply-To: <BC16657A85D83746AFF99A42587CC6BC07C77F2A@exchange2003.tasconline.com>
References: <CABF801D13AA444988E62B7AF62C371D01EEB6C1@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F26@exchange2003.tasconline.com><CABF801D13AA444988E62B7AF62C371D01F46E80@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F29@exchange2003.tasconline.com><CABF801D13AA444988E62B7AF62C371D01F46EA1@WMRI000167.corp.wmdata.net>
	<BC16657A85D83746AFF99A42587CC6BC07C77F2A@exchange2003.tasconline.com>
Message-ID: <CABF801D13AA444988E62B7AF62C371D01F46EC1@WMRI000167.corp.wmdata.net>

Yes, that did the trick. I did the following on both nodes:

$ killall fenced
$ fenced
$ fence_tool join -c

After that I tested the same thing as earlier but this time the failed node was fensed!

/var/log/messages output:

Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ] New Configuration: 
Aug 22 16:08:38 test-db2 kernel: dlm: closing connection to node 1
Aug 22 16:08:38 test-db2 fenced[28702]: test-db1.example.com not a cluster member after 0 sec post_fail_delay
Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ]         r(0) ip(10.100.2.6)  
Aug 22 16:08:38 test-db2 fenced[28702]: fencing node "test-db1.example.com"


So I guess the only thing left to do now is to understand why it doesn't work without commands above. The only reason I can think of is that because I'm using fence_drac to power on/off the nodes so one node is usually started 10-20 seconds after the other and this usually results in one node being fenced by the other before the cluster starts up.

But after that everything appears to work, but I guess that somehow confuses fenced. 

Here's the "group dump fence" output after a restart of all nodes:

test-db1: http://pastebin.com/m73104841

test-db2: http://pastebin.com/ma427c89

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 16:57
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Try restarting fence cleanly. Stop the fenced service and run this command.

fence_tool join -c

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 9:49 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Yes, both "fence_drac ..." and "fence_node test-db1.example.com" works.

The strange thing is that during the test I described earlier it looks like the cluster didn't even try to fence the failed node. /var/log/messages didn't mentioning anything about trying to fence any node. And neither did "group_tool dump fence".

And even if something would be wrong with the fence_drac configuration wouldn't fence_manual kick in instead?

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 16:32
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

See if the fence agent is configured correctly. Run this command to see if it shuts down the node in question.

fence_drac -a 10.100.2.40 -l testdb -p <passwd> -o off

If it does not work I would check to see if access is enabled. By default, the telnet interface is not enabled. To enable the interface, you will need to use the racadm command in the racser-devel rpm available from Dell. To enable telnet on the DRAC:

[root]# racadm config -g cfgSerial -o cfgSerialTelnetEnable 1

[root]# racadm

Racreset


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 9:22 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

As you can see here http://pastebin.com/m7ac9376d I've configured both fence_drac and fence_manual.

And fenced appears to be running:
[root at test-db1 ~]# ps ax | grep fence
 3412 ?        Ss     0:00 /sbin/fenced
 5109 pts/0    S+     0:00 grep fence
[root at test-db1 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010001 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

And on test-db2:
[root at test-db2 ~]# ps ax | grep fence
 3428 ?        Ss     0:00 /sbin/fenced
 8848 pts/0    S+     0:00 grep fence
[root at test-db2 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010002 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 15:47
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

What type of fencing method are you using on your cluster?

Also can you run "cman_tool services" on both nodes to make sure Fenced is running?

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 4:07 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Node fencing problem

Hi,

We're having some problems getting fencing to work as expected on our two-node cluster. 

Our cluster.conf file: http://pastebin.com/m7ac9376d
kernel version: 2.6.18-8.1.8.el5
cman version: 2.0.64-1.0.1.el5

When I'm simulating a network failure on a node I expect it to be fenced by the other node but that doesn't happen for some reason:

Steps to reproduce:
1. Start the cluster
2. Mount a GFS filesystem on both nodes (test-db1 and test-db2)
3. Simulate a net failure on test-db1
    http://pastebin.com/m19fda088

Expected result:
1. Node test-db2 would detect that test-db1 failed
2. test-db1 get fenced by test-db2
3. test-db2 replays the GFS journal (filesystem writable again)
4. Fail over services from test-db1 to test-db2

Actual result:
1. Node-test-db2 detects that something happened to test-db1
2. test-db2 replays the GFS journal (filesystem writable again)
3. The service on test-db1 is still listed as started and not failed
   over to test-db2 even though test-db2 thinks test-db1 is "offline".

Log files and debug output from test-db2:
   /var/log/messages after the failure: http://pastebin.com/m2fe4ce36
   "group_tool dump fence" output: http://pastebin.com/m79d21ed9
   clustat output: http://pastebin.com/m4d1007c2

And if I restore network connectivity on test-db1 the filsystem will become writeable on that node as well and probably results in filesystem corruption.

I think the fencedevice part of cluster.conf is correct since nodes are sometimes fenced when the cluster is started and one node isn't joining fast enough.

What am I doing wrong?

Regards,
Jonas

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From lhh at redhat.com  Wed Aug 22 15:34:39 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 22 Aug 2007 11:34:39 -0400
Subject: [Linux-cluster] software
In-Reply-To: <520532.79001.qm@web35203.mail.mud.yahoo.com>
References: <20070820180151.GD26997@redhat.com>
	<520532.79001.qm@web35203.mail.mud.yahoo.com>
Message-ID: <20070822153439.GD4669@redhat.com>

On Tue, Aug 21, 2007 at 12:41:47PM -0500, ulises gonzalez wrote:
> Sorry, ... I want to build a high aviable cluster, It will offer mail, samba, dns and ltsp services..

The linux-cluster project has agents for dns (named) and samba in CVS.
For mail, you can use a script wrapper.

I've never tried making LTSP (server side) fail-over.  Actually, I've
never set up LTSP, which is a bit of a prerequisite.

We can work on that once you get the rest of it set up.

Good place to start:

  http://sources.redhat.com/cluster/faq.html

What distribution are you running?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Aug 22 15:37:11 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 22 Aug 2007 11:37:11 -0400
Subject: [Linux-cluster] Please correct me if I'm wrong, but...
In-Reply-To: <46C9FBE9.6090504@noaa.gov>
References: <46C9EAD0.1060107@noaa.gov> <20070820202018.GK26997@redhat.com>
	<46C9FBE9.6090504@noaa.gov>
Message-ID: <20070822153711.GE4669@redhat.com>

On Mon, Aug 20, 2007 at 04:39:05PM -0400, Randy Brown wrote:
> Right.  That's the way I understood it to be.  Using ext3 would require 
> us to have to umount and remount the file systems to the each host after 
> the failure, though, correct?  In other words, would require 
> administrator interaction.  GFS would do this automatically without 
> impacting the users.

One would hope that the failover process handles mounting and
unmounting.  You do, however, need shared storage and suitable fencing
mechanisms in place.

-- Lon


-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Aug 22 15:38:07 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 22 Aug 2007 11:38:07 -0400
Subject: [Linux-cluster] Please correct me if I'm wrong, but...
In-Reply-To: <20070822153711.GE4669@redhat.com>
References: <46C9EAD0.1060107@noaa.gov> <20070820202018.GK26997@redhat.com>
	<46C9FBE9.6090504@noaa.gov> <20070822153711.GE4669@redhat.com>
Message-ID: <20070822153807.GF4669@redhat.com>

On Wed, Aug 22, 2007 at 11:37:11AM -0400, Lon Hohberger wrote:
> On Mon, Aug 20, 2007 at 04:39:05PM -0400, Randy Brown wrote:
> > Right.  That's the way I understood it to be.  Using ext3 would require 
> > us to have to umount and remount the file systems to the each host after 
> > the failure, though, correct?  In other words, would require 
> > administrator interaction.  GFS would do this automatically without 
> > impacting the users.
> 
> One would hope that the failover process handles mounting and
> unmounting.  You do, however, need shared storage and suitable fencing
> mechanisms in place.

... that is, with ext3.

With GFS, you just mount it on both nodes and the recovery occurs in the
background.  Again, shared storage + fencing are required.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Aug 22 15:40:01 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 22 Aug 2007 11:40:01 -0400
Subject: [Linux-cluster] NFS Export Information
In-Reply-To: <576CD4D20306324798BB3AE562A2C084027AA162@X2K3.mjm.com>
References: <576CD4D20306324798BB3AE562A2C084027AA162@X2K3.mjm.com>
Message-ID: <20070822153956.GG4669@redhat.com>

On Tue, Aug 21, 2007 at 01:25:25PM -0400, Michael Patrimonio wrote:
> Greetings,
>  
> After a test failover of NFS, I noticed that the export information for the directories that were moved to the standby server with the service are still beings listed in the output of the "exportfs" command on the primary server. They are also coming up on the standby server, where the service is currently running and the files systems (and their directories) are currently located.

> Doesn't the failover process cleanup after itself by removing the entries it adds into the list of exported directories? If not, will the extraneous (and, basically, worthless) entries ever present a problem?

They're supposed to be removed.  Unless you're using /etc/exports, which
the cluster software won't manage...

> As information: the systems are running RHEL 4 Update 5 with RH Cluster Suite and "ext3" (not "GFS") in an active/active configuration.
>  

Could you show us how it's configured at this point?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Aug 22 15:42:05 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 22 Aug 2007 11:42:05 -0400
Subject: [Linux-cluster] deploying two node cluster with rhel5
In-Reply-To: <46CBE263.2050800@gmail.com>
References: <46CBE263.2050800@gmail.com>
Message-ID: <20070822154205.GH4669@redhat.com>

On Wed, Aug 22, 2007 at 09:14:43AM +0200, carlopmart wrote:
> Hi all,
> 
>  I am really confused about how to install two node cluster with rhel5.
> 
>  My questions:
> 
>  - Is essential to setup a quorum disk or not?

No

>  - How many votes do I need to configure under cluster.conf?

1

>  - Do I need to put two_nodes param under cluster.conf like in rhel4?

Yes

>  - What params do I need or do I need to know to put under cluster.conf?

* <cman expected_votes="1" two_node="1" .../>
* 1 vote per cluster node
* Requisite fence device configuration

... I think that's about it.  There's Bob Peterson's GFS/NFS cookbook
which is probably very useful:

  http://sources.redhat.com/cluster/doc/nfscookbook.pdf

There's also Red Hat's docs (available on the web) and the faq:

  http://sources.redhat.com/cluster/faq.html

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Wed Aug 22 15:48:37 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 22 Aug 2007 11:48:37 -0400
Subject: [Linux-cluster] Node fencing problem
In-Reply-To: <CABF801D13AA444988E62B7AF62C371D01F46EA1@WMRI000167.corp.wmdata.net>
References: <BC16657A85D83746AFF99A42587CC6BC07C77F29@exchange2003.tasconline.com>
	<CABF801D13AA444988E62B7AF62C371D01F46EA1@WMRI000167.corp.wmdata.net>
Message-ID: <20070822154837.GI4669@redhat.com>

On Wed, Aug 22, 2007 at 04:49:04PM +0200, Borgstr?m Jonas wrote:
> Yes, both "fence_drac ..." and "fence_node test-db1.example.com" works.
> 
> The strange thing is that during the test I described earlier it looks like the cluster didn't even try to fence the failed node. /var/log/messages didn't mentioning anything about trying to fence any node. And neither did "group_tool dump fence".
> 
> And even if something would be wrong with the fence_drac configuration wouldn't fence_manual kick in instead?

Yes.  Also, it looks like fencing did not "quietly" complete - since
rgmanager never recovered (failed-over) the service from the dead node.

If a node dies unexpectedly, rgmanager waits until cman has finished
fencing that node before initiating a failover.  That's why it was
still reported as 'started' on the dead node in the clustat output.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From claudio.tassini at gmail.com  Wed Aug 22 16:03:14 2007
From: claudio.tassini at gmail.com (Claudio Tassini)
Date: Wed, 22 Aug 2007 18:03:14 +0200
Subject: [Linux-cluster] Remove journal from GFS
Message-ID: <39fdf1c70708220903w38d0ea03uc5ba28ceef9c5e81@mail.gmail.com>

Hi all,
i have a cluster in which two nodes share a GFS filesystem . I created the
filesystem with 3 journals - just to be sure - , and now I need to add TWO
cluster nodes and make the GFS fs visible to all the 4 nodes. When I try to
add a journal to the fs (to have 4 of them), I got this:

gfs_jadd  -j 1 -T -v /gfsfilesystem
Requested size (65536 blocks) greater than available space (2 blocks)

So i read that it's normal, but I really need to add a journal to this fs. I
was thinking about deleting one the current journals (I have 3 and I'm using
2), and then create two "half-sized" new journals, to reach the number of
4.

There is some other way I can achieve my goal? I don't think I can shrink
GFS, and I simply can't backup all data, create a new FS and then restore
(it's a 1 TB fs with production maildirs inside - you know... a VERY lot of
small, really small files, it would mean a 2-days downtime).

Any help would be really appreciated.

-- 
Claudio Tassini
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070822/438817da/attachment.htm>

From dmp at mjm.com  Wed Aug 22 16:02:47 2007
From: dmp at mjm.com (Michael Patrimonio)
Date: Wed, 22 Aug 2007 12:02:47 -0400
Subject: [Linux-cluster] NFS Export Information
References: <576CD4D20306324798BB3AE562A2C084027AA162@X2K3.mjm.com>
	<20070822153956.GG4669@redhat.com>
Message-ID: <576CD4D20306324798BB3AE562A2C084027AA172@X2K3.mjm.com>

 
Can I respond to your e-mail, exclusively?
 
=====
 
 
 

________________________________

From: linux-cluster-bounces at redhat.com on behalf of Lon Hohberger
Sent: Wed 8/22/2007 11:40
To: linux clustering
Subject: Re: [Linux-cluster] NFS Export Information



On Tue, Aug 21, 2007 at 01:25:25PM -0400, Michael Patrimonio wrote:
> Greetings,
> 
> After a test failover of NFS, I noticed that the export information for the directories that were moved to the standby server with the service are still beings listed in the output of the "exportfs" command on the primary server. They are also coming up on the standby server, where the service is currently running and the files systems (and their directories) are currently located.

> Doesn't the failover process cleanup after itself by removing the entries it adds into the list of exported directories? If not, will the extraneous (and, basically, worthless) entries ever present a problem?

They're supposed to be removed.  Unless you're using /etc/exports, which
the cluster software won't manage...

> As information: the systems are running RHEL 4 Update 5 with RH Cluster Suite and "ext3" (not "GFS") in an active/active configuration.
> 

Could you show us how it's configured at this point?

--
Lon Hohberger - Software Engineer - Red Hat, Inc.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 4447 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070822/6ebda3ef/attachment.bin>

From Jeremyc at tasconline.com  Wed Aug 22 16:14:07 2007
From: Jeremyc at tasconline.com (Jeremy Carroll)
Date: Wed, 22 Aug 2007 11:14:07 -0500
Subject: [Linux-cluster] Remove journal from GFS
In-Reply-To: <39fdf1c70708220903w38d0ea03uc5ba28ceef9c5e81@mail.gmail.com>
References: <39fdf1c70708220903w38d0ea03uc5ba28ceef9c5e81@mail.gmail.com>
Message-ID: <BC16657A85D83746AFF99A42587CC6BC07C77F2F@exchange2003.tasconline.com>

Adding an additional journal takes up minimum of 32M of space per
journal. It looks like you have no free space on the GFS to add the
journal. 

 

You will have to grow your GFS volume by at least 32M to add the
additional Journal. Then run this command.

 

Gfs_jadd -T -v -J 32 /gfsfilesystem

 

________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Claudio Tassini
Sent: Wednesday, August 22, 2007 11:03 AM
To: linux clustering
Subject: [Linux-cluster] Remove journal from GFS

 

Hi all,

 

i have a cluster in which two nodes share a GFS filesystem . I created
the filesystem with 3 journals - just to be sure - , and now I need to
add TWO cluster nodes and make the GFS fs visible to all the 4 nodes.
When I try to add a journal to the fs (to have 4 of them), I got this: 

 

gfs_jadd  -j 1 -T -v /gfsfilesystem

Requested size (65536 blocks) greater than available space (2 blocks)

 

So i read that it's normal, but I really need to add a journal to this
fs. I was thinking about deleting one the current journals (I have 3 and
I'm using 2), and then create two "half-sized" new journals, to reach
the number of 4.  

 

There is some other way I can achieve my goal? I don't think I can
shrink GFS, and I simply can't backup all data, create a new FS and then
restore (it's a 1 TB fs with production maildirs inside - you know... a
VERY lot of small, really small files, it would mean a 2-days downtime).


 

Any help would be really appreciated.


-- 
Claudio Tassini 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070822/8a699a01/attachment.htm>

From lhh at redhat.com  Wed Aug 22 16:28:36 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 22 Aug 2007 12:28:36 -0400
Subject: [Linux-cluster] Node fencing via 'ssh/shutdown -r now' possible
In-Reply-To: <E67F1468BF7A4C418D874810215A377EC2BB6C@EITO-MBX01.internal.vodafone.com>
References: <CABF801D13AA444988E62B7AF62C371D01F46EA1@WMRI000167.corp.wmdata.net>
	<E67F1468BF7A4C418D874810215A377EC2BB6C@EITO-MBX01.internal.vodafone.com>
Message-ID: <20070822162835.GJ4669@redhat.com>

On Wed, Aug 22, 2007 at 05:00:19PM +0200, Brieseneck, Arne, VF-Group wrote:
> Hi all,
> 
> Has anybody an idea if it would be possible to fence first via a
> controlled shutdown second fencing via fence_impi?


> The advantage would be, that the filesystem of the fenced node have a
> better chance to be intact after shutdown then power cut off.

The disadvantages are:

(a) it increases cluster recovery time by *minutes* at the very least
(how long do you wait before cutting the power?)

(b) it might hang during shutdown, seemingly responding to the shut-down
request but never actually reaching 'off' state...

(c) the node - if it's actually having a problem - could still write to
the disk while we're trying to fence it.  Some could argue that this
data might be bad (particularly if the node's problem is a bad memory
stick), and lessening the window for this is a good thing.

There are also some IPMI implementation-specific problems (bugs?)
preventing this from being possible.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From claudio.tassini at gmail.com  Wed Aug 22 16:50:50 2007
From: claudio.tassini at gmail.com (Claudio Tassini)
Date: Wed, 22 Aug 2007 18:50:50 +0200
Subject: [Linux-cluster] Remove journal from GFS
In-Reply-To: <BC16657A85D83746AFF99A42587CC6BC07C77F2F@exchange2003.tasconline.com>
References: <39fdf1c70708220903w38d0ea03uc5ba28ceef9c5e81@mail.gmail.com>
	<BC16657A85D83746AFF99A42587CC6BC07C77F2F@exchange2003.tasconline.com>
Message-ID: <39fdf1c70708220950l9e126can13b8909699a59e71@mail.gmail.com>

I know that, but I can't grow the FS because I don't have hardware space to
do this, so I'm trying to find an alternate way...

2007/8/22, Jeremy Carroll <Jeremyc at tasconline.com>:
>
>  Adding an additional journal takes up minimum of 32M of space per
> journal. It looks like you have no free space on the GFS to add the journal.
>
>
>
>
> You will have to grow your GFS volume by at least 32M to add the
> additional Journal. Then run this command.
>
>
>
> Gfs_jadd ?T ?v ?J 32 /gfsfilesystem
>
>
>  ------------------------------
>
> *From:* linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat <linux-cluster-bounces at redhat.com>.com<linux-cluster-bounces at redhat.com>]
> *On Behalf Of *Claudio Tassini
> *Sent:* Wednesday, August 22, 2007 11:03 AM
> *To:* linux clustering
> *Subject:* [Linux-cluster] Remove journal from GFS
>
>
>
> Hi all,
>
>
>
> i have a cluster in which two nodes share a GFS filesystem . I created the
> filesystem with 3 journals - just to be sure - , and now I need to add TWO
> cluster nodes and make the GFS fs visible to all the 4 nodes. When I try to
> add a journal to the fs (to have 4 of them), I got this:
>
>
>
> gfs_jadd  -j 1 -T -v /gfsfilesystem
>
> Requested size (65536 blocks) greater than available space (2 blocks)
>
>
>
> So i read that it's normal, but I really need to add a journal to this fs.
> I was thinking about deleting one the current journals (I have 3 and I'm
> using 2), and then create two "half-sized" new journals, to reach the number
> of 4.
>
>
>
> There is some other way I can achieve my goal? I don't think I can shrink
> GFS, and I simply can't backup all data, create a new FS and then restore
> (it's a 1 TB fs with production maildirs inside - you know... a VERY lot of
> small, really small files, it would mean a 2-days downtime).
>
>
>
> Any help would be really appreciated.
>
>
> --
> Claudio Tassini
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Claudio Tassini
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070822/7528f778/attachment.htm>

From Randy.Brown at noaa.gov  Wed Aug 22 18:18:00 2007
From: Randy.Brown at noaa.gov (Randy Brown)
Date: Wed, 22 Aug 2007 14:18:00 -0400
Subject: [Linux-cluster] Please correct me if I'm wrong, but...
In-Reply-To: <20070822153807.GF4669@redhat.com>
References: <46C9EAD0.1060107@noaa.gov> <20070820202018.GK26997@redhat.com>
	<46C9FBE9.6090504@noaa.gov> <20070822153711.GE4669@redhat.com>
	<20070822153807.GF4669@redhat.com>
Message-ID: <46CC7DD8.2040608@noaa.gov>

We have ISCSI based storage and are using GFS during a demo period, but 
I don't think we can afford it after the demo period is over.

The fencing is another issue for us.  I need to learn more about that.  
I'm not sure I'm fully grasping all I need to there.  I'm still trying 
to get nfs file systems mounted on a remote host. :)

Thanks,
Randy


Lon Hohberger wrote:
> On Wed, Aug 22, 2007 at 11:37:11AM -0400, Lon Hohberger wrote:
>   
>> On Mon, Aug 20, 2007 at 04:39:05PM -0400, Randy Brown wrote:
>>     
>>> Right.  That's the way I understood it to be.  Using ext3 would require 
>>> us to have to umount and remount the file systems to the each host after 
>>> the failure, though, correct?  In other words, would require 
>>> administrator interaction.  GFS would do this automatically without 
>>> impacting the users.
>>>       
>> One would hope that the failover process handles mounting and
>> unmounting.  You do, however, need shared storage and suitable fencing
>> mechanisms in place.
>>     
>
> ... that is, with ext3.
>
> With GFS, you just mount it on both nodes and the recovery occurs in the
> background.  Again, shared storage + fencing are required.
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070822/c0197873/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: randy.brown.vcf
Type: text/x-vcard
Size: 348 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070822/c0197873/attachment.vcf>

From hlawatschek at atix.de  Thu Aug 23 06:56:27 2007
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Thu, 23 Aug 2007 08:56:27 +0200
Subject: [Linux-cluster] Kernel Oops gfs:gfs_glock_dq
In-Reply-To: <BAY105-F2372BE0E8C9D177FFB15BDE3D50@phx.gbl>
References: <BAY105-F2372BE0E8C9D177FFB15BDE3D50@phx.gbl>
Message-ID: <200708230856.27454.hlawatschek@atix.de>

Hi Mehmet,

how do you explain the kernel dump with the number of gfs_glockd ? By default 
there is one per gfs mount. 
Is there a bugzilla that adresses this problem ?

In my opinion it looks like the panic occurs during a flock release. 

Mark

On Wednesday 22 August 2007 15:56:29 mehmet celik wrote:
> hi,
>
> did you look to manpages of gfs_tool ? I guess, you should change number of
> the glockd process. right now, it's not enough for your system.
>
> . . .
> . . .
> # gfs_tool margs "oopses_ok,num_glockd=X"
> # mount
> . . .
>
> the X must be a integer value (1...32).
>
> Mehmet CELIK
> Istanbul/TURKEY
>
> >From: Reiner Rottmann <rottmann at atix.de>
> >Reply-To: linux clustering <linux-cluster at redhat.com>
> >To: linux-cluster at redhat.com
> >Subject: [Linux-cluster] Kernel Oops gfs:gfs_glock_dq
> >Date: Wed, 22 Aug 2007 11:57:01 +0200
> >
> >Hi,
> >
> >we experienced a kernel oops in gfs:gfs_glock_dq for the first time and we
> >would like to know if there are others who are affected by this issue.
> >
> >Although the node was successfully fenced by the cluster and could rejoin
> >without any troubles we are curious about what happened. There are no
> >suspicious log entries preceding the oops messages.
> >
> >Is this perhaps a known bug and maybe already solved?
> >
> >
> ># cat /var/log/messages
> >...
> >Aug 17 17:45:00 node01.mgmt  Unable to handle kernel NULL pointer
> >dereference
> >at 0000000000000000 RIP:
> >Aug 17 17:45:00 node01.mgmt  <ffffffffa0179aef>{:gfs:gfs_glock_dq+191}
> >Aug 17 17:45:00 node01.mgmt  PML4 9363d067 PGD d5119067 PMD 0
> >Aug 17 17:45:00 node01.mgmt  Oops: 0000 [1] SMP
> >Aug 17 17:45:00 node01.mgmt  CPU 3
> >Aug 17 17:45:00 node01.mgmt  Modules linked in: parport_pc lp parport
> >netconsole netdump autofs4 i2c_dev i2c_core sunrpc ds yenta_socket
> >pcmcia_core ext3 j
> >bd button battery ac ohci_hcd hw_random scsi_transport_fc md5 ipv6
> >lock_dlm(U)
> >dlm(U) gfs(U) lock_harness(U) cman(U) 8021q bonding(U) qla2300(U)
> >qla2xxx_co
> >nf(U) qla2xxx(U) sd_mod scsi_mod dm_snapshot dm_mirror dm_mod tg3
> >Aug 17 17:45:00 node01.mgmt  Pid: 13218, comm: site_search_dev Not tainted
> >2.6.9-42.0.3.ELsmp
> >Aug 17 17:45:00 node01.mgmt  RIP: 0010:[<ffffffffa0179aef>]
> ><ffffffffa0179aef>{:gfs:gfs_glock_dq+191}
> >Aug 17 17:45:00 node01.mgmt  RSP: 0018:0000010042b33ce8  EFLAGS: 00010203
> >Aug 17 17:45:00 node01.mgmt  RAX: 000001009cc5c368 RBX: 0000000000000000
> >RCX:
> >00000100af058d50
> >Aug 17 17:45:00 node01.mgmt  RDX: 0000000000000000 RSI: 0000000000000000
> >RDI:
> >000001009cc5c35c
> >Aug 17 17:45:00 node01.mgmt  RBP: 000001009cc5c330 R08: 0000000000000000
> >R09:
> >0000000000000000
> >Aug 17 17:45:00 node01.mgmt  R10: 0000000000000000 R11: 00000100af058ce0
> >R12:
> >000001009cc5c35c
> >Aug 17 17:45:01 node01.mgmt  R13: 0000010008fcc5e8 R14: ffffff00005a3000
> >R15:
> >00000100234079a8
> >Aug 17 17:45:01 node01.mgmt  FS:  0000002a95561a20(0000)
> >GS:ffffffff804e5300
> >(0000) knlGS:0000000000000000
> >Aug 17 17:45:01 node01.mgmt  CS:  0010 DS: 0000 ES: 0000 CR0:
> >000000008005003b
> >Aug 17 17:45:01 node01.mgmt  CR2: 0000000000000000 CR3: 0000000001050000
> >CR4:
> >00000000000006e0
> >Aug 17 17:45:01 node01.mgmt  Process site_search_dev (pid: 13218,
> >threadinfo
> >0000010042b32000, task 000001004886d7f0)
> >Aug 17 17:45:01 node01.mgmt  Stack: 0000010042b33cf8 00012c5400000000
> >ffffffffa01b99c0 0000010008fcc5e8
> >Aug 17 17:45:01 node01.mgmt         0000010008fcc5c0 0000010008fcc5e8
> >00000100dd2625e8 0000000000000000
> >Aug 17 17:45:01 node01.mgmt         0000000000000000 ffffffffa0179c08
> >Aug 17 17:45:01 node01.mgmt  Call
> >Trace:<ffffffffa0179c08>{:gfs:gfs_glock_dq_uninit+9}
> ><ffffffffa018ef63>{:gfs:gfs_flock+222}
> >Aug 17 17:45:01 node01.mgmt
> ><ffffffffa01e1781>{:lock_dlm:get_resource+84}
> ><ffffffffa01e23d7>{:lock_dlm:lm_dlm_punlock+642}
> >Aug 17 17:45:01 node01.mgmt
> ><ffffffff8018e5e4>{locks_remove_flock+97}
> ><ffffffff8017a21d>{__fput+73}
> >Aug 17 17:45:01 node01.mgmt         <ffffffff80178e48>{filp_close+103}
> ><ffffffff80178ed1>{sys_close+130}
> >Aug 17 17:45:01 node01.mgmt         <ffffffff8011026a>{system_call+126}
> >Aug 17 17:45:01 node01.mgmt
> >Aug 17 17:45:01 node01.mgmt  Code: 48 8b 02 0f 18 08 48 8d 45 38 48 39 c2
> >74
> >0c ff 44 24 0c 49
> >Aug 17 17:45:01 node01.mgmt  RIP <ffffffffa0179aef>{:gfs:gfs_glock_dq+191}
> >RSP
> ><0000010042b33ce8>
> >Aug 17 17:45:01 node01.mgmt  CR2: 0000000000000000
> >...
> >
> >$ uname -a
> >Linux node01 2.6.9-42.0.3.ELsmp #1 SMP Fri Oct 6 06:28:26 CDT 2006 x86_64
> >x86_64 x86_64 GNU/Linux
> >
> >$ rpm -qa
> >...
> >hp_qla2x00src-8.01.06-7
> >kernel-smp-2.6.9-42.0.3.EL
> >kernel-smp-devel-2.6.9-42.0.3.EL
> >ccs-1.0.7-0
> >cman-1.0.11-0
> >cman-kernel-smp-2.6.9-45.8
> >dlm-1.0.1-1
> >dlm-kernel-smp-2.6.9-44.3
> >GFS-6.1.6-1
> >GFS-kernel-smp-2.6.9-60.3
> >magma-1.0.6-0
> >magma-plugins-1.0.9-0
> >...
> >
> >
> >--
> >Gruss / Regards,
> >
> >Dipl.-Ing. (FH) Reiner Rottmann
> >
> >Phone: +49-89 452 3538-12
> >
> >http://www.atix.de/
> >http://open-sharedroot.org/
> >
> >PGP Key ID: 0xCA67C5A6
> >PGP Key Fingerprint = BF59FF006360B6E8D48F26B10D9F5A84CA67C5A6
> >
> >**
> >ATIX Informationstechnologie und Consulting AG
> >Einsteinstr. 10
> >85716 Unterschleissheim
> >Deutschland/Germany
> >
> >Phone: +49-89 452 3538-0
> >Fax:   +49-89 990 1766-0
> >
> >Registergericht: Amtsgericht Muenchen
> >Registernummer: HRB 168930
> >USt.-Id.: DE209485962
> >
> >Vorstand:
> >Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)
> >
> >Vorsitzender des Aufsichtsrats:
> >Dr. Martin Buss
> >
> >
> ><< attach4 >>
> >
> >
> >
> >
> >--
> >Linux-cluster mailing list
> >Linux-cluster at redhat.com
> >https://www.redhat.com/mailman/listinfo/linux-cluster
>
> _________________________________________________________________
> Find a local pizza place, movie theater, and more?.then map the best
> route!
> http://maps.live.com/default.aspx?v=2&ss=yp.bars~yp.pizza~yp.movie%20theate
>r&cp=42.358996~-71.056691&style=r&lvl=13&tilt=-90&dir=0&alt=-1000&scene=9506
>07&encType=1&FORM=MGAC01
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Gruss / Regards,

Dipl.-Ing. Mark Hlawatschek
Phone: +49-89 452 3538-15
http://www.atix.de/
http://www.open-sharedroot.org/

**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany

Phone: +49-89 452 3538-0
Fax:   +49-89 990 1766-0

Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962

Vorstand: 
Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)

Vorsitzender des Aufsichtsrats:
Dr. Martin Buss




From jobot at wmdata.com  Thu Aug 23 07:40:04 2007
From: jobot at wmdata.com (=?iso-8859-1?Q?Borgstr=F6m_Jonas?=)
Date: Thu, 23 Aug 2007 09:40:04 +0200
Subject: [Linux-cluster] Node fencing problem
In-Reply-To: <CABF801D13AA444988E62B7AF62C371D01F46EC1@WMRI000167.corp.wmdata.net>
References: <CABF801D13AA444988E62B7AF62C371D01EEB6C1@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F26@exchange2003.tasconline.com><CABF801D13AA444988E62B7AF62C371D01F46E80@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F29@exchange2003.tasconline.com><CABF801D13AA444988E62B7AF62C371D01F46EA1@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F2A@exchange2003.tasconline.com>
	<CABF801D13AA444988E62B7AF62C371D01F46EC1@WMRI000167.corp.wmdata.net>
Message-ID: <CABF801D13AA444988E62B7AF62C371D01F46FE5@WMRI000167.corp.wmdata.net>

Ok, I've done some more testing now and manully running "killall fenced; fenced; fence-tool join -c" after each reboot on every node seems to be the only way to get fenced in a working state. But that's obviously not a good solution.

I tried adding clean_start="" to cluster.conf but that didn't help.

Is this perhaps a fenced bug? And if so, how can I help debugging it or work around it?

Regards,
Jonas

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: den 22 augusti 2007 17:31
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Yes, that did the trick. I did the following on both nodes:

$ killall fenced
$ fenced
$ fence_tool join -c

After that I tested the same thing as earlier but this time the failed node was fensed!

/var/log/messages output:

Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ] New Configuration: 
Aug 22 16:08:38 test-db2 kernel: dlm: closing connection to node 1
Aug 22 16:08:38 test-db2 fenced[28702]: test-db1.example.com not a cluster member after 0 sec post_fail_delay
Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ]         r(0) ip(10.100.2.6)  
Aug 22 16:08:38 test-db2 fenced[28702]: fencing node "test-db1.example.com"


So I guess the only thing left to do now is to understand why it doesn't work without commands above. The only reason I can think of is that because I'm using fence_drac to power on/off the nodes so one node is usually started 10-20 seconds after the other and this usually results in one node being fenced by the other before the cluster starts up.

But after that everything appears to work, but I guess that somehow confuses fenced. 

Here's the "group dump fence" output after a restart of all nodes:

test-db1: http://pastebin.com/m73104841

test-db2: http://pastebin.com/ma427c89

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 16:57
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Try restarting fence cleanly. Stop the fenced service and run this command.

fence_tool join -c

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 9:49 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Yes, both "fence_drac ..." and "fence_node test-db1.example.com" works.

The strange thing is that during the test I described earlier it looks like the cluster didn't even try to fence the failed node. /var/log/messages didn't mentioning anything about trying to fence any node. And neither did "group_tool dump fence".

And even if something would be wrong with the fence_drac configuration wouldn't fence_manual kick in instead?

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 16:32
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

See if the fence agent is configured correctly. Run this command to see if it shuts down the node in question.

fence_drac -a 10.100.2.40 -l testdb -p <passwd> -o off

If it does not work I would check to see if access is enabled. By default, the telnet interface is not enabled. To enable the interface, you will need to use the racadm command in the racser-devel rpm available from Dell. To enable telnet on the DRAC:

[root]# racadm config -g cfgSerial -o cfgSerialTelnetEnable 1

[root]# racadm

Racreset


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 9:22 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

As you can see here http://pastebin.com/m7ac9376d I've configured both fence_drac and fence_manual.

And fenced appears to be running:
[root at test-db1 ~]# ps ax | grep fence
 3412 ?        Ss     0:00 /sbin/fenced
 5109 pts/0    S+     0:00 grep fence
[root at test-db1 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010001 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

And on test-db2:
[root at test-db2 ~]# ps ax | grep fence
 3428 ?        Ss     0:00 /sbin/fenced
 8848 pts/0    S+     0:00 grep fence
[root at test-db2 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010002 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 15:47
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

What type of fencing method are you using on your cluster?

Also can you run "cman_tool services" on both nodes to make sure Fenced is running?

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 4:07 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Node fencing problem

Hi,

We're having some problems getting fencing to work as expected on our two-node cluster. 

Our cluster.conf file: http://pastebin.com/m7ac9376d
kernel version: 2.6.18-8.1.8.el5
cman version: 2.0.64-1.0.1.el5

When I'm simulating a network failure on a node I expect it to be fenced by the other node but that doesn't happen for some reason:

Steps to reproduce:
1. Start the cluster
2. Mount a GFS filesystem on both nodes (test-db1 and test-db2)
3. Simulate a net failure on test-db1
    http://pastebin.com/m19fda088

Expected result:
1. Node test-db2 would detect that test-db1 failed
2. test-db1 get fenced by test-db2
3. test-db2 replays the GFS journal (filesystem writable again)
4. Fail over services from test-db1 to test-db2

Actual result:
1. Node-test-db2 detects that something happened to test-db1
2. test-db2 replays the GFS journal (filesystem writable again)
3. The service on test-db1 is still listed as started and not failed
   over to test-db2 even though test-db2 thinks test-db1 is "offline".

Log files and debug output from test-db2:
   /var/log/messages after the failure: http://pastebin.com/m2fe4ce36
   "group_tool dump fence" output: http://pastebin.com/m79d21ed9
   clustat output: http://pastebin.com/m4d1007c2

And if I restore network connectivity on test-db1 the filsystem will become writeable on that node as well and probably results in filesystem corruption.

I think the fencedevice part of cluster.conf is correct since nodes are sometimes fenced when the cluster is started and one node isn't joining fast enough.

What am I doing wrong?

Regards,
Jonas

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From Jeremyc at tasconline.com  Thu Aug 23 13:18:39 2007
From: Jeremyc at tasconline.com (Jeremy Carroll)
Date: Thu, 23 Aug 2007 08:18:39 -0500
Subject: [Linux-cluster] Node fencing problem
In-Reply-To: <CABF801D13AA444988E62B7AF62C371D01F46FE5@WMRI000167.corp.wmdata.net>
References: <CABF801D13AA444988E62B7AF62C371D01EEB6C1@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F26@exchange2003.tasconline.com><CABF801D13AA444988E62B7AF62C371D01F46E80@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F29@exchange2003.tasconline.com><CABF801D13AA444988E62B7AF62C371D01F46EA1@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F2A@exchange2003.tasconline.com><CABF801D13AA444988E62B7AF62C371D01F46EC1@WMRI000167.corp.wmdata.net>
	<CABF801D13AA444988E62B7AF62C371D01F46FE5@WMRI000167.corp.wmdata.net>
Message-ID: <BC16657A85D83746AFF99A42587CC6BC07C77F36@exchange2003.tasconline.com>


Can you post your fence configuration from cluster.conf?

Example: "<fence_daemon post_fail_delay="0" post_join_delay="120"/>"

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Thursday, August 23, 2007 2:40 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Ok, I've done some more testing now and manully running "killall fenced; fenced; fence-tool join -c" after each reboot on every node seems to be the only way to get fenced in a working state. But that's obviously not a good solution.

I tried adding clean_start="" to cluster.conf but that didn't help.

Is this perhaps a fenced bug? And if so, how can I help debugging it or work around it?

Regards,
Jonas

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: den 22 augusti 2007 17:31
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Yes, that did the trick. I did the following on both nodes:

$ killall fenced
$ fenced
$ fence_tool join -c

After that I tested the same thing as earlier but this time the failed node was fensed!

/var/log/messages output:

Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ] New Configuration: 
Aug 22 16:08:38 test-db2 kernel: dlm: closing connection to node 1
Aug 22 16:08:38 test-db2 fenced[28702]: test-db1.example.com not a cluster member after 0 sec post_fail_delay
Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ]         r(0) ip(10.100.2.6)  
Aug 22 16:08:38 test-db2 fenced[28702]: fencing node "test-db1.example.com"


So I guess the only thing left to do now is to understand why it doesn't work without commands above. The only reason I can think of is that because I'm using fence_drac to power on/off the nodes so one node is usually started 10-20 seconds after the other and this usually results in one node being fenced by the other before the cluster starts up.

But after that everything appears to work, but I guess that somehow confuses fenced. 

Here's the "group dump fence" output after a restart of all nodes:

test-db1: http://pastebin.com/m73104841

test-db2: http://pastebin.com/ma427c89

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 16:57
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Try restarting fence cleanly. Stop the fenced service and run this command.

fence_tool join -c

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 9:49 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Yes, both "fence_drac ..." and "fence_node test-db1.example.com" works.

The strange thing is that during the test I described earlier it looks like the cluster didn't even try to fence the failed node. /var/log/messages didn't mentioning anything about trying to fence any node. And neither did "group_tool dump fence".

And even if something would be wrong with the fence_drac configuration wouldn't fence_manual kick in instead?

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 16:32
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

See if the fence agent is configured correctly. Run this command to see if it shuts down the node in question.

fence_drac -a 10.100.2.40 -l testdb -p <passwd> -o off

If it does not work I would check to see if access is enabled. By default, the telnet interface is not enabled. To enable the interface, you will need to use the racadm command in the racser-devel rpm available from Dell. To enable telnet on the DRAC:

[root]# racadm config -g cfgSerial -o cfgSerialTelnetEnable 1

[root]# racadm

Racreset


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 9:22 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

As you can see here http://pastebin.com/m7ac9376d I've configured both fence_drac and fence_manual.

And fenced appears to be running:
[root at test-db1 ~]# ps ax | grep fence
 3412 ?        Ss     0:00 /sbin/fenced
 5109 pts/0    S+     0:00 grep fence
[root at test-db1 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010001 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

And on test-db2:
[root at test-db2 ~]# ps ax | grep fence
 3428 ?        Ss     0:00 /sbin/fenced
 8848 pts/0    S+     0:00 grep fence
[root at test-db2 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010002 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 15:47
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

What type of fencing method are you using on your cluster?

Also can you run "cman_tool services" on both nodes to make sure Fenced is running?

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 4:07 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Node fencing problem

Hi,

We're having some problems getting fencing to work as expected on our two-node cluster. 

Our cluster.conf file: http://pastebin.com/m7ac9376d
kernel version: 2.6.18-8.1.8.el5
cman version: 2.0.64-1.0.1.el5

When I'm simulating a network failure on a node I expect it to be fenced by the other node but that doesn't happen for some reason:

Steps to reproduce:
1. Start the cluster
2. Mount a GFS filesystem on both nodes (test-db1 and test-db2)
3. Simulate a net failure on test-db1
    http://pastebin.com/m19fda088

Expected result:
1. Node test-db2 would detect that test-db1 failed
2. test-db1 get fenced by test-db2
3. test-db2 replays the GFS journal (filesystem writable again)
4. Fail over services from test-db1 to test-db2

Actual result:
1. Node-test-db2 detects that something happened to test-db1
2. test-db2 replays the GFS journal (filesystem writable again)
3. The service on test-db1 is still listed as started and not failed
   over to test-db2 even though test-db2 thinks test-db1 is "offline".

Log files and debug output from test-db2:
   /var/log/messages after the failure: http://pastebin.com/m2fe4ce36
   "group_tool dump fence" output: http://pastebin.com/m79d21ed9
   clustat output: http://pastebin.com/m4d1007c2

And if I restore network connectivity on test-db1 the filsystem will become writeable on that node as well and probably results in filesystem corruption.

I think the fencedevice part of cluster.conf is correct since nodes are sometimes fenced when the cluster is started and one node isn't joining fast enough.

What am I doing wrong?

Regards,
Jonas

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From jobot at wmdata.com  Thu Aug 23 13:24:24 2007
From: jobot at wmdata.com (=?iso-8859-1?Q?Borgstr=F6m_Jonas?=)
Date: Thu, 23 Aug 2007 15:24:24 +0200
Subject: [Linux-cluster] Node fencing problem
In-Reply-To: <BC16657A85D83746AFF99A42587CC6BC07C77F36@exchange2003.tasconline.com>
References: <CABF801D13AA444988E62B7AF62C371D01EEB6C1@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F26@exchange2003.tasconline.com><CABF801D13AA444988E62B7AF62C371D01F46E80@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F29@exchange2003.tasconline.com><CABF801D13AA444988E62B7AF62C371D01F46EA1@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F2A@exchange2003.tasconline.com><CABF801D13AA444988E62B7AF62C371D01F46EC1@WMRI000167.corp.wmdata.net><CABF801D13AA444988E62B7AF62C371D01F46FE5@WMRI000167.corp.wmdata.net>
	<BC16657A85D83746AFF99A42587CC6BC07C77F36@exchange2003.tasconline.com>
Message-ID: <CABF801D13AA444988E62B7AF62C371D01F47257@WMRI000167.corp.wmdata.net>

Here the cluster.conf I'm currently using:

http://pastebin.com/m35cd58d0

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 23 augusti 2007 15:19
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem


Can you post your fence configuration from cluster.conf?

Example: "<fence_daemon post_fail_delay="0" post_join_delay="120"/>"

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Thursday, August 23, 2007 2:40 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Ok, I've done some more testing now and manully running "killall fenced; fenced; fence-tool join -c" after each reboot on every node seems to be the only way to get fenced in a working state. But that's obviously not a good solution.

I tried adding clean_start="" to cluster.conf but that didn't help.

Is this perhaps a fenced bug? And if so, how can I help debugging it or work around it?

Regards,
Jonas

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: den 22 augusti 2007 17:31
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Yes, that did the trick. I did the following on both nodes:

$ killall fenced
$ fenced
$ fence_tool join -c

After that I tested the same thing as earlier but this time the failed node was fensed!

/var/log/messages output:

Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ] New Configuration: 
Aug 22 16:08:38 test-db2 kernel: dlm: closing connection to node 1
Aug 22 16:08:38 test-db2 fenced[28702]: test-db1.example.com not a cluster member after 0 sec post_fail_delay
Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ]         r(0) ip(10.100.2.6)  
Aug 22 16:08:38 test-db2 fenced[28702]: fencing node "test-db1.example.com"


So I guess the only thing left to do now is to understand why it doesn't work without commands above. The only reason I can think of is that because I'm using fence_drac to power on/off the nodes so one node is usually started 10-20 seconds after the other and this usually results in one node being fenced by the other before the cluster starts up.

But after that everything appears to work, but I guess that somehow confuses fenced. 

Here's the "group dump fence" output after a restart of all nodes:

test-db1: http://pastebin.com/m73104841

test-db2: http://pastebin.com/ma427c89

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 16:57
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Try restarting fence cleanly. Stop the fenced service and run this command.

fence_tool join -c

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 9:49 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Yes, both "fence_drac ..." and "fence_node test-db1.example.com" works.

The strange thing is that during the test I described earlier it looks like the cluster didn't even try to fence the failed node. /var/log/messages didn't mentioning anything about trying to fence any node. And neither did "group_tool dump fence".

And even if something would be wrong with the fence_drac configuration wouldn't fence_manual kick in instead?

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 16:32
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

See if the fence agent is configured correctly. Run this command to see if it shuts down the node in question.

fence_drac -a 10.100.2.40 -l testdb -p <passwd> -o off

If it does not work I would check to see if access is enabled. By default, the telnet interface is not enabled. To enable the interface, you will need to use the racadm command in the racser-devel rpm available from Dell. To enable telnet on the DRAC:

[root]# racadm config -g cfgSerial -o cfgSerialTelnetEnable 1

[root]# racadm

Racreset


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 9:22 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

As you can see here http://pastebin.com/m7ac9376d I've configured both fence_drac and fence_manual.

And fenced appears to be running:
[root at test-db1 ~]# ps ax | grep fence
 3412 ?        Ss     0:00 /sbin/fenced
 5109 pts/0    S+     0:00 grep fence
[root at test-db1 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010001 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

And on test-db2:
[root at test-db2 ~]# ps ax | grep fence
 3428 ?        Ss     0:00 /sbin/fenced
 8848 pts/0    S+     0:00 grep fence
[root at test-db2 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010002 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 15:47
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

What type of fencing method are you using on your cluster?

Also can you run "cman_tool services" on both nodes to make sure Fenced is running?

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 4:07 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Node fencing problem

Hi,

We're having some problems getting fencing to work as expected on our two-node cluster. 

Our cluster.conf file: http://pastebin.com/m7ac9376d
kernel version: 2.6.18-8.1.8.el5
cman version: 2.0.64-1.0.1.el5

When I'm simulating a network failure on a node I expect it to be fenced by the other node but that doesn't happen for some reason:

Steps to reproduce:
1. Start the cluster
2. Mount a GFS filesystem on both nodes (test-db1 and test-db2)
3. Simulate a net failure on test-db1
    http://pastebin.com/m19fda088

Expected result:
1. Node test-db2 would detect that test-db1 failed
2. test-db1 get fenced by test-db2
3. test-db2 replays the GFS journal (filesystem writable again)
4. Fail over services from test-db1 to test-db2

Actual result:
1. Node-test-db2 detects that something happened to test-db1
2. test-db2 replays the GFS journal (filesystem writable again)
3. The service on test-db1 is still listed as started and not failed
   over to test-db2 even though test-db2 thinks test-db1 is "offline".

Log files and debug output from test-db2:
   /var/log/messages after the failure: http://pastebin.com/m2fe4ce36
   "group_tool dump fence" output: http://pastebin.com/m79d21ed9
   clustat output: http://pastebin.com/m4d1007c2

And if I restore network connectivity on test-db1 the filsystem will become writeable on that node as well and probably results in filesystem corruption.

I think the fencedevice part of cluster.conf is correct since nodes are sometimes fenced when the cluster is started and one node isn't joining fast enough.

What am I doing wrong?

Regards,
Jonas

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From Jeremyc at tasconline.com  Thu Aug 23 13:29:58 2007
From: Jeremyc at tasconline.com (Jeremy Carroll)
Date: Thu, 23 Aug 2007 08:29:58 -0500
Subject: [Linux-cluster] Node fencing problem
In-Reply-To: <CABF801D13AA444988E62B7AF62C371D01F47257@WMRI000167.corp.wmdata.net>
References: <CABF801D13AA444988E62B7AF62C371D01EEB6C1@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F26@exchange2003.tasconline.com><CABF801D13AA444988E62B7AF62C371D01F46E80@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F29@exchange2003.tasconline.com><CABF801D13AA444988E62B7AF62C371D01F46EA1@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F2A@exchange2003.tasconline.com><CABF801D13AA444988E62B7AF62C371D01F46EC1@WMRI000167.corp.wmdata.net><CABF801D13AA444988E62B7AF62C371D01F46FE5@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F36@exchange2003.tasconline.com>
	<CABF801D13AA444988E62B7AF62C371D01F47257@WMRI000167.corp.wmdata.net>
Message-ID: <BC16657A85D83746AFF99A42587CC6BC07C77F38@exchange2003.tasconline.com>

I could remove "clean_start" entirely. It's default is zero. I would also increase your post_join_delay to two minutes. That will give fenced time to see all the cluster members on a fresh restart. 

FYI: If you are doing a total reboot of the cluster make sure all of your nodes come up at around the same time. 

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Thursday, August 23, 2007 8:24 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Here the cluster.conf I'm currently using:

http://pastebin.com/m35cd58d0

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 23 augusti 2007 15:19
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem


Can you post your fence configuration from cluster.conf?

Example: "<fence_daemon post_fail_delay="0" post_join_delay="120"/>"

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Thursday, August 23, 2007 2:40 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Ok, I've done some more testing now and manully running "killall fenced; fenced; fence-tool join -c" after each reboot on every node seems to be the only way to get fenced in a working state. But that's obviously not a good solution.

I tried adding clean_start="" to cluster.conf but that didn't help.

Is this perhaps a fenced bug? And if so, how can I help debugging it or work around it?

Regards,
Jonas

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: den 22 augusti 2007 17:31
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Yes, that did the trick. I did the following on both nodes:

$ killall fenced
$ fenced
$ fence_tool join -c

After that I tested the same thing as earlier but this time the failed node was fensed!

/var/log/messages output:

Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ] New Configuration: 
Aug 22 16:08:38 test-db2 kernel: dlm: closing connection to node 1
Aug 22 16:08:38 test-db2 fenced[28702]: test-db1.example.com not a cluster member after 0 sec post_fail_delay
Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ]         r(0) ip(10.100.2.6)  
Aug 22 16:08:38 test-db2 fenced[28702]: fencing node "test-db1.example.com"


So I guess the only thing left to do now is to understand why it doesn't work without commands above. The only reason I can think of is that because I'm using fence_drac to power on/off the nodes so one node is usually started 10-20 seconds after the other and this usually results in one node being fenced by the other before the cluster starts up.

But after that everything appears to work, but I guess that somehow confuses fenced. 

Here's the "group dump fence" output after a restart of all nodes:

test-db1: http://pastebin.com/m73104841

test-db2: http://pastebin.com/ma427c89

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 16:57
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Try restarting fence cleanly. Stop the fenced service and run this command.

fence_tool join -c

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 9:49 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Yes, both "fence_drac ..." and "fence_node test-db1.example.com" works.

The strange thing is that during the test I described earlier it looks like the cluster didn't even try to fence the failed node. /var/log/messages didn't mentioning anything about trying to fence any node. And neither did "group_tool dump fence".

And even if something would be wrong with the fence_drac configuration wouldn't fence_manual kick in instead?

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 16:32
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

See if the fence agent is configured correctly. Run this command to see if it shuts down the node in question.

fence_drac -a 10.100.2.40 -l testdb -p <passwd> -o off

If it does not work I would check to see if access is enabled. By default, the telnet interface is not enabled. To enable the interface, you will need to use the racadm command in the racser-devel rpm available from Dell. To enable telnet on the DRAC:

[root]# racadm config -g cfgSerial -o cfgSerialTelnetEnable 1

[root]# racadm

Racreset


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 9:22 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

As you can see here http://pastebin.com/m7ac9376d I've configured both fence_drac and fence_manual.

And fenced appears to be running:
[root at test-db1 ~]# ps ax | grep fence
 3412 ?        Ss     0:00 /sbin/fenced
 5109 pts/0    S+     0:00 grep fence
[root at test-db1 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010001 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

And on test-db2:
[root at test-db2 ~]# ps ax | grep fence
 3428 ?        Ss     0:00 /sbin/fenced
 8848 pts/0    S+     0:00 grep fence
[root at test-db2 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010002 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 15:47
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

What type of fencing method are you using on your cluster?

Also can you run "cman_tool services" on both nodes to make sure Fenced is running?

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 4:07 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Node fencing problem

Hi,

We're having some problems getting fencing to work as expected on our two-node cluster. 

Our cluster.conf file: http://pastebin.com/m7ac9376d
kernel version: 2.6.18-8.1.8.el5
cman version: 2.0.64-1.0.1.el5

When I'm simulating a network failure on a node I expect it to be fenced by the other node but that doesn't happen for some reason:

Steps to reproduce:
1. Start the cluster
2. Mount a GFS filesystem on both nodes (test-db1 and test-db2)
3. Simulate a net failure on test-db1
    http://pastebin.com/m19fda088

Expected result:
1. Node test-db2 would detect that test-db1 failed
2. test-db1 get fenced by test-db2
3. test-db2 replays the GFS journal (filesystem writable again)
4. Fail over services from test-db1 to test-db2

Actual result:
1. Node-test-db2 detects that something happened to test-db1
2. test-db2 replays the GFS journal (filesystem writable again)
3. The service on test-db1 is still listed as started and not failed
   over to test-db2 even though test-db2 thinks test-db1 is "offline".

Log files and debug output from test-db2:
   /var/log/messages after the failure: http://pastebin.com/m2fe4ce36
   "group_tool dump fence" output: http://pastebin.com/m79d21ed9
   clustat output: http://pastebin.com/m4d1007c2

And if I restore network connectivity on test-db1 the filsystem will become writeable on that node as well and probably results in filesystem corruption.

I think the fencedevice part of cluster.conf is correct since nodes are sometimes fenced when the cluster is started and one node isn't joining fast enough.

What am I doing wrong?

Regards,
Jonas

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From JESmall at directv.com  Thu Aug 23 14:42:53 2007
From: JESmall at directv.com (Small, Jon E)
Date: Thu, 23 Aug 2007 07:42:53 -0700
Subject: [Linux-cluster] Problem with latest fence_apc
Message-ID: <40CEA9CB6E0E8444B59DD4F4E848F71061B5E8@CORVUS.LA.FRD.DIRECTV.COM>

I installed the latest RPM (fence-1.32.45-1.0.2.i686.rpm) that includes
the fence_apc fix for APC firmware 3.x and it still does not work in our
environment. The Status action works, but Off & On do not work.

 

Here is the environment information:

-          RedHat 4 Update #5

-          APC model #AP7968

-          APC Firmware 3.3.3

 

Attached is the copy of /tmp/apclog from an execution of an Off action:

./fence_apc -a 10.13.2.146 -l apc -o Off -n 3 -p apc -v

 

 

Incidentally, I attempted to use the fence_apc_snmp as well, but it
fails with the following statement:

 

./fence_apc_snmp -a 10.13.2.146 -c private -n 3 -o Status -v

called with {'udpport': '161', 'option': 'status', 'ipaddr':
'10.13.2.146', 'community': 'private', 'switch': '', 'port': '3'}

invalid status outletStatusOff

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070823/4e68699f/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: apclog
Type: application/octet-stream
Size: 61947 bytes
Desc: apclog
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070823/4e68699f/attachment.obj>

From jobot at wmdata.com  Thu Aug 23 14:46:37 2007
From: jobot at wmdata.com (=?iso-8859-1?Q?Borgstr=F6m_Jonas?=)
Date: Thu, 23 Aug 2007 16:46:37 +0200
Subject: [Linux-cluster] Node fencing problem
In-Reply-To: <BC16657A85D83746AFF99A42587CC6BC07C77F38@exchange2003.tasconline.com>
References: <CABF801D13AA444988E62B7AF62C371D01EEB6C1@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F26@exchange2003.tasconline.com><CABF801D13AA444988E62B7AF62C371D01F46E80@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F29@exchange2003.tasconline.com><CABF801D13AA444988E62B7AF62C371D01F46EA1@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F2A@exchange2003.tasconline.com><CABF801D13AA444988E62B7AF62C371D01F46EC1@WMRI000167.corp.wmdata.net><CABF801D13AA444988E62B7AF62C371D01F46FE5@WMRI000167.corp.wmdata.net><BC16657A85D83746AFF99A42587CC6BC07C77F36@exchange2003.tasconline.com><CABF801D13AA444988E62B7AF62C371D01F47257@WMRI000167.corp.wmdata.net>
	<BC16657A85D83746AFF99A42587CC6BC07C77F38@exchange2003.tasconline.com>
Message-ID: <CABF801D13AA444988E62B7AF62C371D01F472C5@WMRI000167.corp.wmdata.net>

Ok, after removing "clean_start" and increasing post_join_delay to 120 I can start both cluster nodes without any "post join" fencing.

Here's the "group_tool dump fence" output from both nodes after a successful start of both nodes: http://pastebin.com/m5daed71b

After a simulated network error on test-db1 the following additional lines are printed in the fence log on test-db2:

1187876402 stop default
1187876402 start default 7 members 2 
1187876402 do_recovery stop 2 start 7 finish 1
1187876402 finish default 7
1187876421 client 3: dump

As far as I understand this means that fenced on test-db2 was notified that test-db1 disappeared but decided not to fence it for some unknown reason.

Can anybody explain this behavior?

/ Jonas

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 23 augusti 2007 15:30
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

I could remove "clean_start" entirely. It's default is zero. I would also increase your post_join_delay to two minutes. That will give fenced time to see all the cluster members on a fresh restart. 

FYI: If you are doing a total reboot of the cluster make sure all of your nodes come up at around the same time. 

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Thursday, August 23, 2007 8:24 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Here the cluster.conf I'm currently using:

http://pastebin.com/m35cd58d0

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 23 augusti 2007 15:19
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem


Can you post your fence configuration from cluster.conf?

Example: "<fence_daemon post_fail_delay="0" post_join_delay="120"/>"

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Thursday, August 23, 2007 2:40 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Ok, I've done some more testing now and manully running "killall fenced; fenced; fence-tool join -c" after each reboot on every node seems to be the only way to get fenced in a working state. But that's obviously not a good solution.

I tried adding clean_start="" to cluster.conf but that didn't help.

Is this perhaps a fenced bug? And if so, how can I help debugging it or work around it?

Regards,
Jonas

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: den 22 augusti 2007 17:31
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Yes, that did the trick. I did the following on both nodes:

$ killall fenced
$ fenced
$ fence_tool join -c

After that I tested the same thing as earlier but this time the failed node was fensed!

/var/log/messages output:

Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ] New Configuration: 
Aug 22 16:08:38 test-db2 kernel: dlm: closing connection to node 1
Aug 22 16:08:38 test-db2 fenced[28702]: test-db1.example.com not a cluster member after 0 sec post_fail_delay
Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ]         r(0) ip(10.100.2.6)  
Aug 22 16:08:38 test-db2 fenced[28702]: fencing node "test-db1.example.com"


So I guess the only thing left to do now is to understand why it doesn't work without commands above. The only reason I can think of is that because I'm using fence_drac to power on/off the nodes so one node is usually started 10-20 seconds after the other and this usually results in one node being fenced by the other before the cluster starts up.

But after that everything appears to work, but I guess that somehow confuses fenced. 

Here's the "group dump fence" output after a restart of all nodes:

test-db1: http://pastebin.com/m73104841

test-db2: http://pastebin.com/ma427c89

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 16:57
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Try restarting fence cleanly. Stop the fenced service and run this command.

fence_tool join -c

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 9:49 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

Yes, both "fence_drac ..." and "fence_node test-db1.example.com" works.

The strange thing is that during the test I described earlier it looks like the cluster didn't even try to fence the failed node. /var/log/messages didn't mentioning anything about trying to fence any node. And neither did "group_tool dump fence".

And even if something would be wrong with the fence_drac configuration wouldn't fence_manual kick in instead?

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 16:32
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

See if the fence agent is configured correctly. Run this command to see if it shuts down the node in question.

fence_drac -a 10.100.2.40 -l testdb -p <passwd> -o off

If it does not work I would check to see if access is enabled. By default, the telnet interface is not enabled. To enable the interface, you will need to use the racadm command in the racser-devel rpm available from Dell. To enable telnet on the DRAC:

[root]# racadm config -g cfgSerial -o cfgSerialTelnetEnable 1

[root]# racadm

Racreset


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 9:22 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

As you can see here http://pastebin.com/m7ac9376d I've configured both fence_drac and fence_manual.

And fenced appears to be running:
[root at test-db1 ~]# ps ax | grep fence
 3412 ?        Ss     0:00 /sbin/fenced
 5109 pts/0    S+     0:00 grep fence
[root at test-db1 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010001 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

And on test-db2:
[root at test-db2 ~]# ps ax | grep fence
 3428 ?        Ss     0:00 /sbin/fenced
 8848 pts/0    S+     0:00 grep fence
[root at test-db2 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010002 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

/ Jonas
-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 15:47
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

What type of fencing method are you using on your cluster?

Also can you run "cman_tool services" on both nodes to make sure Fenced is running?

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Borgstr?m Jonas
Sent: Wednesday, August 22, 2007 4:07 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Node fencing problem

Hi,

We're having some problems getting fencing to work as expected on our two-node cluster. 

Our cluster.conf file: http://pastebin.com/m7ac9376d
kernel version: 2.6.18-8.1.8.el5
cman version: 2.0.64-1.0.1.el5

When I'm simulating a network failure on a node I expect it to be fenced by the other node but that doesn't happen for some reason:

Steps to reproduce:
1. Start the cluster
2. Mount a GFS filesystem on both nodes (test-db1 and test-db2)
3. Simulate a net failure on test-db1
    http://pastebin.com/m19fda088

Expected result:
1. Node test-db2 would detect that test-db1 failed
2. test-db1 get fenced by test-db2
3. test-db2 replays the GFS journal (filesystem writable again)
4. Fail over services from test-db1 to test-db2

Actual result:
1. Node-test-db2 detects that something happened to test-db1
2. test-db2 replays the GFS journal (filesystem writable again)
3. The service on test-db1 is still listed as started and not failed
   over to test-db2 even though test-db2 thinks test-db1 is "offline".

Log files and debug output from test-db2:
   /var/log/messages after the failure: http://pastebin.com/m2fe4ce36
   "group_tool dump fence" output: http://pastebin.com/m79d21ed9
   clustat output: http://pastebin.com/m4d1007c2

And if I restore network connectivity on test-db1 the filsystem will become writeable on that node as well and probably results in filesystem corruption.

I think the fencedevice part of cluster.conf is correct since nodes are sometimes fenced when the cluster is started and one node isn't joining fast enough.

What am I doing wrong?

Regards,
Jonas

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From bsd_daemon at msn.com  Thu Aug 23 15:23:50 2007
From: bsd_daemon at msn.com (mehmet celik)
Date: Thu, 23 Aug 2007 15:23:50 +0000
Subject: [Linux-cluster] Kernel Oops gfs:gfs_glock_dq
In-Reply-To: <200708230856.27454.hlawatschek@atix.de>
Message-ID: <BAY105-F211B45FA181A333AEF7967E3D60@phx.gbl>


hi mark. thanx for information. I am very much sorry. Just I worked to 
guess. Maybe, It's be helper to it.

>From: Mark Hlawatschek <hlawatschek at atix.de>
>Reply-To: linux clustering <linux-cluster at redhat.com>
>To: linux clustering <linux-cluster at redhat.com>
>Subject: Re: [Linux-cluster] Kernel Oops gfs:gfs_glock_dq
>
>Hi Mehmet,
>
>how do you explain the kernel dump with the number of gfs_glockd ? By 
>default
>there is one per gfs mount.
>Is there a bugzilla that adresses this problem ?
>
>In my opinion it looks like the panic occurs during a flock release.
>
>Mark
>
>On Wednesday 22 August 2007 15:56:29 mehmet celik wrote:
> > hi,
> >
> > did you look to manpages of gfs_tool ? I guess, you should change number 
>of
> > the glockd process. right now, it's not enough for your system.
> >
> > . . .
> > . . .
> > # gfs_tool margs "oopses_ok,num_glockd=X"
> > # mount
> > . . .
> >
> > the X must be a integer value (1...32).
> >
> > Mehmet CELIK
> > Istanbul/TURKEY
> >
> > >From: Reiner Rottmann <rottmann at atix.de>
> > >Reply-To: linux clustering <linux-cluster at redhat.com>
> > >To: linux-cluster at redhat.com
> > >Subject: [Linux-cluster] Kernel Oops gfs:gfs_glock_dq
> > >Date: Wed, 22 Aug 2007 11:57:01 +0200
> > >
> > >Hi,
> > >
> > >we experienced a kernel oops in gfs:gfs_glock_dq for the first time and 
>we
> > >would like to know if there are others who are affected by this issue.
> > >
> > >Although the node was successfully fenced by the cluster and could 
>rejoin
> > >without any troubles we are curious about what happened. There are no
> > >suspicious log entries preceding the oops messages.
> > >
> > >Is this perhaps a known bug and maybe already solved?
> > >
> > >
> > ># cat /var/log/messages
> > >...
> > >Aug 17 17:45:00 node01.mgmt  Unable to handle kernel NULL pointer
> > >dereference
> > >at 0000000000000000 RIP:
> > >Aug 17 17:45:00 node01.mgmt  <ffffffffa0179aef>{:gfs:gfs_glock_dq+191}
> > >Aug 17 17:45:00 node01.mgmt  PML4 9363d067 PGD d5119067 PMD 0
> > >Aug 17 17:45:00 node01.mgmt  Oops: 0000 [1] SMP
> > >Aug 17 17:45:00 node01.mgmt  CPU 3
> > >Aug 17 17:45:00 node01.mgmt  Modules linked in: parport_pc lp parport
> > >netconsole netdump autofs4 i2c_dev i2c_core sunrpc ds yenta_socket
> > >pcmcia_core ext3 j
> > >bd button battery ac ohci_hcd hw_random scsi_transport_fc md5 ipv6
> > >lock_dlm(U)
> > >dlm(U) gfs(U) lock_harness(U) cman(U) 8021q bonding(U) qla2300(U)
> > >qla2xxx_co
> > >nf(U) qla2xxx(U) sd_mod scsi_mod dm_snapshot dm_mirror dm_mod tg3
> > >Aug 17 17:45:00 node01.mgmt  Pid: 13218, comm: site_search_dev Not 
>tainted
> > >2.6.9-42.0.3.ELsmp
> > >Aug 17 17:45:00 node01.mgmt  RIP: 0010:[<ffffffffa0179aef>]
> > ><ffffffffa0179aef>{:gfs:gfs_glock_dq+191}
> > >Aug 17 17:45:00 node01.mgmt  RSP: 0018:0000010042b33ce8  EFLAGS: 
>00010203
> > >Aug 17 17:45:00 node01.mgmt  RAX: 000001009cc5c368 RBX: 
>0000000000000000
> > >RCX:
> > >00000100af058d50
> > >Aug 17 17:45:00 node01.mgmt  RDX: 0000000000000000 RSI: 
>0000000000000000
> > >RDI:
> > >000001009cc5c35c
> > >Aug 17 17:45:00 node01.mgmt  RBP: 000001009cc5c330 R08: 
>0000000000000000
> > >R09:
> > >0000000000000000
> > >Aug 17 17:45:00 node01.mgmt  R10: 0000000000000000 R11: 
>00000100af058ce0
> > >R12:
> > >000001009cc5c35c
> > >Aug 17 17:45:01 node01.mgmt  R13: 0000010008fcc5e8 R14: 
>ffffff00005a3000
> > >R15:
> > >00000100234079a8
> > >Aug 17 17:45:01 node01.mgmt  FS:  0000002a95561a20(0000)
> > >GS:ffffffff804e5300
> > >(0000) knlGS:0000000000000000
> > >Aug 17 17:45:01 node01.mgmt  CS:  0010 DS: 0000 ES: 0000 CR0:
> > >000000008005003b
> > >Aug 17 17:45:01 node01.mgmt  CR2: 0000000000000000 CR3: 
>0000000001050000
> > >CR4:
> > >00000000000006e0
> > >Aug 17 17:45:01 node01.mgmt  Process site_search_dev (pid: 13218,
> > >threadinfo
> > >0000010042b32000, task 000001004886d7f0)
> > >Aug 17 17:45:01 node01.mgmt  Stack: 0000010042b33cf8 00012c5400000000
> > >ffffffffa01b99c0 0000010008fcc5e8
> > >Aug 17 17:45:01 node01.mgmt         0000010008fcc5c0 0000010008fcc5e8
> > >00000100dd2625e8 0000000000000000
> > >Aug 17 17:45:01 node01.mgmt         0000000000000000 ffffffffa0179c08
> > >Aug 17 17:45:01 node01.mgmt  Call
> > >Trace:<ffffffffa0179c08>{:gfs:gfs_glock_dq_uninit+9}
> > ><ffffffffa018ef63>{:gfs:gfs_flock+222}
> > >Aug 17 17:45:01 node01.mgmt
> > ><ffffffffa01e1781>{:lock_dlm:get_resource+84}
> > ><ffffffffa01e23d7>{:lock_dlm:lm_dlm_punlock+642}
> > >Aug 17 17:45:01 node01.mgmt
> > ><ffffffff8018e5e4>{locks_remove_flock+97}
> > ><ffffffff8017a21d>{__fput+73}
> > >Aug 17 17:45:01 node01.mgmt         <ffffffff80178e48>{filp_close+103}
> > ><ffffffff80178ed1>{sys_close+130}
> > >Aug 17 17:45:01 node01.mgmt         <ffffffff8011026a>{system_call+126}
> > >Aug 17 17:45:01 node01.mgmt
> > >Aug 17 17:45:01 node01.mgmt  Code: 48 8b 02 0f 18 08 48 8d 45 38 48 39 
>c2
> > >74
> > >0c ff 44 24 0c 49
> > >Aug 17 17:45:01 node01.mgmt  RIP 
><ffffffffa0179aef>{:gfs:gfs_glock_dq+191}
> > >RSP
> > ><0000010042b33ce8>
> > >Aug 17 17:45:01 node01.mgmt  CR2: 0000000000000000
> > >...
> > >
> > >$ uname -a
> > >Linux node01 2.6.9-42.0.3.ELsmp #1 SMP Fri Oct 6 06:28:26 CDT 2006 
>x86_64
> > >x86_64 x86_64 GNU/Linux
> > >
> > >$ rpm -qa
> > >...
> > >hp_qla2x00src-8.01.06-7
> > >kernel-smp-2.6.9-42.0.3.EL
> > >kernel-smp-devel-2.6.9-42.0.3.EL
> > >ccs-1.0.7-0
> > >cman-1.0.11-0
> > >cman-kernel-smp-2.6.9-45.8
> > >dlm-1.0.1-1
> > >dlm-kernel-smp-2.6.9-44.3
> > >GFS-6.1.6-1
> > >GFS-kernel-smp-2.6.9-60.3
> > >magma-1.0.6-0
> > >magma-plugins-1.0.9-0
> > >...
> > >
> > >
> > >--
> > >Gruss / Regards,
> > >
> > >Dipl.-Ing. (FH) Reiner Rottmann
> > >
> > >Phone: +49-89 452 3538-12
> > >
> > >http://www.atix.de/
> > >http://open-sharedroot.org/
> > >
> > >PGP Key ID: 0xCA67C5A6
> > >PGP Key Fingerprint = BF59FF006360B6E8D48F26B10D9F5A84CA67C5A6
> > >
> > >**
> > >ATIX Informationstechnologie und Consulting AG
> > >Einsteinstr. 10
> > >85716 Unterschleissheim
> > >Deutschland/Germany
> > >
> > >Phone: +49-89 452 3538-0
> > >Fax:   +49-89 990 1766-0
> > >
> > >Registergericht: Amtsgericht Muenchen
> > >Registernummer: HRB 168930
> > >USt.-Id.: DE209485962
> > >
> > >Vorstand:
> > >Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)
> > >
> > >Vorsitzender des Aufsichtsrats:
> > >Dr. Martin Buss
> > >
> > >
> > ><< attach4 >>
> > >
> > >
> > >
> > >
> > >--
> > >Linux-cluster mailing list
> > >Linux-cluster at redhat.com
> > >https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > _________________________________________________________________
> > Find a local pizza place, movie theater, and more.then map the best
> > route!
> > 
>http://maps.live.com/default.aspx?v=2&ss=yp.bars~yp.pizza~yp.movie%20theate
> >r&cp=42.358996~-71.056691&style=r&lvl=13&tilt=-90&dir=0&alt=-1000&scene=9506
> >07&encType=1&FORM=MGAC01
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>--
>Gruss / Regards,
>
>Dipl.-Ing. Mark Hlawatschek
>Phone: +49-89 452 3538-15
>http://www.atix.de/
>http://www.open-sharedroot.org/
>
>**
>ATIX Informationstechnologie und Consulting AG
>Einsteinstr. 10
>85716 Unterschleissheim
>Deutschland/Germany
>
>Phone: +49-89 452 3538-0
>Fax:   +49-89 990 1766-0
>
>Registergericht: Amtsgericht Muenchen
>Registernummer: HRB 168930
>USt.-Id.: DE209485962
>
>Vorstand:
>Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)
>
>Vorsitzender des Aufsichtsrats:
>Dr. Martin Buss
>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster

_________________________________________________________________
Now you can see trouble?before he arrives 
http://newlivehotmail.com/?ocid=TXT_TAGHM_migration_HM_viral_protection_0507



From JFillman at cucbc.com  Thu Aug 23 22:33:45 2007
From: JFillman at cucbc.com (James Fillman)
Date: Thu, 23 Aug 2007 15:33:45 -0700
Subject: [Linux-cluster] GFS blows up after a SAN path failover
Message-ID: <66F461DD7EDEEF4AA928FCC80B425B52AFCB90@c2kp01mail.cucbc.com>

I've been working for several weeks to get my RHEL5 cluster configured
with GFS and multipathing to a Sun 6140 SAN. The multipathing has been
giving me the most trouble and after some tests this morning I concluded
that everything is working great. But then I took a closer look at the
nodes in my cluster and all was not as it seemed.

While running a "write" test to the GFS volume on all my nodes, I
simulated a SAN path failure, causing multipathd to fail over to the
second path. All the multipathing worked as it should, however my GFS
volumes blew up and my cluster is now in disarray.

Details:
- RHEL5-64
- GFS1
- multipathing
- Sun 6140 SAN
- stock RH qlogic driver


This is the typical dmesg output on my cluster nodes:
device-mapper: multipath: Failing path 8:32.
device-mapper: multipath: Failing path 8:80.
Buffer I/O error on device dm-11, logical block 9811906
lost page write due to I/O error on dm-11
GFS: fsid=p1_logging_mdi:mdi_log.4: fatal: I/O error
GFS: fsid=p1_logging_mdi:mdi_log.4:   block = 13073841
GFS: fsid=p1_logging_mdi:mdi_log.4:   function = gfs_logbh_wait
GFS: fsid=p1_logging_mdi:mdi_log.4:   file =
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.c, line =
925
GFS: fsid=p1_logging_mdi:mdi_log.4:   time = 1187897620
GFS: fsid=p1_logging_mdi:mdi_log.4: about to withdraw from the cluster
GFS: fsid=p1_logging_mdi:mdi_log.4: telling LM to withdraw
Buffer I/O error on device dm-11, logical block 9811907
lost page write due to I/O error on dm-11
Buffer I/O error on device dm-11, logical block 9811908
lost page write due to I/O error on dm-11

Please help. The GFS volumes should be able to handle a path failover.
Especially when it only takes several seconds for the failover to occur.
This sounds like it's timout related but where do I tweak the timeout
values? At the scsi layer? If so how?  At the HBA layer? At the GFS
layer? 

Trying to reboot any one of my nodes is now a problem. They all just
hang trying to umount the GFS volumes. 

I'm quite new to this so I'm hoping for some serious hand holding.

cheers,
--james





From kadlec at sunserv.kfki.hu  Fri Aug 24 08:43:58 2007
From: kadlec at sunserv.kfki.hu (Kadlecsik Jozsi)
Date: Fri, 24 Aug 2007 10:43:58 +0200 (MEST)
Subject: [Linux-cluster] Nodes show up extremely slowly (or not at all)
Message-ID: <Pine.GSO.4.64.0708241023170.8481@sunserv.kfki.hu>

Hi,

This is a test GFS installation on ubuntu dapper machines:

linux-2.6.23-rc3
openais-0.80.3
device-mapper-1.02.23
cluster-2.01.00
LVM2-2.02.28

When starting up, up to 'cman_tool join' all looks OK. 
However, the nodes see each other after a very long time only:

root at lxserv0:~# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M     92   2007-08-24 10:14:43  lxserv0-gfs
   2   M    112   2007-08-24 10:29:30  lxserv1-gfs
   3   M    112   2007-08-24 10:29:30  web0-gfs
   4   X      0                        web1-gfs
   5   X      0                        saturn-gfs

root at lxserv1:~# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   X    112                        lxserv0-gfs
   2   M    108   2007-08-24 10:16:20  lxserv1-gfs
   3   X    112                        web0-gfs
   4   X      0                        web1-gfs
   5   X      0                        saturn-gfs

It took 13 minutes(!) for 'lsxerv0' to notice 'lxserv1'.
As I'm writing this, 'lxserv1' still does not see neither 'lxserv0' nor 
'web0'.

What can be wrong here? Is it an openais-related problem?
How could I debug the situation? 

Thanks any suggestion!

Best regards,
Jozsef
--
E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary



From changerv at gmail.com  Fri Aug 24 09:57:03 2007
From: changerv at gmail.com (Changer Van)
Date: Fri, 24 Aug 2007 17:57:03 +0800
Subject: [Linux-cluster] luci can not restart
Message-ID: <9fa3c2e50708240257m31d487cei89538217363ae362@mail.gmail.com>

Hi all,

I tried to install RHCS on RHEL 5. RHCS and dependences were installed.
Luci server has been initialized successfully, but when luci service was
restarted,
it failed. Could any one tell me how to fix the problem?

Thanks in advance.
-- 
Regards,
Changer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070824/15fa905f/attachment.htm>

From sebastian.walter at fu-berlin.de  Fri Aug 24 10:37:15 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Fri, 24 Aug 2007 12:37:15 +0200
Subject: [Linux-cluster] GFS hangs, nodes die
In-Reply-To: <200708211009.38845.grimme@atix.de>
References: <46C78986.4070301@fu-berlin.de> <200708202052.42581.grimme@atix.de>
	<46CA99C0.6060502@fu-berlin.de> <200708211009.38845.grimme@atix.de>
Message-ID: <46CEB4DB.4050901@fu-berlin.de>

Hi list,

just an update. In my scripts, there is nothing about searching the
whole file system, but I see several "df" processes blocking the system
with 100 % CPU. I will update firmwares now and check for better QLogic
drivers. Thanks!

Regards,
Sebastian

Marc Grimme wrote:
> On Tuesday 21 August 2007 09:52:32 Sebastian Walter wrote:
>   
>> Hi,
>>
>> Marc Grimme wrote:
>>     
>>> Do you also see some messages on the console of the nodes. And the
>>> gfs_tool
>>> counters would help before that problem occures. So let it run sometimes
>>> before to see if locks increase.
>>> What kind of stress tests are you doing? I bet searching the whole
>>> filesystem. What makes me wonder is that the gfs_tool glock_purge does
>>> not work whereas it worked for me with exactly the same problems. Did you
>>> set it _AFTER_ the fs was mounted?
>>>       
> Sorry I mean after is right and before not ;-( .
> And are you using the latest version of CS/GFS?
> Do you have a lot of memory in your machines 16G or more?
>   
>> That makes me optimistic. I set it after the volume was mounted, so I
>> will give it another try setting it before mounting it. Then I will also
>> mail myself the output of the counters every 10 minuts. Let's see...
>>     
> I would be interested in the counters.
> Also add the process list in order to see if how much CPU-Time gfs_scand 
> consumes.
> i.e.
> ps axwwww | sort -k4 -n | tail -10
>
> Have fun Marc.
>   
>> ...with best thanks
>> Sebastian
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>     
>
>
>
>   



From bernard.chew at muvee.com  Fri Aug 24 11:01:15 2007
From: bernard.chew at muvee.com (Bernard Chew)
Date: Fri, 24 Aug 2007 19:01:15 +0800
Subject: [Linux-cluster] Using cmirror
In-Reply-To: <229C73600EB0E54DA818AB599482BCE901A3EF2D@shadowfax.sg.muvee.net>
Message-ID: <229C73600EB0E54DA818AB599482BCE901A3F346@shadowfax.sg.muvee.net>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bernard Chew
> Sent: Wednesday, August 22, 2007 10:05 PM
> To: linux clustering
> Subject: RE: [Linux-cluster] Using cmirror

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jonathan
Brassow
> Sent: Friday, August 10, 2007 11:01 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Using cmirror

> If you've set up a cluster and are using LVM, it will work the same
> way as single machine mirroring.

> http://www.redhat.com/docs/manuals/csgfs/browse/4.5/
> SAC_Cluster_Logical_Volume_Manager/mirror_create.html

>  brassow

> On Aug 10, 2007, at 6:56 AM, Bernard Chew wrote:

> Hi,
>
> I read that cmirror provides user-level utilities for managing cluster

> mirroring but could not find much documentation on it. Can anyone 
> point me to any documentation / guide around?
>
> Regards,
> Bernard Chew
> IT Operations
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

> Hi all,

> I have problems creating a mirrored linear logical volume in a 3 nodes
cluster (using lvcreate -L 960M -m1 -n lv1 vg1 /dev/sdd /dev/sdb
/dev/sdc), and encounter the following errors:

> Error locking on node NodeA.hogwarts.com: device-mapper: reload ioctl
> failed: Invalid argument
> Error locking on node NodeB.hogwarts.com: device-mapper: reload ioctl
> failed: Invalid argument
> Error locking on node NodeC.hogwarts.com: device-mapper: reload ioctl
> failed: Invalid argument
> Failed to activate new LV

> I am able to create a linear logical volume without mirroring using
the same 3 pvs with GFS filesystem (clustered) but not the above. Any
help is appreciated.

> Thanks,
> Bernard Chew

Hi,

Managed to find out the reason; the cmirror modules were not loaded into
the kernel even though it was configured to start.

Regards,
Bernard Chew



From bsheets at singlefin.net  Fri Aug 24 11:06:15 2007
From: bsheets at singlefin.net (Brian Sheets)
Date: Fri, 24 Aug 2007 11:06:15 +0000 (UTC)
Subject: [Linux-cluster] fence_apc_snmp woes
Message-ID: <873211268.81391187953575543.JavaMail.root@v-mailhost2.mxpath.net>

I have a 2 node cluster on debian. Below is my cluster.conf. If I down node1's nic 
node2 sees and tries to fence


Aug 24 10:57:38 oc-index4 fenced[7599]: oc-index3 not a cluster member after 0 sec post_fail_delay
Aug 24 10:57:38 oc-index4 fenced[7599]: fencing node "oc-index3"
Aug 24 10:57:38 oc-index4 fence_manual: Node 172.16.14.100 needs to be reset before recovery can procede.  Waiting for 172.16.14.100 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n 172.16.14.100)
Aug 24 10:59:34 oc-index4 fenced[7599]: fence "oc-index3" success

It states that it's fencing, but never does, and if I do a fence_ack_manual, then fence_apc_snmp gets run and the node1 gets powered down.

what am I missing?

<?xml version="1.0"?>
<cluster name="index" config_version="2">
<cman two_node="1" expected_votes="1">
</cman>
<clusternodes>
<clusternode name="oc-index3" votes="1">
        <fence>
                <method name="single">
                       <device name="oc-cab1-pdu2" port="18" option="off"/>
                </method>
        </fence>
</clusternode>

<clusternode name="oc-index4" votes="1">
        <fence>
                <method name="single">
                  <device name="oc-cab1-pdu1" port="16" option="off"/>
                  </method>
        </fence>
</clusternode>

</clusternodes>
<fencedevices>
        <fencedevice name="oc-cab1-pdu2" agent="fence_apc_snmp" ipaddr="172.16.14.9" login="apc" passwd="xxxx"/>
        <fencedevice name="oc-cab1-pdu1" agent="fence_apc_snmp" ipaddr="172.16.14.8" login="apc" passwd="xxxx"/>
</fencedevices>
</cluster>



From grimme at atix.de  Fri Aug 24 11:35:09 2007
From: grimme at atix.de (Marc Grimme)
Date: Fri, 24 Aug 2007 13:35:09 +0200
Subject: [Linux-cluster] GFS hangs, nodes die
In-Reply-To: <46CEB4DB.4050901@fu-berlin.de>
References: <46C78986.4070301@fu-berlin.de> <200708211009.38845.grimme@atix.de>
	<46CEB4DB.4050901@fu-berlin.de>
Message-ID: <200708241335.11076.grimme@atix.de>

Hi Sebastian,
just to double check. Fencing and everything works as expected, right?

2nd the latest RHEL4 kernel is 2.6.9-55.0.2 (is that also available for 
centos?). If yes you might think about updating. I'm not sure if something 
was updated within dlm/gfs but my tests were done with 2.6.9-55.0.2 and I 
didn't encounter those problems whereas before I had huge amounts of locks 
(~2times the number of files on the fs).

On Friday 24 August 2007 12:37:15 Sebastian Walter wrote:
> Hi list,
>
> just an update. In my scripts, there is nothing about searching the
> whole file system, but I see several "df" processes blocking the system
> with 100 % CPU. I will update firmwares now and check for better QLogic
> drivers. Thanks!
I fear that a firmware update will not change anything but it's always a good 
option ;-) . I also doubt about the Qlogic drivers cause the ones in 2.6.9-55 
are quite ok (did you configure multipathing properly?).
Is that df and everything running concurrently on different nodes?

Last but not least are the "unable to obtain locks" messages the only messages 
that you see when getting problems?

Regards Marc.
>
> Regards,
> Sebastian
>
> Marc Grimme wrote:
> > On Tuesday 21 August 2007 09:52:32 Sebastian Walter wrote:
> >> Hi,
> >>
> >> Marc Grimme wrote:
> >>> Do you also see some messages on the console of the nodes. And the
> >>> gfs_tool
> >>> counters would help before that problem occures. So let it run
> >>> sometimes before to see if locks increase.
> >>> What kind of stress tests are you doing? I bet searching the whole
> >>> filesystem. What makes me wonder is that the gfs_tool glock_purge does
> >>> not work whereas it worked for me with exactly the same problems. Did
> >>> you set it _AFTER_ the fs was mounted?
> >
> > Sorry I mean after is right and before not ;-( .
> > And are you using the latest version of CS/GFS?
> > Do you have a lot of memory in your machines 16G or more?
> >
> >> That makes me optimistic. I set it after the volume was mounted, so I
> >> will give it another try setting it before mounting it. Then I will also
> >> mail myself the output of the counters every 10 minuts. Let's see...
> >
> > I would be interested in the counters.
> > Also add the process list in order to see if how much CPU-Time gfs_scand
> > consumes.
> > i.e.
> > ps axwwww | sort -k4 -n | tail -10
> >
> > Have fun Marc.
> >
> >> ...with best thanks
> >> Sebastian
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany

Phone: +49-89 452 3538-0
Fax:   +49-89 990 1766-0

Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962

Vorstand: 
Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)

Vorsitzender des Aufsichtsrats:
Dr. Martin Buss



From sebastian.walter at fu-berlin.de  Fri Aug 24 11:44:30 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Fri, 24 Aug 2007 13:44:30 +0200
Subject: [Linux-cluster] GFS hangs, nodes die
In-Reply-To: <200708241335.11076.grimme@atix.de>
References: <46C78986.4070301@fu-berlin.de> <200708211009.38845.grimme@atix.de>
	<46CEB4DB.4050901@fu-berlin.de> <200708241335.11076.grimme@atix.de>
Message-ID: <46CEC49E.4020201@fu-berlin.de>

Hi Marc,

thank you for your answer. Fencing works fine, also everything else
works for long times, except when the I/O raises a certain level....
The kernel instead could indeed be a problem! 2.6.9-55.02 is the
standard Update5 kernel also for CentOS, but I had to downgrade as all
the cs/gfs packages were dependant on 2.6.9-55.0

I totally forgot about it... I will compile everything for the new
kernel now, let's see.

Regards,
Sebastian

PS: Btw, the hanging df processes came from the daily logwatch...


Marc Grimme wrote:
> Hi Sebastian,
> just to double check. Fencing and everything works as expected, right?
>
> 2nd the latest RHEL4 kernel is 2.6.9-55.0.2 (is that also available for 
> centos?). If yes you might think about updating. I'm not sure if something 
> was updated within dlm/gfs but my tests were done with 2.6.9-55.0.2 and I 
> didn't encounter those problems whereas before I had huge amounts of locks 
> (~2times the number of files on the fs).
>
> On Friday 24 August 2007 12:37:15 Sebastian Walter wrote:
>   
>> Hi list,
>>
>> just an update. In my scripts, there is nothing about searching the
>> whole file system, but I see several "df" processes blocking the system
>> with 100 % CPU. I will update firmwares now and check for better QLogic
>> drivers. Thanks!
>>     
> I fear that a firmware update will not change anything but it's always a good 
> option ;-) . I also doubt about the Qlogic drivers cause the ones in 2.6.9-55 
> are quite ok (did you configure multipathing properly?).
> Is that df and everything running concurrently on different nodes?
>
> Last but not least are the "unable to obtain locks" messages the only messages 
> that you see when getting problems?
>
> Regards Marc.
>   
>> Regards,
>> Sebastian
>>
>> Marc Grimme wrote:
>>     
>>> On Tuesday 21 August 2007 09:52:32 Sebastian Walter wrote:
>>>       
>>>> Hi,
>>>>
>>>> Marc Grimme wrote:
>>>>         
>>>>> Do you also see some messages on the console of the nodes. And the
>>>>> gfs_tool
>>>>> counters would help before that problem occures. So let it run
>>>>> sometimes before to see if locks increase.
>>>>> What kind of stress tests are you doing? I bet searching the whole
>>>>> filesystem. What makes me wonder is that the gfs_tool glock_purge does
>>>>> not work whereas it worked for me with exactly the same problems. Did
>>>>> you set it _AFTER_ the fs was mounted?
>>>>>           
>>> Sorry I mean after is right and before not ;-( .
>>> And are you using the latest version of CS/GFS?
>>> Do you have a lot of memory in your machines 16G or more?
>>>
>>>       
>>>> That makes me optimistic. I set it after the volume was mounted, so I
>>>> will give it another try setting it before mounting it. Then I will also
>>>> mail myself the output of the counters every 10 minuts. Let's see...
>>>>         
>>> I would be interested in the counters.
>>> Also add the process list in order to see if how much CPU-Time gfs_scand
>>> consumes.
>>> i.e.
>>> ps axwwww | sort -k4 -n | tail -10
>>>
>>> Have fun Marc.
>>>
>>>       
>>>> ...with best thanks
>>>> Sebastian
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>         
>
>
>
>   



From kadlec at sunserv.kfki.hu  Fri Aug 24 13:12:44 2007
From: kadlec at sunserv.kfki.hu (Kadlecsik Jozsi)
Date: Fri, 24 Aug 2007 15:12:44 +0200 (MEST)
Subject: [Linux-cluster] Re: Nodes show up extremely slowly (or not at all)
In-Reply-To: <Pine.GSO.4.64.0708241023170.8481@sunserv.kfki.hu>
References: <Pine.GSO.4.64.0708241023170.8481@sunserv.kfki.hu>
Message-ID: <Pine.GSO.4.64.0708241439110.8481@sunserv.kfki.hu>

On Fri, 24 Aug 2007, Kadlecsik Jozsi wrote:

> It took 13 minutes(!) for 'lsxerv0' to notice 'lxserv1'.

It turned out to be a switch-related multicast problem: openais used the 
default(?) 239.192.6.53 address, while the switch strangely enough passes 
the 224.0.0.x addresses only. So by setting the multicast address in
cluster.conf explicitly to 224.0.0.3 cured the problem.

Best regards,
Jozsef
--
E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary



From rmccabe at redhat.com  Fri Aug 24 13:23:58 2007
From: rmccabe at redhat.com (Ryan McCabe)
Date: Fri, 24 Aug 2007 09:23:58 -0400
Subject: [Linux-cluster] luci can not restart
In-Reply-To: <9fa3c2e50708240257m31d487cei89538217363ae362@mail.gmail.com>
References: <9fa3c2e50708240257m31d487cei89538217363ae362@mail.gmail.com>
Message-ID: <20070824132358.GA13518@redhat.com>

On Fri, Aug 24, 2007 at 05:57:03PM +0800, Changer Van wrote:
> Hi all,
> 
> I tried to install RHCS on RHEL 5. RHCS and dependences were installed.
> Luci server has been initialized successfully, but when luci service was
> restarted,
> it failed. Could any one tell me how to fix the problem?

Which version of luci do you have?

As a start, you may want to check the files in /var/lib/luci/log/
for information that may tip us off as to what went wrong.


Ryan



From sebastian.walter at fu-berlin.de  Fri Aug 24 13:27:38 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Fri, 24 Aug 2007 15:27:38 +0200
Subject: [Linux-cluster] GFS hangs, nodes die
In-Reply-To: <200708241335.11076.grimme@atix.de>
References: <46C78986.4070301@fu-berlin.de> <200708211009.38845.grimme@atix.de>
	<46CEB4DB.4050901@fu-berlin.de> <200708241335.11076.grimme@atix.de>
Message-ID: <46CEDCCA.7060609@fu-berlin.de>

Hi Marc,

did you install the CSGFS packages from the Red Hat repository? I tried
to compile the packages on the new kernel using the centos .src.rpm's
but they also rely on the older kernel. I would have to use a generic
www.kernel.org kernel to compile the cluster-suite packages from source,
which I want to avoid. If the Redhat modules are built on 2.6.9-55.0.2,
there would be a problem with the CentOS repository...

thanks & regards,
Sebastian

Marc Grimme wrote:
> Hi Sebastian,
> just to double check. Fencing and everything works as expected, right?
>
> 2nd the latest RHEL4 kernel is 2.6.9-55.0.2 (is that also available for 
> centos?). If yes you might think about updating. I'm not sure if something 
> was updated within dlm/gfs but my tests were done with 2.6.9-55.0.2 and I 
> didn't encounter those problems whereas before I had huge amounts of locks 
> (~2times the number of files on the fs).
>
> On Friday 24 August 2007 12:37:15 Sebastian Walter wrote:
>   
>> Hi list,
>>
>> just an update. In my scripts, there is nothing about searching the
>> whole file system, but I see several "df" processes blocking the system
>> with 100 % CPU. I will update firmwares now and check for better QLogic
>> drivers. Thanks!
>>     
> I fear that a firmware update will not change anything but it's always a good 
> option ;-) . I also doubt about the Qlogic drivers cause the ones in 2.6.9-55 
> are quite ok (did you configure multipathing properly?).
> Is that df and everything running concurrently on different nodes?
>
> Last but not least are the "unable to obtain locks" messages the only messages 
> that you see when getting problems?
>
> Regards Marc.
>   
>> Regards,
>> Sebastian
>>
>> Marc Grimme wrote:
>>     
>>> On Tuesday 21 August 2007 09:52:32 Sebastian Walter wrote:
>>>       
>>>> Hi,
>>>>
>>>> Marc Grimme wrote:
>>>>         
>>>>> Do you also see some messages on the console of the nodes. And the
>>>>> gfs_tool
>>>>> counters would help before that problem occures. So let it run
>>>>> sometimes before to see if locks increase.
>>>>> What kind of stress tests are you doing? I bet searching the whole
>>>>> filesystem. What makes me wonder is that the gfs_tool glock_purge does
>>>>> not work whereas it worked for me with exactly the same problems. Did
>>>>> you set it _AFTER_ the fs was mounted?
>>>>>           
>>> Sorry I mean after is right and before not ;-( .
>>> And are you using the latest version of CS/GFS?
>>> Do you have a lot of memory in your machines 16G or more?
>>>
>>>       
>>>> That makes me optimistic. I set it after the volume was mounted, so I
>>>> will give it another try setting it before mounting it. Then I will also
>>>> mail myself the output of the counters every 10 minuts. Let's see...
>>>>         
>>> I would be interested in the counters.
>>> Also add the process list in order to see if how much CPU-Time gfs_scand
>>> consumes.
>>> i.e.
>>> ps axwwww | sort -k4 -n | tail -10
>>>
>>> Have fun Marc.
>>>
>>>       
>>>> ...with best thanks
>>>> Sebastian
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>         
>
>
>
>   



From nlam87346 at library.usyd.edu.au  Fri Aug 24 13:36:40 2007
From: nlam87346 at library.usyd.edu.au (Nikolas Lam)
Date: Fri, 24 Aug 2007 23:36:40 +1000
Subject: [Linux-cluster] Re: Nodes show up extremely slowly (or not at all)
In-Reply-To: <Pine.GSO.4.64.0708241439110.8481@sunserv.kfki.hu>
References: <Pine.GSO.4.64.0708241023170.8481@sunserv.kfki.hu>
	<Pine.GSO.4.64.0708241439110.8481@sunserv.kfki.hu>
Message-ID: <1187962600.3353.37.camel@lits17.library.usyd.edu.au>

On Fri, 2007-08-24 at 15:12 +0200, Kadlecsik Jozsi wrote:
> On Fri, 24 Aug 2007, Kadlecsik Jozsi wrote:
> 
> > It took 13 minutes(!) for 'lsxerv0' to notice 'lxserv1'.
> 
> It turned out to be a switch-related multicast problem: openais used the 
> default(?) 239.192.6.53 address, while the switch strangely enough passes 
> the 224.0.0.x addresses only. So by setting the multicast address in
> cluster.conf explicitly to 224.0.0.3 cured the problem.


Hi Jozsef,

I think I might be having similar problems. However I seem to be unable
to specify a multicast address manually using conga / luci (the web
based tool). When I press "apply" I see the reconfiguration happening,
but when I look at my cluster.conf, there's no multicast address in
there:

<?xml version="1.0"?>
<cluster alias="clust0" config_version="10" name="clust0">
        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="30"/>
        <clusternodes>
                <clusternode name="host0.library.usyd.edu.au" nodeid="1"
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="host0"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="host1.library.usyd.edu.au" nodeid="2"
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="host1"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice action="reboot" agent="fence_ilo" force="1"
hostname="192.168.0.3" login="clust0" name="host0" passwd="ilopasswd"/>
                <fencedevice action="reboot" agent="fence_ilo" force="1"
hostname="192.168.0.4" login="clust0" name="host1" passwd="ilopasswd"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
        <fence_xvmd/>
</cluster>


Also when I try

 netstat -gn

itill seems to show the multicast address that openais chose by itself:

[root at host0 ~]# netstat -gn
IPv6/IPv4 Group Memberships
Interface       RefCnt Group
--------------- ------ ---------------------
lo              1      224.0.0.1
eth0            1      224.0.0.251
eth0            1      225.0.0.12
eth0            1      239.192.205.2
eth0            1      224.0.0.1
[root at host0 ~]# 


Could I please have a look at your cluster.conf?

Regards,

Nik Lam
-- 
Unix-like Systems Administrator
----------------------------------------------------------
Library IT Services                   
Level 1, Fisher Library (Building F03)
The University of Sydney, NSW 2006, Australia
Phone: +61 2 9351 4304       Fax: +61 2 9351 7769
CRICOS Provider Number: 00026A
----------------------------------------------------------



From Jeremyc at tasconline.com  Fri Aug 24 13:35:55 2007
From: Jeremyc at tasconline.com (Jeremy Carroll)
Date: Fri, 24 Aug 2007 08:35:55 -0500
Subject: [Linux-cluster] fence_apc_snmp woes
In-Reply-To: <873211268.81391187953575543.JavaMail.root@v-mailhost2.mxpath.net>
References: <873211268.81391187953575543.JavaMail.root@v-mailhost2.mxpath.net>
Message-ID: <BC16657A85D83746AFF99A42587CC6BC07C77F4B@exchange2003.tasconline.com>

Try to run the fence agent manually. See if it works. Most likely the
problem is fenced cannot successfully run the program to power down the
switch. Post your results.

Ex:
fence_apc_snmp -v -a oc-index3 -c <passwd> -n 18 -o off 


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Brian Sheets
Sent: Friday, August 24, 2007 6:06 AM
To: linux clustering
Subject: [Linux-cluster] fence_apc_snmp woes

I have a 2 node cluster on debian. Below is my cluster.conf. If I down
node1's nic 
node2 sees and tries to fence


Aug 24 10:57:38 oc-index4 fenced[7599]: oc-index3 not a cluster member
after 0 sec post_fail_delay
Aug 24 10:57:38 oc-index4 fenced[7599]: fencing node "oc-index3"
Aug 24 10:57:38 oc-index4 fence_manual: Node 172.16.14.100 needs to be
reset before recovery can procede.  Waiting for 172.16.14.100 to rejoin
the cluster or for manual acknowledgement that it has been reset (i.e.
fence_ack_manual -n 172.16.14.100)
Aug 24 10:59:34 oc-index4 fenced[7599]: fence "oc-index3" success

It states that it's fencing, but never does, and if I do a
fence_ack_manual, then fence_apc_snmp gets run and the node1 gets
powered down.

what am I missing?

<?xml version="1.0"?>
<cluster name="index" config_version="2">
<cman two_node="1" expected_votes="1">
</cman>
<clusternodes>
<clusternode name="oc-index3" votes="1">
        <fence>
                <method name="single">
                       <device name="oc-cab1-pdu2" port="18"
option="off"/>
                </method>
        </fence>
</clusternode>

<clusternode name="oc-index4" votes="1">
        <fence>
                <method name="single">
                  <device name="oc-cab1-pdu1" port="16" option="off"/>
                  </method>
        </fence>
</clusternode>

</clusternodes>
<fencedevices>
        <fencedevice name="oc-cab1-pdu2" agent="fence_apc_snmp"
ipaddr="172.16.14.9" login="apc" passwd="xxxx"/>
        <fencedevice name="oc-cab1-pdu1" agent="fence_apc_snmp"
ipaddr="172.16.14.8" login="apc" passwd="xxxx"/>
</fencedevices>
</cluster>

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From orkcu at yahoo.com  Fri Aug 24 13:44:25 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Fri, 24 Aug 2007 06:44:25 -0700 (PDT)
Subject: [Linux-cluster] GFS hangs, nodes die
In-Reply-To: <46CEDCCA.7060609@fu-berlin.de>
Message-ID: <716434.56879.qm@web50603.mail.re2.yahoo.com>


--- Sebastian Walter <sebastian.walter at fu-berlin.de>
wrote:

> Hi Marc,
> 
> did you install the CSGFS packages from the Red Hat
> repository? I tried
> to compile the packages on the new kernel using the
> centos .src.rpm's
> but they also rely on the older kernel. I would have
> to use a generic
> www.kernel.org kernel to compile the cluster-suite
> packages from source,
> which I want to avoid. If the Redhat modules are
> built on 2.6.9-55.0.2,
> there would be a problem with the CentOS
> repository...

in 
http://mirrors.kernel.org/redhat/redhat/linux/updates/enterprise/4AS/en/RHCS/SRPMS/?C=M;O=A
there is nothing after Jun 22 (when
kernel-2.6.9-55.0.2 was released)
nore here:
http://mirrors.kernel.org/redhat/redhat/linux/updates/enterprise/4AS/en/RHGFS/SRPMS/?C=M;O=A

so, I guess, if there are RHCS&GFS binaries for the
new kernel, they came from the same old source or
redhat do not make a public release of the new sources

cu
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


       
____________________________________________________________________________________
Sick sense of humor? Visit Yahoo! TV's 
Comedy with an Edge to see what's on, when. 
http://tv.yahoo.com/collections/222



From kadlec at sunserv.kfki.hu  Fri Aug 24 14:03:07 2007
From: kadlec at sunserv.kfki.hu (Kadlecsik Jozsi)
Date: Fri, 24 Aug 2007 16:03:07 +0200 (MEST)
Subject: [Linux-cluster] Re: Nodes show up extremely slowly (or not at all)
In-Reply-To: <1187962600.3353.37.camel@lits17.library.usyd.edu.au>
References: <Pine.GSO.4.64.0708241023170.8481@sunserv.kfki.hu>
	<Pine.GSO.4.64.0708241439110.8481@sunserv.kfki.hu>
	<1187962600.3353.37.camel@lits17.library.usyd.edu.au>
Message-ID: <Pine.GSO.4.64.0708241600340.8481@sunserv.kfki.hu>

Hi,

On Fri, 24 Aug 2007, Nikolas Lam wrote:

> > It turned out to be a switch-related multicast problem: openais used the 
> > default(?) 239.192.6.53 address, while the switch strangely enough passes 
> > the 224.0.0.x addresses only. So by setting the multicast address in
> > cluster.conf explicitly to 224.0.0.3 cured the problem.
> 
> I think I might be having similar problems. However I seem to be unable
> to specify a multicast address manually using conga / luci (the web
> based tool). 
[..]
> Could I please have a look at your cluster.conf?

Sure, I attached ours. Please note, you must specify the multicast address 
for cman and every clusternode too and the address must be the same 
everywhere. (See 
http://sources.redhat.com/cluster/doc/cluster_schema_rhel5.html.)

Best regards,
Jozsef
--
E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary
-------------- next part --------------
<?xml version="1.0"?>
<cluster name="kfki" config_version="6">

<cman>
	<multicast addr="224.0.0.3"/>
</cman>

<clusternodes>
<clusternode name="lxserv0-gfs" nodeid="1">
	<fence>
		<method name="fabric">
			<device name="aoe" hostname="lxserv0" mac="00:07:E9:11:70:A3"/>
		</method>
	</fence>
	<multicast addr="224.0.0.3" interface="eth1"/>  
</clusternode>
<clusternode name="lxserv1-gfs" nodeid="2">
	<fence>
		<method name="fabric">
			<device name="aoe" hostname="lxserv1" mac="-00:16:76:83:1E:13"/>
		</method>
	</fence>
	<multicast addr="224.0.0.3" interface="eth1"/>  
</clusternode>
<clusternode name="web0-gfs" nodeid="3">
	<fence>
		<method name="fabric">
			<device name="aoe" hostname="web0" mac="00:07:E9:11:71:C1"/>
		</method>
	</fence>
	<multicast addr="224.0.0.3" interface="eth1"/>  
</clusternode>
<clusternode name="web1-gfs" nodeid="4">
	<fence>
		<method name="fabric">
			<device name="aoe" hostname="web1" mac="00:16:76:83:0B:EB"/>
		</method>
	</fence>
	<multicast addr="224.0.0.3" interface="eth1"/>  
</clusternode>
<clusternode name="saturn-gfs" nodeid="5">
	<fence>
		<method name="fabric">
			<device name="aoe" hostname="saturn" mac="00:1B:21:00:31:17"/>
		</method>
	</fence>
	<multicast addr="224.0.0.3" interface="eth1"/>  
</clusternode>
</clusternodes>


<fencedevices>
	<fencedevice name="aoe" agent="fence_aoe"/>
</fencedevices>

</cluster>

From bfilipek at crscold.com  Fri Aug 24 14:14:30 2007
From: bfilipek at crscold.com (Brad Filipek)
Date: Fri, 24 Aug 2007 09:14:30 -0500
Subject: [Linux-cluster] dissapointed with GFS performance
Message-ID: <9C01E18EF3BC2448A3B1A4812EB87D0278CE8E@SRVEDI.upark.crscold.com>

Is there any way to "tune" gfs so that it will run faster? I have a GFS
partition with data on it which our internal order system accesses. It
is a text based (command line) driven application which is installed on
the local disks, and accesses the data through the /data directory which
is mounted on the GFS partition. When I run the program, I can notice
latency between menu changes. To confirm that it is GFS causing the
latency, I created an ext3 partition on my SAN, unmounted /data from the
GFS partition, then mounted /data to the newly created EXT3 partition. I
placed the exact same data on this EXT3 partition, fired up my
application, and it whips through menu changes instantly. This tells me
that the SAN is fast, but GFS is not. 

 

What can I do to speed up GFS, or is this just the way it is? 

 

Thanks, 

Brad Filipek



 


Confidentiality Notice: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. 

If you have received this communication in error, please notify us immediately by email reply or by telephone and immediately delete this message and any attachments.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070824/030cf774/attachment.htm>

From Alexandre.Racine at mhicc.org  Fri Aug 24 14:24:06 2007
From: Alexandre.Racine at mhicc.org (Alexandre Racine)
Date: Fri, 24 Aug 2007 10:24:06 -0400
Subject: [Linux-cluster] dissapointed with GFS performance
References: <9C01E18EF3BC2448A3B1A4812EB87D0278CE8E@SRVEDI.upark.crscold.com>
Message-ID: <C43CF0825BF59D4FBC1F6A2AF45EB88D3C3430@cumulonimbus.RG.local>



You'll never have the same performance with remote data and local data.
You can test those with real numbers if you want. You can try those two hard drive benchmark programs : bonnie++ and iozone.

I am actually working on improving performance right now.

Where to begin:
-Do you have a 1000gbs link?
-Do you use jumbo frames?

You also have to know what your software is actually doing. Reading 1 text file or 2000? You did not post your gfs version, but I know that there was a patch lately for directory listing that was sooo slow.




Alexandre Racine
Projets sp?ciaux
514-461-1300 poste 3304
alexandre.racine at mhicc.org



-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of Brad Filipek
Sent: Fri 2007-08-24 10:14
To: linux clustering
Subject: [Linux-cluster] dissapointed with GFS performance
 
Is there any way to "tune" gfs so that it will run faster? I have a GFS
partition with data on it which our internal order system accesses. It
is a text based (command line) driven application which is installed on
the local disks, and accesses the data through the /data directory which
is mounted on the GFS partition. When I run the program, I can notice
latency between menu changes. To confirm that it is GFS causing the
latency, I created an ext3 partition on my SAN, unmounted /data from the
GFS partition, then mounted /data to the newly created EXT3 partition. I
placed the exact same data on this EXT3 partition, fired up my
application, and it whips through menu changes instantly. This tells me
that the SAN is fast, but GFS is not. 

 

What can I do to speed up GFS, or is this just the way it is? 

 

Thanks, 

Brad Filipek



 


Confidentiality Notice: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. 

If you have received this communication in error, please notify us immediately by email reply or by telephone and immediately delete this message and any attachments.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3958 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070824/91649a91/attachment.bin>

From Jeremyc at tasconline.com  Fri Aug 24 14:42:25 2007
From: Jeremyc at tasconline.com (Jeremy Carroll)
Date: Fri, 24 Aug 2007 09:42:25 -0500
Subject: [Linux-cluster] dissapointed with GFS performance
In-Reply-To: <C43CF0825BF59D4FBC1F6A2AF45EB88D3C3430@cumulonimbus.RG.local>
References: <9C01E18EF3BC2448A3B1A4812EB87D0278CE8E@SRVEDI.upark.crscold.com>
	<C43CF0825BF59D4FBC1F6A2AF45EB88D3C3430@cumulonimbus.RG.local>
Message-ID: <BC16657A85D83746AFF99A42587CC6BC07C77F4E@exchange2003.tasconline.com>

RedHat Performance FAQ
http://kbase.redhat.com/faq/FAQ_78_3152.shtm

GFS Performance Tuning
http://sourceware.org/cluster/faq.html#gfs_tuning

Mount with noatime
http://man.chinaunix.net/linux/redhat/rh-gfs-en-6.0/s1-manage-atimeconf.html

Turn off disk quotas
http://www.redhat.com/archives/linux-cluster/2006-August/msg00237.html



-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alexandre Racine
Sent: Friday, August 24, 2007 9:24 AM
To: linux clustering
Subject: RE: [Linux-cluster] dissapointed with GFS performance



You'll never have the same performance with remote data and local data.
You can test those with real numbers if you want. You can try those two hard drive benchmark programs : bonnie++ and iozone.

I am actually working on improving performance right now.

Where to begin:
-Do you have a 1000gbs link?
-Do you use jumbo frames?

You also have to know what your software is actually doing. Reading 1 text file or 2000? You did not post your gfs version, but I know that there was a patch lately for directory listing that was sooo slow.




Alexandre Racine
Projets sp?ciaux
514-461-1300 poste 3304
alexandre.racine at mhicc.org



-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of Brad Filipek
Sent: Fri 2007-08-24 10:14
To: linux clustering
Subject: [Linux-cluster] dissapointed with GFS performance
 
Is there any way to "tune" gfs so that it will run faster? I have a GFS
partition with data on it which our internal order system accesses. It
is a text based (command line) driven application which is installed on
the local disks, and accesses the data through the /data directory which
is mounted on the GFS partition. When I run the program, I can notice
latency between menu changes. To confirm that it is GFS causing the
latency, I created an ext3 partition on my SAN, unmounted /data from the
GFS partition, then mounted /data to the newly created EXT3 partition. I
placed the exact same data on this EXT3 partition, fired up my
application, and it whips through menu changes instantly. This tells me
that the SAN is fast, but GFS is not. 

 

What can I do to speed up GFS, or is this just the way it is? 

 

Thanks, 

Brad Filipek



 


Confidentiality Notice: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. 

If you have received this communication in error, please notify us immediately by email reply or by telephone and immediately delete this message and any attachments.





From breeves at redhat.com  Fri Aug 24 14:52:44 2007
From: breeves at redhat.com (Bryn M. Reeves)
Date: Fri, 24 Aug 2007 15:52:44 +0100
Subject: [Linux-cluster] dissapointed with GFS performance
In-Reply-To: <9C01E18EF3BC2448A3B1A4812EB87D0278CE8E@SRVEDI.upark.crscold.com>
References: <9C01E18EF3BC2448A3B1A4812EB87D0278CE8E@SRVEDI.upark.crscold.com>
Message-ID: <46CEF0BC.4020207@redhat.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Brad Filipek wrote:
> Is there any way to ?tune? gfs so that it will run faster? I have a GFS
> partition with data on it which our internal order system accesses. It
> is a text based (command line) driven application which is installed on
> the local disks, and accesses the data through the /data directory which
> is mounted on the GFS partition. When I run the program, I can notice
> latency between menu changes. To confirm that it is GFS causing the
> latency, I created an ext3 partition on my SAN, unmounted /data from the
> GFS partition, then mounted /data to the newly created EXT3 partition. I
> placed the exact same data on this EXT3 partition, fired up my
> application, and it whips through menu changes instantly. This tells me
> that the SAN is fast, but GFS is not.
> 
> What can I do to speed up GFS, or is this just the way it is?

There are things you can do to tune GFS, most of them referenced in the
FAQ or in posts to this lists. It's might be that you'll get bigger
improvements by tuning the app to work better with GFS though, i.e. to
avoid operations that are known to be expensive over GFS.

Can you give us some idea of how the application accesses & uses the
data in the GFS file system?

Regards,
Bryn.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFGzvC86YSQoMYUY94RAqC5AJ9rSEs6n7dVXCYXLNmwqdR77Gaa0QCfb2iv
GkdRTpE/C4eBcFrAz7hIReI=
=fV1R
-----END PGP SIGNATURE-----



From nlam87346 at library.usyd.edu.au  Fri Aug 24 15:44:01 2007
From: nlam87346 at library.usyd.edu.au (Nikolas Lam)
Date: Sat, 25 Aug 2007 01:44:01 +1000
Subject: [Linux-cluster] Re: Nodes show up extremely slowly (or not at all)
In-Reply-To: <Pine.GSO.4.64.0708241600340.8481@sunserv.kfki.hu>
References: <Pine.GSO.4.64.0708241023170.8481@sunserv.kfki.hu>
	<Pine.GSO.4.64.0708241439110.8481@sunserv.kfki.hu>
	<1187962600.3353.37.camel@lits17.library.usyd.edu.au>
	<Pine.GSO.4.64.0708241600340.8481@sunserv.kfki.hu>
Message-ID: <1187970241.3353.42.camel@lits17.library.usyd.edu.au>

On Fri, 2007-08-24 at 16:03 +0200, Kadlecsik Jozsi wrote:
> Hi,
> 
> On Fri, 24 Aug 2007, Nikolas Lam wrote:
> 
> > > It turned out to be a switch-related multicast problem: openais used the 
> > > default(?) 239.192.6.53 address, while the switch strangely enough passes 
> > > the 224.0.0.x addresses only. So by setting the multicast address in
> > > cluster.conf explicitly to 224.0.0.3 cured the problem.
> > 
> > I think I might be having similar problems. However I seem to be unable
> > to specify a multicast address manually using conga / luci (the web
> > based tool). 
> [..]
> > Could I please have a look at your cluster.conf?
> 
> Sure, I attached ours. Please note, you must specify the multicast address 
> for cman and every clusternode too and the address must be the same 
> everywhere. (See 
> http://sources.redhat.com/cluster/doc/cluster_schema_rhel5.html.)


Excellent - thanks! Got the cluster's own communications to happen on my
specified addresses, but I can't seem to get the fence_xvmd to use a
specified address. Does anyone have an example cluster.conf for a dom0
cluster where they've specified the multicast IP address for the virt
guest daemon (fence_xvmd) ?

Nik



From JESmall at directv.com  Fri Aug 24 16:05:29 2007
From: JESmall at directv.com (Small, Jon E)
Date: Fri, 24 Aug 2007 09:05:29 -0700
Subject: [Linux-cluster] Problem with latest fence_apc
In-Reply-To: <40CEA9CB6E0E8444B59DD4F4E848F71061B5E8@CORVUS.LA.FRD.DIRECTV.COM>
References: <40CEA9CB6E0E8444B59DD4F4E848F71061B5E8@CORVUS.LA.FRD.DIRECTV.COM>
Message-ID: <40CEA9CB6E0E8444B59DD4F4E848F71061B79C@CORVUS.LA.FRD.DIRECTV.COM>

I got the fence_apc_snmp to work with the following modification to the
Python script.

 

c151 152

           self.state_on  = 'outletStatusOn'

           self.state_off = 'outletStatusOff'

.

c184 185

           self.state_on  = 'outletStatusOn'

           self.state_off = 'outletStatusOff'

.

 

One other important environment element is that I am using v3.8.8 of the
powernet388.mib file.

 

 

  _____  

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Small, Jon E
Sent: Thursday, August 23, 2007 7:43 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Problem with latest fence_apc

 

I installed the latest RPM (fence-1.32.45-1.0.2.i686.rpm) that includes
the fence_apc fix for APC firmware 3.x and it still does not work in our
environment. The Status action works, but Off & On do not work.

 

Here is the environment information:

-          RedHat 4 Update #5

-          APC model #AP7968

-          APC Firmware 3.3.3

 

Attached is the copy of /tmp/apclog from an execution of an Off action:

./fence_apc -a 10.13.2.146 -l apc -o Off -n 3 -p apc -v

 

 

Incidentally, I attempted to use the fence_apc_snmp as well, but it
fails with the following statement:

 

./fence_apc_snmp -a 10.13.2.146 -c private -n 3 -o Status -v

called with {'udpport': '161', 'option': 'status', 'ipaddr':
'10.13.2.146', 'community': 'private', 'switch': '', 'port': '3'}

invalid status outletStatusOff

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070824/145c23d9/attachment.htm>

From sdake at redhat.com  Fri Aug 24 16:17:11 2007
From: sdake at redhat.com (Steven Dake)
Date: Fri, 24 Aug 2007 09:17:11 -0700
Subject: [Linux-cluster] people having trouble with the default multicast
	address 239.192.x.x
Message-ID: <1187972233.3559.4.camel@balance>

If you are having problems with the default multicast address
239.192.x.x it is probably because your switch is not sending those
packets because of a configuration issue.

If you have this problem please contact me.  I would like to make a
cookbook of switch manufacturer/models and possible solutions.

Using 224.0.0.x is not recommended as these multicast packets are
broadcoast over the network.

Regards
-steve



From bfilipek at crscold.com  Fri Aug 24 19:58:42 2007
From: bfilipek at crscold.com (Brad Filipek)
Date: Fri, 24 Aug 2007 14:58:42 -0500
Subject: [Linux-cluster] Problem with latest fence_apc
References: <40CEA9CB6E0E8444B59DD4F4E848F71061B5E8@CORVUS.LA.FRD.DIRECTV.COM>
Message-ID: <9C01E18EF3BC2448A3B1A4812EB87D0278CFD7@SRVEDI.upark.crscold.com>

Try removing the "switch=" variable from the cluster.conf file.

 

$cat /etc/cluster/cluster.conf | grep switch $emacs
/etc/cluster/cluster.conf

 

#remove the switch variable

<device name="APC" port="x" switch="0"/> 

 

change to 

 

<device name="APC" port="x" />

 

A RedHat support technician suggested this to me to resolve a fence_apc
issue and it worked. 

 

Brad Filipek

 

________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Small, Jon E
Sent: Thursday, August 23, 2007 9:43 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Problem with latest fence_apc

 

I installed the latest RPM (fence-1.32.45-1.0.2.i686.rpm) that includes
the fence_apc fix for APC firmware 3.x and it still does not work in our
environment. The Status action works, but Off & On do not work.

 

Here is the environment information:

-          RedHat 4 Update #5

-          APC model #AP7968

-          APC Firmware 3.3.3

 

Attached is the copy of /tmp/apclog from an execution of an Off action:

./fence_apc -a 10.13.2.146 -l apc -o Off -n 3 -p apc -v

 

 

Incidentally, I attempted to use the fence_apc_snmp as well, but it
fails with the following statement:

 

./fence_apc_snmp -a 10.13.2.146 -c private -n 3 -o Status -v

called with {'udpport': '161', 'option': 'status', 'ipaddr':
'10.13.2.146', 'community': 'private', 'switch': '', 'port': '3'}

invalid status outletStatusOff


Confidentiality Notice: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. 

If you have received this communication in error, please notify us immediately by email reply or by telephone and immediately delete this message and any attachments.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070824/632fb830/attachment.htm>

From Nick.Couchman at seakr.com  Fri Aug 24 22:12:28 2007
From: Nick.Couchman at seakr.com (Nick Couchman)
Date: Fri, 24 Aug 2007 16:12:28 -0600
Subject: [Linux-cluster] Service Failover/Fencing Problems
Message-ID: <46CF036B.87A6.0099.1@seakr.com>

Hi, all,
I'm attempting to set up a clustered file system using the cluster suite, GFS, Samba, and NFS.  I'm having a couple of issues with the configuration.
 
Right now I have two virtual machines in the cluster.  They're sharing some iSCSI storage and are configured in a two-node cluster.  I used Conga to do the configuration.  I have a fairly simple setup so far: two nodes, two "virtual IP" resources, two Samba resources, and a single file share (for now).  Since I'm in VMware VMs, I don't really have any fencing devices right now - maybe that's part of the problem.  The physical configuration that this is going in will have two identical Dell PE2650s with DRACs, so I can use the DRAC controllers for fencing.  I have two failover domains set up - each node is the primary (priority = 1) machine in one of the domains and the backup in the other domain.

Back to the problem, though...I'm attempting to test scenarios where ones of the servers becomes unavailable.  So far, my tests have caused the cluster to behave erratically and not succeed in moving services to any host.  Here's what I've tried for my tests, so far:
1) Disconnecting the virtual network interface on one of the VMs that's part of the cluster.  This caused the other machine to recognize that the machine was no longer in the cluster, but the other machine never did anything about restarting the service - it just left it as "dead."  I had to manually start the service, even though I have it configured to auto-start and I have the policy set to "relocate."
2) Shutting down one of the VMs (forcibly - hard shutdown - like a power supply failure).  In this instance, I started getting messages on the second node that it was attempting to fence the first node and kept failing the fencing.  Conga showed the node as "unknown" but still showed the original service running on the first node, though, and when I attempted to migrate the service manually, Conga hung up trying to migrate it (probably waiting on the fencing to occur).

So those the problems I'm currently encountering.  I'd really like to get this setup figured out so that I can get a cluster that fails IPs and Samba/NFS services successfully between nodes.  Do I have to have a fencing device in order for this to work correctly, or is there some other way to configure the cluster to tell it to just fail the services if it detects a problem?  I know that could cause some bad scenarios, i.e. if you lose connectivity between the servers but the servers still have connectivity to other networks/clients, but for now I just need a simple setup that will allow me to have a node fail and everything migrate successfully.

Thanks,
Nick




From bsheets at singlefin.net  Sat Aug 25 01:00:35 2007
From: bsheets at singlefin.net (Brian Sheets)
Date: Sat, 25 Aug 2007 01:00:35 +0000 (UTC)
Subject: [Linux-cluster] fence_apc_snmp woes
In-Reply-To: <BC16657A85D83746AFF99A42587CC6BC07C77F4B@exchange2003.tasconline.com>
Message-ID: <786463971.91731188003635915.JavaMail.root@v-mailhost2.mxpath.net>

Yes, not only does it work manually, when I run fence_ack_manual, that causes the fence_apc_snmp to work.

So, 

System failure
fenced says its trying to fence
then the fenced says its waiting for ack, which never comes and the system doesn't get powered down
then I run fence_ack_manual, and then the system powers down.




----- Original Message -----
From: "Jeremy Carroll" <Jeremyc at tasconline.com>
To: "linux clustering" <linux-cluster at redhat.com>
Sent: Friday, August 24, 2007 6:35:55 AM (GMT-0800) America/Los_Angeles
Subject: RE: [Linux-cluster] fence_apc_snmp woes

Try to run the fence agent manually. See if it works. Most likely the
problem is fenced cannot successfully run the program to power down the
switch. Post your results.

Ex:
fence_apc_snmp -v -a oc-index3 -c <passwd> -n 18 -o off 


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Brian Sheets
Sent: Friday, August 24, 2007 6:06 AM
To: linux clustering
Subject: [Linux-cluster] fence_apc_snmp woes

I have a 2 node cluster on debian. Below is my cluster.conf. If I down
node1's nic 
node2 sees and tries to fence


Aug 24 10:57:38 oc-index4 fenced[7599]: oc-index3 not a cluster member
after 0 sec post_fail_delay
Aug 24 10:57:38 oc-index4 fenced[7599]: fencing node "oc-index3"
Aug 24 10:57:38 oc-index4 fence_manual: Node 172.16.14.100 needs to be
reset before recovery can procede.  Waiting for 172.16.14.100 to rejoin
the cluster or for manual acknowledgement that it has been reset (i.e.
fence_ack_manual -n 172.16.14.100)
Aug 24 10:59:34 oc-index4 fenced[7599]: fence "oc-index3" success

It states that it's fencing, but never does, and if I do a
fence_ack_manual, then fence_apc_snmp gets run and the node1 gets
powered down.

what am I missing?

<?xml version="1.0"?>
<cluster name="index" config_version="2">
<cman two_node="1" expected_votes="1">
</cman>
<clusternodes>
<clusternode name="oc-index3" votes="1">
        <fence>
                <method name="single">
                       <device name="oc-cab1-pdu2" port="18"
option="off"/>
                </method>
        </fence>
</clusternode>

<clusternode name="oc-index4" votes="1">
        <fence>
                <method name="single">
                  <device name="oc-cab1-pdu1" port="16" option="off"/>
                  </method>
        </fence>
</clusternode>

</clusternodes>
<fencedevices>
        <fencedevice name="oc-cab1-pdu2" agent="fence_apc_snmp"
ipaddr="172.16.14.9" login="apc" passwd="xxxx"/>
        <fencedevice name="oc-cab1-pdu1" agent="fence_apc_snmp"
ipaddr="172.16.14.8" login="apc" passwd="xxxx"/>
</fencedevices>
</cluster>

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From zepp157 at yahoo.com  Sat Aug 25 06:05:08 2007
From: zepp157 at yahoo.com (Jon S)
Date: Fri, 24 Aug 2007 23:05:08 -0700 (PDT)
Subject: [Linux-cluster] init scripts and the scope of a service
Message-ID: <978838.23700.qm@web63502.mail.re1.yahoo.com>

Given the Cluster Suite's preference to disable, restart or relocate a service after an init script resource returns non-zero, I'm curious how best to handle a significant number of applications tied to a single IP resource.  The easy thing to do is to just define a bunch of init scripts in a single service with the IP resource.  Problem is, of course, that if one script returns non-zero, all resources in the service are stopped and restarted.

I can see several ways to script my way past the this issue.  Before I do, I was wondering how others handle this situation.

Reminds me of the classic Spock line, "The needs of the many..."



      ____________________________________________________________________________________
Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center.
http://autos.yahoo.com/green_center/ 



From jos at xos.nl  Sat Aug 25 07:18:02 2007
From: jos at xos.nl (Jos Vos)
Date: Sat, 25 Aug 2007 09:18:02 +0200
Subject: [Linux-cluster] init scripts and the scope of a service
In-Reply-To: <978838.23700.qm@web63502.mail.re1.yahoo.com>;
	from zepp157@yahoo.com on Fri, Aug 24, 2007 at 11:05:08PM -0700
References: <978838.23700.qm@web63502.mail.re1.yahoo.com>
Message-ID: <20070825091802.A15362@xos037.xos.nl>

On Fri, Aug 24, 2007 at 11:05:08PM -0700, Jon S wrote:

> Given the Cluster Suite's preference to disable, restart or relocate a
> service after an init script resource returns non-zero, I'm curious how
> best to handle a significant number of applications tied to a single
> IP resource.  The easy thing to do is to just define a bunch of init
> scripts in a single service with the IP resource.  Problem is, of course,
> that if one script returns non-zero, all resources in the service are
> stopped and restarted.

I never tried defining multiple services in one script, I assume it
works as you say.

But I prefer another solution: create one service script and in that
script call the various init scripts (or start the applications
directly, all dependent on the type of application).  Still, if your
script is asked to "stop" the service, you should stop all apps, as
you don't know the reason.

The only way to circumvent this is to make the "status" of this
all-in-one service script more intelligent: if one application has
a problem, it could try to restart that specific application and, if
that succeeds, just return 0 as status for the service as if nothing
has happened.

The other solution is to use different services, but that requires
different IP addresses.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From Jeremyc at tasconline.com  Sat Aug 25 18:08:37 2007
From: Jeremyc at tasconline.com (Jeremy Carroll)
Date: Sat, 25 Aug 2007 13:08:37 -0500
Subject: [Linux-cluster] init scripts and the scope of a service
In-Reply-To: <20070825091802.A15362@xos037.xos.nl>
References: <978838.23700.qm@web63502.mail.re1.yahoo.com>
	<20070825091802.A15362@xos037.xos.nl>
Message-ID: <BC16657A85D83746AFF99A42587CC6BC07C77F5A@exchange2003.tasconline.com>

If I had two High Availability services on the same IP it would not be
good. If one of the services bound to the same IP would fail the IP is
stripped away when it is relocated to a different machine. Causing the
second service to fail. I believe this is the situation your are
currently in.

Use one IP per service. If you cannot do this for architectural reasons
use Piranha (LVS) to front-end load balance services on your back end
cluster. That is what it is there for.

http://www.redhat.com/support/wpapers/piranha/x32.html


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jos Vos
Sent: Saturday, August 25, 2007 2:18 AM
To: linux clustering
Subject: Re: [Linux-cluster] init scripts and the scope of a service

On Fri, Aug 24, 2007 at 11:05:08PM -0700, Jon S wrote:

> Given the Cluster Suite's preference to disable, restart or relocate a
> service after an init script resource returns non-zero, I'm curious
how
> best to handle a significant number of applications tied to a single
> IP resource.  The easy thing to do is to just define a bunch of init
> scripts in a single service with the IP resource.  Problem is, of
course,
> that if one script returns non-zero, all resources in the service are
> stopped and restarted.

I never tried defining multiple services in one script, I assume it
works as you say.

But I prefer another solution: create one service script and in that
script call the various init scripts (or start the applications
directly, all dependent on the type of application).  Still, if your
script is asked to "stop" the service, you should stop all apps, as
you don't know the reason.

The only way to circumvent this is to make the "status" of this
all-in-one service script more intelligent: if one application has
a problem, it could try to restart that specific application and, if
that succeeds, just return 0 as status for the service as if nothing
has happened.

The other solution is to use different services, but that requires
different IP addresses.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From nlam87346 at library.usyd.edu.au  Sun Aug 26 04:01:33 2007
From: nlam87346 at library.usyd.edu.au (Nikolas Lam)
Date: Sun, 26 Aug 2007 14:01:33 +1000
Subject: [Linux-cluster] people having trouble with the default
	multicast address 239.192.x.x
In-Reply-To: <1187972233.3559.4.camel@balance>
References: <1187972233.3559.4.camel@balance>
Message-ID: <1188100893.3759.57.camel@lits19.library.usyd.edu.au>

On Fri, 2007-08-24 at 09:17 -0700, Steven Dake wrote:
> If you are having problems with the default multicast address
> 239.192.x.x it is probably because your switch is not sending those
> packets because of a configuration issue.
> 
> If you have this problem please contact me.  I would like to make a
> cookbook of switch manufacturer/models and possible solutions.
> 
> Using 224.0.0.x is not recommended as these multicast packets are
> broadcoast over the network.
> 
> Regards
> -steve

Hi Steve,

I'm not sure if the problems I'm having are caused by switch
configuration or compatibility - would the following be a good way to
determine if the multicast communications are working?

1) Check what multicast group (IP) the nodes have chosen using netstat
-gn ?

e.g. for host0, I'm guessing that cman is using 239.192.205.2 (as I
think 225.0.0.12 is for the fence_xvmd Xen virtualised host group and I
can't see any activity on 224.0.0.251).

[root at host0 ~]# netstat -g
IPv6/IPv4 Group Memberships
Interface       RefCnt Group
--------------- ------ ---------------------
lo              1      all-systems.mcast.net
eth0            1      224.0.0.251
eth0            1      225.0.0.12
eth0            1      239.192.205.2
eth0            1      all-systems.mcast.net
[root at host0 ~]# 



2) Use tcpdump to check that this node is receiving transmissions from
*other* nodes to the multicast group.

e.g. 

[root at host0 ~]# tcpdump -i any -nn host 239.192.205.2 and ip multicast
tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: verbose output suppressed, use -v or -vv for full protocol
decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 96
bytes
13:42:03.674676 IP 172.31.99.50.5149 > 239.192.205.2.5405: UDP, length
118
13:42:03.767820 IP 172.31.99.50.5149 > 239.192.205.2.5405: UDP, length
74
13:42:03.962382 IP 172.31.99.50.5149 > 239.192.205.2.5405: UDP, length
74
13:42:04.162652 IP 172.31.99.50.5149 > 239.192.205.2.5405: UDP, length
118
13:42:04.267768 IP 172.31.99.50.5149 > 239.192.205.2.5405: UDP, length
74
13:42:04.450195 IP 172.31.99.50.5149 > 239.192.205.2.5405: UDP, length
74
13:42:04.650646 IP 172.31.99.50.5149 > 239.192.205.2.5405: UDP, length
118
13:42:04.766404 IP 172.31.99.50.5149 > 239.192.205.2.5405: UDP, length
74
13:42:04.940367 IP 172.31.99.50.5149 > 239.192.205.2.5405: UDP, length
74

9 packets captured
19 packets received by filter
0 packets dropped by kernel
[root at host0 ~]# 

In the above example I can only see transmissions to 239.192.205.2 from
172.31.99.50 (hosts0's IP address). If things were working properly, I'd
be able to see transmissions from other source addresses wouldn't I?

E.g. I should have at least some entries like this

13:42:04.940367 IP 172.31.99.100.5149 > 239.192.205.2.5405: UDP, length
74

where the source address is a different node, right?


Regards,

Nik Lam














From basv at sara.nl  Mon Aug 27 08:29:26 2007
From: basv at sara.nl (Bas van der Vlies)
Date: Mon, 27 Aug 2007 10:29:26 +0200
Subject: [Linux-cluster] NFS and gfs problems
In-Reply-To: <46CC241F.8000101@sara.nl>
References: <46CC007C.1080403@sara.nl> <46CC241F.8000101@sara.nl>
Message-ID: <46D28B66.6020509@sara.nl>

Are we the only site that tried to run cluster version 1.0.4 in combo with 
  NFS?



Bas van der Vlies wrote:
> We just downgraded to versions 1.0.3 with some cvs updates kernel 2.6.17.4
> and nfs-kernel-server 1.0.6.
> 
> Now we can run vim/tar/bonnie++ and other unix utilities without any problem.
> 
> With this version we experienced problems when the gfs filesystems becomimg
> full the nfsd daemons will fail and to the load of the machine becomes
> equal to number of nfsd daemoms and the system is unresponsive.
> 
> Regards
> 
> Bas van der Vlies wrote:
>> gfs version 1.0.4
>> kernel 2.6.20.16
>> nfs-common/etch uptodate 1:1.0.10-6+etch.1
>> nfs-kernel-server/etch uptodate 1:1.0.10-6+etch.1
>>
>> We have a five node fileserver cluster. The GFS file system works
>> perfectly, but on our clients we have major problems. See examples below
>>
>> Do other people see this problems and is there a solution for this problem?
>>
>>
>> bas at gb2-r39n16:~$ vim j
>> E72: Close error on swap filebas at gb2-r39n16:~$
>>
>> bas at gb2-r39n16:~$ tar cvf bin.tar bin
>> bin/
>> bin/mpicc-wrapper-data.txt
>> bin/mpiexec
>> bin/setmvapich
>> bin/modenv
>> bin/jobnodes
>> bin/SMclient
>> bin/ssh
>> tar: bin.tar: Warning: Cannot close: Invalid argument
>> bas at gb2-r39n16:~$
>>
>>
>>
>> bas at gb2-r39n16:~$ /usr/sbin/bonnie++
>> Writing with putc()...done
>> Writing intelligently...done
>> Rewriting...Can't read a full block, only got 3944 bytes.
>> Can't read a full block, only got 8040 bytes.
>> Can't read a full block, only got 3944 bytes.
>> Can't read a full block, only got 8040 bytes.
>> Can't read a full block, only got 3944 bytes.
>> Can't read a full block, only got 3944 bytes.
>> Can't read a full block, only got 3944 bytes.
>> Can't read a full block, only got 8040 bytes.
>> Bad seek offset
>> Error in seek(0)
>>

-- 
********************************************************************
*                                                                  *
*  Bas van der Vlies                     e-mail: basv at sara.nl      *
*  SARA - Academic Computing Services    phone:  +31 20 592 8012   *
*  Kruislaan 415                         fax:    +31 20 6683167    *
*  1098 SJ Amsterdam                                               *
*                                                                  *
********************************************************************



From stanojr at blackhole.sk  Mon Aug 27 14:17:31 2007
From: stanojr at blackhole.sk (Pavel Stano)
Date: Mon, 27 Aug 2007 16:17:31 +0200
Subject: [Linux-cluster] NFS and gfs problems
In-Reply-To: <46D28B66.6020509@sara.nl>
References: <46CC007C.1080403@sara.nl> <46CC241F.8000101@sara.nl>
	<46D28B66.6020509@sara.nl>
Message-ID: <46D2DCFB.2090301@blackhole.sk>

i have same problem, kernel 2.6.21.6-grsec x86_64 with cluster 1.0.4 
from source = cannot write file over nfs wich is bigger than cca 2kb
write() return EINVAL or something like that
i dont try 1.0.3

but i am droping gfs at all from our servers and i am going use xfs on 
one machine + nfs to all other, because gfs and all that cluster stuff 
is not very stable

Bas van der Vlies wrote:
> Are we the only site that tried to run cluster version 1.0.4 in combo 
> with  NFS?
> 
> 
> 
> Bas van der Vlies wrote:
>> We just downgraded to versions 1.0.3 with some cvs updates kernel 
>> 2.6.17.4
>> and nfs-kernel-server 1.0.6.
>>
>> Now we can run vim/tar/bonnie++ and other unix utilities without any 
>> problem.
>>
>> With this version we experienced problems when the gfs filesystems 
>> becomimg
>> full the nfsd daemons will fail and to the load of the machine becomes
>> equal to number of nfsd daemoms and the system is unresponsive.
>>
>> Regards
>>
>> Bas van der Vlies wrote:
>>> gfs version 1.0.4
>>> kernel 2.6.20.16
>>> nfs-common/etch uptodate 1:1.0.10-6+etch.1
>>> nfs-kernel-server/etch uptodate 1:1.0.10-6+etch.1
>>>
>>> We have a five node fileserver cluster. The GFS file system works
>>> perfectly, but on our clients we have major problems. See examples below
>>>
>>> Do other people see this problems and is there a solution for this 
>>> problem?
>>>
>>>
>>> bas at gb2-r39n16:~$ vim j
>>> E72: Close error on swap filebas at gb2-r39n16:~$
>>>
>>> bas at gb2-r39n16:~$ tar cvf bin.tar bin
>>> bin/
>>> bin/mpicc-wrapper-data.txt
>>> bin/mpiexec
>>> bin/setmvapich
>>> bin/modenv
>>> bin/jobnodes
>>> bin/SMclient
>>> bin/ssh
>>> tar: bin.tar: Warning: Cannot close: Invalid argument
>>> bas at gb2-r39n16:~$
>>>
>>>
>>>
>>> bas at gb2-r39n16:~$ /usr/sbin/bonnie++
>>> Writing with putc()...done
>>> Writing intelligently...done
>>> Rewriting...Can't read a full block, only got 3944 bytes.
>>> Can't read a full block, only got 8040 bytes.
>>> Can't read a full block, only got 3944 bytes.
>>> Can't read a full block, only got 8040 bytes.
>>> Can't read a full block, only got 3944 bytes.
>>> Can't read a full block, only got 3944 bytes.
>>> Can't read a full block, only got 3944 bytes.
>>> Can't read a full block, only got 8040 bytes.
>>> Bad seek offset
>>> Error in seek(0)
>>>
> 


-- 
Pavel Stano



From basv at sara.nl  Mon Aug 27 14:24:33 2007
From: basv at sara.nl (Bas van der Vlies)
Date: Mon, 27 Aug 2007 16:24:33 +0200
Subject: [Linux-cluster] NFS and gfs problems
In-Reply-To: <46D2DCFB.2090301@blackhole.sk>
References: <46CC007C.1080403@sara.nl>
	<46CC241F.8000101@sara.nl>	<46D28B66.6020509@sara.nl>
	<46D2DCFB.2090301@blackhole.sk>
Message-ID: <46D2DEA1.3030605@sara.nl>

Pavel Stano wrote:
> i have same problem, kernel 2.6.21.6-grsec x86_64 with cluster 1.0.4
> from source = cannot write file over nfs wich is bigger than cca 2kb
> write() return EINVAL or something like that
> i dont try 1.0.3
> 
Pavel,

  Thanks for the info.  Our 1.0.3 version with NFS is very stable. Only the 
last time our filesystems are becoming full and we encounter sky high load 
and hanging NFS-servers. We are now back at this level and done a clean up 
and now everything works as expected again.

I do not know if there is stilol development for the gfs1 STABLE release!

> but i am droping gfs at all from our servers and i am going use xfs on
> one machine + nfs to all other, because gfs and all that cluster stuff
> is not very stable
> 
It is some work, but it is more scalable then one server with serveral 
clients. On our site we have 5 servers with 700 clients.

> Bas van der Vlies wrote:
>> Are we the only site that tried to run cluster version 1.0.4 in combo
>> with  NFS?
>>
>>
>>
>> Bas van der Vlies wrote:
>>> We just downgraded to versions 1.0.3 with some cvs updates kernel
>>> 2.6.17.4
>>> and nfs-kernel-server 1.0.6.
>>>
>>> Now we can run vim/tar/bonnie++ and other unix utilities without any
>>> problem.
>>>
>>> With this version we experienced problems when the gfs filesystems
>>> becomimg
>>> full the nfsd daemons will fail and to the load of the machine becomes
>>> equal to number of nfsd daemoms and the system is unresponsive.
>>>
>>> Regards
>>>
>>> Bas van der Vlies wrote:
>>>> gfs version 1.0.4
>>>> kernel 2.6.20.16
>>>> nfs-common/etch uptodate 1:1.0.10-6+etch.1
>>>> nfs-kernel-server/etch uptodate 1:1.0.10-6+etch.1
>>>>
>>>> We have a five node fileserver cluster. The GFS file system works
>>>> perfectly, but on our clients we have major problems. See examples below
>>>>
>>>> Do other people see this problems and is there a solution for this
>>>> problem?
>>>>
>>>>
>>>> bas at gb2-r39n16:~$ vim j
>>>> E72: Close error on swap filebas at gb2-r39n16:~$
>>>>
>>>> bas at gb2-r39n16:~$ tar cvf bin.tar bin
>>>> bin/
>>>> bin/mpicc-wrapper-data.txt
>>>> bin/mpiexec
>>>> bin/setmvapich
>>>> bin/modenv
>>>> bin/jobnodes
>>>> bin/SMclient
>>>> bin/ssh
>>>> tar: bin.tar: Warning: Cannot close: Invalid argument
>>>> bas at gb2-r39n16:~$
>>>>
>>>>
>>>>
>>>> bas at gb2-r39n16:~$ /usr/sbin/bonnie++
>>>> Writing with putc()...done
>>>> Writing intelligently...done
>>>> Rewriting...Can't read a full block, only got 3944 bytes.
>>>> Can't read a full block, only got 8040 bytes.
>>>> Can't read a full block, only got 3944 bytes.
>>>> Can't read a full block, only got 8040 bytes.
>>>> Can't read a full block, only got 3944 bytes.
>>>> Can't read a full block, only got 3944 bytes.
>>>> Can't read a full block, only got 3944 bytes.
>>>> Can't read a full block, only got 8040 bytes.
>>>> Bad seek offset
>>>> Error in seek(0)
>>>>
> 
> 
> --
> Pavel Stano
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
********************************************************************
*                                                                  *
*  Bas van der Vlies                     e-mail: basv at sara.nl      *
*  SARA - Academic Computing Services    phone:  +31 20 592 8012   *
*  Kruislaan 415                         fax:    +31 20 6683167    *
*  1098 SJ Amsterdam                                               *
*                                                                  *
********************************************************************



From sdake at redhat.com  Sun Aug 26 18:06:43 2007
From: sdake at redhat.com (Steven Dake)
Date: Sun, 26 Aug 2007 11:06:43 -0700
Subject: [Linux-cluster] people having trouble with the default
	multicast address 239.192.x.x
In-Reply-To: <1188100893.3759.57.camel@lits19.library.usyd.edu.au>
References: <1187972233.3559.4.camel@balance>
	<1188100893.3759.57.camel@lits19.library.usyd.edu.au>
Message-ID: <1188151603.3592.4.camel@balance>

On Sun, 2007-08-26 at 14:01 +1000, Nikolas Lam wrote:
> On Fri, 2007-08-24 at 09:17 -0700, Steven Dake wrote:
> > If you are having problems with the default multicast address
> > 239.192.x.x it is probably because your switch is not sending those
> > packets because of a configuration issue.
> > 
> > If you have this problem please contact me.  I would like to make a
> > cookbook of switch manufacturer/models and possible solutions.
> > 
> > Using 224.0.0.x is not recommended as these multicast packets are
> > broadcoast over the network.
> > 
> > Regards
> > -steve
> 
> Hi Steve,
> 
> I'm not sure if the problems I'm having are caused by switch
> configuration or compatibility - would the following be a good way to
> determine if the multicast communications are working?
> 
> 1) Check what multicast group (IP) the nodes have chosen using netstat
> -gn ?
> 
> e.g. for host0, I'm guessing that cman is using 239.192.205.2 (as I
> think 225.0.0.12 is for the fence_xvmd Xen virtualised host group and I
> can't see any activity on 224.0.0.251).
> 
> [root at host0 ~]# netstat -g
> IPv6/IPv4 Group Memberships
> Interface       RefCnt Group
> --------------- ------ ---------------------
> lo              1      all-systems.mcast.net
> eth0            1      224.0.0.251
> eth0            1      225.0.0.12
> eth0            1      239.192.205.2
> eth0            1      all-systems.mcast.net
> [root at host0 ~]# 
> 
> 

239.192.205.2 is the multicast address chosen by RHCS.

> 
> 2) Use tcpdump to check that this node is receiving transmissions from
> *other* nodes to the multicast group.
> 

Data will only be sent when a transmission is requested - ie: when using
a checkpoint, event service, or plock operation.

> e.g. 
> 
> [root at host0 ~]# tcpdump -i any -nn host 239.192.205.2 and ip multicast
> tcpdump: WARNING: Promiscuous mode not supported on the "any" device
> tcpdump: verbose output suppressed, use -v or -vv for full protocol
> decode
> listening on any, link-type LINUX_SLL (Linux cooked), capture size 96
> bytes
> 13:42:03.674676 IP 172.31.99.50.5149 > 239.192.205.2.5405: UDP, length
> 118
> 13:42:03.767820 IP 172.31.99.50.5149 > 239.192.205.2.5405: UDP, length
> 74
> 13:42:03.962382 IP 172.31.99.50.5149 > 239.192.205.2.5405: UDP, length
> 74
> 13:42:04.162652 IP 172.31.99.50.5149 > 239.192.205.2.5405: UDP, length
> 118
> 13:42:04.267768 IP 172.31.99.50.5149 > 239.192.205.2.5405: UDP, length
> 74
> 13:42:04.450195 IP 172.31.99.50.5149 > 239.192.205.2.5405: UDP, length
> 74
> 13:42:04.650646 IP 172.31.99.50.5149 > 239.192.205.2.5405: UDP, length
> 118
> 13:42:04.766404 IP 172.31.99.50.5149 > 239.192.205.2.5405: UDP, length
> 74
> 13:42:04.940367 IP 172.31.99.50.5149 > 239.192.205.2.5405: UDP, length
> 74
> 
> 9 packets captured
> 19 packets received by filter
> 0 packets dropped by kernel
> [root at host0 ~]# 
> 
> In the above example I can only see transmissions to 239.192.205.2 from
> 172.31.99.50 (hosts0's IP address). If things were working properly, I'd
> be able to see transmissions from other source addresses wouldn't I?
> 

This is generally true for this phase of the protocol which is the
membership phase.  It looks to me from this trace like you have a
firewall configured on port 5405 for UDP.  Try turning off your firewall
for this port and see if that fixes the problem.

> E.g. I should have at least some entries like this
> 
> 13:42:04.940367 IP 172.31.99.100.5149 > 239.192.205.2.5405: UDP, length
> 74
> 
> where the source address is a different node, right?
> 

Once the protocol is in the "operational" mode meaning a membership has
been obtained, this is less likely unless there is plock/evt/ckpt
traffic on one of those nodes.  Multicast isn't randomly sent - it is
only sent when data is transmitted.

Pls send a copy of your /var/log/messages and I'll take a look if the
firewall configuration is not your issue.

Regards
-steve

> 
> Regards,
> 
> Nik Lam
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 



From ben.yarwood at juno.co.uk  Tue Aug 28 09:08:33 2007
From: ben.yarwood at juno.co.uk (Ben Yarwood)
Date: Tue, 28 Aug 2007 10:08:33 +0100
Subject: [Linux-cluster] gfs_grow
Message-ID: <089b01c7e953$02bebb00$083c3100$@yarwood@juno.co.uk>

I am using a 3 Node cluster using RHEL4U4.

I ran a gfs_grow yesterday on one of our filesystems but stupidly missed a process that was using the same file system.  The grow
process hung and when I got it to exit, the file system is now reporting as having grown to the larger size but no extra space has
appeared.  Basically my file system grew from 14TB to 15TB and my usage also grew from 13TB to 14TB.

Does anyone know if it's possible to get this space back?  I know I could probably do as gfs_fsck but given the size of the file
system, this would take a few days according to some previous reports.

Thanks
Ben






From sebastian.walter at fu-berlin.de  Tue Aug 28 12:32:29 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Tue, 28 Aug 2007 14:32:29 +0200
Subject: [Linux-cluster] Compile Error on RHEL4U5 - kernel 2.6.9-55.0.2.ELsmp
Message-ID: <46D415DD.6060501@fu-berlin.de>

Dear list,

according to the documentation on http://sources.redhat.com/cluster/ and
http://sources.redhat.com/cluster/doc/usage.txt, the source should match
the actual RHEL4 kernel. As of update 5, this is version 2.6.9-55.0.2.

So I did what the notes on this page tell me:

cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout -r RHEL4 cluster

After changing to cluster, ./configure and make, I get the following error:


  CC [M]  /root/csgfs-build/cluster/gfs-kernel/src/gfs/ops_inode.o
/root/csgfs-build/cluster/gfs-kernel/src/gfs/ops_inode.c:27:28:
cluster/cnxman.h: No such file or directory
/root/csgfs-build/cluster/gfs-kernel/src/gfs/ops_inode.c: In function
`get_my_nodeid':
/root/csgfs-build/cluster/gfs-kernel/src/gfs/ops_inode.c:136: error:
dereferencing pointer to incomplete type
/root/csgfs-build/cluster/gfs-kernel/src/gfs/ops_inode.c:139: error:
invalid application of `sizeof' to incomplete type `kcl_cluster_node'
/root/csgfs-build/cluster/gfs-kernel/src/gfs/ops_inode.c:140: warning:
implicit declaration of function `kcl_get_node_by_nodeid'
/root/csgfs-build/cluster/gfs-kernel/src/gfs/ops_inode.c:148: error:
dereferencing pointer to incomplete type
/root/csgfs-build/cluster/gfs-kernel/src/gfs/ops_inode.c: In function
`gfs_setattr':
/root/csgfs-build/cluster/gfs-kernel/src/gfs/ops_inode.c:1589: warning:
ignoring return value of `inode_setattr', declared with attribute
warn_unused_result
make[5]: *** [/root/csgfs-build/cluster/gfs-kernel/src/gfs/ops_inode.o]
Error 1
make[4]: *** [_module_/root/csgfs-build/cluster/gfs-kernel/src/gfs] Error 2
make[4]: Leaving directory `/usr/src/kernels/2.6.9-55.0.2.EL-smp-x86_64'
make[3]: *** [all] Error 2
make[3]: Leaving directory `/root/csgfs-build/cluster/gfs-kernel/src/gfs'
make[2]: *** [install] Error 2
make[2]: Leaving directory `/root/csgfs-build/cluster/gfs-kernel/src'
make[1]: *** [install] Error 2
make[1]: Leaving directory `/root/csgfs-build/cluster/gfs-kernel'
make: *** [all] Error 2
[root at host cluster]#


What did I do wrong? Any help would be greatly appreciated!

Regards,
Sebastian



From nlam87346 at library.usyd.edu.au  Tue Aug 28 14:17:18 2007
From: nlam87346 at library.usyd.edu.au (Nikolas Lam)
Date: Wed, 29 Aug 2007 00:17:18 +1000
Subject: [Linux-cluster] people having trouble with the default
	multicast address 239.192.x.x
In-Reply-To: <1188151603.3592.4.camel@balance>
References: <1187972233.3559.4.camel@balance>
	<1188100893.3759.57.camel@lits19.library.usyd.edu.au>
	<1188151603.3592.4.camel@balance>
Message-ID: <1188310638.3551.55.camel@lits19.library.usyd.edu.au>

Hi Steve,

Thanks for offering to look. Here's the setup:

VLAN ID 99 has 172.31.99.0/24 routed into it
VLAN ID 100 has 172.31.100.0/24 routed into it

I've got two physical nodes:

aphost0 (172.31.99.50 on its eth0 in VLAN ID 99)
aphost1 (172.31.100.50 on its eth0 in VLAN ID 100)

Each of these servers are dom0 hosts for Xen domU guests using the
dom0's bridged eth1. The domU cluster is having the same problem as the
dom0 cluster, but let's just ignore that for the time being.

Here's what I'm certain about as the network equipment. Both hosts are
patched into the same catalyst 2960 G switch. aphost0 into a port
assigned to VLAN ID 99 only and aphost1 into a port assigned to VLAN ID
100 only.

>From here, I'm less certain, but I think the following is true: The cat
2960 is linked using Rapid Spanning Tree Protocol to two layer 3
switches - a Catalyst 6506 and a Catalyst 3750G-12S. I'm also quite
certain that access control lists are applied on these switches, however
I had one of the network admins do a quick check on the routers and he
said IGMP should be working and the multicast IP range that I've been
assigned internally (239.224.72.0/24) has not been used within the
university for a long time.

Indeed there are some video streams running around the campus using
neighbouring ranges of multicast IPs and they seem to work well enough
on the same network infrastructure.

I've tried disabling iptables on the hosts and still see the same
results, so it's not firewalls.

Rebooting the nodes is very slow, because they sit for four or five
minutes waiting at "Starting fencing ..."

I'm going to send you the messages file off-list, it's just a bit too
big, especially with all the spurious fencing and multipath errors that
are going into it.

Regards,

Nik Lam












From rpeterso at redhat.com  Tue Aug 28 14:30:32 2007
From: rpeterso at redhat.com (Bob Peterson)
Date: Tue, 28 Aug 2007 09:30:32 -0500
Subject: [Linux-cluster] gfs_grow
In-Reply-To: <089b01c7e953$02bebb00$083c3100$@yarwood@juno.co.uk>
References: <089b01c7e953$02bebb00$083c3100$@yarwood@juno.co.uk>
Message-ID: <1188311433.16661.17.camel@technetium.msp.redhat.com>

On Tue, 2007-08-28 at 10:08 +0100, Ben Yarwood wrote:
> I am using a 3 Node cluster using RHEL4U4.
> 
> I ran a gfs_grow yesterday on one of our filesystems but stupidly missed a process that was using the same file system.  The grow
> process hung and when I got it to exit, the file system is now reporting as having grown to the larger size but no extra space has
> appeared.  Basically my file system grew from 14TB to 15TB and my usage also grew from 13TB to 14TB.
> 
> Does anyone know if it's possible to get this space back?  I know I could probably do as gfs_fsck but given the size of the file
> system, this would take a few days according to some previous reports.
> 
> Thanks
> Ben

Hi Ben,

The fact that there was a process using the file system shouldn't have
been a problem and gfs_grow should have been able to work around it.
It would have been interesting to see where gfs_grow was "hung" but it's
too late for that now.  My guess is that you killed gfs_grow before it
was able to update the resource group index properly.

In RHEL4U4 there is a feature to gfs_fsck to change and repair damaged
RGs and RG indexes.  Things get tricky for the code once the file system
has been extended though, so although you probably don't want to hear
this, you should probably make a backup of your data first, just to be
safe.

Running gfs_fsck will take a while on a file system that big, but it
depends on the speed of your hardware.  I'd expect it to take less than
a day to complete.  If you can't afford the down time, it might be
helpful to know that the RG repair is done before any of the passes, so
in theory you could probably try to use it to repair the RGs and then
kill the gfs_fsck.  Newer versions of gfs_fsck will catch <ctrl-c>
interrupts and give you options to skip around parts, but I don't think
that's in RHEL4U4 (I think it got into RHEL4.5).

So I guess my recommendation is:

1. Make a backup of your data
2. Wait until most people have gone home for the day
3. Unmount the file system from ALL nodes.
4. Run gfs_fsck.
5. Watch the gfs_fsck output for messages about finding and fixing
   RG damage just so you know it did something.
6. Let gfs_fsck run overnight.
7. If you need the file system back and it's still running by morning,
   you could kill it manually.  It would be better to let it run, but
   it shouldn't do any harm to kill it prematurely if necessary.
8. Remount the file system and see if df shows the correct values.

I hope this helps.

Regards,

Bob Peterson
Red Hat Cluster Suite




From ben.yarwood at juno.co.uk  Tue Aug 28 15:13:53 2007
From: ben.yarwood at juno.co.uk (Ben Yarwood)
Date: Tue, 28 Aug 2007 16:13:53 +0100
Subject: [Linux-cluster] gfs_grow
In-Reply-To: <1188311433.16661.17.camel@technetium.msp.redhat.com>
References: <089b01c7e953$02bebb00$083c3100$@yarwood@juno.co.uk>
	<1188311433.16661.17.camel@technetium.msp.redhat.com>
Message-ID: <08da01c7e986$0ba9df00$22fd9d00$@yarwood@juno.co.uk>



> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bob
> Peterson
> Sent: 28 August 2007 15:31
> To: linux clustering
> Subject: Re: [Linux-cluster] gfs_grow
> 
> On Tue, 2007-08-28 at 10:08 +0100, Ben Yarwood wrote:
> > I am using a 3 Node cluster using RHEL4U4.
> >
> > I ran a gfs_grow yesterday on one of our filesystems but stupidly missed a process that was using
> the same file system.  The grow
> > process hung and when I got it to exit, the file system is now reporting as having grown to the
> larger size but no extra space has
> > appeared.  Basically my file system grew from 14TB to 15TB and my usage also grew from 13TB to 14TB.
> >
> > Does anyone know if it's possible to get this space back?  I know I could probably do as gfs_fsck
> but given the size of the file
> > system, this would take a few days according to some previous reports.
> >
> > Thanks
> > Ben
> 
> Hi Ben,
> 
> The fact that there was a process using the file system shouldn't have
> been a problem and gfs_grow should have been able to work around it.
> It would have been interesting to see where gfs_grow was "hung" but it's
> too late for that now.  My guess is that you killed gfs_grow before it
> was able to update the resource group index properly.
> 
> In RHEL4U4 there is a feature to gfs_fsck to change and repair damaged
> RGs and RG indexes.  Things get tricky for the code once the file system
> has been extended though, so although you probably don't want to hear
> this, you should probably make a backup of your data first, just to be
> safe.
> 
> Running gfs_fsck will take a while on a file system that big, but it
> depends on the speed of your hardware.  I'd expect it to take less than
> a day to complete.  If you can't afford the down time, it might be
> helpful to know that the RG repair is done before any of the passes, so
> in theory you could probably try to use it to repair the RGs and then
> kill the gfs_fsck.  Newer versions of gfs_fsck will catch <ctrl-c>
> interrupts and give you options to skip around parts, but I don't think
> that's in RHEL4U4 (I think it got into RHEL4.5).
> 
> So I guess my recommendation is:
> 
> 1. Make a backup of your data
> 2. Wait until most people have gone home for the day
> 3. Unmount the file system from ALL nodes.
> 4. Run gfs_fsck.
> 5. Watch the gfs_fsck output for messages about finding and fixing
>    RG damage just so you know it did something.
> 6. Let gfs_fsck run overnight.
> 7. If you need the file system back and it's still running by morning,
>    you could kill it manually.  It would be better to let it run, but
>    it shouldn't do any harm to kill it prematurely if necessary.
> 8. Remount the file system and see if df shows the correct values.
> 
> I hope this helps.
> 
> Regards,
> 
> Bob Peterson
> Red Hat Cluster Suite
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster






From ben.yarwood at juno.co.uk  Tue Aug 28 15:54:18 2007
From: ben.yarwood at juno.co.uk (Ben Yarwood)
Date: Tue, 28 Aug 2007 16:54:18 +0100
Subject: [Linux-cluster] gfs_grow
In-Reply-To: <1188311433.16661.17.camel@technetium.msp.redhat.com>
References: <089b01c7e953$02bebb00$083c3100$@yarwood@juno.co.uk>
	<1188311433.16661.17.camel@technetium.msp.redhat.com>
Message-ID: <08dd01c7e98b$b1623dc0$1426b940$@yarwood@juno.co.uk>

Sorry for the blank message.

Thanks for the advice, unfortunately the file system is part of a back end for a website which is permanently in use so taking it
offline overnight is not really an option.  We do have a virtually live backup copy so I will get this fully synchronised and then
try the gfs_fsck process you suggested.  

As a quick solution, I think I'll just unmount the file system from all nodes and run gfs_fsck until I see pass one start then kill
it.  Hopefully the problem will have been solved.  If this doesn't work, I'll mount the backup and run the full gfs_fsck.  If that
doesn't work, I'll rebuild the whole file system.  

Regarding gfs_grow, in my experience it often/always fails when the file system which is being grown is in use.  Previously I have
just waited until it fails and then run it again once I have stopped any process that is accessing the particular file system.  I'm
not sure what possessed me to hit <ctrl-c>, not my finest moment!

Regards
Ben


> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bob
> Peterson
> Sent: 28 August 2007 15:31
> To: linux clustering
> Subject: Re: [Linux-cluster] gfs_grow
> 
> On Tue, 2007-08-28 at 10:08 +0100, Ben Yarwood wrote:
> > I am using a 3 Node cluster using RHEL4U4.
> >
> > I ran a gfs_grow yesterday on one of our filesystems but stupidly missed a process that was using
> the same file system.  The grow
> > process hung and when I got it to exit, the file system is now reporting as having grown to the
> larger size but no extra space has
> > appeared.  Basically my file system grew from 14TB to 15TB and my usage also grew from 13TB to 14TB.
> >
> > Does anyone know if it's possible to get this space back?  I know I could probably do as gfs_fsck
> but given the size of the file
> > system, this would take a few days according to some previous reports.
> >
> > Thanks
> > Ben
> 
> Hi Ben,
> 
> The fact that there was a process using the file system shouldn't have
> been a problem and gfs_grow should have been able to work around it.
> It would have been interesting to see where gfs_grow was "hung" but it's
> too late for that now.  My guess is that you killed gfs_grow before it
> was able to update the resource group index properly.
> 
> In RHEL4U4 there is a feature to gfs_fsck to change and repair damaged
> RGs and RG indexes.  Things get tricky for the code once the file system
> has been extended though, so although you probably don't want to hear
> this, you should probably make a backup of your data first, just to be
> safe.
> 
> Running gfs_fsck will take a while on a file system that big, but it
> depends on the speed of your hardware.  I'd expect it to take less than
> a day to complete.  If you can't afford the down time, it might be
> helpful to know that the RG repair is done before any of the passes, so
> in theory you could probably try to use it to repair the RGs and then
> kill the gfs_fsck.  Newer versions of gfs_fsck will catch <ctrl-c>
> interrupts and give you options to skip around parts, but I don't think
> that's in RHEL4U4 (I think it got into RHEL4.5).
> 
> So I guess my recommendation is:
> 
> 1. Make a backup of your data
> 2. Wait until most people have gone home for the day
> 3. Unmount the file system from ALL nodes.
> 4. Run gfs_fsck.
> 5. Watch the gfs_fsck output for messages about finding and fixing
>    RG damage just so you know it did something.
> 6. Let gfs_fsck run overnight.
> 7. If you need the file system back and it's still running by morning,
>    you could kill it manually.  It would be better to let it run, but
>    it shouldn't do any harm to kill it prematurely if necessary.
> 8. Remount the file system and see if df shows the correct values.
> 
> I hope this helps.
> 
> Regards,
> 
> Bob Peterson
> Red Hat Cluster Suite
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster






From eric at bootseg.com  Tue Aug 28 16:02:26 2007
From: eric at bootseg.com (Eric Kerin)
Date: Tue, 28 Aug 2007 12:02:26 -0400
Subject: [Linux-cluster] fence_apc_snmp woes
In-Reply-To: <786463971.91731188003635915.JavaMail.root@v-mailhost2.mxpath.net>
References: <786463971.91731188003635915.JavaMail.root@v-mailhost2.mxpath.net>
Message-ID: <46D44712.10905@bootseg.com>

Brian Sheets wrote:
> Yes, not only does it work manually, when I run fence_ack_manual, that causes the fence_apc_snmp to work.
>
> So, 
>
> System failure
> fenced says its trying to fence
> then the fenced says its waiting for ack, which never comes and the system doesn't get powered down
> then I run fence_ack_manual, and then the system powers down.
>   
It looks like a fence_manual config entry is still loaded in your 
cluster for fencing that node.  Did you change your cluster.conf file 
without increasing the version number and telling cman to reload the 
config?    fence_ack_manual should only do something if fence_manual is 
setup for that node, also, you should not have seen that fence_manual 
config entry when fence_apc_snmp is configured as the sole fence agent.

Thanks,
Eric
eric at bootseg.com
> <?xml version="1.0"?>
> <cluster name="index" config_version="2">
> <cman two_node="1" expected_votes="1">
> </cman>
> <clusternodes>
> <clusternode name="oc-index3" votes="1">
>         <fence>
>                 <method name="single">
>                        <device name="oc-cab1-pdu2" port="18"
> option="off"/>
>                 </method>
>         </fence>
> </clusternode>
>
> <clusternode name="oc-index4" votes="1">
>         <fence>
>                 <method name="single">
>                   <device name="oc-cab1-pdu1" port="16" option="off"/>
>                   </method>
>         </fence>
> </clusternode>
>
> </clusternodes>
> <fencedevices>
>         <fencedevice name="oc-cab1-pdu2" agent="fence_apc_snmp"
> ipaddr="172.16.14.9" login="apc" passwd="xxxx"/>
>         <fencedevice name="oc-cab1-pdu1" agent="fence_apc_snmp"
> ipaddr="172.16.14.8" login="apc" passwd="xxxx"/>
> </fencedevices>
> </cluster>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   



From Charles_McKinnis at Dell.com  Tue Aug 28 17:56:37 2007
From: Charles_McKinnis at Dell.com (Charles_McKinnis at Dell.com)
Date: Tue, 28 Aug 2007 12:56:37 -0500
Subject: [Linux-cluster] GFS, SELinux denial
Message-ID: <7E3508B66E998043A1A8654D3F72461A60AA9C@AUSX3MPC128.aus.amer.dell.com>

I am having issues with a server running gfs and an SELinux error. When
/etc/init.d/gfs start or service gfs start is run, it results in a
SELinux denial. If mount -a -t gfs is run as root it works fine. The
scripts also work if setenforce 0 is used. Running setsebool -P
allow_mount_anyfile=1 does not fix the problem (as seen in sealert),
although it is set.

Thank you,
Charles McKinnis

# cat /etc/fstab
/dev/VolGroup00/LogVol00 /                       ext3    defaults
1 1
LABEL=/boot             /boot                   ext3    defaults
1 2
devpts                  /dev/pts                devpts  gid=5,mode=620
0 0
tmpfs                   /dev/shm                tmpfs   defaults
0 0
proc                    /proc                   proc    defaults
0 0
sysfs                   /sys                    sysfs   defaults
0 0
/dev/VolGroup00/LogVol01 swap                    swap    defaults
0 0
/dev/hda                /media/cdrecorder       auto
pamconsole,fscontext=system_u:object_r:removable_t,exec,noauto,managed 0
0
/dev/winchester/array	/opt/winchester		gfs
rw,localflocks,localcaching,oopses_ok 	0 0

# /etc/init.d/gfs stop
Mounting GFS filesystems:  /sbin/mount.gfs: error 13 mounting
/dev/winchester/array on /opt/winchester

# tail /var/log/messages
Aug 28 11:56:24 ronnie-vidrine kernel: Trying to join cluster
"lock_nolock", "dm-2"
Aug 28 11:56:24 ronnie-vidrine kernel: Joined cluster. Now mounting
FS...
Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=0: Trying
to acquire journal lock...
Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=0: Looking
at journal...
Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=0: Done Aug
28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=1: Trying to
acquire journal lock...
Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=1: Looking
at journal...
Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=1: Done Aug
28 11:56:24 ronnie-vidrine kernel: SELinux: (dev dm-2, type gfs)
getxattr errno 13
Aug 28 11:56:26 ronnie-vidrine setroubleshoot:      SELinux prevented
/sbin/mount.gfs2 from mounting on the file or directory     "/" (type
"unlabeled_t").      For complete SELinux messages. run sealert -l
c3fabd9a-3aac-4af4-aa26-300e19aab70e

# sealert -l c3fabd9a-3aac-4af4-aa26-300e19aab70e
Summary
    SELinux prevented /sbin/mount.gfs2 from mounting on the file or
directory
    "/" (type "unlabeled_t").

Detailed Description
    SELinux prevented /sbin/mount.gfs2 from mounting a filesystem on the
file or
    directory "/" of type "unlabeled_t". By default SELinux limits the
mounting
    of filesystems to only some files or directories (those with types
that have
    the mountpoint attribute). The type "unlabeled_t" does not have this
    attribute. You can either relabel the file or directory or set the
boolean
    "allow_mount_anyfile" to true to allow mounting on any file or
directory.

Allowing Access
    Changing the "allow_mount_anyfile" boolean to true will allow this
access:
    "setsebool -P allow_mount_anyfile=1."

    The following command will allow this access:
    setsebool -P allow_mount_anyfile=1

Additional Information        

Source Context                user_u:system_r:mount_t
Target Context                system_u:object_r:unlabeled_t
Target Objects                / [ dir ]
Affected RPM Packages         gfs2-utils-0.1.25-1.el5
                              [application]filesystem-2.4.0-1 [target]
Policy RPM                    selinux-policy-2.4.6-30.el5
Selinux Enabled               True
Policy Type                   targeted
MLS Enabled                   True
Enforcing Mode                Enforcing
Plugin Name                   plugins.allow_mount_anyfile
Host Name                     server.net
Platform                      Linux server.net
                              2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:21
EST 2007
                              i686 i686
Alert Count                   14
Line Numbers                  

Raw Audit Messages            

avc: denied { read } for comm="mount.gfs" dev=dm-2 egid=0 euid=0
exe="/sbin/mount.gfs2" exit=-13 fsgid=0 fsuid=0 gid=0 items=0 name="/"
pid=4802 scontext=user_u:system_r:mount_t:s0 sgid=0
subj=user_u:system_r:mount_t:s0 suid=0 tclass=dir
tcontext=system_u:object_r:unlabeled_t:s0 tty=pts1 uid=0



From orkcu at yahoo.com  Tue Aug 28 18:15:22 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Tue, 28 Aug 2007 11:15:22 -0700 (PDT)
Subject: [Linux-cluster] GFS, SELinux denial
In-Reply-To: <7E3508B66E998043A1A8654D3F72461A60AA9C@AUSX3MPC128.aus.amer.dell.com>
Message-ID: <19186.17697.qm@web50603.mail.re2.yahoo.com>


--- Charles_McKinnis at Dell.com wrote:

> I am having issues with a server running gfs and an
> SELinux error. When
> /etc/init.d/gfs start or service gfs start is run,
> it results in a
> SELinux denial. If mount -a -t gfs is run as root it
> works fine. The
> scripts also work if setenforce 0 is used. Running
> setsebool -P
> allow_mount_anyfile=1 does not fix the problem (as
> seen in sealert),
> although it is set.
> 
> Thank you,
> Charles McKinnis

> /dev/winchester/array	/opt/winchester		gfs
> rw,localflocks,localcaching,oopses_ok 	0 0
> 

> /dev/winchester/array on /opt/winchester

what is the out put of:
ls -lZ /opt/ 

?

just to see if the mount point as any SElinux label

> /sbin/mount.gfs2 from mounting on the file or
        ^^^^^^^^^^

gfs2 ? 

cu
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


      ____________________________________________________________________________________
Luggage? GPS? Comic books? 
Check out fitting gifts for grads at Yahoo! Search
http://search.yahoo.com/search?fr=oni_on_mail&p=graduation+gifts&cs=bz



From chrismcc at pricegrabber.com  Tue Aug 28 18:37:30 2007
From: chrismcc at pricegrabber.com (Christopher McCrory)
Date: Tue, 28 Aug 2007 11:37:30 -0700
Subject: [Linux-cluster] GFS with Dell MD3000?
Message-ID: <1188326250.17340.5.camel@wednesday.pricegrabber.com>

Hello...

  Has anyone tried GFS with the Dell MD3000 array?  This is a SAS
hardware raid array that can support four server connections including
shared access.  (very nice piece of hardware).  I have no experience
with GFS (yet).


TIA

  

-- 
Christopher McCrory
 "The guy that keeps the servers running"

To the optimist, the glass is half full.
To the pessimist, the glass is half empty.
To the engineer, the glass is twice as big as it needs to be.



From rohara at redhat.com  Tue Aug 28 18:41:02 2007
From: rohara at redhat.com (Ryan O'Hara)
Date: Tue, 28 Aug 2007 13:41:02 -0500
Subject: [Linux-cluster] GFS with Dell MD3000?
In-Reply-To: <1188326250.17340.5.camel@wednesday.pricegrabber.com>
References: <1188326250.17340.5.camel@wednesday.pricegrabber.com>
Message-ID: <46D46C3E.5050409@redhat.com>

Christopher McCrory wrote:
> 
>   Has anyone tried GFS with the Dell MD3000 array?  This is a SAS
> hardware raid array that can support four server connections including
> shared access.  (very nice piece of hardware).  I have no experience
> with GFS (yet).

Yes. I've tested GFS with this array using the mptsas drivers that are 
shipped with RHEL5. Works fine.

Ryan



From rohara at redhat.com  Tue Aug 28 18:43:51 2007
From: rohara at redhat.com (Ryan O'Hara)
Date: Tue, 28 Aug 2007 13:43:51 -0500
Subject: [Linux-cluster] GFS, SELinux denial
In-Reply-To: <7E3508B66E998043A1A8654D3F72461A60AA9C@AUSX3MPC128.aus.amer.dell.com>
References: <7E3508B66E998043A1A8654D3F72461A60AA9C@AUSX3MPC128.aus.amer.dell.com>
Message-ID: <46D46CE7.4060202@redhat.com>

Charles_McKinnis at Dell.com wrote:
> I am having issues with a server running gfs and an SELinux error. When
> /etc/init.d/gfs start or service gfs start is run, it results in a
> SELinux denial. If mount -a -t gfs is run as root it works fine. The
> scripts also work if setenforce 0 is used. Running setsebool -P
> allow_mount_anyfile=1 does not fix the problem (as seen in sealert),
> although it is set.


What selinux policy are you using? The policy must be such that gfs (or 
gfs2) are declared to support/usr selinux xattrs.


> # cat /etc/fstab
> /dev/VolGroup00/LogVol00 /                       ext3    defaults
> 1 1
> LABEL=/boot             /boot                   ext3    defaults
> 1 2
> devpts                  /dev/pts                devpts  gid=5,mode=620
> 0 0
> tmpfs                   /dev/shm                tmpfs   defaults
> 0 0
> proc                    /proc                   proc    defaults
> 0 0
> sysfs                   /sys                    sysfs   defaults
> 0 0
> /dev/VolGroup00/LogVol01 swap                    swap    defaults
> 0 0
> /dev/hda                /media/cdrecorder       auto
> pamconsole,fscontext=system_u:object_r:removable_t,exec,noauto,managed 0
> 0
> /dev/winchester/array	/opt/winchester		gfs
> rw,localflocks,localcaching,oopses_ok 	0 0
> 
> # /etc/init.d/gfs stop
> Mounting GFS filesystems:  /sbin/mount.gfs: error 13 mounting
> /dev/winchester/array on /opt/winchester
> 
> # tail /var/log/messages
> Aug 28 11:56:24 ronnie-vidrine kernel: Trying to join cluster
> "lock_nolock", "dm-2"
> Aug 28 11:56:24 ronnie-vidrine kernel: Joined cluster. Now mounting
> FS...
> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=0: Trying
> to acquire journal lock...
> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=0: Looking
> at journal...
> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=0: Done Aug
> 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=1: Trying to
> acquire journal lock...
> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=1: Looking
> at journal...
> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=1: Done Aug
> 28 11:56:24 ronnie-vidrine kernel: SELinux: (dev dm-2, type gfs)
> getxattr errno 13
> Aug 28 11:56:26 ronnie-vidrine setroubleshoot:      SELinux prevented
> /sbin/mount.gfs2 from mounting on the file or directory     "/" (type
> "unlabeled_t").      For complete SELinux messages. run sealert -l
> c3fabd9a-3aac-4af4-aa26-300e19aab70e
> 
> # sealert -l c3fabd9a-3aac-4af4-aa26-300e19aab70e
> Summary
>     SELinux prevented /sbin/mount.gfs2 from mounting on the file or
> directory
>     "/" (type "unlabeled_t").
> 
> Detailed Description
>     SELinux prevented /sbin/mount.gfs2 from mounting a filesystem on the
> file or
>     directory "/" of type "unlabeled_t". By default SELinux limits the
> mounting
>     of filesystems to only some files or directories (those with types
> that have
>     the mountpoint attribute). The type "unlabeled_t" does not have this
>     attribute. You can either relabel the file or directory or set the
> boolean
>     "allow_mount_anyfile" to true to allow mounting on any file or
> directory.
> 
> Allowing Access
>     Changing the "allow_mount_anyfile" boolean to true will allow this
> access:
>     "setsebool -P allow_mount_anyfile=1."
> 
>     The following command will allow this access:
>     setsebool -P allow_mount_anyfile=1
> 
> Additional Information        
> 
> Source Context                user_u:system_r:mount_t
> Target Context                system_u:object_r:unlabeled_t
> Target Objects                / [ dir ]
> Affected RPM Packages         gfs2-utils-0.1.25-1.el5
>                               [application]filesystem-2.4.0-1 [target]
> Policy RPM                    selinux-policy-2.4.6-30.el5
> Selinux Enabled               True
> Policy Type                   targeted
> MLS Enabled                   True
> Enforcing Mode                Enforcing
> Plugin Name                   plugins.allow_mount_anyfile
> Host Name                     server.net
> Platform                      Linux server.net
>                               2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:21
> EST 2007
>                               i686 i686
> Alert Count                   14
> Line Numbers                  
> 
> Raw Audit Messages            
> 
> avc: denied { read } for comm="mount.gfs" dev=dm-2 egid=0 euid=0
> exe="/sbin/mount.gfs2" exit=-13 fsgid=0 fsuid=0 gid=0 items=0 name="/"
> pid=4802 scontext=user_u:system_r:mount_t:s0 sgid=0
> subj=user_u:system_r:mount_t:s0 suid=0 tclass=dir
> tcontext=system_u:object_r:unlabeled_t:s0 tty=pts1 uid=0
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From chrismcc at pricegrabber.com  Tue Aug 28 20:13:12 2007
From: chrismcc at pricegrabber.com (Christopher McCrory)
Date: Tue, 28 Aug 2007 13:13:12 -0700
Subject: [Linux-cluster] GFS with Dell MD3000?
In-Reply-To: <46D46C3E.5050409@redhat.com>
References: <1188326250.17340.5.camel@wednesday.pricegrabber.com>
	<46D46C3E.5050409@redhat.com>
Message-ID: <1188331992.17340.19.camel@wednesday.pricegrabber.com>

Hello...

On Tue, 2007-08-28 at 13:41 -0500, Ryan O'Hara wrote:
> Christopher McCrory wrote:
> > 
> >   Has anyone tried GFS with the Dell MD3000 array?  This is a SAS
> > hardware raid array that can support four server connections including
> > shared access.  (very nice piece of hardware).  I have no experience
> > with GFS (yet).
> 
> Yes. I've tested GFS with this array using the mptsas drivers that are 
> shipped with RHEL5. Works fine.
> 


I have a rather large collection of static files that I serve up with
apache.  I would like to take one MD3000 to connect to four servers.
Mark one as read/write for updating and the other three as readonly.  I
realize with GFS any could write, but I would like it to be as simple as
possible.


What kind of fencing did you use?  Or is the MD3000 able to handle that
part?




> Ryan
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-- 
Christopher McCrory
 "The guy that keeps the servers running"

To the optimist, the glass is half full.
To the pessimist, the glass is half empty.
To the engineer, the glass is twice as big as it needs to be.



From kanderso at redhat.com  Tue Aug 28 20:36:26 2007
From: kanderso at redhat.com (Kevin Anderson)
Date: Tue, 28 Aug 2007 15:36:26 -0500
Subject: [Linux-cluster] GFS with Dell MD3000?
In-Reply-To: <1188331992.17340.19.camel@wednesday.pricegrabber.com>
References: <1188326250.17340.5.camel@wednesday.pricegrabber.com>
	<46D46C3E.5050409@redhat.com>
	<1188331992.17340.19.camel@wednesday.pricegrabber.com>
Message-ID: <1188333386.2727.34.camel@dhcp80-204.msp.redhat.com>

On Tue, 2007-08-28 at 13:13 -0700, Christopher McCrory wrote:
> Hello...
> 
> On Tue, 2007-08-28 at 13:41 -0500, Ryan O'Hara wrote:
> > Christopher McCrory wrote:
> > > 
> > >   Has anyone tried GFS with the Dell MD3000 array?  This is a SAS
> > > hardware raid array that can support four server connections including
> > > shared access.  (very nice piece of hardware).  I have no experience
> > > with GFS (yet).
> > 
> > Yes. I've tested GFS with this array using the mptsas drivers that are 
> > shipped with RHEL5. Works fine.
> > 
> 
> 
> I have a rather large collection of static files that I serve up with
> apache.  I would like to take one MD3000 to connect to four servers.
> Mark one as read/write for updating and the other three as readonly.  I
> realize with GFS any could write, but I would like it to be as simple as
> possible.
> 
> 
The MD3000 we used has two controllers in a active/passive
configuration.  This only allows you to connect two nodes to the array
to use it in a GFS configuration.  

> What kind of fencing did you use?  Or is the MD3000 able to handle that
> part?

Fencing needs to be done separate from the storage array.

Kevin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070828/96eb9622/attachment.htm>

From Charles_McKinnis at Dell.com  Tue Aug 28 20:41:37 2007
From: Charles_McKinnis at Dell.com (Charles_McKinnis at Dell.com)
Date: Tue, 28 Aug 2007 15:41:37 -0500
Subject: [Linux-cluster] GFS, SELinux denial
In-Reply-To: <46D46CE7.4060202@redhat.com>
References: <7E3508B66E998043A1A8654D3F72461A60AA9C@AUSX3MPC128.aus.amer.dell.com>
	<46D46CE7.4060202@redhat.com>
Message-ID: <7E3508B66E998043A1A8654D3F72461A60AAA6@AUSX3MPC128.aus.amer.dell.com>

The SELinux policy is set to Enabled/Enforcing. I am not sure how to
check the gfs/gfs2 policy. Can I check it from the shell?

Thank you,
Charles

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan O'Hara
Sent: Tuesday, August 28, 2007 1:44 PM
To: linux clustering
Subject: Re: [Linux-cluster] GFS, SELinux denial

Charles_McKinnis at Dell.com wrote:
> I am having issues with a server running gfs and an SELinux error. 
> When /etc/init.d/gfs start or service gfs start is run, it results in 
> a SELinux denial. If mount -a -t gfs is run as root it works fine. The

> scripts also work if setenforce 0 is used. Running setsebool -P
> allow_mount_anyfile=1 does not fix the problem (as seen in sealert), 
> although it is set.


What selinux policy are you using? The policy must be such that gfs (or
gfs2) are declared to support/usr selinux xattrs.


> # cat /etc/fstab
> /dev/VolGroup00/LogVol00 /                       ext3    defaults
> 1 1
> LABEL=/boot             /boot                   ext3    defaults
> 1 2
> devpts                  /dev/pts                devpts  gid=5,mode=620
> 0 0
> tmpfs                   /dev/shm                tmpfs   defaults
> 0 0
> proc                    /proc                   proc    defaults
> 0 0
> sysfs                   /sys                    sysfs   defaults
> 0 0
> /dev/VolGroup00/LogVol01 swap                    swap    defaults
> 0 0
> /dev/hda                /media/cdrecorder       auto
> pamconsole,fscontext=system_u:object_r:removable_t,exec,noauto,managed
0
> 0
> /dev/winchester/array	/opt/winchester		gfs
> rw,localflocks,localcaching,oopses_ok 	0 0
> 
> # /etc/init.d/gfs stop
> Mounting GFS filesystems:  /sbin/mount.gfs: error 13 mounting
> /dev/winchester/array on /opt/winchester
> 
> # tail /var/log/messages
> Aug 28 11:56:24 ronnie-vidrine kernel: Trying to join cluster
> "lock_nolock", "dm-2"
> Aug 28 11:56:24 ronnie-vidrine kernel: Joined cluster. Now mounting
> FS...
> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=0: Trying
> to acquire journal lock...
> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=0:
Looking
> at journal...
> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=0: Done
Aug
> 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=1: Trying to
> acquire journal lock...
> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=1:
Looking
> at journal...
> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=1: Done
Aug
> 28 11:56:24 ronnie-vidrine kernel: SELinux: (dev dm-2, type gfs)
> getxattr errno 13
> Aug 28 11:56:26 ronnie-vidrine setroubleshoot:      SELinux prevented
> /sbin/mount.gfs2 from mounting on the file or directory     "/" (type
> "unlabeled_t").      For complete SELinux messages. run sealert -l
> c3fabd9a-3aac-4af4-aa26-300e19aab70e
> 
> # sealert -l c3fabd9a-3aac-4af4-aa26-300e19aab70e
> Summary
>     SELinux prevented /sbin/mount.gfs2 from mounting on the file or
> directory
>     "/" (type "unlabeled_t").
> 
> Detailed Description
>     SELinux prevented /sbin/mount.gfs2 from mounting a filesystem on
the
> file or
>     directory "/" of type "unlabeled_t". By default SELinux limits the
> mounting
>     of filesystems to only some files or directories (those with types
> that have
>     the mountpoint attribute). The type "unlabeled_t" does not have
this
>     attribute. You can either relabel the file or directory or set the
> boolean
>     "allow_mount_anyfile" to true to allow mounting on any file or
> directory.
> 
> Allowing Access
>     Changing the "allow_mount_anyfile" boolean to true will allow this
> access:
>     "setsebool -P allow_mount_anyfile=1."
> 
>     The following command will allow this access:
>     setsebool -P allow_mount_anyfile=1
> 
> Additional Information        
> 
> Source Context                user_u:system_r:mount_t
> Target Context                system_u:object_r:unlabeled_t
> Target Objects                / [ dir ]
> Affected RPM Packages         gfs2-utils-0.1.25-1.el5
>                               [application]filesystem-2.4.0-1 [target]
> Policy RPM                    selinux-policy-2.4.6-30.el5
> Selinux Enabled               True
> Policy Type                   targeted
> MLS Enabled                   True
> Enforcing Mode                Enforcing
> Plugin Name                   plugins.allow_mount_anyfile
> Host Name                     server.net
> Platform                      Linux server.net
>                               2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:21
> EST 2007
>                               i686 i686
> Alert Count                   14
> Line Numbers                  
> 
> Raw Audit Messages            
> 
> avc: denied { read } for comm="mount.gfs" dev=dm-2 egid=0 euid=0
> exe="/sbin/mount.gfs2" exit=-13 fsgid=0 fsuid=0 gid=0 items=0 name="/"
> pid=4802 scontext=user_u:system_r:mount_t:s0 sgid=0
> subj=user_u:system_r:mount_t:s0 suid=0 tclass=dir
> tcontext=system_u:object_r:unlabeled_t:s0 tty=pts1 uid=0
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From lhh at redhat.com  Tue Aug 28 20:45:54 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 28 Aug 2007 16:45:54 -0400
Subject: [Linux-cluster] init scripts and the scope of a service
In-Reply-To: <978838.23700.qm@web63502.mail.re1.yahoo.com>
References: <978838.23700.qm@web63502.mail.re1.yahoo.com>
Message-ID: <20070828204554.GM17551@redhat.com>

On Fri, Aug 24, 2007 at 11:05:08PM -0700, Jon S wrote:
> Given the Cluster Suite's preference to disable, restart or relocate a service after an init script resource returns non-zero, I'm curious how best to handle a significant number of applications tied to a single IP resource.  The easy thing to do is to just define a bunch of init scripts in a single service with the IP resource.  Problem is, of course, that if one script returns non-zero, all resources in the service are stopped and restarted.
> 
> I can see several ways to script my way past the this issue.  Before I do, I was wondering how others handle this situation.

In CVS for 5.1, you can add a special attribute to the tags in
cluster.conf:

   <script name="this_app" file="/bin/this_app"
           __independent_subtree="1"/>

This means that a failure of this script will cause a stop/start of that
script without stop/start of the rest of the siblings in the service
tree.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From Charles_McKinnis at Dell.com  Tue Aug 28 20:46:32 2007
From: Charles_McKinnis at Dell.com (Charles_McKinnis at Dell.com)
Date: Tue, 28 Aug 2007 15:46:32 -0500
Subject: [Linux-cluster] GFS, SELinux denial
In-Reply-To: <19186.17697.qm@web50603.mail.re2.yahoo.com>
References: <7E3508B66E998043A1A8654D3F72461A60AA9C@AUSX3MPC128.aus.amer.dell.com>
	<19186.17697.qm@web50603.mail.re2.yahoo.com>
Message-ID: <7E3508B66E998043A1A8654D3F72461A60AAA7@AUSX3MPC128.aus.amer.dell.com>

# ls -lZ /opt
drwxr-xr-x  root root user_u:object_r:usr_t            java
drwxrwxr-x  root root system_u:object_r:usr_t          lgtonmc
drwxrwxr-x  root root system_u:object_r:usr_t          nsr
drwxr-xr-x  root root system_u:object_r:public_content_rw_t winchester
 
# ls -lZd /opt
drwxrwxr-x  root root system_u:object_r:usr_t          /opt 

Thank you,
Charles

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Roger Pe?a
Sent: Tuesday, August 28, 2007 1:15 PM
To: linux clustering
Subject: Re: [Linux-cluster] GFS, SELinux denial


--- Charles_McKinnis at Dell.com wrote:

> I am having issues with a server running gfs and an SELinux error. 
> When /etc/init.d/gfs start or service gfs start is run, it results in 
> a SELinux denial. If mount -a -t gfs is run as root it works fine. The 
> scripts also work if setenforce 0 is used. Running setsebool -P
> allow_mount_anyfile=1 does not fix the problem (as seen in sealert), 
> although it is set.
> 
> Thank you,
> Charles McKinnis

> /dev/winchester/array	/opt/winchester		gfs
> rw,localflocks,localcaching,oopses_ok 	0 0
> 

> /dev/winchester/array on /opt/winchester

what is the out put of:
ls -lZ /opt/ 

?

just to see if the mount point as any SElinux label

> /sbin/mount.gfs2 from mounting on the file or
        ^^^^^^^^^^

gfs2 ? 

cu
roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


      ____________________________________________________________________________________
Luggage? GPS? Comic books? 
Check out fitting gifts for grads at Yahoo! Search http://search.yahoo.com/search?fr=oni_on_mail&p=graduation+gifts&cs=bz

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From orkcu at yahoo.com  Wed Aug 29 01:51:43 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Tue, 28 Aug 2007 18:51:43 -0700 (PDT)
Subject: [Linux-cluster] GFS, SELinux denial
In-Reply-To: <7E3508B66E998043A1A8654D3F72461A60AAA7@AUSX3MPC128.aus.amer.dell.com>
Message-ID: <742532.21379.qm@web50603.mail.re2.yahoo.com>


--- Charles_McKinnis at Dell.com wrote:

> # ls -lZ /opt
> drwxr-xr-x  root root
> system_u:object_r:public_content_rw_t winchester
>  
> # ls -lZd /opt
> drwxrwxr-x  root root system_u:object_r:usr_t       
>   /opt 

sorry, I just miss read your first messages:

> > When /etc/init.d/gfs start or service gfs start is
> > run, it results in a SELinux denial. 
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

>If mount -a -t gfs is run as root it works fine. 
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^     

very wierd

roger

__________________________________________
RedHat Certified ( RHCE )
Cisco Certified ( CCNA & CCDA )


       
____________________________________________________________________________________
Be a better Heartthrob. Get better relationship answers from someone who knows. Yahoo! Answers - Check it out. 
http://answers.yahoo.com/dir/?link=list&sid=396545433



From jon.skelton at gmail.com  Wed Aug 29 09:31:48 2007
From: jon.skelton at gmail.com (Jon Skelton)
Date: Wed, 29 Aug 2007 02:31:48 -0700
Subject: [Linux-cluster] init scripts and the scope of a service
In-Reply-To: <20070828204554.GM17551@redhat.com>
References: <978838.23700.qm@web63502.mail.re1.yahoo.com>
	<20070828204554.GM17551@redhat.com>
Message-ID: <b637d2000708290231r207ca263m7bf2eee1645a0514@mail.gmail.com>

On 8/28/07, Lon Hohberger <lhh at redhat.com> wrote:
>
>
> In CVS for 5.1, you can add a special attribute to the tags in
> cluster.conf:
>
>    <script name="this_app" file="/bin/this_app"
>            __independent_subtree="1"/>
>
> This means that a failure of this script will cause a stop/start of that
> script without stop/start of the rest of the siblings in the service
> tree.



That's awesome.  :)  Thanks much.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070829/571837a9/attachment.htm>

From stuarta at squashedfrog.net  Wed Aug 29 10:29:30 2007
From: stuarta at squashedfrog.net (Stuart Auchterlonie)
Date: Wed, 29 Aug 2007 11:29:30 +0100
Subject: [Linux-cluster] Re: Nodes show up extremely slowly (or not at all)
In-Reply-To: <1187962600.3353.37.camel@lits17.library.usyd.edu.au>
References: <Pine.GSO.4.64.0708241023170.8481@sunserv.kfki.hu>	<Pine.GSO.4.64.0708241439110.8481@sunserv.kfki.hu>
	<1187962600.3353.37.camel@lits17.library.usyd.edu.au>
Message-ID: <46D54A8A.5040509@squashedfrog.net>



Nikolas Lam wrote:
> On Fri, 2007-08-24 at 15:12 +0200, Kadlecsik Jozsi wrote:
>> On Fri, 24 Aug 2007, Kadlecsik Jozsi wrote:
>>
>>> It took 13 minutes(!) for 'lsxerv0' to notice 'lxserv1'.
>> It turned out to be a switch-related multicast problem: openais used the 
>> default(?) 239.192.6.53 address, while the switch strangely enough passes 
>> the 224.0.0.x addresses only. So by setting the multicast address in
>> cluster.conf explicitly to 224.0.0.3 cured the problem.
> 
> 
> Hi Jozsef,
> 
> I think I might be having similar problems. However I seem to be unable
> to specify a multicast address manually using conga / luci (the web
> based tool). When I press "apply" I see the reconfiguration happening,
> but when I look at my cluster.conf, there's no multicast address in
> there:
> 

I reported this one a few days ago and it's been fixed.
As you discovered the workaround is to manually configure
the multicast address the first time.

Stuart



From kadlec at sunserv.kfki.hu  Wed Aug 29 13:01:56 2007
From: kadlec at sunserv.kfki.hu (Kadlecsik Jozsi)
Date: Wed, 29 Aug 2007 15:01:56 +0200 (MEST)
Subject: [Linux-cluster] clvmd startup timed out
Message-ID: <Pine.GSO.4.64.0708291256480.22113@sunserv.kfki.hu>

Hi,

In our GFS cluster clvmd does not start up sometimes. It seems the stuck 
clmvd cannot recover, even when the node is a cluster member and the 
quorum is OK:

root at lxserv0:~# cman_tool status
Version: 6.0.1
Config Version: 6
Cluster Name: kfki
Cluster Id: 1583
Cluster Member: Yes
Cluster Generation: 516
Membership state: Cluster-Member
Nodes: 4
Expected votes: 5
Total votes: 4
Quorum: 3  
Active subsystems: 7
Flags: 
Ports Bound: 0 11  
Node name: lxserv0-gfs
Node ID: 1
Multicast addresses: 224.0.0.3 
Node addresses: 192.168.192.15 

root at lxserv0:~# cman_tool services
type             level name     id       state       
fence            0     default  00010001 none        
[1 2 3 4]
dlm              1     clvmd    00010002 none        
[1 2 3 4]

In consequence, 'vgchange -aly' hangs in the init script and we cannot 
proceed to mount the GFS volumes.

When starting clvmd in debug mode, all I get is

root at lxserv0:~# /sbin/clvmd -T 10 -d
CLVMD[b7d7d6c0]: Aug 29 14:57:16 CLVMD started
CLVMD[b7d7d6c0]: Aug 29 14:57:16 Connected to CMAN
CLVMD[b7d7d6c0]: Aug 29 14:57:19 CMAN initialisation complete

Nothing more and vgchange still hangs.

What is missing here?

Best regards,
Jozsef
--
E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary



From Jeremyc at tasconline.com  Wed Aug 29 13:07:56 2007
From: Jeremyc at tasconline.com (Jeremy Carroll)
Date: Wed, 29 Aug 2007 08:07:56 -0500
Subject: [Linux-cluster] clvmd startup timed out
In-Reply-To: <Pine.GSO.4.64.0708291256480.22113@sunserv.kfki.hu>
References: <Pine.GSO.4.64.0708291256480.22113@sunserv.kfki.hu>
Message-ID: <BC16657A85D83746AFF99A42587CC6BC07C77F75@exchange2003.tasconline.com>

Can you post the contents of the [GLOBAL] seconds of /etc/lvm/lvm.conf?

Thanks.

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kadlecsik Jozsi
Sent: Wednesday, August 29, 2007 8:02 AM
To: linux clustering
Subject: [Linux-cluster] clvmd startup timed out

Hi,

In our GFS cluster clvmd does not start up sometimes. It seems the stuck

clmvd cannot recover, even when the node is a cluster member and the 
quorum is OK:

root at lxserv0:~# cman_tool status
Version: 6.0.1
Config Version: 6
Cluster Name: kfki
Cluster Id: 1583
Cluster Member: Yes
Cluster Generation: 516
Membership state: Cluster-Member
Nodes: 4
Expected votes: 5
Total votes: 4
Quorum: 3  
Active subsystems: 7
Flags: 
Ports Bound: 0 11  
Node name: lxserv0-gfs
Node ID: 1
Multicast addresses: 224.0.0.3 
Node addresses: 192.168.192.15 

root at lxserv0:~# cman_tool services
type             level name     id       state       
fence            0     default  00010001 none        
[1 2 3 4]
dlm              1     clvmd    00010002 none        
[1 2 3 4]

In consequence, 'vgchange -aly' hangs in the init script and we cannot 
proceed to mount the GFS volumes.

When starting clvmd in debug mode, all I get is

root at lxserv0:~# /sbin/clvmd -T 10 -d
CLVMD[b7d7d6c0]: Aug 29 14:57:16 CLVMD started
CLVMD[b7d7d6c0]: Aug 29 14:57:16 Connected to CMAN
CLVMD[b7d7d6c0]: Aug 29 14:57:19 CMAN initialisation complete

Nothing more and vgchange still hangs.

What is missing here?

Best regards,
Jozsef
--
E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From kadlec at sunserv.kfki.hu  Wed Aug 29 13:14:45 2007
From: kadlec at sunserv.kfki.hu (Kadlecsik Jozsi)
Date: Wed, 29 Aug 2007 15:14:45 +0200 (MEST)
Subject: [Linux-cluster] clvmd startup timed out
In-Reply-To: <BC16657A85D83746AFF99A42587CC6BC07C77F75@exchange2003.tasconline.com>
References: <Pine.GSO.4.64.0708291256480.22113@sunserv.kfki.hu>
	<BC16657A85D83746AFF99A42587CC6BC07C77F75@exchange2003.tasconline.com>
Message-ID: <Pine.GSO.4.64.0708291511470.22113@sunserv.kfki.hu>

On Wed, 29 Aug 2007, Jeremy Carroll wrote:

> Can you post the contents of the [GLOBAL] seconds of /etc/lvm/lvm.conf?

global {
    
    # The file creation mask for any files and directories created.
    # Interpreted as octal if the first digit is zero.
    umask = 077

    # Allow other users to read the files
    #umask = 022

    # Enabling test mode means that no changes to the on disk metadata
    # will be made.  Equivalent to having the -t option on every
    # command.  Defaults to off.
    test = 0

    # Whether or not to communicate with the kernel device-mapper.
    # Set to 0 if you want to use the tools to manipulate LVM metadata 
    # without activating any logical volumes.
    # If the device-mapper kernel driver is not present in your kernel
    # setting this to 0 should suppress the error messages.
    activation = 1

    # If we can't communicate with device-mapper, should we try running 
    # the LVM1 tools?
    # This option only applies to 2.4 kernels and is provided to help you
    # switch between device-mapper kernels and LVM1 kernels.
    # The LVM1 tools need to be installed with .lvm1 suffices
    # e.g. vgscan.lvm1 and they will stop working after you start using
    # the new lvm2 on-disk metadata format.
    # The default value is set when the tools are built.
    # fallback_to_lvm1 = 0

    # The default metadata format that commands should use - "lvm1" or "lvm2".
    # The command line override is -M1 or -M2.
    # Defaults to "lvm1" if compiled in, else "lvm2".
    # format = "lvm1"

    # Location of proc filesystem
    proc = "/proc"

    # Type of locking to use. Defaults to file-based locking (1).
    # Turn locking off by setting to 0 (dangerous: risks metadata corruption
    # if LVM2 commands get run concurrently).
    # locking_type = 1

    # Local non-LV directory that holds file-based locks while commands are
    # in progress.  A directory like /tmp that may get wiped on reboot is OK.
    # locking_dir = "/var/lock/lvm"

    # Other entries can go here to allow you to load shared libraries
    # e.g. if support for LVM1 metadata was compiled as a shared library use
    #   format_libraries = "liblvm2format1.so" 
    # Full pathnames can be given.

    # Search this directory first for shared libraries.
    #  library_dir = "/lib/lvm2"
    
    # Enable these three for cluster LVM when clvmd is running.
    # Remember to remove the "locking_type = 1" above.
    #
    #   locking_library = "liblvm2clusterlock.so"
    #   locking_type = 2
    #   library_dir = "/lib/lvm2"
    locking_type = 2
    locking_library = "/lib/liblvm2clusterlock.so"
}

Hope it helps.

Best regards,
Jozsef
--
E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary



From Jeremyc at tasconline.com  Wed Aug 29 13:24:57 2007
From: Jeremyc at tasconline.com (Jeremy Carroll)
Date: Wed, 29 Aug 2007 08:24:57 -0500
Subject: [Linux-cluster] clvmd startup timed out
In-Reply-To: <Pine.GSO.4.64.0708291511470.22113@sunserv.kfki.hu>
References: <Pine.GSO.4.64.0708291256480.22113@sunserv.kfki.hu><BC16657A85D83746AFF99A42587CC6BC07C77F75@exchange2003.tasconline.com>
	<Pine.GSO.4.64.0708291511470.22113@sunserv.kfki.hu>
Message-ID: <BC16657A85D83746AFF99A42587CC6BC07C77F79@exchange2003.tasconline.com>

Uncomment / Change / Add these lines.

library_dir = "/lib/lvm2"
locking_type = 3
fallback_to_local_locking = 1
fallback_to_clustered_locking = 0
locking_dir = "/var/lock/lvm"

If this does not work, enable lvm-clustering from the command line by
typing in this command.

lvmconf  --enable-cluster

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kadlecsik Jozsi
Sent: Wednesday, August 29, 2007 8:15 AM
To: linux clustering
Subject: RE: [Linux-cluster] clvmd startup timed out

On Wed, 29 Aug 2007, Jeremy Carroll wrote:

> Can you post the contents of the [GLOBAL] seconds of
/etc/lvm/lvm.conf?

global {
    
    # The file creation mask for any files and directories created.
    # Interpreted as octal if the first digit is zero.
    umask = 077

    # Allow other users to read the files
    #umask = 022

    # Enabling test mode means that no changes to the on disk metadata
    # will be made.  Equivalent to having the -t option on every
    # command.  Defaults to off.
    test = 0

    # Whether or not to communicate with the kernel device-mapper.
    # Set to 0 if you want to use the tools to manipulate LVM metadata 
    # without activating any logical volumes.
    # If the device-mapper kernel driver is not present in your kernel
    # setting this to 0 should suppress the error messages.
    activation = 1

    # If we can't communicate with device-mapper, should we try running 
    # the LVM1 tools?
    # This option only applies to 2.4 kernels and is provided to help
you
    # switch between device-mapper kernels and LVM1 kernels.
    # The LVM1 tools need to be installed with .lvm1 suffices
    # e.g. vgscan.lvm1 and they will stop working after you start using
    # the new lvm2 on-disk metadata format.
    # The default value is set when the tools are built.
    # fallback_to_lvm1 = 0

    # The default metadata format that commands should use - "lvm1" or
"lvm2".
    # The command line override is -M1 or -M2.
    # Defaults to "lvm1" if compiled in, else "lvm2".
    # format = "lvm1"

    # Location of proc filesystem
    proc = "/proc"

    # Type of locking to use. Defaults to file-based locking (1).
    # Turn locking off by setting to 0 (dangerous: risks metadata
corruption
    # if LVM2 commands get run concurrently).
    # locking_type = 1

    # Local non-LV directory that holds file-based locks while commands
are
    # in progress.  A directory like /tmp that may get wiped on reboot
is OK.
    # locking_dir = "/var/lock/lvm"

    # Other entries can go here to allow you to load shared libraries
    # e.g. if support for LVM1 metadata was compiled as a shared library
use
    #   format_libraries = "liblvm2format1.so" 
    # Full pathnames can be given.

    # Search this directory first for shared libraries.
    #  library_dir = "/lib/lvm2"
    
    # Enable these three for cluster LVM when clvmd is running.
    # Remember to remove the "locking_type = 1" above.
    #
    #   locking_library = "liblvm2clusterlock.so"
    #   locking_type = 2
    #   library_dir = "/lib/lvm2"
    locking_type = 2
    locking_library = "/lib/liblvm2clusterlock.so"
}

Hope it helps.

Best regards,
Jozsef
--
E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From kadlec at sunserv.kfki.hu  Wed Aug 29 13:33:50 2007
From: kadlec at sunserv.kfki.hu (Kadlecsik Jozsi)
Date: Wed, 29 Aug 2007 15:33:50 +0200 (MEST)
Subject: [Linux-cluster] clvmd startup timed out
In-Reply-To: <BC16657A85D83746AFF99A42587CC6BC07C77F79@exchange2003.tasconline.com>
References: <Pine.GSO.4.64.0708291256480.22113@sunserv.kfki.hu><BC16657A85D83746AFF99A42587CC6BC07C77F75@exchange2003.tasconline.com>
	<Pine.GSO.4.64.0708291511470.22113@sunserv.kfki.hu>
	<BC16657A85D83746AFF99A42587CC6BC07C77F79@exchange2003.tasconline.com>
Message-ID: <Pine.GSO.4.64.0708291526130.22113@sunserv.kfki.hu>

On Wed, 29 Aug 2007, Jeremy Carroll wrote:

> Uncomment / Change / Add these lines.
> 
> library_dir = "/lib/lvm2"
> locking_type = 3
> fallback_to_local_locking = 1
> fallback_to_clustered_locking = 0
> locking_dir = "/var/lock/lvm"

The directory /lib/lvm2 does not exists at all. 
The LVM2 version as it reported (on a working node):

root at lxserv1:~# lvm version
  LVM version:     2.02.28-cvs (2007-07-17)
  Library version: 1.02.23-cvs (2007-08-21)
  Driver version:  4.11.0

LVM2 was installed from source which created the clvmd, lvmdump, lvm 
binary, the links as the different commands and the liblvm2clusterlock.so 
library (+ man pages).
 
Earlier I tried to set locking_type to 3 and it did not work at all.
 
> If this does not work, enable lvm-clustering from the command line by
> typing in this command.
> 
> lvmconf  --enable-cluster

There is no lvmconf command.

Best regards,
Jozsef
--
E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary



From kadlec at sunserv.kfki.hu  Wed Aug 29 13:53:04 2007
From: kadlec at sunserv.kfki.hu (Kadlecsik Jozsi)
Date: Wed, 29 Aug 2007 15:53:04 +0200 (MEST)
Subject: [Linux-cluster] clvmd startup timed out
In-Reply-To: <Pine.GSO.4.64.0708291526130.22113@sunserv.kfki.hu>
References: <Pine.GSO.4.64.0708291256480.22113@sunserv.kfki.hu><BC16657A85D83746AFF99A42587CC6BC07C77F75@exchange2003.tasconline.com>
	<Pine.GSO.4.64.0708291511470.22113@sunserv.kfki.hu>
	<BC16657A85D83746AFF99A42587CC6BC07C77F79@exchange2003.tasconline.com>
	<Pine.GSO.4.64.0708291526130.22113@sunserv.kfki.hu>
Message-ID: <Pine.GSO.4.64.0708291549400.22113@sunserv.kfki.hu>

On Wed, 29 Aug 2007, Kadlecsik Jozsi wrote:

> On Wed, 29 Aug 2007, Jeremy Carroll wrote:
> 
> > Uncomment / Change / Add these lines.
> > 
> > library_dir = "/lib/lvm2"
> > locking_type = 3
> > fallback_to_local_locking = 1
> > fallback_to_clustered_locking = 0
> > locking_dir = "/var/lock/lvm"
> 
> The directory /lib/lvm2 does not exists at all. 
> The LVM2 version as it reported (on a working node):
> 
> root at lxserv1:~# lvm version
>   LVM version:     2.02.28-cvs (2007-07-17)
>   Library version: 1.02.23-cvs (2007-08-21)
>   Driver version:  4.11.0
> 
> LVM2 was installed from source which created the clvmd, lvmdump, lvm 
> binary, the links as the different commands and the liblvm2clusterlock.so 
> library (+ man pages).
>  
> Earlier I tried to set locking_type to 3 and it did not work at all.

If locking_type is set to 3, lvm reports:

root at lxserv0:~# lvm version
  Unknown locking type requested.
  Locking type 3 initialisation failed.

LVM2 was compiled according to the instructions in usage.txt:

Install LVM2/CLVM (optional)
  cvs -d :pserver:cvs at sources.redhat.com:/cvs/lvm2 login cvs
  cvs -d :pserver:cvs at sources.redhat.com:/cvs/lvm2 checkout LVM2
  cvs -d :pserver:cvs at sources.redhat.com:/cvs/lvm2
  the password is "cvs"
  cd LVM2
  ./configure --with-clvmd=cman --with-cluster=shared
  make; make install

Peeking into the source code, type 3 would be supported if LVM2 was 
configured as '--with-cluster=internal'.

Should I recompile LVM2?

Best regards,
Jozsef
--
E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary



From Jeremyc at tasconline.com  Wed Aug 29 14:00:12 2007
From: Jeremyc at tasconline.com (Jeremy Carroll)
Date: Wed, 29 Aug 2007 09:00:12 -0500
Subject: [Linux-cluster] clvmd startup timed out
In-Reply-To: <Pine.GSO.4.64.0708291549400.22113@sunserv.kfki.hu>
References: <Pine.GSO.4.64.0708291256480.22113@sunserv.kfki.hu><BC16657A85D83746AFF99A42587CC6BC07C77F75@exchange2003.tasconline.com><Pine.GSO.4.64.0708291511470.22113@sunserv.kfki.hu><BC16657A85D83746AFF99A42587CC6BC07C77F79@exchange2003.tasconline.com><Pine.GSO.4.64.0708291526130.22113@sunserv.kfki.hu>
	<Pine.GSO.4.64.0708291549400.22113@sunserv.kfki.hu>
Message-ID: <BC16657A85D83746AFF99A42587CC6BC07C77F7B@exchange2003.tasconline.com>

I would.

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kadlecsik Jozsi
Sent: Wednesday, August 29, 2007 8:53 AM
To: linux clustering
Subject: RE: [Linux-cluster] clvmd startup timed out

On Wed, 29 Aug 2007, Kadlecsik Jozsi wrote:

> On Wed, 29 Aug 2007, Jeremy Carroll wrote:
> 
> > Uncomment / Change / Add these lines.
> > 
> > library_dir = "/lib/lvm2"
> > locking_type = 3
> > fallback_to_local_locking = 1
> > fallback_to_clustered_locking = 0
> > locking_dir = "/var/lock/lvm"
> 
> The directory /lib/lvm2 does not exists at all. 
> The LVM2 version as it reported (on a working node):
> 
> root at lxserv1:~# lvm version
>   LVM version:     2.02.28-cvs (2007-07-17)
>   Library version: 1.02.23-cvs (2007-08-21)
>   Driver version:  4.11.0
> 
> LVM2 was installed from source which created the clvmd, lvmdump, lvm 
> binary, the links as the different commands and the
liblvm2clusterlock.so 
> library (+ man pages).
>  
> Earlier I tried to set locking_type to 3 and it did not work at all.

If locking_type is set to 3, lvm reports:

root at lxserv0:~# lvm version
  Unknown locking type requested.
  Locking type 3 initialisation failed.

LVM2 was compiled according to the instructions in usage.txt:

Install LVM2/CLVM (optional)
  cvs -d :pserver:cvs at sources.redhat.com:/cvs/lvm2 login cvs
  cvs -d :pserver:cvs at sources.redhat.com:/cvs/lvm2 checkout LVM2
  cvs -d :pserver:cvs at sources.redhat.com:/cvs/lvm2
  the password is "cvs"
  cd LVM2
  ./configure --with-clvmd=cman --with-cluster=shared
  make; make install

Peeking into the source code, type 3 would be supported if LVM2 was 
configured as '--with-cluster=internal'.

Should I recompile LVM2?

Best regards,
Jozsef
--
E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From Charles_McKinnis at Dell.com  Wed Aug 29 15:35:57 2007
From: Charles_McKinnis at Dell.com (Charles_McKinnis at Dell.com)
Date: Wed, 29 Aug 2007 10:35:57 -0500
Subject: [Linux-cluster] GFS, SELinux denial
In-Reply-To: <46D46CE7.4060202@redhat.com>
References: <7E3508B66E998043A1A8654D3F72461A60AA9C@AUSX3MPC128.aus.amer.dell.com>
	<46D46CE7.4060202@redhat.com>
Message-ID: <7E3508B66E998043A1A8654D3F72461A60AABA@AUSX3MPC128.aus.amer.dell.com>

The problem was the lack of xattr on the gfs. When we added them it
works correctly. Thank you for the assistance.

Charles McKinnis 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan O'Hara
Sent: Tuesday, August 28, 2007 1:44 PM
To: linux clustering
Subject: Re: [Linux-cluster] GFS, SELinux denial

Charles_McKinnis at Dell.com wrote:
> I am having issues with a server running gfs and an SELinux error. 
> When /etc/init.d/gfs start or service gfs start is run, it results in 
> a SELinux denial. If mount -a -t gfs is run as root it works fine. The

> scripts also work if setenforce 0 is used. Running setsebool -P
> allow_mount_anyfile=1 does not fix the problem (as seen in sealert), 
> although it is set.


What selinux policy are you using? The policy must be such that gfs (or
gfs2) are declared to support/usr selinux xattrs.


> # cat /etc/fstab
> /dev/VolGroup00/LogVol00 /                       ext3    defaults
> 1 1
> LABEL=/boot             /boot                   ext3    defaults
> 1 2
> devpts                  /dev/pts                devpts  gid=5,mode=620
> 0 0
> tmpfs                   /dev/shm                tmpfs   defaults
> 0 0
> proc                    /proc                   proc    defaults
> 0 0
> sysfs                   /sys                    sysfs   defaults
> 0 0
> /dev/VolGroup00/LogVol01 swap                    swap    defaults
> 0 0
> /dev/hda                /media/cdrecorder       auto
> pamconsole,fscontext=system_u:object_r:removable_t,exec,noauto,managed
0
> 0
> /dev/winchester/array	/opt/winchester		gfs
> rw,localflocks,localcaching,oopses_ok 	0 0
> 
> # /etc/init.d/gfs stop
> Mounting GFS filesystems:  /sbin/mount.gfs: error 13 mounting
> /dev/winchester/array on /opt/winchester
> 
> # tail /var/log/messages
> Aug 28 11:56:24 ronnie-vidrine kernel: Trying to join cluster
> "lock_nolock", "dm-2"
> Aug 28 11:56:24 ronnie-vidrine kernel: Joined cluster. Now mounting
> FS...
> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=0: Trying
> to acquire journal lock...
> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=0:
Looking
> at journal...
> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=0: Done
Aug
> 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=1: Trying to
> acquire journal lock...
> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=1:
Looking
> at journal...
> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=1: Done
Aug
> 28 11:56:24 ronnie-vidrine kernel: SELinux: (dev dm-2, type gfs)
> getxattr errno 13
> Aug 28 11:56:26 ronnie-vidrine setroubleshoot:      SELinux prevented
> /sbin/mount.gfs2 from mounting on the file or directory     "/" (type
> "unlabeled_t").      For complete SELinux messages. run sealert -l
> c3fabd9a-3aac-4af4-aa26-300e19aab70e
> 
> # sealert -l c3fabd9a-3aac-4af4-aa26-300e19aab70e
> Summary
>     SELinux prevented /sbin/mount.gfs2 from mounting on the file or
> directory
>     "/" (type "unlabeled_t").
> 
> Detailed Description
>     SELinux prevented /sbin/mount.gfs2 from mounting a filesystem on
the
> file or
>     directory "/" of type "unlabeled_t". By default SELinux limits the
> mounting
>     of filesystems to only some files or directories (those with types
> that have
>     the mountpoint attribute). The type "unlabeled_t" does not have
this
>     attribute. You can either relabel the file or directory or set the
> boolean
>     "allow_mount_anyfile" to true to allow mounting on any file or
> directory.
> 
> Allowing Access
>     Changing the "allow_mount_anyfile" boolean to true will allow this
> access:
>     "setsebool -P allow_mount_anyfile=1."
> 
>     The following command will allow this access:
>     setsebool -P allow_mount_anyfile=1
> 
> Additional Information        
> 
> Source Context                user_u:system_r:mount_t
> Target Context                system_u:object_r:unlabeled_t
> Target Objects                / [ dir ]
> Affected RPM Packages         gfs2-utils-0.1.25-1.el5
>                               [application]filesystem-2.4.0-1 [target]
> Policy RPM                    selinux-policy-2.4.6-30.el5
> Selinux Enabled               True
> Policy Type                   targeted
> MLS Enabled                   True
> Enforcing Mode                Enforcing
> Plugin Name                   plugins.allow_mount_anyfile
> Host Name                     server.net
> Platform                      Linux server.net
>                               2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:21
> EST 2007
>                               i686 i686
> Alert Count                   14
> Line Numbers                  
> 
> Raw Audit Messages            
> 
> avc: denied { read } for comm="mount.gfs" dev=dm-2 egid=0 euid=0
> exe="/sbin/mount.gfs2" exit=-13 fsgid=0 fsuid=0 gid=0 items=0 name="/"
> pid=4802 scontext=user_u:system_r:mount_t:s0 sgid=0
> subj=user_u:system_r:mount_t:s0 suid=0 tclass=dir
> tcontext=system_u:object_r:unlabeled_t:s0 tty=pts1 uid=0
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From rohara at redhat.com  Wed Aug 29 17:01:46 2007
From: rohara at redhat.com (Ryan O'Hara)
Date: Wed, 29 Aug 2007 12:01:46 -0500
Subject: [Linux-cluster] GFS, SELinux denial
In-Reply-To: <7E3508B66E998043A1A8654D3F72461A60AABA@AUSX3MPC128.aus.amer.dell.com>
References: <7E3508B66E998043A1A8654D3F72461A60AA9C@AUSX3MPC128.aus.amer.dell.com>	<46D46CE7.4060202@redhat.com>
	<7E3508B66E998043A1A8654D3F72461A60AABA@AUSX3MPC128.aus.amer.dell.com>
Message-ID: <46D5A67A.3040807@redhat.com>

Charles_McKinnis at Dell.com wrote:
> The problem was the lack of xattr on the gfs. When we added them it
> works correctly. Thank you for the assistance.

Excellent. Glad that worked. I'll talk to the right people and see 
if/how we can get gfs or gfs2 added to the core policy as a filesystem 
that supports selinux xattrs. The problem is that there are older 
version of gfs(1) that did not support selinux xattrs, so making that 
change to the selinux policy could potentially break older versions of 
gfs. On the other hand, gfs2 had support for selinux xattrs early on, so 
it shouldn't have this problem.

Let me know if you encounter any other problem with gfs/selinux.

Ryan



> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan O'Hara
> Sent: Tuesday, August 28, 2007 1:44 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] GFS, SELinux denial
> 
> Charles_McKinnis at Dell.com wrote:
>> I am having issues with a server running gfs and an SELinux error. 
>> When /etc/init.d/gfs start or service gfs start is run, it results in 
>> a SELinux denial. If mount -a -t gfs is run as root it works fine. The
> 
>> scripts also work if setenforce 0 is used. Running setsebool -P
>> allow_mount_anyfile=1 does not fix the problem (as seen in sealert), 
>> although it is set.
> 
> 
> What selinux policy are you using? The policy must be such that gfs (or
> gfs2) are declared to support/usr selinux xattrs.
> 
> 
>> # cat /etc/fstab
>> /dev/VolGroup00/LogVol00 /                       ext3    defaults
>> 1 1
>> LABEL=/boot             /boot                   ext3    defaults
>> 1 2
>> devpts                  /dev/pts                devpts  gid=5,mode=620
>> 0 0
>> tmpfs                   /dev/shm                tmpfs   defaults
>> 0 0
>> proc                    /proc                   proc    defaults
>> 0 0
>> sysfs                   /sys                    sysfs   defaults
>> 0 0
>> /dev/VolGroup00/LogVol01 swap                    swap    defaults
>> 0 0
>> /dev/hda                /media/cdrecorder       auto
>> pamconsole,fscontext=system_u:object_r:removable_t,exec,noauto,managed
> 0
>> 0
>> /dev/winchester/array	/opt/winchester		gfs
>> rw,localflocks,localcaching,oopses_ok 	0 0
>>
>> # /etc/init.d/gfs stop
>> Mounting GFS filesystems:  /sbin/mount.gfs: error 13 mounting
>> /dev/winchester/array on /opt/winchester
>>
>> # tail /var/log/messages
>> Aug 28 11:56:24 ronnie-vidrine kernel: Trying to join cluster
>> "lock_nolock", "dm-2"
>> Aug 28 11:56:24 ronnie-vidrine kernel: Joined cluster. Now mounting
>> FS...
>> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=0: Trying
>> to acquire journal lock...
>> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=0:
> Looking
>> at journal...
>> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=0: Done
> Aug
>> 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=1: Trying to
>> acquire journal lock...
>> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=1:
> Looking
>> at journal...
>> Aug 28 11:56:24 ronnie-vidrine kernel: GFS: fsid=dm-2.0: jid=1: Done
> Aug
>> 28 11:56:24 ronnie-vidrine kernel: SELinux: (dev dm-2, type gfs)
>> getxattr errno 13
>> Aug 28 11:56:26 ronnie-vidrine setroubleshoot:      SELinux prevented
>> /sbin/mount.gfs2 from mounting on the file or directory     "/" (type
>> "unlabeled_t").      For complete SELinux messages. run sealert -l
>> c3fabd9a-3aac-4af4-aa26-300e19aab70e
>>
>> # sealert -l c3fabd9a-3aac-4af4-aa26-300e19aab70e
>> Summary
>>     SELinux prevented /sbin/mount.gfs2 from mounting on the file or
>> directory
>>     "/" (type "unlabeled_t").
>>
>> Detailed Description
>>     SELinux prevented /sbin/mount.gfs2 from mounting a filesystem on
> the
>> file or
>>     directory "/" of type "unlabeled_t". By default SELinux limits the
>> mounting
>>     of filesystems to only some files or directories (those with types
>> that have
>>     the mountpoint attribute). The type "unlabeled_t" does not have
> this
>>     attribute. You can either relabel the file or directory or set the
>> boolean
>>     "allow_mount_anyfile" to true to allow mounting on any file or
>> directory.
>>
>> Allowing Access
>>     Changing the "allow_mount_anyfile" boolean to true will allow this
>> access:
>>     "setsebool -P allow_mount_anyfile=1."
>>
>>     The following command will allow this access:
>>     setsebool -P allow_mount_anyfile=1
>>
>> Additional Information        
>>
>> Source Context                user_u:system_r:mount_t
>> Target Context                system_u:object_r:unlabeled_t
>> Target Objects                / [ dir ]
>> Affected RPM Packages         gfs2-utils-0.1.25-1.el5
>>                               [application]filesystem-2.4.0-1 [target]
>> Policy RPM                    selinux-policy-2.4.6-30.el5
>> Selinux Enabled               True
>> Policy Type                   targeted
>> MLS Enabled                   True
>> Enforcing Mode                Enforcing
>> Plugin Name                   plugins.allow_mount_anyfile
>> Host Name                     server.net
>> Platform                      Linux server.net
>>                               2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:21
>> EST 2007
>>                               i686 i686
>> Alert Count                   14
>> Line Numbers                  
>>
>> Raw Audit Messages            
>>
>> avc: denied { read } for comm="mount.gfs" dev=dm-2 egid=0 euid=0
>> exe="/sbin/mount.gfs2" exit=-13 fsgid=0 fsuid=0 gid=0 items=0 name="/"
>> pid=4802 scontext=user_u:system_r:mount_t:s0 sgid=0
>> subj=user_u:system_r:mount_t:s0 suid=0 tclass=dir
>> tcontext=system_u:object_r:unlabeled_t:s0 tty=pts1 uid=0
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From dwalgamo at gmail.com  Wed Aug 29 17:02:04 2007
From: dwalgamo at gmail.com (David Walgamotte)
Date: Wed, 29 Aug 2007 12:02:04 -0500
Subject: [Linux-cluster] cluster help pleaz
Message-ID: <77ad9a6b0708291002y4385f095jc4bcd30530223035@mail.gmail.com>

Installed the cluster software on Redhat es 5. When I start the cluster both
systems drop of the network. Not sure what I'm doing wrong and help is much
appreciated.

cluster.conf

<?xml version="1.0"?>
<cluster alias="xpance_cluster" config_version="4"
name="app_cluster"><fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="30"/><clusternodes><clusternode name="webserver1.domain.com"
nodeid="1" votes="1"><fence/></clusternode><clusternode name="
webserver2.domain.com" nodeid="2"
votes="1"><fence/></clusternode></clusternodes><cman expected_votes="1"
two_node="1"/><fencedevices/><rm><failoverdomains><failoverdomain name="
app.domain.com" ordered="1" restricted="1"><failoverdomainnode name="
webserver1.domain.com" priority="1"/><failoverdomainnode name="
webserver2.domain.com"
priority="2"/></failoverdomain></failoverdomains><resources><ip address="
192.168.1.100" monitor_link="1"/><ip address="192.168.1.102"
monitor_link="1"/><script file="/etc/rc.d/init.d/xpancenet"
name="xpancenet"/><fs device="/dev/VolGroup00/LogVol04" force_fsck="0"
force_unmount="0" fsid="8603" fstype="ext3" mountpoint="/usr/local"
name="app_filesys" options="" self_fence="0"/><fs
device="/dev/VolGroup00/LogVol02" force_fsck="0" force_unmount="0"
fsid="11465" fstype="ext3" mountpoint="/usr/local" name="appfs2"
self_fence="0"/></resources><service autostart="1" domain="app.domain.com"
exclusive="0" name="xpancenet"><ip ref="192.168.1.100"/><ip ref="
192.168.1.102"/><script ref="app"/><fs ref="app_filesys"/><fs
ref="xnetfs2"/></service></rm></cluster>

-- 
Thank You, Regards,
Dave
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070829/c3b39dce/attachment.htm>

From kadlec at sunserv.kfki.hu  Wed Aug 29 21:07:48 2007
From: kadlec at sunserv.kfki.hu (Kadlecsik Jozsi)
Date: Wed, 29 Aug 2007 23:07:48 +0200 (MEST)
Subject: [Linux-cluster] clvmd startup timed out
In-Reply-To: <BC16657A85D83746AFF99A42587CC6BC07C77F7B@exchange2003.tasconline.com>
References: <Pine.GSO.4.64.0708291256480.22113@sunserv.kfki.hu><BC16657A85D83746AFF99A42587CC6BC07C77F75@exchange2003.tasconline.com><Pine.GSO.4.64.0708291511470.22113@sunserv.kfki.hu><BC16657A85D83746AFF99A42587CC6BC07C77F79@exchange2003.tasconline.com><Pine.GSO.4.64.0708291526130.22113@sunserv.kfki.hu>
	<Pine.GSO.4.64.0708291549400.22113@sunserv.kfki.hu>
	<BC16657A85D83746AFF99A42587CC6BC07C77F7B@exchange2003.tasconline.com>
Message-ID: <Pine.GSO.4.64.0708292307130.24893@sunserv.kfki.hu>

On Wed, 29 Aug 2007, Jeremy Carroll wrote:

> I would.

You were right, the recompiling helped. Thanks a lot!

Best regards,
Jozsef
--
E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary



From kadlec at sunserv.kfki.hu  Thu Aug 30 12:03:03 2007
From: kadlec at sunserv.kfki.hu (Kadlecsik Jozsi)
Date: Thu, 30 Aug 2007 14:03:03 +0200 (MEST)
Subject: [Linux-cluster] clvmd startup timed out
In-Reply-To: <Pine.GSO.4.64.0708292307130.24893@sunserv.kfki.hu>
References: <Pine.GSO.4.64.0708291256480.22113@sunserv.kfki.hu><BC16657A85D83746AFF99A42587CC6BC07C77F75@exchange2003.tasconline.com><Pine.GSO.4.64.0708291511470.22113@sunserv.kfki.hu><BC16657A85D83746AFF99A42587CC6BC07C77F79@exchange2003.tasconline.com><Pine.GSO.4.64.0708291526130.22113@sunserv.kfki.hu>
	<Pine.GSO.4.64.0708291549400.22113@sunserv.kfki.hu>
	<BC16657A85D83746AFF99A42587CC6BC07C77F7B@exchange2003.tasconline.com>
	<Pine.GSO.4.64.0708292307130.24893@sunserv.kfki.hu>
Message-ID: <Pine.GSO.4.64.0708301346550.17235@sunserv.kfki.hu>

On Wed, 29 Aug 2007, Kadlecsik Jozsi wrote:

> On Wed, 29 Aug 2007, Jeremy Carroll wrote:
> 
> > I would.
> 
> You were right, the recompiling helped. Thanks a lot!

This is simply ridiculous: after rebooting one of the machines, the same 
behaviour appeared again: clvmd seems to wait for something indefinitely, 
thus vgchange cannot proceed. Nothing interesting when clvmd started with 
the debugging enabled. 

lvm.conf contains the lines you suggested. I don't have lvmconf command.
What could I try to do to solve the problem?

Best regards,
Jozsef
--
E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary



From pcaulfie at redhat.com  Thu Aug 30 12:10:32 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Thu, 30 Aug 2007 13:10:32 +0100
Subject: [Linux-cluster] clvmd startup timed out
In-Reply-To: <Pine.GSO.4.64.0708301346550.17235@sunserv.kfki.hu>
References: <Pine.GSO.4.64.0708291256480.22113@sunserv.kfki.hu><BC16657A85D83746AFF99A42587CC6BC07C77F75@exchange2003.tasconline.com><Pine.GSO.4.64.0708291511470.22113@sunserv.kfki.hu><BC16657A85D83746AFF99A42587CC6BC07C77F79@exchange2003.tasconline.com><Pine.GSO.4.64.0708291526130.22113@sunserv.kfki.hu>	<Pine.GSO.4.64.0708291549400.22113@sunserv.kfki.hu>	<BC16657A85D83746AFF99A42587CC6BC07C77F7B@exchange2003.tasconline.com>	<Pine.GSO.4.64.0708292307130.24893@sunserv.kfki.hu>
	<Pine.GSO.4.64.0708301346550.17235@sunserv.kfki.hu>
Message-ID: <46D6B3B8.4090308@redhat.com>

Kadlecsik Jozsi wrote:
> On Wed, 29 Aug 2007, Kadlecsik Jozsi wrote:
> 
>> On Wed, 29 Aug 2007, Jeremy Carroll wrote:
>>
>>> I would.
>> You were right, the recompiling helped. Thanks a lot!
> 
> This is simply ridiculous: after rebooting one of the machines, the same 
> behaviour appeared again: clvmd seems to wait for something indefinitely, 
> thus vgchange cannot proceed. Nothing interesting when clvmd started with 
> the debugging enabled. 

What do you define as "nothing interesting". It might not win a pulitzer prize
but it could be an important clue ...

It might also be worth enabling LVM debugging in /etc/lvm.conf. If you have
clvmd from 2.02.28 then those values will also propogate to clvmd.

Patrick



From kadlec at sunserv.kfki.hu  Thu Aug 30 12:28:04 2007
From: kadlec at sunserv.kfki.hu (Kadlecsik Jozsi)
Date: Thu, 30 Aug 2007 14:28:04 +0200 (MEST)
Subject: [Linux-cluster] clvmd startup timed out
In-Reply-To: <46D6B3B8.4090308@redhat.com>
References: <Pine.GSO.4.64.0708291256480.22113@sunserv.kfki.hu><BC16657A85D83746AFF99A42587CC6BC07C77F75@exchange2003.tasconline.com><Pine.GSO.4.64.0708291511470.22113@sunserv.kfki.hu><BC16657A85D83746AFF99A42587CC6BC07C77F79@exchange2003.tasconline.com><Pine.GSO.4.64.0708291526130.22113@sunserv.kfki.hu>
	<Pine.GSO.4.64.0708291549400.22113@sunserv.kfki.hu>
	<BC16657A85D83746AFF99A42587CC6BC07C77F7B@exchange2003.tasconline.com>
	<Pine.GSO.4.64.0708292307130.24893@sunserv.kfki.hu>
	<Pine.GSO.4.64.0708301346550.17235@sunserv.kfki.hu>
	<46D6B3B8.4090308@redhat.com>
Message-ID: <Pine.GSO.4.64.0708301414240.17908@sunserv.kfki.hu>

On Thu, 30 Aug 2007, Patrick Caulfield wrote:

> > This is simply ridiculous: after rebooting one of the machines, the same 
> > behaviour appeared again: clvmd seems to wait for something indefinitely, 
> > thus vgchange cannot proceed. Nothing interesting when clvmd started with 
> > the debugging enabled. 
> 
> What do you define as "nothing interesting". It might not win a pulitzer 
> prize but it could be an important clue ...

root at web1:~# clvmd -d
CLVMD[b7dec6c0]: Aug 30 13:39:15 CLVMD started
CLVMD[b7dec6c0]: Aug 30 13:39:15 Connected to CMAN
CLVMD[b7dec6c0]: Aug 30 13:39:19 CMAN initialisation complete
 
> It might also be worth enabling LVM debugging in /etc/lvm.conf. If you have
> clvmd from 2.02.28 then those values will also propogate to clvmd.

Nothing more emitted by clvmd. But lvm generated some log, I attached the 
file (level = 7).

Best regards,
Jozsef
--
E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary
-------------- next part --------------
config/config.c:960   log/activation not found in config: defaulting to 0
commands/toolcontext.c:145   Logging initialised at Thu Aug 30 14:22:56 2007
config/config.c:955   Setting global/umask to 63
commands/toolcontext.c:164   Set umask to 0077
config/config.c:932   Setting devices/dir to /dev
config/config.c:932   Setting global/proc to /proc
config/config.c:955   Setting global/activation to 1
config/config.c:960   global/suffix not found in config: defaulting to 1
config/config.c:932   Setting global/units to h
device/dev-cache.c:497   devices/preferred_names not found in config file: using built-in preferences
regex/matcher.c:262   Matcher built with 3 dfa states
config/config.c:955   Setting devices/ignore_suspended_devices to 0
config/config.c:932   Setting devices/cache_dir to /etc/lvm/cache
config/config.c:955   Setting devices/write_cache_state to 1
device/dev-io.c:440   Opened /etc/lvm/cache/.cache RO
device/dev-cache.c:264   /dev/system/services: Added to device cache
device/dev-cache.c:264   /dev/ram11: Added to device cache
device/dev-cache.c:264   /dev/system/var: Added to device cache
device/dev-cache.c:264   /dev/disk/by-path/pci-0000:02:03.0-scsi-0:0:0:0-part3: Added to device cache
device/dev-cache.c:264   /dev/ram10: Added to device cache
device/dev-cache.c:264   /dev/disk/by-path/pci-0000:02:03.0-scsi-0:0:0:0-part2: Added to device cache
device/dev-cache.c:264   /dev/disk/by-uuid/a7c5defb-2e80-414f-8dc1-0cf7fad0c6c6: Added to device cache
device/dev-cache.c:264   /dev/system/root: Added to device cache
device/dev-cache.c:261   /dev/mapper/system-services: Aliased to /dev/system/services in device cache
device/dev-cache.c:264   /dev/ram12: Added to device cache
device/dev-cache.c:264   /dev/disk/by-id/scsi-1AMCC_DL713790FFF4290013DC-part1: Added to device cache
device/dev-cache.c:261   /dev/disk/by-path/pci-0000:02:03.0-scsi-0:0:1:0-part1: Aliased to /dev/disk/by-id/scsi-1AMCC_DL713790FFF4290013DC-part1 in device cache
device/dev-cache.c:264   /dev/ram6: Added to device cache
device/dev-cache.c:261   /dev/mapper/system-var: Aliased to /dev/system/var in device cache
device/dev-cache.c:264   /dev/ram13: Added to device cache
device/dev-cache.c:264   /dev/ram14: Added to device cache
device/dev-cache.c:261   /dev/disk/by-path/pci-0000:02:03.0-scsi-0:0:0:0-part1: Aliased to /dev/disk/by-uuid/a7c5defb-2e80-414f-8dc1-0cf7fad0c6c6 in device cache (preferred name)
device/dev-cache.c:264   /dev/ram5: Added to device cache
device/dev-cache.c:264   /dev/ram1: Added to device cache
device/dev-cache.c:261   /dev/sda1: Aliased to /dev/disk/by-path/pci-0000:02:03.0-scsi-0:0:0:0-part1 in device cache (preferred name)
device/dev-cache.c:261   /dev/sdb1: Aliased to /dev/disk/by-id/scsi-1AMCC_DL713790FFF4290013DC-part1 in device cache (preferred name)
device/dev-cache.c:261   /dev/disk/by-id/scsi-1AMCC_UL500727FFF4290013F7-part1: Aliased to /dev/sda1 in device cache
device/dev-cache.c:264   /dev/ram0: Added to device cache
device/dev-cache.c:261   /dev/mapper/system-root: Aliased to /dev/system/root in device cache
device/dev-cache.c:261   /dev/disk/by-id/scsi-1AMCC_UL500727FFF4290013F7-part3: Aliased to /dev/disk/by-path/pci-0000:02:03.0-scsi-0:0:0:0-part3 in device cache (preferred name)
device/dev-cache.c:264   /dev/ram2: Added to device cache
device/dev-cache.c:261   /dev/sda2: Aliased to /dev/disk/by-path/pci-0000:02:03.0-scsi-0:0:0:0-part2 in device cache (preferred name)
device/dev-cache.c:264   /dev/ram8: Added to device cache
device/dev-cache.c:264   /dev/ram9: Added to device cache
device/dev-cache.c:264   /dev/ram15: Added to device cache
device/dev-cache.c:261   /dev/disk/by-id/scsi-1AMCC_UL500727FFF4290013F7-part2: Aliased to /dev/sda2 in device cache
device/dev-cache.c:264   /dev/ram3: Added to device cache
device/dev-cache.c:261   /dev/sda3: Aliased to /dev/disk/by-id/scsi-1AMCC_UL500727FFF4290013F7-part3 in device cache (preferred name)
device/dev-cache.c:264   /dev/ram4: Added to device cache
device/dev-cache.c:264   /dev/ram7: Added to device cache
filters/filter-persistent.c:132   Loaded persistent filter cache from /etc/lvm/cache/.cache
device/dev-io.c:486   Closed /etc/lvm/cache/.cache
config/config.c:955   Setting activation/reserved_stack to 256
config/config.c:955   Setting activation/reserved_memory to 8192
config/config.c:955   Setting activation/process_priority to -18
format1/format1.c:572   Initialised format: lvm1
format_pool/format_pool.c:358   Initialised format: pool
format_text/format-text.c:1993   Initialised format: lvm2
config/config.c:938   global/format not found in config: defaulting to lvm2
striped/striped.c:228   Initialised segtype: striped
zero/zero.c:110   Initialised segtype: zero
error/errseg.c:110   Initialised segtype: error
snapshot/snapshot.c:179   Initialised segtype: snapshot
mirror/mirrored.c:562   Initialised segtype: mirror
config/config.c:955   Setting backup/retain_days to 30
config/config.c:955   Setting backup/retain_min to 10
config/config.c:932   Setting backup/archive_dir to /etc/lvm/archive
config/config.c:932   Setting backup/backup_dir to /etc/lvm/backup
config/config.c:960   global/fallback_to_lvm1 not found in config: defaulting to 0
lvmcmdline.c:843   Parsing: vgchange -aly
lvmcmdline.c:871 vgchange  Processing: vgchange -aly
lvmcmdline.c:874 vgchange  O_DIRECT will be used
config/config.c:955 vgchange  Setting global/locking_type to 3
locking/locking.c:244 vgchange  Cluster locking selected.
toollib.c:571 vgchange  Finding all volume groups
device/dev-io.c:390 vgchange  /dev/ram0: Not using O_DIRECT
device/dev-io.c:440 vgchange  Opened /dev/ram0 RO
device/dev-io.c:134 vgchange  /dev/ram0: block size is 1024 bytes
label/label.c:185 vgchange  /dev/ram0: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/ram0
device/dev-io.c:440 vgchange  Opened /dev/system/root RO O_DIRECT
device/dev-io.c:134 vgchange  /dev/system/root: block size is 4096 bytes
label/label.c:185 vgchange  /dev/system/root: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/system/root
device/dev-io.c:390 vgchange  /dev/ram1: Not using O_DIRECT
device/dev-io.c:440 vgchange  Opened /dev/ram1 RO
device/dev-io.c:134 vgchange  /dev/ram1: block size is 1024 bytes
label/label.c:185 vgchange  /dev/ram1: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/ram1
device/dev-io.c:440 vgchange  Opened /dev/sda1 RO O_DIRECT
device/dev-io.c:134 vgchange  /dev/sda1: block size is 1024 bytes
label/label.c:185 vgchange  /dev/sda1: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/sda1
device/dev-io.c:440 vgchange  Opened /dev/system/var RO O_DIRECT
device/dev-io.c:134 vgchange  /dev/system/var: block size is 4096 bytes
label/label.c:185 vgchange  /dev/system/var: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/system/var
device/dev-io.c:390 vgchange  /dev/ram2: Not using O_DIRECT
device/dev-io.c:440 vgchange  Opened /dev/ram2 RO
device/dev-io.c:134 vgchange  /dev/ram2: block size is 1024 bytes
label/label.c:185 vgchange  /dev/ram2: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/ram2
device/dev-io.c:440 vgchange  Opened /dev/sda2 RO O_DIRECT
device/dev-io.c:134 vgchange  /dev/sda2: block size is 4096 bytes
label/label.c:185 vgchange  /dev/sda2: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/sda2
device/dev-io.c:440 vgchange  Opened /dev/system/services RO O_DIRECT
device/dev-io.c:134 vgchange  /dev/system/services: block size is 4096 bytes
label/label.c:185 vgchange  /dev/system/services: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/system/services
device/dev-io.c:390 vgchange  /dev/ram3: Not using O_DIRECT
device/dev-io.c:440 vgchange  Opened /dev/ram3 RO
device/dev-io.c:134 vgchange  /dev/ram3: block size is 1024 bytes
label/label.c:185 vgchange  /dev/ram3: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/ram3
device/dev-io.c:440 vgchange  Opened /dev/sda3 RO O_DIRECT
device/dev-io.c:134 vgchange  /dev/sda3: block size is 512 bytes
label/label.c:162 vgchange  /dev/sda3: lvm2 label detected
cache/lvmcache.c:655 vgchange  lvmcache: /dev/sda3: now orphaned
format_text/format-text.c:1119 vgchange  /dev/sda3: Found metadata at 8192 size 1301 for system (ao6yoM-YOvN-YLUt-fsKC-Q3XK-L1QW-JM7zQy)
cache/lvmcache.c:655 vgchange  lvmcache: /dev/sda3: now in VG system
cache/lvmcache.c:468 vgchange  lvmcache: /dev/sda3: setting system VGID to ao6yoMYOvNYLUtfsKCQ3XKL1QWJM7zQy
cache/lvmcache.c:690 vgchange  lvmcache: /dev/sda3: VG system: Set creation host to saturn.
device/dev-io.c:486 vgchange  Closed /dev/sda3
device/dev-io.c:390 vgchange  /dev/ram4: Not using O_DIRECT
device/dev-io.c:440 vgchange  Opened /dev/ram4 RO
device/dev-io.c:134 vgchange  /dev/ram4: block size is 1024 bytes
label/label.c:185 vgchange  /dev/ram4: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/ram4
device/dev-io.c:390 vgchange  /dev/ram5: Not using O_DIRECT
device/dev-io.c:440 vgchange  Opened /dev/ram5 RO
device/dev-io.c:134 vgchange  /dev/ram5: block size is 1024 bytes
label/label.c:185 vgchange  /dev/ram5: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/ram5
device/dev-io.c:390 vgchange  /dev/ram6: Not using O_DIRECT
device/dev-io.c:440 vgchange  Opened /dev/ram6 RO
device/dev-io.c:134 vgchange  /dev/ram6: block size is 1024 bytes
label/label.c:185 vgchange  /dev/ram6: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/ram6
device/dev-io.c:390 vgchange  /dev/ram7: Not using O_DIRECT
device/dev-io.c:440 vgchange  Opened /dev/ram7 RO
device/dev-io.c:134 vgchange  /dev/ram7: block size is 1024 bytes
label/label.c:185 vgchange  /dev/ram7: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/ram7
device/dev-io.c:390 vgchange  /dev/ram8: Not using O_DIRECT
device/dev-io.c:440 vgchange  Opened /dev/ram8 RO
device/dev-io.c:134 vgchange  /dev/ram8: block size is 1024 bytes
label/label.c:185 vgchange  /dev/ram8: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/ram8
device/dev-io.c:390 vgchange  /dev/ram9: Not using O_DIRECT
device/dev-io.c:440 vgchange  Opened /dev/ram9 RO
device/dev-io.c:134 vgchange  /dev/ram9: block size is 1024 bytes
label/label.c:185 vgchange  /dev/ram9: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/ram9
device/dev-io.c:390 vgchange  /dev/ram10: Not using O_DIRECT
device/dev-io.c:440 vgchange  Opened /dev/ram10 RO
device/dev-io.c:134 vgchange  /dev/ram10: block size is 1024 bytes
label/label.c:185 vgchange  /dev/ram10: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/ram10
device/dev-io.c:390 vgchange  /dev/ram11: Not using O_DIRECT
device/dev-io.c:440 vgchange  Opened /dev/ram11 RO
device/dev-io.c:134 vgchange  /dev/ram11: block size is 1024 bytes
label/label.c:185 vgchange  /dev/ram11: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/ram11
device/dev-io.c:390 vgchange  /dev/ram12: Not using O_DIRECT
device/dev-io.c:440 vgchange  Opened /dev/ram12 RO
device/dev-io.c:134 vgchange  /dev/ram12: block size is 1024 bytes
label/label.c:185 vgchange  /dev/ram12: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/ram12
device/dev-io.c:390 vgchange  /dev/ram13: Not using O_DIRECT
device/dev-io.c:440 vgchange  Opened /dev/ram13 RO
device/dev-io.c:134 vgchange  /dev/ram13: block size is 1024 bytes
label/label.c:185 vgchange  /dev/ram13: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/ram13
device/dev-io.c:390 vgchange  /dev/ram14: Not using O_DIRECT
device/dev-io.c:440 vgchange  Opened /dev/ram14 RO
device/dev-io.c:134 vgchange  /dev/ram14: block size is 1024 bytes
label/label.c:185 vgchange  /dev/ram14: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/ram14
device/dev-io.c:390 vgchange  /dev/ram15: Not using O_DIRECT
device/dev-io.c:440 vgchange  Opened /dev/ram15 RO
device/dev-io.c:134 vgchange  /dev/ram15: block size is 1024 bytes
label/label.c:185 vgchange  /dev/ram15: No label detected
label/label.c:284 vgchange  <backtrace>
device/dev-io.c:486 vgchange  Closed /dev/ram15
device/dev-io.c:440 vgchange  Opened /dev/sdb1 RO O_DIRECT
device/dev-io.c:134 vgchange  /dev/sdb1: block size is 512 bytes
label/label.c:162 vgchange  /dev/sdb1: lvm2 label detected
cache/lvmcache.c:655 vgchange  lvmcache: /dev/sdb1: now orphaned
format_text/format-text.c:1119 vgchange  /dev/sdb1: Found metadata at 39936 size 1068 for data (o7IRHo-V2po-At56-bK1t-kY8v-kJpt-2s1VSF)
cache/lvmcache.c:655 vgchange  lvmcache: /dev/sdb1: now in VG data
cache/lvmcache.c:468 vgchange  lvmcache: /dev/sdb1: setting data VGID to o7IRHoV2poAt56bK1tkY8vkJpt2s1VSF
cache/lvmcache.c:690 vgchange  lvmcache: /dev/sdb1: VG data: Set creation host to saturn.
device/dev-io.c:486 vgchange  Closed /dev/sdb1
locking/cluster_locking.c:408 vgchange  Locking V_data at 0x1

From pcaulfie at redhat.com  Thu Aug 30 12:58:31 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Thu, 30 Aug 2007 13:58:31 +0100
Subject: [Linux-cluster] clvmd startup timed out
In-Reply-To: <Pine.GSO.4.64.0708301414240.17908@sunserv.kfki.hu>
References: <Pine.GSO.4.64.0708291256480.22113@sunserv.kfki.hu><BC16657A85D83746AFF99A42587CC6BC07C77F75@exchange2003.tasconline.com><Pine.GSO.4.64.0708291511470.22113@sunserv.kfki.hu><BC16657A85D83746AFF99A42587CC6BC07C77F79@exchange2003.tasconline.com><Pine.GSO.4.64.0708291526130.22113@sunserv.kfki.hu>	<Pine.GSO.4.64.0708291549400.22113@sunserv.kfki.hu>	<BC16657A85D83746AFF99A42587CC6BC07C77F7B@exchange2003.tasconline.com>	<Pine.GSO.4.64.0708292307130.24893@sunserv.kfki.hu>	<Pine.GSO.4.64.0708301346550.17235@sunserv.kfki.hu>	<46D6B3B8.4090308@redhat.com>
	<Pine.GSO.4.64.0708301414240.17908@sunserv.kfki.hu>
Message-ID: <46D6BEF7.9070905@redhat.com>

Kadlecsik Jozsi wrote:
> On Thu, 30 Aug 2007, Patrick Caulfield wrote:
> 
>>> This is simply ridiculous: after rebooting one of the machines, the same 
>>> behaviour appeared again: clvmd seems to wait for something indefinitely, 
>>> thus vgchange cannot proceed. Nothing interesting when clvmd started with 
>>> the debugging enabled. 
>> What do you define as "nothing interesting". It might not win a pulitzer 
>> prize but it could be an important clue ...
> 
> root at web1:~# clvmd -d
> CLVMD[b7dec6c0]: Aug 30 13:39:15 CLVMD started
> CLVMD[b7dec6c0]: Aug 30 13:39:15 Connected to CMAN
> CLVMD[b7dec6c0]: Aug 30 13:39:19 CMAN initialisation complete
>  
>> It might also be worth enabling LVM debugging in /etc/lvm.conf. If you have
>> clvmd from 2.02.28 then those values will also propogate to clvmd.
> 
> Nothing more emitted by clvmd. But lvm generated some log, I attached the 
> file (level = 7).
> 


So it's waiting for a DLM lock for the VG "data".

At this point it could be

a) another node is holding a lock or
b) A DLM bug, probably similar or identical to
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=238490


for a)

# mount -t debugfs debugfs /debug
# cat /debug/dlm/clvmd

and look for that lock and see which node is holding it.

for b)

Check the kernel you have and perhaps apply the patch shown in that bugzilla
comments 18 or 19.


Patrick



From chris at cmiware.com  Thu Aug 30 15:50:08 2007
From: chris at cmiware.com (Chris Harms)
Date: Thu, 30 Aug 2007 10:50:08 -0500
Subject: [Linux-cluster] modclusterd memory leak
In-Reply-To: <20070815155956.GA11255@redhat.com>
References: <46C1C7EB.8090300@cmiware.com> <20070815155956.GA11255@redhat.com>
Message-ID: <46D6E730.7030205@cmiware.com>

As an update, if I stop modclusterd on one node, the other node's memory 
usage seems to reset to 0% utilization according to top without having 
to restart on both nodes.

Chris

Ryan McCabe wrote:
> On Tue, Aug 14, 2007 at 10:19:07AM -0500, Chris Harms wrote:
>   
>> We installed the 5.1 Beta RPMs of the cluster suite and have left our 
>> cluster running unfettered for over a week. It now appears modclusterd 
>> has a slow memory leak.  Its consuming 1.5% and climbing of our 16GB of 
>> RAM which is up from 1.3% yesterday.  I would be happy to do some tests 
>> and send along the results.  Please advise.
>>     
>
> Hi,
>
> Thanks for the report. I've created a bugzilla ticket to track this
> here: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=252357
>
>
> Ryan
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   



From Robert.Hell at fabasoft.com  Thu Aug 30 17:43:50 2007
From: Robert.Hell at fabasoft.com (Hell, Robert)
Date: Thu, 30 Aug 2007 19:43:50 +0200
Subject: [Linux-cluster] postgres-8 resource
Message-ID: <B710F3299F04664DB6B37C258FDEEB94C6509D@FABAMAIL.fabagl.fabasoft.com>

Hi,

 

I have some trouble when using the PostgreSQL 8.2.4 with the provided
postgres-8 OCF-Script from cvs.

 

Has anyone successfully started postgres with as postgres-8 resource
yet?

 

One problem is that listen_address is inserted in the config-file - this
should be listen_addresses.

After fixing this, I'm still having problems:

Aug 30 19:37:06 pg-ba-001 clurgmgrd: [31089]: <err> Generating New
Config File
/etc/cluster/postgres-8/postgres-8:postgresql_vts1/postgresql.conf From
/data1/db1/pgdata/postgresql.conf

Aug 30 19:37:06 pg-ba-001 clurgmgrd: [31089]: <err> Generating New
Config File
/etc/cluster/postgres-8/postgres-8:postgresql_vts1/postgresql.conf From
/data1/db1/pgdata/postgresql.conf > Succeed

Aug 30 19:37:06 pg-ba-001 clurgmgrd: [31089]: <err> Trying to execute
sudo -u postgres /usr/bin/postmaster -c
config_file=/etc/cluster/postgres-8/postgres-8:postgresql_vts1/postgresq
l.conf -> some debugging, works fine when executed manually

Aug 30 19:37:06 pg-ba-001 clurgmgrd: [31089]: <err> Starting Service
postgres-8:postgresql_vts1 > Failed

Aug 30 19:37:06 pg-ba-001 clurgmgrd[31089]: <notice> start on postgres-8
"postgresql_vts1" returned 1 (generic error)

Aug 30 19:37:06 pg-ba-001 clurgmgrd[31089]: <warning> #68: Failed to
start service:pg-ba-vts1; return value: 1

 

Any ideas how to determine why it won't start?

 

Best regards,

Robert

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070830/7d812d04/attachment.htm>

From slacker.ar at gmail.com  Thu Aug 30 19:36:50 2007
From: slacker.ar at gmail.com (Dario Hernan)
Date: Thu, 30 Aug 2007 16:36:50 -0300
Subject: [Linux-cluster] https cluster
Message-ID: <9bc19ef30708301236q1fb0dcfepa7019a877ebf08b6@mail.gmail.com>

Hi all, I have two RHEL5 in cluster, there are running some services
and I'm going to add a webapp with SSL, (https) I'd like if you can
give me some advice about it.
If I have to take care with something, etc...

Thanks in advance!
Dario
P.D: please excuse my poor english



From david.costakos at gmail.com  Thu Aug 30 23:45:24 2007
From: david.costakos at gmail.com (Dave Costakos)
Date: Thu, 30 Aug 2007 16:45:24 -0700
Subject: [Linux-cluster] qdisk votes not in cman
Message-ID: <6b6836c60708301645r6933957dx43066c5af4b8ef1f@mail.gmail.com>

I have a cluster running RHEL5 U1 beta with two nodes and a quorum disk (at
least it's in the cluster.conf file).  However, when I run cman_tool nodes
and cman_tool status, I don't see the qdisk votes represented.  I've
confirmed that qdiskd is indeed running.  I'd appreciate it if someone could
help point me to my mistake here.

Thanks in advance.

Relevant info (too much??)

uname -rm: 2.6.18-36.el5xen x86_64
/etc/redhat-release: Red Hat Enterprise Linux Server release 5.1 Beta
(Tikanga)
rpm -q cman: cman-2.0.70-1.el5
CMAN/Qdisk Info:
[root at test-ibm-3650 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M   2256   2007-08-30 16:36:38  test-ibm-3650.qualcomm.com
   2   M   2264   2007-08-30 16:36:38  test-ibm-3550.qualcomm.com
[root at test-ibm-3650 ~]# cman_tool status
Version: 6.0.1
Config Version: 18
Cluster Name: xen_cluster
Cluster Id: 31668
Cluster Member: Yes
Cluster Generation: 2264
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Quorum: 2
Active subsystems: 9
Flags:
Ports Bound: 0 11 177
Node name: test-ibm-3650.qualcomm.com
Node ID: 1
Multicast addresses: 239.192.123.48
Node addresses: 10.45.5.7
[root at test-ibm-3650 ~]# service qdiskd status
qdiskd (pid 8994) is running...

syslog qdisk messages:
Aug 30 16:37:06 test-ibm-3650 qdiskd[8852]: <info> Quorum Daemon
Initializing
Aug 30 16:37:07 test-ibm-3650 qdiskd[8852]: <info> Heuristic: '/bin/ping
test-ibm-3550.qualcomm.com -c1 -t1' UP
Aug 30 16:37:07 test-ibm-3650 qdiskd[8852]: <info> Heuristic: '/bin/ping
test-ibm-3650.qualcomm.com -c1 -t1' UP
Aug 30 16:37:26 test-ibm-3650 qdiskd[8852]: <info> Initial score 2/2
Aug 30 16:37:26 test-ibm-3650 qdiskd[8852]: <info> Initialization complete
Aug 30 16:37:26 test-ibm-3650 qdiskd[8852]: <notice> Score sufficient for
master operation (2/2; required=1); upgrading
Aug 30 16:37:38 test-ibm-3650 qdiskd[8852]: <info> Assuming master role
Aug 30 16:38:11 test-ibm-3650 qdiskd[8994]: <info> Quorum Daemon
Initializing
Aug 30 16:38:12 test-ibm-3650 qdiskd[8994]: <info> Heuristic: '/bin/ping
test-ibm-3550.qualcomm.com -c1 -t1' UP
Aug 30 16:38:12 test-ibm-3650 qdiskd[8994]: <info> Heuristic: '/bin/ping
test-ibm-3650.qualcomm.com -c1 -t1' UP
Aug 30 16:38:32 test-ibm-3650 qdiskd[8994]: <info> Initial score 2/2
Aug 30 16:38:32 test-ibm-3650 qdiskd[8994]: <info> Initialization complete
Aug 30 16:38:32 test-ibm-3650 qdiskd[8994]: <notice> Score sufficient for
master operation (2/2; required=1); upgrading
Aug 30 16:38:44 test-ibm-3650 qdiskd[8994]: <info> Assuming master role

cluster.conf


<?xml version="1.0" ?>
<cluster alias="xen_cluster" config_version="20" name="xen_cluster">
        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="20"/>
        <clusternodes>
                <clusternode name="test-ibm-3650.qualcomm.com" nodeid="1"
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsa-test-ibm-3650"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="test-ibm-3550.qualcomm.com" nodeid="2"
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="rsa-test-ibm-3550"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices>
                <fencedevice agent="fence_rsa" hostname="rsa-test-ibm-3650"
ipaddr="rsa-test-ibm-3650" login="Administrator" name="rsa-test-ibm-3650"
passwd="R0cknR011"/>
                <fencedevice agent="fence_rsa" hostname="rsa-test-ibm-3550"
ipaddr="rsa-test-ibm-3550" login="Administrator" name="rsa-test-ibm-3550"
passwd="R0cknR011"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="vm_failover" ordered="0"
restricted="0">
                                <failoverdomainnode name="
test-ibm-3650.qualcomm.com" priority="1"/>
                                <failoverdomainnode name="
test-ibm-3550.qualcomm.com" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <clusterfs device="/dev/cluster_vg/xen_volume"
force_unmount="0" fsid="13127" fstype="gfs" mountpoint="/xen" name="xen"/>
                </resources>
                <vm autostart="1" domain="vm_failover" exclusive="0"
name="dave-xen1" path="/xen/vm-config" recovery="restart"/>
                <vm autostart="1" domain="vm_failover" exclusive="0"
name="dave-xen2" path="/xen/vm-config:/etc/xen" recovery="relocate"/>
        </rm>
        <quorumd device="/dev/mpath/1HITACHI_750403220023p1" interval="2"
log_level="7" min_score="1" tko="10" votes="2">
                <heuristic interval="2" program="/bin/ping
test-ibm-3550.qualcomm.com -c1 -t1" score="1"/>
                <heuristic interval="2" program="/bin/ping
test-ibm-3650.qualcomm.com -c1 -t1" score="1"/>
        </quorumd>
</cluster>




-- 
Dave Costakos
mailto:david.costakos at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070830/dd128899/attachment.htm>

From Robert.Hell at fabasoft.com  Fri Aug 31 06:52:12 2007
From: Robert.Hell at fabasoft.com (Hell, Robert)
Date: Fri, 31 Aug 2007 08:52:12 +0200
Subject: [Linux-cluster] qdisk votes not in cman
In-Reply-To: <6b6836c60708301645r6933957dx43066c5af4b8ef1f@mail.gmail.com>
References: <6b6836c60708301645r6933957dx43066c5af4b8ef1f@mail.gmail.com>
Message-ID: <B710F3299F04664DB6B37C258FDEEB94C650E5@FABAMAIL.fabagl.fabasoft.com>

Hi,

 

for me it seems that qdiskd cannot successfully register in cman.

 

I think I found the reason for your problem in the source (libcman.c):

#define MAX_CLUSTER_NAME_LEN           16

...

int cman_register_quorum_device(cman_handle_t handle, char *name, int
votes)

{

        struct cman_handle *h = (struct cman_handle *)handle;

        char buf[strlen(name)+1 + sizeof(int)];

        VALIDATE_HANDLE(h);

 

        if (strlen(name) > MAX_CLUSTER_NAME_LEN)

        {

                errno = EINVAL;

                return -1;

        }

 

        memcpy(buf, &votes, sizeof(int));

        strcpy(buf+sizeof(int), name);

        return info_call(h, CMAN_CMD_REG_QUORUMDEV, buf,
strlen(name)+sizeof(int), NULL, 0);

}

 

==>     try using a shorter qdiskd device name than
"/dev/mpath/1HITACHI_750403220023p1" (< 16 chars).

 

Regards,

Robert

________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dave Costakos
Sent: Freitag, 31. August 2007 01:45
To: linux-cluster at redhat.com
Subject: [Linux-cluster] qdisk votes not in cman

 


I have a cluster running RHEL5 U1 beta with two nodes and a quorum disk
(at least it's in the cluster.conf file).  However, when I run cman_tool
nodes and cman_tool status, I don't see the qdisk votes represented.
I've confirmed that qdiskd is indeed running.  I'd appreciate it if
someone could help point me to my mistake here. 

Thanks in advance.

Relevant info (too much??)

uname -rm: 2.6.18-36.el5xen x86_64
/etc/redhat-release: Red Hat Enterprise Linux Server release 5.1 Beta
(Tikanga)
rpm -q cman: cman-2.0.70-1.el5
CMAN/Qdisk Info:
[root at test-ibm-3650 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M   2256   2007-08-30 16:36:38  test-ibm-3650.qualcomm.com 
   2   M   2264   2007-08-30 16:36:38  test-ibm-3550.qualcomm.com
[root at test-ibm-3650 ~]# cman_tool status
Version: 6.0.1
Config Version: 18
Cluster Name: xen_cluster 
Cluster Id: 31668
Cluster Member: Yes
Cluster Generation: 2264
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Quorum: 2  
Active subsystems: 9
Flags: 
Ports Bound: 0 11 177  
Node name: test-ibm-3650.qualcomm.com
Node ID: 1
Multicast addresses: 239.192.123.48 
Node addresses: 10.45.5.7 
[root at test-ibm-3650 ~]# service qdiskd status
qdiskd (pid 8994) is running...

syslog qdisk messages:
Aug 30 16:37:06 test-ibm-3650 qdiskd[8852]: <info> Quorum Daemon
Initializing 
Aug 30 16:37:07 test-ibm-3650 qdiskd[8852]: <info> Heuristic: '/bin/ping
test-ibm-3550.qualcomm.com -c1 -t1' UP 
Aug 30 16:37:07 test-ibm-3650 qdiskd[8852]: <info> Heuristic: '/bin/ping
test-ibm-3650.qualcomm.com -c1 -t1' UP 
Aug 30 16:37:26 test-ibm-3650 qdiskd[8852]: <info> Initial score 2/2 
Aug 30 16:37:26 test-ibm-3650 qdiskd[8852]: <info> Initialization
complete 
Aug 30 16:37:26 test-ibm-3650 qdiskd[8852]: <notice> Score sufficient
for master operation (2/2; required=1); upgrading 
Aug 30 16:37:38 test-ibm-3650 qdiskd[8852]: <info> Assuming master role 
Aug 30 16:38:11 test-ibm-3650 qdiskd[8994]: <info> Quorum Daemon
Initializing 
Aug 30 16:38:12 test-ibm-3650 qdiskd[8994]: <info> Heuristic: '/bin/ping
test-ibm-3550.qualcomm.com -c1 -t1' UP 
Aug 30 16:38:12 test-ibm-3650 qdiskd[8994]: <info> Heuristic: '/bin/ping
test-ibm-3650.qualcomm.com -c1 -t1' UP 
Aug 30 16:38:32 test-ibm-3650 qdiskd[8994]: <info> Initial score 2/2 
Aug 30 16:38:32 test-ibm-3650 qdiskd[8994]: <info> Initialization
complete 
Aug 30 16:38:32 test-ibm-3650 qdiskd[8994]: <notice> Score sufficient
for master operation (2/2; required=1); upgrading 
Aug 30 16:38:44 test-ibm-3650 qdiskd[8994]: <info> Assuming master role 

cluster.conf


<?xml version="1.0" ?>
<cluster alias="xen_cluster" config_version="20" name="xen_cluster"> 
        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="20"/>
        <clusternodes>
                <clusternode name=" test-ibm-3650.qualcomm.com
<http://test-ibm-3650.qualcomm.com> " nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device
name="rsa-test-ibm-3650"/> 
                                </method>
                        </fence>
                </clusternode>
                <clusternode name=" test-ibm-3550.qualcomm.com
<http://test-ibm-3550.qualcomm.com> " nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device
name="rsa-test-ibm-3550"/> 
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices> 
                <fencedevice agent="fence_rsa"
hostname="rsa-test-ibm-3650" ipaddr="rsa-test-ibm-3650"
login="Administrator" name="rsa-test-ibm-3650" passwd="R0cknR011"/> 
                <fencedevice agent="fence_rsa"
hostname="rsa-test-ibm-3550" ipaddr="rsa-test-ibm-3550"
login="Administrator" name="rsa-test-ibm-3550" passwd="R0cknR011"/> 
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="vm_failover" ordered="0"
restricted="0"> 
                                <failoverdomainnode
name="test-ibm-3650.qualcomm.com" priority="1"/>
                                <failoverdomainnode name="
test-ibm-3550.qualcomm.com" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources> 
                        <clusterfs device="/dev/cluster_vg/xen_volume"
force_unmount="0" fsid="13127" fstype="gfs" mountpoint="/xen"
name="xen"/>
                </resources> 
                <vm autostart="1" domain="vm_failover" exclusive="0"
name="dave-xen1" path="/xen/vm-config" recovery="restart"/>
                <vm autostart="1" domain="vm_failover" exclusive="0"
name="dave-xen2" path="/xen/vm-config:/etc/xen" recovery="relocate"/> 
        </rm>
        <quorumd device="/dev/mpath/1HITACHI_750403220023p1"
interval="2" log_level="7" min_score="1" tko="10" votes="2">
                <heuristic interval="2" program="/bin/ping
test-ibm-3550.qualcomm.com -c1 -t1" score="1"/>
                <heuristic interval="2" program="/bin/ping
test-ibm-3650.qualcomm.com -c1 -t1" score="1"/>
        </quorumd>
</cluster>




-- 
Dave Costakos
mailto: david.costakos at gmail.com <mailto:david.costakos at gmail.com>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070831/5db098c3/attachment.htm>

From kadlec at sunserv.kfki.hu  Fri Aug 31 07:32:30 2007
From: kadlec at sunserv.kfki.hu (Kadlecsik Jozsi)
Date: Fri, 31 Aug 2007 09:32:30 +0200 (MEST)
Subject: [Linux-cluster] clvmd startup timed out
In-Reply-To: <46D6BEF7.9070905@redhat.com>
References: <Pine.GSO.4.64.0708291256480.22113@sunserv.kfki.hu><BC16657A85D83746AFF99A42587CC6BC07C77F75@exchange2003.tasconline.com><Pine.GSO.4.64.0708291511470.22113@sunserv.kfki.hu><BC16657A85D83746AFF99A42587CC6BC07C77F79@exchange2003.tasconline.com><Pine.GSO.4.64.0708291526130.22113@sunserv.kfki.hu>
	<Pine.GSO.4.64.0708291549400.22113@sunserv.kfki.hu>
	<BC16657A85D83746AFF99A42587CC6BC07C77F7B@exchange2003.tasconline.com>
	<Pine.GSO.4.64.0708292307130.24893@sunserv.kfki.hu>
	<Pine.GSO.4.64.0708301346550.17235@sunserv.kfki.hu>
	<46D6B3B8.4090308@redhat.com>
	<Pine.GSO.4.64.0708301414240.17908@sunserv.kfki.hu>
	<46D6BEF7.9070905@redhat.com>
Message-ID: <Pine.GSO.4.64.0708310927020.16749@sunserv.kfki.hu>

On Thu, 30 Aug 2007, Patrick Caulfield wrote:

> So it's waiting for a DLM lock for the VG "data".
> 
> At this point it could be
> 
> a) another node is holding a lock or
> b) A DLM bug, probably similar or identical to
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=238490

I upgraded to 2.6.23-rc4 and now it seems clvmd works OK. 
Thanks again your help, I hope there'll be no need to reopen this thread 
;-).

Best regards,
Jozsef
--
E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary



From carlopmart at gmail.com  Fri Aug 31 09:49:37 2007
From: carlopmart at gmail.com (carlopmart)
Date: Fri, 31 Aug 2007 11:49:37 +0200
Subject: [Linux-cluster] fence_xvmd doesn't starts
Message-ID: <46D7E431.2020100@gmail.com>

Hi all,

  I am running standalone xen host using rhel5 with three rhel5 xen 
guest with cluster-suite. I have setup fence_xvm as a fence device on 
all three guest. On the host side I have setup fence_xvmd on 
cluster.conf file.

  My problems starts when I need to restart xen server host. Every time 
that reboots, fence_xvmd doesn't starts. If I execute "service cman 
restart" all its ok: fence_xvmd starts. Why?? How can I fix it??

Many thanks.

-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From alain.richard at equation.fr  Fri Aug 31 10:46:50 2007
From: alain.richard at equation.fr (Alain RICHARD)
Date: Fri, 31 Aug 2007 12:46:50 +0200
Subject: [Linux-cluster] RE: qdisk votes not in cman
Message-ID: <30E8283B-B35E-4DE2-A8B6-9D59ED51C3E8@equation.fr>

I have the same problem, qdisk votes not in cman and it was  
effectively a problem with the device name I was using :

<quorumd device="/dev/mpath/qdisk1" .../>

replacing with :

<quorumd device="/dev/mpath/qdsk1" .../>

has solved the problem (the first device name length was 17 chars,  
and the second 16 chars, hitting the #define MAX_CLUSTER_NAME_LEN  16  
limitation).

Perhaps a better error reporting is needed in qdiskd to shows that we  
have hit this problem. Also using a generic name like "qdisk device"  
when qdiskd is registering its node to cman is a better approach.


regards,

-- 
Alain RICHARD <mailto:alain.richard at equation.fr>
EQUATION SA <http://www.equation.fr/>
Tel : +33 477 79 48 00     Fax : +33 477 79 48 01
E-Liance, Op?rateur des entreprises et collectivit?s,
Liaisons Fibre optique, SDSL et ADSL <http://www.e-liance.fr>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070831/7e9b426b/attachment.htm>

From maalgi at ono.com  Fri Aug 31 11:58:37 2007
From: maalgi at ono.com (maalgi at ono.com)
Date: Fri, 31 Aug 2007 13:58:37 +0200 (CEST)
Subject: [Linux-cluster] Problem with Fence
Message-ID: <9807820.172681188561517870.JavaMail.root@resprs02>

Hello, the first thing, sorry for my English i'am Spanish.
I am trying to mount in an environment of tests a system of cluster with two PC.
None of the pc have devices fence installed.
Every PC has two cards of network ethernet, one of which this one in a private 
network 10.0.0. X, for test to another node.
Another card has an ip 192.168.X.X, of my net and a virtual ip 192.168. X.X1 
to the access to the services of the cluster.

eth0------>192.168.56.15 (IP computer)
eth0:1---->192.168.56.24 (IP Virtual)
eth1------>10.0.0.1 (IP fence)

eth0------>192.168.56.16 (IP computer)
eth0:1---->192.168.56.24 (IP Virtual)
eth1------>10.0.0.2 (IP fence)

The interface eth0:1 (192.168.56.24) is up when cluster is up while in the other 
node, inteface and cluster service are down


Is possible configure the cluster this way?
That type of device fence must i to use?

I probe to configure cluster (eth0:1) with WTI, APC... fence devices, and i have same ressults.

/etc/init.d/ccsd start OK
/etc/init.d/fenced start FAIL
..........

Thank you very much and regards.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070831/cb8268a1/attachment.htm>

From maalgi at ono.com  Fri Aug 31 12:01:56 2007
From: maalgi at ono.com (maalgi at ono.com)
Date: Fri, 31 Aug 2007 14:01:56 +0200 (CEST)
Subject: [Linux-cluster] Problem with fence
Message-ID: <31905388.172911188561716343.JavaMail.root@resprs02>

Hello, the first thing, sorry for my English i'am Spanish.
I am trying to mount in an environment of tests a system of cluster with two PC.
None of the pc have devices fence installed.
Every PC has two cards of network ethernet, one of which this one in a private 
network 10.0.0. X, for test to another node.
Another card has an ip 192.168.X.X, of my net and a virtual ip 192.168. X.X1 
to the access to the services of the cluster.
----------------------------------------------------
eth0------>192.168.56.15 (IP computer)
eth0:1---->192.168.56.24 (IP Virtual)
eth1------>10.0.0.1 (IP fence)
---------------------------------------------------
eth0------>192.168.56.16 (IP computer)
eth0:1---->192.168.56.24 (IP Virtual)
eth1------>10.0.0.2 (IP fence)

The interface eth0:1 (192.168.56.24) is up when cluster is up while in the other 
node, inteface and cluster service are down


Is possible configure the cluster this way?
That type of device fence must i to use?

I probe to configure cluster (eth0:1) with WTI, APC... fence devices, and i have same ressults.

/etc/init.d/ccsd start OK
/etc/init.d/fenced start FAIL
..........

Thank you very much and regards.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070831/4ab3efe8/attachment.htm>

From claudio.tassini at gmail.com  Fri Aug 31 12:58:27 2007
From: claudio.tassini at gmail.com (Claudio Tassini)
Date: Fri, 31 Aug 2007 14:58:27 +0200
Subject: [Linux-cluster] expected_votes meaning
Message-ID: <39fdf1c70708310558u20d9aa01qb98e25223afaf218@mail.gmail.com>

Hi all,
this could sound quite obvious but I really can't find a reference guide to
cluster.conf parameters. What is the exact meaning of the expected_votes
cman attribute ?  Is it the total number of votes on which cman calculates
the quorum needing of a member or is it the actual number of votes needed by
a member to boot?

I'll try to explain: 4-nodes cluster. 1 vote per node. 1 quorum disk with 3
votes. expected_votes set to 7 as read in the cluster faq. The faq says that
every single node should be able to remain in the cluster as long as it can
reach the quorum, with a total vote count of 4 (3 for the quorum, 1 for the
node itself), which is claimed to be enough with an expected_votes of 7.
This is why I'm quite confused... if expected_votes is 7, doesn't this mean
that the "1 node+qdisk" vote count should be 7 to prevent quorum being
dissolved?


-- 
Claudio Tassini
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070831/5e2b5eda/attachment.htm>

From pcaulfie at redhat.com  Fri Aug 31 13:15:32 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 31 Aug 2007 14:15:32 +0100
Subject: [Linux-cluster] expected_votes meaning
In-Reply-To: <39fdf1c70708310558u20d9aa01qb98e25223afaf218@mail.gmail.com>
References: <39fdf1c70708310558u20d9aa01qb98e25223afaf218@mail.gmail.com>
Message-ID: <46D81474.1070408@redhat.com>

Claudio Tassini wrote:
> Hi all,
> 
> this could sound quite obvious but I really can't find a reference guide
> to cluster.conf parameters. What is the exact meaning of the
> expected_votes cman attribute ?  Is it the total number of votes on
> which cman calculates the quorum needing of a member or is it the actual
> number of votes needed by a member to boot?

expected_votes is the total number of votes in a fully functional cluster.
Quorum is calculated from this as (half the number of expected_votes)+1. when
the cluster has fewer than quorum votes then services will stall.


> I'll try to explain: 4-nodes cluster. 1 vote per node. 1 quorum disk
> with 3 votes. expected_votes set to 7 as read in the cluster faq. The
> faq says that every single node should be able to remain in the cluster
> as long as it can reach the quorum, with a total vote count of 4 (3 for
> the quorum, 1 for the node itself), which is claimed to be enough with
> an expected_votes of 7. This is why I'm quite confused... if
> expected_votes is 7, doesn't this mean that the "1 node+qdisk" vote
> count should be 7 to prevent quorum being dissolved?

So, in your system expected_votes is 7, quorum is 4. so any 1 node + the quorum
disk will be a valid cluster. Which sounds sensible :)

Patrick



From Alexandre.Racine at mhicc.org  Fri Aug 31 13:17:35 2007
From: Alexandre.Racine at mhicc.org (Alexandre Racine)
Date: Fri, 31 Aug 2007 09:17:35 -0400
Subject: [Linux-cluster] expected_votes meaning
References: <39fdf1c70708310558u20d9aa01qb98e25223afaf218@mail.gmail.com>
Message-ID: <C43CF0825BF59D4FBC1F6A2AF45EB88D3C3441@cumulonimbus.RG.local>

Hi Claudio

#man cman 

will give you infos on expected_votes.



Alexandre Racine
Projets sp?ciaux
514-461-1300 poste 3304
alexandre.racine at mhicc.org



-------- Message d'origine--------
De: linux-cluster-bounces at redhat.com de la part de Claudio Tassini
Date: ven. 2007-08-31 08:58
?: linux clustering
Objet : [Linux-cluster] expected_votes meaning
 
Hi all,
this could sound quite obvious but I really can't find a reference guide to
cluster.conf parameters. What is the exact meaning of the expected_votes
cman attribute ?  Is it the total number of votes on which cman calculates
the quorum needing of a member or is it the actual number of votes needed by
a member to boot?

I'll try to explain: 4-nodes cluster. 1 vote per node. 1 quorum disk with 3
votes. expected_votes set to 7 as read in the cluster faq. The faq says that
every single node should be able to remain in the cluster as long as it can
reach the quorum, with a total vote count of 4 (3 for the quorum, 1 for the
node itself), which is claimed to be enough with an expected_votes of 7.
This is why I'm quite confused... if expected_votes is 7, doesn't this mean
that the "1 node+qdisk" vote count should be 7 to prevent quorum being
dissolved?


-- 
Claudio Tassini

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3212 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070831/cd0021ac/attachment.bin>

From kadlec at sunserv.kfki.hu  Fri Aug 31 13:52:46 2007
From: kadlec at sunserv.kfki.hu (Kadlecsik Jozsi)
Date: Fri, 31 Aug 2007 15:52:46 +0200 (MEST)
Subject: [Linux-cluster] quorum lost in spite of 'leave remove'
Message-ID: <Pine.GSO.4.64.0708311542080.16749@sunserv.kfki.hu>

Hello,

In spite of having 'fence_tool leave' and 'cman_tool leave remove' in the 
'cman' init script, when stopping the five-member cluster, it looses 
quorum when only two machines run the cluster components:

root at web1:~# cman_tool status
Version: 6.0.1
Config Version: 6
Cluster Name: kfki
Cluster Id: 1583
Cluster Member: Yes
Cluster Generation: 748
Membership state: Cluster-Member
Nodes: 2
Expected votes: 5
Total votes: 2
Quorum: 3 Activity blocked
Active subsystems: 7
Flags: 
Ports Bound: 0 11  
Node name: web1-gfs
Node ID: 4
Multicast addresses: 224.0.0.3 
Node addresses: 192.168.192.6 

root at web1:~# cman_tool nodes 
Node  Sts   Inc   Joined               Name
   1   X    728                        lxserv0-gfs
   2   M    728   2007-08-31 09:19:09  lxserv1-gfs
   3   X    728                        web0-gfs
   4   M    724   2007-08-31 09:18:48  web1-gfs
   5   X    728                        saturn-gfs

'/etc/init.d/cman stop' was issued and executed successfully on the tree 
other nodes.

What can be the problem here?

Best regards,
Jozsef
--
E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary



From claudio.tassini at gmail.com  Fri Aug 31 13:53:11 2007
From: claudio.tassini at gmail.com (Claudio Tassini)
Date: Fri, 31 Aug 2007 15:53:11 +0200
Subject: [Linux-cluster] expected_votes meaning
In-Reply-To: <46D81474.1070408@redhat.com>
References: <39fdf1c70708310558u20d9aa01qb98e25223afaf218@mail.gmail.com>
	<46D81474.1070408@redhat.com>
Message-ID: <39fdf1c70708310653w556c9814m5e2627ee0fda9298@mail.gmail.com>

You said that quorum is (expected_votes/2)+1 , which means 7/2=3.5+1=
4.5... is it rounded up or down?

2007/8/31, Patrick Caulfield <pcaulfie at redhat.com>:
>
> Claudio Tassini wrote:
> > Hi all,
> >
> > this could sound quite obvious but I really can't find a reference guide
> > to cluster.conf parameters. What is the exact meaning of the
> > expected_votes cman attribute ?  Is it the total number of votes on
> > which cman calculates the quorum needing of a member or is it the actual
> > number of votes needed by a member to boot?
>
> expected_votes is the total number of votes in a fully functional cluster.
> Quorum is calculated from this as (half the number of expected_votes)+1.
> when
> the cluster has fewer than quorum votes then services will stall.
>
>
> > I'll try to explain: 4-nodes cluster. 1 vote per node. 1 quorum disk
> > with 3 votes. expected_votes set to 7 as read in the cluster faq. The
> > faq says that every single node should be able to remain in the cluster
> > as long as it can reach the quorum, with a total vote count of 4 (3 for
> > the quorum, 1 for the node itself), which is claimed to be enough with
> > an expected_votes of 7. This is why I'm quite confused... if
> > expected_votes is 7, doesn't this mean that the "1 node+qdisk" vote
> > count should be 7 to prevent quorum being dissolved?
>
> So, in your system expected_votes is 7, quorum is 4. so any 1 node + the
> quorum
> disk will be a valid cluster. Which sounds sensible :)
>
> Patrick
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Claudio Tassini
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070831/5fc7ff6f/attachment.htm>

From pcaulfie at redhat.com  Fri Aug 31 14:08:06 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 31 Aug 2007 15:08:06 +0100
Subject: [Linux-cluster] expected_votes meaning
In-Reply-To: <39fdf1c70708310653w556c9814m5e2627ee0fda9298@mail.gmail.com>
References: <39fdf1c70708310558u20d9aa01qb98e25223afaf218@mail.gmail.com>	<46D81474.1070408@redhat.com>
	<39fdf1c70708310653w556c9814m5e2627ee0fda9298@mail.gmail.com>
Message-ID: <46D820C6.3060601@redhat.com>

Claudio Tassini wrote:
> You said that quorum is (expected_votes/2)+1 , which means 7/2=3.5+1=
> 4.5 ... is it rounded up or down?
> 

Down.

-- 
Patrick



From pcaulfie at redhat.com  Fri Aug 31 14:42:54 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 31 Aug 2007 15:42:54 +0100
Subject: [Linux-cluster] quorum lost in spite of 'leave remove'
In-Reply-To: <Pine.GSO.4.64.0708311542080.16749@sunserv.kfki.hu>
References: <Pine.GSO.4.64.0708311542080.16749@sunserv.kfki.hu>
Message-ID: <46D828EE.5070103@redhat.com>

Kadlecsik Jozsi wrote:
> Hello,
> 
> In spite of having 'fence_tool leave' and 'cman_tool leave remove' in the 
> 'cman' init script, when stopping the five-member cluster, it looses 
> quorum when only two machines run the cluster components:
> 
> root at web1:~# cman_tool status
> Version: 6.0.1
> Config Version: 6
> Cluster Name: kfki
> Cluster Id: 1583
> Cluster Member: Yes
> Cluster Generation: 748
> Membership state: Cluster-Member
> Nodes: 2
> Expected votes: 5
> Total votes: 2
> Quorum: 3 Activity blocked
> Active subsystems: 7
> Flags: 
> Ports Bound: 0 11  
> Node name: web1-gfs
> Node ID: 4
> Multicast addresses: 224.0.0.3 
> Node addresses: 192.168.192.6 
> 
> root at web1:~# cman_tool nodes 
> Node  Sts   Inc   Joined               Name
>    1   X    728                        lxserv0-gfs
>    2   M    728   2007-08-31 09:19:09  lxserv1-gfs
>    3   X    728                        web0-gfs
>    4   M    724   2007-08-31 09:18:48  web1-gfs
>    5   X    728                        saturn-gfs
> 
> '/etc/init.d/cman stop' was issued and executed successfully on the tree 
> other nodes.
> 
> What can be the problem here?


It looks like a bug to me. Congratulations - you're the first person to spot that!

-- 
Patrick




From Augusto at prpb.mpf.gov.br  Fri Aug 31 15:07:49 2007
From: Augusto at prpb.mpf.gov.br (Augusto Lima)
Date: Fri, 31 Aug 2007 12:07:49 -0300
Subject: [Linux-cluster] Discovering the world of clustering
Message-ID: <s6d804be.026@smtp.prpb.mpf.gov.br>

Hi, i'm Augusto and i don't know much about clusters.
I'm from brazil, so my english it's not quite good.
I have an idea to test in my organization.

We have two large DELL servers with 6GB RAM each and a Xeon 3GHz
processor each.
They have also lots of disk space.

We want to cluster the 2 servers and run VMware Server on them, trying
to utilize most of the processors and the available RAM all the time.
We have plans to make 6 Virtual Machines running on top of them.
We also want to take advantage of High Availbility on our
configuration, meaning that if one servers goes down, the other have to
hold the 6 VMs for a period of time.
We can't afford any paid solution, since our organization does'nt
support that kind of implementation.

So, i'm wondering if anyone can give a opinion about if it is possible
and how can i do it using only free solutions.

Thanks in advance,

Augusto



From mm at yuhu.biz  Fri Aug 31 18:33:13 2007
From: mm at yuhu.biz (Marian Marinov)
Date: Fri, 31 Aug 2007 21:33:13 +0300
Subject: [Linux-cluster] Discovering the world of clustering
In-Reply-To: <s6d804be.026@smtp.prpb.mpf.gov.br>
References: <s6d804be.026@smtp.prpb.mpf.gov.br>
Message-ID: <200708312133.14035.mm@yuhu.biz>

You can always use Xen virtual machines which can easy migrate from machine to 
machine.

http://www.cl.cam.ac.uk/research/srg/netos/xen/

You can have the Xen virtual machines over a GFS cluster.

Best regards
  Marian Marinov
On Friday 31 August 2007 18:07:49 Augusto Lima wrote:
> Hi, i'm Augusto and i don't know much about clusters.
> I'm from brazil, so my english it's not quite good.
> I have an idea to test in my organization.
>
> We have two large DELL servers with 6GB RAM each and a Xeon 3GHz
> processor each.
> They have also lots of disk space.
>
> We want to cluster the 2 servers and run VMware Server on them, trying
> to utilize most of the processors and the available RAM all the time.
> We have plans to make 6 Virtual Machines running on top of them.
> We also want to take advantage of High Availbility on our
> configuration, meaning that if one servers goes down, the other have to
> hold the 6 VMs for a period of time.
> We can't afford any paid solution, since our organization does'nt
> support that kind of implementation.
>
> So, i'm wondering if anyone can give a opinion about if it is possible
> and how can i do it using only free solutions.
>
> Thanks in advance,
>
> Augusto
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From kadlec at sunserv.kfki.hu  Fri Aug 31 18:41:04 2007
From: kadlec at sunserv.kfki.hu (Kadlecsik Jozsi)
Date: Fri, 31 Aug 2007 20:41:04 +0200 (MEST)
Subject: [Linux-cluster] quorum lost in spite of 'leave remove'
In-Reply-To: <46D828EE.5070103@redhat.com>
References: <Pine.GSO.4.64.0708311542080.16749@sunserv.kfki.hu>
	<46D828EE.5070103@redhat.com>
Message-ID: <Pine.GSO.4.64.0708312035250.6149@sunserv.kfki.hu>

Hello,

On Fri, 31 Aug 2007, Patrick Caulfield wrote:

> > '/etc/init.d/cman stop' was issued and executed successfully on the tree 
> > other nodes.
> 
> It looks like a bug to me. Congratulations - you're the first person to spot that!

In the source tree of cluster-2.01.00 (and in cluster-2.00.00 too) the 
flag CMAN_SHUTDOWN_REMOVED is only set by cman_tool but never used 
anywhere. It seems the functionality is simply missing.

Best regards,
Jozsef
--
E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary



From kadlec at sunserv.kfki.hu  Fri Aug 31 19:50:32 2007
From: kadlec at sunserv.kfki.hu (Kadlecsik Jozsi)
Date: Fri, 31 Aug 2007 21:50:32 +0200 (MEST)
Subject: [Linux-cluster] quorum lost in spite of 'leave remove'
In-Reply-To: <Pine.GSO.4.64.0708312035250.6149@sunserv.kfki.hu>
References: <Pine.GSO.4.64.0708311542080.16749@sunserv.kfki.hu>
	<46D828EE.5070103@redhat.com>
	<Pine.GSO.4.64.0708312035250.6149@sunserv.kfki.hu>
Message-ID: <Pine.GSO.4.64.0708312146500.12565@sunserv.kfki.hu>

On Fri, 31 Aug 2007, Kadlecsik Jozsi wrote:

> > > '/etc/init.d/cman stop' was issued and executed successfully on the tree 
> > > other nodes.
> > 
> > It looks like a bug to me. Congratulations - you're the first person to spot that!
> 
> In the source tree of cluster-2.01.00 (and in cluster-2.00.00 too) the 
> flag CMAN_SHUTDOWN_REMOVED is only set by cman_tool but never used 
> anywhere. It seems the functionality is simply missing.

This is a little bit more complicated than that... Same value called and 
used as CMAN_SHUTDOWN_REMOVED here, SHUTDOWN_REMOVE there.

Best regards,
Jozsef
--
E-mail : kadlec at sunserv.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary



From david.costakos at gmail.com  Fri Aug 31 20:21:42 2007
From: david.costakos at gmail.com (Dave Costakos)
Date: Fri, 31 Aug 2007 13:21:42 -0700
Subject: [Linux-cluster] RE: qdisk votes not in cman
In-Reply-To: <30E8283B-B35E-4DE2-A8B6-9D59ED51C3E8@equation.fr>
References: <30E8283B-B35E-4DE2-A8B6-9D59ED51C3E8@equation.fr>
Message-ID: <6b6836c60708311321vc3a22d1v2e10a91b7d2dce99@mail.gmail.com>

Thank you gentlemen!  It worked.



On 8/31/07, Alain RICHARD <alain.richard at equation.fr> wrote:
>
> I have the same problem, qdisk votes not in cman and it was effectively a
> problem with the device name I was using :
> <quorumd device="/dev/mpath/qdisk1" .../>
>
> replacing with :
>
> <quorumd device="/dev/mpath/qdsk1" .../>
>
> has solved the problem (the first device name length was 17 chars, and the
> second 16 chars, hitting the #define MAX_CLUSTER_NAME_LEN  16 limitation).
>
> Perhaps a better error reporting is needed in qdiskd to shows that we have
> hit this problem. Also using a generic name like "qdisk device" when qdiskd
> is registering its node to cman is a better approach.
>
>
> regards,
>
>
> --
>
> Alain RICHARD <mailto:alain.richard at equation.fr<alain.richard at equation.fr>
> >
>
> EQUATION SA <http://www.equation.fr/>
>
> Tel : +33 477 79 48 00     Fax : +33 477 79 48 01
>
> E-Liance, Op?rateur des entreprises et collectivit?s,
>
> Liaisons Fibre optique, SDSL et ADSL <http://www.e-liance.fr>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Dave Costakos
mailto:david.costakos at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070831/71b0fe54/attachment.htm>