From jpyeron at pdinc.us  Fri Sep  1 12:02:30 2006
From: jpyeron at pdinc.us (Jason Pyeron)
Date: Fri, 1 Sep 2006 08:02:30 -0400
Subject: OT: RE: [Linux-cluster] php4-xslt package???
In-Reply-To: <44F6E9E9.5070703@gmail.com>
Message-ID: <200609011202.k81C2f919734@ns.pyerotechnics.com>

This is the wrong list for this, but look at
http://public.pdinc.us/rpms/php-xslt/index.jsp 


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-                                                               -
- Jason Pyeron                      PD Inc. http://www.pdinc.us -
- Sr. Consultant                    10 West 24th Street #100    -
- +1 (443) 269-1555                 Baltimore, Maryland 21218   -
-                                                               -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

This message is for the designated recipient only and may contain
privileged, proprietary, or otherwise private information. If you
have received it in error, purge the message from your system and
notify the sender immediately.  Any other use of the email by you
is prohibited.

  
-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Anthony
Sent: Thursday, August 31, 2006 9:54
To: redhat-sysadmin-list at redhat.com; UNIX-Administration at yahoogroups.com;
linux-cluster at redhat.com
Subject: [Linux-cluster] php4-xslt package???

Hello,

i am unable to find the php4-xslt package for Red Hat Enterprise Linux 4 AS.

any help?


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 6433 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060901/243def65/attachment.bin>

From basv at sara.nl  Fri Sep  1 13:59:04 2006
From: basv at sara.nl (Bas van der Vlies)
Date: Fri, 01 Sep 2006 15:59:04 +0200
Subject: [Linux-cluster] ANNOUNCE: new version gfs_2_deb utils (0.2.1)
Message-ID: <44F83CA8.3080105@sara.nl>

= gfs_2_deb - utilities =

This is a release of the SARA package gfs_2_deb that contains utilities
that we use to make debian packages from the RedHat Cluster Software (GFS).

This is utilities are for version 1.0.3 and cvs updates.

All init.d scripts in the debian package start at runlevel 3 and the
scripts start in the right order. We have choosen this setup for these 
reasons, default runlevel is 2:
  1) When a node is fenced, the node is rebooted and is ready for
     cluster mode.
  2) We can easily switch from run levels to join or leave the cluster

See README for further info

The package can be downloaded at:
	ftp://ftp.sara.nl/pub/outgoing/gfs_2_deb.tar.gz

Regards


-- 
--
********************************************************************
*                                                                  *
*  Bas van der Vlies                     e-mail: basv at sara.nl      *
*  SARA - Academic Computing Services    phone:  +31 20 592 8012   *
*  Kruislaan 415                         fax:    +31 20 6683167    *
*  1098 SJ Amsterdam                                               *
*                                                                  *
********************************************************************


-- 
--
********************************************************************
*                                                                  *
*  Bas van der Vlies                     e-mail: basv at sara.nl      *
*  SARA - Academic Computing Services    phone:  +31 20 592 8012   *
*  Kruislaan 415                         fax:    +31 20 6683167    *
*  1098 SJ Amsterdam                                               *
*                                                                  *
********************************************************************


From mbrookov at mines.edu  Fri Sep  1 19:11:44 2006
From: mbrookov at mines.edu (Matthew B. Brookover)
Date: Fri, 01 Sep 2006 13:11:44 -0600
Subject: [Linux-cluster] is necesary to to build GFS on top of LVM ?
In-Reply-To: <20060831191626.99599.qmail@web50613.mail.yahoo.com>
References: <20060831191626.99599.qmail@web50613.mail.yahoo.com>
Message-ID: <1157137904.26485.6.camel@merlin.Mines.EDU>

I have an iscsi scan that would not work with out LVM.  As with your EMC
SAN I can expand a volume and expand a GFS file system within it.  Where
I get into trouble is identifying the volumes after a reboot.  What
was /dev/sdb may be /dev/sdc next time.  LVM allows you to name your
volumes and helps to track them down when the system is restarted.
There are similar problems when SCSI ID numbers get swapped around.

Matt


On Thu, 2006-08-31 at 12:16 -0700, Roger Pe?a Escobio wrote:

> Hi
> 
> I was wondering why in the docs and examples the GFS
> filesystem is build on top of a lv "partition" ?
> I can understand that if I build the GFS in a direct
> scsi attached storage because is not easy to grow the
> "device" without destroy the data but the same apply
> in an SAN enviroment?
> We have here a EMC SAN, where is relative easy to grow
> a LUN, so can we skip the LVM layer and build the GFS
> filesystem directly over the emcpower device ? 
> 
> there is any advantage of using LVM in this scenario?
> 
> thanks in advance
> roger
> 
> __________________________________________
> RedHat Certified Engineer ( RHCE )
> Cisco Certified Network Associate ( CCNA )
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060901/9eb1ad8e/attachment.htm>

From daves at ActiveState.com  Fri Sep  1 23:40:24 2006
From: daves at ActiveState.com (David Sparks)
Date: Fri, 01 Sep 2006 16:40:24 -0700
Subject: [Linux-cluster] is necesary to to build GFS on top of LVM ?
In-Reply-To: <20060831191626.99599.qmail@web50613.mail.yahoo.com>
References: <20060831191626.99599.qmail@web50613.mail.yahoo.com>
Message-ID: <44F8C4E8.9050601@activestate.com>

> I was wondering why in the docs and examples the GFS
> filesystem is build on top of a lv "partition" ?
> I can understand that if I build the GFS in a direct
> scsi attached storage because is not easy to grow the
> "device" without destroy the data but the same apply
> in an SAN enviroment?
> We have here a EMC SAN, where is relative easy to grow
> a LUN, so can we skip the LVM layer and build the GFS
> filesystem directly over the emcpower device ? 

A variation of this question, what about creating GFS directly on the
block device (ie /dev/sdb) instead of creating partitions (ie /dev/sdb1)?

When increasing a filesystem, this removes the step of increasing the
partition size, which is usually the scariest part (because you are
usually deleting the partition table, and recreating it with the same
starting layout, hoping that your existing filesystem will be intact).

Does parted support GFS?  It doesn't support XFS which is another FS I
am using.  So I asked myself, why bother creating a partition table at
all?  I have been running the fs directly on the block device for some
time now without issue (XFS, haven't tried GFS).

A setup like this has a weakness in that people who aren't familiar with
it may come along with fdisk and corrupt the disk by creating a
partition table on it.  You might rename fdisk as a basic preventative.

ds


> 
> there is any advantage of using LVM in this scenario?
> 
> thanks in advance
> roger


From orkcu at yahoo.com  Sat Sep  2 02:25:28 2006
From: orkcu at yahoo.com (Roger Peña Escobio)
Date: Fri, 1 Sep 2006 19:25:28 -0700 (PDT)
Subject: [Linux-cluster] is necesary to to build GFS on top of LVM ?
In-Reply-To: <1157137904.26485.6.camel@merlin.Mines.EDU>
Message-ID: <20060902022528.92665.qmail@web50608.mail.yahoo.com>


--- "Matthew B. Brookover" <mbrookov at mines.edu> wrote:

> I have an iscsi scan that would not work with out
> LVM.  As with your EMC
> SAN I can expand a volume and expand a GFS file
> system within it.  Where
> I get into trouble is identifying the volumes after
> a reboot.  What
> was /dev/sdb may be /dev/sdc next time.  LVM allows
> you to name your
> volumes and helps to track them down when the system
> is restarted.
> There are similar problems when SCSI ID numbers get
> swapped around.
> 
yes, I know what you mean
I was looking for something like ext{2,3} label for
the filesystem but I could'n find anything for gfs :-(

so I am hopping that PowerPath kernel module always
identify the LUN with the same emcpower device :-)
if that is not true I will be forced to move to LVM
under GFS :-)


thanks
roger


__________________________________________
RedHat Certified Engineer ( RHCE )
Cisco Certified Network Associate ( CCNA )

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From jprats at cesca.es  Sat Sep  2 11:12:40 2006
From: jprats at cesca.es (=?ISO-8859-1?Q?Jordi_Prats_Catal=E0?=)
Date: Sat, 02 Sep 2006 13:12:40 +0200
Subject: [Linux-cluster] clustat problem
Message-ID: <44F96728.8090902@cesca.es>

Hi,
I'm getting different outputs of clustat utility on each node:

node1:
# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  node1                                    Online, Local, rgmanager
  node2                                    Online, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  ptoheczas            node2                          started
  xoqil                node2                          started
  ymsgh                node1                          started
  vofcvhas             node2                          started

node2:
# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  node1                                    Online, rgmanager
  node2                                    Online, Local, rgmanager


(disappears service's info)

Rebooting disapears this problem (displays same info in both nodes) for
a few weeks. After that it appears again.

Do you know what's going on?

Thanks,

-- 
......................................................................
        __
       / /          Jordi Prats Catal?
 C E / S / C A      Departament de Sistemes
     /_/            Centre de Supercomputaci? de Catalunya

 Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona
 T. 93 205 6464 ? F.  93 205 6979 ? jprats at cesca.es
......................................................................


From filipe.miranda at gmail.com  Sat Sep  2 14:28:29 2006
From: filipe.miranda at gmail.com (Filipe Miranda)
Date: Sat, 2 Sep 2006 11:28:29 -0300
Subject: [Linux-cluster] clustat problem
In-Reply-To: <44F96728.8090902@cesca.es>
References: <44F96728.8090902@cesca.es>
Message-ID: <a6d13c780609020728o27487385p189ea599fa292ab0@mail.gmail.com>

Hi there,

I'm having the same problem!
I'm using RHEL3.8 for Itanium and RedHat Cluster Suite U8. The cluster is
composed of 2 HP 4CPUs servers and we are using an EMC ClarionCX700 to hold
the quorum partitions and data partitions.
One more thing that I noticed, eventhough the members are shown ative on
both nodes, any action on the node that shows the active service does not
get propagated to the other member.

I already checked the configuration of the rawdevices, and I also used the
shutil utility and it reported no problems with the quorum partitions.

Does anybody have any suggestions?

Thank you,


On 9/2/06, Jordi Prats Catal? <jprats at cesca.es> wrote:
>
> Hi,
> I'm getting different outputs of clustat utility on each node:
>
> node1:
> # clustat
> Member Status: Quorate
>
>   Member Name                              Status
>   ------ ----                              ------
>   node1                                    Online, Local, rgmanager
>   node2                                    Online, rgmanager
>
>   Service Name         Owner (Last)                   State
>   ------- ----         ----- ------                   -----
>   ptoheczas            node2                          started
>   xoqil                node2                          started
>   ymsgh                node1                          started
>   vofcvhas             node2                          started
>
> node2:
> # clustat
> Member Status: Quorate
>
>   Member Name                              Status
>   ------ ----                              ------
>   node1                                    Online, rgmanager
>   node2                                    Online, Local, rgmanager
>
>
> (disappears service's info)
>
> Rebooting disapears this problem (displays same info in both nodes) for
> a few weeks. After that it appears again.
>
> Do you know what's going on?
>
> Thanks,
>
> --
> ......................................................................
>         __
>        / /          Jordi Prats Catal?
> C E / S / C A      Departament de Sistemes
>      /_/            Centre de Supercomputaci? de Catalunya
>
> Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona
> T. 93 205 6464 ? F.  93 205 6979 ? jprats at cesca.es
> ......................................................................
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060902/0af2d669/attachment.htm>

From jprats at cesca.es  Sat Sep  2 14:50:42 2006
From: jprats at cesca.es (=?ISO-8859-1?Q?Jordi_Prats_Catal=E0?=)
Date: Sat, 02 Sep 2006 16:50:42 +0200
Subject: [Linux-cluster] clustat problem
In-Reply-To: <a6d13c780609020728o27487385p189ea599fa292ab0@mail.gmail.com>
References: <44F96728.8090902@cesca.es>
	<a6d13c780609020728o27487385p189ea599fa292ab0@mail.gmail.com>
Message-ID: <44F99A42.4000402@cesca.es>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,
My software versions are:

# clustat -v
clustat version 1.9.43
Connected via: CMAN/SM Plugin v1.1.4

# cat /etc/redhat-release
Red Hat Enterprise Linux ES release 4 (Nahant Update 2)

My cluster is composed of 2 HP ProLiant DL360 G4p: 4 Xeon processors
each node also.


Filipe Miranda wrote:
> Hi there,
> 
> I'm having the same problem!
> I'm using RHEL3.8 for Itanium and RedHat Cluster Suite U8. The cluster
> is composed of 2 HP 4CPUs servers and we are using an EMC ClarionCX700
> to hold the quorum partitions and data partitions.
> One more thing that I noticed, eventhough the members are shown ative on
> both nodes, any action on the node that shows the active service does
> not get propagated to the other member.
> 
> I already checked the configuration of the rawdevices, and I also used
> the shutil utility and it reported no problems with the quorum partitions.
> 
> Does anybody have any suggestions?
> 
> Thank you,
> 
> 
> On 9/2/06, *Jordi Prats Catal?* <jprats at cesca.es
> <mailto:jprats at cesca.es> > wrote:
> 
>     Hi,
>     I'm getting different outputs of clustat utility on each node:
> 
>     node1:
>     # clustat
>     Member Status: Quorate
> 
>       Member Name                              Status
>       ------ ----                              ------
>       node1                                    Online, Local, rgmanager
>       node2                                    Online, rgmanager
> 
>       Service Name         Owner (Last)                   State
>       ------- ----         ----- ------                   -----
>       ptoheczas            node2                          started
>       xoqil                node2                          started
>       ymsgh                node1                          started
>       vofcvhas             node2                          started
> 
>     node2:
>     # clustat
>     Member Status: Quorate
> 
>       Member Name                              Status
>       ------ ----                              ------
>       node1                                    Online, rgmanager
>       node2                                    Online, Local, rgmanager
> 
> 
>     (disappears service's info)
> 
>     Rebooting disapears this problem (displays same info in both nodes) for
>     a few weeks. After that it appears again.
> 
>     Do you know what's going on?
> 
>     Thanks,
> 
>     --
>     ......................................................................
>             __
>            / /          Jordi Prats Catal?
>     C E / S / C A      Departament de Sistemes
>          /_/            Centre de Supercomputaci? de Catalunya
> 
>     Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona
>     T. 93 205 6464 ? F.  93 205 6979 ? jprats at cesca.es
>     <mailto:jprats at cesca.es>
>     ......................................................................
> 
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>     <https://www.redhat.com/mailman/listinfo/linux-cluster>
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

- --
......................................................................
        __
       / /          Jordi Prats Catal?
 C E / S / C A      Departament de Sistemes
     /_/            Centre de Supercomputaci? de Catalunya

 Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona
 T. 93 205 6464 ? F.  93 205 6979 ? jprats at cesca.es
......................................................................
pgp:0x5D0D1321
......................................................................
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE+ZpCHGTYFl0NEyERAiwrAJ47HGNxNQ6D5PcKPHXszw1JWenILwCbBB9m
T6KJAv7tjOoJ6A6XGswECs0=
=o8m8
-----END PGP SIGNATURE-----


From eric at nitrate.nl  Mon Sep  4 09:17:01 2006
From: eric at nitrate.nl (E. de Ruiter)
Date: Mon, 04 Sep 2006 11:17:01 +0200
Subject: [Linux-cluster] economy filesystem cluster
Message-ID: <44FBEF0D.6090409@nitrate.nl>

Hi,

I'm planning to build a webserver cluster, and as part of that I'm 
looking for solutions
that allows every node in the cluster to access the same filesystem. The 
most easy
way would be via nfs, but my requirements state that there should be no 
single point
of failure  (ofcourse not completely possible but the cluster should not 
be affected by the
downtime of 1 machine). A san or other some other piece of extra 
hardware is currently
not possible within the current budget.
The system will have a low number of writes (only some uploaded files 
and some generated
templates but the majority of the load will be reads) but a rsync 
solution or something
like that is not feasible since loadbalancing needs the file to be 
directly available on all
nodes.

What I have:
- 1 loadbalancing machine
- 1 database server
- 2 webfrontends
- 1 management server (slave db / backup load balancer etc)
In the future I plan on adding some extra database servers + webfrontends

All machines are very similar and have (dual) xeon processors. The 
requirements are that
all machines have access to the filesystem, and no single machine may 
affect the availability
of (a part of) the filesystem.
Searching the internet resulted in some possible solutions:
- GFS with only gnbd (http://gfs.wikidev.net/GNBD_installation).
  This only exports the specified partitions over the network and
  has (in my mind) no advantages over using plain nfs (it adds no 
redundancy)
- GFS with gnbd in combination with drbd (mentioned a few times on the 
mailing list).
  This looks promising but I couldn't find a definitive answer to the 
questions
  raised here on the mailinglist:
  - drbd 0.7 only allows 1 node to have write-access. Is it possible to 
construct a simple
    failover scenario without serious risks of corruption when drbd has 
"failed-over"
    but gfs has not.
  - drbd 0.8 seems to have support for multi(2)-master configuration, 
but is it stable
    enough for a production environment and can it work together with gfs
- GFS in combination with clvm (network raid?). Mentioned a few times 
here on
  the mailinglist but most posts claim it is not stable enough, and 
documentation
  seems completely missing.
- economy configuration from the GFS Administrator's Guide
  
(http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-ov-perform.html#S2-OV-ECONOMY)
  The problem with this is:
  - is there a need to have separate gnbd servers? Or can the gnbd 
servers be run on
    the application servers.
  - it is not documented how to configure this, and it is not clear 
whether this configuration
    gives me the redundancy I want.

What I was thinking of is the following:
- One node acts as a gnbd server
- Each node has his own disk
- Each node mounts a gnbd device.
- Each node creates a raid-1 (own disk + gnbd device)
- GFS is run on top of the raid-1

But it is not clear to me if this is feasible since I rely on a single gnbd
server. Maybe I can have 2 gnbd servers where the disks are synced
with drbd (0.8?), but that creates issues with fencing (according to 
some posts here).
And also the raid-1 should read only from it's local disk and only if that
fails it should read from the gnbd device, but I don't know if that is 
possible.
Or maybe clvm (network raid?) would be an option but I couldn't find
any documentation for that.

Can this be done with gfs / clvm / drbd or are there other solutions more
appropriate for this case? (other filesystems I've seen, like 
pvfs2/intermezzo/lustre,
are either not production ready, abandoned or don't have support for
redundancy)

Thanks,

Eric de Ruiter


From sdake at redhat.com  Tue Sep  5 06:39:34 2006
From: sdake at redhat.com (Steven Dake)
Date: Mon, 04 Sep 2006 23:39:34 -0700
Subject: [Linux-cluster] cman and bond isssue
In-Reply-To: <44F4519E.1030305@atichile.com>
References: <OF546B33E5.FF2F16FA-ON852571D8.0071B29C-852571D8.0074CDF8@applera.com>
	<44F4519E.1030305@atichile.com>
Message-ID: <1157438374.12305.46.camel@shih.broked.org>

One possible problem is that your switch doesn't properly support
multicast or jumbo frames.  I suggest ensuring your MTU on all machines
is 1500 to test the jumbo frames possibility.

I have seen many switches advertised as jumbo frames which fail to
operate properly in multicast or heavy broadcast environments.

Regards
-steve
On Tue, 2006-08-29 at 10:39 -0400, Luis Godoy Gonzalez wrote:
> Hello
> 
> I have a problem with the installation of the Cluster Suite. I've 
> configured 2 nodes cluster, add some services to test de installation 
> and this worked OK.
> 
> But when I configured "bond" for ethernet interfaces, the communication 
> between the Cluster nodes doesn't work well. Although networking at IP 
> level works fine, when I reboot one of the nodes the other one goes to 
> kernel panic.
> 
> I've lost a lot of time debbuging this problem and I finally decide to 
> replace one switch ( D-Link DGS-1016D Gigabit Switch ) putting another 
> from other installation  ( D-Link  10/100 ) and the cluster Works Fine now.
> 
> But Now, I'm not sure if the problem is with the hardware switch ( 
> D-Link ) or with the software.
> Any ideas ?
> 
> I have  RHE4 U2 and Cluster Suite 4 U2 using HP DL380 and external RAID.
> 
> Some error messages are below
> 
> ---------------------------------------------------------
> # SM: 03000002 process_stop_request: uevent already set
> 
> SM:  Assertion failed on line 106 of file 
> /usr/src/build/615121-i686/BUILD/cman-
> kernel-2.6.9-39/smp/src/sm_membership.c
> SM:  assertion:  "node"
> SM:  time = 256523
> nodeid=1
> 
> Kernel panic - not syncing: SM:  Record message above and reboot.
> ----------------------------------------------------------------------
> 
> Thanks in advance for any help.
> Luis G.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From damian.osullivan at hp.com  Tue Sep  5 12:50:23 2006
From: damian.osullivan at hp.com (O'Sullivan, Damian)
Date: Tue, 5 Sep 2006 13:50:23 +0100
Subject: [Linux-cluster] CMAN and interface
Message-ID: <644A0966265D9D40AC7584FCE956111302EDD634@dubexc01.emea.cpqcorp.net>

Hi,

How do I ensure that CMAN uses a specific interface? I have a 2 node
cluster with 6 ethernet interfaces. I have a cross over cable beween the
2 eth0 interfaces on both nodes. All other interfaces are connected to a
common switch with VLANs for each interface. When this switch is
reloaded/rebooted the nodes try to fence each other and soon as the
switch comes back each node is shutdown by the fencing agent.

I see there is a way with multicast but is that the only way and how
does one set up addresses for this?

Thanks,

D.


From mwill at penguincomputing.com  Tue Sep  5 16:06:56 2006
From: mwill at penguincomputing.com (Michael Will)
Date: Tue, 5 Sep 2006 09:06:56 -0700
Subject: [Linux-cluster] is necesary to to build GFS on top of LVM ?
Message-ID: <433093DF7AD7444DA65EFAFE3987879C245253@jellyfish.highlyscyld.com>

My number one reason for using a partition table under lvm is avoiding to place filesystem data where it could be damaged by accidentally installing a bootblock or partitiontable on the wrong device. 

Michael 

 -----Original Message-----
From: 	David Sparks [mailto:daves at ActiveState.com]
Sent:	Fri Sep 01 16:40:47 2006
To:	linux clustering
Subject:	Re: [Linux-cluster] is necesary to to build GFS on top of LVM ?

> I was wondering why in the docs and examples the GFS
> filesystem is build on top of a lv "partition" ?
> I can understand that if I build the GFS in a direct
> scsi attached storage because is not easy to grow the
> "device" without destroy the data but the same apply
> in an SAN enviroment?
> We have here a EMC SAN, where is relative easy to grow
> a LUN, so can we skip the LVM layer and build the GFS
> filesystem directly over the emcpower device ? 

A variation of this question, what about creating GFS directly on the
block device (ie /dev/sdb) instead of creating partitions (ie /dev/sdb1)?

When increasing a filesystem, this removes the step of increasing the
partition size, which is usually the scariest part (because you are
usually deleting the partition table, and recreating it with the same
starting layout, hoping that your existing filesystem will be intact).

Does parted support GFS?  It doesn't support XFS which is another FS I
am using.  So I asked myself, why bother creating a partition table at
all?  I have been running the fs directly on the block device for some
time now without issue (XFS, haven't tried GFS).

A setup like this has a weakness in that people who aren't familiar with
it may come along with fdisk and corrupt the disk by creating a
partition table on it.  You might rename fdisk as a basic preventative.

ds


> 
> there is any advantage of using LVM in this scenario?
> 
> thanks in advance
> roger

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060905/91654d46/attachment.htm>

From lhh at redhat.com  Tue Sep  5 16:44:40 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 05 Sep 2006 12:44:40 -0400
Subject: [Linux-cluster] clustat problem
In-Reply-To: <44F99A42.4000402@cesca.es>
References: <44F96728.8090902@cesca.es>
	<a6d13c780609020728o27487385p189ea599fa292ab0@mail.gmail.com>
	<44F99A42.4000402@cesca.es>
Message-ID: <1157474680.3610.35.camel@rei.boston.devel.redhat.com>

On Sat, 2006-09-02 at 16:50 +0200, Jordi Prats Catal? wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi,
> My software versions are:
> 
> # clustat -v
> clustat version 1.9.43

It should be fixed in the U4 release.  There were cases where the main
thread would block, causing problems with the CMAN service manager
(which then caused the cluster to cease to function normally).  

Additionally, there is an issue with the DLM which has been worked
around by switching the way locks are taken.  Either one of these
problems causes clustat to hang and/or produce no output.

Versions:

rgmanager-1.9.53 
magma-plugins-1.0.9 
magma-1.0.6

-- Lon


From ivanp at yu.net  Wed Sep  6 09:32:58 2006
From: ivanp at yu.net (Ivan Pantovic)
Date: Wed, 06 Sep 2006 11:32:58 +0200
Subject: [Linux-cluster] is necesary to to build GFS on top of LVM ?
In-Reply-To: <20060902022528.92665.qmail@web50608.mail.yahoo.com>
References: <20060902022528.92665.qmail@web50608.mail.yahoo.com>
Message-ID: <44FE95CA.1050202@yu.net>

You can use udev with scsi_id to map that lun always on the same place 
instead using lvm to find volumes.

There is another thing you should consider. It is cLVM not LVM.

Roger PeXa Escobio wrote:
> 
> --- "Matthew B. Brookover" <mbrookov at mines.edu> wrote:
> 
> 
>>I have an iscsi scan that would not work with out
>>LVM.  As with your EMC
>>SAN I can expand a volume and expand a GFS file
>>system within it.  Where
>>I get into trouble is identifying the volumes after
>>a reboot.  What
>>was /dev/sdb may be /dev/sdc next time.  LVM allows
>>you to name your
>>volumes and helps to track them down when the system
>>is restarted.
>>There are similar problems when SCSI ID numbers get
>>swapped around.
>>
> 
> yes, I know what you mean
> I was looking for something like ext{2,3} label for
> the filesystem but I could'n find anything for gfs :-(
> 
> so I am hopping that PowerPath kernel module always
> identify the LUN with the same emcpower device :-)
> if that is not true I will be forced to move to LVM
> under GFS :-)
> 
> 
> thanks
> roger
> 
> 
> 
> 
> __________________________________________
> RedHat Certified Engineer ( RHCE )
> Cisco Certified Network Associate ( CCNA )
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Ivan Pantovic, System Engineer
-----
YUnet International  http://www.eunet.yu
Dubrovacka 35/III,   11000 Belgrade
Tel: +381 11 311 9901;  Fax: +381 11 311 9901; Mob: +381 63 302 288
-----
This  e-mail  is confidential and intended only for the recipient.
Unauthorized  distribution,  modification  or  disclosure  of  its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone  +381 11 311 9901.
-----


From riaan at obsidian.co.za  Wed Sep  6 15:09:54 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Wed, 06 Sep 2006 17:09:54 +0200
Subject: [Linux-cluster] data journaling for increased performance
Message-ID: <44FEE4C2.1070205@obsidian.co.za>

has anyone been able to use GFS data journaling to get any measurable 
performance boost? For those unfamiliar:

http://www.redhat.com/docs/manuals/csgfs/browse/rh-gfs-en/s1-manage-data-journal.html

We have a 2.6 TB maildir mail store (e.g. lots of small files) and think 
of implementing it (we will take any performance increase we can get as 
long as it does not impact reliability), and even though it will only 
apply to new files.

Also, is it possible to check if the inherit_jdata (for directories) or 
jdata (for files) flag has been set?

Riaan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060906/03bf2535/attachment.vcf>

From celso at webbertek.com.br  Thu Sep  7 03:59:46 2006
From: celso at webbertek.com.br (Celso K. Webber)
Date: Thu, 07 Sep 2006 00:59:46 -0300
Subject: [Linux-cluster] Is IPMI fencing considered certified by Red Hat?
Message-ID: <44FF9932.9020508@webbertek.com.br>

Hello friends,

Regarding Red Hat Cluster Suite and/or GFS, could someone from Red Hat 
please tell me if the use of IPMI embedded devices from the servers' 
motherboards is officially certified by Red Hat?

I'd like to have this information so that we can recommend (or not) to 
customers the use of IPMI as a secure form of fencing.

We had some bad experiences recently on some servers where only one of 
the onboard NICs listened to the IPMI over LAN packets, so it appeared 
to us that sometimes IPMI is not that safe as a fence device. Of course 
the Cluster software will assume nothing when the fencing fails, but the 
bad thing is that there is no automatic failover on this situation.

Thank you all,

Celso.
-- 
*Celso Kopp Webber*

celso at webbertek.com.br <mailto:celso at webbertek.com.br>

*Webbertek - Opensource Knowledge*
(41) 8813-1919
(41) 3284-3035


-- 
Esta mensagem foi verificada pelo sistema de antiv?rus e
 acredita-se estar livre de perigo.


From celso at webbertek.com.br  Thu Sep  7 04:24:27 2006
From: celso at webbertek.com.br (Celso K. Webber)
Date: Thu, 07 Sep 2006 01:24:27 -0300
Subject: [Linux-cluster] Write log messages to a different file
In-Reply-To: <1156958955.4501.245.camel@rei.boston.devel.redhat.com>
References: <5F08B160555AC946B5AB743B85FF406D05ABF26F@ex2k.bankofamerica.com>
	<1156958955.4501.245.camel@rei.boston.devel.redhat.com>
Message-ID: <44FF9EFB.5090007@webbertek.com.br>

Hello,

Are there plans to implement those loggin facilities to the other 
daemons? It'd be very interesting to have the fence messages and related 
stuff to a separete file.

Thanks,

Celso.

Lon Hohberger escreveu:
> On Wed, 2006-08-30 at 11:15 -0400, Brown, Rodrick R wrote:
>> You need to modify /etc/syslog.conf 
>> local4.*                         /var/log/cluster.log
> 
> I think it's daemon.*, not local4.*, by default.  You can make rgmanager
> use local4 by tweaking the <rm> tag, though:
> 
>   <rm log_facility="local4">
> 
> ... but this doesn't change CMAN, CCS, GuLM, etc.
> 
> -- Lon
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
*Celso Kopp Webber*

celso at webbertek.com.br <mailto:celso at webbertek.com.br>

*Webbertek - Opensource Knowledge*
(41) 8813-1919
(41) 3284-3035


-- 
Esta mensagem foi verificada pelo sistema de antiv?rus e
 acredita-se estar livre de perigo.


From celso at webbertek.com.br  Thu Sep  7 04:28:32 2006
From: celso at webbertek.com.br (Celso K. Webber)
Date: Thu, 07 Sep 2006 01:28:32 -0300
Subject: [Linux-cluster] clustat problem
In-Reply-To: <a6d13c780609020728o27487385p189ea599fa292ab0@mail.gmail.com>
References: <44F96728.8090902@cesca.es>
	<a6d13c780609020728o27487385p189ea599fa292ab0@mail.gmail.com>
Message-ID: <44FF9FF0.1000404@webbertek.com.br>

Hi Filipe!

I think your case is a little bit different from Jordi's case, since you 
are using Cluster Suite v3 and he is using v4.

 From my own experience, under CSv3 I had this kind o problem when using 
high latency quorum devices. So I had to change from disk tiebraker to 
network tiebraker. I imagine you're using disk tiebraker, aren't you?

Please, would someone please confirm that Filipe's case could be solved 
by changing the heartbeat method? It worked for me in the past, but I'm 
not pretty sure that this was the actual solution.

Thanks,

Celso.

Filipe Miranda escreveu:
> Hi there,
> 
> I'm having the same problem!
> I'm using RHEL3.8 for Itanium and RedHat Cluster Suite U8. The cluster 
> is composed of 2 HP 4CPUs servers and we are using an EMC ClarionCX700 
> to hold the quorum partitions and data partitions.
> One more thing that I noticed, eventhough the members are shown ative on 
> both nodes, any action on the node that shows the active service does 
> not get propagated to the other member.
> 
> I already checked the configuration of the rawdevices, and I also used 
> the shutil utility and it reported no problems with the quorum partitions.
> 
> Does anybody have any suggestions?
> 
> Thank you,
> 
> 
> On 9/2/06, *Jordi Prats Catal?* <jprats at cesca.es 
> <mailto:jprats at cesca.es> > wrote:
> 
>     Hi,
>     I'm getting different outputs of clustat utility on each node:
> 
>     node1:
>     # clustat
>     Member Status: Quorate
> 
>       Member Name                              Status
>       ------ ----                              ------
>       node1                                    Online, Local, rgmanager
>       node2                                    Online, rgmanager
> 
>       Service Name         Owner (Last)                   State
>       ------- ----         ----- ------                   -----
>       ptoheczas            node2                          started
>       xoqil                node2                          started
>       ymsgh                node1                          started
>       vofcvhas             node2                          started
> 
>     node2:
>     # clustat
>     Member Status: Quorate
> 
>       Member Name                              Status
>       ------ ----                              ------
>       node1                                    Online, rgmanager
>       node2                                    Online, Local, rgmanager
> 
> 
>     (disappears service's info)
> 
>     Rebooting disapears this problem (displays same info in both nodes) for
>     a few weeks. After that it appears again.
> 
>     Do you know what's going on?
> 
>     Thanks,
> 
>     --
>     ......................................................................
>             __
>            / /          Jordi Prats Catal?
>     C E / S / C A      Departament de Sistemes
>          /_/            Centre de Supercomputaci? de Catalunya
> 
>     Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona
>     T. 93 205 6464 ? F.  93 205 6979 ? jprats at cesca.es
>     <mailto:jprats at cesca.es>
>     ......................................................................
> 
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>     <https://www.redhat.com/mailman/listinfo/linux-cluster>
> 
> 
> 
> 
> -- 
> Esta mensagem foi verificada pelo sistema de antiv?rus e
> acredita-se estar livre de perigo.
> 
> 
> ------------------------------------------------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
*Celso Kopp Webber*

celso at webbertek.com.br <mailto:celso at webbertek.com.br>

*Webbertek - Opensource Knowledge*
(41) 8813-1919
(41) 3284-3035


-- 
Esta mensagem foi verificada pelo sistema de antiv?rus e
 acredita-se estar livre de perigo.


From Matthew.Patton.ctr at osd.mil  Thu Sep  7 13:19:07 2006
From: Matthew.Patton.ctr at osd.mil (Patton, Matthew F, CTR, OSD-PA&E)
Date: Thu, 7 Sep 2006 09:19:07 -0400 
Subject: [Linux-cluster] Write log messages to a different file
Message-ID: <D8063DF686D10247B0A49D0127128569119C27FF@osdn06.osd.mil>

Classification: UNCLASSIFIED

> Are there plans to implement those loggin facilities to the other 
> daemons?

unfortunately Redhat probably can't get away with defining a new facility:
"cluster" but it would be nice if they'd settle on a localN. daemon sorta
fits but then it would pollute the regular daemon stream with all it's
noise. I can't stand the stock RH syslog.conf. But hey, that's my
perogative.

Every daemon should have an option to specify the facility. But this is unix
- nobody does anything in a consistant manner. Shoot, even the LVM tools
aren't consistant with each other. While I'm on my rant, please stop using
XML to configure daemons. I don't mean eg. the cluster configuration itself,
but like settings for rgmanager. What facility it uses does NOT belong
anywhere but in /etc/sysconfig.

I'm all for new stuff and fixing new stuff but I wish the larger Linux/unix
community would spend some time fixing all the garbage that's been around
for 30+ years.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060907/85fffd6f/attachment.htm>

From teigland at redhat.com  Thu Sep  7 14:10:37 2006
From: teigland at redhat.com (David Teigland)
Date: Thu, 7 Sep 2006 09:10:37 -0500
Subject: [Linux-cluster] CMAN and interface
In-Reply-To: <644A0966265D9D40AC7584FCE956111302EDD634@dubexc01.emea.cpqcorp.net>
References: <644A0966265D9D40AC7584FCE956111302EDD634@dubexc01.emea.cpqcorp.net>
Message-ID: <20060907141037.GB7775@redhat.com>

On Tue, Sep 05, 2006 at 01:50:23PM +0100, O'Sullivan, Damian wrote:
> Hi,
> 
> How do I ensure that CMAN uses a specific interface? I have a 2 node
> cluster with 6 ethernet interfaces. I have a cross over cable beween the
> 2 eth0 interfaces on both nodes. All other interfaces are connected to a
> common switch with VLANs for each interface. When this switch is
> reloaded/rebooted the nodes try to fence each other and soon as the
> switch comes back each node is shutdown by the fencing agent.
> 
> I see there is a way with multicast but is that the only way and how
> does one set up addresses for this?

The node names in cluster.conf should be the name assigned to the
interface you want cman/dlm to use for heartbeating/locking.  So, in your
case it sounds like you should use the name of the address on eth0 in
cluster.conf (I think you can use IP addresses as node names, too, but I'm
not certain.)

Dave


From teigland at redhat.com  Thu Sep  7 14:21:27 2006
From: teigland at redhat.com (David Teigland)
Date: Thu, 7 Sep 2006 09:21:27 -0500
Subject: [Linux-cluster] data journaling for increased performance
In-Reply-To: <44FEE4C2.1070205@obsidian.co.za>
References: <44FEE4C2.1070205@obsidian.co.za>
Message-ID: <20060907142127.GC7775@redhat.com>

On Wed, Sep 06, 2006 at 05:09:54PM +0200, Riaan van Niekerk wrote:
> has anyone been able to use GFS data journaling to get any measurable 
> performance boost? For those unfamiliar:
> 
> http://www.redhat.com/docs/manuals/csgfs/browse/rh-gfs-en/s1-manage-data-journal.html
> 
> We have a 2.6 TB maildir mail store (e.g. lots of small files) and think 
> of implementing it (we will take any performance increase we can get as 
> long as it does not impact reliability), and even though it will only 
> apply to new files.
> 
> Also, is it possible to check if the inherit_jdata (for directories) or 
> jdata (for files) flag has been set?

'gfs_tool stat' will show the gfs-specific flags on files or directories.

Dave


From damian.osullivan at hp.com  Thu Sep  7 14:24:57 2006
From: damian.osullivan at hp.com (O'Sullivan, Damian)
Date: Thu, 7 Sep 2006 15:24:57 +0100
Subject: [Linux-cluster] CMAN and interface
In-Reply-To: <20060907141037.GB7775@redhat.com>
Message-ID: <644A0966265D9D40AC7584FCE956111302F17C0C@dubexc01.emea.cpqcorp.net>

> -----Original Message-----
> From: David Teigland [mailto:teigland at redhat.com] 
> Sent: 07 September 2006 15:11
> To: O'Sullivan, Damian
> Cc: Linux-cluster at redhat.com
> Subject: Re: [Linux-cluster] CMAN and interface
 
> The node names in cluster.conf should be the name assigned to 
> the interface you want cman/dlm to use for 
> heartbeating/locking.  So, in your case it sounds like you 
> should use the name of the address on eth0 in cluster.conf (I 
> think you can use IP addresses as node names, too, but I'm 
> not certain.)
> 
> Dave
> 

Thanks Dave,

I assume it is no problem to change the node names in the cluster.conf
file on a running cluster?

D.
 

From teigland at redhat.com  Thu Sep  7 14:30:31 2006
From: teigland at redhat.com (David Teigland)
Date: Thu, 7 Sep 2006 09:30:31 -0500
Subject: [Linux-cluster] CMAN and interface
In-Reply-To: <644A0966265D9D40AC7584FCE956111302F17C0C@dubexc01.emea.cpqcorp.net>
References: <20060907141037.GB7775@redhat.com>
	<644A0966265D9D40AC7584FCE956111302F17C0C@dubexc01.emea.cpqcorp.net>
Message-ID: <20060907143031.GD7775@redhat.com>

On Thu, Sep 07, 2006 at 03:24:57PM +0100, O'Sullivan, Damian wrote:
> > -----Original Message-----
> > From: David Teigland [mailto:teigland at redhat.com] 
> > Sent: 07 September 2006 15:11
> > To: O'Sullivan, Damian
> > Cc: Linux-cluster at redhat.com
> > Subject: Re: [Linux-cluster] CMAN and interface
>  
> > The node names in cluster.conf should be the name assigned to 
> > the interface you want cman/dlm to use for 
> > heartbeating/locking.  So, in your case it sounds like you 
> > should use the name of the address on eth0 in cluster.conf (I 
> > think you can use IP addresses as node names, too, but I'm 
> > not certain.)
> > 
> > Dave
> > 
> 
> Thanks Dave,
> 
> I assume it is no problem to change the node names in the cluster.conf
> file on a running cluster?

I think it would be a problem, although I can't say exactly what would
break or how badly.  One thing that would break is fencing, since fenced
would be given the old name to fence and wouldn't be able to find it in
cluster.conf when looking up fencing paramters.  You probably need to stop
both nodes, change the names, then have them rejoin the cluster.

Dave


From titi.titi75 at caramail.com  Thu Sep  7 17:03:15 2006
From: titi.titi75 at caramail.com (titi.titi75)
Date: Thu Sep 07 17:03:15 GMT+00:00 2006
Subject: [Linux-cluster] lock_nolock to lock_dlm trouble
Message-ID: <10949343554084@lycos-europe.com>

An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060907/7727a4d4/attachment.htm>

From filipe.miranda at gmail.com  Thu Sep  7 17:16:12 2006
From: filipe.miranda at gmail.com (Filipe Miranda)
Date: Thu, 7 Sep 2006 14:16:12 -0300
Subject: [Linux-cluster] clustat problem
In-Reply-To: <44FF9FF0.1000404@webbertek.com.br>
References: <44F96728.8090902@cesca.es>
	<a6d13c780609020728o27487385p189ea599fa292ab0@mail.gmail.com>
	<44FF9FF0.1000404@webbertek.com.br>
Message-ID: <a6d13c780609071016t78978256hdfc2193f3ae3d8cd@mail.gmail.com>

Hello Celso,

Well, your suggestion might be the solution to the problem, but since I
think its a quorum latency problem, would the parameters "cludb -p
clumemb%rtp 50" and "cludb -p cluquorumd%rtp 50" help on this this issue?
I was digging into the Cluster Suite documentation and I found these
parameters.
Would those help on this issue without changing the heartbeat method?

Also take a look in this Kbase bellow, it has some interesting tunning
parameters for Red Hat's Clsuter Suite v3:
http://kbase.redhat.com/faq/FAQ_79_7722.shtm

Regards,
Filipe Miranda


On 9/7/06, Celso K. Webber <celso at webbertek.com.br> wrote:
>
> Hi Filipe!
>
> I think your case is a little bit different from Jordi's case, since you
> are using Cluster Suite v3 and he is using v4.
>
> From my own experience, under CSv3 I had this kind o problem when using
> high latency quorum devices. So I had to change from disk tiebraker to
> network tiebraker. I imagine you're using disk tiebraker, aren't you?
>
> Please, would someone please confirm that Filipe's case could be solved
> by changing the heartbeat method? It worked for me in the past, but I'm
> not pretty sure that this was the actual solution.
>
> Thanks,
>
> Celso.
>
> Filipe Miranda escreveu:
> > Hi there,
> >
> > I'm having the same problem!
> > I'm using RHEL3.8 for Itanium and RedHat Cluster Suite U8. The cluster
> > is composed of 2 HP 4CPUs servers and we are using an EMC ClarionCX700
> > to hold the quorum partitions and data partitions.
> > One more thing that I noticed, eventhough the members are shown ative on
> > both nodes, any action on the node that shows the active service does
> > not get propagated to the other member.
> >
> > I already checked the configuration of the rawdevices, and I also used
> > the shutil utility and it reported no problems with the quorum
> partitions.
> >
> > Does anybody have any suggestions?
> >
> > Thank you,
> >
> >
> > On 9/2/06, *Jordi Prats Catal?* <jprats at cesca.es
> > <mailto:jprats at cesca.es> > wrote:
> >
> >     Hi,
> >     I'm getting different outputs of clustat utility on each node:
> >
> >     node1:
> >     # clustat
> >     Member Status: Quorate
> >
> >       Member Name                              Status
> >       ------ ----                              ------
> >       node1                                    Online, Local, rgmanager
> >       node2                                    Online, rgmanager
> >
> >       Service Name         Owner (Last)                   State
> >       ------- ----         ----- ------                   -----
> >       ptoheczas            node2                          started
> >       xoqil                node2                          started
> >       ymsgh                node1                          started
> >       vofcvhas             node2                          started
> >
> >     node2:
> >     # clustat
> >     Member Status: Quorate
> >
> >       Member Name                              Status
> >       ------ ----                              ------
> >       node1                                    Online, rgmanager
> >       node2                                    Online, Local, rgmanager
> >
> >
> >     (disappears service's info)
> >
> >     Rebooting disapears this problem (displays same info in both nodes)
> for
> >     a few weeks. After that it appears again.
> >
> >     Do you know what's going on?
> >
> >     Thanks,
> >
> >     --
> >
> ......................................................................
> >             __
> >            / /          Jordi Prats Catal?
> >     C E / S / C A      Departament de Sistemes
> >          /_/            Centre de Supercomputaci? de Catalunya
> >
> >     Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona
> >     T. 93 205 6464 ? F.  93 205 6979 ? jprats at cesca.es
> >     <mailto:jprats at cesca.es>
> >
> ......................................................................
> >
> >     --
> >     Linux-cluster mailing list
> >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >     https://www.redhat.com/mailman/listinfo/linux-cluster
> >     <https://www.redhat.com/mailman/listinfo/linux-cluster>
> >
> >
> >
> >
> > --
> > Esta mensagem foi verificada pelo sistema de antiv?rus e
> > acredita-se estar livre de perigo.
> >
> >
> > ------------------------------------------------------------------------
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> *Celso Kopp Webber*
>
> celso at webbertek.com.br <mailto:celso at webbertek.com.br>
>
> *Webbertek - Opensource Knowledge*
> (41) 8813-1919
> (41) 3284-3035
>
>
> --
> Esta mensagem foi verificada pelo sistema de antiv?rus e
> acredita-se estar livre de perigo.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
---
Filipe T Miranda
Red Hat Certified Engineer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060907/eb518f41/attachment.htm>

From teigland at redhat.com  Thu Sep  7 17:53:03 2006
From: teigland at redhat.com (David Teigland)
Date: Thu, 7 Sep 2006 12:53:03 -0500
Subject: [Linux-cluster] lock_nolock to lock_dlm trouble
In-Reply-To: <10949343554084@lycos-europe.com>
References: <10949343554084@lycos-europe.com>
Message-ID: <20060907175303.GI7775@redhat.com>

> I had to remove a storage system from my cluster. It was formatted using
> lock_dlm before being removed. An then, it was plugged on a single
> server, using the "lockproto=lock_nolock" option.
>
> Now, I put it back in the cluster, but I can 't mount it with the
> standard lock_dlm (but it's ok with the lock_nolock option, but it of
> course prevents me to share it).  The error is
>
> GFS: Trying to join cluster "lock_dlm", "alpha_cluster:vol001"
> lock_dlm: new lockspace error -17
> GFS: can't mount proto = lock_dlm, table = alpha_cluster:vol001, hostdata =

-17 is EEXIST, meaning a dlm lockspace with the name "vol001" already
exists.  "cman_tool services" should display it.  Do you have another fs
with the same name?  You may need to reboot the system to get it back in
shape.

Dave

PS. please send text instead of html mail


From lhh at redhat.com  Thu Sep  7 21:50:14 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 07 Sep 2006 17:50:14 -0400
Subject: [Linux-cluster] Is IPMI fencing considered certified by Red Hat?
In-Reply-To: <44FF9932.9020508@webbertek.com.br>
References: <44FF9932.9020508@webbertek.com.br>
Message-ID: <1157665814.3610.251.camel@rei.boston.devel.redhat.com>

On Thu, 2006-09-07 at 00:59 -0300, Celso K. Webber wrote:
> Hello friends,
> 
> Regarding Red Hat Cluster Suite and/or GFS, could someone from Red Hat 
> please tell me if the use of IPMI embedded devices from the servers' 
> motherboards is officially certified by Red Hat?
> 
> I'd like to have this information so that we can recommend (or not) to 
> customers the use of IPMI as a secure form of fencing.
> 
> We had some bad experiences recently on some servers where only one of 
> the onboard NICs listened to the IPMI over LAN packets, so it appeared 
> to us that sometimes IPMI is not that safe as a fence device. Of course 
> the Cluster software will assume nothing when the fencing fails, but the 
> bad thing is that there is no automatic failover on this situation.

It's supported, but there are a couple of caveats that you should be
aware of:

(a) You should, if possible, use the IPMI-enabled NIC only for IPMI
traffic.  At least, you should not use it for cluster communication
traffic - though it is fine for service-related (e.g. rgmanager, etc.)
and other traffic.  That way, the IPMI-enabled port can't become a
single point of failure.

Here's why: If IPMI and cluster traffic are using the same NIC, then
that NIC failing (or becoming disconnected) will cause the node to be
evicted -- but prevent fencing, because the IPMI host will be
unreachable.

Similarly, on a machine with a single power supply + IPMI fencing in a
cluster, the power cord becomes a SPF - if you pull the power, the host
is dead and fencing cannot complete (because IPMI does not have power
either!), which leads to...


(b) If you do not have *both* dual power supplies and dual NICs, you
need something else (in addition to IPMI) if NSPF is a requirement for
your particular installation.  For example, what one linux-cluster user
did was add their fiber channel switch as a secondary fence device (in
its own fence level).  His cluster tries to fence using IPMI.  Failing
that, the cluster falls back to fencing via the fiber switch.


(c) You often need to disable ACPI on hardware which has IPMI if you
intend to use IPMI for fencing.  This can vary on a per-machine basis,
so you should check first.  If a host does a "graceful shutdown" when
you fence it via IPMI, you need to disable ACPI on that host (e.g. boot
with acpi=off).  The server should turn off immediately (or within 4-5
seconds, like when holding an ATX power button in to force a machine
off).


Hope that helps!

-- Lon


From peter.huesser at psi.ch  Thu Sep  7 22:49:27 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Fri, 8 Sep 2006 00:49:27 +0200
Subject: [Linux-cluster] Two node cluster: node cannot connect to cluster
	infrastructure
Message-ID: <8E2924888511274B95014C2DD906E58AD19E1E@MAILBOX0A.psi.ch>

Hello

 
I try to make a two node cluster run. Unfortunately if I run the
"/etc/init.d/cman start" command I get in "/var/log/messages" entries
like:

 
    Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5

    Initial status:: Inquorate

    Cluster manager shutdown.  Attemping to reconnect...

    Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5

    Initial status:: Inquorate

    Cluster manager shutdown.  Attemping to reconnect...

 
"dmesg" shows:

 
    CMAN: sendmsg failed: -13

    CMAN: sendmsg failed: -13

    CMAN: sendmsg failed: -13

    CMAN: forming a new cluster

    CMAN: quorum regained, resuming activity

    CMAN: sendmsg failed: -13

    CMAN: No functional network interfaces, leaving cluster

    CMAN: sendmsg failed: -13

    CMAN: we are leaving the cluster. 

    CMAN: Waiting to join or form a Linux-cluster

    CMAN: sendmsg failed: -13

    ....

 
The "/etc/hosts" file is correctly set up. "iptables" are disabled
("service iptables stop"). The "cluster.conf" file looks like:

 
    <?xml version="1.0" ?>

    <cluster config_version="6" name="test_cluster">

            <fence_daemon post_fail_delay="0" post_join_delay="20"/>

            <clusternodes>

                    <clusternode name="server01" votes="1">

                            <fence>

                                    <method name="1">

                                            <device name="server01man"/>

                                    </method>

                            </fence>

                    </clusternode>

                    <clusternode name="server02" votes="1">

                            <fence>

                                    <method name="1">

                                            <device name="server02man"/>

                                    </method>

                            </fence>

                    </clusternode>

            </clusternodes>

            <cman expected_votes="1" two_node="1"/>

            <fencedevices>

                    <fencedevice agent="fence_ilo" hostname="server01"
login="root" name="server01man" passwd="password"/>

                    <fencedevice agent="fence_ilo" hostname="server02"
login="root" name="server02man" passwd="password"/>

            </fencedevices>

            <rm>

                    <failoverdomains/>

                    <resources/>

            </rm>

    </cluster>

 
I found a thread about this topic in June and August but these did not
help me. Any ideas what could be wrong. Sorry, it is possible that I
make a complete stupid error (this is my first cluster I set up).

 
Thanks' for any help

 
            Pedro

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060908/e1fc1199/attachment.htm>

From orkcu at yahoo.com  Thu Sep  7 22:59:18 2006
From: orkcu at yahoo.com (Roger Peña Escobio)
Date: Thu, 7 Sep 2006 15:59:18 -0700 (PDT)
Subject: [Linux-cluster] Two node cluster: node cannot connect to cluster
	infrastructure
In-Reply-To: <8E2924888511274B95014C2DD906E58AD19E1E@MAILBOX0A.psi.ch>
Message-ID: <20060907225918.76101.qmail@web50612.mail.yahoo.com>


--- Huesser Peter <peter.huesser at psi.ch> wrote:

> Hello
> 
>  
> 
> I try to make a two node cluster run. Unfortunately
> if I run the
> "/etc/init.d/cman start" command I get in
> "/var/log/messages" entries
> like:
> 
>  
> 
>     Connected to cluster infrastruture via: CMAN/SM
> Plugin v1.1.5
> 
>     Initial status:: Inquorate
> 
>     Cluster manager shutdown.  Attemping to
> reconnect...
> 
>     Connected to cluster infrastruture via: CMAN/SM
> Plugin v1.1.5
> 
>     Initial status:: Inquorate
> 
>     Cluster manager shutdown.  Attemping to
> reconnect...
> 
>  
> 
> "dmesg" shows:
> 
>  
> 
>     CMAN: sendmsg failed: -13
> 
>     CMAN: sendmsg failed: -13
> 
>     CMAN: sendmsg failed: -13
> 
>     CMAN: forming a new cluster
> 
>     CMAN: quorum regained, resuming activity
> 
>     CMAN: sendmsg failed: -13
> 
>     CMAN: No functional network interfaces, leaving
> cluster
> 
>     CMAN: sendmsg failed: -13
> 
>     CMAN: we are leaving the cluster. 
> 
>     CMAN: Waiting to join or form a Linux-cluster
> 
>     CMAN: sendmsg failed: -13
> 
>     ....


shutdown the iptables just to see if anything change


cu
roger

__________________________________________
RedHat Certified Engineer ( RHCE )
Cisco Certified Network Associate ( CCNA )

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From rodgersr at yahoo.com  Thu Sep  7 23:07:28 2006
From: rodgersr at yahoo.com (Rick Rodgers)
Date: Thu, 7 Sep 2006 16:07:28 -0700 (PDT)
Subject: [Linux-cluster] Issue stonith commands to the failing node twice??
Message-ID: <20060907230728.27221.qmail@web34207.mail.mud.yahoo.com>

I am using an older version of clumanger (about 2 yrs old) and I notice
that when the active node goes down the back will actually issue
stonith commands twice. They are about 60 seconds apart. Does this 
happen to anyone else??

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060907/46802334/attachment.htm>

From peter.huesser at psi.ch  Fri Sep  8 07:44:50 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Fri, 8 Sep 2006 09:44:50 +0200
Subject: [Linux-cluster] Two node cluster: node cannot connect to
	clusterinfrastructure
In-Reply-To: <20060907225918.76101.qmail@web50612.mail.yahoo.com>
Message-ID: <8E2924888511274B95014C2DD906E58AD19E37@MAILBOX0A.psi.ch>

Hello Roger

> 
> shutdown the iptables just to see if anything change
> 

Thanks' for the answer but I already tried this. No effect.

Pedro


From titi.titi75 at caramail.com  Fri Sep  8 08:45:08 2006
From: titi.titi75 at caramail.com (titi.titi75)
Date: Fri Sep 08 08:45:08 GMT+00:00 2006
Subject: [Linux-cluster] Re: lock_nolock to lock_dlm trouble - solved
Message-ID: <16513365574571@lycos-europe.com>

Hello,

Ignore my precedent post. I made a mistake. The problem wasn't because of a lock manager modification, but because of a conflict with the FSName.

I solved my problem with a 'gfs_tool sb /dev/sanstock3/vol001 table alpha_cluster:vol003' command

Thank's

Jerome


From peter.huesser at psi.ch  Fri Sep  8 09:25:30 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Fri, 8 Sep 2006 11:25:30 +0200
Subject: [Linux-cluster] Two node cluster: node cannot connect to
	clusterinfrastructure
In-Reply-To: <8E2924888511274B95014C2DD906E58AD19E1E@MAILBOX0A.psi.ch>
Message-ID: <8E2924888511274B95014C2DD906E58AD19E58@MAILBOX0A.psi.ch>

I forgot to mention, that I first execute "/etc/init.d/ccsd start" on
all servers and afterwards "/etc/init.d/cman start". In the
"/var/log/messages" file I see (after some time) a line like "Cluster is
quorate. Allowing connections" which sounds interesting but already on
the next line I see "Cluster manager shutdown. Attempting to
reconnect...". Later I only have the entries you see  below.

 
Pedro

 
Hello

 
I try to make a two node cluster run. Unfortunately if I run the
"/etc/init.d/cman start" command I get in "/var/log/messages" entries
like:

 
    Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5

    Initial status:: Inquorate

    Cluster manager shutdown.  Attemping to reconnect...

    Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5

    Initial status:: Inquorate

    Cluster manager shutdown.  Attemping to reconnect...

 
"dmesg" shows:

 
    CMAN: sendmsg failed: -13

    CMAN: sendmsg failed: -13

    CMAN: sendmsg failed: -13

    CMAN: forming a new cluster

    CMAN: quorum regained, resuming activity

    CMAN: sendmsg failed: -13

    CMAN: No functional network interfaces, leaving cluster

    CMAN: sendmsg failed: -13

    CMAN: we are leaving the cluster. 

    CMAN: Waiting to join or form a Linux-cluster

    CMAN: sendmsg failed: -13

    ....

 
The "/etc/hosts" file is correctly set up. "iptables" are disabled
("service iptables stop"). The "cluster.conf" file looks like:

 
    <?xml version="1.0" ?>

    <cluster config_version="6" name="test_cluster">

            <fence_daemon post_fail_delay="0" post_join_delay="20"/>

            <clusternodes>

                    <clusternode name="server01" votes="1">

                            <fence>

                                    <method name="1">

                                            <device name="server01man"/>

                                    </method>

                            </fence>

                    </clusternode>

                    <clusternode name="server02" votes="1">

                            <fence>

                                    <method name="1">

                                            <device name="server02man"/>

                                    </method>

                            </fence>

                    </clusternode>

            </clusternodes>

            <cman expected_votes="1" two_node="1"/>

            <fencedevices>

                    <fencedevice agent="fence_ilo" hostname="server01"
login="root" name="server01man" passwd="password"/>

                    <fencedevice agent="fence_ilo" hostname="server02"
login="root" name="server02man" passwd="password"/>

            </fencedevices>

            <rm>

                    <failoverdomains/>

                    <resources/>

            </rm>

    </cluster>

 
I found a thread about this topic in June and August but these did not
help me. Any ideas what could be wrong. Sorry, it is possible that I
make a complete stupid error (this is my first cluster I set up).

 
Thanks' for any help

 
            Pedro

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060908/824ee299/attachment.htm>

From orkcu at yahoo.com  Fri Sep  8 13:08:23 2006
From: orkcu at yahoo.com (Roger Peña Escobio)
Date: Fri, 8 Sep 2006 06:08:23 -0700 (PDT)
Subject: [Linux-cluster] Two node cluster: node cannot connect to
	clusterinfrastructure
In-Reply-To: <8E2924888511274B95014C2DD906E58AD19E37@MAILBOX0A.psi.ch>
Message-ID: <20060908130823.8334.qmail@web50610.mail.yahoo.com>


--- Huesser Peter <peter.huesser at psi.ch> wrote:

> Hello Roger
> 
> > 
> > shutdown the iptables just to see if anything
> change
> > 
> 
> Thanks' for the answer but I already tried this. No
> effect.

in both nodes?
even before the start the ccsd daemon?


ok, that was a guess
sorry if it not help you :-(

cu
roger

__________________________________________
RedHat Certified Engineer ( RHCE )
Cisco Certified Network Associate ( CCNA )

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From peter.huesser at psi.ch  Fri Sep  8 13:17:46 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Fri, 8 Sep 2006 15:17:46 +0200
Subject: [Linux-cluster] Two node cluster: node cannot connect
	toclusterinfrastructure
In-Reply-To: <20060908130823.8334.qmail@web50610.mail.yahoo.com>
Message-ID: <8E2924888511274B95014C2DD906E58AD19E73@MAILBOX0A.psi.ch>

> 
> in both nodes?
> even before the start the ccsd daemon?
> 
Yes (unfortunately).
> 
> ok, that was a guess
> sorry if it not help you :-(
>
I am glad for any answer. I am looking after the problem for quit a long
time now and do not see a solution.

Pedro


From m.catanese at kinetikon.com  Fri Sep  8 12:58:03 2006
From: m.catanese at kinetikon.com (Matteo Catanese)
Date: Fri, 8 Sep 2006 14:58:03 +0200
Subject: [Linux-cluster] system-config-cluster problem
Message-ID: <AD6B2CEE-3D1D-49C9-B678-9C7E4C97F25F@kinetikon.com>

I've setup a cluster some month ago.

Cluster is working , but still not in production.

Today, after summer break, i did all the updates for my rhat and CS

First i disabled all services, then i patched one machine and  
rebooted, then the other one and rebooted.


Cluster works perfectly:


[root at lvzbe1 ~]# uname -a
Linux lvzbe1.lavazza.it 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 18:00:32  
EDT 2006 i686 i686 i386 GNU/Linux
[root at lvzbe1 ~]# clustat -v
clustat version 1.9.53
Connected via: CMAN/SM Plugin v1.1.7.1
[root at lvzbe1 ~]# clustat
Member Status: Quorate

   Member Name                              Status
   ------ ----                              ------
   lvzbe1                                   Online, Local, rgmanager
   lvzbe2                                   Online, rgmanager

   Service Name         Owner (Last)                   State
   ------- ----         ----- ------                   -----
   oracle               lvzbe1                         started
[root at lvzbe1 ~]#


But when i try to run system-config-cluster,it pops out:

Poorly Formed XML error
A problem was encoutered while reading configuration file /etc/ 
cluster/clluster.conf.
Details or the error appear below. Click the "New" button to create a  
new configuration file.
To continue anyway(Not Recommended!), click the "ok" button.


Relax-NG validity error : Extra element rm in interleave
/etc/cluster/cluster.conf:35: element rm: Relax-NG validity error :  
Element cluster failed to validate content
/etc/cluster/cluster.conf fails to validate


I clicked the "cancel" button, to not to damage all.

Conf file is immutated since Jul 13 2006

Matteo

-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: application/octet-stream
Size: 2192 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060908/f325ce7b/attachment.obj>

From orkcu at yahoo.com  Fri Sep  8 13:55:50 2006
From: orkcu at yahoo.com (Roger Peña Escobio)
Date: Fri, 8 Sep 2006 06:55:50 -0700 (PDT)
Subject: [Linux-cluster] system-config-cluster problem
In-Reply-To: <AD6B2CEE-3D1D-49C9-B678-9C7E4C97F25F@kinetikon.com>
Message-ID: <20060908135552.37715.qmail@web50611.mail.yahoo.com>

> But when i try to run system-config-cluster,it pops
> out:
> 
> Poorly Formed XML error
> A problem was encoutered while reading configuration
> file /etc/ 
> cluster/clluster.conf.
          ^^^^^^^^
this extra 'l' is a tipo error or is actually there ?

cu
roger

__________________________________________
RedHat Certified Engineer ( RHCE )
Cisco Certified Network Associate ( CCNA )

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From orkcu at yahoo.com  Fri Sep  8 14:10:43 2006
From: orkcu at yahoo.com (Roger Peña Escobio)
Date: Fri, 8 Sep 2006 07:10:43 -0700 (PDT)
Subject: [Linux-cluster] Two node cluster: node cannot connect
	toclusterinfrastructure
In-Reply-To: <8E2924888511274B95014C2DD906E58AD19E73@MAILBOX0A.psi.ch>
Message-ID: <20060908141043.44667.qmail@web50611.mail.yahoo.com>


--- Huesser Peter <peter.huesser at psi.ch> wrote:

> > 
> > in both nodes?
> > even before the start the ccsd daemon?
> > 
> Yes (unfortunately).
> > 
> > ok, that was a guess
> > sorry if it not help you :-(
> >
> I am glad for any answer. I am looking after the
> problem for quit a long
> time now and do not see a solution.
> 
my fisrt time with rhcs I had something like that, I
did a lot of things, but the last one before the node
join the cluster was a :
cman_tooy join

I did that in the first node without configure the
second node, I didn't had to do that for the second
node

After that first experience I had others rhcs
installations from scratch and never I had to do the
"cman_tool join" again ...

but maybe you need it ...


cu
roger

__________________________________________
RedHat Certified Engineer ( RHCE )
Cisco Certified Network Associate ( CCNA )

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From jparsons at redhat.com  Fri Sep  8 13:44:44 2006
From: jparsons at redhat.com (Jim Parsons)
Date: Fri, 08 Sep 2006 09:44:44 -0400
Subject: [Linux-cluster] system-config-cluster problem
References: <AD6B2CEE-3D1D-49C9-B678-9C7E4C97F25F@kinetikon.com>
Message-ID: <450173CC.6070401@redhat.com>

Hi Matteo,

Sorry for the scary warning. I will look at this issue this morning.

Before the s-c-cluster app reads in a cluster.conf, it runs the file 
against 'xmllint --relaxng' and checks for errors. A bad cluster.conf 
file  could wreak havoc in the GUI. Sometimes errors creep in from hand 
editing, but they can also occur if we have missed an xml construct we 
use in the schema file. I'll let you know what is up. Unfortunately, the 
relaxNG error messages are not very descriptive, but they improve with 
every release of the validation checker.

Thanks for sending your conf file.

-J

Matteo Catanese wrote:

> I've setup a cluster some month ago.
>
> Cluster is working , but still not in production.
>
> Today, after summer break, i did all the updates for my rhat and CS
>
> First i disabled all services, then i patched one machine and  
> rebooted, then the other one and rebooted.
>
>
> Cluster works perfectly:
>
>
> [root at lvzbe1 ~]# uname -a
> Linux lvzbe1.lavazza.it 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 18:00:32  
> EDT 2006 i686 i686 i386 GNU/Linux
> [root at lvzbe1 ~]# clustat -v
> clustat version 1.9.53
> Connected via: CMAN/SM Plugin v1.1.7.1
> [root at lvzbe1 ~]# clustat
> Member Status: Quorate
>
>   Member Name                              Status
>   ------ ----                              ------
>   lvzbe1                                   Online, Local, rgmanager
>   lvzbe2                                   Online, rgmanager
>
>   Service Name         Owner (Last)                   State
>   ------- ----         ----- ------                   -----
>   oracle               lvzbe1                         started
> [root at lvzbe1 ~]#
>
>
> But when i try to run system-config-cluster,it pops out:
>
> Poorly Formed XML error
> A problem was encoutered while reading configuration file /etc/ 
> cluster/clluster.conf.
> Details or the error appear below. Click the "New" button to create a  
> new configuration file.
> To continue anyway(Not Recommended!), click the "ok" button.
>
>
> Relax-NG validity error : Extra element rm in interleave
> /etc/cluster/cluster.conf:35: element rm: Relax-NG validity error :  
> Element cluster failed to validate content
> /etc/cluster/cluster.conf fails to validate
>
>
>
>
> I clicked the "cancel" button, to not to damage all.
>
> Conf file is immutated since Jul 13 2006
>
> Matteo
>
>
>------------------------------------------------------------------------
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>


From peter.huesser at psi.ch  Fri Sep  8 16:22:38 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Fri, 8 Sep 2006 18:22:38 +0200
Subject: [Linux-cluster] Two node cluster: node cannot
	connecttoclusterinfrastructure
In-Reply-To: <20060908141043.44667.qmail@web50611.mail.yahoo.com>
Message-ID: <8E2924888511274B95014C2DD906E58AD19E87@MAILBOX0A.psi.ch>

> my fisrt time with rhcs I had something like that, I
> did a lot of things, but the last one before the node
> join the cluster was a :
> cman_tooy join
> 
> I did that in the first node without configure the
> second node, I didn't had to do that for the second
> node

Did not help either. What I do not understand is, that in some
situations the node gets quorated but immediately afterwards is
shutdowned ???

> 
> After that first experience I had others rhcs
> installations from scratch and never I had to do the
> "cman_tool join" again ...
> 
What do you mean with installation from scratch. Did you recompile the
packages by yourself ?

Pedro


From orkcu at yahoo.com  Fri Sep  8 16:30:21 2006
From: orkcu at yahoo.com (Roger Peña Escobio)
Date: Fri, 8 Sep 2006 09:30:21 -0700 (PDT)
Subject: [Linux-cluster] Two node cluster: node cannot
	connecttoclusterinfrastructure
In-Reply-To: <8E2924888511274B95014C2DD906E58AD19E87@MAILBOX0A.psi.ch>
Message-ID: <20060908163021.99026.qmail@web50610.mail.yahoo.com>

> > After that first experience I had others rhcs
> > installations from scratch and never I had to do
> the
> > "cman_tool join" again ...
> > 
> What do you mean with installation from scratch. Did
> you recompile the
> packages by yourself ?
> 
no so "from scratch" ;-)
I use the centos4 recompilation of rhcs and rhgfs

what i mean was, complete instalation of the Cluster,
including the operating system, so not previus conf
files, not cache or any other file taken from previus
working system.

cu
roger

__________________________________________
RedHat Certified Engineer ( RHCE )
Cisco Certified Network Associate ( CCNA )

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From Darrell.Frazier at crc.army.mil  Fri Sep  8 17:14:39 2006
From: Darrell.Frazier at crc.army.mil (Frazier, Darrell USA CRC (Contractor))
Date: Fri, 8 Sep 2006 12:14:39 -0500 
Subject: [Linux-cluster] Odd/Even Nodes for RHCS/GFS
Message-ID: <A5502A8A1836A54FB9CB33BDC6A5544404C436D0@safeb1mf533c.crc.army.mil>

Hello,

 
I have heard that RHCS may have an issue with odd number of nodes vs even
number of nodes. Has anyone heard of this? Thanx.

 
Darrell J. Frazier

Unix System Administrator

US Army Combat Readiness Center

Fort Rucker, Alabama 36362

 
CAUTION: This electronic transmission may contain information protected by
deliberative process or other privilege, which is protected from disclosure
under the Freedom of Information Act, 5 U.S.C. ? 552. The information is
intended for the use of the individual or agency to which it was sent. If
you are not the intended recipient, be aware that any disclosure,
distribution or use of the contents of this information is prohibited. Do
not release outside of DoD channels without prior authorization from the
sender. The sender provides no assurance as to the integrity of the content
of this electronic transmission after it has been sent and received by the
intended email recipient. 

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060908/77d8c36e/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 3161 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060908/77d8c36e/attachment.jpg>

From rodgersr at yahoo.com  Fri Sep  8 18:16:36 2006
From: rodgersr at yahoo.com (Rick Rodgers)
Date: Fri, 8 Sep 2006 11:16:36 -0700 (PDT)
Subject: [Linux-cluster] power controller is interal/loss of pwer prevents
	failover: any ideas
Message-ID: <20060908181636.5993.qmail@web34207.mail.mud.yahoo.com>

Does anyone know of a good solution to providing good failover 
for somthing like a Dell 1850? The issue here is that the power 
souce plug in the back provides power for both the internal power controller
and the node itself. So if you pull the cord it will not failover because
it can not Stonith the failed node (power controller is down also).

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060908/9b4e3238/attachment.htm>

From busyadmin at gmail.com  Fri Sep  8 23:18:18 2006
From: busyadmin at gmail.com (Ken Johnson)
Date: Fri, 8 Sep 2006 17:18:18 -0600
Subject: [Linux-cluster] power controller is interal/loss of pwer prevents
	failover: any ideas
In-Reply-To: <20060908181636.5993.qmail@web34207.mail.mud.yahoo.com>
References: <20060908181636.5993.qmail@web34207.mail.mud.yahoo.com>
Message-ID: <200609081718.18842.ken@novell.com>

On Fri, 8 Sep 2006 11:16:36 -0700, Rick Rodgers wrote:
> Does anyone know of a good solution to providing good failover
> for somthing like a Dell 1850? The issue here is that the power
> souce plug in the back provides power for both the internal power
> controller and the node itself. So if you pull the cord it will not
> failover because it can not Stonith the failed node (power controller is
> down also).

I've used the fence_ipmi and fence_drac agents for these systems successfully.

- Ken


From eric at bootseg.com  Sat Sep  9 00:10:57 2006
From: eric at bootseg.com (Eric Kerin)
Date: Fri, 08 Sep 2006 20:10:57 -0400
Subject: [Linux-cluster] power controller is interal/loss of pwer
	prevents failover: any ideas
In-Reply-To: <20060908181636.5993.qmail@web34207.mail.mud.yahoo.com>
References: <20060908181636.5993.qmail@web34207.mail.mud.yahoo.com>
Message-ID: <1157760657.16147.7.camel@mechanism.localnet>

On Fri, 2006-09-08 at 11:16 -0700, Rick Rodgers wrote:
> Does anyone know of a good solution to providing good failover 
> for somthing like a Dell 1850? The issue here is that the power 
> souce plug in the back provides power for both the internal power
> controller
> and the node itself. So if you pull the cord it will not failover
> because
> it can not Stonith the failed node (power controller is down also).
> 

While you can't eliminate your chances of that happening while using the
internal fence device, you can reduce the chance by using dual power
supplies.  Obviously if both power supplies go to the same PDU then you
only buy so much.

For my cluster, I use two external power controllers (APC 7900's) to
fence my nodes, two to provide redundant power paths and no single point
of failure for power.  While I could use the built in RIB card (HP
Servers) this method reduces the possible failure points.

Thanks, 
Eric Kerin
eric at bootseg.com


From danwest at comcast.net  Tue Sep  5 11:43:42 2006
From: danwest at comcast.net (danwest)
Date: Tue, 05 Sep 2006 07:43:42 -0400
Subject: [Linux-cluster] 2-node fencing question (IPMI/ACPI question)
In-Reply-To: <1154633146.28677.70.camel@ayanami.boston.redhat.com>
References: <080220061550.6837.44D0C9B800021AD200001AB522007347489B9C0A99020E0B@comcast.net>
	<1154633146.28677.70.camel@ayanami.boston.redhat.com>
Message-ID: <1157456622.4378.7.camel@belmont.site>

What happens if the servers you are using require ACPI=on in order to
boot.  For instance IBM X366 servers need ACPI set in order to boot.
With ACPI=on both nodes reboot when a fence occurs(see "both nodes off
problem" in thread below).  This is not desirable, especially with
active/active clusters.

Thanks,
 dan

> Sorry I didn't see this earlier!
> 
> On Wed, 2006-08-02 at 15:50 +0000, danwest at comcast.net wrote:
> > It seems like a significant problem to have fence_ipmilan issue a power-off followed by a power-on with a 2 node cluster.
> 
> Generally, the chances of this occurring are very, very small, though
> not impossible.
> 
> However, it could very well be that IPMI hardware modules are slow
> enough at processing requests that this could pose a problem.  What
> hardware has this happened on?  Was ACPI disabled on boot in the host OS
> (it should be; see below)?
> 
> 
> > This seems to make a 2-node cluster with ipmi fencing pointless.
> 
> I'm pretty sure that 'both-nodes-off problem' can only occur if all of
> the following criteria are met:
> 
> (a) while using a separate NICs for IPMI and cluster traffic (the
> recommended configuration),
> 
> (b) in the event of a network partition, such that both nodes can not
> see each other but can see each other's IPMI port, and
> 
> (c) if both nodes send their power-off packets at or near the exact same
> time.
> 
> The time window for (c) increases significantly (5+ seconds) if the
> cluster nodes are enabling ACPI power events on boot.  This is one of
> the reasons why booting with acpi=off is required when using IPMI, iLO,
> or other integrated power management solutions.
> 
> If booting with acpi=off, does the problem persist?
> 
> > It looks like fence_ipmilan needs to support sending a cycle instead of a poweroff than a poweron?
> 
> The reason fence_ipmilan functions this way (off, status, on) is because
> that we require a confirmation that the node has lost power.  I am not
> sure that it is possible to confirm the node has rebooted using IPMI.
> 
> Arguably, it also might not be necessary to make such a confirmation in
> this particular case.  
> 
> > According to fence_ipmilan.c it looks like cycle is not an option although it is an option for ipmitool.  (ipmitool -H <ipaddr> -U <userid> -P <password> chassis power cycle)
> 
> Looks like you're on the right track.
> 
> -- Lon
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From marcelosoaressouza at gmail.com  Fri Sep  8 14:18:52 2006
From: marcelosoaressouza at gmail.com (Marcelo Souza)
Date: Fri, 8 Sep 2006 10:18:52 -0400
Subject: [Linux-cluster] Slackware Package for openmpi 1.1.1 and mpich2
	1.0.4p1
Message-ID: <12c9ca330609080718m4e15793fle202afa21a5b3227@mail.gmail.com>

  If interest anyone i make Slackware packages, i486, for openmpi
1.1.1 and mpich2 1.0.4p1

  TGZ
  http://www.cebacad.net/slackware/openmpi-1.1.1-i486-1goa.tgz
  http://www.cebacad.net/slackware/mpich2-1.0.4p1-i486-1goa.tgz

  signed with my pgp key
  http://www.cebacad.net/slackware/openmpi-1.1.1-i486-1goa.tgz.asc
  http://www.cebacad.net/slackware/mpich2-1.0.4p1-i486-1goa.tgz.asc

  MD5
  http://www.cebacad.net/slackware/openmpi-1.1.1-i486-1goa.tgz.md5
  http://www.cebacad.net/slackware/mpich2-1.0.4p1-i486-1goa.tgz.md5

 see ya

 Marcelo Souza (marcelo at cebacad.net)
 http://marcelo.cebacad.net
 http://slackbeowulf.cebacad.net


From rodgersr at yahoo.com  Mon Sep 11 02:02:03 2006
From: rodgersr at yahoo.com (Rick Rodgers)
Date: Sun, 10 Sep 2006 19:02:03 -0700 (PDT)
Subject: [Linux-cluster] power controller is interal/loss of pwer prevents
	failover: any ideas
In-Reply-To: <200609081718.18842.ken@novell.com>
Message-ID: <20060911020203.67886.qmail@web34206.mail.mud.yahoo.com>

How does this help? The power controller is still down

----- Original Message ----
From: Ken Johnson <busyadmin at gmail.com>
To: linux-cluster at redhat.com
Sent: Friday, September 8, 2006 4:18:18 PM
Subject: Re: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas

On Fri, 8 Sep 2006 11:16:36 -0700, Rick Rodgers wrote:
> Does anyone know of a good solution to providing good failover
> for somthing like a Dell 1850? The issue here is that the power
> souce plug in the back provides power for both the internal power
> controller and the node itself. So if you pull the cord it will not
> failover because it can not Stonith the failed node (power controller is
> down also).

I've used the fence_ipmi and fence_drac agents for these systems successfully.

- Ken

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060910/dc4b19fa/attachment.htm>

From busyadmin at gmail.com  Mon Sep 11 04:02:19 2006
From: busyadmin at gmail.com (Ken Johnson)
Date: Sun, 10 Sep 2006 22:02:19 -0600
Subject: [Linux-cluster] power controller is interal/loss of pwer prevents
	failover: any ideas
In-Reply-To: <20060911020203.67886.qmail@web34206.mail.mud.yahoo.com>
References: <200609081718.18842.ken@novell.com>
	<20060911020203.67886.qmail@web34206.mail.mud.yahoo.com>
Message-ID: <1c0e77670609102102w79125384xd648b11b3e3dc889@mail.gmail.com>

On Sun, 10 Sep 2006 at 19:02 -0700, Rick Rodgers wrote:
> How does this help? The power controller is still down

Sorry, I obviously don't understand your question. I thought you were
looking for fencing solutions for these devices (1850's).

- Ken


From rodgersr at yahoo.com  Mon Sep 11 04:14:11 2006
From: rodgersr at yahoo.com (Rick Rodgers)
Date: Sun, 10 Sep 2006 21:14:11 -0700 (PDT)
Subject: [Linux-cluster] power controller is interal/loss of pwer prevents
	failover: any ideas
In-Reply-To: <1c0e77670609102102w79125384xd648b11b3e3dc889@mail.gmail.com>
Message-ID: <20060911041411.22029.qmail@web34202.mail.mud.yahoo.com>

Yes I was, but if the power controller is down (unreachable)  
  and the system (node)  is hung how can these fence anything?
  By pulling the plug you loose both and you can not be sure of anything
  since you can not successfully issue a power cycle command.
 
 Thanks for your input though.
  

----- Original Message ----
From: Ken Johnson <busyadmin at gmail.com>
To: linux clustering <linux-cluster at redhat.com>
Sent: Sunday, September 10, 2006 9:02:19 PM
Subject: Re: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas

On Sun, 10 Sep 2006 at 19:02 -0700, Rick Rodgers wrote:
> How does this help? The power controller is still down

Sorry, I obviously don't understand your question. I thought you were
looking for fencing solutions for these devices (1850's).

- Ken

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060910/0ba71098/attachment.htm>

From rodgersr at yahoo.com  Mon Sep 11 04:15:16 2006
From: rodgersr at yahoo.com (Rick Rodgers)
Date: Sun, 10 Sep 2006 21:15:16 -0700 (PDT)
Subject: [Linux-cluster] power controller is interal/loss of pwer prevents
	failover: any ideas
In-Reply-To: <1c0e77670609102102w79125384xd648b11b3e3dc889@mail.gmail.com>
Message-ID: <20060911041516.22450.qmail@web34202.mail.mud.yahoo.com>

can these agents do anything if the power controller is unaccessable?

----- Original Message ----
From: Ken Johnson <busyadmin at gmail.com>
To: linux clustering <linux-cluster at redhat.com>
Sent: Sunday, September 10, 2006 9:02:19 PM
Subject: Re: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas

On Sun, 10 Sep 2006 at 19:02 -0700, Rick Rodgers wrote:
> How does this help? The power controller is still down

Sorry, I obviously don't understand your question. I thought you were
looking for fencing solutions for these devices (1850's).

- Ken

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060910/8cd5ba02/attachment.htm>

From busyadmin at gmail.com  Mon Sep 11 04:42:16 2006
From: busyadmin at gmail.com (Ken Johnson)
Date: Sun, 10 Sep 2006 22:42:16 -0600
Subject: [Linux-cluster] power controller is interal/loss of pwer prevents
	failover: any ideas
In-Reply-To: <20060911041411.22029.qmail@web34202.mail.mud.yahoo.com>
References: <1c0e77670609102102w79125384xd648b11b3e3dc889@mail.gmail.com>
	<20060911041411.22029.qmail@web34202.mail.mud.yahoo.com>
Message-ID: <1c0e77670609102142t267518e3l99a3a2a6eb5fd498@mail.gmail.com>

On Sun, 10 Sep 2006 at 21:14 -0700, Rick Rodgers wrote:
> Yes I was, but if the power controller is down (unreachable)
>  and the system (node)  is hung how can these fence anything?
>  By pulling the plug you loose both and you can not be sure of anything
>  since you can not successfully issue a power cycle command.

I'm not sure I understand what you mean by "if the power controller is
down". These systems can be configured with redundant power supplies
and if both power supplies fail then there's not anything you can do
to fence a system.

>  Thanks for your input though.

sure, np

- Ken


From rodgersr at yahoo.com  Mon Sep 11 06:19:12 2006
From: rodgersr at yahoo.com (Rick Rodgers)
Date: Sun, 10 Sep 2006 23:19:12 -0700 (PDT)
Subject: [Linux-cluster] power controller is interal/loss of pwer prevents
	failover: any ideas
In-Reply-To: <1c0e77670609102142t267518e3l99a3a2a6eb5fd498@mail.gmail.com>
Message-ID: <20060911061912.39375.qmail@web34204.mail.mud.yahoo.com>

Yes that is what my point is. These systems use the same power cord
 for the powercontroller and system power. If you pull the plug then no failover
 can happen because the backup node can not shoot the active node because
 it can not talk to the active nodes power controller. This means a pull of the plug and
 no failover. Seem like we really should havea way to failover.

----- Original Message ----
From: Ken Johnson <busyadmin at gmail.com>
To: linux clustering <linux-cluster at redhat.com>
Sent: Sunday, September 10, 2006 9:42:16 PM
Subject: Re: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas

On Sun, 10 Sep 2006 at 21:14 -0700, Rick Rodgers wrote:
> Yes I was, but if the power controller is down (unreachable)
>  and the system (node)  is hung how can these fence anything?
>  By pulling the plug you loose both and you can not be sure of anything
>  since you can not successfully issue a power cycle command.

I'm not sure I understand what you mean by "if the power controller is
down". These systems can be configured with redundant power supplies
and if both power supplies fail then there's not anything you can do
to fence a system.

>  Thanks for your input though.

sure, np

- Ken

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060910/e1c00e43/attachment.htm>

From rodgersr at yahoo.com  Mon Sep 11 06:23:33 2006
From: rodgersr at yahoo.com (Rick Rodgers)
Date: Sun, 10 Sep 2006 23:23:33 -0700 (PDT)
Subject: [Linux-cluster] power controller is interal/loss of pwer prevents
	failover: any ideas
In-Reply-To: <1c0e77670609102142t267518e3l99a3a2a6eb5fd498@mail.gmail.com>
Message-ID: <20060911062333.48552.qmail@web34210.mail.mud.yahoo.com>

When you say  redundant power supply, do you mean they  
 have the same IP address?. If not, how does Clumanger handle talking
 to two power supplys? And if one goes down how does it know to 
 talk to the other? Is there a configuration in cluster.xml?

----- Original Message ----
From: Ken Johnson <busyadmin at gmail.com>
To: linux clustering <linux-cluster at redhat.com>
Sent: Sunday, September 10, 2006 9:42:16 PM
Subject: Re: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas

On Sun, 10 Sep 2006 at 21:14 -0700, Rick Rodgers wrote:
> Yes I was, but if the power controller is down (unreachable)
>  and the system (node)  is hung how can these fence anything?
>  By pulling the plug you loose both and you can not be sure of anything
>  since you can not successfully issue a power cycle command.

I'm not sure I understand what you mean by "if the power controller is
down". These systems can be configured with redundant power supplies
and if both power supplies fail then there's not anything you can do
to fence a system.

>  Thanks for your input though.

sure, np

- Ken

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060910/85deff0c/attachment.htm>

From Alain.Moulle at bull.net  Mon Sep 11 07:16:51 2006
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Mon, 11 Sep 2006 09:16:51 +0200
Subject: [Linux-cluster] CS4 Update 2 & Patch watchdog on rgmanager process
Message-ID: <45050D63.2040507@bull.net>

Hi

I tried to apply the watchdog path on CS4 U2 , which normally
should launch a reboot if the process clurmgrd disappears for
any reason, but it seems not to work on Update 2 ...
We have now two clurmgrd processes launched at rgmanager start,
and I tried to kill it about 10 times, but it
leads to a reboot of the node only once.

Any idea ?
Which is exactly the expected behavior with the watchdog patch ?

Thanks
Alain Moull?


From jos at xos.nl  Mon Sep 11 07:59:44 2006
From: jos at xos.nl (Jos Vos)
Date: Mon, 11 Sep 2006 09:59:44 +0200
Subject: [Linux-cluster] GFS and (missing) filesystem labels
Message-ID: <200609110759.k8B7xij01034@xos037.xos.nl>

Hi,

It seems that you can not add a filesystem label to a GFS filesystem.

Especially when using iSCSI, it would be handy to have a method to be
sure that you mount the right SCSI device, in case the device name
has changed due to a failure of another (i)SCSI device.

Is there a good solution for this?

Thanks,

--
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From riaan at obsidian.co.za  Mon Sep 11 09:19:21 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Mon, 11 Sep 2006 11:19:21 +0200
Subject: [Linux-cluster] GFS and (missing) filesystem labels
In-Reply-To: <200609110759.k8B7xij01034@xos037.xos.nl>
References: <200609110759.k8B7xij01034@xos037.xos.nl>
Message-ID: <45052A19.1000804@obsidian.co.za>

Jos Vos wrote:
> Hi,
> 
> It seems that you can not add a filesystem label to a GFS filesystem.
> 
> Especially when using iSCSI, it would be handy to have a method to be
> sure that you mount the right SCSI device, in case the device name
> has changed due to a failure of another (i)SCSI device.
> 
> Is there a good solution for this?
> 
> Thanks,
> 

hi Jos

a) use LVM. it does not care what the underlying physical volume names 
are, it will do the "right thing" w.r.t. volume groups and logical 
volumes names
b) your multipathing solution (e.g. EMC PowerPath with its persistent 
mapping functionality of paths to for example /dev/emcpowera1) might 
also solve this problem, if you want to avoid using LVM.

note - on SANs with multiple paths to the same LUN/partition, using 
labels to mount does not work (you will get an error message about 
duplicate labels), which is probably why the functionality is not there 
to begin with, and probably will not be either.

HTH
Riaan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060911/3a3e2a3f/attachment.vcf>

From peter.huesser at psi.ch  Mon Sep 11 09:54:46 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Mon, 11 Sep 2006 11:54:46 +0200
Subject: [Linux-cluster] Immediate shutdown after getting quorate of two
	node cluster
Message-ID: <8E2924888511274B95014C2DD906E58AD19EDE@MAILBOX0A.psi.ch>

Hello

 
I try to run a two node cluster. Starting ccsd on both servers is no
problem. But if I try to start cman I get the following lines in my
"/var/log/messages" file:

 
Sep 11 11:44:26 server01 ccsd[24972]: Connected to cluster infrastruture
via: CMAN/SM Plugin v1.1.5

Sep 11 11:44:26 server01 ccsd[24972]: Initial status:: Inquorate

Sep 11 11:44:57 server01 ccsd[24972]: Cluster is quorate.  Allowing
connections.

Sep 11 11:44:57 server01 ccsd[24972]: Cluster manager shutdown.
Attemping to reconnect...

Sep 11 11:44:58 server01 ccsd[24972]: Connected to cluster infrastruture
via: CMAN/SM Plugin v1.1.5

Sep 11 11:44:58 server01 ccsd[24972]: Initial status:: Inquorate

Sep 11 11:45:29 server01 ccsd[24972]: Cluster is quorate.  Allowing
connections.

Sep 11 11:45:29 server01 ccsd[24972]: Cluster manager shutdown.
Attemping to reconnect...

...

 
Why is the daemon shutdown after getting quorated ? Any ideas ?

 
Thanks'

 
            Pedro

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060911/ab2c5a08/attachment.htm>

From sara_sodagar at yahoo.com  Mon Sep 11 11:26:31 2006
From: sara_sodagar at yahoo.com (sara sodagar)
Date: Mon, 11 Sep 2006 04:26:31 -0700 (PDT)
Subject: [Linux-cluster] Question about using Lock manager
Message-ID: <20060911112631.98356.qmail@web31801.mail.mud.yahoo.com>

hi
I am new in GFS concept and  planning to use RHEL4 GFS
to immplement clustering.My SAN is HDS 9585 and I have
4 HS-20 web servers and 2 IBM HS20 ftp servers.
My question is about the place of lock amanger in this
configuration.
Should I set up lock manager on a separate host or
would it be possible to have a node with both roles of
lock manager and apache ? please let me know the
impact of having a node with both roles ?
If I should set up Lock manager and its RLM on
different nodes please let me know the best
configuration .

I would be greatful if any one can help me regarding
this matter.

--Best regards.
Sara


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From lists at brimer.org  Mon Sep 11 12:47:14 2006
From: lists at brimer.org (Barry Brimer)
Date: Mon, 11 Sep 2006 07:47:14 -0500 (CDT)
Subject: [Linux-cluster] Question about using Lock manager
In-Reply-To: <20060911112631.98356.qmail@web31801.mail.mud.yahoo.com>
References: <20060911112631.98356.qmail@web31801.mail.mud.yahoo.com>
Message-ID: <Pine.LNX.4.61.0609110745210.15312@localhost.localdomain>

> hi
> I am new in GFS concept and  planning to use RHEL4 GFS
> to immplement clustering.My SAN is HDS 9585 and I have
> 4 HS-20 web servers and 2 IBM HS20 ftp servers.
> My question is about the place of lock amanger in this
> configuration.
> Should I set up lock manager on a separate host or
> would it be possible to have a node with both roles of
> lock manager and apache ? please let me know the
> impact of having a node with both roles ?
> If I should set up Lock manager and its RLM on
> different nodes please let me know the best
> configuration .

I would recommend using lock_dlm.  With lock_dlm, each node manages locks 
for the files it uses.  Hope this helps.

Barry


From lhh at redhat.com  Mon Sep 11 13:55:19 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 11 Sep 2006 09:55:19 -0400
Subject: [Linux-cluster] Issue stonith commands to the failing node twice??
In-Reply-To: <20060907230728.27221.qmail@web34207.mail.mud.yahoo.com>
References: <20060907230728.27221.qmail@web34207.mail.mud.yahoo.com>
Message-ID: <1157982919.3610.274.camel@rei.boston.devel.redhat.com>

On Thu, 2006-09-07 at 16:07 -0700, Rick Rodgers wrote:
> I am using an older version of clumanger (about 2 yrs old) and I
> notice
> that when the active node goes down the back will actually issue
> stonith commands twice. They are about 60 seconds apart. Does this 
> happen to anyone else??

It's "normal" if you're using the disk tiebreaker.  That is, it's been
around for so long that people are used to it ;)

Basically, both membership transitions and quorum disk transitions are
causing full recovery (including STONITH).  However, only one should
cause a STONITH event -- the one that happens last.

There is a switch which should fix it in 1.2.34, but it has to be
enabled manually ('cludb -p cluquorumd%disk_quorum 1').

-- Lon

-------------- next part --------------
A non-text attachment was scrubbed...
Name: clumanager-1.2.31-179363.patch
Type: text/x-patch
Size: 5553 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060911/119f5e49/attachment.bin>

From lhh at redhat.com  Mon Sep 11 13:58:40 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 11 Sep 2006 09:58:40 -0400
Subject: [Linux-cluster] Odd/Even Nodes for RHCS/GFS
In-Reply-To: <A5502A8A1836A54FB9CB33BDC6A5544404C436D0@safeb1mf533c.crc.army.mil>
References: <A5502A8A1836A54FB9CB33BDC6A5544404C436D0@safeb1mf533c.crc.army.mil>
Message-ID: <1157983120.3610.279.camel@rei.boston.devel.redhat.com>

On Fri, 2006-09-08 at 12:14 -0500, Frazier, Darrell USA CRC (Contractor)
wrote:
> Hello,
> 
>  
> 
> I have heard that RHCS may have an issue with odd number of nodes vs
> even number of nodes. Has anyone heard of this? Thanx.

The only special considerations are with two-node clusters, because
there is no easy way to declare a majority in two node clusters.  So,
there has to be a way to decide which node is "alive" and which one is
"dead" in the case of a network partition.  There are several ways to do
this.

Otherwise, even vs. odd should not matter.  If there are any issues WRT
even vs. odd, it's probably a bug.

-- Lon


From lhh at redhat.com  Mon Sep 11 14:07:27 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 11 Sep 2006 10:07:27 -0400
Subject: [Linux-cluster] power controller is interal/loss of pwer
	prevents failover: any ideas
In-Reply-To: <20060908181636.5993.qmail@web34207.mail.mud.yahoo.com>
References: <20060908181636.5993.qmail@web34207.mail.mud.yahoo.com>
Message-ID: <1157983647.3610.287.camel@rei.boston.devel.redhat.com>

On Fri, 2006-09-08 at 11:16 -0700, Rick Rodgers wrote:
> Does anyone know of a good solution to providing good failover 
> for somthing like a Dell 1850? The issue here is that the power 
> souce plug in the back provides power for both the internal power
> controller
> and the node itself. So if you pull the cord it will not failover
> because
> it can not Stonith the failed node (power controller is down also).

Generally, you can't handle this without external fencing.

https://www.redhat.com/archives/linux-cluster/2006-September/msg00026.html

-- Lon


From jparsons at redhat.com  Mon Sep 11 14:40:12 2006
From: jparsons at redhat.com (James Parsons)
Date: Mon, 11 Sep 2006 10:40:12 -0400
Subject: [Linux-cluster] system-config-cluster problem
In-Reply-To: <AD6B2CEE-3D1D-49C9-B678-9C7E4C97F25F@kinetikon.com>
References: <AD6B2CEE-3D1D-49C9-B678-9C7E4C97F25F@kinetikon.com>
Message-ID: <4505754C.5020806@redhat.com>

Matteo Catanese wrote:

> I've setup a cluster some month ago.
>
> Cluster is working , but still not in production.
>
> Today, after summer break, i did all the updates for my rhat and CS
>
> First i disabled all services, then i patched one machine and  
> rebooted, then the other one and rebooted.
>
>
> Cluster works perfectly:
>
>
> [root at lvzbe1 ~]# uname -a
> Linux lvzbe1.lavazza.it 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 18:00:32  
> EDT 2006 i686 i686 i386 GNU/Linux
> [root at lvzbe1 ~]# clustat -v
> clustat version 1.9.53
> Connected via: CMAN/SM Plugin v1.1.7.1
> [root at lvzbe1 ~]# clustat
> Member Status: Quorate
>
>   Member Name                              Status
>   ------ ----                              ------
>   lvzbe1                                   Online, Local, rgmanager
>   lvzbe2                                   Online, rgmanager
>
>   Service Name         Owner (Last)                   State
>   ------- ----         ----- ------                   -----
>   oracle               lvzbe1                         started
> [root at lvzbe1 ~]#
>
>
> But when i try to run system-config-cluster,it pops out:
>
> Poorly Formed XML error
> A problem was encoutered while reading configuration file /etc/ 
> cluster/clluster.conf.
> Details or the error appear below. Click the "New" button to create a  
> new configuration file.
> To continue anyway(Not Recommended!), click the "ok" button.
>
>
> Relax-NG validity error : Extra element rm in interleave
> /etc/cluster/cluster.conf:35: element rm: Relax-NG validity error :  
> Element cluster failed to validate content
> /etc/cluster/cluster.conf fails to validate
>
Hi Matteo,

Here is why the conf file is failing validation:
In your conf lines specifying your two FS's, you have an fstype 
attribute but no fsid attribute. I spoke with Lon, who is the Grand 
Resource Guru, and he says that the two should be exclusive, that is, an 
fsid should not be necessary just because you are specifying an fstype. 
So this is a bug in the relaxNG schema validation file. A fix for this 
will be in the next update, and until then, using the conf file that you 
attached, please just disregard the warning message. For completeness 
sake, I am attaching a fixed version of the relaxNG file that you can 
drop into /usr/share/system-config-cluster/misc, if you want.

Thanks for finding this issue.

-Jim
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: cluster.ng
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060911/b8351bc5/attachment.ksh>

From riaan at obsidian.co.za  Mon Sep 11 15:31:14 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Mon, 11 Sep 2006 17:31:14 +0200
Subject: [Linux-cluster] power controller is interal/loss of pwer	prevents
	failover: any ideas
In-Reply-To: <1157983647.3610.287.camel@rei.boston.devel.redhat.com>
References: <20060908181636.5993.qmail@web34207.mail.mud.yahoo.com>
	<1157983647.3610.287.camel@rei.boston.devel.redhat.com>
Message-ID: <45058142.3040901@obsidian.co.za>


Lon Hohberger wrote:
> On Fri, 2006-09-08 at 11:16 -0700, Rick Rodgers wrote:
>> Does anyone know of a good solution to providing good failover 
>> for somthing like a Dell 1850? The issue here is that the power 
>> souce plug in the back provides power for both the internal power
>> controller
>> and the node itself. So if you pull the cord it will not failover
>> because
>> it can not Stonith the failed node (power controller is down also).
> 
> Generally, you can't handle this without external fencing.
> 
> https://www.redhat.com/archives/linux-cluster/2006-September/msg00026.html
> 
> -- Lon
> 

Lon - having reread that previous posting of yours, and esp the last 
paragraph:

+++
(c) ... If a host does a "graceful shutdown" when
you fence it via IPMI, you need to disable ACPI on that host (e.g. boot
with acpi=off).  The server should turn off immediately (or within 4-5
seconds, like when holding an ATX power button in to force a machine
off).
++++

Just so I am absolutely sure about this: Is the above the only scenario 
when would have to disable ACPI? e.g. a graceful shutdown is easy to 
spot. If I don't see one in the logs, that means I can leave ACPI on?

Riaan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060911/6edc7f5c/attachment.vcf>

From lhh at redhat.com  Mon Sep 11 15:32:18 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 11 Sep 2006 11:32:18 -0400
Subject: [Linux-cluster] 2-node fencing question (IPMI/ACPI question)
In-Reply-To: <1157456622.4378.7.camel@belmont.site>
References: <080220061550.6837.44D0C9B800021AD200001AB522007347489B9C0A99020E0B@comcast.net>
	<1154633146.28677.70.camel@ayanami.boston.redhat.com>
	<1157456622.4378.7.camel@belmont.site>
Message-ID: <1157988738.3610.367.camel@rei.boston.devel.redhat.com>

On Tue, 2006-09-05 at 07:43 -0400, danwest wrote:
> What happens if the servers you are using require ACPI=on in order to
> boot.  For instance IBM X366 servers need ACPI set in order to boot.
> With ACPI=on both nodes reboot when a fence occurs(see "both nodes off
> problem" in thread below).  This is not desirable, especially with
> active/active clusters.

Hopefully, the X366 either turns off immediately or can be configured to
do so upon getting the "power off" command with ACPI enabled.  If it
does not, then you will need remote power control or fabric-level
fencing.

Here is some relevant background information.

If you look at the IPMI v1.5 and v2 specifications, the instruction 0
for power control is supposed force the system to S4/S5 (soft-off) state
immediately (for use in emergency situations).  If you then look at the
ipmitool source code, you will find that it uses the 0 instruction when
you do a 'chassis power off' command.

(quote, source =
http://www.intel.com/design/servers/ipmi/pdf/IPMIv2_0_rev1_0_E3_markup.pdf - page 403):

[3:0] - chassis control
   0h = power down. Force system into soft off (S4/S45) state. This is
        for `emergency' management power down actions. The command
        does not initiate a clean shut-down of the operating system
        prior to powering down the system.

(/quote)

The reason linux-cluster often needs ACPI disabled with IPMI is because
in many cases, machines which receive this "emergency power off"
instruction do not appear to operate as what is stated in the IPMI
specification.  That is, some do a full, complete, clean shutdown when
ACPI is enabled.  If the shutdown never completes, fencing will never
complete and the cluster will never recover.

Now, not all machines behave this way.  If your machine powers off
immediately with ACPI enabled, then you do not need to disable ACPI.
(Note: cheating by switching the acpid event for power button presses
to /sbin/poweroff -fn does *not* count!)

It is possible that some machines are - quite simply - twiddling the
motherboard's soft power button.  In that case, it is possible that
those machines can also be configured to do an immediate-off in the BIOS
when the power button is pressed, thereby alleviating the need for
booting with ACPI disabled.

There may be other ways to work around the ACPI/IPMI problem on your
specific hardware; this is just an example.  Booting with ACPI disabled
is the general "quick fix", which works immediately for the majority of
machines with IPMI - and does not require hardware-specific
configuration.  Booting with ACPI disabled also works for other types of
integrated power management (iLO, RSA, DRAC, etc.) which often suffer
the same problems.

As noted by others in separate emails to this list, it would be nice if
we could use the reboot operations more often - rather than "off, on"
cycles in all cases.

Most fencing solutions can not (as far as I know) confirm that a machine
has rebooted the way it can confirm that a machine is "off" or "on".  Of
course, "reboot" does not suffer the theoretical "everyone off at once"
problem, and it should eliminate the need boot with ACPI disabled.

-- Lon


From rodgersr at yahoo.com  Mon Sep 11 15:52:37 2006
From: rodgersr at yahoo.com (Rick Rodgers)
Date: Mon, 11 Sep 2006 08:52:37 -0700 (PDT)
Subject: [Linux-cluster] power controller is interal/loss of pwer prevents
	failover: any ideas
In-Reply-To: <45058142.3040901@obsidian.co.za>
Message-ID: <20060911155237.47711.qmail@web34215.mail.mud.yahoo.com>

Graceful shutdown? The question I also have is:
In a two node cluster when you shoutdown (shutdown/reboot command)
the active node should this cause a failover?


----- Original Message ----
From: Riaan van Niekerk <riaan at obsidian.co.za>
To: linux clustering <linux-cluster at redhat.com>
Sent: Monday, September 11, 2006 8:31:14 AM
Subject: Re: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas


Lon Hohberger wrote:
> On Fri, 2006-09-08 at 11:16 -0700, Rick Rodgers wrote:
>> Does anyone know of a good solution to providing good failover 
>> for somthing like a Dell 1850? The issue here is that the power 
>> souce plug in the back provides power for both the internal power
>> controller
>> and the node itself. So if you pull the cord it will not failover
>> because
>> it can not Stonith the failed node (power controller is down also).
> 
> Generally, you can't handle this without external fencing.
> 
> https://www.redhat.com/archives/linux-cluster/2006-September/msg00026.html
> 
> -- Lon
> 

Lon - having reread that previous posting of yours, and esp the last 
paragraph:

+++
(c) ... If a host does a "graceful shutdown" when
you fence it via IPMI, you need to disable ACPI on that host (e.g. boot
with acpi=off).  The server should turn off immediately (or within 4-5
seconds, like when holding an ATX power button in to force a machine
off).
++++

Just so I am absolutely sure about this: Is the above the only scenario 
when would have to disable ACPI? e.g. a graceful shutdown is easy to 
spot. If I don't see one in the logs, that means I can leave ACPI on?

Riaan

begin:vcard
fn:Riaan van Niekerk
n:van Niekerk;Riaan
org:Obsidian Systems;Obsidian Red Hat Consulting
email;internet:riaan at obsidian.co.za
title:Systems Architect
tel;work:+27 11 792 6500
tel;fax:+27 11 792 6522
tel;cell:+27 82 921 8768
x-mozilla-html:FALSE
url:http://www.obsidian.co.za
version:2.1
end:vcard


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060911/6e38d638/attachment.htm>

From rodgersr at yahoo.com  Mon Sep 11 16:24:50 2006
From: rodgersr at yahoo.com (Rick Rodgers)
Date: Mon, 11 Sep 2006 09:24:50 -0700 (PDT)
Subject: [Linux-cluster] Clumanger reboots to same node
Message-ID: <20060911162450.93432.qmail@web34204.mail.mud.yahoo.com>

Somtimes during testing when you use the powerctroller to 
reboot the active node, clumanger will not fail over but instead
restart the services on the same node. Has anyone seen this?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060911/65a57860/attachment.htm>

From lhh at redhat.com  Mon Sep 11 16:33:20 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 11 Sep 2006 12:33:20 -0400
Subject: [Linux-cluster] Issue stonith commands to the failing node twice??
In-Reply-To: <20060911161153.82338.qmail@web34201.mail.mud.yahoo.com>
References: <20060911161153.82338.qmail@web34201.mail.mud.yahoo.com>
Message-ID: <1157992400.3610.393.camel@rei.boston.devel.redhat.com>

On Mon, 2006-09-11 at 09:11 -0700, Rick Rodgers wrote:
> Thanks. 
> I woas wondering if Clumanager can work with dual power controllers?
> So if one controller goes down and it needs to shoot the node it can
> use the other 
> controller to shoot the node. If so how does that get configured into
> clumanager?

Clumanager 1.2.x's use of multiple power controllers is basically "all
or nothing".  That is, if you have two power controllers listed, both
must succeed or STONITH fails.

There is no equivalent (in clumanager 1.2.x) to RHCS4's / RHGFS6.0's /
RHGFS6.1's "fence level" construct, which allows you to configure backup
fencing.  Each fence level is tried in sequence (each fence level may
have one or more devices to try).  The first level which fully succeeds
ends the fencing operation (successfully).  If no level succeeds,
fencing fails (and is retried on RHCS4).

-- Lon


From venilton.junior at sercompe.com.br  Mon Sep 11 19:29:16 2006
From: venilton.junior at sercompe.com.br (Venilton Junior)
Date: Mon, 11 Sep 2006 16:29:16 -0300
Subject: [Linux-cluster] GFS questions
Message-ID: <D007994C85288E4BAA7BDF3CCF5654BC2F02EC@venus.sercompe.com.br>

Hi,

 
I'm wondering if I could deploy a cluster solution with 3 nodes
accessing the same storage area without using GFS.

Are there any other solutions that allow me to access the same file
system without using GFS? 

I have 3 nodes running RHAS4 and they're accessing an EVA4000.

I'd like to run a Huge SMTP server on this infrastructure and I'm
seeking for all possibilities to do that.

 
Does anyone have an idea?

 
Best regards,  

 
Venilton C. Junior

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060911/99a4a4b8/attachment.htm>

From lhh at redhat.com  Mon Sep 11 20:32:34 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 11 Sep 2006 16:32:34 -0400
Subject: [Linux-cluster] power controller is interal/loss of
	pwer	prevents failover: any ideas
In-Reply-To: <45058142.3040901@obsidian.co.za>
References: <20060908181636.5993.qmail@web34207.mail.mud.yahoo.com>
	<1157983647.3610.287.camel@rei.boston.devel.redhat.com>
	<45058142.3040901@obsidian.co.za>
Message-ID: <1158006754.3610.406.camel@rei.boston.devel.redhat.com>

On Mon, 2006-09-11 at 17:31 +0200, Riaan van Niekerk wrote:
> Lon Hohberger wrote:
> > On Fri, 2006-09-08 at 11:16 -0700, Rick Rodgers wrote:
> >> Does anyone know of a good solution to providing good failover 
> >> for somthing like a Dell 1850? The issue here is that the power 
> >> souce plug in the back provides power for both the internal power
> >> controller
> >> and the node itself. So if you pull the cord it will not failover
> >> because
> >> it can not Stonith the failed node (power controller is down also).
> > 
> > Generally, you can't handle this without external fencing.
> > 
> > https://www.redhat.com/archives/linux-cluster/2006-September/msg00026.html
> > 
> > -- Lon
> > 
> 
> Lon - having reread that previous posting of yours, and esp the last 
> paragraph:
> 
> +++
> (c) ... If a host does a "graceful shutdown" when
> you fence it via IPMI, you need to disable ACPI on that host (e.g. boot
> with acpi=off).  The server should turn off immediately (or within 4-5
> seconds, like when holding an ATX power button in to force a machine
> off).
> ++++
> 
> Just so I am absolutely sure about this: Is the above the only scenario 
> when would have to disable ACPI? e.g. a graceful shutdown is easy to 
> spot. If I don't see one in the logs, that means I can leave ACPI on?

Basically, yes.

If you want to be sure, watch the machine's console while you perform a
power off using the integrated power management.  If the machine shuts
off immediately (while ACPI is enabled) then leaving it enabled should
not cause any problems with the cluster.

Note: Setting acpid to do /sbin/poweroff or its likeness does not count
as an "instant off"...  Don't cheat :)

-- Lon


From lhh at redhat.com  Mon Sep 11 20:33:27 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 11 Sep 2006 16:33:27 -0400
Subject: [Linux-cluster] Immediate shutdown after getting quorate of
	two node cluster
In-Reply-To: <8E2924888511274B95014C2DD906E58AD19EDE@MAILBOX0A.psi.ch>
References: <8E2924888511274B95014C2DD906E58AD19EDE@MAILBOX0A.psi.ch>
Message-ID: <1158006807.3610.408.camel@rei.boston.devel.redhat.com>

On Mon, 2006-09-11 at 11:54 +0200, Huesser Peter wrote:
> Hello
> 
>  
> 
> I try to run a two node cluster. Starting ccsd on both servers is no
> problem. But if I try to start cman I get the following lines in my
> ?/var/log/messages? file:
> 
>  
> 
> Sep 11 11:44:26 server01 ccsd[24972]: Connected to cluster
> infrastruture via: CMAN/SM Plugin v1.1.5
> 
> Sep 11 11:44:26 server01 ccsd[24972]: Initial status:: Inquorate
> 
> Sep 11 11:44:57 server01 ccsd[24972]: Cluster is quorate.  Allowing
> connections.
> 
> Sep 11 11:44:57 server01 ccsd[24972]: Cluster manager shutdown.
> Attemping to reconnect...
> 
> Sep 11 11:44:58 server01 ccsd[24972]: Connected to cluster
> infrastruture via: CMAN/SM Plugin v1.1.5
> 
> Sep 11 11:44:58 server01 ccsd[24972]: Initial status:: Inquorate
> 
> Sep 11 11:45:29 server01 ccsd[24972]: Cluster is quorate.  Allowing
> connections.
> 
> Sep 11 11:45:29 server01 ccsd[24972]: Cluster manager shutdown.
> Attemping to reconnect...
> 
> ?
> 
>  
> 
> Why is the daemon shutdown after getting quorated ? Any ideas ?

What does the dmesg output look like?

-- Lon


From lhh at redhat.com  Mon Sep 11 21:10:19 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 11 Sep 2006 17:10:19 -0400
Subject: [Linux-cluster] CS4 Update 2 & Patch watchdog on rgmanager process
In-Reply-To: <45050D63.2040507@bull.net>
References: <45050D63.2040507@bull.net>
Message-ID: <1158009019.3610.418.camel@rei.boston.devel.redhat.com>

Hi,

The self-watchdog patch adds a process which monitors the "real"
clurgmgrd.  The monitoring process should be the lower-numbered PID
(it's the parent of the one doing the work).

The monitoring process watches for crash signals (SIGBUS, SIGSEGV,
etc.), and will simply exit if you kill the child with SIGKILL.

So, basically, killing the higher-numbered PID with something like
SIGSEGV should cause the node to reboot.

-- Lon


From rico_tsang at macroview.com  Tue Sep 12 03:09:31 2006
From: rico_tsang at macroview.com (Rico Tsang)
Date: Tue, 12 Sep 2006 11:09:31 +0800
Subject: [Linux-cluster] GFS questions
Message-ID: <61E6BBD96354E1419428314BA80EA8B9750A2D@exchsvr.macroview.com>

Dear Venilton,

 
You may want to take a look at the list of shared file systems in Wiki:

 
http://en.wikipedia.org/wiki/List_of_file_systems#Shared_disk_file_syste
ms

 
I think that IBM GPFS or Polyserve are some of the well-known SAN file
systems that you can check.

 
Regards,

Rico

 
  _____  

From: Venilton Junior [mailto:venilton.junior at sercompe.com.br] 
Sent: Tuesday, September 12, 2006 3:29 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] GFS questions

 
Hi,

 
I'm wondering if I could deploy a cluster solution with 3 nodes
accessing the same storage area without using GFS.

Are there any other solutions that allow me to access the same file
system without using GFS? 

I have 3 nodes running RHAS4 and they're accessing an EVA4000.

I'd like to run a Huge SMTP server on this infrastructure and I'm
seeking for all possibilities to do that.

 
Does anyone have an idea?

 
Best regards,  

 
Venilton C. Junior

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060912/fc020362/attachment.htm>

From peter.huesser at psi.ch  Tue Sep 12 05:10:49 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Tue, 12 Sep 2006 07:10:49 +0200
Subject: [Linux-cluster] Immediate shutdown after getting quorate oftwo
	node cluster
In-Reply-To: <1158006807.3610.408.camel@rei.boston.devel.redhat.com>
Message-ID: <8E2924888511274B95014C2DD906E58AD19F1A@MAILBOX0A.psi.ch>

> What does the dmesg output look like?

It looks the following

CMAN: forming a new cluster
CMAN: quorum regained, resuming activity
CMAN: sendmsg failed: -13
CMAN: No functional network interfaces, leaving cluster
CMAN: sendmsg failed: -13
CMAN: we are leaving the cluster.
CMAN: Waiting to join or form a Linux-cluster
CMAN: sendmsg failed: -13
CMAN: sendmsg failed: -13
CMAN: sendmsg failed: -13
CMAN: sendmsg failed: -13
CMAN: sendmsg failed: -13
CMAN: forming a new cluster
CMAN: quorum regained, resuming activity
CMAN: sendmsg failed: -13
CMAN: No functional network interfaces, leaving cluster
CMAN: sendmsg failed: -13
CMAN: we are leaving the cluster.
CMAN: Waiting to join or form a Linux-cluster
CMAN: sendmsg failed: -13

Can't interpret the "No functional network interface". No firewall is
running on the system. /etc/hosts.{allow,deny} makes no restriction. The
/etc/hosts file is set up correctly. 

Pedro


From dan.hawker at astrium.eads.net  Tue Sep 12 08:25:33 2006
From: dan.hawker at astrium.eads.net (HAWKER, Dan)
Date: Tue, 12 Sep 2006 09:25:33 +0100
Subject: [Linux-cluster] CLVMD - Do I need it???
Message-ID: <7F6B06837A5DBD49AC6E1650EFF5490601223028@auk52177.ukr.astrium.corp>


Hi All,

Have an EMC SAN unit on the way. I plan to use it as the central store for a
couple of servers setup as a cluster, using GFS. As the SAN unit can handle
all of its own Logical Volume management natively, I presume I don't have to
use/implement CLVMD and hence can cut one layer of complexity in the disk
structure away.

Am I correct in this assumption, or does GFS/RHCS need to use CLVMD in its
configuration???

TIA

Dan

--

Dan Hawker
Linux System Administrator
Astrium

-- 

This email is for the intended addressee only.
If you have received it in error then you must not use, retain, disseminate or otherwise deal with it.
Please notify the sender by return email.
The views of the author may not necessarily constitute the views of Astrium Limited.
Nothing in this email shall bind Astrium Limited in any contract or obligation.

Astrium Limited, Registered in England and Wales No. 2449259
Registered Office: Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England


From m.catanese at kinetikon.com  Tue Sep 12 08:45:30 2006
From: m.catanese at kinetikon.com (Matteo Catanese)
Date: Tue, 12 Sep 2006 10:45:30 +0200
Subject: [Linux-cluster] system-config-cluster problem
Message-ID: <6B5FFF19-58EF-42B0-81E1-98D280314168@kinetikon.com>

Thx a lot James and Lon,

i feel more relaxed now :-)

I will disregard that warning message and wait until next patch.

Ciao

Matteo


From pcaulfie at redhat.com  Tue Sep 12 09:33:24 2006
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 12 Sep 2006 10:33:24 +0100
Subject: [Linux-cluster] Two node cluster: node cannot connect to cluster
	infrastructure
In-Reply-To: <8E2924888511274B95014C2DD906E58AD19E1E@MAILBOX0A.psi.ch>
References: <8E2924888511274B95014C2DD906E58AD19E1E@MAILBOX0A.psi.ch>
Message-ID: <45067EE4.6010503@redhat.com>

Huesser Peter wrote:
> Hello
> 
>  
> 
> I try to make a two node cluster run. Unfortunately if I run the
> ?/etc/init.d/cman start? command I get in ?/var/log/messages? entries like:
> 
>  
> 
>     Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5
> 
>     Initial status:: Inquorate
> 
>     Cluster manager shutdown.  Attemping to reconnect...
> 
>     Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5
> 
>     Initial status:: Inquorate
> 
>     Cluster manager shutdown.  Attemping to reconnect...
> 
>  
> 
> ?dmesg? shows:
> 
>  
> 
>     CMAN: sendmsg failed: -13
> 


That's a kernel/userspace mismatch.

Upgrade the userspace cman tools.


-- 

patrick


From pcaulfie at redhat.com  Tue Sep 12 09:37:58 2006
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 12 Sep 2006 10:37:58 +0100
Subject: [Linux-cluster] Immediate shutdown after getting quorate oftwo
	node cluster
In-Reply-To: <8E2924888511274B95014C2DD906E58AD19F1A@MAILBOX0A.psi.ch>
References: <8E2924888511274B95014C2DD906E58AD19F1A@MAILBOX0A.psi.ch>
Message-ID: <45067FF6.1010605@redhat.com>

Huesser Peter wrote:
>> What does the dmesg output look like?
> 
> It looks the following
> 
> CMAN: forming a new cluster
> CMAN: quorum regained, resuming activity
> CMAN: sendmsg failed: -13
> CMAN: No functional network interfaces, leaving cluster
> CMAN: sendmsg failed: -13
> CMAN: we are leaving the cluster.
> CMAN: Waiting to join or form a Linux-cluster
> CMAN: sendmsg failed: -13
> CMAN: sendmsg failed: -13
> CMAN: sendmsg failed: -13
> CMAN: sendmsg failed: -13
> CMAN: sendmsg failed: -13


That's a kernel/userspace mismatch

Upgrade the cman user tools.


(I'm going to put that text on a macro key!)
-- 

patrick


From riaan at obsidian.co.za  Tue Sep 12 10:10:13 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Tue, 12 Sep 2006 12:10:13 +0200
Subject: [Linux-cluster] post_fail_delay versus deadnode_timeout
Message-ID: <45068785.9070404@obsidian.co.za>

hi

We are trying to capture diskdumps when a lock_dlm kernel panic happens 
and need to increase either post_fail_delay or deadnode_timeout to 
prevent the dumping node from being fenced.

Is there any advantages or disadvantages to using either? Which is 
recommended?

post_fail_delay and diskdump has come up previously, with some good 
answers from David
http://www.redhat.com/archives/linux-cluster/2006-June/msg00037.html

note: for capturing a "sysrq t", we manually increase deadnode_timeout, 
and decrease it back again, but don't have this luxury with a kernel 
panic (which can happen at any time).

Riaan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060912/6f66dc95/attachment.vcf>

From lhh at redhat.com  Tue Sep 12 14:04:19 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 12 Sep 2006 10:04:19 -0400
Subject: [Linux-cluster] CLVMD - Do I need it???
In-Reply-To: <7F6B06837A5DBD49AC6E1650EFF5490601223028@auk52177.ukr.astrium.corp>
References: <7F6B06837A5DBD49AC6E1650EFF5490601223028@auk52177.ukr.astrium.corp>
Message-ID: <1158069859.3610.437.camel@rei.boston.devel.redhat.com>

On Tue, 2006-09-12 at 09:25 +0100, HAWKER, Dan wrote:
> 
> Hi All,
> 
> Have an EMC SAN unit on the way. I plan to use it as the central store for a
> couple of servers setup as a cluster, using GFS. As the SAN unit can handle
> all of its own Logical Volume management natively, I presume I don't have to
> use/implement CLVMD and hence can cut one layer of complexity in the disk
> structure away.

> Am I correct in this assumption, or does GFS/RHCS need to use CLVMD in its
> configuration???

You don't need CLVM if you intend to use the internal array tools, but
it's a "nice to have" thing.  After all, we've had GFS (and simple
failover, for that matter) for a few years -- while CLVM is a relatively
new technology.  Some SANs can do this internally too, of course.

For example, if you had CLVM and you add another array, I'm pretty sure
you could use CLVM to extend an existing logical volume on to the second
array while the cluster is running.

-- Lon


From lists at brimer.org  Tue Sep 12 14:04:38 2006
From: lists at brimer.org (Barry Brimer)
Date: Tue, 12 Sep 2006 09:04:38 -0500 (CDT)
Subject: [Linux-cluster] CLVMD - Do I need it???
In-Reply-To: <7F6B06837A5DBD49AC6E1650EFF5490601223028@auk52177.ukr.astrium.corp>
References: <7F6B06837A5DBD49AC6E1650EFF5490601223028@auk52177.ukr.astrium.corp>
Message-ID: <Pine.LNX.4.61.0609120903440.29487@localhost.localdomain>

> Am I correct in this assumption, or does GFS/RHCS need to use CLVMD in its
> configuration???

My understanding is that you will continue to need clvmd.


From peter.huesser at psi.ch  Tue Sep 12 14:20:30 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Tue, 12 Sep 2006 16:20:30 +0200
Subject: [Linux-cluster] Immediate shutdown after getting quorate
	oftwonode cluster
In-Reply-To: <45067FF6.1010605@redhat.com>
Message-ID: <8E2924888511274B95014C2DD906E58AD19F7C@MAILBOX0A.psi.ch>

> 
> 
> That's a kernel/userspace mismatch
> 
> Upgrade the cman user tools.
>  

Thanks' and sorry if you had to repeat some stuff again. In fact I had
the newest versions installed. What I did now was to recompile all the
packages for the clustersuite and install these packages. After this the
"quorated" problem was solved. One node was now a member of the cluster
but I still could not get the other one to be a member. After reboot of
both systems both nodes were clustermembers so this works now (I got
another message from Jari who told me that something could be wrong with
my fence domain and I should reboot it). At the moment it looks much
better than this morning. Something with my fencing is not correctly set
up and services are not correctly working. Maybe I have to contact the
mailing list later but for the moment thanks' to all who gave an answer.

Pedro


From danwest at comcast.net  Tue Sep 12 16:12:30 2006
From: danwest at comcast.net (danwest at comcast.net)
Date: Tue, 12 Sep 2006 16:12:30 +0000
Subject: [Linux-cluster] qdiskd not properly failing nodes??
Message-ID: <091220061612.9370.4506DC6E0006D5FF0000249A22007481849B9C0A99020E0B@comcast.net>

Below is the qdisk configuration for a simple 2 node cluster with a webserver services.  The service is configured with 3 heuristics below.
 
<quorumd interval="1" tko="10" votes="3" device="/dev/disk/by-name/36006048000018794040051524d304234"
 status_file="/tmp/qdisk_status" log_level="7">
                <heuristic program="ping X.X.X.X -c1 -t1" score="1" interval="2"/>
                <heuristic program="[ -f /quorum ]" score="1" interval="2"/>
                <heuristic program="curl -s http://X.X.X.X | grep CLUSTER >> /dev/null" score="2"
 interval="1"/>
 </quorumd>
 
# cat /tmp/qdisk_status
Node ID: 1
Score (current / min req. / max allowed): 4 / 2 / 4
Current state: Master
Current disk state: None
Visible Set: { 1 2 }
Master Node ID: 1
Quorate Set: { 1 2 }
 
Causing the last 2 heuristics to fail causes the score to fall below ? and in theory should reboot the node.  So far I get confirmation in /var/log/messages but no actual reboot ( See below ).  The service (webserver) also remains on the node that dropped below ?. 
 
# cat /tmp/qdisk_status
Node ID: 1
Score (current / min req. / max allowed): 1 / 2 / 4
Current state: None
Current disk state: None
Visible Set: { 1 2 }
Master Node ID: 2
Quorate Set: { }
 
/var/log/messages
 
Sep 12 11:34:02 SERVER1 qdiskd[7495]: <notice> Score insufficient for master operation (1/2; max=4); downgrading
Sep 12 11:34:04 SERVER1 qdiskd[7495]: <info> Node 2 is the master
 
Sep 12 11:34:02 SERVER2 qdiskd[9780]: <info> Node 1 shutdown
Sep 12 11:34:02 SERVER2 qdiskd[9780]: <debug> Making bid for master
Sep 12 11:34:03 SERVER2 qdiskd[9780]: <info> Assuming master role
 
Any idea why the server is not getting rebooted/fenced?
 
Thanks,
 Dan


From jbrassow at redhat.com  Tue Sep 12 17:41:11 2006
From: jbrassow at redhat.com (Jonathan Brassow)
Date: Tue, 12 Sep 2006 12:41:11 -0500
Subject: [Linux-cluster] CLVMD - Do I need it???
In-Reply-To: <1158069859.3610.437.camel@rei.boston.devel.redhat.com>
References: <7F6B06837A5DBD49AC6E1650EFF5490601223028@auk52177.ukr.astrium.corp>
	<1158069859.3610.437.camel@rei.boston.devel.redhat.com>
Message-ID: <1158082871.988.4.camel@hydrogen.msp.redhat.com>

On Tue, 2006-09-12 at 10:04 -0400, Lon Hohberger wrote:
> On Tue, 2006-09-12 at 09:25 +0100, HAWKER, Dan wrote:
> > 
> > Hi All,
> > 
> > Have an EMC SAN unit on the way. I plan to use it as the central store for a
> > couple of servers setup as a cluster, using GFS. As the SAN unit can handle
> > all of its own Logical Volume management natively, I presume I don't have to
> > use/implement CLVMD and hence can cut one layer of complexity in the disk
> > structure away.
> 
> > Am I correct in this assumption, or does GFS/RHCS need to use CLVMD in its
> > configuration???
> 
> You don't need CLVM if you intend to use the internal array tools, but
> it's a "nice to have" thing.  After all, we've had GFS (and simple
> failover, for that matter) for a few years -- while CLVM is a relatively
> new technology.  Some SANs can do this internally too, of course.
> 
> For example, if you had CLVM and you add another array, I'm pretty sure
> you could use CLVM to extend an existing logical volume on to the second
> array while the cluster is running.


I think one of the big things is naming - ensuring that the device name
is always the same on all nodes in the cluster - regardless of any
devices added/changed/removed.  If you can do that, in addition to
storage management, then there is probably no need to involve LVM
(cluster or not).  If you plan to use LVM on top of the storage device,
then you must use clvmd.

 brassow


From DylanV at semaphore.com  Wed Sep 13 04:44:27 2006
From: DylanV at semaphore.com (Dylan Vanderhoof)
Date: Tue, 12 Sep 2006 21:44:27 -0700
Subject: [Linux-cluster] Some newbie questions
Message-ID: <F6F72F553C7E754A945CB88CD8804B4204D0B4@exchange01.semaphore.lan>

I'm getting ready to start using GFS for a project at my company and
believe I have a sane migration path, but I wanted to ask for a sanity
check from people who are using it first.

The eventual architecture will be multiple iSCSI targets as part of a
single GFS filesystem using CLVM, primarily so adding more disk is
fairly seamless, and if I understand correctly, can be done without any
downtime.  (Is there downtime required for the fs grow step?)  This also
will allow multipath io for some extra redundancy in the future.
(Obviously, the iSCSI targets are SPOFs, but that's unavoidable).

This points me towards using DLM, of course, but in the initial install
I only have a single node and will be adding other nodes in the fairly
near future.  Can I transition from nolock to using DLM?  I would assume
so, but I haven't seen anything indicating how that would be done.

Other than those couple questions, I believe everything to be fairly
straightforward.  Looking forward to trying GFS out!

Thanks,
Dylan Vanderhoof
Sr. Software Developer
Semaphore Corporation


From jos at xos.nl  Wed Sep 13 06:34:52 2006
From: jos at xos.nl (Jos Vos)
Date: Wed, 13 Sep 2006 08:34:52 +0200
Subject: [Linux-cluster] Some newbie questions
In-Reply-To: <F6F72F553C7E754A945CB88CD8804B4204D0B4@exchange01.semaphore.lan>;
	from DylanV@semaphore.com on Tue, Sep 12, 2006 at 09:44:27PM -0700
References: <F6F72F553C7E754A945CB88CD8804B4204D0B4@exchange01.semaphore.lan>
Message-ID: <20060913083452.B14844@xos037.xos.nl>

On Tue, Sep 12, 2006 at 09:44:27PM -0700, Dylan Vanderhoof wrote:

> This points me towards using DLM, of course, but in the initial install
> I only have a single node and will be adding other nodes in the fairly
> near future.  Can I transition from nolock to using DLM?  I would assume
> so, but I haven't seen anything indicating how that would be done.

Yes, this can be done (on an unmounted fs) using:

  gfs_tool sb <device> proto lock_dlm

Note that you better should add enough journals to the filesystem
when creating it.  You can add journals later, but only if there
is (enough) space left on the device after the filesystem, which
is normally not the case (if your filesystem occupies the whole
device).

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From DylanV at semaphore.com  Wed Sep 13 06:57:04 2006
From: DylanV at semaphore.com (Dylan Vanderhoof)
Date: Tue, 12 Sep 2006 23:57:04 -0700
Subject: [Linux-cluster] Some newbie questions
Message-ID: <F6F72F553C7E754A945CB88CD8804B42417F75@exchange01.semaphore.lan>


> -----Original Message-----
> From: Jos Vos [mailto:jos at xos.nl]
> Sent: Tuesday, September 12, 2006 11:35 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Some newbie questions
>
>
> Yes, this can be done (on an unmounted fs) using:
> 
>   gfs_tool sb <device> proto lock_dlm
> 
> Note that you better should add enough journals to the filesystem
> when creating it.  You can add journals later, but only if there
> is (enough) space left on the device after the filesystem, which
> is normally not the case (if your filesystem occupies the whole
> device).

Interesting.  I hadn't considered that.  Is there a document somewhere
that shows how large a journal is?  Or rather, is there a cost to adding
more than I will likely need to be safe?

If the fs is grown onto additional iSCSI targets, can journals be added
at that point as well utilizing the additional space on those devices?

Thanks,
Dylan


From Alain.Moulle at bull.net  Wed Sep 13 07:51:31 2006
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Wed, 13 Sep 2006 09:51:31 +0200
Subject: [Linux-cluster] CS4 Update 2 & Patch watchdog on
Message-ID: <4507B883.8060400@bull.net>

>> The self-watchdog patch adds a process which monitors the "real"
>> clurgmgrd.  The monitoring process should be the lower-numbered PID
>> (it's the parent of the one doing the work).

>> The monitoring process watches for crash signals (SIGBUS, SIGSEGV,
>> etc.), and will simply exit if you kill the child with SIGKILL.

>> So, basically, killing the higher-numbered PID with something like
>> SIGSEGV should cause the node to reboot.

>> -- Lon

Thanks Lon, I understand.
And if I kill -9 (SIGKILL) the higher-numbered PID at test purpose,
is it expected to reboot or not ?

I see in code :
                case SIGCHLD:
                case SIGILL:
                case SIGFPE:
                case SIGSEGV:
                case SIGBUS:
                        setup_signal(i, SIG_DFL);
                        break;
                default:
                        setup_signal(i, signal_handler);
but can't conclude for a SIGKILL on higher-numbered PID process ...

Thanks again

Alain Moull?


From dan.hawker at astrium.eads.net  Wed Sep 13 08:25:08 2006
From: dan.hawker at astrium.eads.net (HAWKER, Dan)
Date: Wed, 13 Sep 2006 09:25:08 +0100
Subject: [Linux-cluster] CLVMD - Do I need it???
Message-ID: <7F6B06837A5DBD49AC6E1650EFF5490601223032@auk52177.ukr.astrium.corp>


> > Have an EMC SAN unit on the way. I plan to use it as the central store
for a
> > couple of servers setup as a cluster, using GFS. As the SAN unit can
handle
> > all of its own Logical Volume management natively, I presume I don't
have to
> > use/implement CLVMD and hence can cut one layer of complexity in the
disk
> > structure away.
> 
> > Am I correct in this assumption, or does GFS/RHCS need to use CLVMD in
its
> > configuration???
> 
> You don't need CLVM if you intend to use the internal array tools, but
> it's a "nice to have" thing.  After all, we've had GFS (and simple
> failover, for that matter) for a few years -- while CLVM is a relatively
> new technology.  Some SANs can do this internally too, of course.
> 
> For example, if you had CLVM and you add another array, I'm pretty sure
> you could use CLVM to extend an existing logical volume on to the second
> array while the cluster is running.


>I think one of the big things is naming - ensuring that the device name
>is always the same on all nodes in the cluster - regardless of any
>devices added/changed/removed.  If you can do that, in addition to
>storage management, then there is probably no need to involve LVM
>(cluster or not).  If you plan to use LVM on top of the storage device,
>then you must use clvmd.
>
> brassow

Thanks for the replies. So, the decision is purely a matter of policy rather
than any technical reasons.

Didn't think of the possibility of extending the cluster storage by
utilising CLVM. Makes sense, nice feature, that may make me use CLVM anyway.

Guess I'll have a think and make a decision.

Thanks again

Dan
--

Dan Hawker
Linux System Administrator
Astrium

-- 


This email is for the intended addressee only.
If you have received it in error then you must not use, retain, disseminate or otherwise deal with it.
Please notify the sender by return email.
The views of the author may not necessarily constitute the views of Astrium Limited.
Nothing in this email shall bind Astrium Limited in any contract or obligation.

Astrium Limited, Registered in England and Wales No. 2449259
Registered Office: Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England


From lhh at redhat.com  Wed Sep 13 14:07:42 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 13 Sep 2006 10:07:42 -0400
Subject: [Linux-cluster] qdiskd not properly failing nodes??
In-Reply-To: <091220061612.9370.4506DC6E0006D5FF0000249A22007481849B9C0A99020E0B@comcast.net>
References: <091220061612.9370.4506DC6E0006D5FF0000249A22007481849B9C0A99020E0B@comcast.net>
Message-ID: <1158156462.11241.5.camel@rei.boston.devel.redhat.com>

On Tue, 2006-09-12 at 16:12 +0000, danwest at comcast.net wrote:

> Any idea why the server is not getting rebooted/fenced?

Did you start fenced ?  Qdisk doesn't handle fencing; it still relies on
CMAN to handle the fencing bit.

-- Lon


From lhh at redhat.com  Wed Sep 13 14:18:10 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 13 Sep 2006 10:18:10 -0400
Subject: [Linux-cluster] CS4 Update 2 & Patch watchdog on
In-Reply-To: <4507B883.8060400@bull.net>
References: <4507B883.8060400@bull.net>
Message-ID: <1158157090.11241.8.camel@rei.boston.devel.redhat.com>

On Wed, 2006-09-13 at 09:51 +0200, Alain Moulle wrote:
> >> The self-watchdog patch adds a process which monitors the "real"
> >> clurgmgrd.  The monitoring process should be the lower-numbered PID
> >> (it's the parent of the one doing the work).
> 
> >> The monitoring process watches for crash signals (SIGBUS, SIGSEGV,
> >> etc.), and will simply exit if you kill the child with SIGKILL.
> 
> >> So, basically, killing the higher-numbered PID with something like
> >> SIGSEGV should cause the node to reboot.
> 
> >> -- Lon
> 
> Thanks Lon, I understand.
> And if I kill -9 (SIGKILL) the higher-numbered PID at test purpose,
> is it expected to reboot or not ?
> 
> I see in code :
>                 case SIGCHLD:
>                 case SIGILL:
>                 case SIGFPE:
>                 case SIGSEGV:
>                 case SIGBUS:
>                         setup_signal(i, SIG_DFL);
>                         break;
>                 default:
>                         setup_signal(i, signal_handler);
> but can't conclude for a SIGKILL on higher-numbered PID process ...

No, sigkill will just cause the watchdog to commit suicide:

                if (waitpid(child, &status, 0) <= 0)
                        continue;
   
                if (WIFEXITED(status))
                        exit(WEXITSTATUS(status));
   
                if (WIFSIGNALED(status)) {
                        if (WTERMSIG(status) == SIGKILL) {
                                clulog(LOG_CRIT, "Watchdog: Daemon
killed, exiting\n");
                                raise(SIGKILL);


Use something like SIGSEGV (e.g. to simulate a crash) and the
nanny/watchdog process should reboot the node.

-- Lon


From lhh at redhat.com  Wed Sep 13 14:22:30 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 13 Sep 2006 10:22:30 -0400
Subject: [Linux-cluster] qdiskd not properly failing nodes??
In-Reply-To: <1158156462.11241.5.camel@rei.boston.devel.redhat.com>
References: <091220061612.9370.4506DC6E0006D5FF0000249A22007481849B9C0A99020E0B@comcast.net>
	<1158156462.11241.5.camel@rei.boston.devel.redhat.com>
Message-ID: <1158157350.11241.14.camel@rei.boston.devel.redhat.com>

On Wed, 2006-09-13 at 10:07 -0400, Lon Hohberger wrote:
> On Tue, 2006-09-12 at 16:12 +0000, danwest at comcast.net wrote:
> 
> > Any idea why the server is not getting rebooted/fenced?
> 
> Did you start fenced ?  Qdisk doesn't handle fencing; it still relies on
> CMAN to handle the fencing bit.

If you want, I could add something to cause the node to reboot itself on
the down-transition when it detects its score is insufficient to
continue as part of the master partition.

E.g., right here:

Sep 12 11:34:02 SERVER1 qdiskd[7495]: <notice> Score insufficient for
master operation (1/2; max=4); downgrading
[ reboot(RB_AUTOBOOT); /* if new configuration thing is set */ ]

-- Lon


From isplist at logicore.net  Wed Sep 13 14:40:12 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 13 Sep 2006 09:40:12 -0500
Subject: [Linux-cluster] Can't mount multiple GFS volumes?
Message-ID: <200691394012.985163@leena>

I have a need for non contiguous storage and wish to mount multiple GFS 
logical volumes. However, I cannot seem to get past this following error and 
others related.

-Command
# mount -t gfs /dev/vgcomp/str1 /lvstr1
mount: File exists
[root at dev new]#

-Error Log
Sep 12 16:22:22 dev kernel: GFS: Trying to join cluster "lock_dlm", 
"vgcomp:gfscomp"
Sep 12 16:22:22 dev kernel: dlm: gfscomp: lockspace already in use
Sep 12 16:22:22 dev kernel: lock_dlm: new lockspace error -17
Sep 12 16:22:22 dev kernel: GFS: can't mount proto = lock_dlm, table = 
vgcomp:gfscomp, hostdata =
Sep 12 16:22:23 dev hald[2168]: Timed out waiting for hotplug event 395. 
Rebasing to 396

There are two physical drives attached to a FC network. I would like to have 
access to each on their own, not as part of a single volume group of storage.

Anyone have some ideas, things I can try, to start getting closer to something 
that works? I've tried all I can think of.

Running RHEL4 with all latest updates.
Let me know what info you need and I'll be happy to provide it of course.

Thank you.

Mike


From isplist at logicore.net  Wed Sep 13 14:44:18 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 13 Sep 2006 09:44:18 -0500
Subject: [Linux-cluster] Cluster.conf documentation?
Message-ID: <200691394418.802318@leena>

I've looked but cannot seem to find good documentation on the cluster.conf 
file itself. Is there documentation somewhere which clearly talks about only 
the cluster.conf options, how to best build the file, available options, etc.

Thanks.

Mike


From isplist at logicore.net  Wed Sep 13 14:45:58 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 13 Sep 2006 09:45:58 -0500
Subject: [Linux-cluster] Fencing using brocade
Message-ID: <200691394558.252508@leena>

I want to use my brocade switch as the fencing device for my cluster. I cannot 
find any documentation showing what I need to set up on the brocade itself and 
within the cluster.conf file as well to make this work.

My cluster works fine... until a node dies of course or other problems come 
up.

Thanks in advance for any help.

Mike


From jparsons at redhat.com  Wed Sep 13 14:51:00 2006
From: jparsons at redhat.com (James Parsons)
Date: Wed, 13 Sep 2006 10:51:00 -0400
Subject: [Linux-cluster] Fencing using brocade
In-Reply-To: <200691394558.252508@leena>
References: <200691394558.252508@leena>
Message-ID: <45081AD4.5050801@redhat.com>

isplist at logicore.net wrote:

>I want to use my brocade switch as the fencing device for my cluster. I cannot 
>find any documentation showing what I need to set up on the brocade itself and 
>within the cluster.conf file as well to make this work.
>
>My cluster works fine... until a node dies of course or other problems come 
>up.
>
>Thanks in advance for any help.
>
>Mike
>  
>
The system-config-cluster application supports brocade fencing. It is a 
two part process - first you define the switch as a fence device; type 
brocade, then you select a node an click "Manage fencing for this node" 
and declare a fence instance.

-J


From isplist at logicore.net  Wed Sep 13 15:00:54 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 13 Sep 2006 10:00:54 -0500
Subject: [Linux-cluster] Fencing using brocade
In-Reply-To: <45081AD4.5050801@redhat.com>
Message-ID: <200691310054.237706@leena>

>> I want to use my brocade switch as the fencing device for my cluster. I
>> cannot find any documentation showing what I need to set up on the brocade
>> itself and within the cluster.conf file as well to make this work.

> The system-config-cluster application supports brocade fencing. It is a
> two part process - first you define the switch as a fence device; type
> brocade, then you select a node an click "Manage fencing for this node"
> and declare a fence instance.

Ah, I'm at the command line :). 

So, there is nothing I need to do on the brocade itself then? The cluster 
ports aren't connected directly, they are connected into a compaq hub, then 
the hub is connected into the brocade. The brocade seems to know about the 
external ports however since they are listed when I look on the switch.

As for the conf file, I've not found enough information on how to build a good 
conf file so know this one is probably not even complete. Been working on 
other parts of the problems then wanting to get to this.

<?xml version="1.0"?>
<cluster config_version="40" name="vgcomp">
    <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
    <clusternode name="cweb92.companions.com" nodeid="92" votes="1"/>
    <clusternode name="cweb93.companions.com" nodeid="93" votes="1"/>
    <clusternode name="cweb94.companions.com" nodeid="94" votes="1"/>
    <clusternode name="dev.companions.com" nodeid="99" votes="1"/>
    <clusternode name="qm247.companions.com" nodeid="247" votes="1"/>
    <clusternode name="qm248.companions.com" nodeid="248" votes="1"/>
    <clusternode name="qm249.companions.com" nodeid="249" votes="1"/>
    <clusternode name="qm250.companions.com" nodeid="250" votes="1"/>
</clusternodes>
        <cman/>
<fencedevices>
    <fencedevice agent="fence_brocade" ipaddr="x.x.x.x" login="xxx" 
name="brocade" passwd="xxx"/>
</fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>


From jparsons at redhat.com  Wed Sep 13 15:08:07 2006
From: jparsons at redhat.com (James Parsons)
Date: Wed, 13 Sep 2006 11:08:07 -0400
Subject: [Linux-cluster] Fencing using brocade
In-Reply-To: <200691310054.237706@leena>
References: <200691310054.237706@leena>
Message-ID: <45081ED7.9060505@redhat.com>

isplist at logicore.net wrote:

>>>I want to use my brocade switch as the fencing device for my cluster. I
>>>cannot find any documentation showing what I need to set up on the brocade
>>>itself and within the cluster.conf file as well to make this work.
>>>      
>>>
>
>  
>
>>The system-config-cluster application supports brocade fencing. It is a
>>two part process - first you define the switch as a fence device; type
>>brocade, then you select a node an click "Manage fencing for this node"
>>and declare a fence instance.
>>    
>>
>
>Ah, I'm at the command line :). 
>
>So, there is nothing I need to do on the brocade itself then? The cluster 
>ports aren't connected directly, they are connected into a compaq hub, then 
>the hub is connected into the brocade. The brocade seems to know about the 
>external ports however since they are listed when I look on the switch.
>
>As for the conf file, I've not found enough information on how to build a good 
>conf file so know this one is probably not even complete. Been working on 
>other parts of the problems then wanting to get to this.
>
Why build it yourself? The app will do it for you, and not make a typo 
that could cost you valuable time. If you don't have X running on your 
nodes, no problem - just install the s-c-cluster app anywhere...it will 
let you configure a cluster and then save out the conf file which you 
can propogate to the cluster yourself, if you want.

-J

>
><?xml version="1.0"?>
><cluster config_version="40" name="vgcomp">
>    <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>
><clusternodes>
>    <clusternode name="cweb92.companions.com" nodeid="92" votes="1"/>
>    <clusternode name="cweb93.companions.com" nodeid="93" votes="1"/>
>    <clusternode name="cweb94.companions.com" nodeid="94" votes="1"/>
>    <clusternode name="dev.companions.com" nodeid="99" votes="1"/>
>    <clusternode name="qm247.companions.com" nodeid="247" votes="1"/>
>    <clusternode name="qm248.companions.com" nodeid="248" votes="1"/>
>    <clusternode name="qm249.companions.com" nodeid="249" votes="1"/>
>    <clusternode name="qm250.companions.com" nodeid="250" votes="1"/>
></clusternodes>
>        <cman/>
><fencedevices>
>    <fencedevice agent="fence_brocade" ipaddr="x.x.x.x" login="xxx" 
>name="brocade" passwd="xxx"/>
></fencedevices>
>        <rm>
>                <failoverdomains/>
>                <resources/>
>        </rm>
></cluster>
>
>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>  
>


From cjk at techma.com  Wed Sep 13 15:26:22 2006
From: cjk at techma.com (Kovacs, Corey J.)
Date: Wed, 13 Sep 2006 11:26:22 -0400
Subject: [Linux-cluster] RHEL5 cluster problem...
Message-ID: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F05@tmaemail.techma.com>

Good morning.. 

Some oddness regarding clusterring on RHEL5beta1 (could be me)

I have a two node cluster and the cluster components installed. 
I have two nics in each node, the second of which I want to use for openais.

I have my cluster.conf pointing to the primary nic and I have openais
pointing to 
192.168.0.0 (my second nics are on 192.168.0.1 and 2)

Things seem to start ok on both nodes but they don't appear to be talking to
eachother.
For instance, clustat on the first node shows both nodes active even if node2
is down.

Actually, openais seems to be doing fine, but cman looks to be acting up.

This config was created using s-c-cluster and indeed it looks good. Am I
missing some
new fundemental thing with the new cluster versions?

I've been running RHCS/GFS on RHEL3 and RHEL4 for some time now but this is
my first
attempt at the new (openais based) clusterring.


Any thoughts?


Corey
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060913/fe839d3e/attachment.htm>

From lhh at redhat.com  Wed Sep 13 15:28:34 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 13 Sep 2006 11:28:34 -0400
Subject: [Linux-cluster] RHEL5 cluster problem...
In-Reply-To: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F05@tmaemail.techma.com>
References: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F05@tmaemail.techma.com>
Message-ID: <1158161314.11241.16.camel@rei.boston.devel.redhat.com>

On Wed, 2006-09-13 at 11:26 -0400, Kovacs, Corey J. wrote:
> Good morning.. 
> 
> Some oddness regarding clusterring on RHEL5beta1 (could be me)
> 
> I have a two node cluster and the cluster components installed.  
> I have two nics in each node, the second of which I want to use for
> openais.
> 
> I have my cluster.conf pointing to the primary nic and I have openais
> pointing to  
> 192.168.0.0 (my second nics are on 192.168.0.1 and 2)
> 
> Things seem to start ok on both nodes but they don't appear to be
> talking to eachother. 
> For instance, clustat on the first node shows both nodes active even
> if node2 is down.
> 
> Actually, openais seems to be doing fine, but cman looks to be acting
> up.
> 
> This config was created using s-c-cluster and indeed it looks good. Am
> I missing some 
> new fundemental thing with the new cluster versions?
> 
> I've been running RHCS/GFS on RHEL3 and RHEL4 for some time now but
> this is my first 
> attempt at the new (openais based) clusterring.
> 
> 
> Any thoughts?

For a start, you can always try cman_tool status / cman_tool nodes, just
to take clustat out of the picture.

-- Lon


From cjk at techma.com  Wed Sep 13 15:39:38 2006
From: cjk at techma.com (Kovacs, Corey J.)
Date: Wed, 13 Sep 2006 11:39:38 -0400
Subject: =?us-ascii?Q?RE:_=5BLinux-cluster=5D_RHEL5_cluster_problem...?=
In-Reply-To: <1158161314.11241.16.camel@rei.boston.devel.redhat.com>
Message-ID: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F06@tmaemail.techma.com>

Ok, that looks better. Both nodes show up as being memebers
using cman_tool status and cman_tool nodes. Also, seems I
forgot to start rgmanager. Once I started it, the "test" 
service I configured started up. stopping rgmanager, openais, cman
in that order on the second node, caused node1 to fence node2.

clustat still doesn't report correct status for me, but at least
I am getting some status back.


Thanks


Corey

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger
Sent: Wednesday, September 13, 2006 11:29 AM
To: linux clustering
Subject: Re: [Linux-cluster] RHEL5 cluster problem...

On Wed, 2006-09-13 at 11:26 -0400, Kovacs, Corey J. wrote:
> Good morning.. 
> 
> Some oddness regarding clusterring on RHEL5beta1 (could be me)
> 
> I have a two node cluster and the cluster components installed.  
> I have two nics in each node, the second of which I want to use for 
> openais.
> 
> I have my cluster.conf pointing to the primary nic and I have openais 
> pointing to 192.168.0.0 (my second nics are on 192.168.0.1 and 2)
> 
> Things seem to start ok on both nodes but they don't appear to be 
> talking to eachother.
> For instance, clustat on the first node shows both nodes active even 
> if node2 is down.
> 
> Actually, openais seems to be doing fine, but cman looks to be acting 
> up.
> 
> This config was created using s-c-cluster and indeed it looks good. Am 
> I missing some new fundemental thing with the new cluster versions?
> 
> I've been running RHCS/GFS on RHEL3 and RHEL4 for some time now but 
> this is my first attempt at the new (openais based) clusterring.
> 
> 
> Any thoughts?

For a start, you can always try cman_tool status / cman_tool nodes, just to
take clustat out of the picture.

-- Lon

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From frank at opticalart.de  Wed Sep 13 15:42:39 2006
From: frank at opticalart.de (Frank Hellmann)
Date: Wed, 13 Sep 2006 17:42:39 +0200
Subject: [Linux-cluster] Fencing using brocade
In-Reply-To: <200691310054.237706@leena>
References: <200691310054.237706@leena>
Message-ID: <450826EF.90203@opticalart.de>

Hi!

I can only recommend the system-config-cluster GUI, but if you feel
brave enough you can do it by hand

This example is for a sanbox2, but it should get you going:
 
   ...
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="clusty1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="sanbox" port="0"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="clusty2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="sanbox" port="1"/>
                                </method>
                        </fence>
                </clusternode>
                        ....
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_sanbox2"
ipaddr="xxx.xxx.xxx.xxx" login="username" name="sanbox" passwd="password"/>
        </fencedevices>
   ...

And don't forget to check the fence_brocade manpage for your brocade
switch for further options...
       
       Cheers,

                Frank...

isplist at logicore.net wrote:
>>> I want to use my brocade switch as the fencing device for my cluster. I
>>> cannot find any documentation showing what I need to set up on the brocade
>>> itself and within the cluster.conf file as well to make this work.
>>>       
>
>   
>> The system-config-cluster application supports brocade fencing. It is a
>> two part process - first you define the switch as a fence device; type
>> brocade, then you select a node an click "Manage fencing for this node"
>> and declare a fence instance.
>>     
>
> Ah, I'm at the command line :). 
>
> So, there is nothing I need to do on the brocade itself then? The cluster 
> ports aren't connected directly, they are connected into a compaq hub, then 
> the hub is connected into the brocade. The brocade seems to know about the 
> external ports however since they are listed when I look on the switch.
>
> As for the conf file, I've not found enough information on how to build a good 
> conf file so know this one is probably not even complete. Been working on 
> other parts of the problems then wanting to get to this.
>
> <?xml version="1.0"?>
> <cluster config_version="40" name="vgcomp">
>     <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>
> <clusternodes>
>     <clusternode name="cweb92.companions.com" nodeid="92" votes="1"/>
>     <clusternode name="cweb93.companions.com" nodeid="93" votes="1"/>
>     <clusternode name="cweb94.companions.com" nodeid="94" votes="1"/>
>     <clusternode name="dev.companions.com" nodeid="99" votes="1"/>
>     <clusternode name="qm247.companions.com" nodeid="247" votes="1"/>
>     <clusternode name="qm248.companions.com" nodeid="248" votes="1"/>
>     <clusternode name="qm249.companions.com" nodeid="249" votes="1"/>
>     <clusternode name="qm250.companions.com" nodeid="250" votes="1"/>
> </clusternodes>
>         <cman/>
> <fencedevices>
>     <fencedevice agent="fence_brocade" ipaddr="x.x.x.x" login="xxx" 
> name="brocade" passwd="xxx"/>
> </fencedevices>
>         <rm>
>                 <failoverdomains/>
>                 <resources/>
>         </rm>
> </cluster>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>   


-- 
--------------------------------------------------------------------------
Frank Hellmann          Optical Art GmbH           Waterloohain 7a
DI Supervisor           http://www.opticalart.de   22769 Hamburg
frank at opticalart.de     Tel: ++49 40 5111051       Fax: ++49 40 43169199 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060913/a76a01b6/attachment.htm>

From lhh at redhat.com  Wed Sep 13 16:10:35 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 13 Sep 2006 12:10:35 -0400
Subject: [Linux-cluster] RHEL5 cluster problem...
In-Reply-To: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F06@tmaemail.techma.com>
References: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F06@tmaemail.techma.com>
Message-ID: <1158163835.11241.20.camel@rei.boston.devel.redhat.com>

On Wed, 2006-09-13 at 11:39 -0400, Kovacs, Corey J. wrote:
> Ok, that looks better. Both nodes show up as being memebers
> using cman_tool status and cman_tool nodes. Also, seems I
> forgot to start rgmanager. Once I started it, the "test" 
> service I configured started up. stopping rgmanager, openais, cman
> in that order on the second node, caused node1 to fence node2.

rgmanager caused a node to get fenced?

:o

I know there have been some pretty big rgmanager bugs fixed since B1
freeze, but that one is news to me.  Let me see if there are any newer
rgmanager packages available.

-- Lon


From jparsons at redhat.com  Wed Sep 13 16:14:13 2006
From: jparsons at redhat.com (James Parsons)
Date: Wed, 13 Sep 2006 12:14:13 -0400
Subject: [Linux-cluster] RHEL5 cluster problem...
In-Reply-To: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F05@tmaemail.techma.com>
References: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F05@tmaemail.techma.com>
Message-ID: <45082E55.1080506@redhat.com>

Kovacs, Corey J. wrote:

> Good morning..
>
> Some oddness regarding clusterring on RHEL5beta1 (could be me)
>
> I have a two node cluster and the cluster components installed.
> I have two nics in each node, the second of which I want to use for 
> openais.
>
> I have my cluster.conf pointing to the primary nic and I have openais 
> pointing to
> 192.168.0.0 (my second nics are on 192.168.0.1 and 2)
>
> Things seem to start ok on both nodes but they don't appear to be 
> talking to eachother.
> For instance, clustat on the first node shows both nodes active even 
> if node2 is down.
>
> Actually, openais seems to be doing fine, but cman looks to be acting up.
>
> This config was created using s-c-cluster and indeed it looks good. Am 
> I missing some
> new fundemental thing with the new cluster versions?
>
> I've been running RHCS/GFS on RHEL3 and RHEL4 for some time now but 
> this is my first
> attempt at the new (openais based) clusterring.
>
>
> Any thoughts?
>

EEk. s-c-cluster is NOT updated completely for rhel5 cluster in the beta 
1 release - Sorry, Corey.

Are you aware that each node needs an explicit 'nodeid' attribute value 
in the conf file, in addition to the name attribute? This just needs to 
be a unique integer value...a simple enumeration of the nodes, 1, 2, 3...

-J


From cjk at techma.com  Wed Sep 13 16:31:02 2006
From: cjk at techma.com (Kovacs, Corey J.)
Date: Wed, 13 Sep 2006 12:31:02 -0400
Subject: =?us-ascii?Q?RE:_=5BLinux-cluster=5D_RHEL5_cluster_problem...?=
In-Reply-To: <45082E55.1080506@redhat.com>
Message-ID: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F07@tmaemail.techma.com>

James, didn't know that s-c-cluster wasn't updated, good to know, not a
problem as I like the command line anyway :)


I did know about the nodeid so that's good to go. 

Thanks for the heads up on s-c-cluster


Corey 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of James Parsons
Sent: Wednesday, September 13, 2006 12:14 PM
To: linux clustering
Subject: Re: [Linux-cluster] RHEL5 cluster problem...

Kovacs, Corey J. wrote:

> Good morning..
>
> Some oddness regarding clusterring on RHEL5beta1 (could be me)
>
> I have a two node cluster and the cluster components installed.
> I have two nics in each node, the second of which I want to use for 
> openais.
>
> I have my cluster.conf pointing to the primary nic and I have openais 
> pointing to 192.168.0.0 (my second nics are on 192.168.0.1 and 2)
>
> Things seem to start ok on both nodes but they don't appear to be 
> talking to eachother.
> For instance, clustat on the first node shows both nodes active even 
> if node2 is down.
>
> Actually, openais seems to be doing fine, but cman looks to be acting up.
>
> This config was created using s-c-cluster and indeed it looks good. Am 
> I missing some new fundemental thing with the new cluster versions?
>
> I've been running RHCS/GFS on RHEL3 and RHEL4 for some time now but 
> this is my first attempt at the new (openais based) clusterring.
>
>
> Any thoughts?
>

EEk. s-c-cluster is NOT updated completely for rhel5 cluster in the beta 
1 release - Sorry, Corey.

Are you aware that each node needs an explicit 'nodeid' attribute value 
in the conf file, in addition to the name attribute? This just needs to 
be a unique integer value...a simple enumeration of the nodes, 1, 2, 3...

-J


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From cjk at techma.com  Wed Sep 13 16:34:25 2006
From: cjk at techma.com (Kovacs, Corey J.)
Date: Wed, 13 Sep 2006 12:34:25 -0400
Subject: =?us-ascii?Q?RE:_=5BLinux-cluster=5D_RHEL5_cluster_problem...?=
In-Reply-To: <1158163835.11241.20.camel@rei.boston.devel.redhat.com>
Message-ID: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F08@tmaemail.techma.com>

Lon, no I don't believe rgmanager is the culprit rather it happened when
rgmanager was not running. I went shutdown cman on node2 so in the beginning
of my playing, I stopped openais, then cman (rgmamanager came later) and the
node was fenced. I did this in the simplest way of course...

service openais stop
service cman stop

anyway, I'll keep an eye on it.

Thanks for your suggestions/help


Corey 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger
Sent: Wednesday, September 13, 2006 12:11 PM
To: linux clustering
Subject: RE: [Linux-cluster] RHEL5 cluster problem...

On Wed, 2006-09-13 at 11:39 -0400, Kovacs, Corey J. wrote:
> Ok, that looks better. Both nodes show up as being memebers using 
> cman_tool status and cman_tool nodes. Also, seems I forgot to start 
> rgmanager. Once I started it, the "test"
> service I configured started up. stopping rgmanager, openais, cman in 
> that order on the second node, caused node1 to fence node2.

rgmanager caused a node to get fenced?

:o

I know there have been some pretty big rgmanager bugs fixed since B1 freeze,
but that one is news to me.  Let me see if there are any newer rgmanager
packages available.

-- Lon

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From lhh at redhat.com  Wed Sep 13 16:56:57 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 13 Sep 2006 12:56:57 -0400
Subject: [Linux-cluster] RHEL5 cluster problem...
In-Reply-To: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F08@tmaemail.techma.com>
References: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F08@tmaemail.techma.com>
Message-ID: <1158166617.11241.22.camel@rei.boston.devel.redhat.com>

On Wed, 2006-09-13 at 12:34 -0400, Kovacs, Corey J. wrote:
> Lon, no I don't believe rgmanager is the culprit rather it happened when
> rgmanager was not running. I went shutdown cman on node2 so in the beginning
> of my playing, I stopped openais, then cman (rgmamanager came later) and the
> node was fenced. I did this in the simplest way of course...
> 
> service openais stop
> service cman stop
> 
> anyway, I'll keep an eye on it.


Yeah, I read that last message wrong.  

-- Lon


From lhh at redhat.com  Wed Sep 13 21:46:30 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 13 Sep 2006 17:46:30 -0400
Subject: [Linux-cluster] qdiskd not properly failing nodes??
In-Reply-To: <2d4e61a8f96d5bf89f1d86611e4712d3@comcast.net>
References: <2d4e61a8f96d5bf89f1d86611e4712d3@comcast.net>
Message-ID: <1158183990.11241.65.camel@rei.boston.devel.redhat.com>

On Wed, 2006-09-13 at 15:40 -0400, Andrea Westervelt wrote:
> 
> 
> ______________________________________________________________________
> 
> Lon,
>  
> fenced is running and based on the manpage it seems like dropping
> below a score of ? should cause a reboot? 

It currently expects the quorate partition (remember, this node is no
longer quorate) to fence the node rather than taking action itself.

>  I guess I am a little confused on what the heuristics/scoring are
> meant to do.  Can you explain the role of the master partition and
> what the expected outcome of an insufficient score should be?

The master node is a node with sufficient score to declare itself online
according to the heuristics that you supply in the qdisk configuration.
Assuming it maintains its score, it arbitrates what other nodes join the
"master" partition.  If a node becomes part of the master partition, the
node advertises quorum device votes to CMAN.

Insufficient scores should cause a node to remove itself from the master
partition and tell CMAN that the quorum device is offline.  This should
cause CMAN on a node in the qdisk master partition to fence the node
(assuming that this causes the node to transition from
quorate->inquorate).

I'm guessing what is happening here in your case is that CMAN is still
seeing the node - even though it's inquorate - and it's not fencing it
-- is that right?  A transition from quorate->inquorate should cause the
node to get fenced.

That sounds like a bug (pretty easy to fix, too).

-- Lon


From lhh at redhat.com  Wed Sep 13 21:58:59 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 13 Sep 2006 17:58:59 -0400
Subject: [Linux-cluster] qdiskd not properly failing nodes??
In-Reply-To: <1158183990.11241.65.camel@rei.boston.devel.redhat.com>
References: <2d4e61a8f96d5bf89f1d86611e4712d3@comcast.net>
	<1158183990.11241.65.camel@rei.boston.devel.redhat.com>
Message-ID: <1158184739.11241.73.camel@rei.boston.devel.redhat.com>

On Wed, 2006-09-13 at 17:46 -0400, Lon Hohberger wrote:

> I'm guessing what is happening here in your case is that CMAN is still
> seeing the node - even though it's inquorate - and it's not fencing it
> -- is that right?  A transition from quorate->inquorate should cause the
> node to get fenced.
> 
> That sounds like a bug (pretty easy to fix, too).

The easiest fix is to make it reboot on the S_RUN->S_NONE transition
like it says in the man page (but allow a configuration parameter to
override it).  This would make it work exactly stated, and wouldn't
require any changes to your configuration.

-- Lon


From lhh at redhat.com  Wed Sep 13 22:24:43 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 13 Sep 2006 18:24:43 -0400
Subject: [Linux-cluster] [PATCH] reboot flag + score fix
In-Reply-To: <1158184739.11241.73.camel@rei.boston.devel.redhat.com>
References: <2d4e61a8f96d5bf89f1d86611e4712d3@comcast.net>
	<1158183990.11241.65.camel@rei.boston.devel.redhat.com>
	<1158184739.11241.73.camel@rei.boston.devel.redhat.com>
Message-ID: <1158186283.11241.82.camel@rei.boston.devel.redhat.com>

This implements a reboot flag which must be explicitly disabled.  Upon a
transition from majority score to less than majority score, a node will
reboot unless the reboot flag is explicitly set to 0 in the cluster
configuration.  This makes qdiskd operate consistently with section 2.2
in the manual page.

-- Lon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: qdisk-transition.patch
Type: text/x-patch
Size: 2919 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060913/f0700b95/attachment.bin>

From isplist at logicore.net  Thu Sep 14 02:17:25 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 13 Sep 2006 21:17:25 -0500
Subject: [Linux-cluster] Fencing using brocade
In-Reply-To: <450826EF.90203@opticalart.de>
Message-ID: <2006913211725.520516@leena>

In my case, the nodes are connected to a hub, which is in turn connected to 
the brocade. Do I just use the brocade's port still?

I have not been able to find clear information on building a proper 
cluster.conf file either so have bits of this and that.

This is what I've got... you're sample and the bits and pieces I've been 
using.

<?xml version="1.0"?>
<cluster config_version="40" name="vgcomp">
    <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
    <clusternode name="cweb92.xxx.com" nodeid="92" votes="1"/>
    <clusternode name="cweb93.xxx.com" nodeid="93" votes="1"/>
    <clusternode name="cweb94.xxx.com" nodeid="94" votes="1"/>
    <clusternode name="dev.xxx.com" nodeid="99" votes="1"/>
    <clusternode name="qm247.xxx.com" nodeid="247" votes="1"/>
    <clusternode name="qm248.xxx.com" nodeid="248" votes="1"/>
    <clusternode name="qm249.xxx.com" nodeid="249" votes="1"/>
    <clusternode name="qm250.xxx.com" nodeid="250" votes="1"/>
</clusternodes>
        <cman/>
<fencedevices>
    <fencedevice agent="fence_brocade" ipaddr="x.x.x.x" login="user" 
name="brocade" passwd="xxx"/>
</fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>

Mike


On Wed, 13 Sep 2006 17:42:39 +0200, Frank Hellmann wrote:
> Hi!
> 
> I can only recommend the system-config-cluster GUI, but if you feel brave
> enough you can do it by hand
> 
> This example is for a sanbox2, but it should get you going:
> 
>   ...
>        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>        <clusternodes>
>                <clusternode name="clusty1" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="sanbox" port="0"/>
>                                </method>
>                        </fence>
>                </clusternode>
>                <clusternode name="clusty2" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="sanbox" port="1"/>
>                                </method>
>                        </fence>
>                </clusternode>
>                        ....
>        </clusternodes>
>        <fencedevices>
>                <fencedevice agent="fence_sanbox2" ipaddr="xxx.xxx.xxx.xxx"
> login="username" name="sanbox" passwd="password"/>
>        </fencedevices>
>   ...
> 
> And don't forget to check the fence_brocade manpage for your brocade switch
> for further options...
>       
>       Cheers,
> 
>                Frank...
> 
> isplist at logicore.net wrote: >  >  >  I want to use my brocade switch as the
> fencing device for my cluster. I cannot find any documentation showing what
> I need to set up on the brocade itself and within the cluster.conf file as
> well to make this work.
> 
>>> The system-config-cluster application supports brocade fencing. It is a
>>> two part process - first you define the switch as a fence device; type
>>> brocade, then you select a node an click "Manage fencing for this node"
>>> and declare a fence instance.
>> Ah, I'm at the command line :). So, there is nothing I need to do on the
>> brocade itself then? The cluster ports aren't connected directly, they
>> are connected into a compaq hub, then the hub is connected into the
>> brocade. The brocade seems to know about the external ports however since
>> they are listed when I look on the switch. As for the conf file, I've not
>> found enough information on how to build a good conf file so know this
>> one is probably not even complete. Been working on other parts of the
>> problems then wanting to get to this. <?xml version="1.0"?> <cluster
>> config_version="40" name="vgcomp"> <fence_daemon clean_start="1"
>> post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode
>> name="cweb92.companions.com" nodeid="92" votes="1"/> <clusternode
>> name="cweb93.companions.com" nodeid="93" votes="1"/> <clusternode
>> name="cweb94.companions.com" nodeid="94" votes="1"/> <clusternode
>> name="dev.companions.com" nodeid="99" votes="1"/> <clusternode
>> name="qm247.companions.com" nodeid="247" votes="1"/> <clusternode
>> name="qm248.companions.com" nodeid="248" votes="1"/> <clusternode
>> name="qm249.companions.com" nodeid="249" votes="1"/> <clusternode
>> name="qm250.companions.com" nodeid="250" votes="1"/> </clusternodes> 
>> <cman/> <fencedevices> <fencedevice agent="fence_brocade"
>> ipaddr="x.x.x.x" login="xxx" name="brocade" passwd="xxx"/> 
>> </fencedevices> <rm> <failoverdomains/> <resources/> </rm> </cluster> --
>> Linux-cluster mailing list Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> 
> 
> -- -------------------------------------------------------------------------
> - Frank Hellmann Optical Art GmbH Waterloohain 7a DI Supervisor
> http://www.opticalart.de 22769 Hamburg frank at opticalart.de Tel: ++49 40
> 5111051 Fax: ++49 40 43169199


From isplist at logicore.net  Thu Sep 14 02:34:12 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 13 Sep 2006 21:34:12 -0500
Subject: [Linux-cluster] cluster.conf using brocade
In-Reply-To: <450826EF.90203@opticalart.de>
Message-ID: <2006913213412.698710@leena>

Anyone have any thoughts on this config? Make sense, not, needs work? Can do 
the job but not the best? Etc. Thanks.

The nodes are all connected into a compaq FC hub. That hub is then connected 
into a brocade switch. I'd like to use the brocade switch as the fencing 
device.

<?xml version="1.0"?>
<cluster config_version="40" name="vgcomp">
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
    <clusternodes>
        <clusternode name="cweb92.xxx.com" nodeid="92" votes="1">
                <fence>
                    <method name="1">
                        <device name="brocade" port="0"/>
                    </method>
                </fence>
        </clusternode>
        <clusternode name="cweb93.xxx.com" nodeid="93" votes="1">
                <fence>
                    <method name="1">
                        <device name="brocade" port="0"/>
                    </method>
                </fence>
        </clusternode>
        <clusternode name="cweb94.xxx.com" nodeid="94" votes="1">
                <fence>
                    <method name="1">
                        <device name="brocade" port="0"/>
                    </method>
                </fence>
        </clusternode>
        <clusternode name="qm247.xxx.com" nodeid="247" votes="1">
                <fence>
                    <method name="1">
                        <device name="brocade" port="0"/>
                    </method>
                </fence>
        </clusternode>
        <clusternode name="qm248.xxx.com" nodeid="248" votes="1">
                <fence>
                    <method name="1">
                        <device name="brocade" port="0"/>
                    </method>
                </fence>
        </clusternode>
        <clusternode name="qm249.xxx.com" nodeid="249" votes="1">
                <fence>
                    <method name="1">
                        <device name="brocade" port="0"/>
                    </method>
                </fence>
        </clusternode>
        <clusternode name="qm250.xxx.com" nodeid="250" votes="1">
                <fence>
                    <method name="1">
                        <device name="brocade" port="0"/>
                    </method>
                </fence>
        </clusternode>
        <clusternode name="dev.xxx.com" nodeid="99" votes="1">
                <fence>
                    <method name="1">
                        <device name="brocade" port="0"/>
                    </method>
                </fence>
        </clusternode>
    </clusternodes>
<fencedevices>
    <fencedevice agent="fence_brocade" ipaddr="x.x.x.x" login="user" 
name="brocade" passwd="xxx"/>
</fencedevices>
</cluster>


From eric at bootseg.com  Thu Sep 14 02:48:28 2006
From: eric at bootseg.com (Eric Kerin)
Date: Wed, 13 Sep 2006 22:48:28 -0400
Subject: [Linux-cluster] cluster.conf using brocade
In-Reply-To: <2006913213412.698710@leena>
References: <2006913213412.698710@leena>
Message-ID: <1158202108.2411.4.camel@mechanism.localnet>

On Wed, 2006-09-13 at 21:34 -0500, isplist at logicore.net wrote:
> The nodes are all connected into a compaq FC hub. That hub is then connected 
> into a brocade switch. I'd like to use the brocade switch as the fencing 
> device.
> 
Sadly, that won't work.  The fence script for brocade instructs it to
turn off a specified port.  All your machines hook up to the switch
through a single port.  So when a node acts up, you disconnect ALL of
your nodes nodes from the storage at the same time, since they all are
connected to port 0 (through the hub).

For SAN fabric fencing to work, each server needs to be connected to the
brocade switch on it's own switch port.

Thanks, 
Eric


From isplist at logicore.net  Thu Sep 14 02:50:49 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 13 Sep 2006 21:50:49 -0500
Subject: [Linux-cluster] cluster.conf using brocade
In-Reply-To: <1158202108.2411.4.camel@mechanism.localnet>
Message-ID: <2006913215049.413248@leena>

> For SAN fabric fencing to work, each server needs to be connected to the
> brocade switch on it's own switch port.

Darn, thought that would be the case :). Well, I'm looking at a large McData 
switch and from what I've seen, those are also supported so, guess that's the 
next way to go.

Thanks!

Mike


From isplist at logicore.net  Thu Sep 14 03:19:35 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 13 Sep 2006 22:19:35 -0500
Subject: [Linux-cluster] cluster.conf using brocade
In-Reply-To: <1158202108.2411.4.camel@mechanism.localnet>
Message-ID: <2006913221935.494886@leena>

> Sadly, that won't work.  The fence script for brocade instructs it to
> turn off a specified port.  All your machines hook up to the switch
> through a single port.  So when a node acts up, you disconnect ALL of
> your nodes nodes from the storage at the same time, since they all are
> connected to port 0 (through the hub).

Since hubs are much cheaper than switches, and from the brocade's point of 
view, it can see unique ports even on the hub... would it not be worth adding 
this functionality to the fencing functions?

Mike


From erling.nygaard at gmail.com  Thu Sep 14 06:59:59 2006
From: erling.nygaard at gmail.com (Erling Nygaard)
Date: Thu, 14 Sep 2006 08:59:59 +0200
Subject: [Linux-cluster] Fencing using brocade
In-Reply-To: <2006913211725.520516@leena>
References: <450826EF.90203@opticalart.de> <2006913211725.520516@leena>
Message-ID: <adb721b40609132359v6f2ee11fn242d7662e07737e3@mail.gmail.com>

Mike

If I understand your description correctly, you have all your nodes
connected into a FC hub.
This hub is then connected to one port of the Brocade FC switch.
So all the nodes are on a single public Arbitrated Loop.
I assume that all the FC-connected storage is on another port on the Brocade?

I can see one potential problem with this setup. If the fencing is
done by disabling the port on the Brocade the entire loop will be
disconnected from the switch.
So instead of fencing one node the entire loop (containing all nodes)
will be fenced. (Cut off from the storage)

Only way I can see this work is to configure the fencing work with the
wwnn/wwpn of the nodes instead of the port on the Brocade.
Instead of having a fencing operation block all traffic on a given
Brocade port you need to have the Brocade block traffic to a given
wwnn/wwpn (the wwnn/wwpn of the FC-HBA of the node to be fenced)

I have not played with such a setup for a number of years, so I can't
really tell you how this should be done.

And of course, if you have the storage connected to the same FC-hub,
this won't work at all. In that case the traffic between the storage
and the nodes would not be controlled by the Brocade at all...

This should at least point out a potential problem :-)

Erling

On 9/14/06, isplist at logicore.net <isplist at logicore.net> wrote:
> In my case, the nodes are connected to a hub, which is in turn connected to
> the brocade. Do I just use the brocade's port still?
>
> I have not been able to find clear information on building a proper
> cluster.conf file either so have bits of this and that.
>
> This is what I've got... you're sample and the bits and pieces I've been
> using.
>
> <?xml version="1.0"?>
> <cluster config_version="40" name="vgcomp">
>     <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>
> <clusternodes>
>     <clusternode name="cweb92.xxx.com" nodeid="92" votes="1"/>
>     <clusternode name="cweb93.xxx.com" nodeid="93" votes="1"/>
>     <clusternode name="cweb94.xxx.com" nodeid="94" votes="1"/>
>     <clusternode name="dev.xxx.com" nodeid="99" votes="1"/>
>     <clusternode name="qm247.xxx.com" nodeid="247" votes="1"/>
>     <clusternode name="qm248.xxx.com" nodeid="248" votes="1"/>
>     <clusternode name="qm249.xxx.com" nodeid="249" votes="1"/>
>     <clusternode name="qm250.xxx.com" nodeid="250" votes="1"/>
> </clusternodes>
>         <cman/>
> <fencedevices>
>     <fencedevice agent="fence_brocade" ipaddr="x.x.x.x" login="user"
> name="brocade" passwd="xxx"/>
> </fencedevices>
>         <rm>
>                 <failoverdomains/>
>                 <resources/>
>         </rm>
> </cluster>
>
> Mike
>
>
> On Wed, 13 Sep 2006 17:42:39 +0200, Frank Hellmann wrote:
> > Hi!
> >
> > I can only recommend the system-config-cluster GUI, but if you feel brave
> > enough you can do it by hand
> >
> > This example is for a sanbox2, but it should get you going:
> >
> >   ...
> >        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
> >        <clusternodes>
> >                <clusternode name="clusty1" votes="1">
> >                        <fence>
> >                                <method name="1">
> >                                        <device name="sanbox" port="0"/>
> >                                </method>
> >                        </fence>
> >                </clusternode>
> >                <clusternode name="clusty2" votes="1">
> >                        <fence>
> >                                <method name="1">
> >                                        <device name="sanbox" port="1"/>
> >                                </method>
> >                        </fence>
> >                </clusternode>
> >                        ....
> >        </clusternodes>
> >        <fencedevices>
> >                <fencedevice agent="fence_sanbox2" ipaddr="xxx.xxx.xxx.xxx"
> > login="username" name="sanbox" passwd="password"/>
> >        </fencedevices>
> >   ...
> >
> > And don't forget to check the fence_brocade manpage for your brocade switch
> > for further options...
> >
> >       Cheers,
> >
> >                Frank...
> >
> > isplist at logicore.net wrote: >  >  >  I want to use my brocade switch as the
> > fencing device for my cluster. I cannot find any documentation showing what
> > I need to set up on the brocade itself and within the cluster.conf file as
> > well to make this work.
> >
> >>> The system-config-cluster application supports brocade fencing. It is a
> >>> two part process - first you define the switch as a fence device; type
> >>> brocade, then you select a node an click "Manage fencing for this node"
> >>> and declare a fence instance.
> >> Ah, I'm at the command line :). So, there is nothing I need to do on the
> >> brocade itself then? The cluster ports aren't connected directly, they
> >> are connected into a compaq hub, then the hub is connected into the
> >> brocade. The brocade seems to know about the external ports however since
> >> they are listed when I look on the switch. As for the conf file, I've not
> >> found enough information on how to build a good conf file so know this
> >> one is probably not even complete. Been working on other parts of the
> >> problems then wanting to get to this. <?xml version="1.0"?> <cluster
> >> config_version="40" name="vgcomp"> <fence_daemon clean_start="1"
> >> post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode
> >> name="cweb92.companions.com" nodeid="92" votes="1"/> <clusternode
> >> name="cweb93.companions.com" nodeid="93" votes="1"/> <clusternode
> >> name="cweb94.companions.com" nodeid="94" votes="1"/> <clusternode
> >> name="dev.companions.com" nodeid="99" votes="1"/> <clusternode
> >> name="qm247.companions.com" nodeid="247" votes="1"/> <clusternode
> >> name="qm248.companions.com" nodeid="248" votes="1"/> <clusternode
> >> name="qm249.companions.com" nodeid="249" votes="1"/> <clusternode
> >> name="qm250.companions.com" nodeid="250" votes="1"/> </clusternodes>
> >> <cman/> <fencedevices> <fencedevice agent="fence_brocade"
> >> ipaddr="x.x.x.x" login="xxx" name="brocade" passwd="xxx"/>
> >> </fencedevices> <rm> <failoverdomains/> <resources/> </rm> </cluster> --
> >> Linux-cluster mailing list Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>
> >
> > -- -------------------------------------------------------------------------
> > - Frank Hellmann Optical Art GmbH Waterloohain 7a DI Supervisor
> > http://www.opticalart.de 22769 Hamburg frank at opticalart.de Tel: ++49 40
> > 5111051 Fax: ++49 40 43169199
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
-
Mac OS X. Because making Unix user-friendly is easier than debugging Windows


From pcaulfie at redhat.com  Thu Sep 14 07:51:14 2006
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Thu, 14 Sep 2006 08:51:14 +0100
Subject: [Linux-cluster] Can't mount multiple GFS volumes?
In-Reply-To: <200691394012.985163@leena>
References: <200691394012.985163@leena>
Message-ID: <450909F2.7070106@redhat.com>

isplist at logicore.net wrote:
> I have a need for non contiguous storage and wish to mount multiple GFS 
> logical volumes. However, I cannot seem to get past this following error and 
> others related.
> 
> -Command
> # mount -t gfs /dev/vgcomp/str1 /lvstr1
> mount: File exists
> [root at dev new]#
> 
> -Error Log
> Sep 12 16:22:22 dev kernel: GFS: Trying to join cluster "lock_dlm", 
> "vgcomp:gfscomp"
> Sep 12 16:22:22 dev kernel: dlm: gfscomp: lockspace already in use
> Sep 12 16:22:22 dev kernel: lock_dlm: new lockspace error -17


When you created the GFS volumes using gfs_mkfs did you give them different
names ?

All filesystems in a cluster must have unique names.

-- 

patrick


From frank at opticalart.de  Thu Sep 14 08:05:01 2006
From: frank at opticalart.de (Frank Hellmann)
Date: Thu, 14 Sep 2006 10:05:01 +0200
Subject: [Linux-cluster] cluster.conf using brocade
In-Reply-To: <2006913221935.494886@leena>
References: <2006913221935.494886@leena>
Message-ID: <45090D2D.30606@opticalart.de>

isplist at logicore.net wrote:
>> Sadly, that won't work.  The fence script for brocade instructs it to
>> turn off a specified port.  All your machines hook up to the switch
>> through a single port.  So when a node acts up, you disconnect ALL of
>> your nodes nodes from the storage at the same time, since they all are
>> connected to port 0 (through the hub).
>>     
>
> Since hubs are much cheaper than switches, and from the brocade's point of 
> view, it can see unique ports even on the hub... would it not be worth adding 
> this functionality to the fencing functions?
>
> Mike
>
>   

Can you try to turn off a single port of that hub via the brocade
switch? I doubt that there is any method in the brocade switch to do
that, but I could be wrong here.

Also if the hub is manageable there might be a way to disable certain
ports directly at the hub.

If neither works, you'll need to think of a different setup for SAN
fencing, like putting the nodes onto their own FC-switch, or consider
power fencing via a network manageable pdu or ups.

       Cheers,
                   Frank...

-- 
--------------------------------------------------------------------------
Frank Hellmann          Optical Art GmbH           Waterloohain 7a
DI Supervisor           http://www.opticalart.de   22769 Hamburg
frank at opticalart.de     Tel: ++49 40 5111051       Fax: ++49 40 43169199 


From chekov at ucla.edu  Thu Sep 14 10:01:50 2006
From: chekov at ucla.edu (Alan Wood)
Date: Thu, 14 Sep 2006 03:01:50 -0700 (PDT)
Subject: [Linux-cluster] GFS and the Dell pv220s or iSCSI
In-Reply-To: <20060816123335.74AB373340@hormel.redhat.com>
References: <20060816123335.74AB373340@hormel.redhat.com>
Message-ID: <Pine.LNX.4.64.0609140237280.10838@c-24-130-164-240.hsd1.ca.comcast.net>

sorry I've been away from the list and only getting to this 1-month old 
thread now...

Brendan, I have a pv220s which I used for a GFS cluster last year with 
disasterous consequences.  Performance was terrible for multiple concurrent 
users (one of the chief thing you are worried about in selecting SCSI 
over SATA in the first place).  In addition, the support I got from Dell, 
while attentive, ended after 3 months with "we do not support using the 
pv220s in an active-active linux cluster".  This is after I had reverted 
out of GFS and was using linux-ha to do failover and getting SCSI 
reservation errors which led to data loss...

I have since moved on to iSCSI as a few people on the list suggested you 
do.  Instead of the Dell/EMC box most people were talking about I went with 
a Promise vtrak M300i.  as far as I could see there were only a couple of 
minor differences and the promise box was less than half the price when 
fully stocked with SATA drives (because Dell totally rips you off on the 
price of the drives).  It supports SATA II and NCQ.
so far performance has been just as good as with the pv220s in clustered 
config (I am only using 10K drives in the 220 though).  I have just bought 
a couple of Qlogic HBAs (in the US $500 instead of the $2K someone 
mentioned in Brazil) but have yet to test them.  I also bought a second 
enclosure and am hoping to use lvm mirroring and multipathing as soon as 
its good to go in order to have full redundancy:
http://www.redhat.com/f/summitfiles/presentation/May31/Clustering%20and%20Storage/StorageUninterrupted.pdf

btw, HP does offer an iSCSI head unit that you can then daisy-chain SCSI or 
SATA enclosures off of -- so if you really want SCSI disks that would be an 
option:
http://h18006.www1.hp.com/products/storageworks/msa1510i/index.html
I haven't tested it (and last I heard HP only officially supported it in 
Windows) but if anyone else on the list has experience with it I'd be 
curious to hear it.

now that I have a 10gig switch available to me I'm also curious to try out 
a 10gig iSCSI enclosure but haven't seen any on the market...
-alan

> ------------------------------
>
> Message: 5
> Date: Tue, 15 Aug 2006 21:29:59 +0100
> From: Brendan Heading <brendanheading at clara.co.uk>
> Subject: [Linux-cluster] Setting up a GFS cluster
> To: linux-cluster at redhat.com
> Message-ID: <44E22EC7.8020506 at clara.co.uk>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Hi all,
>
> I'm planning to build a cluster using a pair of PE1950s, using RHEL 3
> (or 4) with RHCS. Plan at the moment is to use GFS. Most of our stuff is
> Dell, therefore the obvious choice is to use a Dell PowerVault 220S as
> the shared storage device.
>
> Before I kick off with this idea I'd be interested to hear if anyone had
> any issues with this kind of setup, or if there were any general
> performance problems. Are there other SCSI enclosures which might be
> better or more appropriate for these purposes ?
>
> Regards
>
> Brendan
>
>
>
> ------------------------------
>
> Message: 7
> Date: Tue, 15 Aug 2006 23:23:40 -0300
> From: "Celso K. Webber" <celso at webbertek.com.br>
> Subject: Re: [Linux-cluster] Setting up a GFS cluster
> To: linux clustering <linux-cluster at redhat.com>
> Message-ID: <44E281AC.4010608 at webbertek.com.br>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Hello Brendan,
>
> Although Dell hardware is an excellent choice for Linux, the PV220S
> solution is terrible at performance under a cluster environment.
>
> The reason is that the PV220S itself does not manage RAID devices, it is
> in fact a JBOD (Just a Bunch Of Disks). The RAID management is done by
> the SCSI controllers within the servers (PERC 3/DC or PERC 4/DC).
>
> Since there is a possibility of one of the machines going down, together
> with data in the controller's write cache, this solution automatically
> disable the write cache (write through mode) when you set the
> controllers in "cluster mode".
>
> The end result is very poor performance, specially on write operations.
> It's not uncommon that Dell provides the PV-220S with 15K RPM disks to
> compensate this performance penalty due to lack of write cache.
>
> As far as I can tell, Red Hat did support the PV220S solution in the
> past, during the RHEL 2.1 era, but it is not supported anymore as
> certified shared storage for cluster solutions (RHCS or RHGFS).
>
> If you still plan to go on, be warned that the PV220S performs better in
> Cluster Mode if you set up the data transfer rate to 160 MB/s instead of
> 320 MB/s (the PERC 3/DC supports transfer rates of up to 160 MB/s while
> the PERC 4/DC supports up to 320 MB/s). This is a known issue at Dell
> support queues.
>
> As an extra information, there were too many problems about reliability
> with the PV220S when used in Cluster Mode, this can be seen by the large
> amount of firmware updates for the PERC 3/DC and 4/DC (LSI Logic based
> chipset, megaraid driver on Linux). More recent firmware versions seem
> to have corrected most logical drive corruption problems I've
> experienced, so I believe the PV220S is still worth a try if you can
> live with the poor performance issue.
>
> Maybe a Dell|EMC AX-100 using iSCSI could a better choice with a not so
> high price tag.
>
> Sorry for the long message, I believe this information can be useful to
> others.
>
> Best regards,
>
> Celso.
>
> Brendan Heading escreveu:
>> Hi all,
>>
>> I'm planning to build a cluster using a pair of PE1950s, using RHEL 3
>> (or 4) with RHCS. Plan at the moment is to use GFS. Most of our stuff is
>> Dell, therefore the obvious choice is to use a Dell PowerVault 220S as
>> the shared storage device.
>>
>> Before I kick off with this idea I'd be interested to hear if anyone had
>> any issues with this kind of setup, or if there were any general
>> performance problems. Are there other SCSI enclosures which might be
>> better or more appropriate for these purposes ?
>>
>> Regards
>>
>> Brendan
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>


From bosse at klykken.com  Thu Sep 14 11:20:43 2006
From: bosse at klykken.com (Bosse Klykken)
Date: Thu, 14 Sep 2006 13:20:43 +0200
Subject: [Linux-cluster] Cluster node won't rejoin cluster after fencing,
	stops at cman
Message-ID: <45093B0B.10709@klykken.com>

Hi.

I'm having some issues with a two-node failover cluster on RHEL4/U3 with
kernel 2.6.9-34.0.1.ELsmp, ccs-1.0.3-0, cman-1.0.4-0, fence-1.32.18-0
and rgmanager-1.9.46-0. After a mishap where I accidentaly caused a
failover of services with power fencing of server01, the system will not
rejoin the cluster after boot.

I have tried using both the init.d scripts and starting the daemons
manually to troubleshoot this further, to no avail. I'm able to start
ccsd properly (although it logs the cluster as inquorate) but it fails
completely on cman, claiming that connection is refused.

If anyone could help me by giving me some tips, directing me to the
proper documentation addressing this issue or downright pointing out my
problem, I would be most grateful.

[server01] # service ccsd start
Starting ccsd:                                             [  OK  ]
---8<--- /var/log/messages
Sep 14 00:33:28 server01 ccsd[30227]: Starting ccsd 1.0.3:
Sep 14 00:33:28 server01 ccsd[30227]:  Built: Jan 25 2006 16:54:43
Sep 14 00:33:28 server01 ccsd[30227]:  Copyright (C) Red Hat, Inc.  2004
 All rights reserved.
Sep 14 00:33:28 server01 ccsd[30227]: Connected to cluster infrastruture
via: CMAN/SM Plugin v1.1.5
Sep 14 00:33:28 server01 ccsd[30227]: Initial status:: Inquorate
Sep 14 00:33:29 server01 ccsd: startup succeeded
---8<---

[server01] # service cman start
Starting cman:                                             [FAILED]
---8<--- /var/log/messages
Sep 14 00:39:07 server01 ccsd[31417]: Cluster is not quorate.  Refusing
connection.
Sep 14 00:39:07 server01 ccsd[31417]: Error while processing connect:
Connection refused
Sep 14 00:39:07 server01 ccsd[31417]: cluster.conf (cluster name =
something_cluster, version = 46) found.
Sep 14 00:39:07 server01 ccsd[31417]: Remote copy of cluster.conf is
from quorate node.
Sep 14 00:39:07 server01 ccsd[31417]:  Local version # : 46
Sep 14 00:39:07 server01 ccsd[31417]:  Remote version #: 46
Sep 14 00:39:07 server01 cman: cman_tool: Node is already active failed
Sep 14 00:39:12 server01 kernel: CMAN: sending membership request
---8<---

[server01] # cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 46
Cluster name: something_cluster
Cluster ID: 47540
Cluster Member: No
Membership state: Joining

[server01] # cat /proc/cluster/nodes
Node  Votes Exp Sts  Name

[server02] # cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 46
Cluster name: something_cluster
Cluster ID: 47540
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 1
Total_votes: 1
Quorum: 1
Active subsystems: 4
Node name: server02
Node addresses: xx.xx.xx.134

[server02] # cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    1   X   server01
   2    1    1   M   server02

[server01] # cat /etc/cluster/cluster.conf
---8<---
<?xml version="1.0"?>
<cluster config_version="46" name="something_cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="30"/>
        <clusternodes>
                <clusternode name="server01" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="APC-LEFT"
option="off" port="8" switch="0"/>
                                        <device name="APC-RIGHT"
option="off" port="8" switch="0"/>
                                        <device name="APC-LEFT"
option="on" port="8" switch="0"/>
                                        <device name="APC-RIGHT"
option="on" port="8" switch="0"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="server02" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="APC-LEFT"
option="off" port="4" switch="0"/>
                                        <device name="APC-RIGHT"
option="off" port="4" switch="0"/>
                                        <device name="APC-LEFT"
option="on" port="4" switch="0"/>
                                        <device name="APC-RIGHT"
option="on" port="4" switch="0"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="xx.xx.xx.10"
login="secret" name="APC-LEFT" passwd="secret"/>
                <fencedevice agent="fence_apc" ipaddr="xx.xx.xx.11"
login="secret" name="APC-RIGHT" passwd="secret"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="OX" ordered="1"
restricted="0">
                                <failoverdomainnode name="server01"
priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="IMAP" ordered="1"
restricted="0">
                                <failoverdomainnode name="server01"
priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="NFS" ordered="1"
restricted="0">
                                <failoverdomainnode name="server02"
priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="LDAP" ordered="1">
                                <failoverdomainnode name="server02"
priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="PGSQL" ordered="1"
restricted="0">
                                <failoverdomainnode name="server02"
priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources/>
                <service autostart="1" domain="PGSQL" name="OX-OX">
                        <script file="/etc/init.d/openexchange" name="OX"/>
                        <ip address="192.168.xx.xx" monitor_link="1"/>
                        <fs device="/dev/emcpowera9" force_fsck="0"
force_unmount="1" fsid="39155" fstype="ext3"
mountpoint="/var/opt/openexchange/filespool" name="OX" options=""
self_fence="0"/>
                        <script file="/etc/init.d/openexchange-daemons"
name="XMLRPC"/>
                        <script file="/etc/init.d/tomcat5" name="Tomcat"/>
                        <ip address="192.168.xx.xx" monitor_link="1"/>
                </service>
                <service autostart="1" domain="IMAP" name="OX-IMAP">
                        <ip address="192.168.xx.xx" monitor_link="1"/>
                        <fs device="/dev/emcpowera7" force_fsck="0"
force_unmount="1" fsid="63880" fstype="ext3" mountpoint="/var/lib/imap"
name="IMAP" options="" self_fence="0"/>
                        <fs device="/dev/emcpowera10" force_fsck="0"
force_unmount="1" fsid="63324" fstype="ext3"
mountpoint="/var/spool/imap1" name="IMAP1" options="" self_fence="0"/>
                        <script file="/etc/init.d/saslauthd" name="SASL"/>
                        <script file="/etc/init.d/cyrus-imapd"
name="Cyrus"/>
                        <fs device="/dev/emcpowerb5" force_fsck="0"
force_unmount="1" fsid="42726" fstype="ext3"
mountpoint="/var/spool/imap2" name="IMAP2" options="" self_fence="0"/>
                        <fs device="/dev/emcpowerb6" force_fsck="0"
force_unmount="1" fsid="38512" fstype="ext3"
mountpoint="/var/spool/imap3" name="IMAP3" options="" self_fence="0"/>
                        <fs device="/dev/emcpowerc5" force_fsck="0"
force_unmount="1" fsid="979" fstype="ext3" mountpoint="/var/spool/imap4"
name="IMAP4" options="" self_fence="0"/>
                        <fs device="/dev/emcpowerc6" force_fsck="0"
force_unmount="1" fsid="13125" fstype="ext3"
mountpoint="/var/spool/imap5" name="IMAP5" options="" self_fence="0"/>
                </service>
                <service autostart="1" domain="NFS" name="OX-NFS">
                        <ip address="192.168.xx.xx" monitor_link="1"/>
                        <fs device="/dev/emcpowera8" force_fsck="0"
force_unmount="1" fsid="37141" fstype="ext3"
mountpoint="/var/lib/xxxxxxxx" name="NFS" options="" self_fence="0"/>
                        <script file="/etc/init.d/nfs" name="NFS"/>
                        <script file="/etc/init.d/nfslock" name="NFSLOCK"/>
                </service>
                <service autostart="1" domain="LDAP" name="OX-LDAP">
                        <ip address="192.168.xx.xx" monitor_link="1"/>
                        <fs device="/dev/emcpowerb8" force_fsck="0"
force_unmount="1" fsid="12853" fstype="ext3"
mountpoint="/var/symas/openldap-data" name="DATA" options=""
self_fence="0"/>
                        <fs device="/dev/emcpowerb9" force_fsck="0"
force_unmount="1" fsid="11240" fstype="ext3"
mountpoint="/var/symas/openldap-logs" name="LOGS" options=""
self_fence="0"/>
                        <fs device="/dev/emcpowerb10" force_fsck="0"
force_unmount="1" fsid="10234" fstype="ext3"
mountpoint="/var/symas/openldap-slurp" name="SLURP" options=""
self_fence="0"/>
                        <script file="/etc/init.d/cdsserver" name="LDAP"/>
                </service>
                <service autostart="1" domain="PGSQL" name="OX-PGSQL">
                        <ip address="192.168.xx.xx" monitor_link="1"/>
                        <fs device="/dev/emcpowera5" force_fsck="0"
force_unmount="1" fsid="43285" fstype="ext3" mountpoint="/var/lib/pgsql"
name="PGSQL" options="" self_fence="0"/>
                        <script file="/etc/init.d/postgresql" name="PGSQL"/>
                </service>
        </rm>
</cluster>
---8<---

[server01] # cat /etc/hosts
---8<---
127.0.0.1       localhost.localdomain localhost
xx.xx.xx.133  server01.example.com     server01
xx.xx.xx.134  server02.example.com     server02
---8<---

Thanks,
.../Bosse


From jshuuskonen at gmail.com  Thu Sep 14 12:21:43 2006
From: jshuuskonen at gmail.com (Jari Huuskonen)
Date: Thu, 14 Sep 2006 15:21:43 +0300
Subject: [Linux-cluster] Cluster node won't rejoin cluster after fencing,
	stops at cman
In-Reply-To: <45093B0B.10709@klykken.com>
References: <45093B0B.10709@klykken.com>
Message-ID: <91d480300609140521v2049eeaeo409d802832d6eaa6@mail.gmail.com>

Hi,
I think this is common behavior in two node cluster setup, for some reason
fence_domain is disorganized.
Try following.

Verify that node01 is up and running correctly, node02 has same version of
cluster.conf that node01 has.
reboot node01, after you have pressed enter after reboot command reboot node02
immediately after node01 so that nodes are comming up with latency of
few seconds, this
should fix up fence_domain so that the rest of service's cman ccsd
etc.. are able to start.
There is now way to do this manually in my experience. ( startting
service's manually)

Verify also that your fence device's are working properly!

/jari


On 14/09/06, Bosse Klykken <bosse at klykken.com> wrote:
> Hi.
>
> I'm having some issues with a two-node failover cluster on RHEL4/U3 with
> kernel 2.6.9-34.0.1.ELsmp, ccs-1.0.3-0, cman-1.0.4-0, fence-1.32.18-0
> and rgmanager-1.9.46-0. After a mishap where I accidentaly caused a
> failover of services with power fencing of server01, the system will not
> rejoin the cluster after boot.
>
> I have tried using both the init.d scripts and starting the daemons
> manually to troubleshoot this further, to no avail. I'm able to start
> ccsd properly (although it logs the cluster as inquorate) but it fails
> completely on cman, claiming that connection is refused.
>
> If anyone could help me by giving me some tips, directing me to the
> proper documentation addressing this issue or downright pointing out my
> problem, I would be most grateful.
>
> [server01] # service ccsd start
> Starting ccsd:                                             [  OK  ]
> ---8<--- /var/log/messages
> Sep 14 00:33:28 server01 ccsd[30227]: Starting ccsd 1.0.3:
> Sep 14 00:33:28 server01 ccsd[30227]:  Built: Jan 25 2006 16:54:43
> Sep 14 00:33:28 server01 ccsd[30227]:  Copyright (C) Red Hat, Inc.  2004
>  All rights reserved.
> Sep 14 00:33:28 server01 ccsd[30227]: Connected to cluster infrastruture
> via: CMAN/SM Plugin v1.1.5
> Sep 14 00:33:28 server01 ccsd[30227]: Initial status:: Inquorate
> Sep 14 00:33:29 server01 ccsd: startup succeeded
> ---8<---
>
> [server01] # service cman start
> Starting cman:                                             [FAILED]
> ---8<--- /var/log/messages
> Sep 14 00:39:07 server01 ccsd[31417]: Cluster is not quorate.  Refusing
> connection.
> Sep 14 00:39:07 server01 ccsd[31417]: Error while processing connect:
> Connection refused
> Sep 14 00:39:07 server01 ccsd[31417]: cluster.conf (cluster name =
> something_cluster, version = 46) found.
> Sep 14 00:39:07 server01 ccsd[31417]: Remote copy of cluster.conf is
> from quorate node.
> Sep 14 00:39:07 server01 ccsd[31417]:  Local version # : 46
> Sep 14 00:39:07 server01 ccsd[31417]:  Remote version #: 46
> Sep 14 00:39:07 server01 cman: cman_tool: Node is already active failed
> Sep 14 00:39:12 server01 kernel: CMAN: sending membership request
> ---8<---
>
> [server01] # cat /proc/cluster/status
> Protocol version: 5.0.1
> Config version: 46
> Cluster name: something_cluster
> Cluster ID: 47540
> Cluster Member: No
> Membership state: Joining
>
> [server01] # cat /proc/cluster/nodes
> Node  Votes Exp Sts  Name
>
> [server02] # cat /proc/cluster/status
> Protocol version: 5.0.1
> Config version: 46
> Cluster name: something_cluster
> Cluster ID: 47540
> Cluster Member: Yes
> Membership state: Cluster-Member
> Nodes: 1
> Expected_votes: 1
> Total_votes: 1
> Quorum: 1
> Active subsystems: 4
> Node name: server02
> Node addresses: xx.xx.xx.134
>
> [server02] # cat /proc/cluster/nodes
> Node  Votes Exp Sts  Name
>    1    1    1   X   server01
>    2    1    1   M   server02
>
> [server01] # cat /etc/cluster/cluster.conf
> ---8<---
> <?xml version="1.0"?>
> <cluster config_version="46" name="something_cluster">
>         <fence_daemon post_fail_delay="0" post_join_delay="30"/>
>         <clusternodes>
>                 <clusternode name="server01" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="APC-LEFT"
> option="off" port="8" switch="0"/>
>                                         <device name="APC-RIGHT"
> option="off" port="8" switch="0"/>
>                                         <device name="APC-LEFT"
> option="on" port="8" switch="0"/>
>                                         <device name="APC-RIGHT"
> option="on" port="8" switch="0"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="server02" votes="1">
>                         <fence>
>                                 <method name="1">
>                                         <device name="APC-LEFT"
> option="off" port="4" switch="0"/>
>                                         <device name="APC-RIGHT"
> option="off" port="4" switch="0"/>
>                                         <device name="APC-LEFT"
> option="on" port="4" switch="0"/>
>                                         <device name="APC-RIGHT"
> option="on" port="4" switch="0"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1"/>
>         <fencedevices>
>                 <fencedevice agent="fence_apc" ipaddr="xx.xx.xx.10"
> login="secret" name="APC-LEFT" passwd="secret"/>
>                 <fencedevice agent="fence_apc" ipaddr="xx.xx.xx.11"
> login="secret" name="APC-RIGHT" passwd="secret"/>
>         </fencedevices>
>         <rm>
>                 <failoverdomains>
>                         <failoverdomain name="OX" ordered="1"
> restricted="0">
>                                 <failoverdomainnode name="server01"
> priority="1"/>
>                         </failoverdomain>
>                         <failoverdomain name="IMAP" ordered="1"
> restricted="0">
>                                 <failoverdomainnode name="server01"
> priority="1"/>
>                         </failoverdomain>
>                         <failoverdomain name="NFS" ordered="1"
> restricted="0">
>                                 <failoverdomainnode name="server02"
> priority="1"/>
>                         </failoverdomain>
>                         <failoverdomain name="LDAP" ordered="1">
>                                 <failoverdomainnode name="server02"
> priority="1"/>
>                         </failoverdomain>
>                         <failoverdomain name="PGSQL" ordered="1"
> restricted="0">
>                                 <failoverdomainnode name="server02"
> priority="1"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <resources/>
>                 <service autostart="1" domain="PGSQL" name="OX-OX">
>                         <script file="/etc/init.d/openexchange" name="OX"/>
>                         <ip address="192.168.xx.xx" monitor_link="1"/>
>                         <fs device="/dev/emcpowera9" force_fsck="0"
> force_unmount="1" fsid="39155" fstype="ext3"
> mountpoint="/var/opt/openexchange/filespool" name="OX" options=""
> self_fence="0"/>
>                         <script file="/etc/init.d/openexchange-daemons"
> name="XMLRPC"/>
>                         <script file="/etc/init.d/tomcat5" name="Tomcat"/>
>                         <ip address="192.168.xx.xx" monitor_link="1"/>
>                 </service>
>                 <service autostart="1" domain="IMAP" name="OX-IMAP">
>                         <ip address="192.168.xx.xx" monitor_link="1"/>
>                         <fs device="/dev/emcpowera7" force_fsck="0"
> force_unmount="1" fsid="63880" fstype="ext3" mountpoint="/var/lib/imap"
> name="IMAP" options="" self_fence="0"/>
>                         <fs device="/dev/emcpowera10" force_fsck="0"
> force_unmount="1" fsid="63324" fstype="ext3"
> mountpoint="/var/spool/imap1" name="IMAP1" options="" self_fence="0"/>
>                         <script file="/etc/init.d/saslauthd" name="SASL"/>
>                         <script file="/etc/init.d/cyrus-imapd"
> name="Cyrus"/>
>                         <fs device="/dev/emcpowerb5" force_fsck="0"
> force_unmount="1" fsid="42726" fstype="ext3"
> mountpoint="/var/spool/imap2" name="IMAP2" options="" self_fence="0"/>
>                         <fs device="/dev/emcpowerb6" force_fsck="0"
> force_unmount="1" fsid="38512" fstype="ext3"
> mountpoint="/var/spool/imap3" name="IMAP3" options="" self_fence="0"/>
>                         <fs device="/dev/emcpowerc5" force_fsck="0"
> force_unmount="1" fsid="979" fstype="ext3" mountpoint="/var/spool/imap4"
> name="IMAP4" options="" self_fence="0"/>
>                         <fs device="/dev/emcpowerc6" force_fsck="0"
> force_unmount="1" fsid="13125" fstype="ext3"
> mountpoint="/var/spool/imap5" name="IMAP5" options="" self_fence="0"/>
>                 </service>
>                 <service autostart="1" domain="NFS" name="OX-NFS">
>                         <ip address="192.168.xx.xx" monitor_link="1"/>
>                         <fs device="/dev/emcpowera8" force_fsck="0"
> force_unmount="1" fsid="37141" fstype="ext3"
> mountpoint="/var/lib/xxxxxxxx" name="NFS" options="" self_fence="0"/>
>                         <script file="/etc/init.d/nfs" name="NFS"/>
>                         <script file="/etc/init.d/nfslock" name="NFSLOCK"/>
>                 </service>
>                 <service autostart="1" domain="LDAP" name="OX-LDAP">
>                         <ip address="192.168.xx.xx" monitor_link="1"/>
>                         <fs device="/dev/emcpowerb8" force_fsck="0"
> force_unmount="1" fsid="12853" fstype="ext3"
> mountpoint="/var/symas/openldap-data" name="DATA" options=""
> self_fence="0"/>
>                         <fs device="/dev/emcpowerb9" force_fsck="0"
> force_unmount="1" fsid="11240" fstype="ext3"
> mountpoint="/var/symas/openldap-logs" name="LOGS" options=""
> self_fence="0"/>
>                         <fs device="/dev/emcpowerb10" force_fsck="0"
> force_unmount="1" fsid="10234" fstype="ext3"
> mountpoint="/var/symas/openldap-slurp" name="SLURP" options=""
> self_fence="0"/>
>                         <script file="/etc/init.d/cdsserver" name="LDAP"/>
>                 </service>
>                 <service autostart="1" domain="PGSQL" name="OX-PGSQL">
>                         <ip address="192.168.xx.xx" monitor_link="1"/>
>                         <fs device="/dev/emcpowera5" force_fsck="0"
> force_unmount="1" fsid="43285" fstype="ext3" mountpoint="/var/lib/pgsql"
> name="PGSQL" options="" self_fence="0"/>
>                         <script file="/etc/init.d/postgresql" name="PGSQL"/>
>                 </service>
>         </rm>
> </cluster>
> ---8<---
>
> [server01] # cat /etc/hosts
> ---8<---
> 127.0.0.1       localhost.localdomain localhost
> xx.xx.xx.133  server01.example.com     server01
> xx.xx.xx.134  server02.example.com     server02
> ---8<---
>
> Thanks,
> .../Bosse
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From riaan at obsidian.co.za  Thu Sep 14 13:54:13 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Thu, 14 Sep 2006 15:54:13 +0200
Subject: [Linux-cluster] post_fail_delay versus deadnode_timeout
In-Reply-To: <45068785.9070404@obsidian.co.za>
References: <45068785.9070404@obsidian.co.za>
Message-ID: <45095F05.8000409@obsidian.co.za>

Riaan van Niekerk wrote:
> hi
> 
> We are trying to capture diskdumps when a lock_dlm kernel panic happens 
> and need to increase either post_fail_delay or deadnode_timeout to 
> prevent the dumping node from being fenced.
> 
> Is there any advantages or disadvantages to using either? Which is 
> recommended?
> 
> post_fail_delay and diskdump has come up previously, with some good 
> answers from David
> http://www.redhat.com/archives/linux-cluster/2006-June/msg00037.html
> 
> note: for capturing a "sysrq t", we manually increase deadnode_timeout, 
> and decrease it back again, but don't have this luxury with a kernel 
> panic (which can happen at any time).
> 
> Riaan
> 

Having spent some time researching this, and with some help from Red Hat 
Support, here is an attempt at an answer. I use power-fencing. Some of 
these might not apply to I/O fencing:


post_fail_delay

Pros:
- single place to change it (cluster.conf) makes it global across the 
cluster
- If failed node is detected, resources will relocated immediately 
(instead of waiting for the deadnode_timeout to be reached and then 
relocate)
- usage case: post-kernel panic, when you need to capture a disk-/netdump

Cons:
- Fence daemon needs to be restarted to apply (e.g. in all likelihood 
you need to reboot all nodes)
- Slight annoyance: depending on how long you set the post_fail_delay, a 
node may be restarting already, and is then fenced, requiring another 
restart.


deadnode_timeout

Pros:
- can be set dynamically
- useful if you have warning that the problem will materialize (we have 
a scenario like that)
- usage case: when you need to run "sysrq t" or some intrusive command 
which would cause a node to be fenced otherwise: Increase, sysrq, decrease

Cons:
- need to set on all nodes
- Not persistent. Need to hack cman init script to make persistent.


corrections/additions welcome
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060914/88a15863/attachment.vcf>

From isplist at logicore.net  Thu Sep 14 15:29:37 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 14 Sep 2006 10:29:37 -0500
Subject: [Linux-cluster] Can't mount multiple GFS volumes?
In-Reply-To: <450909F2.7070106@redhat.com>
Message-ID: <2006914102937.326736@leena>

>> I have a need for non contiguous storage and wish to mount multiple GFS
>> logical volumes. However, I cannot seem to get past this following error
>> and others related.

>When you created the GFS volumes using gfs_mkfs did you give them different
>names? All filesystems in a cluster must have unique names.

Yes, I did.

I wanted all storage in the vgcomp group but I wanted each to be on it's own 
and not contiguous. 
So I did;

gfs_mkfs -p lock_dlm -t vgcomp:gfscomp -j 16 /dev/vgcomp/rimfire
and
gfs_mkfs -p lock_dlm -t vgcomp:gfs -j 16 /dev/vgcomp/str1

Now, I can only mount one or the other and not both. If I try to mouth both, I 
get the errors I posted. I'll repost the errors so it's all in one msg.

-Command
# mount -t gfs /dev/vgcomp/str1 /lvstr1
mount: File exists
#
 
-In Error Log
Sep 12 16:22:22 dev kernel: GFS: Trying to join cluster "lock_dlm",
"vgcomp:gfscomp"
Sep 12 16:22:22 dev kernel: dlm: gfscomp: lockspace already in use
Sep 12 16:22:22 dev kernel: lock_dlm: new lockspace error -17

Mike


From lhh at redhat.com  Thu Sep 14 15:31:02 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 14 Sep 2006 11:31:02 -0400
Subject: [Linux-cluster] qdiskd not properly failing nodes??
In-Reply-To: <091320062315.16290.450890FF0005A31800003FA222070209539B9C0A990E0A9D0B020E@comcast.net>
References: <091320062315.16290.450890FF0005A31800003FA222070209539B9C0A990E0A9D0B020E@comcast.net>
Message-ID: <1158247862.11241.103.camel@rei.boston.devel.redhat.com>

On Wed, 2006-09-13 at 23:15 +0000, andreawest at comcast.net wrote:
> Thanks for the clarification.
> 
> Does  the state change from quorate -> inquorate get logged anywhere?  I have log level set to 7 and all that is in messages is "downgrading".  Also my status_file "Current disk state" always seems to be "None".
> 

The qdisk transition looks like this:
 
Sep 12 11:34:02 SERVER1 qdiskd[7495]: <notice> Score insufficient for
master operation (1/2; max=4); downgrading

For CMAN, in dmesg (also usually in /var/log/messages) -- it's not a
qdisk decision; qdisk just helps CMAN make the "right" decision ;)

-- Lon


From isplist at logicore.net  Thu Sep 14 15:34:30 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 14 Sep 2006 10:34:30 -0500
Subject: [Linux-cluster] cluster.conf using brocade
In-Reply-To: <45090D2D.30606@opticalart.de>
Message-ID: <2006914103430.334372@leena>

>> Since hubs are much cheaper than switches, and from the brocade's point of
>> view, it can see unique ports even on the hub... would it not be worth
>> adding this functionality to the fencing functions?

> Can you try to turn off a single port of that hub via the brocade
> switch? I doubt that there is any method in the brocade switch to do
> that, but I could be wrong here.

Right, the brocade would need to know how to fence an external port but my 
point is that it can see unique port numbers for external ports. As such, it 
made me wonder if there might be a way for the brocade to fence just that 
port. That way, we could use cheaper hub's as external devices over very large 
switches which cost $$$$.
 
> Also if the hub is manageable there might be a way to disable certain
> ports directly at the hub.

I don't have enough experience with the switches to know if I can do these 
things. It would appear that one could, like I said, the switch does see all 
external ports as unique numbers.
 
> If neither works, you'll need to think of a different setup for SAN
> fencing, like putting the nodes onto their own FC-switch, or consider
> power fencing via a network manageable pdu or ups.

My next step is a McData switch. However, I would like to try using other 
linux machines first but have yet to mess with GNBD.

Mike


From lhh at redhat.com  Thu Sep 14 15:34:49 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 14 Sep 2006 11:34:49 -0400
Subject: [Linux-cluster] Cluster node won't rejoin cluster after
	fencing, stops at cman
In-Reply-To: <45093B0B.10709@klykken.com>
References: <45093B0B.10709@klykken.com>
Message-ID: <1158248089.11241.108.camel@rei.boston.devel.redhat.com>

On Thu, 2006-09-14 at 13:20 +0200, Bosse Klykken wrote:
> Hi.
> 
> I'm having some issues with a two-node failover cluster on RHEL4/U3 with
> kernel 2.6.9-34.0.1.ELsmp, ccs-1.0.3-0, cman-1.0.4-0, fence-1.32.18-0
> and rgmanager-1.9.46-0. After a mishap where I accidentaly caused a
> failover of services with power fencing of server01, the system will not
> rejoin the cluster after boot.

Upgrade to rgmanager 1.9.53 + current magma + magma-plugins.  There's an
interaction issue between CMAN and rgmanager on 1.9.46 where it's
possible that rgmanager could wedge up, causing other services (DLM,
fenced) to misbehave.

-- Lon


From andreawest at comcast.net  Wed Sep 13 19:40:33 2006
From: andreawest at comcast.net (Andrea Westervelt)
Date: Wed, 13 Sep 2006 15:40:33 -0400
Subject: [Linux-cluster] qdiskd not properly failing nodes??
Message-ID: <2d4e61a8f96d5bf89f1d86611e4712d3@comcast.net>

Lon,
?
fenced is running and based on the manpage it seems like dropping below 
a score of ? should cause a reboot?? I guess I am a little confused on 
what the heuristics/scoring are meant to do.? Can you explain the role 
of the master partition and what the expected outcome of an 
insufficient score should be?? Past experience with quorum disks 
(tiebreaker votes) were a bit more simplistic.
?
Thanks,
?Dan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 486 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060913/b9f781cd/attachment.bin>

From andreawest at comcast.net  Wed Sep 13 23:07:00 2006
From: andreawest at comcast.net (andreawest at comcast.net)
Date: Wed, 13 Sep 2006 23:07:00 +0000
Subject: [Linux-cluster] [PATCH] reboot flag + score fix
Message-ID: <091320062307.5307.45088F1400094A6B000014BB22070209539B9C0A990E0A9D0B020E@comcast.net>

I will apply the patch and test.

Thanks for all the help.

- Dan
 -------------- Original message ----------------------
From: Lon Hohberger <lhh at redhat.com>
> This implements a reboot flag which must be explicitly disabled.  Upon a
> transition from majority score to less than majority score, a node will
> reboot unless the reboot flag is explicitly set to 0 in the cluster
> configuration.  This makes qdiskd operate consistently with section 2.2
> in the manual page.
> 
> -- Lon


-------------- next part --------------
An embedded message was scrubbed...
From: Lon Hohberger <lhh at redhat.com>
Subject: Re: [Linux-cluster] [PATCH] reboot flag + score fix
Date: Wed, 13 Sep 2006 22:24:50 +0000
Size: 3568
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060913/97bc07e8/attachment.eml>

From andreawest at comcast.net  Wed Sep 13 23:15:11 2006
From: andreawest at comcast.net (andreawest at comcast.net)
Date: Wed, 13 Sep 2006 23:15:11 +0000
Subject: [Linux-cluster] qdiskd not properly failing nodes??
Message-ID: <091320062315.16290.450890FF0005A31800003FA222070209539B9C0A990E0A9D0B020E@comcast.net>

Thanks for the clarification.

Does  the state change from quorate -> inquorate get logged anywhere?  I have log level set to 7 and all that is in messages is "downgrading".  Also my status_file "Current disk state" always seems to be "None".

Thanks Again.
 - Dan
 
 -------------- Original message ----------------------
From: Lon Hohberger <lhh at redhat.com>
> On Wed, 2006-09-13 at 15:40 -0400, Andrea Westervelt wrote:
> > 
> > 
> > ______________________________________________________________________
> > 
> > Lon,
> >  
> > fenced is running and based on the manpage it seems like dropping
> > below a score of ?? should cause a reboot? 
> 
> It currently expects the quorate partition (remember, this node is no
> longer quorate) to fence the node rather than taking action itself.
> 
> >  I guess I am a little confused on what the heuristics/scoring are
> > meant to do.  Can you explain the role of the master partition and
> > what the expected outcome of an insufficient score should be?
> 
> The master node is a node with sufficient score to declare itself online
> according to the heuristics that you supply in the qdisk configuration.
> Assuming it maintains its score, it arbitrates what other nodes join the
> "master" partition.  If a node becomes part of the master partition, the
> node advertises quorum device votes to CMAN.
> 
> Insufficient scores should cause a node to remove itself from the master
> partition and tell CMAN that the quorum device is offline.  This should
> cause CMAN on a node in the qdisk master partition to fence the node
> (assuming that this causes the node to transition from
> quorate->inquorate).
> 
> I'm guessing what is happening here in your case is that CMAN is still
> seeing the node - even though it's inquorate - and it's not fencing it
> -- is that right?  A transition from quorate->inquorate should cause the
> node to get fenced.
> 
> That sounds like a bug (pretty easy to fix, too).
> 
> -- Lon
> 


From bosse at klykken.com  Wed Sep 13 23:22:51 2006
From: bosse at klykken.com (Bosse Klykken)
Date: Thu, 14 Sep 2006 01:22:51 +0200
Subject: [Linux-cluster] Cluster node won't rejoin cluster after fencing,
	stops at cman
Message-ID: <450892CB.90909@klykken.com>

Hi.

I'm having some issues with a two-node failover cluster on RHEL4/U3 with
kernel 2.6.9-34.0.1.ELsmp, ccs-1.0.3-0, cman-1.0.4-0, fence-1.32.18-0
and rgmanager-1.9.46-0. After a mishap where I accidentaly caused a
failover of services with power fencing of server01, the system will not
rejoin the cluster after boot.

I have tried using both the init.d scripts and starting the daemons
manually to troubleshoot this further, to no avail. I'm able to start
ccsd properly (although it logs the cluster as inquorate) but it fails
completely on cman, claiming that connection is refused.

If anyone could help me by giving me some tips, directing me to the
proper documentation addressing this issue or downright pointing out my
problem, I would be most grateful.

[server01] # service ccsd start
Starting ccsd:                                             [  OK  ]
---8<--- /var/log/messages
Sep 14 00:33:28 server01 ccsd[30227]: Starting ccsd 1.0.3:
Sep 14 00:33:28 server01 ccsd[30227]:  Built: Jan 25 2006 16:54:43
Sep 14 00:33:28 server01 ccsd[30227]:  Copyright (C) Red Hat, Inc.  2004
 All rights reserved.
Sep 14 00:33:28 server01 ccsd[30227]: Connected to cluster infrastruture
via: CMAN/SM Plugin v1.1.5
Sep 14 00:33:28 server01 ccsd[30227]: Initial status:: Inquorate
Sep 14 00:33:29 server01 ccsd: startup succeeded
---8<---

[server01] # service cman start
Starting cman:                                             [FAILED]
---8<--- /var/log/messages
Sep 14 00:39:07 server01 ccsd[31417]: Cluster is not quorate.  Refusing
connection.
Sep 14 00:39:07 server01 ccsd[31417]: Error while processing connect:
Connection refused
Sep 14 00:39:07 server01 ccsd[31417]: cluster.conf (cluster name =
something_cluster, version = 46) found.
Sep 14 00:39:07 server01 ccsd[31417]: Remote copy of cluster.conf is
from quorate node.
Sep 14 00:39:07 server01 ccsd[31417]:  Local version # : 46
Sep 14 00:39:07 server01 ccsd[31417]:  Remote version #: 46
Sep 14 00:39:07 server01 cman: cman_tool: Node is already active failed
Sep 14 00:39:12 server01 kernel: CMAN: sending membership request
---8<---

[server01] # cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 46
Cluster name: something_cluster
Cluster ID: 47540
Cluster Member: No
Membership state: Joining

[server01] # cat /proc/cluster/nodes
Node  Votes Exp Sts  Name

[server02] # cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 46
Cluster name: something_cluster
Cluster ID: 47540
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 1
Total_votes: 1
Quorum: 1
Active subsystems: 4
Node name: server02
Node addresses: xx.xx.xx.134

[server02] # cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    1   X   server01
   2    1    1   M   server02

[server01] # cat /etc/cluster/cluster.conf
---8<---
<?xml version="1.0"?>
<cluster config_version="46" name="something_cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="30"/>
        <clusternodes>
                <clusternode name="server01" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="APC-LEFT"
option="off" port="8" switch="0"/>
                                        <device name="APC-RIGHT"
option="off" port="8" switch="0"/>
                                        <device name="APC-LEFT"
option="on" port="8" switch="0"/>
                                        <device name="APC-RIGHT"
option="on" port="8" switch="0"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="server02" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="APC-LEFT"
option="off" port="4" switch="0"/>
                                        <device name="APC-RIGHT"
option="off" port="4" switch="0"/>
                                        <device name="APC-LEFT"
option="on" port="4" switch="0"/>
                                        <device name="APC-RIGHT"
option="on" port="4" switch="0"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="xx.xx.xx.10"
login="secret" name="APC-LEFT" passwd="secret"/>
                <fencedevice agent="fence_apc" ipaddr="xx.xx.xx.11"
login="secret" name="APC-RIGHT" passwd="secret"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="OX" ordered="1"
restricted="0">
                                <failoverdomainnode name="server01"
priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="IMAP" ordered="1"
restricted="0">
                                <failoverdomainnode name="server01"
priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="NFS" ordered="1"
restricted="0">
                                <failoverdomainnode name="server02"
priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="LDAP" ordered="1">
                                <failoverdomainnode name="server02"
priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="PGSQL" ordered="1"
restricted="0">
                                <failoverdomainnode name="server02"
priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources/>
                <service autostart="1" domain="PGSQL" name="OX-OX">
                        <script file="/etc/init.d/openexchange" name="OX"/>
                        <ip address="192.168.xx.xx" monitor_link="1"/>
                        <fs device="/dev/emcpowera9" force_fsck="0"
force_unmount="1" fsid="39155" fstype="ext3"
mountpoint="/var/opt/openexchange/filespool" name="OX" options=""
self_fence="0"/>
                        <script file="/etc/init.d/openexchange-daemons"
name="XMLRPC"/>
                        <script file="/etc/init.d/tomcat5" name="Tomcat"/>
                        <ip address="192.168.xx.xx" monitor_link="1"/>
                </service>
                <service autostart="1" domain="IMAP" name="OX-IMAP">
                        <ip address="192.168.xx.xx" monitor_link="1"/>
                        <fs device="/dev/emcpowera7" force_fsck="0"
force_unmount="1" fsid="63880" fstype="ext3" mountpoint="/var/lib/imap"
name="IMAP" options="" self_fence="0"/>
                        <fs device="/dev/emcpowera10" force_fsck="0"
force_unmount="1" fsid="63324" fstype="ext3"
mountpoint="/var/spool/imap1" name="IMAP1" options="" self_fence="0"/>
                        <script file="/etc/init.d/saslauthd" name="SASL"/>
                        <script file="/etc/init.d/cyrus-imapd"
name="Cyrus"/>
                        <fs device="/dev/emcpowerb5" force_fsck="0"
force_unmount="1" fsid="42726" fstype="ext3"
mountpoint="/var/spool/imap2" name="IMAP2" options="" self_fence="0"/>
                        <fs device="/dev/emcpowerb6" force_fsck="0"
force_unmount="1" fsid="38512" fstype="ext3"
mountpoint="/var/spool/imap3" name="IMAP3" options="" self_fence="0"/>
                        <fs device="/dev/emcpowerc5" force_fsck="0"
force_unmount="1" fsid="979" fstype="ext3" mountpoint="/var/spool/imap4"
name="IMAP4" options="" self_fence="0"/>
                        <fs device="/dev/emcpowerc6" force_fsck="0"
force_unmount="1" fsid="13125" fstype="ext3"
mountpoint="/var/spool/imap5" name="IMAP5" options="" self_fence="0"/>
                </service>
                <service autostart="1" domain="NFS" name="OX-NFS">
                        <ip address="192.168.xx.xx" monitor_link="1"/>
                        <fs device="/dev/emcpowera8" force_fsck="0"
force_unmount="1" fsid="37141" fstype="ext3"
mountpoint="/var/lib/xxxxxxxx" name="NFS" options="" self_fence="0"/>
                        <script file="/etc/init.d/nfs" name="NFS"/>
                        <script file="/etc/init.d/nfslock" name="NFSLOCK"/>
                </service>
                <service autostart="1" domain="LDAP" name="OX-LDAP">
                        <ip address="192.168.xx.xx" monitor_link="1"/>
                        <fs device="/dev/emcpowerb8" force_fsck="0"
force_unmount="1" fsid="12853" fstype="ext3"
mountpoint="/var/symas/openldap-data" name="DATA" options=""
self_fence="0"/>
                        <fs device="/dev/emcpowerb9" force_fsck="0"
force_unmount="1" fsid="11240" fstype="ext3"
mountpoint="/var/symas/openldap-logs" name="LOGS" options=""
self_fence="0"/>
                        <fs device="/dev/emcpowerb10" force_fsck="0"
force_unmount="1" fsid="10234" fstype="ext3"
mountpoint="/var/symas/openldap-slurp" name="SLURP" options=""
self_fence="0"/>
                        <script file="/etc/init.d/cdsserver" name="LDAP"/>
                </service>
                <service autostart="1" domain="PGSQL" name="OX-PGSQL">
                        <ip address="192.168.xx.xx" monitor_link="1"/>
                        <fs device="/dev/emcpowera5" force_fsck="0"
force_unmount="1" fsid="43285" fstype="ext3" mountpoint="/var/lib/pgsql"
name="PGSQL" options="" self_fence="0"/>
                        <script file="/etc/init.d/postgresql" name="PGSQL"/>
                </service>
        </rm>
</cluster>
---8<---

[server01] # cat /etc/hosts
---8<---
127.0.0.1       localhost.localdomain localhost
xx.xx.xx.133  server01.example.com     server01
xx.xx.xx.134  server02.example.com     server02
---8<---

Thanks,
.../Bosse


From DylanV at semaphore.com  Thu Sep 14 20:47:01 2006
From: DylanV at semaphore.com (Dylan Vanderhoof)
Date: Thu, 14 Sep 2006 13:47:01 -0700
Subject: [Linux-cluster] Cisco SNMP fence agent?
Message-ID: <F6F72F553C7E754A945CB88CD8804B42417F84@exchange01.semaphore.lan>

In looking around google, I ran across a post somebody made to this list
about a Cisco Fence agent that will fence a server at the network level
using SNMP.  Does anybody happen to have the code for that agent
anywhere?  Because of a combination of iSCSI and redundant power
supplies, fencing via any other method is somewhat difficult in my
environment.

Thanks,
Dylan


From jstoner at opsource.net  Thu Sep 14 22:44:41 2006
From: jstoner at opsource.net (Jeff Stoner)
Date: Thu, 14 Sep 2006 23:44:41 +0100
Subject: [Linux-cluster] post_fail_delay versus deadnode_timeout
Message-ID: <38A48FA2F0103444906AD22E14F1B5A3042C6580@mailxchg01.corp.opsource.net>

> deadnode_timeout
> 
> Cons:
> - Not persistent. Need to hack cman init script to make persistent.
> 
> 
> corrections/additions welcome
> 

Thinking out loud....

Since it's executed last in the boot sequence, set it in
/etc/rc.d/rc.local instead of /etc/rc.d/init.d/cman. Updates to RHCS
won't require re-hacking the init scripts.

...but...

Bringing down the cluster services (and unloading the kernel modules)
then bringing it all back up again, without a reboot, means the setting
won't be put back in place.

...so...

Write a wrapper around stopping/starting the cluster services on the
node such that on startup, the parameter gets set in /proc/ explicitly -
use said script for cluster maintenance that doesn't involve a reboot.


--Jeff
SME - UNIX
OpSource Inc.

PGP Key ID 0x6CB364CA 


From isplist at logicore.net  Fri Sep 15 04:16:25 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 14 Sep 2006 23:16:25 -0500
Subject: [Linux-cluster] McData Switch for fencing?
In-Reply-To: <adb721b40609132359v6f2ee11fn242d7662e07737e3@mail.gmail.com>
Message-ID: <2006914231625.954589@leena>

So, can someone confirm that a McData ED-6432 would work with GFS for fencing?

Mike


From teigland at redhat.com  Fri Sep 15 13:55:16 2006
From: teigland at redhat.com (David Teigland)
Date: Fri, 15 Sep 2006 08:55:16 -0500
Subject: [Linux-cluster] Can't mount multiple GFS volumes?
In-Reply-To: <2006914102937.326736@leena>
References: <450909F2.7070106@redhat.com> <2006914102937.326736@leena>
Message-ID: <20060915135516.GA17451@redhat.com>

On Thu, Sep 14, 2006 at 10:29:37AM -0500, isplist at logicore.net wrote:
> >> I have a need for non contiguous storage and wish to mount multiple GFS
> >> logical volumes. However, I cannot seem to get past this following error
> >> and others related.
> 
> >When you created the GFS volumes using gfs_mkfs did you give them different
> >names? All filesystems in a cluster must have unique names.
> 
> Yes, I did.
> 
> I wanted all storage in the vgcomp group but I wanted each to be on it's own 
> and not contiguous. 
> So I did;
> 
> gfs_mkfs -p lock_dlm -t vgcomp:gfscomp -j 16 /dev/vgcomp/rimfire
> and
> gfs_mkfs -p lock_dlm -t vgcomp:gfs -j 16 /dev/vgcomp/str1
> 
> Now, I can only mount one or the other and not both. If I try to mouth
> both, I get the errors I posted. I'll repost the errors so it's all in
> one msg.
> 
> -Command
> # mount -t gfs /dev/vgcomp/str1 /lvstr1
> mount: File exists
> #
>  
> -In Error Log
> Sep 12 16:22:22 dev kernel: GFS: Trying to join cluster "lock_dlm",
> "vgcomp:gfscomp"
> Sep 12 16:22:22 dev kernel: dlm: gfscomp: lockspace already in use
> Sep 12 16:22:22 dev kernel: lock_dlm: new lockspace error -17

Could you send the output of 'cman_tool services' from all nodes before
and after you try to mount?  Thanks
Dave


From rodrick.r.brown at bofasecurities.com  Fri Sep 15 15:17:05 2006
From: rodrick.r.brown at bofasecurities.com (Brown, Rodrick R)
Date: Fri, 15 Sep 2006 11:17:05 -0400
Subject: [Linux-cluster] Reinitilize a LUN
Message-ID: <5F08B160555AC946B5AB743B85FF406D05ABF2DE@ex2k.bankofamerica.com>

How does one reinitlize LUN's on a system w/o rebooting in Linux on
Solaris I usually use luxadm -e forcelip /path/blah/foo 

---
Rodrick R. Brown 
UNIX Platform Operations (SME)
Banc of America Securities LLC. 
Global Trading Infrastructure (GCIB)
100 West 33rd ST. 3rd Flr. 
New York, NY 10001
Mail Code: NY1-509-03-18
Office: 646 733 4473
Cell: 646-261-5286


From orkcu at yahoo.com  Fri Sep 15 15:37:16 2006
From: orkcu at yahoo.com (Roger Peña Escobio)
Date: Fri, 15 Sep 2006 08:37:16 -0700 (PDT)
Subject: [Linux-cluster] Reinitilize a LUN
In-Reply-To: <5F08B160555AC946B5AB743B85FF406D05ABF2DE@ex2k.bankofamerica.com>
Message-ID: <20060915153716.96527.qmail@web50606.mail.yahoo.com>


--- "Brown, Rodrick R"
<rodrick.r.brown at bofasecurities.com> wrote:

> How does one reinitlize LUN's on a system w/o
> rebooting in Linux on
> Solaris I usually use luxadm -e forcelip
> /path/blah/foo 
> 

there is a wonderfull document:
http://people.redhat.com/nayfield/storage/RHEL4Storage.html

from where I highlight this:

If you need re-do a SCSI scan, you can do


echo "- - -" > /sys/class/scsi_host/host0/scan


Where host0 is replaced by the HBA you wish to use.
You also can do a fabric rediscover like this:


echo ?1? > /sys/class/fc_host/host0/issue_lip
echo "- - -" > /sys/class/scsi_host/host0/scan

This will send a LIP (loop initialization primitive)
to the fabric. 

cu
roger

__________________________________________
RedHat Certified Engineer ( RHCE )
Cisco Certified Network Associate ( CCNA )

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From mwill at penguincomputing.com  Fri Sep 15 15:39:45 2006
From: mwill at penguincomputing.com (Michael Will)
Date: Fri, 15 Sep 2006 08:39:45 -0700
Subject: [Linux-cluster] Reinitilize a LUN
Message-ID: <433093DF7AD7444DA65EFAFE3987879C2452A4@jellyfish.highlyscyld.com>

Assuming reinitializing means getting an uptodate lunsize and see additional luns created on the san: Unloading and reloading the qlogic fibre channel driver modules  if you can live with reinitializing all. 

Michael

 -----Original Message-----
From: 	Brown, Rodrick R [mailto:rodrick.r.brown at bofasecurities.com]
Sent:	Fri Sep 15 08:19:48 2006
To:	linux clustering
Subject:	[Linux-cluster] Reinitilize a LUN

How does one reinitlize LUN's on a system w/o rebooting in Linux on
Solaris I usually use luxadm -e forcelip /path/blah/foo 

---
Rodrick R. Brown 
UNIX Platform Operations (SME)
Banc of America Securities LLC. 
Global Trading Infrastructure (GCIB)
100 West 33rd ST. 3rd Flr. 
New York, NY 10001
Mail Code: NY1-509-03-18
Office: 646 733 4473
Cell: 646-261-5286

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060915/0932a2b5/attachment.htm>

From aneesh.kumar at gmail.com  Fri Sep 15 18:52:43 2006
From: aneesh.kumar at gmail.com (Aneesh Kumar)
Date: Sat, 16 Sep 2006 00:22:43 +0530
Subject: [Linux-cluster] [Announce] Linux kernel Cluster Framework 0.2
Message-ID: <cc723f590609151152s15f46be9h4f34f6c0ef895339@mail.gmail.com>

This is the 0.2 release of Linux kernel Cluster Framework.

What is LKCF:
-----------
LKCF's aim is to provide a transport independent cluster communication
framework within the kernel. This enables the developers to write
kernel based cluster services without being worried about
communication transport. It also support RPC style programming. That
means to write kernel service one need to write the service definition
file (<service>.svc>) and the implementation API. LKCF framework will
generate all the  registration routines and the marshaling code. Also
it takes care of forwarding the SIGNALs across different nodes.The
particular service can be called from any node specifying the node at
which this particular service need to be executed. All the underlying
management interface is taken care by the LKCF.

What transport are supported as of today:
---------------------------------------
IPV4
Infiniband verbs/RDMA.

Project Documentation:
---------------------
http://ci-linux.sourceforge.net/

Project git repository:
---------------------------
http://git.openssi.org/~kvaneesh/gitweb.cgi?p=ci-to-linus.git;a=summary

The patches can be found at
http://git.openssi.org/~kvaneesh/ics_patches/lkcf-0.2/

and include

0001-Internode-communication-subsystem-for-Linux.txt
0002-ICS-over-Infiniband-verbs-work-in-progress.txt
0003-Token-facility-needed-for-cluster-based-synchronization.txt
0004-A-simple-test-case-for-ICS.txt

The important changes with the release is the support of infiniband
interconnect. The changes are originally from stan smith with further
cleanup and fixes from me.

The patches are on top of git SHA1
ef7d1b244fa6c94fb76d5f787b8629df64ea4046 of the linus tree.


-aneesh


From isplist at logicore.net  Fri Sep 15 19:00:28 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Fri, 15 Sep 2006 14:00:28 -0500
Subject: [Linux-cluster] Web based manager?
In-Reply-To: <45081AD4.5050801@redhat.com>
Message-ID: <200691514028.217780@leena>

> The system-config-cluster application supports brocade fencing. It is a

How I WISH there was a web browser based management app out there. Wish I knew 
how to write one, I would have been working on it long ago :). 

Then I could run it one node and have access to that from aywhere.

Mike


From jparsons at redhat.com  Fri Sep 15 19:42:07 2006
From: jparsons at redhat.com (James Parsons)
Date: Fri, 15 Sep 2006 15:42:07 -0400
Subject: [Linux-cluster] Web based manager?
In-Reply-To: <200691514028.217780@leena>
References: <200691514028.217780@leena>
Message-ID: <450B020F.4070505@redhat.com>

isplist at logicore.net wrote:

>>The system-config-cluster application supports brocade fencing. It is a
>>    
>>
>
>How I WISH there was a web browser based management app out there. Wish I knew 
>how to write one, I would have been working on it long ago :). 
>
>Then I could run it one node and have access to that from aywhere.
>
>Mike
>
>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>  
>
Coming in RHEL5 and RHEL4U5:  Conga - A web-based remote administration 
capability that makes previously tedious tasks such as creating a 
cluster or adding a new node as simple as a few mouse clicks. Remote 
storage administration (partitioning, LVM, FS creation, etc...), 
clustered Xen VMs, diagnostic capabilities such as node fencing, cluster 
restart with one button click, and remote log filtering/viewing are also 
part of the package.

http://sourceware.org/cluster/conga/

-j


From isplist at logicore.net  Fri Sep 15 19:51:56 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Fri, 15 Sep 2006 14:51:56 -0500
Subject: [Linux-cluster] Web based manager?
In-Reply-To: <450B020F.4070505@redhat.com>
Message-ID: <2006915145156.133258@leena>

Say it ain't so!!! No way, you're pulling my leg??? I've been looking and 
looking for months. WONDERFUL, just what is needed out there!

I'll check this out as soon as I get my cluster back online :).

Mike


> Coming in RHEL5 and RHEL4U5:  Conga - A web-based remote administration
> capability that makes previously tedious tasks such as creating a
> cluster or adding a new node as simple as a few mouse clicks. Remote
> storage administration (partitioning, LVM, FS creation, etc...),
> clustered Xen VMs, diagnostic capabilities such as node fencing, cluster
> restart with one button click, and remote log filtering/viewing are also
> part of the package.
> 
> http://sourceware.org/cluster/conga/
> 
> -j


From aneesh.kumar at gmail.com  Sat Sep 16 06:22:17 2006
From: aneesh.kumar at gmail.com (Aneesh Kumar)
Date: Sat, 16 Sep 2006 11:52:17 +0530
Subject: [Linux-cluster] Re: [Announce] Linux kernel Cluster Framework 0.2
In-Reply-To: <cc723f590609151152s15f46be9h4f34f6c0ef895339@mail.gmail.com>
References: <cc723f590609151152s15f46be9h4f34f6c0ef895339@mail.gmail.com>
Message-ID: <cc723f590609152322q7c9e4842l8ae7485c59eafcae@mail.gmail.com>

On 9/16/06, Aneesh Kumar <aneesh.kumar at gmail.com> wrote:
> This is the 0.2 release of Linux kernel Cluster Framework.
>
> What is LKCF:
> -----------
> LKCF's aim is to provide a transport independent cluster communication
> framework within the kernel. This enables the developers to write
> kernel based cluster services without being worried about
> communication transport. It also support RPC style programming. That
> means to write kernel service one need to write the service definition
> file (<service>.svc>) and the implementation API. LKCF framework will
> generate all the  registration routines and the marshaling code. Also
> it takes care of forwarding the SIGNALs across different nodes.The
> particular service can be called from any node specifying the node at
> which this particular service need to be executed. All the underlying
> management interface is taken care by the LKCF.
>

Attaching the short log below

commit 5cfb96a487d1ca1893c30ca112706a315c492197
Author: Aneesh Kumar K.V <aneesh.kumar at gmail.com>

    A simple test case for ICS.

commit 8ebdda8eca87867b1b6123ab08e1e270f419c74a
Author: Aneesh Kumar K.V <aneesh.kumar at gmail.com>

    Token facility needed for cluster based synchronization

commit c7214483043dbc1fce194d0ead4be7d30d587dcf
Author: Stan Smith <stan.smith at intel.com>

    ICS over Infiniband verbs work-in-progress.

commit b35812c68c33357335d9acd44ce702f27c87bbc3
Author: Aneesh Kumar K.V <aneesh.kumar at gmail.com>

    Internode communication subsystem for Linux.

-aneesh


From isplist at logicore.net  Sat Sep 16 14:37:17 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Sat, 16 Sep 2006 09:37:17 -0500
Subject: [Linux-cluster] Web based manager?
In-Reply-To: <450B020F.4070505@redhat.com>
Message-ID: <200691693717.346952@leena>

> Coming in RHEL5 and RHEL4U5:  Conga - A web-based remote administration

When are either of these expected out? Or is there an early beta version for 
feedback needed? :).

Mike


From jos at xos.nl  Sat Sep 16 14:46:04 2006
From: jos at xos.nl (Jos Vos)
Date: Sat, 16 Sep 2006 16:46:04 +0200
Subject: [Linux-cluster] Web based manager?
In-Reply-To: <200691693717.346952@leena>;
	from isplist@logicore.net on Sat, Sep 16, 2006 at 09:37:17AM -0500
References: <450B020F.4070505@redhat.com> <200691693717.346952@leena>
Message-ID: <20060916164604.A5198@xos037.xos.nl>

On Sat, Sep 16, 2006 at 09:37:17AM -0500, isplist at logicore.net wrote:

> > Coming in RHEL5 and RHEL4U5:  Conga - A web-based remote administration
> 
> When are either of these expected out? Or is there an early beta version for 
> feedback needed? :).

RHEL5 beta1 is out since 1 1/2 week or so and on the public ftp sites.

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From bosse at klykken.com  Sun Sep 17 20:12:43 2006
From: bosse at klykken.com (Bosse Klykken)
Date: Sun, 17 Sep 2006 22:12:43 +0200
Subject: [Linux-cluster] Cluster node won't rejoin cluster after	fencing, 
	stops at cman
In-Reply-To: <1158248089.11241.108.camel@rei.boston.devel.redhat.com>
References: <45093B0B.10709@klykken.com>
	<1158248089.11241.108.camel@rei.boston.devel.redhat.com>
Message-ID: <450DAC3B.7050303@klykken.com>

Lon Hohberger wrote:
> Upgrade to rgmanager 1.9.53 + current magma + magma-plugins.  There's an
> interaction issue between CMAN and rgmanager on 1.9.46 where it's
> possible that rgmanager could wedge up, causing other services (DLM,
> fenced) to misbehave.

OK, but do you think I will need to update both servers and reboot
almost simulatniously, as suggested by Jari Huuskonen? After upgrading
magma, magma-plugins and rgmanager on node01 (the one with problems
rejoining the cluster) I still get a segfault when I run 'clustat' on
that node, and I'm a bit worried about rebooting node02 when the cluster
services on node01 is inoperative.

Thanks your help.

.../Bosse


From peter.huesser at psi.ch  Sun Sep 17 22:32:46 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Mon, 18 Sep 2006 00:32:46 +0200
Subject: [Linux-cluster] Fencing with SUN X4100
Message-ID: <8E2924888511274B95014C2DD906E58A64947B@MAILBOX0A.psi.ch>

Hello

 
Does somebody has any experience with fencing of SUN X4100 machines ?
They use the Sun Integrated Lights Out Manager. For example I just tried
to remotely reboot the system using the following command: 

 
            fence_ilo -a server_con -l loginname -v -p my_passwort -o
reboot

 
where "server_con" is the management console of the server I want to
reboot. Put I just get something like

 
ssl module: Net::SSL

opening connection to server_con:443

SEND: <?xml version="1.0"?>

READ:(29,1) HTTP/1.1 400 Page not found

READ:(33,1) Server: Sun-ILOM-Web-Server/1.0

READ:(32,1) Date: Tue Sep 13 19:54:37 2005

READ:(43,1) Pragma: no-cache

Cache-Control: no-cache

READ:(25,1) Content-Type: text/html

READ:(2,1)

READ:(168,1) <html><head><title>Document Error: Page not
found</title></head>

                   <body><h2>Access Error: Page not found</h2>

                   <p>Bad request type</p></body></html>

 
unable to detect RIBCL version

 
back. Any idea where I did wrong ?

 
Of course I want to let the fence daemon do the job at a later time.
Maybe somebody already has a working cluster.conf file (with the
appropriate fencing section).

 
Thanks' 

 
            Pedro

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060918/29711633/attachment.htm>

From ivanp at yu.net  Sun Sep 17 23:36:44 2006
From: ivanp at yu.net (Ivan Pantovic)
Date: Mon, 18 Sep 2006 01:36:44 +0200
Subject: [Linux-cluster] cluster failure ...
Message-ID: <450DDC0C.2090500@yu.net>

In past couple of weeks..

Cluster fence node for missed too many heartbeats. Node goes away. No 
other node in a cluster tries to acquire his part of lock.
Fenced node do come up and again joins a cluster in meanwhile there is a 
lock on a shared fs and it ends in a high load nobody can log in.

> Sep 16 15:06:37 clu-V kernel: CMAN: node clu-III has been removed from 
> the cluster : Missed too many heartbeats
> Sep 16 15:09:07 clu-V kernel: CMAN: node clu-III rejoining
After a cluster restart everything is fine.

Again when I manually issue fence_node <nodename> i do get this messages 
of other nodes trying to acquire part of dlm.

> tail /var/log/messages
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Looking 
> at journal...
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:shared.1: jid=4: Trying 
> to acquire journal lock...
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:mailbox.1: jid=4: Busy
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:shared.1: jid=4: Busy
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Acquiring 
> the transaction lock...
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Replaying 
> journal...
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Replayed 
> 1 of 2 blocks
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: replays = 
> 1, skips = 1, sames = 0
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Journal 
> replayed in 1s
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Done
Did anyone have this kind of a problem?

I have to mention this happened over weekend or night when there is no 
significant load on a cluster.
the GFS version is cvs-20060714

-- 
Ivan Pantovic, System Engineer
-----
YUnet International  http://www.eunet.yu
Dubrovacka 35/III,   11000 Belgrade
Tel: +381 11 311 9901;  Fax: +381 11 311 9901; Mob: +381 63 302 288
-----
This  e-mail  is confidential and intended only for the recipient.
Unauthorized  distribution,  modification  or  disclosure  of  its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone  +381 11 311 9901.
-----


From bosse at klykken.com  Mon Sep 18 13:12:59 2006
From: bosse at klykken.com (Bosse Klykken)
Date: Mon, 18 Sep 2006 15:12:59 +0200
Subject: Solved: [Linux-cluster] Cluster node won't rejoin cluster
	after	fencing, stops at cman
In-Reply-To: <450DAC3B.7050303@klykken.com>
References: <45093B0B.10709@klykken.com>	<1158248089.11241.108.camel@rei.boston.devel.redhat.com>
	<450DAC3B.7050303@klykken.com>
Message-ID: <450E9B5B.9040208@klykken.com>

> OK, but do you think I will need to update both servers and reboot
> almost simulatniously, as suggested by Jari Huuskonen? After upgrading
> magma, magma-plugins and rgmanager on node01 (the one with problems
> rejoining the cluster) I still get a segfault when I run 'clustat' on
> that node, and I'm a bit worried about rebooting node02 when the cluster
> services on node01 is inoperative.

Solved by upgrading both systems to U4, shutting down rgmanager, fenced,
cman and ccsd on working node node02, rebooting node01 and letting it
start services, and then starting services on node02.

.../Bosse


From lhh at redhat.com  Mon Sep 18 13:34:55 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 18 Sep 2006 09:34:55 -0400
Subject: [Linux-cluster] Fencing with SUN X4100
In-Reply-To: <8E2924888511274B95014C2DD906E58A64947B@MAILBOX0A.psi.ch>
References: <8E2924888511274B95014C2DD906E58A64947B@MAILBOX0A.psi.ch>
Message-ID: <1158586495.3652.6.camel@rei.boston.devel.redhat.com>

On Mon, 2006-09-18 at 00:32 +0200, Huesser Peter wrote:
> Hello
> 
>  
> 
> Does somebody has any experience with fencing of SUN X4100 machines ?
> They use the Sun Integrated Lights Out Manager. For example I just
> tried to remotely reboot the system using the following command: 

Argh, apparently, Sun's version of iLO is different from HP's.

-- Lon


From isplist at logicore.net  Mon Sep 18 13:36:19 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Mon, 18 Sep 2006 08:36:19 -0500
Subject: [Linux-cluster] McData Switch for fencing?
In-Reply-To: <adb721b40609132359v6f2ee11fn242d7662e07737e3@mail.gmail.com>
Message-ID: <200691883619.455558@leena>

So, can someone confirm that a McData ED-6432 would work with GFS for fencing?

Mike


From dbrieck at gmail.com  Mon Sep 18 13:56:37 2006
From: dbrieck at gmail.com (David Brieck Jr.)
Date: Mon, 18 Sep 2006 09:56:37 -0400
Subject: [Linux-cluster] Need some help with an Oops
Message-ID: <8c1094290609180656v6b9b9c71r879319c60adee1e@mail.gmail.com>

I'm trying to figure out why I'm getting this Oops every time I copy
multiple large files to my shared storage. So far I'm just working on
this one node, but I'm going to try to see if it happens on other
nodes as well.

Basically we're using GNBD and GFS with multipath and CLVM from U4.
I'm trying to setup backups from the nodes to the shared storage for
tape backup at a later point. Basically just tar gzipping up each fs
directly to shared storage. It seems to work ok for smaller files
(such as the root and boot), but for larger files it hangs the node.
It then gets power fenced and rebooted.

There doesn't seem to be any related log files on the other nodes
(except on the GNBD server saying a gnbd process exited because of
signal 15) so I'm not sure what might be causing it and all I have are
the logs. Any help would be greatly appreciated. If I need to provide
more info please let me know.
-------------- next part --------------
Sep 18 09:21:11 http1 kernel: do_IRQ: stack overflow: 476
Sep 18 09:21:11 http1 kernel:  [<c010795b>] do_IRQ+0x49/0x1ae<1>Unable to handle kernel paging request at virtual address 98393c30
Sep 18 09:21:11 http1 kernel:  printing eip:
Sep 18 09:21:11 http1 kernel: c011aef5
Sep 18 09:21:11 http1 kernel: *pde = f6e22080
Sep 18 09:21:11 http1 kernel: Oops: 0000 [#1]
Sep 18 09:21:11 http1 kernel: SMP 
Sep 18 09:21:11 http1 kernel: Modules linked in: mptctl mptbase dell_rbu parport_pc lp parport autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dm_round_robin gnbd(U) dlm(U) cman(U) sunrpc ipmi_devintf ipmi_si ipmi_msghandler md5 ipv6 dm_mirror dm_multipath dm_mod joydev button battery ac uhci_hcd ehci_hcd hw_random shpchp e1000 bonding(U) floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod
Sep 18 09:21:11 http1 kernel: CPU:    75
Sep 18 09:21:11 http1 kernel: EIP:    0060:[<c011aef5>]    Not tainted VLI
Sep 18 09:21:11 http1 kernel: EFLAGS: 00010046   (2.6.9-42.0.2.ELsmp) 
Sep 18 09:21:11 http1 kernel: EIP is at do_page_fault+0xa0/0x5c6
Sep 18 09:21:11 http1 kernel: eax: ece09000   ebx: c0416632   ecx: ece09074   edx: ece09130
Sep 18 09:21:11 http1 kernel: esi: 00000000   edi: c011ae55   ebp: 98393bc0   esp: ece09064
Sep 18 09:21:11 http1 kernel: ds: 007b   es: 007b   ss: 0068
iso-8859-1Sep 18 09:21:11 http1 kernel: Process ???? (pid: 4194367, threadinfo=ece08000 task=efc7c000)
Sep 18 09:21:11 http1 kernel: Stack: 00000000 98393c30 00000000 00000000 ece09130 c02e5e84 00000000 0000000e 
Sep 18 09:21:11 http1 kernel:        0000000b 00000000 00000000 00000000 00000000 00000000 00030001 00000000 
Sep 18 09:21:11 http1 kernel:        00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
Sep 18 09:21:11 http1 kernel: Call Trace:
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:11 http1 kernel:  =======================
Sep 18 09:21:11 http1 kernel:  [<f8840f01>] .text.lock.scsi+0x1a/0x35 [scsi_mod]
Sep 18 09:21:11 http1 kernel:  [<f8890424>] ext3_alloc_branch+0x194/0x260 [ext3]
Sep 18 09:21:11 http1 kernel:  =======================
Sep 18 09:21:11 http1 kernel: Unable to handle kernel paging request at virtual address 5e5b10c4
Sep 18 09:21:11 http1 kernel:  printing eip:
Sep 18 09:21:11 http1 kernel: c0105cd0
Sep 18 09:21:11 http1 kernel: *pde = 00000000
Sep 18 09:21:11 http1 kernel: Oops: 0000 [#2]
Sep 18 09:21:11 http1 kernel: SMP 
Sep 18 09:21:11 http1 kernel: Modules linked in: mptctl mptbase dell_rbu parport_pc lp parport autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dm_round_robin gnbd(U) dlm(U) cman(U) sunrpc ipmi_devintf ipmi_si ipmi_msghandler md5 ipv6 dm_mirror dm_multipath dm_mod joydev button battery ac uhci_hcd ehci_hcd hw_random shpchp e1000 bonding(U) floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod
Sep 18 09:21:11 http1 kernel: CPU:    75
Sep 18 09:21:11 http1 kernel: EIP:    0060:[<c0105cd0>]    Not tainted VLI
Sep 18 09:21:11 http1 kernel: EFLAGS: 00010093   (2.6.9-42.0.2.ELsmp) 
Sep 18 09:21:11 http1 kernel: EIP is at show_trace+0x11/0x6b
Sep 18 09:21:11 http1 kernel: eax: 5e5b1ffd   ebx: 5e5b10c4   ecx: ece08ed0   edx: c02de60c
Sep 18 09:21:11 http1 kernel: esi: 5e5b10c4   edi: 5e5b1000   ebp: 00000068   esp: ece08ed0
Sep 18 09:21:11 http1 kernel: ds: 007b   es: 007b   ss: 0068
iso-8859-1Sep 18 09:21:11 http1 kernel: Process ???? (pid: 4194367, threadinfo=ece08000 task=efc7c000)
Sep 18 09:21:11 http1 kernel: Stack: ece090c4 00000018 00000000 c0105d9d c02de636 ece09064 ece08000 ece09030 
Sep 18 09:21:11 http1 kernel:        00000000 c0105e9c c02de739 00000001 ece08000 ece09030 00000000 c02e5f8c 
Sep 18 09:21:11 http1 kernel:        c0106043 ece09030 c02e5f8c 00000000 000000ff 0000000b c01229e5 c02e5f23 
Sep 18 09:21:11 http1 kernel: Call Trace:
Sep 18 09:21:11 http1 kernel:  [<c0105d9d>] show_stack+0x73/0x79
Sep 18 09:21:11 http1 kernel:  [<c0105e9c>] show_registers+0xe6/0x14d
Sep 18 09:21:11 http1 kernel:  [<c0106043>] die+0xdb/0x16b
Sep 18 09:21:11 http1 kernel:  [<c01229e5>] vprintk+0x136/0x14a
Sep 18 09:21:11 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011b245>] do_page_fault+0x3f0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:11 http1 kernel:  [<c015c79d>] end_buffer_async_write+0x0/0xc0
Sep 18 09:21:11 http1 kernel:  [<c015c03d>] end_buffer_read_sync+0x0/0x1f
Sep 18 09:21:11 http1 kernel:  =======================
Sep 18 09:21:11 http1 kernel: Unable to handle kernel paging request at virtual address 00160015
Sep 18 09:21:11 http1 kernel:  printing eip:
Sep 18 09:21:11 http1 kernel: c0105cd0
Sep 18 09:21:11 http1 kernel: *pde = 2b5c5001
Sep 18 09:21:11 http1 kernel: Recursive die() failure, output suppressed
Sep 18 09:21:11 http1 kernel:  <1>Unable to handle kernel paging request at virtual address 0001006f
Sep 18 09:21:11 http1 kernel:  printing eip:
Sep 18 09:21:11 http1 kernel: c011aef5
Sep 18 09:21:11 http1 kernel: *pde = 2b5c5001
Sep 18 09:21:11 http1 kernel: Oops: 0000 [#3]
Sep 18 09:21:11 http1 kernel: SMP 
Sep 18 09:21:11 http1 kernel: Modules linked in: mptctl mptbase dell_rbu parport_pc lp parport autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dm_round_robin gnbd(U) dlm(U) cman(U) sunrpc ipmi_devintf ipmi_si ipmi_msghandler md5 ipv6 dm_mirror dm_multipath dm_mod joydev button battery ac uhci_hcd ehci_hcd hw_random shpchp e1000 bonding(U) floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod
Sep 18 09:21:11 http1 kernel: CPU:    31
Sep 18 09:21:11 http1 kernel: EIP:    0060:[<c011aef5>]    Not tainted VLI
Sep 18 09:21:11 http1 kernel: EFLAGS: 00010046   (2.6.9-42.0.2.ELsmp) 
Sep 18 09:21:11 http1 kernel: EIP is at do_page_fault+0xa0/0x5c6
Sep 18 09:21:11 http1 kernel: eax: f7f61000   ebx: 0000ffff   ecx: f7f610ec   edx: f7f611a8
Sep 18 09:21:11 http1 kernel: esi: 00000000   edi: c011ae55   ebp: 0000ffff   esp: f7f610dc
Sep 18 09:21:11 http1 kernel: ds: 007b   es: 007b   ss: 0068
Sep 18 09:21:11 http1 kernel: Process  (pid: -331851520, threadinfo=f7f60000 task=f7f76000)
Sep 18 09:21:12 http1 kernel: Stack: 00000000 0001006f 00000000 00000000 f7f611a8 c02e5e84 00000000 0000000e 
Sep 18 09:21:12 http1 kernel:        0000000b 00004000 00000000 00000000 00000000 00000000 00030001 00000000 
Sep 18 09:21:12 http1 kernel:        00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000001 
Sep 18 09:21:12 http1 kernel: Call Trace:
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c01cc01b>] disable_msi_mode+0x27/0x9d
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:12 http1 kernel:  [<c012a312>] update_one_process+0x7/0xce
Sep 18 09:21:12 http1 kernel:  [<c012a3f7>] update_process_times+0x1e/0x2f
Sep 18 09:21:12 http1 kernel:  [<c0117462>] smp_apic_timer_interrupt+0x5e/0x9c
Sep 18 09:21:12 http1 kernel: Code: 81 7c 24 04 ff ff ff bf 8b 28 c7 44 24 38 01 00 03 00 76 13 f6 84 24 c8 00 00 00 05 0f 85 a4 01 00 00 e9 75 04 00 00 83 78 14 00 <8b> 5d 70 0f 85 92 01 00 00 85 db 0f 84 8a 01 00 00 8d 73 30 8b 
Sep 18 09:21:12 http1 kernel:  <1>Unable to handle kernel NULL pointer dereference at virtual address 00000000
Sep 18 09:21:12 http1 kernel:  printing eip:
Sep 18 09:21:12 http1 kernel: c011743e
Sep 18 09:21:12 http1 kernel: *pde = 2b5c5001
Sep 18 09:21:12 http1 kernel: Oops: 0002 [#4]
Sep 18 09:21:12 http1 kernel: SMP 
Sep 18 09:21:12 http1 kernel: Modules linked in: mptctl mptbase dell_rbu parport_pc lp parport autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dm_round_robin gnbd(U) dlm(U) cman(U) sunrpc ipmi_devintf ipmi_si ipmi_msghandler md5 ipv6 dm_mirror dm_multipath dm_mod joydev button battery ac uhci_hcd ehci_hcd hw_random shpchp e1000 bonding(U) floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod
Sep 18 09:21:12 http1 kernel: CPU:    31
Sep 18 09:21:12 http1 kernel: EIP:    0060:[<c011743e>]    Not tainted VLI
Sep 18 09:21:12 http1 kernel: EFLAGS: 00010086   (2.6.9-42.0.2.ELsmp) 
Sep 18 09:21:12 http1 kernel: EIP is at smp_apic_timer_interrupt+0x3a/0x9c
Sep 18 09:21:12 http1 kernel: eax: f7f76000   ebx: 00000ffc   ecx: f7f60000   edx: 00000000
Sep 18 09:21:12 http1 kernel: esi: f7f610a8   edi: 00000000   ebp: c02e5f8c   esp: f7f60f50
Sep 18 09:21:12 http1 kernel: ds: 007b   es: 007b   ss: 0068
Sep 18 09:21:12 http1 kernel: Process  (pid: -331851520, threadinfo=f7f60000 task=f7f76000)
Sep 18 09:21:12 http1 kernel: Stack: fffff000 c02d521a fffff000 f7f60f00 c0323e4c f7f610a8 00000000 c02e5f8c 
Sep 18 09:21:12 http1 kernel:        00000000 c030007b f7f6007b ffffffef c0106077 00000060 00000246 f7f610a8 
Sep 18 09:21:12 http1 kernel:        c02e5f8c 00000000 000000ff 0000000b c01229e5 c02e5f23 00000000 c02e5f23 
Sep 18 09:21:12 http1 kernel: Call Trace:
Sep 18 09:21:12 http1 kernel:  [<c02d521a>] apic_timer_interrupt+0x1a/0x20
Sep 18 09:21:12 http1 kernel:  [<c0106077>] die+0x10f/0x16b
Sep 18 09:21:12 http1 kernel:  [<c01229e5>] vprintk+0x136/0x14a
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011b245>] do_page_fault+0x3f0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:12 http1 kernel:  =======================
Sep 18 09:21:12 http1 kernel: Unable to handle kernel paging request at virtual address 00160015
Sep 18 09:21:12 http1 kernel:  printing eip:
Sep 18 09:21:12 http1 kernel: c0105cd0
Sep 18 09:21:12 http1 kernel: *pde = 2b5c5001
Sep 18 09:21:12 http1 kernel: Oops: 0000 [#5]
Sep 18 09:21:12 http1 kernel: SMP 
Sep 18 09:21:12 http1 kernel: Modules linked in: mptctl mptbase dell_rbu parport_pc lp parport autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dm_round_robin gnbd(U) dlm(U) cman(U) sunrpc ipmi_devintf ipmi_si ipmi_msghandler md5 ipv6 dm_mirror dm_multipath dm_mod joydev button battery ac uhci_hcd ehci_hcd hw_random shpchp e1000 bonding(U) floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod
Sep 18 09:21:12 http1 kernel: CPU:    31
Sep 18 09:21:12 http1 kernel: EIP:    0060:[<c0105cd0>]    Not tainted VLI
Sep 18 09:21:12 http1 kernel: EFLAGS: 00010097   (2.6.9-42.0.2.ELsmp) 
Sep 18 09:21:12 http1 kernel: EIP is at show_trace+0x11/0x6b
Sep 18 09:21:12 http1 kernel: eax: 00160ffd   ebx: 00160015   ecx: f7f60dbc   edx: c02de60c
Sep 18 09:21:12 http1 kernel: esi: 00160015   edi: 00160000   ebp: 00000068   esp: f7f60dbc
Sep 18 09:21:12 http1 kernel: ds: 007b   es: 007b   ss: 0068
Sep 18 09:21:12 http1 kernel: Process  (pid: -331851520, threadinfo=f7f60000 task=f7f76000)
Sep 18 09:21:12 http1 kernel: Stack: f7f60fb0 00000018 00000000 c0105d9d c02de636 f7f60f50 f7f60000 f7f60f1c 
Sep 18 09:21:12 http1 kernel:        00000000 c0105e9c c02de739 00000001 f7f60000 f7f60f1c 00000002 c02e5f8c 
Sep 18 09:21:12 http1 kernel:        c0106043 f7f60f1c c02e5f8c 00000002 000000ff 0000000b c01229e5 c02e5eef 
Sep 18 09:21:12 http1 kernel: Call Trace:
Sep 18 09:21:12 http1 kernel:  [<c0105d9d>] show_stack+0x73/0x79
Sep 18 09:21:12 http1 kernel:  [<c0105e9c>] show_registers+0xe6/0x14d
Sep 18 09:21:12 http1 kernel:  [<c0106043>] die+0xdb/0x16b
Sep 18 09:21:12 http1 kernel:  [<c01229e5>] vprintk+0x136/0x14a
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011b245>] do_page_fault+0x3f0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011743e>] smp_apic_timer_interrupt+0x3a/0x9c
Sep 18 09:21:12 http1 kernel:  [<c020b69e>] vt_console_print+0x65/0x2a5
Sep 18 09:21:12 http1 kernel:  [<c020b639>] vt_console_print+0x0/0x2a5
Sep 18 09:21:12 http1 kernel:  [<c01c27dd>] vsnprintf+0x448/0x488
Sep 18 09:21:12 http1 kernel:  [<c020b69e>] vt_console_print+0x65/0x2a5
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:12 http1 kernel:  [<c011743e>] smp_apic_timer_interrupt+0x3a/0x9c
Sep 18 09:21:12 http1 kernel:  [<c02d521a>] apic_timer_interrupt+0x1a/0x20
Sep 18 09:21:12 http1 kernel:  [<c0106077>] die+0x10f/0x16b
Sep 18 09:21:12 http1 kernel:  [<c01229e5>] vprintk+0x136/0x14a
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011b245>] do_page_fault+0x3f0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:12 http1 kernel:  =======================
Sep 18 09:21:12 http1 kernel: Unable to handle kernel paging request at virtual address 00160015
Sep 18 09:21:12 http1 kernel:  printing eip:
Sep 18 09:21:12 http1 kernel: c0105cd0
Sep 18 09:21:12 http1 kernel: *pde = 2b5c5001
Sep 18 09:21:12 http1 kernel: Recursive die() failure, output suppressed
Sep 18 09:21:12 http1 kernel:  <1>Unable to handle kernel NULL pointer dereference at virtual address 00000000
Sep 18 09:21:12 http1 kernel:  printing eip:
Sep 18 09:21:12 http1 kernel: c011743e
Sep 18 09:21:12 http1 kernel: *pde = 2b5c5001
Sep 18 09:21:12 http1 kernel: Oops: 0002 [#6]
Sep 18 09:21:12 http1 kernel: SMP 
Sep 18 09:21:12 http1 kernel: Modules linked in: mptctl mptbase dell_rbu parport_pc lp parport autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dm_round_robin gnbd(U) dlm(U) cman(U) sunrpc ipmi_devintf ipmi_si ipmi_msghandler md5 ipv6 dm_mirror dm_multipath dm_mod joydev button battery ac uhci_hcd ehci_hcd hw_random shpchp e1000 bonding(U) floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod
Sep 18 09:21:12 http1 kernel: CPU:    31
Sep 18 09:21:12 http1 kernel: EIP:    0060:[<c011743e>]    Not tainted VLI
Sep 18 09:21:12 http1 kernel: EFLAGS: 00010086   (2.6.9-42.0.2.ELsmp) 
Sep 18 09:21:12 http1 kernel: EIP is at smp_apic_timer_interrupt+0x3a/0x9c
Sep 18 09:21:12 http1 kernel: eax: f7f76000   ebx: 00000ffc   ecx: f7f60000   edx: 00000000
Sep 18 09:21:12 http1 kernel: esi: f7f60bf4   edi: c02de7d4   ebp: c02e5f8c   esp: f7f60a9c
Sep 18 09:21:12 http1 kernel: ds: 007b   es: 007b   ss: 0068
Sep 18 09:21:12 http1 kernel: Process  (pid: -331851520, threadinfo=f7f60000 task=f7f76000)
Sep 18 09:21:12 http1 kernel: Stack: fffff000 c02d521a fffff000 f7f60a00 c0323e4c f7f60bf4 c02de7d4 c02e5f8c 
Sep 18 09:21:12 http1 kernel:        00000000 c030007b f7f6007b ffffffef c0106077 00000060 00000246 0000a81b 
Sep 18 09:21:12 http1 kernel:        c0122acd c0416633 00000003 00000013 c01229e5 c02e5f23 00000000 c02e5f23 
Sep 18 09:21:12 http1 kernel: Call Trace:
Sep 18 09:21:12 http1 kernel:  [<c02d521a>] apic_timer_interrupt+0x1a/0x20
Sep 18 09:21:12 http1 kernel:  [<c0106077>] die+0x10f/0x16b
Sep 18 09:21:12 http1 kernel:  [<c0122acd>] release_console_sem+0x75/0xa9
Sep 18 09:21:12 http1 kernel:  [<c01229e5>] vprintk+0x136/0x14a
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011b245>] do_page_fault+0x3f0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c0105cd0>] show_trace+0x11/0x6b
Sep 18 09:21:12 http1 kernel:  [<c020b69e>] vt_console_print+0x65/0x2a5
Sep 18 09:21:12 http1 kernel:  [<c020b639>] vt_console_print+0x0/0x2a5
Sep 18 09:21:12 http1 kernel:  [<c012268f>] __call_console_drivers+0x36/0x40
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:12 http1 kernel:  [<c0105cd0>] show_trace+0x11/0x6b
Sep 18 09:21:12 http1 kernel:  [<c0105d9d>] show_stack+0x73/0x79
Sep 18 09:21:12 http1 kernel:  [<c0105e9c>] show_registers+0xe6/0x14d
Sep 18 09:21:12 http1 kernel:  [<c0106043>] die+0xdb/0x16b
Sep 18 09:21:12 http1 kernel:  [<c01229e5>] vprintk+0x136/0x14a
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011b245>] do_page_fault+0x3f0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c0105cd0>] show_trace+0x11/0x6b
Sep 18 09:21:12 http1 kernel:  [<c020b69e>] vt_console_print+0x65/0x2a5
Sep 18 09:21:12 http1 kernel:  [<c020b639>] vt_console_print+0x0/0x2a5
Sep 18 09:21:12 http1 kernel:  [<c012268f>] __call_console_drivers+0x36/0x40
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:12 http1 kernel:  [<c0105cd0>] show_trace+0x11/0x6b
Sep 18 09:21:12 http1 kernel:  [<c0105d9d>] show_stack+0x73/0x79
Sep 18 09:21:12 http1 kernel:  [<c0105e9c>] show_registers+0xe6/0x14d
Sep 18 09:21:12 http1 kernel:  [<c0106043>] die+0xdb/0x16b
Sep 18 09:21:12 http1 kernel:  [<c01229e5>] vprintk+0x136/0x14a
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011b245>] do_page_fault+0x3f0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011743e>] smp_apic_timer_interrupt+0x3a/0x9c
Sep 18 09:21:12 http1 kernel:  [<c020b69e>] vt_console_print+0x65/0x2a5
Sep 18 09:21:12 http1 kernel:  [<c020b639>] vt_console_print+0x0/0x2a5
Sep 18 09:21:12 http1 kernel:  [<c01c27dd>] vsnprintf+0x448/0x488
Sep 18 09:21:12 http1 kernel:  [<c020b69e>] vt_console_print+0x65/0x2a5
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c02d52b7>] error_code+0x2f/0x38
Sep 18 09:21:12 http1 kernel:  [<c011743e>] smp_apic_timer_interrupt+0x3a/0x9c
Sep 18 09:21:12 http1 kernel:  [<c02d521a>] apic_timer_interrupt+0x1a/0x20
Sep 18 09:21:12 http1 kernel:  [<c0106077>] die+0x10f/0x16b
Sep 18 09:21:12 http1 kernel:  [<c01229e5>] vprintk+0x136/0x14a
Sep 18 09:21:12 http1 kernel:  [<c011ae55>] do_page_fault+0x0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011b245>] do_page_fault+0x3f0/0x5c6
Sep 18 09:21:12 http1 kernel:  [<c011aef5>] do_page_fault+0xa0/0x5c6
Sep 18 09:21:12 http1 kernel:  =======================
Sep 18 09:21:12 http1 kernel: Unable to handle kernel paging request at virtual address 00160015

From kanderso at redhat.com  Mon Sep 18 14:58:33 2006
From: kanderso at redhat.com (Kevin Anderson)
Date: Mon, 18 Sep 2006 09:58:33 -0500
Subject: [Linux-cluster] Fencing with SUN X4100
In-Reply-To: <8E2924888511274B95014C2DD906E58A64947B@MAILBOX0A.psi.ch>
References: <8E2924888511274B95014C2DD906E58A64947B@MAILBOX0A.psi.ch>
Message-ID: <1158591513.2785.13.camel@dhcp80-204.msp.redhat.com>

On Mon, 2006-09-18 at 00:32 +0200, Huesser Peter wrote:
> Hello
> 
>  
> 
> Does somebody has any experience with fencing of SUN X4100 machines ?
> They use the Sun Integrated Lights Out Manager. For example I just
> tried to remotely reboot the system using the following command: 
> 
>  
> 
>             fence_ilo -a server_con -l loginname -v -p my_passwort ?o
> reboot
> 
Fence_ilo is for HP's lights out management card.  No one has developed
a fence agent for Sun's management controller as of yet.  We would be
more than willing get one in the source base if one were developed. 

Kevin

> 


From dbrieck at gmail.com  Mon Sep 18 20:34:54 2006
From: dbrieck at gmail.com (David Brieck Jr.)
Date: Mon, 18 Sep 2006 16:34:54 -0400
Subject: [Linux-cluster] Multiple Active MySQL instances
Message-ID: <8c1094290609181334q7015262csc2479cc6626886fd@mail.gmail.com>

I've been trying to find more information about using GFS  and MySQL
to create a simple active-active mysql cluster without the need for
the actual mysql cluster (wouldn't work for our situation).

The only thing I've seen on the mailing list is the following:

-----
It is possible to use mysql on shared storage with enabled external locking
and also disabling the query cache and few other things:

enable-locking
query_cache_wlock_invalidate
query_cache_size= 0
query_cache_type= 0
delay_key_write = OFF
flush

in mysqld section


this configuration worked for my 10 node cluster .
----

but other than that no one has posted anything. I also found this press release:
http://www.mysql.com/news-and-events/press-release/release_2005_13.html

from mysql and redhat that says:

"MySQL and Red Hat plan to test the MySQL database with Red Hat's
Cluster Suite and Global File System (GFS). Red Hat GFS allows a
cluster of MySQL servers to simultaneously read and write data to a
single shared file system on a SAN, achieving high performance and
reducing the complexity and overhead of managing redundant data
copies. With Red Hat Cluster Suite and GFS, MySQL customers can get a
highly available clustered database solution based on all open source
technologies."

But once again I can't find any follow up to this. Can anyone give me
a hand? I'd want to run 3 active mysql servers at least on one set of
data shared with GFS.

Thanks.


From jstoner at opsource.net  Mon Sep 18 21:23:49 2006
From: jstoner at opsource.net (Jeff Stoner)
Date: Mon, 18 Sep 2006 22:23:49 +0100
Subject: [Linux-cluster] Clustat XML output
Message-ID: <38A48FA2F0103444906AD22E14F1B5A3042C6895@mailxchg01.corp.opsource.net>

Is there an XML schema that corresponds to the output generated by
'clustat -x'? The rgmanager RPM doesn't have anything in it.

I figure it'll be much easier to parse XML output than build a parser to
read the regular output.

--Jeff
SME - UNIX
OpSource Inc.

PGP Key ID 0x6CB364CA 


From rodgersr at yahoo.com  Tue Sep 19 00:13:07 2006
From: rodgersr at yahoo.com (Rick Rodgers)
Date: Mon, 18 Sep 2006 17:13:07 -0700 (PDT)
Subject: [Linux-cluster] clmrmtabd not running. Can anyone fill me in?
Message-ID: <20060919001307.32980.qmail@web34201.mail.mud.yahoo.com>

I am using Clumanager version 1.2.24. I  notice that clurmtabd is not
running for my services. IS this correct? If so does anyone know why?

Also if it is dupposed to be running what should the cluster.xml look lilke to
make that happen?

Thanks

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060918/7e011b65/attachment.htm>

From Klaus.Steinberger at physik.uni-muenchen.de  Tue Sep 19 07:10:09 2006
From: Klaus.Steinberger at physik.uni-muenchen.de (Klaus Steinberger)
Date: Tue, 19 Sep 2006 09:10:09 +0200
Subject: [Linux-cluster] Fencing with SUN X4100
In-Reply-To: <20060918160006.DC89F7363A@hormel.redhat.com>
References: <20060918160006.DC89F7363A@hormel.redhat.com>
Message-ID: <200609190910.13165.Klaus.Steinberger@physik.uni-muenchen.de>

Hello,

> > Does somebody has any experience with fencing of SUN X4100 machines ?

> a fence agent for Sun's management controller as of yet.  We would be
> more than willing get one in the source base if one were developed.

It is not necessary to develop a fencing agent for SUN. Just use the ipmi 
agent, it works, as SUN supports IPMI 2.0 on the X4x00.

How to do it:

Create a user for fencing through the SUN Remote Service Board Web Management, 
give this user full rights.

Install if not already done: OpenIPMI and OpenIPMI-tools

Test the fencing with the following command:

fence_ipmilan -l username -p password -a IPAddresse-management -o status -v

This should give you back something like:

Getting status of IPMI:192.168.160.12...Spawning: '/usr/bin/ipmitool -I lan -H 
IPaddress -U username -P password -v chassis power status'...
Chassis power = On
Done


For SUN Fire V40Z servers fencing will also work with IPMI, but it is a little 
bit different, as it only support IPMI 1.5:

Log in through ssh to the Remote Service board, enable ipmi:

ipmi enable channel lan

(uhh, you should give ipmi a password, but don't ask how, look into the 
documentation)

The fencing command looks like the following:

fence_ipmilan -p password -a IPAddresse-management -o status -v

Note that there is no Username!

Sincerly,
Klaus


-- 
Klaus Steinberger         Beschleunigerlaboratorium
Phone: (+49 89)289 14287  Am Coulombwall 6, D-85748 Garching, Germany
FAX:   (+49 89)289 14280  EMail: Klaus.Steinberger at Physik.Uni-Muenchen.DE
URL: http://www.physik.uni-muenchen.de/~Klaus.Steinberger/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2002 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060919/90688222/attachment.p7s>

From peter.huesser at psi.ch  Tue Sep 19 07:14:50 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Tue, 19 Sep 2006 09:14:50 +0200
Subject: [Linux-cluster] Fencing with SUN X4100
In-Reply-To: <1158591513.2785.13.camel@dhcp80-204.msp.redhat.com>
Message-ID: <8E2924888511274B95014C2DD906E58AD1A1AD@MAILBOX0A.psi.ch>

> > Does somebody has any experience with fencing of SUN X4100 machines
?
> > They use the Sun Integrated Lights Out Manager. For example I just
> > tried to remotely reboot the system using the following command:
> >
> >
> >
> >             fence_ilo -a server_con -l loginname -v -p my_passwort
-o
> > reboot
> >
> Fence_ilo is for HP's lights out management card.  No one has
developed
> a fence agent for Sun's management controller as of yet.  We would be
> more than willing get one in the source base if one were developed.
> 

I went trough the documentation of the Sun Integrated Ligths Out Manager
and remarked that it is IMPI v2.0 compliant. So I tried

    fence_ipmilan -a server_con -l loginname -p my_passwort -o status

This worked. The next step is to test the fencing within the
clustersuite.

Thanks' for answer

	Pedro


From peter.huesser at psi.ch  Tue Sep 19 07:27:03 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Tue, 19 Sep 2006 09:27:03 +0200
Subject: [Linux-cluster] Fencing with SUN X4100
In-Reply-To: <200609190910.13165.Klaus.Steinberger@physik.uni-muenchen.de>
Message-ID: <8E2924888511274B95014C2DD906E58AD1A1AE@MAILBOX0A.psi.ch>

Thank's for your answer. I went trough the documentation this morning
and found the same. 

By the way: The fence_ipmilan tool has a problem if the password
contains a character which is interpreted by the shell as for example
"$". 

	Pedro

> 
> It is not necessary to develop a fencing agent for SUN. Just use the
ipmi
> agent, it works, as SUN supports IPMI 2.0 on the X4x00.
> 
> How to do it:
> 
> Create a user for fencing through the SUN Remote Service Board Web
> Management,
> give this user full rights.
> 
> Install if not already done: OpenIPMI and OpenIPMI-tools
> 
> Test the fencing with the following command:
> 
> fence_ipmilan -l username -p password -a IPAddresse-management -o
status -
> v
> 
> This should give you back something like:
> 
> Getting status of IPMI:192.168.160.12...Spawning: '/usr/bin/ipmitool
-I
> lan -H
> IPaddress -U username -P password -v chassis power status'...
> Chassis power = On
> Done
> 
> 
> For SUN Fire V40Z servers fencing will also work with IPMI, but it is
a
> little
> bit different, as it only support IPMI 1.5:
> 
> Log in through ssh to the Remote Service board, enable ipmi:
> 
> ipmi enable channel lan
> 
> (uhh, you should give ipmi a password, but don't ask how, look into
the
> documentation)
> 
> The fencing command looks like the following:
> 
> fence_ipmilan -p password -a IPAddresse-management -o status -v
> 
> Note that there is no Username!
> 
> Sincerly,
> Klaus
> 
> 
> --
> Klaus Steinberger         Beschleunigerlaboratorium
> Phone: (+49 89)289 14287  Am Coulombwall 6, D-85748 Garching, Germany
> FAX:   (+49 89)289 14280  EMail:
Klaus.Steinberger at Physik.Uni-Muenchen.DE
> URL: http://www.physik.uni-muenchen.de/~Klaus.Steinberger/


From Alain.Moulle at bull.net  Tue Sep 19 08:33:29 2006
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Tue, 19 Sep 2006 10:33:29 +0200
Subject: [Linux-cluster] CS4 U4 versus U2 / clsuter.conf compatibility ?
Message-ID: <450FAB59.7050207@bull.net>

Hi

I think so but just to be sure : is there an upward compatibility
with regard to the cluster.conf, meaning that I can keep the cluster.conf
as it was with U2 in a new configuration with U4 ?

Thanks
Alain


From jparsons at redhat.com  Tue Sep 19 12:16:01 2006
From: jparsons at redhat.com (James Parsons)
Date: Tue, 19 Sep 2006 08:16:01 -0400
Subject: [Linux-cluster] CS4 U4 versus U2 / clsuter.conf compatibility
 ?
In-Reply-To: <450FAB59.7050207@bull.net>
References: <450FAB59.7050207@bull.net>
Message-ID: <450FDF81.9070207@redhat.com>

Alain Moulle wrote:

>Hi
>
>I think so but just to be sure : is there an upward compatibility
>with regard to the cluster.conf, meaning that I can keep the cluster.conf
>as it was with U2 in a new configuration with U4 ?
>
>Thanks
>Alain
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>  
>
Yes, this should be fine. There are more different types of fences 
supported in U4...but if you don't need them, all should be well.

-J


From celso at webbertek.com.br  Tue Sep 19 22:27:59 2006
From: celso at webbertek.com.br (Celso K. Webber)
Date: Tue, 19 Sep 2006 19:27:59 -0300
Subject: [Linux-cluster] Using heartbeat and network access over the same
	bonded interface
Message-ID: <45106EEF.4030208@webbertek.com.br>

Hello all,

I'm on a new project where we are deploying a GFS Cluster using two 
brand new Dell PE-2900 servers. One good thing about the 9th generation 
of Dell servers is that the IPMI device "listens" on both onboard NIC 
connectors of the machine. On previous generation machines, you can only 
access the IPMI LAN interface through NIC #1.

This is good since we plan to have a redundant channel to fence a 
machine through IPMI.

The project requires a setup like this:
* normal network access on a 10.x.x.x IP range;
* cluster heartbeating on a 192.168.x.x IP range.

The original project was done this way so that network access goes 
through NIC #1, and heartbeating goes through NIC #2.

What we want to do is:
* create a bond0 device for the 10.x.x.x IP address;
* create a bond0:0 device for the 192.168.x.x IP address;
* use IPMI as a fence device on the 10.x.x.x IP range.

Question: can we share the same bonded device for both normal network 
access and heartbeating, even if I'm using differente IP ranges? Does 
anything in the Cluster Software will be affected because of the normal 
network traffic over the same NICs?

This way we believe we can take advantage of the redundant IPMI 
interface, becase the bond0 device will guarantee we have an IP route to 
the 10.x.x.x network for accessing the IPMI device.

Sorry if the explanation is confusing, please advice.

Regards,

Celso.
-- 
*Celso Kopp Webber*

celso at webbertek.com.br <mailto:celso at webbertek.com.br>

*Webbertek - Opensource Knowledge*


-- 
Esta mensagem foi verificada pelo sistema de antiv?rus e
 acredita-se estar livre de perigo.


From troels at arvin.dk  Wed Sep 20 07:16:59 2006
From: troels at arvin.dk (Troels Arvin)
Date: Wed, 20 Sep 2006 09:16:59 +0200
Subject: [Linux-cluster] Re: CS4 U4 versus U2 / clsuter.conf compatibility ?
References: <450FAB59.7050207@bull.net> <450FDF81.9070207@redhat.com>
Message-ID: <pan.2006.09.20.07.17.01.172000@arvin.dk>

On Tue, 19 Sep 2006 08:16:01 -0400, James Parsons wrote:
> Yes, this should be fine. There are more different types of fences
> supported in U4...but if you don't need them, all should be well.

Where can I read about Update 4 of CS4? The docs at
https://www.redhat.com/docs/manuals/csgfs/ seem to cover nothing newer
than Update 3.

-- 
Greetings from Troels Arvin


From carlopmart at gmail.com  Wed Sep 20 08:23:04 2006
From: carlopmart at gmail.com (carlopmart)
Date: Wed, 20 Sep 2006 10:23:04 +0200
Subject: [Linux-cluster] Things that i don't understand about cluster suite
Message-ID: <4510FA68.2000403@gmail.com>

Hi all,

  Sorry for this toppic, but i have serious doubts about using cluster 
suite under some deployments. My questions:

  a) How can I configure status check on a service script? for exmaple: 
I have two nodes with CS U4 with postfix service running on two nodes 
and using DLM as a lock manager. If I stop postfix from the script and I 
wait status check, nothing happens and rgmanager returns an ok for the 
service, but this service is stopped !!!.

  b) is it posible to startup only one node on a two-node cluster? i 
have tested this feature, but this node doesn't startup ( i am using iLO 
as a fencing method, but I have tested gnbd too and the result is the same).

  c) why relocate service doesn't works?? I have attached my config. For 
example, if I reboot one node, all services go to the second. This is 
ok, but when this primary node is up, services continue getting up in 
the previous node and they don't migrate towards the other node.


  I suppose that I am doing something wrong but i don't know what. 
Somebody can helps me??

many thanks.

-- 
CL Martinez
carlopmart {at} gmail {d0t} com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: text/xml
Size: 1950 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060920/7060f2c5/attachment.xml>

From RMoody at mweb.com  Wed Sep 20 09:18:44 2006
From: RMoody at mweb.com (Robert Moody - MWEB)
Date: Wed, 20 Sep 2006 11:18:44 +0200
Subject: [Linux-cluster] Performance issue on our GFS cluster.
Message-ID: <6586D1F97DDEDE408BEEF44402F3797806F0C49A@mwmx4.mweb.com>

Hi there,
 
I am looking for any ideas as to why our cluster has a weird load issue.
We are currently running a 2 node GFS cluster with ES 4 and GFS 6.1.
 
Currently these systems are running a light load as a mail cluster. We
eventually plan to push our heavily loaded systems onto 3 node clusters.
However we are now concerend as the system load seems to be higher than
we expected it to be.
 
I have noticed that between the two nodes that load bounces between 0.5
and 1.8-2. It cycles between the machines. ie one will be 0.5 and then
the other would be 1.7 per say... then a few min later the one that was
not loaded is now loaded and the loaded one is now idle.
 
Our worrying concern is that the mail servers that we are going to move
onto these hosts are currently running a max of 0.5 however they process
a lot more of mail than the current GFS cluster.
 
The other day we did a patch update to see if any new patches corrected
these issues. However they are still there.
 
The top deamons in use are gfs_scand, gfs_inoded and dlm_recvd  However
they take about 7% of the CPU time which does not explain the high load
on the server. As I speak the once node has spiked to 2.98.
 
Any ideas?
 
Robert Moody
Snr. Sun Solaris/Linux Engineer
WPF : Communications Officer
MWeb
Tel : 021 596 8753
Cell : 084 466 8521
rmoody at mweb.com
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060920/73510769/attachment.htm>

From jparsons at redhat.com  Wed Sep 20 13:20:55 2006
From: jparsons at redhat.com (Jim Parsons)
Date: Wed, 20 Sep 2006 09:20:55 -0400
Subject: [Linux-cluster] Things that i don't understand about cluster suite
References: <4510FA68.2000403@gmail.com>
Message-ID: <45114037.1040108@redhat.com>

carlopmart wrote:

> Hi all,
>
>  Sorry for this toppic, but i have serious doubts about using cluster 
> suite under some deployments. My questions:
>
>  a) How can I configure status check on a service script? for exmaple: 
> I have two nodes with CS U4 with postfix service running on two nodes 
> and using DLM as a lock manager. If I stop postfix from the script and 
> I wait status check, nothing happens and rgmanager returns an ok for 
> the service, but this service is stopped !!!.
>
>  b) is it posible to startup only one node on a two-node cluster? i 
> have tested this feature, but this node doesn't startup ( i am using 
> iLO as a fencing method, but I have tested gnbd too and the result is 
> the same).
>
>  c) why relocate service doesn't works?? I have attached my config. 
> For example, if I reboot one node, all services go to the second. This 
> is ok, but when this primary node is up, services continue getting up 
> in the previous node and they don't migrate towards the other node.
>
>
>  I suppose that I am doing something wrong but i don't know what. 
> Somebody can helps me??
>
> many thanks. 

Below in the conf file, you have one ilo device declared under the 
fencedevice section, and both nodes are using it. This would mean that 
if one node were ever fenced, then both nodes would be fenced. ilo is a 
per node fence type - they are rarely shared. I think you should have a 
fencedevice block of type ilo for each node, and then the fence section 
under each node should ref the appropriate device....that is, node1 
should use it's built-in ilo and node2 should use its own.

-J

>
>
>
>------------------------------------------------------------------------
>
><?xml version="1.0"?>
><cluster config_version="23" name="DMZCluster">
>	<fence_daemon post_fail_delay="0" post_join_delay="3"/>
>	<clusternodes>
>		<clusternode name="klecker" votes="1">
>			<fence>
>				<method name="1">
>					<device name="fence_iLO"/>
>				</method>
>			</fence>
>		</clusternode>
>		<clusternode name="samosa" votes="1">
>			<fence>
>				<method name="1">
>					<device name="fence_iLO"/>
>				</method>
>			</fence>
>		</clusternode>
>	</clusternodes>
>	<cman expected_votes="1" two_node="1"/>
>	<fencedevices>
>		<fencedevice agent="fence_ilo" hostname="10.10.10.221" login="Administrator" name="fence_iLO" passwd="fenceilo"/>
>	</fencedevices>
>	<rm>
>		<failoverdomains>
>			<failoverdomain name="PriCluster" ordered="1" restricted="0">
>				<failoverdomainnode name="klecker" priority="1"/>
>				<failoverdomainnode name="samosa" priority="2"/>
>			</failoverdomain>
>			<failoverdomain name="SecCluster" ordered="1">
>				<failoverdomainnode name="klecker" priority="2"/>
>				<failoverdomainnode name="samosa" priority="1"/>
>			</failoverdomain>
>		</failoverdomains>
>		<resources/>
>		<service autostart="1" domain="SecCluster" exclusive="1" name="dbserver" recovery="relocate">
>			<ip address="192.168.100.68" monitor_link="1">
>				<script file="/opt/cluconfig/etc/init.d/postgresql" name="postgresql"/>
>			</ip>
>		</service>
>		<service autostart="1" domain="SecCluster" exclusive="1" name="euq" recovery="relocate">
>			<ip address="192.168.100.70" monitor_link="1">
>				<script file="/opt/cluconfig/etc/init.d/euq" name="euq"/>
>				<script file="/opt/cluconfig/etc/init.d/euqa" name="euqa"/>
>			</ip>
>		</service>
>		<service autostart="1" domain="PriCluster" exclusive="1" name="smtp-imss" recovery="relocate">
>			<ip address="192.168.100.69" monitor_link="1">
>				<script file="/opt/cluconfig/etc/init.d/imss" name="imss">
>					<script file="/opt/cluconfig/etc/init.d/imss-mta" name="imss-mta"/>
>				</script>
>			</ip>
>		</service>
>	</rm>
></cluster>
>
>
>------------------------------------------------------------------------
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>


From carlopmart at gmail.com  Wed Sep 20 14:03:42 2006
From: carlopmart at gmail.com (carlopmart)
Date: Wed, 20 Sep 2006 16:03:42 +0200
Subject: [Linux-cluster] Things that i don't understand about cluster suite
In-Reply-To: <45114037.1040108@redhat.com>
References: <4510FA68.2000403@gmail.com> <45114037.1040108@redhat.com>
Message-ID: <45114A3E.2080202@gmail.com>

Thanks Jim, but If i change iLO fence for a GNBD fence, results are the 
same for my three questions, or do I need to configure one gnbd fence 
for each node???

Jim Parsons wrote:
> carlopmart wrote:
> 
>> Hi all,
>>
>>  Sorry for this toppic, but i have serious doubts about using cluster 
>> suite under some deployments. My questions:
>>
>>  a) How can I configure status check on a service script? for exmaple: 
>> I have two nodes with CS U4 with postfix service running on two nodes 
>> and using DLM as a lock manager. If I stop postfix from the script and 
>> I wait status check, nothing happens and rgmanager returns an ok for 
>> the service, but this service is stopped !!!.
>>
>>  b) is it posible to startup only one node on a two-node cluster? i 
>> have tested this feature, but this node doesn't startup ( i am using 
>> iLO as a fencing method, but I have tested gnbd too and the result is 
>> the same).
>>
>>  c) why relocate service doesn't works?? I have attached my config. 
>> For example, if I reboot one node, all services go to the second. This 
>> is ok, but when this primary node is up, services continue getting up 
>> in the previous node and they don't migrate towards the other node.
>>
>>
>>  I suppose that I am doing something wrong but i don't know what. 
>> Somebody can helps me??
>>
>> many thanks. 
> 
> Below in the conf file, you have one ilo device declared under the 
> fencedevice section, and both nodes are using it. This would mean that 
> if one node were ever fenced, then both nodes would be fenced. ilo is a 
> per node fence type - they are rarely shared. I think you should have a 
> fencedevice block of type ilo for each node, and then the fence section 
> under each node should ref the appropriate device....that is, node1 
> should use it's built-in ilo and node2 should use its own.
> 
> -J
> 
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> <?xml version="1.0"?>
>> <cluster config_version="23" name="DMZCluster">
>>     <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>>     <clusternodes>
>>         <clusternode name="klecker" votes="1">
>>             <fence>
>>                 <method name="1">
>>                     <device name="fence_iLO"/>
>>                 </method>
>>             </fence>
>>         </clusternode>
>>         <clusternode name="samosa" votes="1">
>>             <fence>
>>                 <method name="1">
>>                     <device name="fence_iLO"/>
>>                 </method>
>>             </fence>
>>         </clusternode>
>>     </clusternodes>
>>     <cman expected_votes="1" two_node="1"/>
>>     <fencedevices>
>>         <fencedevice agent="fence_ilo" hostname="10.10.10.221" 
>> login="Administrator" name="fence_iLO" passwd="fenceilo"/>
>>     </fencedevices>
>>     <rm>
>>         <failoverdomains>
>>             <failoverdomain name="PriCluster" ordered="1" restricted="0">
>>                 <failoverdomainnode name="klecker" priority="1"/>
>>                 <failoverdomainnode name="samosa" priority="2"/>
>>             </failoverdomain>
>>             <failoverdomain name="SecCluster" ordered="1">
>>                 <failoverdomainnode name="klecker" priority="2"/>
>>                 <failoverdomainnode name="samosa" priority="1"/>
>>             </failoverdomain>
>>         </failoverdomains>
>>         <resources/>
>>         <service autostart="1" domain="SecCluster" exclusive="1" 
>> name="dbserver" recovery="relocate">
>>             <ip address="192.168.100.68" monitor_link="1">
>>                 <script file="/opt/cluconfig/etc/init.d/postgresql" 
>> name="postgresql"/>
>>             </ip>
>>         </service>
>>         <service autostart="1" domain="SecCluster" exclusive="1" 
>> name="euq" recovery="relocate">
>>             <ip address="192.168.100.70" monitor_link="1">
>>                 <script file="/opt/cluconfig/etc/init.d/euq" name="euq"/>
>>                 <script file="/opt/cluconfig/etc/init.d/euqa" 
>> name="euqa"/>
>>             </ip>
>>         </service>
>>         <service autostart="1" domain="PriCluster" exclusive="1" 
>> name="smtp-imss" recovery="relocate">
>>             <ip address="192.168.100.69" monitor_link="1">
>>                 <script file="/opt/cluconfig/etc/init.d/imss" 
>> name="imss">
>>                     <script file="/opt/cluconfig/etc/init.d/imss-mta" 
>> name="imss-mta"/>
>>                 </script>
>>             </ip>
>>         </service>
>>     </rm>
>> </cluster>
>>
>>
>> ------------------------------------------------------------------------
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
> 
> 
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From lhh at redhat.com  Wed Sep 20 18:19:21 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 20 Sep 2006 14:19:21 -0400
Subject: [Linux-cluster] clmrmtabd not running. Can anyone fill me in?
In-Reply-To: <20060919001307.32980.qmail@web34201.mail.mud.yahoo.com>
References: <20060919001307.32980.qmail@web34201.mail.mud.yahoo.com>
Message-ID: <1158776361.7388.21.camel@rei.boston.devel.redhat.com>

On Mon, 2006-09-18 at 17:13 -0700, Rick Rodgers wrote:
> I am using Clumanager version 1.2.24. I  notice that clurmtabd is not
> running for my services. IS this correct? If so does anyone know why?
> 
> Also if it is dupposed to be running what should the cluster.xml look
> lilke to
> make that happen?

It only runs if an NFS service is running.

-- Lon


From lhh at redhat.com  Wed Sep 20 18:26:15 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 20 Sep 2006 14:26:15 -0400
Subject: [Linux-cluster] Things that i don't understand about cluster suite
In-Reply-To: <4510FA68.2000403@gmail.com>
References: <4510FA68.2000403@gmail.com>
Message-ID: <1158776775.7388.29.camel@rei.boston.devel.redhat.com>

On Wed, 2006-09-20 at 10:23 +0200, carlopmart wrote:
> Hi all,
> 
>   Sorry for this toppic, but i have serious doubts about using cluster 
> suite under some deployments. My questions:
> 
>   a) How can I configure status check on a service script? for exmaple: 
> I have two nodes with CS U4 with postfix service running on two nodes 
> and using DLM as a lock manager. If I stop postfix from the script and I 
> wait status check, nothing happens and rgmanager returns an ok for the 
> service, but this service is stopped !!!.

The status check in the postfix script must return nonzero if the
service is stopped.

>   b) is it posible to startup only one node on a two-node cluster? i 
> have tested this feature, but this node doesn't startup ( i am using iLO 
> as a fencing method, but I have tested gnbd too and the result is the same).

Yes, but the other node must be fenced first.

>   c) why relocate service doesn't works?? I have attached my config. For 
> example, if I reboot one node, all services go to the second. This is 
> ok, but when this primary node is up, services continue getting up in 
> the previous node and they don't migrate towards the other node.

Kill the 'exclusive' attribute.  It doesn't do what you think it does
(and is probably the source of your problem).

-- Lon


From jab at ufba.br  Wed Sep 20 20:20:31 2006
From: jab at ufba.br (Jeronimo Bezerra)
Date: Wed, 20 Sep 2006 17:20:31 -0300
Subject: [Linux-cluster] Troubles to install GFS on Debian
Message-ID: <1158783631.28886.8.camel@localhost.localdomain>

Hello All.

I'm having a big trouble here to compile the gfs on Debian. I downloaded
from CVS, and did the follow:

cd /usr/src
ln -s linux-source-2.6.16 linux-2.6
cd cluster
./configure
make

After that, the make command returns the follow:

make[2]: Entering directory `/usr/src/cluster/group/daemon
(...)
gcc -Wall  -g -I. -I../include/ -I../../cman/lib/ -I../lib/ -c -o
joinleave.o joinleave.c
joinleave.c: In function `do_leave':
joinleave.c:129: warning: long long unsigned int format, uint64_t arg
(arg 7)
joinleave.c:136: warning: long long unsigned int format, uint64_t arg
(arg 7)
gcc -Wall  -g -I. -I../include/ -I../../cman/lib/ -I../lib/ -c -o main.o
main.c
main.c: In function `app_deliver':
main.c:180: warning: int format, different type arg (arg 6)
gcc -L//usr/src/cluster/group/../cman/lib -L//usr/lib64/openais
-L//usr/lib64 -o groupd app.o cpg.o cman.o joinleave.o main.o -lcman
-lcpg
/usr/bin/ld: cannot find -lcpg
collect2: ld returned 1 exit status
make[2]: *** [groupd] Error 1
make[2]: Leaving directory `/usr/src/cluster/group/daemon'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/usr/src/cluster/group'
make: *** [all] Error 2

In the directory /usr/src/cluster/group/daemon :

app.c  app.o  cman.c  cman.o  cpg.c  cpg.o  CVS  gd_internal.h  groupd.h
joinleave.c  joinleave.o  main.c  main.o  Makefile

ie, the gcc loads the cman (-lcman) but doesn't with cgp (-lcpg). 

Why this happen? Is there a best howto to install gfs on Debian? or
another way??

Thanks a lot

Jeronimo Bezerra


From rpeterso at redhat.com  Wed Sep 20 20:38:36 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Wed, 20 Sep 2006 15:38:36 -0500
Subject: [Linux-cluster] Troubles to install GFS on Debian
In-Reply-To: <1158783631.28886.8.camel@localhost.localdomain>
References: <1158783631.28886.8.camel@localhost.localdomain>
Message-ID: <4511A6CC.4080301@redhat.com>

Jeronimo Bezerra wrote:
> Hello All.
>
> I'm having a big trouble here to compile the gfs on Debian. I downloaded
> from CVS, and did the follow:
>
> cd /usr/src
> ln -s linux-source-2.6.16 linux-2.6
> cd cluster
> ./configure
> make
>
> After that, the make command returns the follow:
>
> make[2]: Entering directory `/usr/src/cluster/group/daemon
> (...)
> gcc -Wall  -g -I. -I../include/ -I../../cman/lib/ -I../lib/ -c -o
> joinleave.o joinleave.c
> joinleave.c: In function `do_leave':
> joinleave.c:129: warning: long long unsigned int format, uint64_t arg
> (arg 7)
> joinleave.c:136: warning: long long unsigned int format, uint64_t arg
> (arg 7)
> gcc -Wall  -g -I. -I../include/ -I../../cman/lib/ -I../lib/ -c -o main.o
> main.c
> main.c: In function `app_deliver':
> main.c:180: warning: int format, different type arg (arg 6)
> gcc -L//usr/src/cluster/group/../cman/lib -L//usr/lib64/openais
> -L//usr/lib64 -o groupd app.o cpg.o cman.o joinleave.o main.o -lcman
> -lcpg
> /usr/bin/ld: cannot find -lcpg
> collect2: ld returned 1 exit status
> make[2]: *** [groupd] Error 1
> make[2]: Leaving directory `/usr/src/cluster/group/daemon'
> make[1]: *** [all] Error 2
> make[1]: Leaving directory `/usr/src/cluster/group'
> make: *** [all] Error 2
>
> In the directory /usr/src/cluster/group/daemon :
>
> app.c  app.o  cman.c  cman.o  cpg.c  cpg.o  CVS  gd_internal.h  groupd.h
> joinleave.c  joinleave.o  main.c  main.o  Makefile
>
> ie, the gcc loads the cman (-lcman) but doesn't with cgp (-lcpg). 
>
> Why this happen? Is there a best howto to install gfs on Debian? or
> another way??
>
> Thanks a lot
>
> Jeronimo Bezerra
Hi Jeronimo,

libcpg is part of openais, so I suspect you're missing openais.
Here is a usage.txt file that is a good resource for building the 
development
tree of cluster/GFS from source.
http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/doc/usage.txt?cvsroot=cluster

Regards,

Bob Peterson
Red Hat Cluster Suite


From jab at ufba.br  Wed Sep 20 20:59:45 2006
From: jab at ufba.br (Jeronimo Bezerra)
Date: Wed, 20 Sep 2006 17:59:45 -0300
Subject: [Linux-cluster] Troubles to install GFS on Debian
In-Reply-To: <4511A6CC.4080301@redhat.com>
References: <1158783631.28886.8.camel@localhost.localdomain>
	<4511A6CC.4080301@redhat.com>
Message-ID: <1158785985.31019.5.camel@localhost.localdomain>

Hello Bob, thanks :)

I installed openais but I didn't see that was
in /usr/local/usr/include/openais, and in the Debian the default
location is /usr/include. I fix it. After that, I received another
error:

make[2]: Leaving directory `/usr/src/cluster/group/tool'
make -C dlm_controld all
make[2]: Entering directory `/usr/src/cluster/group/dlm_controld'
gcc -Wall  -g -I//usr/include -I../config -idirafter /include/linux
-I../../group/lib/ -I../../ccs/lib/ -I../../cman/lib/ -I../include/ -c
-o main.o main.c
main.c: In function `setup_uevent':
main.c:183: error: `NETLINK_KOBJECT_UEVENT' undeclared (first use in
this function)
main.c:183: error: (Each undeclared identifier is reported only once
main.c:183: error: for each function it appears in.)
make[2]: *** [main.o] Error 1
make[2]: Leaving directory `/usr/src/cluster/group/dlm_controld'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/usr/src/cluster/group'
make: *** [all] Error 2

I will try to resolve it tonight. But if you would like to help,
please! :)

Thank you again!

Jeronimo


Em Qua, 2006-09-20 ?s 15:38 -0500, Robert Peterson escreveu:
> Jeronimo Bezerra wrote:
> > Hello All.
> >
> > I'm having a big trouble here to compile the gfs on Debian. I downloaded
> > from CVS, and did the follow:
> >
> > cd /usr/src
> > ln -s linux-source-2.6.16 linux-2.6
> > cd cluster
> > ./configure
> > make
> >
> > After that, the make command returns the follow:
> >
> > make[2]: Entering directory `/usr/src/cluster/group/daemon
> > (...)
> > gcc -Wall  -g -I. -I../include/ -I../../cman/lib/ -I../lib/ -c -o
> > joinleave.o joinleave.c
> > joinleave.c: In function `do_leave':
> > joinleave.c:129: warning: long long unsigned int format, uint64_t arg
> > (arg 7)
> > joinleave.c:136: warning: long long unsigned int format, uint64_t arg
> > (arg 7)
> > gcc -Wall  -g -I. -I../include/ -I../../cman/lib/ -I../lib/ -c -o main.o
> > main.c
> > main.c: In function `app_deliver':
> > main.c:180: warning: int format, different type arg (arg 6)
> > gcc -L//usr/src/cluster/group/../cman/lib -L//usr/lib64/openais
> > -L//usr/lib64 -o groupd app.o cpg.o cman.o joinleave.o main.o -lcman
> > -lcpg
> > /usr/bin/ld: cannot find -lcpg
> > collect2: ld returned 1 exit status
> > make[2]: *** [groupd] Error 1
> > make[2]: Leaving directory `/usr/src/cluster/group/daemon'
> > make[1]: *** [all] Error 2
> > make[1]: Leaving directory `/usr/src/cluster/group'
> > make: *** [all] Error 2
> >
> > In the directory /usr/src/cluster/group/daemon :
> >
> > app.c  app.o  cman.c  cman.o  cpg.c  cpg.o  CVS  gd_internal.h  groupd.h
> > joinleave.c  joinleave.o  main.c  main.o  Makefile
> >
> > ie, the gcc loads the cman (-lcman) but doesn't with cgp (-lcpg). 
> >
> > Why this happen? Is there a best howto to install gfs on Debian? or
> > another way??
> >
> > Thanks a lot
> >
> > Jeronimo Bezerra
> Hi Jeronimo,
> 
> libcpg is part of openais, so I suspect you're missing openais.
> Here is a usage.txt file that is a good resource for building the 
> development
> tree of cluster/GFS from source.
> http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/doc/usage.txt?cvsroot=cluster
> 
> Regards,
> 
> Bob Peterson
> Red Hat Cluster Suite
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From rpeterso at redhat.com  Wed Sep 20 21:07:10 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Wed, 20 Sep 2006 16:07:10 -0500
Subject: [Linux-cluster] Troubles to install GFS on Debian
In-Reply-To: <1158785985.31019.5.camel@localhost.localdomain>
References: <1158783631.28886.8.camel@localhost.localdomain>	<4511A6CC.4080301@redhat.com>
	<1158785985.31019.5.camel@localhost.localdomain>
Message-ID: <4511AD7E.6020302@redhat.com>

Jeronimo Bezerra wrote:
> Hello Bob, thanks :)
>
> I installed openais but I didn't see that was
> in /usr/local/usr/include/openais, and in the Debian the default
> location is /usr/include. I fix it. After that, I received another
> error:
>
> make[2]: Leaving directory `/usr/src/cluster/group/tool'
> make -C dlm_controld all
> make[2]: Entering directory `/usr/src/cluster/group/dlm_controld'
> gcc -Wall  -g -I//usr/include -I../config -idirafter /include/linux
> -I../../group/lib/ -I../../ccs/lib/ -I../../cman/lib/ -I../include/ -c
> -o main.o main.c
> main.c: In function `setup_uevent':
> main.c:183: error: `NETLINK_KOBJECT_UEVENT' undeclared (first use in
> this function)
> main.c:183: error: (Each undeclared identifier is reported only once
> main.c:183: error: for each function it appears in.)
> make[2]: *** [main.o] Error 1
> make[2]: Leaving directory `/usr/src/cluster/group/dlm_controld'
> make[1]: *** [all] Error 2
> make[1]: Leaving directory `/usr/src/cluster/group'
> make: *** [all] Error 2
>
> I will try to resolve it tonight. But if you would like to help,
> please! :)
>
> Thank you again!
>
> Jeronimo
>   
Sounds like you're not picking up netlink.h.

Regards,

Bob Peterson
Red Hat Cluster Suite


From sdake at redhat.com  Wed Sep 20 23:22:16 2006
From: sdake at redhat.com (Steven Dake)
Date: Wed, 20 Sep 2006 16:22:16 -0700
Subject: [Linux-cluster] Troubles to install GFS on Debian
In-Reply-To: <4511AD7E.6020302@redhat.com>
References: <1158783631.28886.8.camel@localhost.localdomain>
	<4511A6CC.4080301@redhat.com>
	<1158785985.31019.5.camel@localhost.localdomain>
	<4511AD7E.6020302@redhat.com>
Message-ID: <1158794536.20300.7.camel@shih.broked.org>

Jeronimo,

I suspect you have old kernel include headers which did not support the
uevent mechanism.  For example on my 2.6.9 kernel I am using with 2.6.9
include headers, there is no support for uevents.  You can work around
this problem by defining NETLINK_KOBJECT_UEVENT to be whatever value is
in your kernel (found in include/linux/netlink.h in your new kernel
sources you are installing) in main.c.  Alternatively you could upgrade
your kernel include headers. You didn't state which version of debian
you are using, but updating the kernel headers could cause problems, so
I'd stick with the workaround above.

You should also verify your kernel supports uevents.  This can be done
by checking for a copy of the file lib/kobject_uevent.c within the
kernel source tree.  It also needs to be enabled in the kernel
configuration options.  I would expect debian unstable and also FC6 to
support these features.

Regards
-steve

On Wed, 2006-09-20 at 16:07 -0500, Robert Peterson wrote:
> Jeronimo Bezerra wrote:
> > Hello Bob, thanks :)
> >
> > I installed openais but I didn't see that was
> > in /usr/local/usr/include/openais, and in the Debian the default
> > location is /usr/include. I fix it. After that, I received another
> > error:
> >
> > make[2]: Leaving directory `/usr/src/cluster/group/tool'
> > make -C dlm_controld all
> > make[2]: Entering directory `/usr/src/cluster/group/dlm_controld'
> > gcc -Wall  -g -I//usr/include -I../config -idirafter /include/linux
> > -I../../group/lib/ -I../../ccs/lib/ -I../../cman/lib/ -I../include/ -c
> > -o main.o main.c
> > main.c: In function `setup_uevent':
> > main.c:183: error: `NETLINK_KOBJECT_UEVENT' undeclared (first use in
> > this function)
> > main.c:183: error: (Each undeclared identifier is reported only once
> > main.c:183: error: for each function it appears in.)
> > make[2]: *** [main.o] Error 1
> > make[2]: Leaving directory `/usr/src/cluster/group/dlm_controld'
> > make[1]: *** [all] Error 2
> > make[1]: Leaving directory `/usr/src/cluster/group'
> > make: *** [all] Error 2
> >
> > I will try to resolve it tonight. But if you would like to help,
> > please! :)
> >
> > Thank you again!
> >
> > Jeronimo
> >   
> Sounds like you're not picking up netlink.h.
> 
> Regards,
> 
> Bob Peterson
> Red Hat Cluster Suite
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From rodgersr at yahoo.com  Wed Sep 20 23:45:21 2006
From: rodgersr at yahoo.com (Rick Rodgers)
Date: Wed, 20 Sep 2006 16:45:21 -0700 (PDT)
Subject: [Linux-cluster] Clurmtabd needs to be run manually?
Message-ID: <20060920234521.37344.qmail@web34214.mail.mud.yahoo.com>

When I start clumanager I notice that clumrmtabd does not start up. The man
page says the service manager daemon will automatically start is for each mount 
point. But this does not seem to happen. 

I can start it manually. Does anyone know if this is the expected behavior? Is it accepatable to manually start it? Below is my cluster.xml file.


<?xml version="1.0"?>
<cluconfig version="3.0">
  <clumembd broadcast="no" interval="750000" loglevel="5" multicast="yes" multicast_ipaddress="225.0.0.11" thread="yes" tko_count="20"/>
  <cluquorumd loglevel="5" pinginterval="" tiebreaker_ip=""/>
  <clurmtabd loglevel="5" pollinterval="4"/>
  <clusvcmgrd loglevel="5"/>
  <clulockd loglevel="5"/>
  <cluster config_viewnumber="1" key="d05ba7b1a725ad9e94017e45a3eda03e" name="Service Nodes"/>
  <sharedstate driver="libsharedraw.so" rawprimary="/dev/raw/raw1" rawshadow="/dev/raw/raw2" type="raw"/>
  <members>
    <member id="0" name="10.20.70.100" watchdog="yes">
      <powercontroller id="0" ipaddress="10.20.70.101" password="calvin" port="80" type="dell_rac" user="root"/>
    </member>
    <member id="1" name="10.20.70.102" watchdog="yes">
      <powercontroller id="0" ipaddress="10.20.70.50" password="calvin" port="80" type="dell_rac" user="root"/>
    </member>
  </members>
  <services>
    <service checkinterval="5" failoverdomain="Service Nodes" id="0" name="service-core" userscript="/etc/init.d/service-core">
      <service_ipaddresses>
        <service_ipaddress broadcast="" id="0" ipaddress="10.20.70.104" netmask="255.255.255.0"/>
      </service_ipaddresses>
      <device id="0" name="LABEL=/service4" sharename="">
        <mount forceunmount="yes" fstype="ext3" mountpoint="/service" options=""/>
      </device>
    </service>
  </services>
  <failoverdomains>
    <failoverdomain id="0" name="Service Nodes" ordered="no" restricted="yes">
      <failoverdomainnode id="0" name="10.20.70.100"/>
      <failoverdomainnode id="1" name="10.20.70.102"/>
    </failoverdomain>
  </failoverdomains>
</cluconfig>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060920/c54bad30/attachment.htm>

From RMoody at mweb.com  Thu Sep 21 08:54:17 2006
From: RMoody at mweb.com (Robert Moody - MWEB)
Date: Thu, 21 Sep 2006 10:54:17 +0200
Subject: [Linux-cluster] Testing a ipmi fence.
Message-ID: <6586D1F97DDEDE408BEEF44402F37978084467@mwmx4.mweb.com>

Hi all,

I have configured this before but now that I want to show someone how it is working do you think it will work.

Anyway here is my problem. I have 3 dell 2850's with an onboard ipmi interface on. I have configured these to work on the lan on a private network.

Now there was a command that I have used before to manually fence a node to test if the fencing is working. For the life of me I can not remember what it was.

Anyone done this recently?

Thanks,

Robert.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060921/2a827cd4/attachment.htm>

From jab at ufba.br  Thu Sep 21 11:04:38 2006
From: jab at ufba.br (Jeronimo Bezerra)
Date: Thu, 21 Sep 2006 08:04:38 -0300
Subject: [Linux-cluster] Troubles to install GFS on Debian
In-Reply-To: <1158794536.20300.7.camel@shih.broked.org>
References: <1158783631.28886.8.camel@localhost.localdomain>
	<4511A6CC.4080301@redhat.com>
	<1158785985.31019.5.camel@localhost.localdomain>
	<4511AD7E.6020302@redhat.com>
	<1158794536.20300.7.camel@shih.broked.org>
Message-ID: <1158836678.31019.10.camel@localhost.localdomain>

Thanks Steve! I'll try to upgrade my kernel-headers.

This box is just for tests. My debian is 3.1 Sarge, and my kernel is
2.6.16.

Thanks again

Jeronimo

Em Qua, 2006-09-20 ?s 16:22 -0700, Steven Dake escreveu:
> Jeronimo,
> 
> I suspect you have old kernel include headers which did not support the
> uevent mechanism.  For example on my 2.6.9 kernel I am using with 2.6.9
> include headers, there is no support for uevents.  You can work around
> this problem by defining NETLINK_KOBJECT_UEVENT to be whatever value is
> in your kernel (found in include/linux/netlink.h in your new kernel
> sources you are installing) in main.c.  Alternatively you could upgrade
> your kernel include headers. You didn't state which version of debian
> you are using, but updating the kernel headers could cause problems, so
> I'd stick with the workaround above.
> 
> You should also verify your kernel supports uevents.  This can be done
> by checking for a copy of the file lib/kobject_uevent.c within the
> kernel source tree.  It also needs to be enabled in the kernel
> configuration options.  I would expect debian unstable and also FC6 to
> support these features.
> 
> Regards
> -steve
> 
> On Wed, 2006-09-20 at 16:07 -0500, Robert Peterson wrote:
> > Jeronimo Bezerra wrote:
> > > Hello Bob, thanks :)
> > >
> > > I installed openais but I didn't see that was
> > > in /usr/local/usr/include/openais, and in the Debian the default
> > > location is /usr/include. I fix it. After that, I received another
> > > error:
> > >
> > > make[2]: Leaving directory `/usr/src/cluster/group/tool'
> > > make -C dlm_controld all
> > > make[2]: Entering directory `/usr/src/cluster/group/dlm_controld'
> > > gcc -Wall  -g -I//usr/include -I../config -idirafter /include/linux
> > > -I../../group/lib/ -I../../ccs/lib/ -I../../cman/lib/ -I../include/ -c
> > > -o main.o main.c
> > > main.c: In function `setup_uevent':
> > > main.c:183: error: `NETLINK_KOBJECT_UEVENT' undeclared (first use in
> > > this function)
> > > main.c:183: error: (Each undeclared identifier is reported only once
> > > main.c:183: error: for each function it appears in.)
> > > make[2]: *** [main.o] Error 1
> > > make[2]: Leaving directory `/usr/src/cluster/group/dlm_controld'
> > > make[1]: *** [all] Error 2
> > > make[1]: Leaving directory `/usr/src/cluster/group'
> > > make: *** [all] Error 2
> > >
> > > I will try to resolve it tonight. But if you would like to help,
> > > please! :)
> > >
> > > Thank you again!
> > >
> > > Jeronimo
> > >   
> > Sounds like you're not picking up netlink.h.
> > 
> > Regards,
> > 
> > Bob Peterson
> > Red Hat Cluster Suite
> > 
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From carlopmart at gmail.com  Thu Sep 21 11:12:53 2006
From: carlopmart at gmail.com (carlopmart)
Date: Thu, 21 Sep 2006 13:12:53 +0200
Subject: [Linux-cluster] Things that i don't understand about cluster suite
In-Reply-To: <1158776775.7388.29.camel@rei.boston.devel.redhat.com>
References: <4510FA68.2000403@gmail.com>
	<1158776775.7388.29.camel@rei.boston.devel.redhat.com>
Message-ID: <451273B5.3090804@gmail.com>


Lon Hohberger wrote:
> On Wed, 2006-09-20 at 10:23 +0200, carlopmart wrote:
>> Hi all,
>>
>>   Sorry for this toppic, but i have serious doubts about using cluster 
>> suite under some deployments. My questions:
>>
>>   a) How can I configure status check on a service script? for exmaple: 
>> I have two nodes with CS U4 with postfix service running on two nodes 
>> and using DLM as a lock manager. If I stop postfix from the script and I 
>> wait status check, nothing happens and rgmanager returns an ok for the 
>> service, but this service is stopped !!!.
> 
> The status check in the postfix script must return nonzero if the
> service is stopped.

Lon, I use original's postfix script and returns this if postfix is up: 
"master (pid 957) is running..." when postfix isn't up, script returns: 
"master is stopped". Do I need to change this message to "0" for status 
check works ok??
> 
>>   b) is it posible to startup only one node on a two-node cluster? i 
>> have tested this feature, but this node doesn't startup ( i am using iLO 
>> as a fencing method, but I have tested gnbd too and the result is the same).
> 
> Yes, but the other node must be fenced first.

Then, I can't startup only one node, when both are stopped, right??


> 
>>   c) why relocate service doesn't works?? I have attached my config. For 
>> example, if I reboot one node, all services go to the second. This is 
>> ok, but when this primary node is up, services continue getting up in 
>> the previous node and they don't migrate towards the other node.
> 
> Kill the 'exclusive' attribute.  It doesn't do what you think it does
> (and is probably the source of your problem).
Thanks, now works ok.


> 
> -- Lon
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
CL Martinez
carlopmart {at} gmail {d0t} com


From gforte at leopard.us.udel.edu  Thu Sep 21 12:11:53 2006
From: gforte at leopard.us.udel.edu (Greg Forte)
Date: Thu, 21 Sep 2006 08:11:53 -0400
Subject: [Linux-cluster] Things that i don't understand about cluster suite
In-Reply-To: <451273B5.3090804@gmail.com>
References: <4510FA68.2000403@gmail.com>	<1158776775.7388.29.camel@rei.boston.devel.redhat.com>
	<451273B5.3090804@gmail.com>
Message-ID: <45128189.3060108@leopard.us.udel.edu>

> Lon, I use original's postfix script and returns this if postfix is up: 
> "master (pid 957) is running..." when postfix isn't up, script returns: 
> "master is stopped". Do I need to change this message to "0" for status 
> check works ok??

That's just a message that's printed.  return status is the value given 
in a statement of the form 'return X', or 0 if no such statement is 
explicitly reached.  All executables return a status value to the shell, 
where 0 is taken to mean "OK", and non-zero means "something bad happened".

The postfix script appears to return the correct values in each case.
My guess would be that it's cluster configuration problem, but I didn't 
see anything about postfix in the conf that you pasted ...

> Then, I can't startup only one node, when both are stopped, right??

No, you definitely can do this, if the cluster is configured correctly. 
  The problem may be in your fencing method - the first thing the booted 
node will do when cman starts is to try to contact the other node.  When 
it times out, it'll try to fence the other node and won't continue until 
it does. If the fence process fails, it'll hang there, which I'm 
guessing is what you're seeing.  So the problem is most likely that 
fencing is failing, either due to misconfiguration or because the other 
node is powered off and so its iLo agent isn't responding.  Since iLo is 
supposed to be able to power-up a switched-off server, my guess is 
there's a problem with your fencing configuration - did you fix it so 
that you have a separate fencedevice entry for each node?

-g


From Alain.Moulle at bull.net  Thu Sep 21 13:23:42 2006
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Thu, 21 Sep 2006 15:23:42 +0200
Subject: [Linux-cluster] CS4 U2 / clustat not responding
Message-ID: <4512925E.1090007@bull.net>

Hi
Could you give me all patches (or defect) numbers available to avoid
clustat stalled on CS4 U2 ?
Thanks
Alain


From f.hackenberger at mediatransfer.com  Thu Sep 21 13:57:59 2006
From: f.hackenberger at mediatransfer.com (Falk Hackenberger - MediaTransfer AG Netresearch & Consulting)
Date: Thu, 21 Sep 2006 15:57:59 +0200
Subject: [Linux-cluster] search experiences RedHat CS and lvm2 snapshots on
	both nodes
Message-ID: <45129A67.6080007@mediatransfer.com>

Hello all,

I have running 2 nodes (active-pasive) on one san.
Because the san have no snapshot functionality I use lvm2 snapshots.

the disks on the san are one Volume group with many Logical volumes.

have you experiences with setups wich are:
service1 runs on node1 an need Logical volume1
service2 runs on node2 an need Logical volume2

it is posible to say in such a setup
snapshot1 on on node1 on Logical volume1
snapshot2 on on node2 on Logical volume2

remember both Logical volumes are on one Volume group.

experiences, recommendations?


thanks,
falk


From lhh at redhat.com  Thu Sep 21 14:07:20 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 21 Sep 2006 10:07:20 -0400
Subject: [Linux-cluster] Testing a ipmi fence.
In-Reply-To: <6586D1F97DDEDE408BEEF44402F37978084467@mwmx4.mweb.com>
References: <6586D1F97DDEDE408BEEF44402F37978084467@mwmx4.mweb.com>
Message-ID: <1158847640.7388.83.camel@rei.boston.devel.redhat.com>

On Thu, 2006-09-21 at 10:54 +0200, Robert Moody - MWEB wrote:
> Hi all,
> 
> I have configured this before but now that I want to show someone how
> it is working do you think it will work.
> 
> Anyway here is my problem. I have 3 dell 2850's with an onboard ipmi
> interface on. I have configured these to work on the lan on a private
> network.
> 
> Now there was a command that I have used before to manually fence a
> node to test if the fencing is working. For the life of me I can not
> remember what it was.
> 
> Anyone done this recently?

fence_node -n <foo> ?


From lhh at redhat.com  Thu Sep 21 14:11:03 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 21 Sep 2006 10:11:03 -0400
Subject: [Linux-cluster] Testing a ipmi fence.
In-Reply-To: <1158847640.7388.83.camel@rei.boston.devel.redhat.com>
References: <6586D1F97DDEDE408BEEF44402F37978084467@mwmx4.mweb.com>
	<1158847640.7388.83.camel@rei.boston.devel.redhat.com>
Message-ID: <1158847863.7388.88.camel@rei.boston.devel.redhat.com>

On Thu, 2006-09-21 at 10:07 -0400, Lon Hohberger wrote:
> On Thu, 2006-09-21 at 10:54 +0200, Robert Moody - MWEB wrote:
> > Hi all,
> > 
> > I have configured this before but now that I want to show someone how
> > it is working do you think it will work.
> > 
> > Anyway here is my problem. I have 3 dell 2850's with an onboard ipmi
> > interface on. I have configured these to work on the lan on a private
> > network.
> > 
> > Now there was a command that I have used before to manually fence a
> > node to test if the fencing is working. For the life of me I can not
> > remember what it was.
> > 
> > Anyone done this recently?
> 
> fence_node -n <foo>

er, fence_node <foo>

-- Lon


From lhh at redhat.com  Thu Sep 21 14:18:36 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 21 Sep 2006 10:18:36 -0400
Subject: [Linux-cluster] Clurmtabd needs to be run manually?
In-Reply-To: <20060920234521.37344.qmail@web34214.mail.mud.yahoo.com>
References: <20060920234521.37344.qmail@web34214.mail.mud.yahoo.com>
Message-ID: <1158848316.7388.96.camel@rei.boston.devel.redhat.com>

On Wed, 2006-09-20 at 16:45 -0700, Rick Rodgers wrote:
> When I start clumanager I notice that clumrmtabd does not start up.
> The man
> page says the service manager daemon will automatically start is for
> each mount 
> point. But this does not seem to happen. 
> 
> I can start it manually. Does anyone know if this is the expected
> behavior? Is it accepatable to manually start it? Below is my
> cluster.xml file.


>   <services>
>     <service checkinterval="5" failoverdomain="Service Nodes" id="0"
> name="service-core" userscript="/etc/init.d/service-core">
>       <service_ipaddresses>
>         <service_ipaddress broadcast="" id="0"
> ipaddress="10.20.70.104" netmask="255.255.255.0"/>
>       </service_ipaddresses>
>       <device id="0" name="LABEL=/service4" sharename="">
>         <mount forceunmount="yes" fstype="ext3" mountpoint="/service"
> options=""/>
>       </device>
>     </service>
>   </services>

It looks like there is no clumanager-managed NFS component to the
service, which is why it's not being started.  If you're not running
NFS, then you don't need clurmtabd.

You can start it manually, or you can tweak the scripts for
your /service mountpoint and make it start if you need to.  The easiest
thing to do is just add a dummy export entry to the service.

-- Lon


From RMoody at mweb.com  Thu Sep 21 14:54:20 2006
From: RMoody at mweb.com (Robert Moody - MWEB)
Date: Thu, 21 Sep 2006 16:54:20 +0200
Subject: [Linux-cluster] Testing a ipmi fence.
References: <6586D1F97DDEDE408BEEF44402F37978084467@mwmx4.mweb.com><1158847640.7388.83.camel@rei.boston.devel.redhat.com>
	<1158847863.7388.88.camel@rei.boston.devel.redhat.com>
Message-ID: <6586D1F97DDEDE408BEEF44402F3797808446D@mwmx4.mweb.com>

Ok I get the barney award. The one command I did not write down and document cause duh it was so easy is the one that I forget.

Thanks guys I feel really clever right now. ;-) (Very sheepishly looks at the ground....)


-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of Lon Hohberger
Sent: Thu 9/21/2006 4:11 PM
To: linux clustering
Subject: Re: [Linux-cluster] Testing a ipmi fence.
 
On Thu, 2006-09-21 at 10:07 -0400, Lon Hohberger wrote:
> On Thu, 2006-09-21 at 10:54 +0200, Robert Moody - MWEB wrote:
> > Hi all,
> > 
> > I have configured this before but now that I want to show someone how
> > it is working do you think it will work.
> > 
> > Anyway here is my problem. I have 3 dell 2850's with an onboard ipmi
> > interface on. I have configured these to work on the lan on a private
> > network.
> > 
> > Now there was a command that I have used before to manually fence a
> > node to test if the fencing is working. For the life of me I can not
> > remember what it was.
> > 
> > Anyone done this recently?
> 
> fence_node -n <foo>

er, fence_node <foo>

-- Lon


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060921/4a66a67a/attachment.htm>

From lhh at redhat.com  Thu Sep 21 14:57:56 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 21 Sep 2006 10:57:56 -0400
Subject: [Linux-cluster] CS4 U2 / clustat not responding
In-Reply-To: <4512925E.1090007@bull.net>
References: <4512925E.1090007@bull.net>
Message-ID: <1158850676.7388.104.camel@rei.boston.devel.redhat.com>

On Thu, 2006-09-21 at 15:23 +0200, Alain Moulle wrote:
> Hi
> Could you give me all patches (or defect) numbers available to avoid
> clustat stalled on CS4 U2 ?
> Thanks
> Alain

Please look at errata notes for bugs fixed.  Especially look at magma,
magma-plugins, and rgmanager bugzillas.  Some patches may be incremental
(and therefore may not apply to U2).  Several bugs may cause this
symptom.

Here are the errata:

https://rhn.redhat.com/errata/RHBA-2006-0557.html
https://rhn.redhat.com/errata/RHBA-2006-0552.html
https://rhn.redhat.com/errata/RHBA-2006-0551.html
https://rhn.redhat.com/errata/RHBA-2006-0241.html
https://rhn.redhat.com/errata/RHBA-2006-0240.html
https://rhn.redhat.com/errata/RHBA-2006-0239.html

If you would like to fork U2, your quickest bet to get something working
is to take a diff of the sources between U2 and U4 and pull out things
like resource agent changes and such.

You can run the U4 version of magma-plugins, magma, and rgmanager on the
U2 infrastructure if this makes it your particular environment.  That
is, you can leave cman[-kernel], dlm[-kernel], ccsd, gfs, and the rest
of the system at U2, and just upgrade magma, magma-plugins, and
rgmanager (though, you can safely update ccsd too) if you want, and it
will probably save you time and effort.

-- Lon


From dist-list at LEXUM.UMontreal.CA  Thu Sep 21 16:03:13 2006
From: dist-list at LEXUM.UMontreal.CA (FM)
Date: Thu, 21 Sep 2006 12:03:13 -0400
Subject: [Linux-cluster] Throttling HTTP traffic when using director ?
Message-ID: <4512B7C1.5010109@lexum.umontreal.ca>

Hello,
Our setup :
 1 director in front of 4 web servers. (redhat cluster suite). All
servers are behind Pix Firewall

We need a way to stop abusing users (based on a download limit per day
for ex.).
The prob is that we need to do it on the director level (so we cannot
use apache2 modules) and it has to be dynamic based on the bandwidth
usage of the client.

Should I use iptable's traffic shaping capabilities for that ?

Do you have any advice for this particular situation.

Thanks !!

F


From jab at ufba.br  Thu Sep 21 17:25:13 2006
From: jab at ufba.br (Jeronimo Bezerra)
Date: Thu, 21 Sep 2006 14:25:13 -0300
Subject: [Linux-cluster] Troubles to install GFS on Debian]
Message-ID: <1158859513.31019.37.camel@localhost.localdomain>

I forget to send to list :).
 
So, where can I find the lock_dlm_plock.h? 

I already searched im my linux box and nothing.

The openais is installed too.

Thank

Jeronimo
-------------- next part --------------
An embedded message was scrubbed...
From: Jeronimo Bezerra <jab at ufba.br>
Subject: Re: [Linux-cluster] Troubles to install GFS on Debian
Date: Thu, 21 Sep 2006 10:35:48 -0300
Size: 4002
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060921/593f2532/attachment.eml>

From teigland at redhat.com  Thu Sep 21 18:03:34 2006
From: teigland at redhat.com (David Teigland)
Date: Thu, 21 Sep 2006 13:03:34 -0500
Subject: [Linux-cluster] Troubles to install GFS on Debian]
In-Reply-To: <1158859513.31019.37.camel@localhost.localdomain>
References: <1158859513.31019.37.camel@localhost.localdomain>
Message-ID: <20060921180334.GA24022@redhat.com>

On Thu, Sep 21, 2006 at 02:25:13PM -0300, Jeronimo Bezerra wrote:
> I forget to send to list :).
>  
> So, where can I find the lock_dlm_plock.h? 
> 
> I already searched im my linux box and nothing.

Code in cvs head needs the kernel in this git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6.git

Dave


From Zelikov_Mikhail at emc.com  Thu Sep 21 19:12:56 2006
From: Zelikov_Mikhail at emc.com (Zelikov_Mikhail at emc.com)
Date: Thu, 21 Sep 2006 15:12:56 -0400
Subject: [Linux-cluster] Unable to lock any resource
Message-ID: <9B2FEC4CE7E80B4A965F1D9ADF22B17304A65B92@CORPUSMX40B.corp.emc.com>

I am debugging a program that uses DLM (lock_resource()) to lock a
resource. If I kill the process within GDB and leave it running for a
long time (for example overnight), I am not longer able to lock any
resources. I obviously killed gdb and verified that I have no leftovers.


To verify that it is not just my resource that I can not lock I use:
dlmtest from ...dlm/tests/usertests/ directory to lock any resource:

[root at bof227 usertest]# ./dlmtest -m NL TEST
locking TEST NL ...
lock: Invalid argument


The error code returned on the lock_resources is EINVAL (22).

I can obviously fix this by rebooting the system, however it is a pain.
I tried to fix it by restarting cman and clvmd services - no success.
And I can not reload dlm kernel module as it is in use.

The content of dlm_stats shows that there is the same number of locks as
unlocks:

[root at bof227 usertest]# cat /proc/cluster/dlm_stats 
DLM stats (HZ=1000)

Lock operations:         21
Unlock operations:       21
Convert operations:       0
Completion ASTs:         42
Blocking ASTs:            0

Lockqueue        num  waittime   ave
WAIT_RSB          19         8     0
Total             19         8     0


I was wondering if anybody could provide an insight on this. I was also
wondering if there is a better way to deal with this than just rebooting
the system.

Thanks, Mike

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060921/1aca5656/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Blank Bkgrd.gif
Type: image/gif
Size: 145 bytes
Desc: Blank Bkgrd.gif
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060921/1aca5656/attachment.gif>

From phung at cs.columbia.edu  Thu Sep 21 22:54:46 2006
From: phung at cs.columbia.edu (Dan B. Phung)
Date: Thu, 21 Sep 2006 18:54:46 -0400
Subject: [Linux-cluster] kernel oops on mount and sendmsg failed: -22
Message-ID: <45131836.7010802@ncl.cs.columbia.edu>

I have a two node cluster, one node (node A) runs linux kernel 2.6.11.12 
while the other (node B) runs 2.6.18.  both are running cman_tool 
version 5.0.1.  I first start up node A, then node B joins.  node A can 
mount the GFS file systems, but when node B tries that, it gets a kernel 
oops, which is pasted at the end of the email (see "KERNEL OOPS output"). 

So I reboot node B and try to rejoin, but it seems to not be able to 
communicate with node A correctly, as if the cluster is in some stale 
state (see "node B rejoin kernel messages").  Upon viewing node A, it 
seemed to have received the join message, but it looks like it didn't 
send an ack or something, and then node A simply quits...(see "node A 
kernel messages").

I think the problem lies in my use of two different cluster software 
versions (even though --version doesn't say so), but the newest -rSTABLE 
doesn't compile with 2.6.11.12 anymore.  What is the recommended 
solution for a cluster that must run different kernel versions?

tia,
dan

---

<KERNEL OOPS output>

BUG: unable to handle kernel NULL pointer dereference at virtual
address 0000001c
 printing eip:
c01825e6
*pde = 00000000
Oops: 0000 [#1]
PREEMPT SMP
Modules linked in: lock_dlm dlm gfs lock_harness cman qla2xxx
firmware_class scsi_transport_fc ppdev parport_pc lp parport sg sd_mod
scsi_mod ide_generic ide_cd cdrom evdev i2c_piix4 psmouse i2c_core
serio_raw sworks_agp agpgart rtc pcspkr ext3 jbd mbcache dm_mirror
dm_snapshot dm_mod ide_disk serverworks generic ohci_hcd ide_core
usbcore tg3 thermal processor fan unix
CPU:    2
EIP:    0060:[<c01825e6>]    Tainted: GF     VLI
EFLAGS: 00010293   (2.6.18 #1)
EIP is at do_add_mount+0x66/0x130
eax: 0000000c   ebx: f3843f24   ecx: c24fbac0   edx: f443f550
esi: df907200   edi: 00000000   ebp: 00000000   esp: f3843df4
ds: 007b   es: 007b   ss: 0068
Process mount (pid: 14922, ti=f3842000 task=f443f550 task.ti=f3842000)
Stack: c0394388 00000000 00000000 f49a1000 f3843f24 00000000 c018321d 
df907200
       f3843f24 00000000 00000000 f49a1000 df907200 c033a5c0 fffffffe 
00000000
       c0175080 c24fbac0 f3843ef8 00000050 f4998000 dfb98c40 c24fbac0 
df98330c
Call Trace:
 [<c018321d>] do_mount+0x33d/0x760
 [<c0175080>] link_path_walk+0x80/0x100
 [<c01507e3>] __handle_mm_fault+0x233/0x980
 [<c0150a86>] __handle_mm_fault+0x4d6/0x980
 [<c0147cdf>] __alloc_pages+0x4f/0x2f0
 [<c0147fad>] __get_free_pages+0x2d/0x40
 [<c0181ed7>] copy_mount_options+0x47/0x130
 [<c01836dd>] sys_mount+0x9d/0xe0
 [<c01031fb>] syscall_call+0x7/0xb
Code: e4 89 e0 8b 4b 04 25 00 e0 ff ff 8b 10 8b 41 64 3b 82 58 04 00
00 0f 85 a1 00 00 00 8b 41 14 3b 46 14 0f 84 ac 00 00 00 8b 46 10 <8b>
40 10 0f b7 40 28 25 00 f0 00 00 3d 00 a0 00 00 74 55 8b 44
EIP: [<c01825e6>] do_add_mount+0x66/0x130 SS:ESP 0068:f3843df4

<node B rejoin kernel messages>
CMAN: Waiting to join or form a Linux-cluster
CMAN: sending membership request (message repeated 30 times)
CMAN: Been in JOINWAIT for too long - giving up
CMAN: sendmsg failed: -22

<node A kernel messages>
CMAN: node blade14 rejoining
CMAN: too many transition restarts - will die
CMAN: we are leaving the cluster. Inconsistent cluster view


From pcaulfie at redhat.com  Fri Sep 22 07:36:16 2006
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 22 Sep 2006 08:36:16 +0100
Subject: [Linux-cluster] kernel oops on mount and sendmsg failed: -22
In-Reply-To: <45131836.7010802@ncl.cs.columbia.edu>
References: <45131836.7010802@ncl.cs.columbia.edu>
Message-ID: <45139270.3040401@redhat.com>

Dan B. Phung wrote:
> I have a two node cluster, one node (node A) runs linux kernel 2.6.11.12
> while the other (node B) runs 2.6.18.  both are running cman_tool
> version 5.0.1.  I first start up node A, then node B joins.  node A can
> mount the GFS file systems, but when node B tries that, it gets a kernel
> oops, which is pasted at the end of the email (see "KERNEL OOPS output").
> So I reboot node B and try to rejoin, but it seems to not be able to
> communicate with node A correctly, as if the cluster is in some stale
> state (see "node B rejoin kernel messages").  Upon viewing node A, it
> seemed to have received the join message, but it looks like it didn't
> send an ack or something, and then node A simply quits...(see "node A
> kernel messages").
> 
> I think the problem lies in my use of two different cluster software
> versions (even though --version doesn't say so), but the newest -rSTABLE
> doesn't compile with 2.6.11.12 anymore.  What is the recommended
> solution for a cluster that must run different kernel versions?
> 
> tia,
> dan
> 
> ---
> 
> <KERNEL OOPS output>
> 
> BUG: unable to handle kernel NULL pointer dereference at virtual
> address 0000001c
> printing eip:
> c01825e6
> *pde = 00000000
> Oops: 0000 [#1]
> PREEMPT SMP
> Modules linked in: lock_dlm dlm gfs lock_harness cman qla2xxx
> firmware_class scsi_transport_fc ppdev parport_pc lp parport sg sd_mod
> scsi_mod ide_generic ide_cd cdrom evdev i2c_piix4 psmouse i2c_core
> serio_raw sworks_agp agpgart rtc pcspkr ext3 jbd mbcache dm_mirror
> dm_snapshot dm_mod ide_disk serverworks generic ohci_hcd ide_core
> usbcore tg3 thermal processor fan unix
> CPU:    2
> EIP:    0060:[<c01825e6>]    Tainted: GF     VLI
> EFLAGS: 00010293   (2.6.18 #1)
> EIP is at do_add_mount+0x66/0x130
> eax: 0000000c   ebx: f3843f24   ecx: c24fbac0   edx: f443f550
> esi: df907200   edi: 00000000   ebp: 00000000   esp: f3843df4
> ds: 007b   es: 007b   ss: 0068
> Process mount (pid: 14922, ti=f3842000 task=f443f550 task.ti=f3842000)
> Stack: c0394388 00000000 00000000 f49a1000 f3843f24 00000000 c018321d
> df907200
>       f3843f24 00000000 00000000 f49a1000 df907200 c033a5c0 fffffffe
> 00000000
>       c0175080 c24fbac0 f3843ef8 00000050 f4998000 dfb98c40 c24fbac0
> df98330c
> Call Trace:
> [<c018321d>] do_mount+0x33d/0x760
> [<c0175080>] link_path_walk+0x80/0x100
> [<c01507e3>] __handle_mm_fault+0x233/0x980
> [<c0150a86>] __handle_mm_fault+0x4d6/0x980
> [<c0147cdf>] __alloc_pages+0x4f/0x2f0
> [<c0147fad>] __get_free_pages+0x2d/0x40
> [<c0181ed7>] copy_mount_options+0x47/0x130
> [<c01836dd>] sys_mount+0x9d/0xe0
> [<c01031fb>] syscall_call+0x7/0xb
> Code: e4 89 e0 8b 4b 04 25 00 e0 ff ff 8b 10 8b 41 64 3b 82 58 04 00
> 00 0f 85 a1 00 00 00 8b 41 14 3b 46 14 0f 84 ac 00 00 00 8b 46 10 <8b>
> 40 10 0f b7 40 28 25 00 f0 00 00 3d 00 a0 00 00 74 55 8b 44
> EIP: [<c01825e6>] do_add_mount+0x66/0x130 SS:ESP 0068:f3843df4
> 
> <node B rejoin kernel messages>
> CMAN: Waiting to join or form a Linux-cluster
> CMAN: sending membership request (message repeated 30 times)
> CMAN: Been in JOINWAIT for too long - giving up
> CMAN: sendmsg failed: -22
> 
> <node A kernel messages>
> CMAN: node blade14 rejoining
> CMAN: too many transition restarts - will die
> CMAN: we are leaving the cluster. Inconsistent cluster view


That's a known bug. Upgrade the kernel component of cman.

-- 

patrick


From carlopmart at gmail.com  Fri Sep 22 08:07:17 2006
From: carlopmart at gmail.com (carlopmart)
Date: Fri, 22 Sep 2006 10:07:17 +0200
Subject: [Linux-cluster] Things that i don't understand about cluster suite
In-Reply-To: <45128189.3060108@leopard.us.udel.edu>
References: <4510FA68.2000403@gmail.com>	<1158776775.7388.29.camel@rei.boston.devel.redhat.com>	<451273B5.3090804@gmail.com>
	<45128189.3060108@leopard.us.udel.edu>
Message-ID: <451399B5.20409@gmail.com>


Greg Forte wrote:
>> Lon, I use original's postfix script and returns this if postfix is 
>> up: "master (pid 957) is running..." when postfix isn't up, script 
>> returns: "master is stopped". Do I need to change this message to "0" 
>> for status check works ok??
> 
> That's just a message that's printed.  return status is the value given 
> in a statement of the form 'return X', or 0 if no such statement is 
> explicitly reached.  All executables return a status value to the shell, 
> where 0 is taken to mean "OK", and non-zero means "something bad happened".
> 
> The postfix script appears to return the correct values in each case.
> My guess would be that it's cluster configuration problem, but I didn't 
> see anything about postfix in the conf that you pasted ...

Lon, I have attached postfix script
> 
>> Then, I can't startup only one node, when both are stopped, right??
> 
> No, you definitely can do this, if the cluster is configured correctly. 
>  The problem may be in your fencing method - the first thing the booted 
> node will do when cman starts is to try to contact the other node.  When 
> it times out, it'll try to fence the other node and won't continue until 
> it does. If the fence process fails, it'll hang there, which I'm 
> guessing is what you're seeing.  So the problem is most likely that 
> fencing is failing, either due to misconfiguration or because the other 
> node is powered off and so its iLo agent isn't responding.  Since iLo is 
> supposed to be able to power-up a switched-off server, my guess is 
> there's a problem with your fencing configuration - did you fix it so 
> that you have a separate fencedevice entry for each node?

I have changed iLO fence for gnbd fence. But can not boot only one node.
> 
> -g
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
CL Martinez
carlopmart {at} gmail {d0t} com
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: imss-mta
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060922/e5896853/attachment.ksh>

From sandra-llistes at fib.upc.edu  Fri Sep 22 10:23:18 2006
From: sandra-llistes at fib.upc.edu (sandra-llistes)
Date: Fri, 22 Sep 2006 12:23:18 +0200
Subject: [Linux-cluster] GFS and samba problem
Message-ID: <4513B996.8050804@fib.upc.edu>

Hello,

We have two Fedora 5 Servers clustered with GFS. We installed samba 
and exported the same shares in both of them.
All went fine at first, with people accessing to theirs own files and 
so, but for some programs (minitab, matlab, ...) people need to access 
the same file at once. Then samba begins to fail and clients hang. In 
order to fix samba is necessary to restart the service. We've tried to 
put the shares in a filesystem without GFS and all goes well, people 
can access the same file without problems simultaneously.

Is a weird behaviour because the shares are exported from the two 
servers, but we really only access files simoultaneuosly using the 
first server (the second is used for other linux clients than don't 
access the same shares), the other server exports the shares too but 
isn't used by that clients.

I don't know how to debug this problem to see what is happening. It 
seems something related to GFS and Samba.
I have seen mails of people with samba+GFS problems, but we aren't 
using the same configuration, and the GFS rpm are updated:
         GFS-6.1.5-0.FC5.1
         GFS-kernel-2.6.15.1-5.FC5.32
Any help will be greatly apreciated.
Thanks,

Sandra


From peter.huesser at psi.ch  Fri Sep 22 15:42:55 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Fri, 22 Sep 2006 17:42:55 +0200
Subject: [Linux-cluster] Cannot restart service after "failed" state
Message-ID: <8E2924888511274B95014C2DD906E58AD1A30E@MAILBOX0A.psi.ch>

Hello

 
I have defined a web-services (for testing it contains an IP and two
script resources). I sometimes happens that I produce failed state of
the cluster. After this I am not able to restart the service anymore.
Even after a reboot of all (two) clustermembers it is not possible. Do I
have to remove by hand some kind of "lock" file.

 
Greetings

 
            Pedro

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060922/9a7d5280/attachment.htm>

From jstoner at opsource.net  Fri Sep 22 16:27:07 2006
From: jstoner at opsource.net (Jeff Stoner)
Date: Fri, 22 Sep 2006 17:27:07 +0100
Subject: [Linux-cluster] Cannot restart service after "failed" state
Message-ID: <38A48FA2F0103444906AD22E14F1B5A3042C6E19@mailxchg01.corp.opsource.net>

Check for errors in the logs files for the service itself (you didn't
say exactly what it is) and in /var/log/message for Cluster-related
messages for more specific information about why it won't start. We
can't help very much without knowing what is wrong.
 

--Jeff
SME - UNIX
OpSource Inc.

PGP Key ID 0x6CB364CA 

 
________________________________

	 
	I have defined a web-services (for testing it contains an IP and
two script resources). I sometimes happens that I produce failed state
of the cluster. After this I am not able to restart the service anymore.
Even after a reboot of all (two) clustermembers it is not possible. Do I
have to remove by hand some kind of "lock" file.

	 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060922/1a03ed7f/attachment.htm>

From rodgersr at yahoo.com  Fri Sep 22 20:28:43 2006
From: rodgersr at yahoo.com (Rick Rodgers)
Date: Fri, 22 Sep 2006 13:28:43 -0700 (PDT)
Subject: [Linux-cluster] Disk tie breaker -how does it work?
Message-ID: <20060922202843.42656.qmail@web34205.mail.mud.yahoo.com>

Does anyone know much about the details of how a disk tiebreaker
works in a two member node? Or any docs to point to?


---------------------------------
Yahoo! Messenger with Voice. Make PC-to-Phone Calls to the US (and 30+ countries) for 2?/min or less.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060922/8226b2e2/attachment.htm>

From peter.huesser at psi.ch  Fri Sep 22 20:43:59 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Fri, 22 Sep 2006 22:43:59 +0200
Subject: [Linux-cluster] Cannot restart service after "failed" state
In-Reply-To: <38A48FA2F0103444906AD22E14F1B5A3042C6E19@mailxchg01.corp.opsource.net>
Message-ID: <8E2924888511274B95014C2DD906E58AD1A316@MAILBOX0A.psi.ch>

You were right with the check of the error log. I should have read it
more carefully before writing to the group. The problem was with one of
the scripts. What I was curious was that after a restart of both of the
serves I had the same problem again. But I have to reformulate my
problem and want to start a new thread.

 
Thanks

 
            Pedro

 
________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeff Stoner
Sent: Freitag, 22. September 2006 18:27
To: linux clustering
Subject: RE: [Linux-cluster] Cannot restart service after "failed" state

 
Check for errors in the logs files for the service itself (you didn't
say exactly what it is) and in /var/log/message for Cluster-related
messages for more specific information about why it won't start. We
can't help very much without knowing what is wrong.

 
--Jeff
SME - UNIX
OpSource Inc.

PGP Key ID 0x6CB364CA 

 
________________________________


	I have defined a web-services (for testing it contains an IP and
two script resources). I sometimes happens that I produce failed state
of the cluster. After this I am not able to restart the service anymore.
Even after a reboot of all (two) clustermembers it is not possible. Do I
have to remove by hand some kind of "lock" file.

	 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060922/59faeab6/attachment.htm>

From celso at webbertek.com.br  Sat Sep 23 03:08:14 2006
From: celso at webbertek.com.br (Celso K. Webber)
Date: Sat, 23 Sep 2006 00:08:14 -0300
Subject: [Linux-cluster] LVM and Multipath with EMC PowerPath (Was: CLVMD -
	Do I need it)
In-Reply-To: <1158082871.988.4.camel@hydrogen.msp.redhat.com>
References: <7F6B06837A5DBD49AC6E1650EFF5490601223028@auk52177.ukr.astrium.corp>	<1158069859.3610.437.camel@rei.boston.devel.redhat.com>
	<1158082871.988.4.camel@hydrogen.msp.redhat.com>
Message-ID: <4514A51E.1080507@webbertek.com.br>

Hello all,

After reading a thread on this list (CLVMD - Do I need it), I started 
playing around with CLVM, just to make sure two problems I had in the 
past were solved:

1) LVM normally cannot be used on shared disks, because the first server 
that "sees" the PVs will initialize them, and the other server will see 
the LVM objects as inactive. This is solved in LVM2 when used together 
with CLVM, right? I'm not pretty sure about the mecanics of CLVM, but I 
imagine it shares device UUIDs between the machines. So far, so good.

2) The other problem is not directly related to CLVM, but I found no 
solution for it (yet). In my setup, I have multiple paths to the same 
devices in the shared storage (either in a SAN or DAS). Under the EMC 
solution, we employ PowerPath to solve the multiple devices issue for 
each LUN. It works quite well. But LVM is not aware of PowerPath's 
multiple path aggregation, so when it scans the PVs on the LUN's 
partitions, it "finds" duplicates for the PVs, like this:
[root at csumccaixa12 network-scripts]# pvscan
   Found duplicate PV 7v9XUzPHIRqe6E0fA6hgCR3ybeaJoiWm: using /dev/sdc1 
not /dev/emcpowerb1
   Found duplicate PV 3eKnMIm00kg6DXn4MW1UX9QCFh96ykwG: using 
/dev/emcpowerc1 not /dev/sdb1
   Found duplicate PV 3T00PR5Ky1XrBesYHRtyowoBQLWDO1kd: using /dev/sdd1 
not /dev/emcpowera1
   Found duplicate PV 3eKnMIm00kg6DXn4MW1UX9QCFh96ykwG: using /dev/sde1 
not /dev/emcpowerc1
   Found duplicate PV 7v9XUzPHIRqe6E0fA6hgCR3ybeaJoiWm: using /dev/sdf1 
not /dev/sdc1
   Found duplicate PV 3T00PR5Ky1XrBesYHRtyowoBQLWDO1kd: using /dev/sdg1 
not /dev/sdd1
   PV /dev/sda3   VG vg0   lvm2 [59.81 GB / 37.75 GB free]
   PV /dev/sdg1            lvm2 [127.43 GB]
   PV /dev/sde1            lvm2 [127.43 GB]
   PV /dev/sdf1            lvm2 [127.43 GB]
   Total: 4 [442.10 GB] / in use: 1 [59.81 GB] / in no VG: 3 [382.29 GB]

You can see above that the /dev/emcpowerX devices were declined in favor 
of the real Linux devices. "vg0" is a VG in the internal disks (/dev/sda).

The problem I see here is that whenever the specific device that LVM2 
chose goes down because of a link failure, LVM will not automatically 
failover to another device, will it? In my tests it didn't.

Another matter is that using the /dev/emcpowerX devices I have also load 
balancing, so even if LVM2 did failover to the other paths (the other 
devices), I would loose the load balancing feature I can achieve with 
PowerPath.


Question 1: did anyone solve this problem? Does device-mapper-multipath 
solve this problem?

Question 2: is there a way to "force" which devices LVM should employ 
when scanning the PVs over the disks Linux recognize?


Thank you all for any hints on this.

Regards,

Celso.
-- 
*Celso Kopp Webber*

celso at webbertek.com.br <mailto:celso at webbertek.com.br>

*Webbertek - Opensource Knowledge*
(41) 8813-1919
(41) 3284-3035


-- 
Esta mensagem foi verificada pelo sistema de antiv?rus e
 acredita-se estar livre de perigo.


From isplist at logicore.net  Sat Sep 23 03:14:25 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Fri, 22 Sep 2006 22:14:25 -0500
Subject: [Linux-cluster] Can't mount multiple GFS volumes?
In-Reply-To: <20060915135516.GA17451@redhat.com>
Message-ID: <2006922221425.220068@leena>

Sorry about the delay. I trashed my whole setup since I bought new hardware. 
Once I have it all up and running, I'll see if I'm still having the same 
problems and post again.

Thanks much Dave.

Mike

> Could you send the output of 'cman_tool services' from all nodes before
> and after you try to mount?  Thanks
> Dave


From ben.yarwood at juno.co.uk  Sat Sep 23 11:09:07 2006
From: ben.yarwood at juno.co.uk (Ben Yarwood)
Date: Sat, 23 Sep 2006 12:09:07 +0100
Subject: [Linux-cluster] LVM and Multipath with EMC PowerPath (Was: CLVMD
	-Do I need it)
In-Reply-To: <4514A51E.1080507@webbertek.com.br>
Message-ID: <007f01c6df00$b010e890$3964a8c0@WS076>

Good document on emc powerlink site about setting up gfs6.1 and powerpath.

https://powerlink.emc.com/nsepn/webapps/btg548664833igtcuup4826/km/live1/en_
US/Offering_Technical/Technical_Documentation/300-003-820_a01_elccnt_0.pdf?m
tcs=ZXZlbnRUeXBlPUttQ2xpY2tTZWFyY2hSZXN1bHRzRXZlbnQsZG9jdW1lbnRJZD0wOTAxNDA2
NjgwMTg3YjFhLGRhdGFTb3VyY2U9RENUTV9lbl9VU18w

Page 18 I believe has the filtering solution you are after for point 2.

Ben


> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Celso K. Webber
> Sent: 23 September 2006 04:08
> To: linux clustering
> Subject: [Linux-cluster] LVM and Multipath with EMC PowerPath 
> (Was: CLVMD -Do I need it)
> 
> Hello all,
> 
> After reading a thread on this list (CLVMD - Do I need it), I 
> started playing around with CLVM, just to make sure two 
> problems I had in the past were solved:
> 
> 1) LVM normally cannot be used on shared disks, because the 
> first server that "sees" the PVs will initialize them, and 
> the other server will see the LVM objects as inactive. This 
> is solved in LVM2 when used together with CLVM, right? I'm 
> not pretty sure about the mecanics of CLVM, but I imagine it 
> shares device UUIDs between the machines. So far, so good.
> 
> 2) The other problem is not directly related to CLVM, but I 
> found no solution for it (yet). In my setup, I have multiple 
> paths to the same devices in the shared storage (either in a 
> SAN or DAS). Under the EMC solution, we employ PowerPath to 
> solve the multiple devices issue for each LUN. It works quite 
> well. But LVM is not aware of PowerPath's multiple path 
> aggregation, so when it scans the PVs on the LUN's 
> partitions, it "finds" duplicates for the PVs, like this:
> [root at csumccaixa12 network-scripts]# pvscan
>    Found duplicate PV 7v9XUzPHIRqe6E0fA6hgCR3ybeaJoiWm: using 
> /dev/sdc1 not /dev/emcpowerb1
>    Found duplicate PV 3eKnMIm00kg6DXn4MW1UX9QCFh96ykwG: using
> /dev/emcpowerc1 not /dev/sdb1
>    Found duplicate PV 3T00PR5Ky1XrBesYHRtyowoBQLWDO1kd: using 
> /dev/sdd1 not /dev/emcpowera1
>    Found duplicate PV 3eKnMIm00kg6DXn4MW1UX9QCFh96ykwG: using 
> /dev/sde1 not /dev/emcpowerc1
>    Found duplicate PV 7v9XUzPHIRqe6E0fA6hgCR3ybeaJoiWm: using 
> /dev/sdf1 not /dev/sdc1
>    Found duplicate PV 3T00PR5Ky1XrBesYHRtyowoBQLWDO1kd: using 
> /dev/sdg1 not /dev/sdd1
>    PV /dev/sda3   VG vg0   lvm2 [59.81 GB / 37.75 GB free]
>    PV /dev/sdg1            lvm2 [127.43 GB]
>    PV /dev/sde1            lvm2 [127.43 GB]
>    PV /dev/sdf1            lvm2 [127.43 GB]
>    Total: 4 [442.10 GB] / in use: 1 [59.81 GB] / in no VG: 3 
> [382.29 GB]
> 
> You can see above that the /dev/emcpowerX devices were 
> declined in favor of the real Linux devices. "vg0" is a VG in 
> the internal disks (/dev/sda).
> 
> The problem I see here is that whenever the specific device 
> that LVM2 chose goes down because of a link failure, LVM will 
> not automatically failover to another device, will it? In my 
> tests it didn't.
> 
> Another matter is that using the /dev/emcpowerX devices I 
> have also load balancing, so even if LVM2 did failover to the 
> other paths (the other devices), I would loose the load 
> balancing feature I can achieve with PowerPath.
> 
> 
> Question 1: did anyone solve this problem? Does 
> device-mapper-multipath solve this problem?
> 
> Question 2: is there a way to "force" which devices LVM 
> should employ when scanning the PVs over the disks Linux recognize?
> 
> 
> Thank you all for any hints on this.
> 
> Regards,
> 
> Celso.
> -- 
> *Celso Kopp Webber*
> 
> celso at webbertek.com.br <mailto:celso at webbertek.com.br>
> 
> *Webbertek - Opensource Knowledge*
> (41) 8813-1919
> (41) 3284-3035
> 
> 
> -- 
> Esta mensagem foi verificada pelo sistema de antiv?rus e
>  acredita-se estar livre de perigo.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 


From bosse at klykken.com  Sat Sep 23 16:00:44 2006
From: bosse at klykken.com (Bosse Klykken)
Date: Sat, 23 Sep 2006 18:00:44 +0200
Subject: [Linux-cluster] Cannot restart service after "failed" state
In-Reply-To: <8E2924888511274B95014C2DD906E58AD1A30E@MAILBOX0A.psi.ch>
References: <8E2924888511274B95014C2DD906E58AD1A30E@MAILBOX0A.psi.ch>
Message-ID: <45155A2C.8030806@klykken.com>

Huesser Peter wrote:
> I have defined a web-services (for testing it contains an IP and two
> script resources). I sometimes happens that I produce failed state of
> the cluster. After this I am not able to restart the service anymore.
> Even after a reboot of all (two) clustermembers it is not possible. Do I
> have to remove by hand some kind of ?lock? file.

If the problem is that you're unable to restart the service when it is
in "failed" modus, you could try this:

clusvcadm -d service	# disables the failed service
clusvcadm -e service	# enables/starts the now disabled service

.../Bosse


From peter.huesser at psi.ch  Sat Sep 23 19:26:33 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Sat, 23 Sep 2006 21:26:33 +0200
Subject: [Linux-cluster] Cannot restart service after "failed" state
In-Reply-To: <45155A2C.8030806@klykken.com>
Message-ID: <8E2924888511274B95014C2DD906E58AD1A318@MAILBOX0A.psi.ch>

> 
> If the problem is that you're unable to restart the service when it is
> in "failed" modus, you could try this:
> 
> clusvcadm -d service	# disables the failed service
> clusvcadm -e service	# enables/starts the now disabled service
> 
Thanks for the hint but the problem was that I did (and still do) not
understand the concept of the different resource types within a service.
There you can choose between "Add a Shared Resource to this service",
"Attach a new Private Resource to the Selection" and "Attach a Shared
Resource to the selection". I played around a little bit and everything
works now as expected. But I have to search the web first before asking
some more specific questions.

Greetings

	Pedro


From peter.huesser at psi.ch  Sat Sep 23 19:53:24 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Sat, 23 Sep 2006 21:53:24 +0200
Subject: [Linux-cluster] When is a failed node fenced
Message-ID: <8E2924888511274B95014C2DD906E58AD1A319@MAILBOX0A.psi.ch>

Hello

 
Maybe I do not understand the concept of fencing in the right way. I
created a two node cluster with a webservice running on it. The failover
works fine now. I also configured a fencing device (ipmilan). Fencing by
hand works also fine (using a command like: "fence_ipmilan -a server_con
-l loginname -p my_passwort -o off") but if I initiate a failover on one
of the nodes I expect the services to switch to the other node (which
works) and to let this node shutdown the failed node. The second does
not happen. Any idea?

 
Thanks

 
            Pedro

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060923/82963f4a/attachment.htm>

From orkcu at yahoo.com  Sat Sep 23 21:16:13 2006
From: orkcu at yahoo.com (Roger Peña Escobio)
Date: Sat, 23 Sep 2006 14:16:13 -0700 (PDT)
Subject: [Linux-cluster] LVM and Multipath with EMC PowerPath (Was: CLVMD
	- Do I need it)
In-Reply-To: <4514A51E.1080507@webbertek.com.br>
Message-ID: <20060923211613.93874.qmail@web50601.mail.yahoo.com>


--- "Celso K. Webber" <celso at webbertek.com.br> wrote:

> Hello all,
> 
> After reading a thread on this list (CLVMD - Do I
> need it), I started 
> playing around with CLVM, just to make sure two
> problems I had in the 
> past were solved:
> 
> 1) LVM normally cannot be used on shared disks,
[...]

> 2) The other problem is not directly related to
> CLVM, but I found no 
> solution for it (yet). In my setup, I have multiple
> paths to the same 
> devices in the shared storage (either in a SAN or
> DAS). Under the EMC 
> solution, we employ PowerPath to solve the multiple
> devices issue for 
> each LUN. It works quite well. But LVM is not aware
> of PowerPath's 
> multiple path aggregation, so when it scans the PVs
> on the LUN's 
> partitions, it "finds" duplicates for the PVs, like
> this:

tyhe solution for this is to "filter" the
"under-powerpath devices" :-)
I mean, to filter to not scan the devices exported by
the SAN or DAS, and just use the powerpath devices for
the LVM
check the file /etc/lvm/lvm.conf

> Another matter is that using the /dev/emcpowerX
> devices I have also load 
> balancing, so even if LVM2 did failover to the other
> paths (the other 
> devices), I would loose the load balancing feature I
> can achieve with 
> PowerPath.

if you put LVM over powerpath, you are layering the
environment, do you?
path load balancing and failover are cover by
powerpath, and  LVM do its own business :-)

> Question 2: is there a way to "force" which devices
> LVM should employ 
> when scanning the PVs over the disks Linux
> recognize?

/etc/lvm/lvm.conf 
# A filter that tells LVM2 to only use a restricted
set of devices.
# The filter consists of an array of regular
expressions.  These
# expressions can be delimited by a character of your
choice, and
# prefixed with either an 'a' (for accept) or 'r' (for
reject).
# The first expression found to match a device name
determines if
# the device will be accepted or rejected (ignored). 
Devices that
# don't match any patterns are accepted.


cu
roger

__________________________________________
RedHat Certified Engineer ( RHCE )
Cisco Certified Network Associate ( CCNA )

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From peter.huesser at psi.ch  Sat Sep 23 22:19:59 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Sun, 24 Sep 2006 00:19:59 +0200
Subject: [Linux-cluster] Problem with loadbalancer
Message-ID: <8E2924888511274B95014C2DD906E58AD1A31C@MAILBOX0A.psi.ch>

Hello

 
I just started learning how to do loadbalancing using LVS and piranha.
As an example I wanted to have a loadbalancer running in front of one
webserver (testing). I want to use direct routing. The IP of the
loadbalancer is e.g. 236.25.1.1 the one of the webserver is 236.25.1.2
(web01) and the VIP is 236.25.1.3. Here is the lvs.cf file I created
with the aid of piranha_gui:

 
serial_no = 67

primary = 236.25.1.1

service = lvs

backup_active = 0

backup = 0.0.0.0

heartbeat = 1

heartbeat_port = 539

keepalive = 6

deadtime = 18

network = direct

debug_level = NONE

monitor_links = 0

virtual webserver {

     active = 1

     address = 236.25.1.3 eth0:1

     vip_nmask = 255.255.255.0

     port = 80

     send = "GET / HTTP/1.0rnrn"

     expect = "HTTP"

     use_regex = 0

     load_monitor = ruptime

     scheduler = wlc

     protocol = tcp

     timeout = 6

     reentry = 15

     quiesce_server = 1

     server web01 {

         address = 236.25.1.2

         active = 1

         weight = 1

     }

}

 
-          The webservice on web01 is running correctly.

-          I can ping the VIP 236.25.1.3.

-          The output of "/sbin/ip addr" looks fine: the interface eth0
has the rigth secondary IP.

-          If I run "tcpdump host web01" I see that there is some
communication (eg a "GET / HTTP/1.0") between the loadbalancer and
web01.

-          The output of "/sbin/sysctl net.ipv4.ip_forward" is
"net.ipv4.ip_forward = 1".

 
But if I try to connect to the VIP on port 80 I get a "connection
refused". Something is wrong but what? In the /var/log/messages I have a
lot of the following lines:

 
Sep 24 00:09:35 loadbalancer nanny[3925]: READ to 236.25.1.2:80 timed
out

Sep 24 00:09:47 loadbalancer nanny[3925]: READ to 236.25.1.2:80 timed
out

 
Some idea where I have to look for to solve the problem?

 
By the way: how can I increase the debuglevel? As far as I have seen it
is not possible with GUI.

 
Thanks' in advance

 
            Pedro

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060924/1b10c4ed/attachment.htm>

From rodgersr at yahoo.com  Sun Sep 24 00:20:57 2006
From: rodgersr at yahoo.com (Rick Rodgers)
Date: Sat, 23 Sep 2006 17:20:57 -0700 (PDT)
Subject: [Linux-cluster] Workings of Tiebreaker IP (RHCS)
Message-ID: <20060924002057.43217.qmail@web34214.mail.mud.yahoo.com>

  I  pulled a message from 2005  about tiebreakers. I have  some questions  and it does not seem to agree  with what I see culmanger do.
   
  
>> Hello,

>> 

>> To completely understand what the role of a tiebreaker IP within a two

>> or four node RHCS cluster is, I've searched redhat and Google. I can't

>> however find anything describing the precise workings of the

>> tiebreaker-IP. I would really like to know what happens excactly when

>> the tiebreaker is used an how (maybe even somekind of flow diagram). 

>> 

>> Can anyone here maybe explain that to me, or point me in the direction

>> of more specific information regarding tiebreaker?

 
>The tiebreaker IP address is used as an additional vote in the event

>that half the nodes become unreachable or dead in a 2 or 4 node >cluster

>on RHCS.

 
>The IP address must reside on the same network as is used for cluster

>communication.  To be a little more specific, if your cluster is using

>eth0 for communication, your IP address used for a tiebreaker must be

>reachable only via eth0 (otherwise, you will end up with a split >brain).

 
>When enabled, the nodes ping the given IP address at regular >intervals.

>When the IP address is not reachable, the tiebreaker is considered

>"dead".  When it is reachable, it is considered "alive".

 
>It acts as an additional vote (like an extra cluster member), except >for

>one key difference: Unless the default configuration is overridden, >the

 
How  does this  work? Does the node trying to become the active node access the tiebreaker and put a lock on it? How does it reseve it? 

Just  pinging it  would not prevent the other node from doing the same.

 
>IP tiebreaker may not be used to *form* a quorum where one did not >exist

>previously.

 
>So, if one node of a two node cluster is online, it will never become

>quorate unless the other node comes online (or administrator override,

>see man pages for "cluforce" and "cludb").

 
>So, in a 2 node cluster, if one node fails and the other node is >online

>(and the tiebreaker is still "alive" according to that node), the

>remaining node considers itself quorate and "shoots" (aka STONITHs, >aka

>fences) the dead node and takes over services.

 
>If a network partition occurs such that both nodes see the tiebreaker

>but not each other, the first one to fence the other will naturally >win.

 
>Ok, moving on...

 
>The disk tiebreaker works in a similar way, except that it lets the

>cluster limp in along in a safe, semi-split-brain (split brain) in a

>network outage.  What I mean is that because there's state information

>written to/read from the shared raw partitions, the nodes can actually

>tell via other means whether or not the other node is "alive" or not >as

>opposed to relying solely on the network traffic.

 
>Both nodes update state information on the shared partitions.  When >one

>node detects that the other node has not updated its information for a

>period of time, that node is "down" according to the disk subsystem.  >If

>this coincides with a "down" status from the membership daemon, the >node

>is fenced and services are failed over.  If the node never goes down

>(and keeps updating its information on the shared partitions), then >the

I do not use a IP tiebreaker. I have a two nodes system. When the active node shows it is down via memebership but up  via disk then

Clumanager determines it is in an ?uncertain state? and shoots it. 

 
>node is never fenced and services never fail over.

 
-- Lon
   
  
---------------------------------
Talk is cheap. Use Yahoo! Messenger to make PC-to-Phone calls.  Great rates starting at 1?/min.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060923/7e0c6a70/attachment.htm>

From linuxr at gmail.com  Mon Sep 25 04:15:33 2006
From: linuxr at gmail.com (Marc )
Date: Mon, 25 Sep 2006 00:15:33 -0400
Subject: [Linux-cluster] starting point
Message-ID: <b60a966f0609242115i106a4805t7aa8f06614da4025@mail.gmail.com>

Hi,

I am interested in setting up a linux cluster.  I have the following
hardware:

Dell 2950 server x2, each has PERC4 RAID controller and the Dell remote
access client (DRAC)
    8 GB RAM
    (2) x 74 GB for the OS
    (10) x 300 GB for storage
    Core 2 duo processors (I think)
    LPE 11000 HBA's x 2 per machine

Switch: unknown

Fiber connections

OS:  Red Hat Enterprise Server (AS) 4 EMT64 - latest update (3 I think)

SAN:  unknown (volumes/LUN's provided for test)

enterprise app:  Perforce

The goal is to cluster the application so that there is no possibility of
downtime.  This is to be a dev environment that will have quite a lot of
users.  The organization has already used Perforce and likes that, just
wants to migrate to a Linux/SAN/GFS environment.

I have worked with clustering and also a lot with linux, but not together,
so that is my challenge.  I am wondering how things like LVM and NFS come
into play with the GFS once it is all up and running.  It is a given that
SAMBA will probably have to be running on there at some point, not sure how
that plays into the mix.

Also I am worried about block size.  Perforce is a CVS type database that
will store code as flat (tiny) text files, even only storing updates.  Great
for text storage.  However, this is a multimedia type company, and much of
the data may be full multimedia files (jpeg, video, game stuff, music, you
name it).  Therefore if a developer writes 10 edits to a C++ application in
text form, it only stores the changes and not even the whole text file each
time.  However (how's this for contrast?)  ----if a developer edits a video
clip and stores it ten times, perforce saves it ten separate times, each at
least as big as the first.  So although I hear that I should avoid the 64k
block size, I don't know what to go to, realistically.  If anyone has
specifically grappled with this I would love to know more about how/why you
decided whatever you did for your situation.

Does anyone know of a good starting point, best practices, HOWTO's, etc.?  I
am reviewing Karl Knopper's book 'Enterprise Linux Cluster'.  I have to get
this cranked out NOW.  I really need some sort of guidelines or outline
since I need to set this up as a project.  Any information is GREATLY
appreciated.

Thanks
Marc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060925/06592f01/attachment.htm>

From peter.huesser at psi.ch  Mon Sep 25 06:52:51 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Mon, 25 Sep 2006 08:52:51 +0200
Subject: [Linux-cluster] starting point
In-Reply-To: <b60a966f0609242115i106a4805t7aa8f06614da4025@mail.gmail.com>
Message-ID: <8E2924888511274B95014C2DD906E58AD1A349@MAILBOX0A.psi.ch>

Hi

 
I also read the bock of Knopper which I find a good for understanding.
If you want more concrete details about the redhat cluster suite read
the RedHat documentation:
http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/ (you can also
get it in pdf format but I did not find the link anymore. If you want it
I can send it to you). Some documentation in the same style can be found
on Wikipedia: http://gfs.wikidev.net/Installation. Maybe the FAQ can
help you too: http://sources.redhat.com/cluster/faq.html.

 
Pedro

 
Does anyone know of a good starting point, best practices, HOWTO's,
etc.?  I am reviewing Karl Knopper's book 'Enterprise Linux Cluster'.  I
have to get this cranked out NOW.  I really need some sort of guidelines
or outline since I need to set this up as a project.  Any information is
GREATLY appreciated.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060925/db8e8468/attachment.htm>

From redhat at watson-wilson.ca  Mon Sep 25 12:30:50 2006
From: redhat at watson-wilson.ca (Neil Watson)
Date: Mon, 25 Sep 2006 08:30:50 -0400
Subject: [Linux-cluster] starting point
In-Reply-To: <8E2924888511274B95014C2DD906E58AD1A349@MAILBOX0A.psi.ch>
References: <b60a966f0609242115i106a4805t7aa8f06614da4025@mail.gmail.com>
	<8E2924888511274B95014C2DD906E58AD1A349@MAILBOX0A.psi.ch>
Message-ID: <20060925123050.GB31534@ettin>

On Mon, Sep 25, 2006 at 08:52:51AM +0200, Huesser Peter wrote:
>   Does anyone know of a good starting point, best practices, HOWTO's, etc.?
http://technocrat.watson-wilson.ca/db2-cluster.pdf

-- 
Neil Watson             | Gentoo Linux
System Administrator    | Uptime 7 days
http://watson-wilson.ca | 2.6.17.6 AMD Athlon(tm) MP 2000+ x 2


From jos at xos.nl  Mon Sep 25 14:11:42 2006
From: jos at xos.nl (Jos Vos)
Date: Mon, 25 Sep 2006 16:11:42 +0200
Subject: [Linux-cluster] IPMI fencing on an IBM x366
Message-ID: <200609251411.k8PEBg406654@xos037.xos.nl>

Hi,

Is it possible to use the built-in IPMI support of an IBM x366 server
with RHEL CS?

I think it is not compatible with RSA II, and I also tried IPMI Lan,
but none of them seems to work.

Any suggestions?
Thanks,

--
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From rpeterso at redhat.com  Mon Sep 25 14:13:39 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Mon, 25 Sep 2006 09:13:39 -0500
Subject: [Linux-cluster] starting point
In-Reply-To: <b60a966f0609242115i106a4805t7aa8f06614da4025@mail.gmail.com>
References: <b60a966f0609242115i106a4805t7aa8f06614da4025@mail.gmail.com>
Message-ID: <4517E413.9040702@redhat.com>

Marc wrote:
> Does anyone know of a good starting point, best practices, HOWTO's, 
> etc.?  I am reviewing Karl Knopper's book 'Enterprise Linux Cluster'.  
> I have to get this cranked out NOW.  I really need some sort of 
> guidelines or outline since I need to set this up as a project.  Any 
> information is GREATLY appreciated.
>
> Thanks
> Marc
Hi Marc,

I recommend the "Unofficial" NFS/GFS cookbook (but I'm biased):
http://sources.redhat.com/cluster/doc/nfscookbook.pdf

Regards,

Bob Peterson
Red Hat Cluster Suite


From isplist at logicore.net  Mon Sep 25 15:17:40 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Mon, 25 Sep 2006 10:17:40 -0500
Subject: [Linux-cluster] General FC Question
Message-ID: <2006925101740.355825@leena>

After adding storage, my cluster comes up with different /dev/sda, /dev/sdb, 
etc settings. My initial device now comes up as sdc when it used to be sda. 

Is there some way of allowing GFS to see the storage in some way that it can 
know which device is which when I add a new one or remove one, etc?

Hard loop ID's on the FC side I think but is there anything on the GFS side?

Mike


From peter.huesser at psi.ch  Mon Sep 25 19:31:42 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Mon, 25 Sep 2006 21:31:42 +0200
Subject: [Linux-cluster] piranha
Message-ID: <8E2924888511274B95014C2DD906E58AD1A3A6@MAILBOX0A.psi.ch>

Hello

 
I sent a similar question a few days ago and did not get any answer.
Maybe the time (Saturday night) was unfavorable or the question was not
that clear. So I try it once more:

 
I want to run a loadbalancer in front of two webserver (using direct
routing). But if I connect to port 80 of the loadbalancer I get a
"connection refused".

 
1)       Did anybody had a similar problem? 

2)       How can I increase the debuglevel?

 
Thanks' in advance

 
            Pedro

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060925/32502eb3/attachment.htm>

From peter.huesser at psi.ch  Mon Sep 25 20:53:41 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Mon, 25 Sep 2006 22:53:41 +0200
Subject: [Linux-cluster] piranha
In-Reply-To: <8E2924888511274B95014C2DD906E58AD1A3A6@MAILBOX0A.psi.ch>
Message-ID: <8E2924888511274B95014C2DD906E58AD1A3A7@MAILBOX0A.psi.ch>

By the way: I started the "pulse" daemon in the debug modus ("pulse -v
-n") and got the following output:

 
nanny: Opening TCP socket to remote service port 80...

nanny: Connecting socket to remote address...

nanny: DEBUG -- Posting CONNECT poll()

nanny: Sending len=16, text="GET / HTTP/1.0

 
"

nanny: DEBUG -- Posting READ poll()

nanny: DEBUG -- READ poll() completed (1,1)

nanny: Posting READ I/O; expecting 4 character(s)...

nanny: DEBUG -- READ returned 4

nanny: READ expected len=4, text="HTTP"

nanny: READ got len=4, text=HTTP

nanny: avail: 1 active: 1: count: 13

pulse: DEBUG -- setting SEND_heartbeat timer

pulse: DEBUG -- setting SEND_heartbeat timer

pulse: DEBUG -- setting NEED_heartbeat timer

pulse: DEBUG -- setting SEND_heartbeat timer

nanny: Opening TCP socket to remote service port 80...

...

 
For me this looks as if everything is ok. "nanny" sends from time to
time a "GET / HTTP/1.0" request and the response ("HTTP" only first four
letters) correspondence with what is expected. The problem is that pulse
is not opening port 80 on the loadbalancer for reveiving http-request. A
"netstat -anp" verifies this.

 
Hello

 
I sent a similar question a few days ago and did not get any answer.
Maybe the time (Saturday night) was unfavorable or the question was not
that clear. So I try it once more:

 
I want to run a loadbalancer in front of two webserver (using direct
routing). But if I connect to port 80 of the loadbalancer I get a
"connection refused".

 
1)   Did anybody had a similar problem? 

2)   How can I increase the debuglevel?

 
Thanks' in advance

 
            Pedro

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060925/dc805e14/attachment.htm>

From jpenalbae at gmail.com  Mon Sep 25 23:11:32 2006
From: jpenalbae at gmail.com (=?ISO-8859-1?Q?Jaime_Pe=F1alba?=)
Date: Tue, 26 Sep 2006 01:11:32 +0200
Subject: [Linux-cluster] General FC Question
In-Reply-To: <2006925101740.355825@leena>
References: <2006925101740.355825@leena>
Message-ID: <edaeffdd0609251611u6cc14eaav2b1df7b6418f5129@mail.gmail.com>

You can try multipath-tools or some other software that will group
disks by WWN (World Wide Name).

Regards,
Jaime.


2006/9/25, isplist at logicore.net <isplist at logicore.net>:
> After adding storage, my cluster comes up with different /dev/sda, /dev/sdb,
> etc settings. My initial device now comes up as sdc when it used to be sda.
>
> Is there some way of allowing GFS to see the storage in some way that it can
> know which device is which when I add a new one or remove one, etc?
>
> Hard loop ID's on the FC side I think but is there anything on the GFS side?
>
> Mike
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From isplist at logicore.net  Mon Sep 25 23:56:47 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Mon, 25 Sep 2006 18:56:47 -0500
Subject: [Linux-cluster] General FC Question
In-Reply-To: <edaeffdd0609251611u6cc14eaav2b1df7b6418f5129@mail.gmail.com>
Message-ID: <2006925185647.450867@leena>

I'll take a look at that, wwn might work so long as all my storage devices 
supports it.

Mike


On Tue, 26 Sep 2006 01:11:32 +0200, Jaime Pe?alba wrote:
> You can try multipath-tools or some other software that will group
> 
> disks by WWN (World Wide Name).
> 
> Regards,
> Jaime.
> 
> 
> 2006/9/25, isplist at logicore.net <isplist at logicore.net>:
>> After adding storage, my cluster comes up with different /dev/sda,
>> /dev/sdb,
>> etc settings. My initial device now comes up as sdc when it used to be
>> sda.
>> 
>> Is there some way of allowing GFS to see the storage in some way that it
>> can
>> know which device is which when I add a new one or remove one, etc?
>> 
>> Hard loop ID's on the FC side I think but is there anything on the GFS
>> side?
>> 
>> Mike
>> 
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster


From orkcu at yahoo.com  Tue Sep 26 01:05:29 2006
From: orkcu at yahoo.com (Roger Peña Escobio)
Date: Mon, 25 Sep 2006 18:05:29 -0700 (PDT)
Subject: [Linux-cluster] General FC Question
In-Reply-To: <2006925185647.450867@leena>
Message-ID: <20060926010530.88046.qmail@web50601.mail.yahoo.com>


--- "isplist at logicore.net" <isplist at logicore.net>
wrote:

> I'll take a look at that, wwn might work so long as
> all my storage devices 
> supports it.
but the LUNs that you export from the same SAN will
show the same wwn, do they?
the SAN's wwn I guess

maybe somre kind of LUNs ID can be mapped with udev so
the same name apply to the same LUNs Id, I am just
guessing. Of course, the other way is to use LVM, LVM
can help because it have "IDs" that helps to gruop
always the same PV no matter if you add new devices
:-)

cu
roger

__________________________________________
RedHat Certified Engineer ( RHCE )
Cisco Certified Network Associate ( CCNA )

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From isplist at logicore.net  Tue Sep 26 01:29:14 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Mon, 25 Sep 2006 20:29:14 -0500
Subject: [Linux-cluster] General FC Question
In-Reply-To: <20060926010530.88046.qmail@web50601.mail.yahoo.com>
Message-ID: <2006925202914.382809@leena>

Problem is that SCSI devices are changing when I add/remove storage devices. 
For example, a device that was all set up as sda is now sdc upon reboot.

Mike


On Mon, 25 Sep 2006 18:05:29 -0700 (PDT), Pe?a wrote:
> 
> 
> --- "isplist at logicore.net" <isplist at logicore.net>
> wrote:
> 
>> I'll take a look at that, wwn might work so long as
>> all my storage devices
>> supports it.
> but the LUNs that you export from the same SAN will
> show the same wwn, do they?
> the SAN's wwn I guess
> 
> maybe somre kind of LUNs ID can be mapped with udev so
> the same name apply to the same LUNs Id, I am just
> guessing. Of course, the other way is to use LVM, LVM
> can help because it have "IDs" that helps to gruop
> always the same PV no matter if you add new devices
> :-)
> 
> cu
> roger
> 
> __________________________________________
> RedHat Certified Engineer ( RHCE )
> Cisco Certified Network Associate ( CCNA )
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com


From orkcu at yahoo.com  Tue Sep 26 01:39:41 2006
From: orkcu at yahoo.com (Roger Peña Escobio)
Date: Mon, 25 Sep 2006 18:39:41 -0700 (PDT)
Subject: [Linux-cluster] General FC Question
In-Reply-To: <2006925202914.382809@leena>
Message-ID: <20060926013941.31337.qmail@web50613.mail.yahoo.com>


--- "isplist at logicore.net" <isplist at logicore.net>
wrote:

> Problem is that SCSI devices are changing when I
> add/remove storage devices. 
> For example, a device that was all set up as sda is
> now sdc upon reboot.
> 
yes, I understand you
the same happen when you add LUNs exported from a SAN
the OS will see more devices , and maybe something
that was sda now is sde :-(

with LVM you will not have any problem no matther how
the name of the device change (sda -> sdb or sde)

with GFS alone I guess you will

maybe with the help of multipath or powerpath or any
other "device mapper" tool you can map a device to
something unique and invariable across addiction of
new  real scsi devices or LUNs to the system


cu
roger
> Mike
> 
> 
> On Mon, 25 Sep 2006 18:05:29 -0700 (PDT), Pe?a
> wrote:
> > 
> > 
> > --- "isplist at logicore.net" <isplist at logicore.net>
> > wrote:
> > 
> >> I'll take a look at that, wwn might work so long
> as
> >> all my storage devices
> >> supports it.
> > but the LUNs that you export from the same SAN
> will
> > show the same wwn, do they?
> > the SAN's wwn I guess
> > 
> > maybe somre kind of LUNs ID can be mapped with
> udev so
> > the same name apply to the same LUNs Id, I am just
> > guessing. Of course, the other way is to use LVM,
> LVM
> > can help because it have "IDs" that helps to gruop
> > always the same PV no matter if you add new
> devices
> > :-)
> > 
> > cu
> > roger
> > 
> > __________________________________________
> > RedHat Certified Engineer ( RHCE )
> > Cisco Certified Network Associate ( CCNA )
> > 
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> 
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
>
https://www.redhat.com/mailman/listinfo/linux-cluster
> 


__________________________________________
RedHat Certified Engineer ( RHCE )
Cisco Certified Network Associate ( CCNA )

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


From kanderso at redhat.com  Tue Sep 26 07:19:45 2006
From: kanderso at redhat.com (Kevin Anderson)
Date: Tue, 26 Sep 2006 02:19:45 -0500
Subject: [Linux-cluster] General FC Question
In-Reply-To: <2006925101740.355825@leena>
References: <2006925101740.355825@leena>
Message-ID: <1159255185.2997.3.camel@localhost.localdomain>

On Mon, 2006-09-25 at 10:17 -0500, isplist at logicore.net wrote:
> After adding storage, my cluster comes up with different /dev/sda, /dev/sdb, 
> etc settings. My initial device now comes up as sdc when it used to be sda. 
> 
> Is there some way of allowing GFS to see the storage in some way that it can 
> know which device is which when I add a new one or remove one, etc?
> 
You should be using lvm2 and lvm2_cluster to handle this issue.  LVM2
handles the name changing of the device on reboot.  This often happens
depending on the scan order for the devices.  By using a volume manager,
you make these changes transparent.  You also have the advantage of not
being tied to single devices, but able to concatenate or stripe your
filesystem across multiple devices.  You must also use lvm2-cluster to
ensure any changes you make to the volume information is consistent
across the cluster.

> Hard loop ID's on the FC side I think but is there anything on the GFS side?
> 
> Mike
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 


From cjk at techma.com  Tue Sep 26 11:37:59 2006
From: cjk at techma.com (Kovacs, Corey J.)
Date: Tue, 26 Sep 2006 07:37:59 -0400
Subject: =?us-ascii?Q?RE:_=5BLinux-cluster=5D_General_FC_Question?=
In-Reply-To: <2006925101740.355825@leena>
Message-ID: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F0F@tmaemail.techma.com>

You don't say which FC cards you are using but if it's qlogic, then the
driver
can be set to combine the devices. Basically whats happened is that your
machine
is picking up the alternate path to the device, which is a perfectly valid
thing
to do, it's just not what you need at this point. It may be as simple as your

secondary controller actually has the lun you are trying to access. To work
around
yo might just be able to reset the seconday controller and force the primary
to take
over the LUN. This happens quite a bit depending on your setup. The Qlogic
drivers, 
when setup for failover, will coelesce the devices into a single device by
the WWID 
of the LUN. If that's not an option, then try the multipath tools support in
RHEL4.2 
or above. You won't be using the /dev/sd{a,b,c,...} devices, rather it'll be 
/dev/mpath/mpath0 etc, or whatever you set them to instead.

Even without failover, the latest Qlogic drivers will make both paths active
so that
you never end up with a dead path upon boot up.


Hope this helps.


Corey

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of isplist at logicore.net
Sent: Monday, September 25, 2006 11:18 AM
To: linux-cluster
Subject: [Linux-cluster] General FC Question

After adding storage, my cluster comes up with different /dev/sda, /dev/sdb,
etc settings. My initial device now comes up as sdc when it used to be sda. 

Is there some way of allowing GFS to see the storage in some way that it can
know which device is which when I add a new one or remove one, etc?

Hard loop ID's on the FC side I think but is there anything on the GFS side?

Mike


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From cjk at techma.com  Tue Sep 26 11:44:00 2006
From: cjk at techma.com (Kovacs, Corey J.)
Date: Tue, 26 Sep 2006 07:44:00 -0400
Subject: =?us-ascii?Q?RE:_=5BLinux-cluster=5D_General_FC_Question?=
In-Reply-To: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F0F@tmaemail.techma.com>
Message-ID: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F10@tmaemail.techma.com>

One more thing, when using more than one path (basically anyu san setup) the
device
mappings will wrap around for every path, so for two paths... single hba,
dual controller..


three disks will look like this...

disk1=/dev/sda
disk2=/dev/sdb
disk3=/dev/sdc
disk1=/dev/sdd
disk2=/dev/sde
disk3=/dev/sde 

and four like this..

disk1=/dev/sda
disk2=/dev/sdb
disk3=/dev/sdc
disk4=/dev/sdd
disk1=/dev/sde
disk2=/dev/sde 
disk3=/dev/sdf
disk4=/dev/sdg


Or for dual hba, dual controller (4 paths)


disk1=/dev/sda
disk2=/dev/sdb
disk3=/dev/sdc
disk4=/dev/sdd
disk1=/dev/sde
disk2=/dev/sde 
disk3=/dev/sdf
disk4=/dev/sdg
disk1=/dev/sdh
disk2=/dev/sdi
disk3=/dev/sdj
disk4=/dev/sdk
disk1=/dev/sdl
disk2=/dev/sdm 
disk3=/dev/sdn
disk4=/dev/sdo

etc...

Cheers

With the Qlogic drivers in failover mode, you'll get this..

disk1=/dev/sda
disk2=/dev/sdb
disk3=/dev/sdc
disk4=/dev/sdd

even though there are multiple paths


Corey

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kovacs, Corey J.
Sent: Tuesday, September 26, 2006 7:38 AM
To: isplist at logicore.net; linux clustering
Subject: RE: [Linux-cluster] General FC Question

You don't say which FC cards you are using but if it's qlogic, then the
driver can be set to combine the devices. Basically whats happened is that
your machine is picking up the alternate path to the device, which is a
perfectly valid thing to do, it's just not what you need at this point. It
may be as simple as your

secondary controller actually has the lun you are trying to access. To work
around yo might just be able to reset the seconday controller and force the
primary to take over the LUN. This happens quite a bit depending on your
setup. The Qlogic drivers, when setup for failover, will coelesce the devices
into a single device by the WWID of the LUN. If that's not an option, then
try the multipath tools support in
RHEL4.2
or above. You won't be using the /dev/sd{a,b,c,...} devices, rather it'll be
/dev/mpath/mpath0 etc, or whatever you set them to instead.

Even without failover, the latest Qlogic drivers will make both paths active
so that you never end up with a dead path upon boot up.


Hope this helps.


Corey

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of isplist at logicore.net
Sent: Monday, September 25, 2006 11:18 AM
To: linux-cluster
Subject: [Linux-cluster] General FC Question

After adding storage, my cluster comes up with different /dev/sda, /dev/sdb,
etc settings. My initial device now comes up as sdc when it used to be sda. 

Is there some way of allowing GFS to see the storage in some way that it can
know which device is which when I add a new one or remove one, etc?

Hard loop ID's on the FC side I think but is there anything on the GFS side?

Mike


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From isplist at logicore.net  Tue Sep 26 13:26:34 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Tue, 26 Sep 2006 08:26:34 -0500
Subject: [Linux-cluster] General FC Question
In-Reply-To: <1159255185.2997.3.camel@localhost.localdomain>
Message-ID: <200692682634.118034@leena>

> You should be using lvm2 and lvm2_cluster to handle this issue.  LVM2
> handles the name changing of the device on reboot.  This often happens
> depending on the scan order for the devices.  By using a volume manager,

Yes, this is what I'm using. I'll reply to the next message about gear as 
well. Since I've just added the hardware, I guess I've not had enough time to 
notice how well LVM2 handles this. I just noticed it right off the bat after 
boot up. The problem came up when I was not able to mount my previously set up 
GFS FS.

Mike


From isplist at logicore.net  Tue Sep 26 13:32:00 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Tue, 26 Sep 2006 08:32:00 -0500
Subject: [Linux-cluster] General FC Question
In-Reply-To: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F0F@tmaemail.techma.com>
Message-ID: <20069268320.047255@leena>

> You don't say which FC cards you are using but if it's qlogic, then the
> driver can be set to combine the devices. Basically whats happened is that 
>your machine is picking up the alternate path to the device, which is a 
>perfectly valid thing to do, it's just not what you need at this point. It 
>may be as simple as your

The cards are all Qlogic, the switch is going to be an ED-5000 next week, the 
storage is mostly Xyratex chassis.
 
>secondary controller actually has the lun you are trying to access. To work
>around yo might just be able to reset the seconday controller and force the 
>primary to take over the LUN. This happens quite a bit depending on your 
>setup. 

I do have options on the Xyratex to combine the two controllers into one 
number but that's the loop ID. Not sure I've seen anything for LUN control 
yet. 

> Even without failover, the latest Qlogic drivers will make both paths active
> so that you never end up with a dead path upon boot up.

Path's seem fine, I mean, the storage does show up. It's just that my initial 
device has moved to another /dev/sdx number. In another message, am I to 
understand that once the physical device is set up and running as a GFS 
volume, that LVM2 will always see it no matter if the /dev number changes?

I'll also have to look into your suggestions. 

Thank you very much.

Mike


From isplist at logicore.net  Tue Sep 26 13:35:36 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Tue, 26 Sep 2006 08:35:36 -0500
Subject: [Linux-cluster] General FC Question
In-Reply-To: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F10@tmaemail.techma.com>
Message-ID: <200692683536.307927@leena>

> One more thing, when using more than one path (basically anyu san setup) the
> device mappings will wrap around for every path, so for two paths... single 
> hba, dual controller..

Right, and of course, this is what's happened. New disks have shown up and the 
old disk now shows up as a new device number. 

By the way, is there a way to clear everything in the LVM2 cache and setup 
info? It is now confused and seeing a lot of trashed information. Since the 
setup is new, I can start from scratch so wish to nuke all the old info.

Mike


> 
> three disks will look like this...
> 
> disk1=/dev/sda
> disk2=/dev/sdb
> disk3=/dev/sdc
> disk1=/dev/sdd
> disk2=/dev/sde
> disk3=/dev/sde
> 
> and four like this..
> 
> disk1=/dev/sda
> disk2=/dev/sdb
> disk3=/dev/sdc
> disk4=/dev/sdd
> disk1=/dev/sde
> disk2=/dev/sde
> disk3=/dev/sdf
> disk4=/dev/sdg
> 
> 
> Or for dual hba, dual controller (4 paths)
> 
> 
> disk1=/dev/sda
> disk2=/dev/sdb
> disk3=/dev/sdc
> disk4=/dev/sdd
> disk1=/dev/sde
> disk2=/dev/sde
> disk3=/dev/sdf
> disk4=/dev/sdg
> disk1=/dev/sdh
> disk2=/dev/sdi
> disk3=/dev/sdj
> disk4=/dev/sdk
> disk1=/dev/sdl
> disk2=/dev/sdm
> disk3=/dev/sdn
> disk4=/dev/sdo
> 
> etc...
> 
> Cheers
> 
> With the Qlogic drivers in failover mode, you'll get this..
> 
> disk1=/dev/sda
> disk2=/dev/sdb
> disk3=/dev/sdc
> disk4=/dev/sdd
> 
> even though there are multiple paths
> 
> 
> Corey
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kovacs, Corey J.
> Sent: Tuesday, September 26, 2006 7:38 AM
> To: isplist at logicore.net; linux clustering
> Subject: RE: [Linux-cluster] General FC Question
> 
> You don't say which FC cards you are using but if it's qlogic, then the
> driver can be set to combine the devices. Basically whats happened is that
> your machine is picking up the alternate path to the device, which is a
> perfectly valid thing to do, it's just not what you need at this point. It
> may be as simple as your
> 
> secondary controller actually has the lun you are trying to access. To work
> around yo might just be able to reset the seconday controller and force the
> primary to take over the LUN. This happens quite a bit depending on your
> setup. The Qlogic drivers, when setup for failover, will coelesce the
> devices
> into a single device by the WWID of the LUN. If that's not an option, then
> try the multipath tools support in
> RHEL4.2
> or above. You won't be using the /dev/sd{a,b,c,...} devices, rather it'll be
> /dev/mpath/mpath0 etc, or whatever you set them to instead.
> 
> Even without failover, the latest Qlogic drivers will make both paths active
> so that you never end up with a dead path upon boot up.
> 
> 
> Hope this helps.
> 
> 
> Corey
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of isplist at logicore.net
> Sent: Monday, September 25, 2006 11:18 AM
> To: linux-cluster
> Subject: [Linux-cluster] General FC Question
> 
> After adding storage, my cluster comes up with different /dev/sda, /dev/sdb,
> etc settings. My initial device now comes up as sdc when it used to be sda.
> 
> Is there some way of allowing GFS to see the storage in some way that it can
> know which device is which when I add a new one or remove one, etc?
> 
> Hard loop ID's on the FC side I think but is there anything on the GFS side?
> 
> Mike
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From isplist at logicore.net  Tue Sep 26 13:46:22 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Tue, 26 Sep 2006 08:46:22 -0500
Subject: [Linux-cluster] General FC Question
In-Reply-To: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F10@tmaemail.techma.com>
Message-ID: <200692684622.663564@leena>

PS: Is my problem hard loop ID's or LUN's? Could I achieve what I need either 
way or is it one or thew other?


On Tue, 26 Sep 2006 07:44:00 -0400, Kovacs, Corey J. wrote:
> One more thing, when using more than one path (basically anyu san setup) the
> 
> device
> mappings will wrap around for every path, so for two paths... single hba,
> dual controller..
> 
> 
> three disks will look like this...
> 
> disk1=/dev/sda
> disk2=/dev/sdb
> disk3=/dev/sdc
> disk1=/dev/sdd
> disk2=/dev/sde
> disk3=/dev/sde
> 
> and four like this..
> 
> disk1=/dev/sda
> disk2=/dev/sdb
> disk3=/dev/sdc
> disk4=/dev/sdd
> disk1=/dev/sde
> disk2=/dev/sde
> disk3=/dev/sdf
> disk4=/dev/sdg
> 
> 
> Or for dual hba, dual controller (4 paths)
> 
> 
> disk1=/dev/sda
> disk2=/dev/sdb
> disk3=/dev/sdc
> disk4=/dev/sdd
> disk1=/dev/sde
> disk2=/dev/sde
> disk3=/dev/sdf
> disk4=/dev/sdg
> disk1=/dev/sdh
> disk2=/dev/sdi
> disk3=/dev/sdj
> disk4=/dev/sdk
> disk1=/dev/sdl
> disk2=/dev/sdm
> disk3=/dev/sdn
> disk4=/dev/sdo
> 
> etc...
> 
> Cheers
> 
> With the Qlogic drivers in failover mode, you'll get this..
> 
> disk1=/dev/sda
> disk2=/dev/sdb
> disk3=/dev/sdc
> disk4=/dev/sdd
> 
> even though there are multiple paths
> 
> 
> Corey
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kovacs, Corey J.
> Sent: Tuesday, September 26, 2006 7:38 AM
> To: isplist at logicore.net; linux clustering
> Subject: RE: [Linux-cluster] General FC Question
> 
> You don't say which FC cards you are using but if it's qlogic, then the
> driver can be set to combine the devices. Basically whats happened is that
> your machine is picking up the alternate path to the device, which is a
> perfectly valid thing to do, it's just not what you need at this point. It
> may be as simple as your
> 
> secondary controller actually has the lun you are trying to access. To work
> around yo might just be able to reset the seconday controller and force the
> primary to take over the LUN. This happens quite a bit depending on your
> setup. The Qlogic drivers, when setup for failover, will coelesce the
> devices
> into a single device by the WWID of the LUN. If that's not an option, then
> try the multipath tools support in
> RHEL4.2
> or above. You won't be using the /dev/sd{a,b,c,...} devices, rather it'll be
> /dev/mpath/mpath0 etc, or whatever you set them to instead.
> 
> Even without failover, the latest Qlogic drivers will make both paths active
> so that you never end up with a dead path upon boot up.
> 
> 
> Hope this helps.
> 
> 
> Corey
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of isplist at logicore.net
> Sent: Monday, September 25, 2006 11:18 AM
> To: linux-cluster
> Subject: [Linux-cluster] General FC Question
> 
> After adding storage, my cluster comes up with different /dev/sda, /dev/sdb,
> etc settings. My initial device now comes up as sdc when it used to be sda.
> 
> Is there some way of allowing GFS to see the storage in some way that it can
> know which device is which when I add a new one or remove one, etc?
> 
> Hard loop ID's on the FC side I think but is there anything on the GFS side?
> 
> Mike
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From jaap at sara.nl  Tue Sep 26 14:02:11 2006
From: jaap at sara.nl (Jaap Dijkshoorn)
Date: Tue, 26 Sep 2006 16:02:11 +0200
Subject: [Linux-cluster] Files are there, but not.
Message-ID: <339554D0FE9DD94A8E5ACE4403676CEB016936D4@douwes.ka.sara.nl>

Hi All,

We are running a GFS / NFS cluster with 5 fileservers. Each server
exports the same storage as a different NFS server. nfs1, nfs2....nfs5

On nodes that mount on nfs1 we have the following problem:

root# ls -l
ls: CHGCAR: No such file or directory
ls: CHG: No such file or directory
ls: WAVECAR: No such file or directory
total 17616
-rw-------  1 xxxxxxxx yyy    1612 Sep 26 15:43 CONTCAR
-rw-------  1 xxxxxxxx yyy    167 Sep 26 14:11 DOSCAR

Other fileservers display the 3 missing files normally and they are
accessible. On the fileservers we also get these kind messages:

h_update: A.TCNQ/CHGCAR already up-to-date!
fh_update: A.TCNQ/CHGCAR already up-to-date!
fh_update: A.TCNQ/CHGCAR already up-to-date!
fh_update: A.TCNQ/CHGCAR already up-to-date!
fh_update: A.TCNQ/WAVECAR already up-to-date!
fh_update: A.TCNQ/WAVECAR already up-to-date!
fh_update: A.TCNQ/WAVECAR already up-to-date!
fh_update: A+B/CHGCAR already up-to-date!
fh_update: A+B/CHGCAR already up-to-date!
fh_update: A+B/CHGCAR already up-to-date!
fh_update: A+B/WAVECAR already up-to-date!

I don't know where to start looking to trac this problem. If i reboot
the nfs1 server the problem is gone, but in time the problem comes back
with other files, until now on the same fileserver.

Maybe someone has seen this problem before?

We use GFS version CVS 1.0.3 stable. with kernel 2.6.17.11


Met vriendelijke groet, Kind Regards,

Jaap P. Dijkshoorn
Systems Programmer
mailto:jaap at sara.nl    http://home.sara.nl/~jaapd

SARA Computing & Networking Services
Kruislaan 415     1098 SJ  Amsterdam
Tel: +31-(0)20-5923000
Fax: +31-(0)20-6683167
http://www.sara.nl


From peter.huesser at psi.ch  Tue Sep 26 14:26:29 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Tue, 26 Sep 2006 16:26:29 +0200
Subject: [Linux-cluster] Realserver configuration using loadbalancer
Message-ID: <8E2924888511274B95014C2DD906E58AD1A3F9@MAILBOX0A.psi.ch>

Hello

 
If I run a loadbalancer in front of the webservers (using piranha_gui
and pulse) is there anything I have configure on the real webservers?

 
Pedro

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060926/bc163abb/attachment.htm>

From cjk at techma.com  Tue Sep 26 15:19:41 2006
From: cjk at techma.com (Kovacs, Corey J.)
Date: Tue, 26 Sep 2006 11:19:41 -0400
Subject: =?us-ascii?Q?RE:_=5BLinux-cluster=5D_General_FC_Question?=
In-Reply-To: <200692684622.663564@leena>
Message-ID: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F12@tmaemail.techma.com>

I'd say LUN.  If you cat out /proc/scsi/scsi you'll see the luns are
repeated. The qlogic
based failover doesn't have anything to do with settings on the SAN
(combining luns etc)
it does it at the scsi layer (on the host). sort of like "secure path" from
HP. What you
are seeing is the presence of both paths by the driver. The RedHat qlogic
driver seems a
bit crippled since they'd (and the upstream kernel devs) would rather you
used the 
device mapper multipath solution instead.

The path of least resistence is to get the qlogic drivers from the qlogic
site (not the
stock redhat drivers) and install them.

A better long term solution is prolly to go ahead and figure out the
multipath device mapper
stuff.


Cheers.

Corey

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of isplist at logicore.net
Sent: Tuesday, September 26, 2006 9:46 AM
To: linux-cluster
Subject: RE: [Linux-cluster] General FC Question

PS: Is my problem hard loop ID's or LUN's? Could I achieve what I need either
way or is it one or thew other?


On Tue, 26 Sep 2006 07:44:00 -0400, Kovacs, Corey J. wrote:
> One more thing, when using more than one path (basically anyu san setup)
the
> 
> device
> mappings will wrap around for every path, so for two paths... single hba,
> dual controller..
> 
> 
> three disks will look like this...
> 
> disk1=/dev/sda
> disk2=/dev/sdb
> disk3=/dev/sdc
> disk1=/dev/sdd
> disk2=/dev/sde
> disk3=/dev/sde
> 
> and four like this..
> 
> disk1=/dev/sda
> disk2=/dev/sdb
> disk3=/dev/sdc
> disk4=/dev/sdd
> disk1=/dev/sde
> disk2=/dev/sde
> disk3=/dev/sdf
> disk4=/dev/sdg
> 
> 
> Or for dual hba, dual controller (4 paths)
> 
> 
> disk1=/dev/sda
> disk2=/dev/sdb
> disk3=/dev/sdc
> disk4=/dev/sdd
> disk1=/dev/sde
> disk2=/dev/sde
> disk3=/dev/sdf
> disk4=/dev/sdg
> disk1=/dev/sdh
> disk2=/dev/sdi
> disk3=/dev/sdj
> disk4=/dev/sdk
> disk1=/dev/sdl
> disk2=/dev/sdm
> disk3=/dev/sdn
> disk4=/dev/sdo
> 
> etc...
> 
> Cheers
> 
> With the Qlogic drivers in failover mode, you'll get this..
> 
> disk1=/dev/sda
> disk2=/dev/sdb
> disk3=/dev/sdc
> disk4=/dev/sdd
> 
> even though there are multiple paths
> 
> 
> Corey
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kovacs, Corey J.
> Sent: Tuesday, September 26, 2006 7:38 AM
> To: isplist at logicore.net; linux clustering
> Subject: RE: [Linux-cluster] General FC Question
> 
> You don't say which FC cards you are using but if it's qlogic, then the
> driver can be set to combine the devices. Basically whats happened is that
> your machine is picking up the alternate path to the device, which is a
> perfectly valid thing to do, it's just not what you need at this point. It
> may be as simple as your
> 
> secondary controller actually has the lun you are trying to access. To work
> around yo might just be able to reset the seconday controller and force the
> primary to take over the LUN. This happens quite a bit depending on your
> setup. The Qlogic drivers, when setup for failover, will coelesce the
> devices
> into a single device by the WWID of the LUN. If that's not an option, then
> try the multipath tools support in
> RHEL4.2
> or above. You won't be using the /dev/sd{a,b,c,...} devices, rather it'll
be
> /dev/mpath/mpath0 etc, or whatever you set them to instead.
> 
> Even without failover, the latest Qlogic drivers will make both paths
active
> so that you never end up with a dead path upon boot up.
> 
> 
> Hope this helps.
> 
> 
> Corey
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of isplist at logicore.net
> Sent: Monday, September 25, 2006 11:18 AM
> To: linux-cluster
> Subject: [Linux-cluster] General FC Question
> 
> After adding storage, my cluster comes up with different /dev/sda,
/dev/sdb,
> etc settings. My initial device now comes up as sdc when it used to be sda.
> 
> Is there some way of allowing GFS to see the storage in some way that it
can
> know which device is which when I add a new one or remove one, etc?
> 
> Hard loop ID's on the FC side I think but is there anything on the GFS
side?
> 
> Mike
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


From wcheng at redhat.com  Tue Sep 26 16:02:58 2006
From: wcheng at redhat.com (Wendy Cheng)
Date: Tue, 26 Sep 2006 12:02:58 -0400
Subject: [Linux-cluster] Files are there, but not.
In-Reply-To: <339554D0FE9DD94A8E5ACE4403676CEB016936D4@douwes.ka.sara.nl>
References: <339554D0FE9DD94A8E5ACE4403676CEB016936D4@douwes.ka.sara.nl>
Message-ID: <45194F32.2040403@redhat.com>

Jaap Dijkshoorn wrote:

>Hi All,
>
>We are running a GFS / NFS cluster with 5 fileservers. Each server
>exports the same storage as a different NFS server. nfs1, nfs2....nfs5
>
>On nodes that mount on nfs1 we have the following problem:
>
>root# ls -l
>ls: CHGCAR: No such file or directory
>ls: CHG: No such file or directory
>ls: WAVECAR: No such file or directory
>total 17616
>-rw-------  1 xxxxxxxx yyy    1612 Sep 26 15:43 CONTCAR
>-rw-------  1 xxxxxxxx yyy    167 Sep 26 14:11 DOSCAR
>
>Other fileservers display the 3 missing files normally and they are
>accessible. On the fileservers we also get these kind messages:
>  
>
Look like NFS client side caching issue. What's the kernel version you 
have in the nfs client machine (do a "uname  -a")?

-- Wendy

>h_update: A.TCNQ/CHGCAR already up-to-date!
>fh_update: A.TCNQ/CHGCAR already up-to-date!
>fh_update: A.TCNQ/CHGCAR already up-to-date!
>fh_update: A.TCNQ/CHGCAR already up-to-date!
>fh_update: A.TCNQ/WAVECAR already up-to-date!
>fh_update: A.TCNQ/WAVECAR already up-to-date!
>fh_update: A.TCNQ/WAVECAR already up-to-date!
>fh_update: A+B/CHGCAR already up-to-date!
>fh_update: A+B/CHGCAR already up-to-date!
>fh_update: A+B/CHGCAR already up-to-date!
>fh_update: A+B/WAVECAR already up-to-date!
>
>I don't know where to start looking to trac this problem. If i reboot
>the nfs1 server the problem is gone, but in time the problem comes back
>with other files, until now on the same fileserver.
>
>Maybe someone has seen this problem before?
>
>We use GFS version CVS 1.0.3 stable. with kernel 2.6.17.11
>
>
>
>
>Met vriendelijke groet, Kind Regards,
>
>Jaap P. Dijkshoorn
>Systems Programmer
>mailto:jaap at sara.nl    http://home.sara.nl/~jaapd
>
>SARA Computing & Networking Services
>Kruislaan 415     1098 SJ  Amsterdam
>Tel: +31-(0)20-5923000
>Fax: +31-(0)20-6683167
>http://www.sara.nl
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>  
>


-- 
S. Wendy Cheng
wcheng at redhat.com


From jpenalbae at gmail.com  Tue Sep 26 18:05:45 2006
From: jpenalbae at gmail.com (=?ISO-8859-1?Q?Jaime_Pe=F1alba?=)
Date: Tue, 26 Sep 2006 20:05:45 +0200
Subject: [Linux-cluster] General FC Question
In-Reply-To: <200692684622.663564@leena>
References: <FF2CE0D593AEE34B955FEC77BD5AFBE0079F10@tmaemail.techma.com>
	<200692684622.663564@leena>
Message-ID: <edaeffdd0609261105t2a857811y3de59713f377dfa6@mail.gmail.com>

Hi,

I will recommend you again using multipath-tools which uses
device-mapper. Here is an example.

# cat /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: HP       Model: HSV100           Rev: 3025
  Type:   RAID                             ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 00 Lun: 01
  Vendor: HP       Model: HSV100           Rev: 3025
  Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 00 Lun: 02
  Vendor: HP       Model: HSV100           Rev: 3025
  Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 01 Lun: 00
  Vendor: HP       Model: HSV100           Rev: 3025
  Type:   RAID                             ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 01 Lun: 01
  Vendor: HP       Model: HSV100           Rev: 3025
  Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 01 Lun: 02
  Vendor: HP       Model: HSV100           Rev: 3025
  Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi1 Channel: 00 Id: 00 Lun: 00
  Vendor: HP       Model: HSV100           Rev: 3025
  Type:   RAID                             ANSI SCSI revision: 02
Host: scsi1 Channel: 00 Id: 00 Lun: 01
  Vendor: HP       Model: HSV100           Rev: 3025
  Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi1 Channel: 00 Id: 00 Lun: 02
  Vendor: HP       Model: HSV100           Rev: 3025
  Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi1 Channel: 00 Id: 01 Lun: 00
  Vendor: HP       Model: HSV100           Rev: 3025
  Type:   RAID                             ANSI SCSI revision: 02
Host: scsi1 Channel: 00 Id: 01 Lun: 01
  Vendor: HP       Model: HSV100           Rev: 3025
  Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi1 Channel: 00 Id: 01 Lun: 02
  Vendor: HP       Model: HSV100           Rev: 3025
  Type:   Direct-Access                    ANSI SCSI revision: 02

That output wont help you to identify each disk.

Here is the multipath-tools output from those disks:
# multipath -v3
..... truncated output .......
3600508b4000116370000a00000c00000 0:0:0:1 sda 8:0 [ready]
3600508b40001168a0000e00000090000 0:0:0:2 sdb 8:16 [ready]
3600508b4000116370000a00000c00000 0:0:1:1 sdc 8:32 [faulty]
3600508b40001168a0000e00000090000 0:0:1:2 sdd 8:48 [faulty]
3600508b4000116370000a00000c00000 1:0:0:1 sde 8:64 [ready]
3600508b40001168a0000e00000090000 1:0:0:2 sdf 8:80 [ready]
3600508b4000116370000a00000c00000 1:0:1:1 sdg 8:96 [faulty]
3600508b40001168a0000e00000090000 1:0:1:2 sdh 8:112 [faulty]
..... truncated output .......

It finds each disk WWN and groups them.
# multipath -l
storage.old ()
[size=100 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [active]
  \_ 0:0:0:1 sda 8:0   [active]
  \_ 0:0:1:1 sdc 8:32  [active]
  \_ 1:0:0:1 sde 8:64  [active]
  \_ 1:0:1:1 sdg 8:96  [failed]

storage ()
[size=100 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [active]
  \_ 0:0:0:2 sdb 8:16  [active]
  \_ 0:0:1:2 sdd 8:48  [failed]
  \_ 1:0:0:2 sdf 8:80  [active]
  \_ 1:0:1:2 sdh 8:112 [failed]

And WWNs are aliased to names in the /etc/multipath.conf file, example:
----------------------
multipath {
                wwid    3600508b4000116370000a00000c00000
                alias   storage.old
        }
multipath {
                wwid    3600508b40001168a0000e00000090000
                alias   storage
        }
---------------------

So it will create /dev/mapper/storage and /dev/mapper/storage.old

# dmsetup ls
storage.old     (253, 4)
storage3        (253, 3)
storage2        (253, 2)
storage1        (253, 1)
storage (253, 0)
storage.old2    (253, 6)
storage.old1    (253, 5)

My devices are partiotioned so, direct access for each partition is
automatically created.

/dev/mapper/storage (hole disk)
/dev/mapper/storage1 (first partition)
/dev/mapper/storage2 (second)
/dev/mapper/storage3 (third)

This way you can tell gfs to access those mapper devices which dont
care about the order of found disks, just WWNs.

About the device-mapper question, you can clean all devices by doing
# dmsetup remove_all

Or just remove one device
# dmsetup remove storage


I hope this helps you.
Regards,
Jaime.


2006/9/26, isplist at logicore.net <isplist at logicore.net>:
> PS: Is my problem hard loop ID's or LUN's? Could I achieve what I need either
> way or is it one or thew other?
>
>
> On Tue, 26 Sep 2006 07:44:00 -0400, Kovacs, Corey J. wrote:
> > One more thing, when using more than one path (basically anyu san setup) the
> >
> > device
> > mappings will wrap around for every path, so for two paths... single hba,
> > dual controller..
> >
> >
> > three disks will look like this...
> >
> > disk1=/dev/sda
> > disk2=/dev/sdb
> > disk3=/dev/sdc
> > disk1=/dev/sdd
> > disk2=/dev/sde
> > disk3=/dev/sde
> >
> > and four like this..
> >
> > disk1=/dev/sda
> > disk2=/dev/sdb
> > disk3=/dev/sdc
> > disk4=/dev/sdd
> > disk1=/dev/sde
> > disk2=/dev/sde
> > disk3=/dev/sdf
> > disk4=/dev/sdg
> >
> >
> > Or for dual hba, dual controller (4 paths)
> >
> >
> > disk1=/dev/sda
> > disk2=/dev/sdb
> > disk3=/dev/sdc
> > disk4=/dev/sdd
> > disk1=/dev/sde
> > disk2=/dev/sde
> > disk3=/dev/sdf
> > disk4=/dev/sdg
> > disk1=/dev/sdh
> > disk2=/dev/sdi
> > disk3=/dev/sdj
> > disk4=/dev/sdk
> > disk1=/dev/sdl
> > disk2=/dev/sdm
> > disk3=/dev/sdn
> > disk4=/dev/sdo
> >
> > etc...
> >
> > Cheers
> >
> > With the Qlogic drivers in failover mode, you'll get this..
> >
> > disk1=/dev/sda
> > disk2=/dev/sdb
> > disk3=/dev/sdc
> > disk4=/dev/sdd
> >
> > even though there are multiple paths
> >
> >
> > Corey
> >
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kovacs, Corey J.
> > Sent: Tuesday, September 26, 2006 7:38 AM
> > To: isplist at logicore.net; linux clustering
> > Subject: RE: [Linux-cluster] General FC Question
> >
> > You don't say which FC cards you are using but if it's qlogic, then the
> > driver can be set to combine the devices. Basically whats happened is that
> > your machine is picking up the alternate path to the device, which is a
> > perfectly valid thing to do, it's just not what you need at this point. It
> > may be as simple as your
> >
> > secondary controller actually has the lun you are trying to access. To work
> > around yo might just be able to reset the seconday controller and force the
> > primary to take over the LUN. This happens quite a bit depending on your
> > setup. The Qlogic drivers, when setup for failover, will coelesce the
> > devices
> > into a single device by the WWID of the LUN. If that's not an option, then
> > try the multipath tools support in
> > RHEL4.2
> > or above. You won't be using the /dev/sd{a,b,c,...} devices, rather it'll be
> > /dev/mpath/mpath0 etc, or whatever you set them to instead.
> >
> > Even without failover, the latest Qlogic drivers will make both paths active
> > so that you never end up with a dead path upon boot up.
> >
> >
> > Hope this helps.
> >
> >
> > Corey
> >
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of isplist at logicore.net
> > Sent: Monday, September 25, 2006 11:18 AM
> > To: linux-cluster
> > Subject: [Linux-cluster] General FC Question
> >
> > After adding storage, my cluster comes up with different /dev/sda, /dev/sdb,
> > etc settings. My initial device now comes up as sdc when it used to be sda.
> >
> > Is there some way of allowing GFS to see the storage in some way that it can
> > know which device is which when I add a new one or remove one, etc?
> >
> > Hard loop ID's on the FC side I think but is there anything on the GFS side?
> >
> > Mike
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


From jaap at sara.nl  Wed Sep 27 06:56:59 2006
From: jaap at sara.nl (Jaap Dijkshoorn)
Date: Wed, 27 Sep 2006 08:56:59 +0200
Subject: [Linux-cluster] Files are there, but not.
In-Reply-To: <45194F32.2040403@redhat.com>
Message-ID: <339554D0FE9DD94A8E5ACE4403676CEB016936E0@douwes.ka.sara.nl>


Hi Wendy,

> >  
> >
> Look like NFS client side caching issue. What's the kernel 
> version you 
> have in the nfs client machine (do a "uname  -a")?
> 

It is the same as our NFS server, kernel 2.6.17.11

Regards,
Jaap


From jaap at sara.nl  Wed Sep 27 08:44:10 2006
From: jaap at sara.nl (Jaap Dijkshoorn)
Date: Wed, 27 Sep 2006 10:44:10 +0200
Subject: [Linux-cluster] Files are there, but not.
In-Reply-To: <339554D0FE9DD94A8E5ACE4403676CEB016936E0@douwes.ka.sara.nl>
Message-ID: <339554D0FE9DD94A8E5ACE4403676CEB016936E9@douwes.ka.sara.nl>

Wendy,

Just to be complete. We see the problems occure on one of our GFS
fileservers that is acting as a NFS fileserver. So on both server and
clients connected to that GFS/NFS server, the files are missing. On the
other GFS/NFS fileservers and clients connected to those servers the
files are still available.

So the same ls command on fileserver 2,3,4,5 gives a normal view of all
the files.

Regards,
Jaap


> 
> 
> Hi Wendy,
> 
> > >  
> > >
> > Look like NFS client side caching issue. What's the kernel 
> > version you 
> > have in the nfs client machine (do a "uname  -a")?
> > 
> 
> It is the same as our NFS server, kernel 2.6.17.11
> 
> Regards,
> Jaap
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


From rpeterso at redhat.com  Wed Sep 27 14:21:44 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Wed, 27 Sep 2006 09:21:44 -0500
Subject: [Linux-cluster] Files are there, but not.
In-Reply-To: <339554D0FE9DD94A8E5ACE4403676CEB016936E9@douwes.ka.sara.nl>
References: <339554D0FE9DD94A8E5ACE4403676CEB016936E9@douwes.ka.sara.nl>
Message-ID: <451A88F8.4010503@redhat.com>

Jaap Dijkshoorn wrote:
> Wendy,
>
> Just to be complete. We see the problems occure on one of our GFS
> fileservers that is acting as a NFS fileserver. So on both server and
> clients connected to that GFS/NFS server, the files are missing. On the
> other GFS/NFS fileservers and clients connected to those servers the
> files are still available.
>
> So the same ls command on fileserver 2,3,4,5 gives a normal view of all
> the files.
>
> Regards,
> Jaap
>   
Hi Jaap,

Your problem may very well be the same as bugzilla bz 190756:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=190756

I've been trying unsuccessfully to recreate this problem at RHEL4U4.
I need to reproduce the problem in our lab in order to debug the problem.
I suspect that NFS changes make in U4 may have changed the timing
to make the problem much less likely to occur.  I may need to go back
to U3 to recreate it.

If you can give me any information that can help me reproduce the
problem, I would be grateful. 

Regards,

Bob Peterson
Red Hat Cluster Suite


From jaap at sara.nl  Wed Sep 27 15:03:45 2006
From: jaap at sara.nl (Jaap Dijkshoorn)
Date: Wed, 27 Sep 2006 17:03:45 +0200
Subject: [Linux-cluster] Files are there, but not.
In-Reply-To: <451A88F8.4010503@redhat.com>
Message-ID: <339554D0FE9DD94A8E5ACE4403676CEB016936FE@douwes.ka.sara.nl>

Hi Bob,


> >   
> Hi Jaap,
> 
> Your problem may very well be the same as bugzilla bz 190756:
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=190756

It looks like it!

> 
> I've been trying unsuccessfully to recreate this problem at RHEL4U4.
> I need to reproduce the problem in our lab in order to debug 
> the problem.
> I suspect that NFS changes make in U4 may have changed the timing
> to make the problem much less likely to occur.  I may need to go back
> to U3 to recreate it.
> 
> If you can give me any information that can help me reproduce the
> problem, I would be grateful. 

I have aksed the user who is having this problem, what exactly is
happening with those files during his job. I hope this will give us a
clue in what ways those files are touched and/or deleted etc.

All files are read/write by the users through NFS. But that strange
thing is that on 4 of the 5 servers the files are still available, on
GFS as well on the clients through NFS.

> 
> Regards,
> 
> Bob Peterson
> Red Hat Cluster Suite
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 


thanks already for the effort. I hope we can tackle this bug!

Best Regards,
Jaap


From peter.huesser at psi.ch  Wed Sep 27 16:13:54 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Wed, 27 Sep 2006 18:13:54 +0200
Subject: [Linux-cluster] Realserver configuration using loadbalancer
In-Reply-To: <8E2924888511274B95014C2DD906E58AD1A3F9@MAILBOX0A.psi.ch>
Message-ID: <8E2924888511274B95014C2DD906E58AD1A460@MAILBOX0A.psi.ch>

I found the solution. One also has to manipulate the real webservers.
This is not described in the official "Red Hat Cluster Suite"
documentation.

 
            Pedro

 
________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Huesser Peter
Sent: Dienstag, 26. September 2006 16:26
To: Linux-cluster at redhat.com
Subject: [Linux-cluster] Realserver configuration using loadbalancer

 
Hello

 
If I run a loadbalancer in front of the webservers (using piranha_gui
and pulse) is there anything I have configure on the real webservers?

 
Pedro

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060927/dd4737fe/attachment.htm>

From peter.huesser at psi.ch  Wed Sep 27 16:15:34 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Wed, 27 Sep 2006 18:15:34 +0200
Subject: [Linux-cluster] piranha
In-Reply-To: <8E2924888511274B95014C2DD906E58AD1A3A7@MAILBOX0A.psi.ch>
Message-ID: <8E2924888511274B95014C2DD906E58AD1A464@MAILBOX0A.psi.ch>

I found the solution. One also has to manipulate the real webservers.
This is not described in the official "Red Hat Cluster Suite"
documentation.

 
            Pedro

 
________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Huesser Peter
Sent: Montag, 25. September 2006 22:54
To: linux clustering
Subject: RE: [Linux-cluster] piranha

 
By the way: I started the "pulse" daemon in the debug modus ("pulse -v
-n") and got the following output:

 
nanny: Opening TCP socket to remote service port 80...

nanny: Connecting socket to remote address...

nanny: DEBUG -- Posting CONNECT poll()

nanny: Sending len=16, text="GET / HTTP/1.0

 
"

nanny: DEBUG -- Posting READ poll()

nanny: DEBUG -- READ poll() completed (1,1)

nanny: Posting READ I/O; expecting 4 character(s)...

nanny: DEBUG -- READ returned 4

nanny: READ expected len=4, text="HTTP"

nanny: READ got len=4, text=HTTP

nanny: avail: 1 active: 1: count: 13

pulse: DEBUG -- setting SEND_heartbeat timer

pulse: DEBUG -- setting SEND_heartbeat timer

pulse: DEBUG -- setting NEED_heartbeat timer

pulse: DEBUG -- setting SEND_heartbeat timer

nanny: Opening TCP socket to remote service port 80...

...

 
For me this looks as if everything is ok. "nanny" sends from time to
time a "GET / HTTP/1.0" request and the response ("HTTP" only first four
letters) correspondence with what is expected. The problem is that pulse
is not opening port 80 on the loadbalancer for reveiving http-request. A
"netstat -anp" verifies this.

 
Hello

 
I sent a similar question a few days ago and did not get any answer.
Maybe the time (Saturday night) was unfavorable or the question was not
that clear. So I try it once more:

 
I want to run a loadbalancer in front of two webserver (using direct
routing). But if I connect to port 80 of the loadbalancer I get a
"connection refused".

 
1)       Did anybody had a similar problem? 

2)       How can I increase the debuglevel?

 
Thanks' in advance

 
            Pedro

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060927/e21c15bd/attachment.htm>

From rpeterso at redhat.com  Wed Sep 27 16:59:56 2006
From: rpeterso at redhat.com (Robert Peterson)
Date: Wed, 27 Sep 2006 11:59:56 -0500
Subject: [Linux-cluster] Files are there, but not.
In-Reply-To: <339554D0FE9DD94A8E5ACE4403676CEB016936FE@douwes.ka.sara.nl>
References: <339554D0FE9DD94A8E5ACE4403676CEB016936FE@douwes.ka.sara.nl>
Message-ID: <451AAE0C.70702@redhat.com>

Jaap Dijkshoorn wrote:
> It looks like it!
>   
> I have aksed the user who is having this problem, what exactly is
> happening with those files during his job. I hope this will give us a
> clue in what ways those files are touched and/or deleted etc.
>
> All files are read/write by the users through NFS. But that strange
> thing is that on 4 of the 5 servers the files are still available, on
> GFS as well on the clients through NFS.
>   
> thanks already for the effort. I hope we can tackle this bug!
>
> Best Regards,
> Jaap
>   
Hi Jaap,

Soon after I sent the last email, I did recreate the problem here in our 
lab,
though it was after several days of trying.  That's good: It means the U4 is
very stable, and it means I can probably work on the problem without the
need for further information from people in the field.  I did just 
update the
bugzilla, but here's what I know so far:

This is hard to explain, so let me simplify by calling "A" the cluster node
that shows the files correctly, and "B" the cluster node that say the files
are missing.  Let's further say that an example "missing" file is:
/mnt/gfs/subdir/xyz.  So "ls /mnt/gfs/subdir/xyz" from "A" shows the
file correctly, while the same command from "B" produces
"No such file or directory".

The biggest clue I've found today is this:

It looks as if "B" somehow seems to have the wrong inode cached for
"subdir".  In other words, a stat command run on the directory 
"/mnt/gfs/subdir"
shows the wrong directory inode (possibly a deleted subdirectory?) on
"B" whereas "A" has the correct inode for "subdir" with the same stat
command.  I'm not sure yet if this incorrect cached inode is coming from 
GFS,
or whether it's in the Linux vfs.  I'm still investigating.

Please update the bugzilla if you get more information.  In the meanwhile,
I'll continue working on the problem and I'll keep the bugzilla up to date
when I find out more.

Regards,

Bob Peterson
Red Hat Cluster Suite


From tmornini at engineyard.com  Wed Sep 27 20:29:49 2006
From: tmornini at engineyard.com (Tom Mornini)
Date: Wed, 27 Sep 2006 13:29:49 -0700
Subject: [Linux-cluster] Re: [Xen-users] what do you recommend for cluster
	fs ??
In-Reply-To: <acb757c00609271256t6770071avcf8dee7f969d9aa6@mail.gmail.com>
References: <acb757c00609230744q767c1d72jeb70304fccb5db08@mail.gmail.com>
	<68729346-BD22-4B3D-84B1-948F79D72CDA@engineyard.com>
	<acb757c00609271256t6770071avcf8dee7f969d9aa6@mail.gmail.com>
Message-ID: <738F62ED-86F1-4DEA-9F9D-A97B09327137@engineyard.com>

On Sep 27, 2006, at 12:56 PM, Anand Gupta wrote:

> Hello Tom
>
> Thanks for the response.

You're welcome.

> We use CLVM.
>
> Would you mind sharing howto / documentation on how to get CLVM and  
> GFS setup ?

I found a consultant to help out. It's a difficult configuration and  
requires a lot of time, patience, and expertise.

We liked him so much he's now available for consulting rates through  
these other companies that I'm involved in:

www.engineyard.com
www.qualityhumans.com

-- 
-- Tom Mornini


From celso at webbertek.com.br  Thu Sep 28 04:36:16 2006
From: celso at webbertek.com.br (Celso K. Webber)
Date: Thu, 28 Sep 2006 01:36:16 -0300
Subject: [Linux-cluster] IPMI fencing on an IBM x366
In-Reply-To: <200609251411.k8PEBg406654@xos037.xos.nl>
References: <200609251411.k8PEBg406654@xos037.xos.nl>
Message-ID: <451B5140.90404@webbertek.com.br>

Hi Jos,

I've configured a pair of x366s in the past successfully to use the 
builtin IPMI device as a fence device, these systems run RHELv3u6 and 
Cluster Suite v3u6. They seem to work quite well until today. Since the 
systems are into production in a client, we didn't upgrade anything 
since then but it seems to work ok.

I've not tried anything with IBM servers under Cluster Suite v4, though, 
but I imagine it works ok too.

The IPMI device on the x366 servers was configured in the traditional 
way, login+password with admin rights, IP+netmask configured and 
everything worked ok. IBM's implementation of IPMI over LAN respond to 
"pings" (ICMP echos), while Dell's don't. But you will not be able to 
connect to the IPMI device from the machine itself, you have to try from 
an outside machine, ok?

Hope this gives you some light.

Best regards,

Celso.

Jos Vos escreveu:
> Hi,
> 
> Is it possible to use the built-in IPMI support of an IBM x366 server
> with RHEL CS?
> 
> I think it is not compatible with RSA II, and I also tried IPMI Lan,
> but none of them seems to work.
> 
> Any suggestions?
> Thanks,
> 
> --
> --    Jos Vos <jos at xos.nl>
> --    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
> --    Amsterdam, The Netherlands        |     Fax: +31 20 6948204
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
*Celso Kopp Webber*

-- 
Esta mensagem foi verificada pelo sistema de antiv?rus e
 acredita-se estar livre de perigo.


From celso at webbertek.com.br  Thu Sep 28 04:36:30 2006
From: celso at webbertek.com.br (Celso K. Webber)
Date: Thu, 28 Sep 2006 01:36:30 -0300
Subject: [Linux-cluster] Fencing deadlock under Cluster Suite v4,
	how to solve?
Message-ID: <451B514E.4000607@webbertek.com.br>

Hello all,

I'm having a strange problem. Here is the scenario:
* 2-node GFS cluster on 2 Dell PE-2900 servers;
* 1 Dell|EMC CX300 storage, with servers direct attached using two HBAs 
each;
* RHEL AS 4 Update 4, no updates applied;
* Red Hat Cluster Suite v4 Update 4, no updates applied;
* Red Hat GFS Update 4, no updates applied;
* Using IPMI over LAN fencing.

The Cluster was configured quite straight forward, the GFS filesystems 
worked fine.

Since the Dell PowerEdge x9xx series now support IPMI on both LOMs 
(onboard NICs) as a configurable failover option, we decided to "channel 
bond" eth0 and eth1 (onboard NICs) together to have both the normal 
network traffic and also the heartbeat traffic over a redundant channel 
(bond0). Since IPMI works over both NICs, fencing is expected to work 
even if one of the NICs/cables goes down.

Now the problem: whenever I pull both cables from one server, the 
servers almost simultaneously detect each other as offline (the logs 
show "serverX lost too many heartbeats, removing it from the Cluster"). 
A few seconds later and one server fences the other, at the same time!!!

As far as I can tell, there is some delay between the sending of the 
"power off" IPMI command and the real poweroff from the IPMI embedded 
controller. By the way, there is no "normal shutdown" caused by ACPI or 
APM, these are both turned off in the servers.

So it seems that when the first server kills the other, there is enough 
time to the second server to send the IPMI command to kill the first 
server also, and a few seconds later both are turned off, so my 
redundant environment goes down alltogether.

Question: does someone is aware of a solution for this? Is there a way a 
server can notify the other that it is removing it from the cluster? 
Maybe using a shared disk? By the way, I didn't experimented with the 
new shared disk feature under CS v4, only with CS v3.

Thank you all in advance.

Regards,

Celso.
-- 
*Celso Kopp Webber*

-- 
Esta mensagem foi verificada pelo sistema de antiv?rus e
 acredita-se estar livre de perigo.


From riaan at obsidian.co.za  Thu Sep 28 08:58:13 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Thu, 28 Sep 2006 10:58:13 +0200
Subject: [Linux-cluster] piranha
In-Reply-To: <8E2924888511274B95014C2DD906E58AD1A464@MAILBOX0A.psi.ch>
References: <8E2924888511274B95014C2DD906E58AD1A464@MAILBOX0A.psi.ch>
Message-ID: <451B8EA5.5070202@obsidian.co.za>

hi Pedro

Care to tell us what you did to the real servers?

If this is an omission in the documentation, please file a bugzilla 
against the RHCS manual.

tnx
Riaan

Huesser Peter wrote:
> I found the solution. One also has to manipulate the real webservers. 
> This is not described in the official ?Red Hat Cluster Suite? documentation.
> 
>  
> 
>             Pedro
> 
>  
> 
> ------------------------------------------------------------------------
> 
> *From:* linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] *On Behalf Of *Huesser Peter
> *Sent:* Montag, 25. September 2006 22:54
> *To:* linux clustering
> *Subject:* RE: [Linux-cluster] piranha
> 
>  
> 
> By the way: I started the ?pulse? daemon in the debug modus (?pulse ?v 
> ?n?) and got the following output:
> 
>  
> 
> nanny: Opening TCP socket to remote service port 80...
> 
> nanny: Connecting socket to remote address...
> 
> nanny: DEBUG -- Posting CONNECT poll()
> 
> nanny: Sending len=16, text="GET / HTTP/1.0
> 
>  
> 
> "
> 
> nanny: DEBUG -- Posting READ poll()
> 
> nanny: DEBUG -- READ poll() completed (1,1)
> 
> nanny: Posting READ I/O; expecting 4 character(s)...
> 
> nanny: DEBUG -- READ returned 4
> 
> nanny: READ expected len=4, text="HTTP"
> 
> nanny: READ got len=4, text=HTTP
> 
> nanny: avail: 1 active: 1: count: 13
> 
> pulse: DEBUG -- setting SEND_heartbeat timer
> 
> pulse: DEBUG -- setting SEND_heartbeat timer
> 
> pulse: DEBUG -- setting NEED_heartbeat timer
> 
> pulse: DEBUG -- setting SEND_heartbeat timer
> 
> nanny: Opening TCP socket to remote service port 80...
> 
> ?
> 
>  
> 
> For me this looks as if everything is ok. ?nanny? sends from time to 
> time a ?GET / HTTP/1.0? request and the response (?HTTP? only first four 
> letters) correspondence with what is expected. The problem is that pulse 
> is not opening port 80 on the loadbalancer for reveiving http-request. A 
> ?netstat ?anp? verifies this.
> 
>  
> 
>  
> 
> Hello
> 
>  
> 
> I sent a similar question a few days ago and did not get any answer. 
> Maybe the time (Saturday night) was unfavorable or the question was not 
> that clear. So I try it once more:
> 
>  
> 
> I want to run a loadbalancer in front of two webserver (using direct 
> routing). But if I connect to port 80 of the loadbalancer I get a 
> ?connection refused?.
> 
>  
> 
> 1)       Did anybody had a similar problem?
> 
> 2)       How can I increase the debuglevel?
> 
>  
> 
> Thanks? in advance
> 
>  
> 
>             Pedro
> 
>  
> 
> 
> ------------------------------------------------------------------------
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060928/a0e0dade/attachment.vcf>

From isplist at logicore.net  Thu Sep 28 11:21:33 2006
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 28 Sep 2006 06:21:33 -0500
Subject: [Linux-cluster] McDara
Message-ID: <200692862133.433361@leena>

Anyone using a McData ED-5000 or ED-6064 as their fence/hub who might be able 
to help me?

Mike


From csim at ices.utexas.edu  Thu Sep 28 12:56:15 2006
From: csim at ices.utexas.edu (Chris Simmons)
Date: Thu, 28 Sep 2006 07:56:15 -0500
Subject: [Linux-cluster] piranha
In-Reply-To: <451B8EA5.5070202@obsidian.co.za>
References: <8E2924888511274B95014C2DD906E58AD1A464@MAILBOX0A.psi.ch>
	<451B8EA5.5070202@obsidian.co.za>
Message-ID: <20060928125615.GA21742@ices.utexas.edu>

I imagine he had to add an iptables rule to his real servers to utilize
direct routing.

Older documentation contains direct routing examples, however, the
latest incarnation does not. It only contains examples for NAT.

Something like the following should work:

iptables -t nat -A PREROUTING -p tcp -d VIP --dport 80 -j REDIRECT

Chris

* Riaan van Niekerk <riaan at obsidian.co.za> [2006-09-28 10:58:13 +0200]:

> hi Pedro
> 
> Care to tell us what you did to the real servers?
> 
> If this is an omission in the documentation, please file a bugzilla 
> against the RHCS manual.
> 
> tnx
> Riaan
> 
> Huesser Peter wrote:
> >I found the solution. One also has to manipulate the real webservers. 
> >This is not described in the official ?Red Hat Cluster Suite? 
> >documentation.
> >
> > 
> >
> >            Pedro
> >
> > 
> >
> >------------------------------------------------------------------------
> >
> >*From:* linux-cluster-bounces at redhat.com 
> >[mailto:linux-cluster-bounces at redhat.com] *On Behalf Of *Huesser Peter
> >*Sent:* Montag, 25. September 2006 22:54
> >*To:* linux clustering
> >*Subject:* RE: [Linux-cluster] piranha
> >
> > 
> >
> >By the way: I started the ?pulse? daemon in the debug modus (?pulse 
> >?v ?n?) and got the following output:
> >
> > 
> >
> >nanny: Opening TCP socket to remote service port 80...
> >
> >nanny: Connecting socket to remote address...
> >
> >nanny: DEBUG -- Posting CONNECT poll()
> >
> >nanny: Sending len=16, text="GET / HTTP/1.0
> >
> > 
> >
> >"
> >
> >nanny: DEBUG -- Posting READ poll()
> >
> >nanny: DEBUG -- READ poll() completed (1,1)
> >
> >nanny: Posting READ I/O; expecting 4 character(s)...
> >
> >nanny: DEBUG -- READ returned 4
> >
> >nanny: READ expected len=4, text="HTTP"
> >
> >nanny: READ got len=4, text=HTTP
> >
> >nanny: avail: 1 active: 1: count: 13
> >
> >pulse: DEBUG -- setting SEND_heartbeat timer
> >
> >pulse: DEBUG -- setting SEND_heartbeat timer
> >
> >pulse: DEBUG -- setting NEED_heartbeat timer
> >
> >pulse: DEBUG -- setting SEND_heartbeat timer
> >
> >nanny: Opening TCP socket to remote service port 80...
> >
> >?
> >
> > 
> >
> >For me this looks as if everything is ok. ?nanny? sends from time to 
> >time a ?GET / HTTP/1.0? request and the response (?HTTP? only 
> >first four letters) correspondence with what is expected. The problem is 
> >that pulse is not opening port 80 on the loadbalancer for reveiving 
> >http-request. A ?netstat ?anp? verifies this.
> >
> > 
> >
> > 
> >
> >Hello
> >
> > 
> >
> >I sent a similar question a few days ago and did not get any answer. 
> >Maybe the time (Saturday night) was unfavorable or the question was not 
> >that clear. So I try it once more:
> >
> > 
> >
> >I want to run a loadbalancer in front of two webserver (using direct 
> >routing). But if I connect to port 80 of the loadbalancer I get a 
> >?connection refused?.
> >
> > 
> >
> >1)       Did anybody had a similar problem?
> >
> >2)       How can I increase the debuglevel?
> >
> > 
> >
> >Thanks? in advance
> >
> > 
> >
> >            Pedro
> >
> > 
> >
> >
> >------------------------------------------------------------------------
> >
> >--
> >Linux-cluster mailing list
> >Linux-cluster at redhat.com
> >https://www.redhat.com/mailman/listinfo/linux-cluster


> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From peter.huesser at psi.ch  Thu Sep 28 14:01:00 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Thu, 28 Sep 2006 16:01:00 +0200
Subject: [Linux-cluster] piranha
In-Reply-To: <451B8EA5.5070202@obsidian.co.za>
Message-ID: <8E2924888511274B95014C2DD906E58A011078DF@MAILBOX0A.psi.ch>

> 
> hi Pedro
> 
> Care to tell us what you did to the real servers?
> 

I found a solution in LVS-HowTo in chapter 5.7. Here is the
link:

http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.arp_problem.html#
2.6_arp

I had to change the /etc/sysctl.conf and let the lo:1 listen to the VIP
without responding to arp requests.

Pedro


From dbrieck at gmail.com  Thu Sep 28 16:15:43 2006
From: dbrieck at gmail.com (David Brieck Jr.)
Date: Thu, 28 Sep 2006 12:15:43 -0400
Subject: [Linux-cluster] Hard lockups during file transfer to GNBD/GFS device
Message-ID: <8c1094290609280915o6b6b4962ud0d090e58e5d7fc6@mail.gmail.com>

Here is our setup: 2 GNBD servers attached to a shared SCSI array. Each (of
9) nodes uses multipath to import the shared device from both servers. We
are also using GFS on to of that for our shared storage.

What is happening is that I need to transfer a large number of files (about
1.5 million) from a nodes local storage to the network storage. I'm using
rsync locally to move all the files. Orginally my problem was that the oom
killer would start running partway through the transfer and the machine
would then be unusable (however it was still up enough that it wasn't
fenced). Here is that log:

Sep 27 12:21:43 db2 kernel: oom-killer: gfp_mask=0xd0
Sep 27 12:21:43 db2 kernel: Mem-info:
Sep 27 12:21:43 db2 kernel: DMA per-cpu:
Sep 27 12:21:43 db2 kernel: cpu 0 hot: low 2, high 6, batch 1
Sep 27 12:21:43 db2 kernel: cpu 0 cold: low 0, high 2, batch 1
Sep 27 12:21:43 db2 kernel: cpu 1 hot: low 2, high 6, batch 1
Sep 27 12:21:43 db2 kernel: cpu 1 cold: low 0, high 2, batch 1
Sep 27 12:21:43 db2 kernel: cpu 2 hot: low 2, high 6, batch 1
Sep 27 12:21:43 db2 kernel: cpu 2 cold: low 0, high 2, batch 1
Sep 27 12:21:43 db2 kernel: cpu 3 hot: low 2, high 6, batch 1
Sep 27 12:21:43 db2 kernel: cpu 3 cold: low 0, high 2, batch 1
Sep 27 12:21:43 db2 kernel: cpu 4 hot: low 2, high 6, batch 1
Sep 27 12:21:44 db2 kernel: cpu 4 cold: low 0, high 2, batch 1
Sep 27 12:21:53 db2 in[15473]: 1159374113||chericee at herr-sacco.com
|2852|timeout|1
Sep 27 12:21:54 db2 kernel: cpu 5 hot: low 2, high 6, batch 1
Sep 27 12:21:54 db2 kernel: cpu 5 cold: low 0, high 2, batch 1
Sep 27 12:21:54 db2 kernel: cpu 6 hot: low 2, high 6, batch 1
Sep 27 12:21:54 db2 kernel: cpu 6 cold: low 0, high 2, batch 1
Sep 27 12:21:54 db2 kernel: cpu 7 hot: low 2, high 6, batch 1
Sep 27 12:21:54 db2 kernel: cpu 7 cold: low 0, high 2, batch 1
Sep 27 12:21:54 db2 kernel: Normal per-cpu:
Sep 27 12:21:54 db2 kernel: cpu 0 hot: low 32, high 96, batch 16
Sep 27 12:21:54 db2 kernel: cpu 0 cold: low 0, high 32, batch 16
Sep 27 12:21:54 db2 kernel: cpu 1 hot: low 32, high 96, batch 16
Sep 27 12:21:54 db2 kernel: cpu 1 cold: low 0, high 32, batch 16
Sep 27 12:21:54 db2 kernel: cpu 2 hot: low 32, high 96, batch 16
Sep 27 12:27:59 db2 syslogd 1.4.1: restart.
Sep 27 12:27:59 db2 syslog: syslogd startup succeeded
Sep 27 12:27:59 db2 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Sep 27 12:27:59 db2 kernel: Linux version 2.6.9-42.0.2.ELsmp (
buildsvn at build-i386) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3))
#1 SMP Wed Aug 23 00:17:26 CDT 2006


I found a few postings saying that using the hugemem kernel would solve the
problems (they claimed it was a known SMP bug by redhat) so all my systems
are now running on that kernel. It did solve the out of memory problem, but
it seems to have introduced some new ones. Here are the logs from the most
recent crashes:


Sep 28 11:15:05 db2 kernel: do_IRQ: stack overflow: 412
Sep 28 11:15:05 db2 kernel:  [<02107c6b>] do_IRQ+0x49/0x1ae<1>Unable to
handle kernel NULL pointer dereference at virtual address
00000000
Sep 28 11:15:05 db2 kernel:  printing eip:
Sep 28 11:15:05 db2 kernel: 0212928c
Sep 28 11:15:05 db2 kernel: *pde = 00004001
Sep 28 11:15:05 db2 kernel: Oops: 0002 [#1]
Sep 28 11:15:05 db2 kernel: SMP
Sep 28 11:15:05 db2 kernel: Modules linked in: mptctl mptbase dell_rbu nfsd
exportfs lockd nfs_acl parport_pc lp parport autofs4 i
2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dm_round_robin gnbd(U)
dlm(U) cman(U) sunrpc ipmi_devintf ipmi_si ipmi_msghandl
er iptable_filter iptable_mangle iptable_nat ip_conntrack ip_tables md5 ipv6
dm_multipath joydev button battery ac uhci_hcd ehci_h
cd hw_random e1000 bonding(U) floppy sg dm_snapshot dm_zero dm_mirror ext3
jbd dm_mod megaraid_mbox megaraid_mm sd_mod scsi_mod
Sep 28 11:15:05 db2 kernel: CPU:    1548750336
Sep 28 11:15:05 db2 kernel: EIP:    0060:[<0212928c>]    Not tainted VLI
Sep 28 11:15:05 db2 kernel: EFLAGS: 00010002   (2.6.9-42.0.2.ELhugemem)
Sep 28 11:15:05 db2 kernel: EIP is at internal_add_timer+0x84/0x8c
Sep 28 11:15:05 db2 kernel: eax: 00000000   ebx: 023b7900   ecx: 023b8680
edx: 02447620
Sep 28 11:15:05 db2 kernel: esi: 00000000   edi: 023b7900   ebp: 02ee0c94
esp: 48552fb4
Sep 28 11:15:05 db2 kernel: ds: 007b   es: 007b   ss: 0068
Sep 28 11:15:05 db2 kernel: Process  (pid: 1, threadinfo=48552000
task=6d641a00)
Sep 28 11:17:54 db2 syslogd 1.4.1: restart.
Sep 28 11:17:54 db2 syslog: syslogd startup succeeded
Sep 28 11:17:54 db2 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Sep 28 11:17:54 db2 syslog: klogd startup succeeded
Sep 28 11:17:54 db2 kernel: Linux version 2.6.9-42.0.2.ELhugemem (
buildsvn at build-i386) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-
3)) #1 SMP Wed Aug 23 00:38:38 CDT 2006

The GNBD servers stay online and don't have any problems, it's just the
client where all the trouble is coming from. Is this a bug or is something
not setup right?

If you need more info I'll be happy to provide it.

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060928/81df55f5/attachment.htm>

From rodgersr at yahoo.com  Thu Sep 28 17:09:17 2006
From: rodgersr at yahoo.com (Rick Rodgers)
Date: Thu, 28 Sep 2006 10:09:17 -0700 (PDT)
Subject: [Linux-cluster] clurmtabd
Message-ID: <20060928170917.70680.qmail@web34214.mail.mud.yahoo.com>

Is there anyway to have clurmtabd monitor all the subdirectories
of a mount point. (ie. specify a parent directory but have nodes mounting off some of the subdirectories) Or do you always have to have a clurmtabd running for each subdirectory mount point

 		
---------------------------------
Talk is cheap. Use Yahoo! Messenger to make PC-to-Phone calls.  Great rates starting at 1?/min.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060928/cd81a947/attachment.htm>

From dbrieck at gmail.com  Thu Sep 28 19:08:58 2006
From: dbrieck at gmail.com (David Brieck Jr.)
Date: Thu, 28 Sep 2006 15:08:58 -0400
Subject: [Linux-cluster] Re: Hard lockups during file transfer to GNBD/GFS
	device
In-Reply-To: <8c1094290609280915o6b6b4962ud0d090e58e5d7fc6@mail.gmail.com>
References: <8c1094290609280915o6b6b4962ud0d090e58e5d7fc6@mail.gmail.com>
Message-ID: <8c1094290609281208i6a5eaf8br70697c6b5d085cf@mail.gmail.com>

On 9/28/06, David Brieck Jr. <dbrieck at gmail.com> wrote:
> Here is our setup: 2 GNBD servers attached to a shared SCSI array. Each (of 9) nodes uses multipath to import the shared device from both servers. We are also using GFS on to of that for our shared storage.
>
> What is happening is that I need to transfer a large number of files (about  1.5 million) from a nodes local storage to the network storage. I'm using rsync locally to move all the files. Orginally my problem was that the oom killer would start running partway through the transfer and the machine would then be unusable (however it was still up enough that it wasn't fenced). Here is that log:
> <snip>
>
>
> I found a few postings saying that using the hugemem kernel would solve the problems (they claimed it was a known SMP bug by redhat) so all my systems are now running on that kernel. It did solve the out of memory problem, but it seems to have introduced some new ones. Here are the logs from the most recent crashes:
>
>
> <snip>
>
> The GNBD servers stay online and don't have any problems, it's just the client where all the trouble is coming from. Is this a bug or is something not setup right?
>
> If you need more info I'll be happy to provide it.
>
> Thanks.


I just tried to more the same data by tar-ing it up to the network,
same result. Again, this is about 94GB and 1.5 million files that I
seem to be unable to move from local storage to shared. Anyone have
any suggestions?


From dbrieck at gmail.com  Thu Sep 28 19:27:20 2006
From: dbrieck at gmail.com (David Brieck Jr.)
Date: Thu, 28 Sep 2006 15:27:20 -0400
Subject: [Linux-cluster] Re: Hard lockups during file transfer to GNBD/GFS
	device
In-Reply-To: <8c1094290609281208i6a5eaf8br70697c6b5d085cf@mail.gmail.com>
References: <8c1094290609280915o6b6b4962ud0d090e58e5d7fc6@mail.gmail.com>
	<8c1094290609281208i6a5eaf8br70697c6b5d085cf@mail.gmail.com>
Message-ID: <8c1094290609281227j1303ec11u300932ab8d4953ab@mail.gmail.com>

On 9/28/06, David Brieck Jr. <dbrieck at gmail.com> wrote:
> On 9/28/06, David Brieck Jr. <dbrieck at gmail.com> wrote:
> > Here is our setup: 2 GNBD servers attached to a shared SCSI array. Each (of 9) nodes uses multipath to import the shared device from both servers. We are also using GFS on to of that for our shared storage.
> >
> > What is happening is that I need to transfer a large number of files (about  1.5 million) from a nodes local storage to the network storage. I'm using rsync locally to move all the files. Orginally my problem was that the oom killer would start running partway through the transfer and the machine would then be unusable (however it was still up enough that it wasn't fenced). Here is that log:
> > <snip>
> >
> >
> > I found a few postings saying that using the hugemem kernel would solve the problems (they claimed it was a known SMP bug by redhat) so all my systems are now running on that kernel. It did solve the out of memory problem, but it seems to have introduced some new ones. Here are the logs from the most recent crashes:
> >
> >
> > <snip>
> >
> > The GNBD servers stay online and don't have any problems, it's just the client where all the trouble is coming from. Is this a bug or is something not setup right?
> >
> > If you need more info I'll be happy to provide it.
> >
> > Thanks.
>
>
> I just tried to more the same data by tar-ing it up to the network,
> same result. Again, this is about 94GB and 1.5 million files that I
> seem to be unable to move from local storage to shared. Anyone have
> any suggestions?
>

I forgot to include the kernel message, see below:

Sep 28 15:01:56 db2 kernel: do_IRQ: stack overflow: 460
Sep 28 15:01:56 db2 kernel:  [<02107c6b>] do_IRQ+0x49/0x1ae
Sep 28 15:01:56 db2 kernel:  [<f89e3574>] tcp_in_window+0x1c6/0x3ad
[ip_conntrack]
Sep 28 15:01:56 db2 kernel:  [<f89e3d0e>] tcp_packet+0x338/0x412 [ip_conntrack]
Sep 28 15:01:56 db2 kernel:  [<f89e1c3b>] __ip_conntrack_find+0xf/0xa1
[ip_conntrack]
Sep 28 15:01:56 db2 kernel:  [<f89e24e6>] ip_conntrack_in+0x1dc/0x2a6
[ip_conntrack]
Sep 28 15:01:56 db2 kernel:  [<0228227b>] nf_iterate+0x40/0x81
Sep 28 15:01:56 db2 kernel:  [<022927d8>] dst_output+0x0/0x1a
Sep 28 15:01:56 db2 kernel:  [<02282581>] nf_hook_slow+0x47/0xbc
Sep 28 15:01:56 db2 kernel:  [<022927d8>] dst_output+0x0/0x1a
Sep 28 15:01:56 db2 kernel:  [<02293093>] ip_queue_xmit+0x395/0x3f9
Sep 28 15:04:39 db2 syslogd 1.4.1: restart.


From teigland at redhat.com  Thu Sep 28 19:58:44 2006
From: teigland at redhat.com (David Teigland)
Date: Thu, 28 Sep 2006 14:58:44 -0500
Subject: [Linux-cluster] Re: Hard lockups during file transfer to GNBD/GFS
	device
In-Reply-To: <8c1094290609281227j1303ec11u300932ab8d4953ab@mail.gmail.com>
References: <8c1094290609280915o6b6b4962ud0d090e58e5d7fc6@mail.gmail.com>
	<8c1094290609281208i6a5eaf8br70697c6b5d085cf@mail.gmail.com>
	<8c1094290609281227j1303ec11u300932ab8d4953ab@mail.gmail.com>
Message-ID: <20060928195844.GB25242@redhat.com>

On Thu, Sep 28, 2006 at 03:27:20PM -0400, David Brieck Jr. wrote:
> I forgot to include the kernel message, see below:
> 
> Sep 28 15:01:56 db2 kernel: do_IRQ: stack overflow: 460
> Sep 28 15:01:56 db2 kernel:  [<02107c6b>] do_IRQ+0x49/0x1ae
> Sep 28 15:01:56 db2 kernel:  [<f89e3574>] tcp_in_window+0x1c6/0x3ad
> [ip_conntrack]
> Sep 28 15:01:56 db2 kernel:  [<f89e3d0e>] tcp_packet+0x338/0x412 
> [ip_conntrack]
> Sep 28 15:01:56 db2 kernel:  [<f89e1c3b>] __ip_conntrack_find+0xf/0xa1
> [ip_conntrack]
> Sep 28 15:01:56 db2 kernel:  [<f89e24e6>] ip_conntrack_in+0x1dc/0x2a6
> [ip_conntrack]
> Sep 28 15:01:56 db2 kernel:  [<0228227b>] nf_iterate+0x40/0x81
> Sep 28 15:01:56 db2 kernel:  [<022927d8>] dst_output+0x0/0x1a
> Sep 28 15:01:56 db2 kernel:  [<02282581>] nf_hook_slow+0x47/0xbc
> Sep 28 15:01:56 db2 kernel:  [<022927d8>] dst_output+0x0/0x1a
> Sep 28 15:01:56 db2 kernel:  [<02293093>] ip_queue_xmit+0x395/0x3f9
> Sep 28 15:04:39 db2 syslogd 1.4.1: restart.

Could you try it without multipath?  You have quite a few layers there.
Dave


From teigland at redhat.com  Thu Sep 28 20:01:00 2006
From: teigland at redhat.com (David Teigland)
Date: Thu, 28 Sep 2006 15:01:00 -0500
Subject: [Linux-cluster] Fencing deadlock under Cluster Suite v4,
	how to solve?
In-Reply-To: <451B514E.4000607@webbertek.com.br>
References: <451B514E.4000607@webbertek.com.br>
Message-ID: <20060928200100.GC25242@redhat.com>

On Thu, Sep 28, 2006 at 01:36:30AM -0300, Celso K. Webber wrote:
> So it seems that when the first server kills the other, there is enough 
> time to the second server to send the IPMI command to kill the first 
> server also, and a few seconds later both are turned off, so my 
> redundant environment goes down alltogether.
> 
> Question: does someone is aware of a solution for this? Is there a way a 
> server can notify the other that it is removing it from the cluster? 
> Maybe using a shared disk? By the way, I didn't experimented with the 
> new shared disk feature under CS v4, only with CS v3.

The new qdisk should be a good way to solve this.
Dave


From celso at webbertek.com.br  Fri Sep 29 13:15:37 2006
From: celso at webbertek.com.br (Celso K. Webber)
Date: Fri, 29 Sep 2006 10:15:37 -0300
Subject: [Linux-cluster] Fencing deadlock under Cluster Suite v4, how
	to solve?
In-Reply-To: <20060928200100.GC25242@redhat.com>
References: <451B514E.4000607@webbertek.com.br>
	<20060928200100.GC25242@redhat.com>
Message-ID: <451D1C79.3050601@webbertek.com.br>

Hello David,

Do you know (or someone else) where can I find documentation about the
new qdisk mechanism?

I imagine I should configure it by editing cluster.conf directly, isn't
it? The GUI does not mantion the "shared state" configuration as it did
under Cluster Suite v3.

Thank you all.

Celso.

David Teigland escreveu:
> On Thu, Sep 28, 2006 at 01:36:30AM -0300, Celso K. Webber wrote:
>> Question: does someone is aware of a solution for this? Is there a way a 
>> server can notify the other that it is removing it from the cluster? 
>> Maybe using a shared disk? By the way, I didn't experimented with the 
>> new shared disk feature under CS v4, only with CS v3.
> 
> The new qdisk should be a good way to solve this.
> Dave
> 
> 

-- 
*Celso Kopp Webber*

celso at webbertek.com.br <mailto:celso at webbertek.com.br>

*Webbertek - Opensource Knowledge*
(41) 8813-1919
(41) 3284-3035


-- 
Esta mensagem foi verificada pelo sistema de antiv?rus e
 acredita-se estar livre de perigo.


From jparsons at redhat.com  Fri Sep 29 13:27:12 2006
From: jparsons at redhat.com (James Parsons)
Date: Fri, 29 Sep 2006 09:27:12 -0400
Subject: [Linux-cluster] Fencing deadlock under Cluster Suite v4, how
	to solve?
In-Reply-To: <451D1C79.3050601@webbertek.com.br>
References: <451B514E.4000607@webbertek.com.br>	<20060928200100.GC25242@redhat.com>
	<451D1C79.3050601@webbertek.com.br>
Message-ID: <451D1F30.5030807@redhat.com>

Celso K. Webber wrote:

> Hello David,
>
> Do you know (or someone else) where can I find documentation about the
> new qdisk mechanism?
>
> I imagine I should configure it by editing cluster.conf directly, isn't
> it? The GUI does not mantion the "shared state" configuration as it did
> under Cluster Suite v3. 

The GUI will support it in 4U5 and in RHEL5.

-J

>
>
> Thank you all.
>
> Celso.
>
> David Teigland escreveu:
>
>> On Thu, Sep 28, 2006 at 01:36:30AM -0300, Celso K. Webber wrote:
>>
>>> Question: does someone is aware of a solution for this? Is there a 
>>> way a server can notify the other that it is removing it from the 
>>> cluster? Maybe using a shared disk? By the way, I didn't 
>>> experimented with the new shared disk feature under CS v4, only with 
>>> CS v3.
>>
>>
>> The new qdisk should be a good way to solve this.
>> Dave
>>
>>
>


From lhh at redhat.com  Fri Sep 29 13:40:25 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 29 Sep 2006 09:40:25 -0400
Subject: [Linux-cluster] clurmtabd
In-Reply-To: <20060928170917.70680.qmail@web34214.mail.mud.yahoo.com>
References: <20060928170917.70680.qmail@web34214.mail.mud.yahoo.com>
Message-ID: <1159537225.27578.9.camel@rei.boston.devel.redhat.com>

On Thu, 2006-09-28 at 10:09 -0700, Rick Rodgers wrote:
> Is there anyway to have clurmtabd monitor all the subdirectories
> of a mount point. (ie. specify a parent directory but have nodes
> mounting off some of the subdirectories) Or do you always have to have
> a clurmtabd running for each subdirectory mount point

It matches based on the parent mount point, and should sync all
subdirectories present in /var/lib/nfs/rmtab... e.g.

clurmtabd /foo

Clients which mount /foo/bar, /foo/bar/1, etc. should all have entries
in /foo/.clumanager/rmtab

-- Lon


From riaan at obsidian.co.za  Fri Sep 29 13:41:51 2006
From: riaan at obsidian.co.za (Riaan van Niekerk)
Date: Fri, 29 Sep 2006 15:41:51 +0200
Subject: [Linux-cluster] Fencing deadlock under Cluster Suite v4, how
	to solve?
In-Reply-To: <451D1C79.3050601@webbertek.com.br>
References: <451B514E.4000607@webbertek.com.br>	<20060928200100.GC25242@redhat.com>
	<451D1C79.3050601@webbertek.com.br>
Message-ID: <451D229F.80005@obsidian.co.za>

qdisk is part of newersions of cman. "man qdisk" is the best source of 
information (that I am aware of) for the new quorum disk functionality.

Unfortunately the Cluster Suite docs have not been updated with the 
qdisk subsystem. However, the CS update 4 release notes mention it.

http://www.redhat.com/docs/manuals/csgfs/

slighly off topic rant: unfortunately it is very difficult to tell when 
Red Hat update their documentation. At the above link, there is a red 
"(Updated)" if something is updated recently, but "recently" is a very 
vague term. I have even submitted Bugzilla Bug 195890: RFE "(Updated)" 
and "(New)" labels in documentation should have dates without success.

Riaan


Celso K. Webber wrote:
> Hello David,
> 
> Do you know (or someone else) where can I find documentation about the
> new qdisk mechanism?
> 
> I imagine I should configure it by editing cluster.conf directly, isn't
> it? The GUI does not mantion the "shared state" configuration as it did
> under Cluster Suite v3.
> 
> Thank you all.
> 
> Celso.
> 
> David Teigland escreveu:
>> On Thu, Sep 28, 2006 at 01:36:30AM -0300, Celso K. Webber wrote:
>>> Question: does someone is aware of a solution for this? Is there a 
>>> way a server can notify the other that it is removing it from the 
>>> cluster? Maybe using a shared disk? By the way, I didn't experimented 
>>> with the new shared disk feature under CS v4, only with CS v3.
>>
>> The new qdisk should be a good way to solve this.
>> Dave
>>
>>
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: riaan.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060929/a49d4957/attachment.vcf>

From dbrieck at gmail.com  Fri Sep 29 13:51:08 2006
From: dbrieck at gmail.com (David Brieck Jr.)
Date: Fri, 29 Sep 2006 09:51:08 -0400
Subject: [Linux-cluster] Re: Hard lockups during file transfer to GNBD/GFS
	device
In-Reply-To: <20060928195844.GB25242@redhat.com>
References: <8c1094290609280915o6b6b4962ud0d090e58e5d7fc6@mail.gmail.com>
	<8c1094290609281208i6a5eaf8br70697c6b5d085cf@mail.gmail.com>
	<8c1094290609281227j1303ec11u300932ab8d4953ab@mail.gmail.com>
	<20060928195844.GB25242@redhat.com>
Message-ID: <8c1094290609290651r62cec5f9n28278d6a81c3e6ef@mail.gmail.com>

On 9/28/06, David Teigland <teigland at redhat.com> wrote:
>
> Could you try it without multipath?  You have quite a few layers there.
> Dave
>
>

Thanks for the response. I unloaded gfs, clvm, gnbd and multipath, the
reloaded gnbd, clvm and gfs. It was only talking to one of the gnbd
servers and without multipath. Here's the log from this crash. It
seems to have more info in it.

I'm kinda confused why it still has references to mulitpath though. I
unloaded the multipath module so I'm not sure why it's still in there.

Sep 29 09:39:26 db2 kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000000
Sep 29 09:39:26 db2 kernel:  printing eip:
Sep 29 09:39:26 db2 kernel: f882d427
Sep 29 09:39:26 db2 kernel: *pde = 00004001
Sep 29 09:39:26 db2 kernel: Oops: 0000 [#1]
Sep 29 09:39:26 db2 kernel: SMP
Sep 29 09:39:26 db2 kernel: Modules linked in: lock_dlm(U) gfs(U)
lock_harness(U) gnbd(U) mptctl mptbase dell_rbu nfsd exportfs lockd
nfs_acl parport_pc lp p
arport autofs4 i2c_dev i2c_core dm_round_robin dlm(U) cman(U) sunrpc
ipmi_devintf ipmi_si ipmi_msghandler iptable_filter iptable_mangle
iptable_nat ip_conntr
ack ip_tables md5 ipv6 dm_multipath joydev button battery ac uhci_hcd
ehci_hcd hw_random e1000 bonding(U) floppy sg dm_snapshot dm_zero
dm_mirror ext3 jbd dm
_mod megaraid_mbox megaraid_mm sd_mod scsi_mod
Sep 29 09:39:26 db2 kernel: CPU:    5
Sep 29 09:39:26 db2 kernel: EIP:    0060:[<f882d427>]    Not tainted VLI
Sep 29 09:39:26 db2 kernel: EFLAGS: 00010286   (2.6.9-42.0.2.ELhugemem)
Sep 29 09:39:26 db2 kernel: EIP is at journal_start+0x23/0x9e [jbd]
Sep 29 09:39:26 db2 kernel: eax: 00000000   ebx: 8ca9b300   ecx:
e1f0b400   edx: 00000042
Sep 29 09:39:26 db2 kernel: esi: e1f0bc00   edi: 1ef03000   ebp:
02325e78   esp: 1ef03bc0
Sep 29 09:39:26 db2 kernel: ds: 007b   es: 007b   ss: 0068
Sep 29 09:39:26 db2 kernel: Process rsync (pid: 20038,
threadinfo=1ef03000 task=d9f178b0)
Sep 29 09:39:26 db2 kernel: Stack: d406cde8 1ef03c00 00000031 f88a8c55
d406cde8 1ef03c00 0216fc5c d406cde8
Sep 29 09:39:26 db2 kernel:        0216fcf1 3d38f768 3d38f770 0000000a
02170076 00000080 00000080 00000080
Sep 29 09:39:26 db2 kernel:        bf756da8 8b255598 00000000 00000086
00000000 39ffe980 021700e3 02148548
Sep 29 09:39:26 db2 kernel: Call Trace:
Sep 29 09:39:26 db2 kernel:  [<f88a8c55>] ext3_dquot_drop+0x14/0x3b [ext3]
Sep 29 09:39:26 db2 kernel:  [<0216fc5c>] clear_inode+0xb4/0x102
Sep 29 09:39:26 db2 kernel:  [<0216fcf1>] dispose_list+0x47/0x6d
Sep 29 09:39:26 db2 kernel:  [<02170076>] prune_icache+0x193/0x1ec
Sep 29 09:39:26 db2 kernel:  [<021700e3>] shrink_icache_memory+0x14/0x2b
Sep 29 09:39:26 db2 kernel:  [<02148548>] shrink_slab+0xf8/0x161
Sep 29 09:39:26 db2 kernel:  [<0214952c>] try_to_free_pages+0xd1/0x1a7
Sep 29 09:39:26 db2 kernel:  [<02142f1d>] __alloc_pages+0x1b5/0x29d
Sep 29 09:39:26 db2 kernel:  [<02140e51>]
generic_file_buffered_write+0x1a1/0x533
Sep 29 09:39:26 db2 kernel:  [<0214156c>]
__generic_file_aio_write_nolock+0x389/0x3b7
Sep 29 09:39:26 db2 kernel:  [<021415d3>]
generic_file_aio_write_nolock+0x39/0x7f
Sep 29 09:39:26 db2 kernel:  [<02141736>] generic_file_write_nolock+0x84/0x99
Sep 29 09:39:26 db2 kernel:  [<f9009055>] gfs_glock_nq+0xe3/0x116 [gfs]
Sep 29 09:39:26 db2 kernel:  [<021204e9>] autoremove_wake_function+0x0/0x2d
Sep 29 09:39:26 db2 kernel:  [<f9029bac>] gfs_trans_begin_i+0xfd/0x15a [gfs]
Sep 29 09:39:26 db2 kernel:  [<f901d0fc>] do_do_write_buf+0x2a6/0x452 [gfs]
Sep 29 09:39:26 db2 kernel:  [<f901d3c3>] do_write_buf+0x11b/0x15e [gfs]
Sep 29 09:39:26 db2 kernel:  [<f901c31c>] walk_vm+0xd7/0x100 [gfs]
Sep 29 09:39:26 db2 kernel:  [<f901d4a7>] __gfs_write+0xa1/0xbb [gfs]
Sep 29 09:39:26 db2 kernel:  [<f901d2a8>] do_write_buf+0x0/0x15e [gfs]
Sep 29 09:39:26 db2 kernel:  [<f901d4cc>] gfs_write+0xb/0xe [gfs]
Sep 29 09:39:26 db2 kernel:  [<0215a52f>] vfs_write+0xb6/0xe2
Sep 29 09:39:26 db2 kernel:  [<0215a5f9>] sys_write+0x3c/0x62
Sep 29 09:39:26 db2 kernel: Code: <3>Debug: sleeping function called
from invalid context at include/linux/rwsem.h:43
Sep 29 09:39:26 db2 kernel: in_atomic():0[expected: 0], irqs_disabled():1
Sep 29 09:39:26 db2 kernel:  [<02120209>] __might_sleep+0x7d/0x88
Sep 29 09:39:26 db2 kernel:  [<0215537c>] rw_vm+0xe4/0x29c
Sep 29 09:39:26 db2 kernel:  [<f882d3fc>] new_handle+0x38/0x40 [jbd]
Sep 29 09:39:26 db2 kernel:  [<f882d3fc>] new_handle+0x38/0x40 [jbd]
Sep 29 09:39:26 db2 kernel:  [<021557f3>] get_user_size+0x30/0x57
Sep 29 09:39:26 db2 kernel:  [<f882d3fc>] new_handle+0x38/0x40 [jbd]
Sep 29 09:39:26 db2 kernel:  [<021061bb>] show_registers+0x115/0x16c
Sep 29 09:39:26 db2 kernel:  [<02106352>] die+0xdb/0x16b
Sep 29 09:39:26 db2 kernel:  [<02122a14>] vprintk+0x136/0x14a
Sep 29 09:39:26 db2 kernel:  [<0211b236>] do_page_fault+0x421/0x5f7
Sep 29 09:39:26 db2 kernel:  [<f882d427>] journal_start+0x23/0x9e [jbd]
Sep 29 09:39:26 db2 kernel:  [<0211cec9>] activate_task+0x88/0x95
Sep 29 09:39:26 db2 kernel:  [<0211d3f4>] try_to_wake_up+0x28e/0x299
Sep 29 09:39:26 db2 kernel:  [<0211ae15>] do_page_fault+0x0/0x5f7
Sep 29 09:39:26 db2 kernel:  [<f882d427>] journal_start+0x23/0x9e [jbd]
Sep 29 09:39:26 db2 kernel:  [<f88a8c55>] ext3_dquot_drop+0x14/0x3b [ext3]
Sep 29 09:39:26 db2 kernel:  [<0216fc5c>] clear_inode+0xb4/0x102
Sep 29 09:39:26 db2 kernel:  [<0216fcf1>] dispose_list+0x47/0x6d
Sep 29 09:39:26 db2 kernel:  [<02170076>] prune_icache+0x193/0x1ec
Sep 29 09:39:26 db2 kernel:  [<021700e3>] shrink_icache_memory+0x14/0x2b
Sep 29 09:39:26 db2 kernel:  [<02148548>] shrink_slab+0xf8/0x161
Sep 29 09:39:26 db2 kernel:  [<0214952c>] try_to_free_pages+0xd1/0x1a7
Sep 29 09:39:26 db2 kernel:  [<02142f1d>] __alloc_pages+0x1b5/0x29d
Sep 29 09:39:26 db2 kernel:  [<02140e51>]
generic_file_buffered_write+0x1a1/0x533
Sep 29 09:39:26 db2 kernel:  [<0214156c>]
__generic_file_aio_write_nolock+0x389/0x3b7
Sep 29 09:39:26 db2 kernel:  [<021415d3>]
generic_file_aio_write_nolock+0x39/0x7f
Sep 29 09:39:26 db2 kernel:  [<02141736>] generic_file_write_nolock+0x84/0x99
Sep 29 09:39:26 db2 kernel:  [<f9009055>] gfs_glock_nq+0xe3/0x116 [gfs]
Sep 29 09:39:26 db2 kernel:  [<021204e9>] autoremove_wake_function+0x0/0x2d
Sep 29 09:39:26 db2 kernel:  [<f9029bac>] gfs_trans_begin_i+0xfd/0x15a [gfs]
Sep 29 09:39:26 db2 kernel:  [<f901d0fc>] do_do_write_buf+0x2a6/0x452 [gfs]
Sep 29 09:39:26 db2 kernel:  [<f901d3c3>] do_write_buf+0x11b/0x15e [gfs]
Sep 29 09:39:26 db2 kernel:  [<f901c31c>] walk_vm+0xd7/0x100 [gfs]
Sep 29 09:39:26 db2 kernel:  [<f901d4a7>] __gfs_write+0xa1/0xbb [gfs]
Sep 29 09:39:26 db2 kernel:  [<f901d2a8>] do_write_buf+0x0/0x15e [gfs]
Sep 29 09:39:26 db2 kernel:  [<f901d4cc>] gfs_write+0xb/0xe [gfs]
Sep 29 09:39:26 db2 kernel:  [<0215a52f>] vfs_write+0xb6/0xe2
Sep 29 09:39:26 db2 kernel:  [<0215a5f9>] sys_write+0x3c/0x62
Sep 29 09:39:26 db2 kernel:  Bad EIP value.
Sep 29 09:39:26 db2 kernel:  <0>Fatal exception: panic in 5 seconds
Sep 29 09:42:17 db2 syslogd 1.4.1: restart.

Thanks again for your help.


From lhh at redhat.com  Fri Sep 29 13:58:21 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 29 Sep 2006 09:58:21 -0400
Subject: [Linux-cluster] IPMI fencing on an IBM x366
In-Reply-To: <200609251411.k8PEBg406654@xos037.xos.nl>
References: <200609251411.k8PEBg406654@xos037.xos.nl>
Message-ID: <1159538301.27578.13.camel@rei.boston.devel.redhat.com>

On Mon, 2006-09-25 at 16:11 +0200, Jos Vos wrote:
> Hi,
> 
> Is it possible to use the built-in IPMI support of an IBM x366 server
> with RHEL CS?
> 
> I think it is not compatible with RSA II, and I also tried IPMI Lan,
> but none of them seems to work.

Maybe this patch would help?

http://bugzilla.redhat.com/bugzilla/attachment.cgi?id=135803

It enables IPMI Lan+ operation; you'll need to add lanplus=1 to the
fence device definition.

-- Lon


From lhh at redhat.com  Fri Sep 29 13:59:18 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 29 Sep 2006 09:59:18 -0400
Subject: [Linux-cluster] Realserver configuration using loadbalancer
In-Reply-To: <8E2924888511274B95014C2DD906E58AD1A3F9@MAILBOX0A.psi.ch>
References: <8E2924888511274B95014C2DD906E58AD1A3F9@MAILBOX0A.psi.ch>
Message-ID: <1159538358.27578.15.camel@rei.boston.devel.redhat.com>

On Tue, 2006-09-26 at 16:26 +0200, Huesser Peter wrote:
> Hello
> 
>  
> 
> If I run a loadbalancer in front of the webservers (using piranha_gui
> and pulse) is there anything I have configure on the real webservers?

Not usually, unless you're trying to do direct routing.

-- Lon

> 


From lhh at redhat.com  Fri Sep 29 14:05:08 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 29 Sep 2006 10:05:08 -0400
Subject: [Linux-cluster] Fencing deadlock under Cluster Suite v4, how
	to solve?
In-Reply-To: <451D1C79.3050601@webbertek.com.br>
References: <451B514E.4000607@webbertek.com.br>
	<20060928200100.GC25242@redhat.com> <451D1C79.3050601@webbertek.com.br>
Message-ID: <1159538708.27578.21.camel@rei.boston.devel.redhat.com>

On Fri, 2006-09-29 at 10:15 -0300, Celso K. Webber wrote:
> Hello David,
> 
> Do you know (or someone else) where can I find documentation about the
> new qdisk mechanism?

I'll assist you -- most of the documentation is in the manual pages:

  man qdisk

The only "difficult" part is getting the heuristics right.  In your
case, you'd want heuristics which monitor network connectivity, so that
when you pull the cables, the node (which is still alive, despite the
fact that it has lost network connectivity) will remove itself from the
cluster, and the other node will fence it.

The example in the manual page for pinging a router should be of some
use, but you may very well have a better method of determining network
connectivity.

Oh, and to note a question which isn't actually mentioned in the manual
page -- the qdisk partition should be around 10mb. :)

-- Lon


From lhh at redhat.com  Fri Sep 29 14:07:14 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 29 Sep 2006 10:07:14 -0400
Subject: [Linux-cluster] Fencing deadlock under Cluster Suite v4, how
	to solve?
In-Reply-To: <1159538708.27578.21.camel@rei.boston.devel.redhat.com>
References: <451B514E.4000607@webbertek.com.br>
	<20060928200100.GC25242@redhat.com> <451D1C79.3050601@webbertek.com.br>
	<1159538708.27578.21.camel@rei.boston.devel.redhat.com>
Message-ID: <1159538834.27578.24.camel@rei.boston.devel.redhat.com>

On Fri, 2006-09-29 at 10:05 -0400, Lon Hohberger wrote:
> On Fri, 2006-09-29 at 10:15 -0300, Celso K. Webber wrote:
> > Hello David,
> > 
> > Do you know (or someone else) where can I find documentation about the
> > new qdisk mechanism?
> 
> I'll assist you -- most of the documentation is in the manual pages:

Note: Please keep it on-list for posterity.

-- Lon


From jos at xos.nl  Fri Sep 29 16:30:39 2006
From: jos at xos.nl (Jos Vos)
Date: Fri, 29 Sep 2006 18:30:39 +0200
Subject: [Linux-cluster] IPMI fencing on an IBM x366
In-Reply-To: <1159538301.27578.13.camel@rei.boston.devel.redhat.com>;
	from lhh@redhat.com on Fri, Sep 29, 2006 at 09:58:21AM -0400
References: <200609251411.k8PEBg406654@xos037.xos.nl>
	<1159538301.27578.13.camel@rei.boston.devel.redhat.com>
Message-ID: <20060929183039.A8483@xos037.xos.nl>

On Fri, Sep 29, 2006 at 09:58:21AM -0400, Lon Hohberger wrote:

> Maybe this patch would help?
> 
> http://bugzilla.redhat.com/bugzilla/attachment.cgi?id=135803
> 
> It enables IPMI Lan+ operation; you'll need to add lanplus=1 to the
> fence device definition.

In the meantime I solved the problem.  It was a password problem
(PASSW*O*RD vs. PASSW*0*RD :-( ).

--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204


From lhh at redhat.com  Fri Sep 29 20:30:13 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 29 Sep 2006 16:30:13 -0400
Subject: [Linux-cluster] Cannot restart service after "failed" state
In-Reply-To: <8E2924888511274B95014C2DD906E58AD1A30E@MAILBOX0A.psi.ch>
References: <8E2924888511274B95014C2DD906E58AD1A30E@MAILBOX0A.psi.ch>
Message-ID: <1159561813.30820.0.camel@rei.boston.devel.redhat.com>

On Fri, 2006-09-22 at 17:42 +0200, Huesser Peter wrote:
> Hello
> 
>  
> 
> I have defined a web-services (for testing it contains an IP and two
> script resources). I sometimes happens that I produce failed state of
> the cluster. After this I am not able to restart the service anymore.
> Even after a reboot of all (two) clustermembers it is not possible. Do
> I have to remove by hand some kind of ?lock? file.
> 

This sounds like bug 208011 FYI.

-- Lon
> 


From lhh at redhat.com  Fri Sep 29 20:42:50 2006
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 29 Sep 2006 16:42:50 -0400
Subject: [Linux-cluster] Disk tie breaker -how does it work?
In-Reply-To: <20060922202843.42656.qmail@web34205.mail.mud.yahoo.com>
References: <20060922202843.42656.qmail@web34205.mail.mud.yahoo.com>
Message-ID: <1159562570.31184.2.camel@rei.boston.devel.redhat.com>

On Fri, 2006-09-22 at 13:28 -0700, Rick Rodgers wrote:
> Does anyone know much about the details of how a disk tiebreaker
> works in a two member node? Or any docs to point to?

http://people.redhat.com/lhh/rhcm-3-internals.odt

(Note: Open Office 2.0 format)

It's mostly up to date.

-- Lon


From rodgersr at yahoo.com  Fri Sep 29 21:37:41 2006
From: rodgersr at yahoo.com (Rick Rodgers)
Date: Fri, 29 Sep 2006 14:37:41 -0700 (PDT)
Subject: [Linux-cluster] clurmtabd
In-Reply-To: <1159537225.27578.9.camel@rei.boston.devel.redhat.com>
Message-ID: <20060929213741.97281.qmail@web34212.mail.mud.yahoo.com>

it does not seem to work that way. I tested it and it only got what was mounted on the specified directory. Not the subdirectories.
Has this changed recently (in the last 2 years?)


Lon Hohberger <lhh at redhat.com> wrote: On Thu, 2006-09-28 at 10:09 -0700, Rick Rodgers wrote:
> Is there anyway to have clurmtabd monitor all the subdirectories
> of a mount point. (ie. specify a parent directory but have nodes
> mounting off some of the subdirectories) Or do you always have to have
> a clurmtabd running for each subdirectory mount point

It matches based on the parent mount point, and should sync all
subdirectories present in /var/lib/nfs/rmtab... e.g.

clurmtabd /foo

Clients which mount /foo/bar, /foo/bar/1, etc. should all have entries
in /foo/.clumanager/rmtab

-- Lon


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


---------------------------------
Get your own web address for just $1.99/1st yr. We'll help. Yahoo! Small Business.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060929/f34d4fb9/attachment.htm>

From rodgersr at yahoo.com  Fri Sep 29 21:39:59 2006
From: rodgersr at yahoo.com (Rick Rodgers)
Date: Fri, 29 Sep 2006 14:39:59 -0700 (PDT)
Subject: [Linux-cluster] Disk tie breaker -how does it work?
In-Reply-To: <1159562570.31184.2.camel@rei.boston.devel.redhat.com>
Message-ID: <20060929213959.98975.qmail@web34209.mail.mud.yahoo.com>

The link to the page can not be found.

Lon Hohberger <lhh at redhat.com> wrote: On Fri, 2006-09-22 at 13:28 -0700, Rick Rodgers wrote:
> Does anyone know much about the details of how a disk tiebreaker
> works in a two member node? Or any docs to point to?

http://people.redhat.com/lhh/rhcm-3-internals.odt

(Note: Open Office 2.0 format)

It's mostly up to date.

-- Lon


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster


---------------------------------
Do you Yahoo!?
 Everyone is raving about the  all-new Yahoo! Mail.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060929/7e252ea6/attachment.htm>

From jotheswaran at renaissance-it.com  Sat Sep 30 06:43:33 2006
From: jotheswaran at renaissance-it.com (Jotheswaran M)
Date: Sat, 30 Sep 2006 12:13:33 +0530
Subject: [Linux-cluster] Red Hat Linux AS 4 U3 Clustering
Message-ID: <7BED60E643BD1C4F8A84E3F0B411C14A0F3F31@srit_mail.renaissance-it.com>

Hi All,
 
I am new to this forum, I have a problem with Red Hat Linux AS 4 U3 Clustering I have used IBM Xseries 366 servers with two HBA's and DS4300 SAN storage.
 
I have installed and configured the OS and the clustering with out any issues. I am running oracle9i as the database and the same has been configured in the cluster and it works fine, I can also fail over it works fine.
 
The problem is if I shutdown one server or remove the power chord of one server the cluster doesn't switch over but if I go through the normal shutdown the cluster switches.
 
Can you gueys help me to resolve this please.
 
Regards,
Jotheswaran M
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060930/86a41380/attachment.htm>

From peter.huesser at psi.ch  Sat Sep 30 07:09:51 2006
From: peter.huesser at psi.ch (Huesser Peter)
Date: Sat, 30 Sep 2006 09:09:51 +0200
Subject: [Linux-cluster] Realserver configuration using loadbalancer
In-Reply-To: <1159538358.27578.15.camel@rei.boston.devel.redhat.com>
Message-ID: <8E2924888511274B95014C2DD906E58A01107956@MAILBOX0A.psi.ch>

> > If I run a loadbalancer in front of the webservers (using
piranha_gui
> > and pulse) is there anything I have configure on the real
webservers?
> 
> Not usually, unless you're trying to do direct routing.
> 

Thanks' for your answer. I found the solution and posted it a few days
ago.

Pedro