From jpyeron at pdinc.us Fri Sep 1 12:02:30 2006 From: jpyeron at pdinc.us (Jason Pyeron) Date: Fri, 1 Sep 2006 08:02:30 -0400 Subject: OT: RE: [Linux-cluster] php4-xslt package??? In-Reply-To: <44F6E9E9.5070703@gmail.com> Message-ID: <200609011202.k81C2f919734@ns.pyerotechnics.com> This is the wrong list for this, but look at http://public.pdinc.us/rpms/php-xslt/index.jsp -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Sr. Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, purge the message from your system and notify the sender immediately. Any other use of the email by you is prohibited. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Anthony Sent: Thursday, August 31, 2006 9:54 To: redhat-sysadmin-list at redhat.com; UNIX-Administration at yahoogroups.com; linux-cluster at redhat.com Subject: [Linux-cluster] php4-xslt package??? Hello, i am unable to find the php4-xslt package for Red Hat Enterprise Linux 4 AS. any help? -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 6433 bytes Desc: not available URL: From basv at sara.nl Fri Sep 1 13:59:04 2006 From: basv at sara.nl (Bas van der Vlies) Date: Fri, 01 Sep 2006 15:59:04 +0200 Subject: [Linux-cluster] ANNOUNCE: new version gfs_2_deb utils (0.2.1) Message-ID: <44F83CA8.3080105@sara.nl> = gfs_2_deb - utilities = This is a release of the SARA package gfs_2_deb that contains utilities that we use to make debian packages from the RedHat Cluster Software (GFS). This is utilities are for version 1.0.3 and cvs updates. All init.d scripts in the debian package start at runlevel 3 and the scripts start in the right order. We have choosen this setup for these reasons, default runlevel is 2: 1) When a node is fenced, the node is rebooted and is ready for cluster mode. 2) We can easily switch from run levels to join or leave the cluster See README for further info The package can be downloaded at: ftp://ftp.sara.nl/pub/outgoing/gfs_2_deb.tar.gz Regards -- -- ******************************************************************** * * * Bas van der Vlies e-mail: basv at sara.nl * * SARA - Academic Computing Services phone: +31 20 592 8012 * * Kruislaan 415 fax: +31 20 6683167 * * 1098 SJ Amsterdam * * * ******************************************************************** -- -- ******************************************************************** * * * Bas van der Vlies e-mail: basv at sara.nl * * SARA - Academic Computing Services phone: +31 20 592 8012 * * Kruislaan 415 fax: +31 20 6683167 * * 1098 SJ Amsterdam * * * ******************************************************************** From mbrookov at mines.edu Fri Sep 1 19:11:44 2006 From: mbrookov at mines.edu (Matthew B. Brookover) Date: Fri, 01 Sep 2006 13:11:44 -0600 Subject: [Linux-cluster] is necesary to to build GFS on top of LVM ? In-Reply-To: <20060831191626.99599.qmail@web50613.mail.yahoo.com> References: <20060831191626.99599.qmail@web50613.mail.yahoo.com> Message-ID: <1157137904.26485.6.camel@merlin.Mines.EDU> I have an iscsi scan that would not work with out LVM. As with your EMC SAN I can expand a volume and expand a GFS file system within it. Where I get into trouble is identifying the volumes after a reboot. What was /dev/sdb may be /dev/sdc next time. LVM allows you to name your volumes and helps to track them down when the system is restarted. There are similar problems when SCSI ID numbers get swapped around. Matt On Thu, 2006-08-31 at 12:16 -0700, Roger Pe?a Escobio wrote: > Hi > > I was wondering why in the docs and examples the GFS > filesystem is build on top of a lv "partition" ? > I can understand that if I build the GFS in a direct > scsi attached storage because is not easy to grow the > "device" without destroy the data but the same apply > in an SAN enviroment? > We have here a EMC SAN, where is relative easy to grow > a LUN, so can we skip the LVM layer and build the GFS > filesystem directly over the emcpower device ? > > there is any advantage of using LVM in this scenario? > > thanks in advance > roger > > __________________________________________ > RedHat Certified Engineer ( RHCE ) > Cisco Certified Network Associate ( CCNA ) > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From daves at ActiveState.com Fri Sep 1 23:40:24 2006 From: daves at ActiveState.com (David Sparks) Date: Fri, 01 Sep 2006 16:40:24 -0700 Subject: [Linux-cluster] is necesary to to build GFS on top of LVM ? In-Reply-To: <20060831191626.99599.qmail@web50613.mail.yahoo.com> References: <20060831191626.99599.qmail@web50613.mail.yahoo.com> Message-ID: <44F8C4E8.9050601@activestate.com> > I was wondering why in the docs and examples the GFS > filesystem is build on top of a lv "partition" ? > I can understand that if I build the GFS in a direct > scsi attached storage because is not easy to grow the > "device" without destroy the data but the same apply > in an SAN enviroment? > We have here a EMC SAN, where is relative easy to grow > a LUN, so can we skip the LVM layer and build the GFS > filesystem directly over the emcpower device ? A variation of this question, what about creating GFS directly on the block device (ie /dev/sdb) instead of creating partitions (ie /dev/sdb1)? When increasing a filesystem, this removes the step of increasing the partition size, which is usually the scariest part (because you are usually deleting the partition table, and recreating it with the same starting layout, hoping that your existing filesystem will be intact). Does parted support GFS? It doesn't support XFS which is another FS I am using. So I asked myself, why bother creating a partition table at all? I have been running the fs directly on the block device for some time now without issue (XFS, haven't tried GFS). A setup like this has a weakness in that people who aren't familiar with it may come along with fdisk and corrupt the disk by creating a partition table on it. You might rename fdisk as a basic preventative. ds > > there is any advantage of using LVM in this scenario? > > thanks in advance > roger From orkcu at yahoo.com Sat Sep 2 02:25:28 2006 From: orkcu at yahoo.com (Roger Peņa Escobio) Date: Fri, 1 Sep 2006 19:25:28 -0700 (PDT) Subject: [Linux-cluster] is necesary to to build GFS on top of LVM ? In-Reply-To: <1157137904.26485.6.camel@merlin.Mines.EDU> Message-ID: <20060902022528.92665.qmail@web50608.mail.yahoo.com> --- "Matthew B. Brookover" wrote: > I have an iscsi scan that would not work with out > LVM. As with your EMC > SAN I can expand a volume and expand a GFS file > system within it. Where > I get into trouble is identifying the volumes after > a reboot. What > was /dev/sdb may be /dev/sdc next time. LVM allows > you to name your > volumes and helps to track them down when the system > is restarted. > There are similar problems when SCSI ID numbers get > swapped around. > yes, I know what you mean I was looking for something like ext{2,3} label for the filesystem but I could'n find anything for gfs :-( so I am hopping that PowerPath kernel module always identify the LUN with the same emcpower device :-) if that is not true I will be forced to move to LVM under GFS :-) thanks roger __________________________________________ RedHat Certified Engineer ( RHCE ) Cisco Certified Network Associate ( CCNA ) __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From jprats at cesca.es Sat Sep 2 11:12:40 2006 From: jprats at cesca.es (=?ISO-8859-1?Q?Jordi_Prats_Catal=E0?=) Date: Sat, 02 Sep 2006 13:12:40 +0200 Subject: [Linux-cluster] clustat problem Message-ID: <44F96728.8090902@cesca.es> Hi, I'm getting different outputs of clustat utility on each node: node1: # clustat Member Status: Quorate Member Name Status ------ ---- ------ node1 Online, Local, rgmanager node2 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- ptoheczas node2 started xoqil node2 started ymsgh node1 started vofcvhas node2 started node2: # clustat Member Status: Quorate Member Name Status ------ ---- ------ node1 Online, rgmanager node2 Online, Local, rgmanager (disappears service's info) Rebooting disapears this problem (displays same info in both nodes) for a few weeks. After that it appears again. Do you know what's going on? Thanks, -- ...................................................................... __ / / Jordi Prats Catal? C E / S / C A Departament de Sistemes /_/ Centre de Supercomputaci? de Catalunya Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona T. 93 205 6464 ? F. 93 205 6979 ? jprats at cesca.es ...................................................................... From filipe.miranda at gmail.com Sat Sep 2 14:28:29 2006 From: filipe.miranda at gmail.com (Filipe Miranda) Date: Sat, 2 Sep 2006 11:28:29 -0300 Subject: [Linux-cluster] clustat problem In-Reply-To: <44F96728.8090902@cesca.es> References: <44F96728.8090902@cesca.es> Message-ID: Hi there, I'm having the same problem! I'm using RHEL3.8 for Itanium and RedHat Cluster Suite U8. The cluster is composed of 2 HP 4CPUs servers and we are using an EMC ClarionCX700 to hold the quorum partitions and data partitions. One more thing that I noticed, eventhough the members are shown ative on both nodes, any action on the node that shows the active service does not get propagated to the other member. I already checked the configuration of the rawdevices, and I also used the shutil utility and it reported no problems with the quorum partitions. Does anybody have any suggestions? Thank you, On 9/2/06, Jordi Prats Catal? wrote: > > Hi, > I'm getting different outputs of clustat utility on each node: > > node1: > # clustat > Member Status: Quorate > > Member Name Status > ------ ---- ------ > node1 Online, Local, rgmanager > node2 Online, rgmanager > > Service Name Owner (Last) State > ------- ---- ----- ------ ----- > ptoheczas node2 started > xoqil node2 started > ymsgh node1 started > vofcvhas node2 started > > node2: > # clustat > Member Status: Quorate > > Member Name Status > ------ ---- ------ > node1 Online, rgmanager > node2 Online, Local, rgmanager > > > (disappears service's info) > > Rebooting disapears this problem (displays same info in both nodes) for > a few weeks. After that it appears again. > > Do you know what's going on? > > Thanks, > > -- > ...................................................................... > __ > / / Jordi Prats Catal? > C E / S / C A Departament de Sistemes > /_/ Centre de Supercomputaci? de Catalunya > > Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona > T. 93 205 6464 ? F. 93 205 6979 ? jprats at cesca.es > ...................................................................... > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jprats at cesca.es Sat Sep 2 14:50:42 2006 From: jprats at cesca.es (=?ISO-8859-1?Q?Jordi_Prats_Catal=E0?=) Date: Sat, 02 Sep 2006 16:50:42 +0200 Subject: [Linux-cluster] clustat problem In-Reply-To: References: <44F96728.8090902@cesca.es> Message-ID: <44F99A42.4000402@cesca.es> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, My software versions are: # clustat -v clustat version 1.9.43 Connected via: CMAN/SM Plugin v1.1.4 # cat /etc/redhat-release Red Hat Enterprise Linux ES release 4 (Nahant Update 2) My cluster is composed of 2 HP ProLiant DL360 G4p: 4 Xeon processors each node also. Filipe Miranda wrote: > Hi there, > > I'm having the same problem! > I'm using RHEL3.8 for Itanium and RedHat Cluster Suite U8. The cluster > is composed of 2 HP 4CPUs servers and we are using an EMC ClarionCX700 > to hold the quorum partitions and data partitions. > One more thing that I noticed, eventhough the members are shown ative on > both nodes, any action on the node that shows the active service does > not get propagated to the other member. > > I already checked the configuration of the rawdevices, and I also used > the shutil utility and it reported no problems with the quorum partitions. > > Does anybody have any suggestions? > > Thank you, > > > On 9/2/06, *Jordi Prats Catal?* > wrote: > > Hi, > I'm getting different outputs of clustat utility on each node: > > node1: > # clustat > Member Status: Quorate > > Member Name Status > ------ ---- ------ > node1 Online, Local, rgmanager > node2 Online, rgmanager > > Service Name Owner (Last) State > ------- ---- ----- ------ ----- > ptoheczas node2 started > xoqil node2 started > ymsgh node1 started > vofcvhas node2 started > > node2: > # clustat > Member Status: Quorate > > Member Name Status > ------ ---- ------ > node1 Online, rgmanager > node2 Online, Local, rgmanager > > > (disappears service's info) > > Rebooting disapears this problem (displays same info in both nodes) for > a few weeks. After that it appears again. > > Do you know what's going on? > > Thanks, > > -- > ...................................................................... > __ > / / Jordi Prats Catal? > C E / S / C A Departament de Sistemes > /_/ Centre de Supercomputaci? de Catalunya > > Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona > T. 93 205 6464 ? F. 93 205 6979 ? jprats at cesca.es > > ...................................................................... > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster - -- ...................................................................... __ / / Jordi Prats Catal? C E / S / C A Departament de Sistemes /_/ Centre de Supercomputaci? de Catalunya Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona T. 93 205 6464 ? F. 93 205 6979 ? jprats at cesca.es ...................................................................... pgp:0x5D0D1321 ...................................................................... -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE+ZpCHGTYFl0NEyERAiwrAJ47HGNxNQ6D5PcKPHXszw1JWenILwCbBB9m T6KJAv7tjOoJ6A6XGswECs0= =o8m8 -----END PGP SIGNATURE----- From eric at nitrate.nl Mon Sep 4 09:17:01 2006 From: eric at nitrate.nl (E. de Ruiter) Date: Mon, 04 Sep 2006 11:17:01 +0200 Subject: [Linux-cluster] economy filesystem cluster Message-ID: <44FBEF0D.6090409@nitrate.nl> Hi, I'm planning to build a webserver cluster, and as part of that I'm looking for solutions that allows every node in the cluster to access the same filesystem. The most easy way would be via nfs, but my requirements state that there should be no single point of failure (ofcourse not completely possible but the cluster should not be affected by the downtime of 1 machine). A san or other some other piece of extra hardware is currently not possible within the current budget. The system will have a low number of writes (only some uploaded files and some generated templates but the majority of the load will be reads) but a rsync solution or something like that is not feasible since loadbalancing needs the file to be directly available on all nodes. What I have: - 1 loadbalancing machine - 1 database server - 2 webfrontends - 1 management server (slave db / backup load balancer etc) In the future I plan on adding some extra database servers + webfrontends All machines are very similar and have (dual) xeon processors. The requirements are that all machines have access to the filesystem, and no single machine may affect the availability of (a part of) the filesystem. Searching the internet resulted in some possible solutions: - GFS with only gnbd (http://gfs.wikidev.net/GNBD_installation). This only exports the specified partitions over the network and has (in my mind) no advantages over using plain nfs (it adds no redundancy) - GFS with gnbd in combination with drbd (mentioned a few times on the mailing list). This looks promising but I couldn't find a definitive answer to the questions raised here on the mailinglist: - drbd 0.7 only allows 1 node to have write-access. Is it possible to construct a simple failover scenario without serious risks of corruption when drbd has "failed-over" but gfs has not. - drbd 0.8 seems to have support for multi(2)-master configuration, but is it stable enough for a production environment and can it work together with gfs - GFS in combination with clvm (network raid?). Mentioned a few times here on the mailinglist but most posts claim it is not stable enough, and documentation seems completely missing. - economy configuration from the GFS Administrator's Guide (http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-ov-perform.html#S2-OV-ECONOMY) The problem with this is: - is there a need to have separate gnbd servers? Or can the gnbd servers be run on the application servers. - it is not documented how to configure this, and it is not clear whether this configuration gives me the redundancy I want. What I was thinking of is the following: - One node acts as a gnbd server - Each node has his own disk - Each node mounts a gnbd device. - Each node creates a raid-1 (own disk + gnbd device) - GFS is run on top of the raid-1 But it is not clear to me if this is feasible since I rely on a single gnbd server. Maybe I can have 2 gnbd servers where the disks are synced with drbd (0.8?), but that creates issues with fencing (according to some posts here). And also the raid-1 should read only from it's local disk and only if that fails it should read from the gnbd device, but I don't know if that is possible. Or maybe clvm (network raid?) would be an option but I couldn't find any documentation for that. Can this be done with gfs / clvm / drbd or are there other solutions more appropriate for this case? (other filesystems I've seen, like pvfs2/intermezzo/lustre, are either not production ready, abandoned or don't have support for redundancy) Thanks, Eric de Ruiter From sdake at redhat.com Tue Sep 5 06:39:34 2006 From: sdake at redhat.com (Steven Dake) Date: Mon, 04 Sep 2006 23:39:34 -0700 Subject: [Linux-cluster] cman and bond isssue In-Reply-To: <44F4519E.1030305@atichile.com> References: <44F4519E.1030305@atichile.com> Message-ID: <1157438374.12305.46.camel@shih.broked.org> One possible problem is that your switch doesn't properly support multicast or jumbo frames. I suggest ensuring your MTU on all machines is 1500 to test the jumbo frames possibility. I have seen many switches advertised as jumbo frames which fail to operate properly in multicast or heavy broadcast environments. Regards -steve On Tue, 2006-08-29 at 10:39 -0400, Luis Godoy Gonzalez wrote: > Hello > > I have a problem with the installation of the Cluster Suite. I've > configured 2 nodes cluster, add some services to test de installation > and this worked OK. > > But when I configured "bond" for ethernet interfaces, the communication > between the Cluster nodes doesn't work well. Although networking at IP > level works fine, when I reboot one of the nodes the other one goes to > kernel panic. > > I've lost a lot of time debbuging this problem and I finally decide to > replace one switch ( D-Link DGS-1016D Gigabit Switch ) putting another > from other installation ( D-Link 10/100 ) and the cluster Works Fine now. > > But Now, I'm not sure if the problem is with the hardware switch ( > D-Link ) or with the software. > Any ideas ? > > I have RHE4 U2 and Cluster Suite 4 U2 using HP DL380 and external RAID. > > Some error messages are below > > --------------------------------------------------------- > # SM: 03000002 process_stop_request: uevent already set > > SM: Assertion failed on line 106 of file > /usr/src/build/615121-i686/BUILD/cman- > kernel-2.6.9-39/smp/src/sm_membership.c > SM: assertion: "node" > SM: time = 256523 > nodeid=1 > > Kernel panic - not syncing: SM: Record message above and reboot. > ---------------------------------------------------------------------- > > Thanks in advance for any help. > Luis G. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From damian.osullivan at hp.com Tue Sep 5 12:50:23 2006 From: damian.osullivan at hp.com (O'Sullivan, Damian) Date: Tue, 5 Sep 2006 13:50:23 +0100 Subject: [Linux-cluster] CMAN and interface Message-ID: <644A0966265D9D40AC7584FCE956111302EDD634@dubexc01.emea.cpqcorp.net> Hi, How do I ensure that CMAN uses a specific interface? I have a 2 node cluster with 6 ethernet interfaces. I have a cross over cable beween the 2 eth0 interfaces on both nodes. All other interfaces are connected to a common switch with VLANs for each interface. When this switch is reloaded/rebooted the nodes try to fence each other and soon as the switch comes back each node is shutdown by the fencing agent. I see there is a way with multicast but is that the only way and how does one set up addresses for this? Thanks, D. From mwill at penguincomputing.com Tue Sep 5 16:06:56 2006 From: mwill at penguincomputing.com (Michael Will) Date: Tue, 5 Sep 2006 09:06:56 -0700 Subject: [Linux-cluster] is necesary to to build GFS on top of LVM ? Message-ID: <433093DF7AD7444DA65EFAFE3987879C245253@jellyfish.highlyscyld.com> My number one reason for using a partition table under lvm is avoiding to place filesystem data where it could be damaged by accidentally installing a bootblock or partitiontable on the wrong device. Michael -----Original Message----- From: David Sparks [mailto:daves at ActiveState.com] Sent: Fri Sep 01 16:40:47 2006 To: linux clustering Subject: Re: [Linux-cluster] is necesary to to build GFS on top of LVM ? > I was wondering why in the docs and examples the GFS > filesystem is build on top of a lv "partition" ? > I can understand that if I build the GFS in a direct > scsi attached storage because is not easy to grow the > "device" without destroy the data but the same apply > in an SAN enviroment? > We have here a EMC SAN, where is relative easy to grow > a LUN, so can we skip the LVM layer and build the GFS > filesystem directly over the emcpower device ? A variation of this question, what about creating GFS directly on the block device (ie /dev/sdb) instead of creating partitions (ie /dev/sdb1)? When increasing a filesystem, this removes the step of increasing the partition size, which is usually the scariest part (because you are usually deleting the partition table, and recreating it with the same starting layout, hoping that your existing filesystem will be intact). Does parted support GFS? It doesn't support XFS which is another FS I am using. So I asked myself, why bother creating a partition table at all? I have been running the fs directly on the block device for some time now without issue (XFS, haven't tried GFS). A setup like this has a weakness in that people who aren't familiar with it may come along with fdisk and corrupt the disk by creating a partition table on it. You might rename fdisk as a basic preventative. ds > > there is any advantage of using LVM in this scenario? > > thanks in advance > roger -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From lhh at redhat.com Tue Sep 5 16:44:40 2006 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 05 Sep 2006 12:44:40 -0400 Subject: [Linux-cluster] clustat problem In-Reply-To: <44F99A42.4000402@cesca.es> References: <44F96728.8090902@cesca.es> <44F99A42.4000402@cesca.es> Message-ID: <1157474680.3610.35.camel@rei.boston.devel.redhat.com> On Sat, 2006-09-02 at 16:50 +0200, Jordi Prats Catal? wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi, > My software versions are: > > # clustat -v > clustat version 1.9.43 It should be fixed in the U4 release. There were cases where the main thread would block, causing problems with the CMAN service manager (which then caused the cluster to cease to function normally). Additionally, there is an issue with the DLM which has been worked around by switching the way locks are taken. Either one of these problems causes clustat to hang and/or produce no output. Versions: rgmanager-1.9.53 magma-plugins-1.0.9 magma-1.0.6 -- Lon From ivanp at yu.net Wed Sep 6 09:32:58 2006 From: ivanp at yu.net (Ivan Pantovic) Date: Wed, 06 Sep 2006 11:32:58 +0200 Subject: [Linux-cluster] is necesary to to build GFS on top of LVM ? In-Reply-To: <20060902022528.92665.qmail@web50608.mail.yahoo.com> References: <20060902022528.92665.qmail@web50608.mail.yahoo.com> Message-ID: <44FE95CA.1050202@yu.net> You can use udev with scsi_id to map that lun always on the same place instead using lvm to find volumes. There is another thing you should consider. It is cLVM not LVM. Roger PeXa Escobio wrote: > > --- "Matthew B. Brookover" wrote: > > >>I have an iscsi scan that would not work with out >>LVM. As with your EMC >>SAN I can expand a volume and expand a GFS file >>system within it. Where >>I get into trouble is identifying the volumes after >>a reboot. What >>was /dev/sdb may be /dev/sdc next time. LVM allows >>you to name your >>volumes and helps to track them down when the system >>is restarted. >>There are similar problems when SCSI ID numbers get >>swapped around. >> > > yes, I know what you mean > I was looking for something like ext{2,3} label for > the filesystem but I could'n find anything for gfs :-( > > so I am hopping that PowerPath kernel module always > identify the LUN with the same emcpower device :-) > if that is not true I will be forced to move to LVM > under GFS :-) > > > thanks > roger > > > > > __________________________________________ > RedHat Certified Engineer ( RHCE ) > Cisco Certified Network Associate ( CCNA ) > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Ivan Pantovic, System Engineer ----- YUnet International http://www.eunet.yu Dubrovacka 35/III, 11000 Belgrade Tel: +381 11 311 9901; Fax: +381 11 311 9901; Mob: +381 63 302 288 ----- This e-mail is confidential and intended only for the recipient. Unauthorized distribution, modification or disclosure of its contents is prohibited. If you have received this e-mail in error, please notify the sender by telephone +381 11 311 9901. ----- From riaan at obsidian.co.za Wed Sep 6 15:09:54 2006 From: riaan at obsidian.co.za (Riaan van Niekerk) Date: Wed, 06 Sep 2006 17:09:54 +0200 Subject: [Linux-cluster] data journaling for increased performance Message-ID: <44FEE4C2.1070205@obsidian.co.za> has anyone been able to use GFS data journaling to get any measurable performance boost? For those unfamiliar: http://www.redhat.com/docs/manuals/csgfs/browse/rh-gfs-en/s1-manage-data-journal.html We have a 2.6 TB maildir mail store (e.g. lots of small files) and think of implementing it (we will take any performance increase we can get as long as it does not impact reliability), and even though it will only apply to new files. Also, is it possible to check if the inherit_jdata (for directories) or jdata (for files) flag has been set? Riaan -------------- next part -------------- A non-text attachment was scrubbed... Name: riaan.vcf Type: text/x-vcard Size: 310 bytes Desc: not available URL: From celso at webbertek.com.br Thu Sep 7 03:59:46 2006 From: celso at webbertek.com.br (Celso K. Webber) Date: Thu, 07 Sep 2006 00:59:46 -0300 Subject: [Linux-cluster] Is IPMI fencing considered certified by Red Hat? Message-ID: <44FF9932.9020508@webbertek.com.br> Hello friends, Regarding Red Hat Cluster Suite and/or GFS, could someone from Red Hat please tell me if the use of IPMI embedded devices from the servers' motherboards is officially certified by Red Hat? I'd like to have this information so that we can recommend (or not) to customers the use of IPMI as a secure form of fencing. We had some bad experiences recently on some servers where only one of the onboard NICs listened to the IPMI over LAN packets, so it appeared to us that sometimes IPMI is not that safe as a fence device. Of course the Cluster software will assume nothing when the fencing fails, but the bad thing is that there is no automatic failover on this situation. Thank you all, Celso. -- *Celso Kopp Webber* celso at webbertek.com.br *Webbertek - Opensource Knowledge* (41) 8813-1919 (41) 3284-3035 -- Esta mensagem foi verificada pelo sistema de antiv?rus e acredita-se estar livre de perigo. From celso at webbertek.com.br Thu Sep 7 04:24:27 2006 From: celso at webbertek.com.br (Celso K. Webber) Date: Thu, 07 Sep 2006 01:24:27 -0300 Subject: [Linux-cluster] Write log messages to a different file In-Reply-To: <1156958955.4501.245.camel@rei.boston.devel.redhat.com> References: <5F08B160555AC946B5AB743B85FF406D05ABF26F@ex2k.bankofamerica.com> <1156958955.4501.245.camel@rei.boston.devel.redhat.com> Message-ID: <44FF9EFB.5090007@webbertek.com.br> Hello, Are there plans to implement those loggin facilities to the other daemons? It'd be very interesting to have the fence messages and related stuff to a separete file. Thanks, Celso. Lon Hohberger escreveu: > On Wed, 2006-08-30 at 11:15 -0400, Brown, Rodrick R wrote: >> You need to modify /etc/syslog.conf >> local4.* /var/log/cluster.log > > I think it's daemon.*, not local4.*, by default. You can make rgmanager > use local4 by tweaking the tag, though: > > > > ... but this doesn't change CMAN, CCS, GuLM, etc. > > -- Lon > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- *Celso Kopp Webber* celso at webbertek.com.br *Webbertek - Opensource Knowledge* (41) 8813-1919 (41) 3284-3035 -- Esta mensagem foi verificada pelo sistema de antiv?rus e acredita-se estar livre de perigo. From celso at webbertek.com.br Thu Sep 7 04:28:32 2006 From: celso at webbertek.com.br (Celso K. Webber) Date: Thu, 07 Sep 2006 01:28:32 -0300 Subject: [Linux-cluster] clustat problem In-Reply-To: References: <44F96728.8090902@cesca.es> Message-ID: <44FF9FF0.1000404@webbertek.com.br> Hi Filipe! I think your case is a little bit different from Jordi's case, since you are using Cluster Suite v3 and he is using v4. From my own experience, under CSv3 I had this kind o problem when using high latency quorum devices. So I had to change from disk tiebraker to network tiebraker. I imagine you're using disk tiebraker, aren't you? Please, would someone please confirm that Filipe's case could be solved by changing the heartbeat method? It worked for me in the past, but I'm not pretty sure that this was the actual solution. Thanks, Celso. Filipe Miranda escreveu: > Hi there, > > I'm having the same problem! > I'm using RHEL3.8 for Itanium and RedHat Cluster Suite U8. The cluster > is composed of 2 HP 4CPUs servers and we are using an EMC ClarionCX700 > to hold the quorum partitions and data partitions. > One more thing that I noticed, eventhough the members are shown ative on > both nodes, any action on the node that shows the active service does > not get propagated to the other member. > > I already checked the configuration of the rawdevices, and I also used > the shutil utility and it reported no problems with the quorum partitions. > > Does anybody have any suggestions? > > Thank you, > > > On 9/2/06, *Jordi Prats Catal?* > wrote: > > Hi, > I'm getting different outputs of clustat utility on each node: > > node1: > # clustat > Member Status: Quorate > > Member Name Status > ------ ---- ------ > node1 Online, Local, rgmanager > node2 Online, rgmanager > > Service Name Owner (Last) State > ------- ---- ----- ------ ----- > ptoheczas node2 started > xoqil node2 started > ymsgh node1 started > vofcvhas node2 started > > node2: > # clustat > Member Status: Quorate > > Member Name Status > ------ ---- ------ > node1 Online, rgmanager > node2 Online, Local, rgmanager > > > (disappears service's info) > > Rebooting disapears this problem (displays same info in both nodes) for > a few weeks. After that it appears again. > > Do you know what's going on? > > Thanks, > > -- > ...................................................................... > __ > / / Jordi Prats Catal? > C E / S / C A Departament de Sistemes > /_/ Centre de Supercomputaci? de Catalunya > > Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona > T. 93 205 6464 ? F. 93 205 6979 ? jprats at cesca.es > > ...................................................................... > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > -- > Esta mensagem foi verificada pelo sistema de antiv?rus e > acredita-se estar livre de perigo. > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- *Celso Kopp Webber* celso at webbertek.com.br *Webbertek - Opensource Knowledge* (41) 8813-1919 (41) 3284-3035 -- Esta mensagem foi verificada pelo sistema de antiv?rus e acredita-se estar livre de perigo. From Matthew.Patton.ctr at osd.mil Thu Sep 7 13:19:07 2006 From: Matthew.Patton.ctr at osd.mil (Patton, Matthew F, CTR, OSD-PA&E) Date: Thu, 7 Sep 2006 09:19:07 -0400 Subject: [Linux-cluster] Write log messages to a different file Message-ID: Classification: UNCLASSIFIED > Are there plans to implement those loggin facilities to the other > daemons? unfortunately Redhat probably can't get away with defining a new facility: "cluster" but it would be nice if they'd settle on a localN. daemon sorta fits but then it would pollute the regular daemon stream with all it's noise. I can't stand the stock RH syslog.conf. But hey, that's my perogative. Every daemon should have an option to specify the facility. But this is unix - nobody does anything in a consistant manner. Shoot, even the LVM tools aren't consistant with each other. While I'm on my rant, please stop using XML to configure daemons. I don't mean eg. the cluster configuration itself, but like settings for rgmanager. What facility it uses does NOT belong anywhere but in /etc/sysconfig. I'm all for new stuff and fixing new stuff but I wish the larger Linux/unix community would spend some time fixing all the garbage that's been around for 30+ years. -------------- next part -------------- An HTML attachment was scrubbed... URL: From teigland at redhat.com Thu Sep 7 14:10:37 2006 From: teigland at redhat.com (David Teigland) Date: Thu, 7 Sep 2006 09:10:37 -0500 Subject: [Linux-cluster] CMAN and interface In-Reply-To: <644A0966265D9D40AC7584FCE956111302EDD634@dubexc01.emea.cpqcorp.net> References: <644A0966265D9D40AC7584FCE956111302EDD634@dubexc01.emea.cpqcorp.net> Message-ID: <20060907141037.GB7775@redhat.com> On Tue, Sep 05, 2006 at 01:50:23PM +0100, O'Sullivan, Damian wrote: > Hi, > > How do I ensure that CMAN uses a specific interface? I have a 2 node > cluster with 6 ethernet interfaces. I have a cross over cable beween the > 2 eth0 interfaces on both nodes. All other interfaces are connected to a > common switch with VLANs for each interface. When this switch is > reloaded/rebooted the nodes try to fence each other and soon as the > switch comes back each node is shutdown by the fencing agent. > > I see there is a way with multicast but is that the only way and how > does one set up addresses for this? The node names in cluster.conf should be the name assigned to the interface you want cman/dlm to use for heartbeating/locking. So, in your case it sounds like you should use the name of the address on eth0 in cluster.conf (I think you can use IP addresses as node names, too, but I'm not certain.) Dave From teigland at redhat.com Thu Sep 7 14:21:27 2006 From: teigland at redhat.com (David Teigland) Date: Thu, 7 Sep 2006 09:21:27 -0500 Subject: [Linux-cluster] data journaling for increased performance In-Reply-To: <44FEE4C2.1070205@obsidian.co.za> References: <44FEE4C2.1070205@obsidian.co.za> Message-ID: <20060907142127.GC7775@redhat.com> On Wed, Sep 06, 2006 at 05:09:54PM +0200, Riaan van Niekerk wrote: > has anyone been able to use GFS data journaling to get any measurable > performance boost? For those unfamiliar: > > http://www.redhat.com/docs/manuals/csgfs/browse/rh-gfs-en/s1-manage-data-journal.html > > We have a 2.6 TB maildir mail store (e.g. lots of small files) and think > of implementing it (we will take any performance increase we can get as > long as it does not impact reliability), and even though it will only > apply to new files. > > Also, is it possible to check if the inherit_jdata (for directories) or > jdata (for files) flag has been set? 'gfs_tool stat' will show the gfs-specific flags on files or directories. Dave From damian.osullivan at hp.com Thu Sep 7 14:24:57 2006 From: damian.osullivan at hp.com (O'Sullivan, Damian) Date: Thu, 7 Sep 2006 15:24:57 +0100 Subject: [Linux-cluster] CMAN and interface In-Reply-To: <20060907141037.GB7775@redhat.com> Message-ID: <644A0966265D9D40AC7584FCE956111302F17C0C@dubexc01.emea.cpqcorp.net> > -----Original Message----- > From: David Teigland [mailto:teigland at redhat.com] > Sent: 07 September 2006 15:11 > To: O'Sullivan, Damian > Cc: Linux-cluster at redhat.com > Subject: Re: [Linux-cluster] CMAN and interface > The node names in cluster.conf should be the name assigned to > the interface you want cman/dlm to use for > heartbeating/locking. So, in your case it sounds like you > should use the name of the address on eth0 in cluster.conf (I > think you can use IP addresses as node names, too, but I'm > not certain.) > > Dave > Thanks Dave, I assume it is no problem to change the node names in the cluster.conf file on a running cluster? D. From teigland at redhat.com Thu Sep 7 14:30:31 2006 From: teigland at redhat.com (David Teigland) Date: Thu, 7 Sep 2006 09:30:31 -0500 Subject: [Linux-cluster] CMAN and interface In-Reply-To: <644A0966265D9D40AC7584FCE956111302F17C0C@dubexc01.emea.cpqcorp.net> References: <20060907141037.GB7775@redhat.com> <644A0966265D9D40AC7584FCE956111302F17C0C@dubexc01.emea.cpqcorp.net> Message-ID: <20060907143031.GD7775@redhat.com> On Thu, Sep 07, 2006 at 03:24:57PM +0100, O'Sullivan, Damian wrote: > > -----Original Message----- > > From: David Teigland [mailto:teigland at redhat.com] > > Sent: 07 September 2006 15:11 > > To: O'Sullivan, Damian > > Cc: Linux-cluster at redhat.com > > Subject: Re: [Linux-cluster] CMAN and interface > > > The node names in cluster.conf should be the name assigned to > > the interface you want cman/dlm to use for > > heartbeating/locking. So, in your case it sounds like you > > should use the name of the address on eth0 in cluster.conf (I > > think you can use IP addresses as node names, too, but I'm > > not certain.) > > > > Dave > > > > Thanks Dave, > > I assume it is no problem to change the node names in the cluster.conf > file on a running cluster? I think it would be a problem, although I can't say exactly what would break or how badly. One thing that would break is fencing, since fenced would be given the old name to fence and wouldn't be able to find it in cluster.conf when looking up fencing paramters. You probably need to stop both nodes, change the names, then have them rejoin the cluster. Dave From titi.titi75 at caramail.com Thu Sep 7 17:03:15 2006 From: titi.titi75 at caramail.com (titi.titi75) Date: Thu Sep 07 17:03:15 GMT+00:00 2006 Subject: [Linux-cluster] lock_nolock to lock_dlm trouble Message-ID: <10949343554084@lycos-europe.com> An HTML attachment was scrubbed... URL: From filipe.miranda at gmail.com Thu Sep 7 17:16:12 2006 From: filipe.miranda at gmail.com (Filipe Miranda) Date: Thu, 7 Sep 2006 14:16:12 -0300 Subject: [Linux-cluster] clustat problem In-Reply-To: <44FF9FF0.1000404@webbertek.com.br> References: <44F96728.8090902@cesca.es> <44FF9FF0.1000404@webbertek.com.br> Message-ID: Hello Celso, Well, your suggestion might be the solution to the problem, but since I think its a quorum latency problem, would the parameters "cludb -p clumemb%rtp 50" and "cludb -p cluquorumd%rtp 50" help on this this issue? I was digging into the Cluster Suite documentation and I found these parameters. Would those help on this issue without changing the heartbeat method? Also take a look in this Kbase bellow, it has some interesting tunning parameters for Red Hat's Clsuter Suite v3: http://kbase.redhat.com/faq/FAQ_79_7722.shtm Regards, Filipe Miranda On 9/7/06, Celso K. Webber wrote: > > Hi Filipe! > > I think your case is a little bit different from Jordi's case, since you > are using Cluster Suite v3 and he is using v4. > > From my own experience, under CSv3 I had this kind o problem when using > high latency quorum devices. So I had to change from disk tiebraker to > network tiebraker. I imagine you're using disk tiebraker, aren't you? > > Please, would someone please confirm that Filipe's case could be solved > by changing the heartbeat method? It worked for me in the past, but I'm > not pretty sure that this was the actual solution. > > Thanks, > > Celso. > > Filipe Miranda escreveu: > > Hi there, > > > > I'm having the same problem! > > I'm using RHEL3.8 for Itanium and RedHat Cluster Suite U8. The cluster > > is composed of 2 HP 4CPUs servers and we are using an EMC ClarionCX700 > > to hold the quorum partitions and data partitions. > > One more thing that I noticed, eventhough the members are shown ative on > > both nodes, any action on the node that shows the active service does > > not get propagated to the other member. > > > > I already checked the configuration of the rawdevices, and I also used > > the shutil utility and it reported no problems with the quorum > partitions. > > > > Does anybody have any suggestions? > > > > Thank you, > > > > > > On 9/2/06, *Jordi Prats Catal?* > > wrote: > > > > Hi, > > I'm getting different outputs of clustat utility on each node: > > > > node1: > > # clustat > > Member Status: Quorate > > > > Member Name Status > > ------ ---- ------ > > node1 Online, Local, rgmanager > > node2 Online, rgmanager > > > > Service Name Owner (Last) State > > ------- ---- ----- ------ ----- > > ptoheczas node2 started > > xoqil node2 started > > ymsgh node1 started > > vofcvhas node2 started > > > > node2: > > # clustat > > Member Status: Quorate > > > > Member Name Status > > ------ ---- ------ > > node1 Online, rgmanager > > node2 Online, Local, rgmanager > > > > > > (disappears service's info) > > > > Rebooting disapears this problem (displays same info in both nodes) > for > > a few weeks. After that it appears again. > > > > Do you know what's going on? > > > > Thanks, > > > > -- > > > ...................................................................... > > __ > > / / Jordi Prats Catal? > > C E / S / C A Departament de Sistemes > > /_/ Centre de Supercomputaci? de Catalunya > > > > Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona > > T. 93 205 6464 ? F. 93 205 6979 ? jprats at cesca.es > > > > > ...................................................................... > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > -- > > Esta mensagem foi verificada pelo sistema de antiv?rus e > > acredita-se estar livre de perigo. > > > > > > ------------------------------------------------------------------------ > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > *Celso Kopp Webber* > > celso at webbertek.com.br > > *Webbertek - Opensource Knowledge* > (41) 8813-1919 > (41) 3284-3035 > > > -- > Esta mensagem foi verificada pelo sistema de antiv?rus e > acredita-se estar livre de perigo. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- --- Filipe T Miranda Red Hat Certified Engineer -------------- next part -------------- An HTML attachment was scrubbed... URL: From teigland at redhat.com Thu Sep 7 17:53:03 2006 From: teigland at redhat.com (David Teigland) Date: Thu, 7 Sep 2006 12:53:03 -0500 Subject: [Linux-cluster] lock_nolock to lock_dlm trouble In-Reply-To: <10949343554084@lycos-europe.com> References: <10949343554084@lycos-europe.com> Message-ID: <20060907175303.GI7775@redhat.com> > I had to remove a storage system from my cluster. It was formatted using > lock_dlm before being removed. An then, it was plugged on a single > server, using the "lockproto=lock_nolock" option. > > Now, I put it back in the cluster, but I can 't mount it with the > standard lock_dlm (but it's ok with the lock_nolock option, but it of > course prevents me to share it). The error is > > GFS: Trying to join cluster "lock_dlm", "alpha_cluster:vol001" > lock_dlm: new lockspace error -17 > GFS: can't mount proto = lock_dlm, table = alpha_cluster:vol001, hostdata = -17 is EEXIST, meaning a dlm lockspace with the name "vol001" already exists. "cman_tool services" should display it. Do you have another fs with the same name? You may need to reboot the system to get it back in shape. Dave PS. please send text instead of html mail From lhh at redhat.com Thu Sep 7 21:50:14 2006 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 07 Sep 2006 17:50:14 -0400 Subject: [Linux-cluster] Is IPMI fencing considered certified by Red Hat? In-Reply-To: <44FF9932.9020508@webbertek.com.br> References: <44FF9932.9020508@webbertek.com.br> Message-ID: <1157665814.3610.251.camel@rei.boston.devel.redhat.com> On Thu, 2006-09-07 at 00:59 -0300, Celso K. Webber wrote: > Hello friends, > > Regarding Red Hat Cluster Suite and/or GFS, could someone from Red Hat > please tell me if the use of IPMI embedded devices from the servers' > motherboards is officially certified by Red Hat? > > I'd like to have this information so that we can recommend (or not) to > customers the use of IPMI as a secure form of fencing. > > We had some bad experiences recently on some servers where only one of > the onboard NICs listened to the IPMI over LAN packets, so it appeared > to us that sometimes IPMI is not that safe as a fence device. Of course > the Cluster software will assume nothing when the fencing fails, but the > bad thing is that there is no automatic failover on this situation. It's supported, but there are a couple of caveats that you should be aware of: (a) You should, if possible, use the IPMI-enabled NIC only for IPMI traffic. At least, you should not use it for cluster communication traffic - though it is fine for service-related (e.g. rgmanager, etc.) and other traffic. That way, the IPMI-enabled port can't become a single point of failure. Here's why: If IPMI and cluster traffic are using the same NIC, then that NIC failing (or becoming disconnected) will cause the node to be evicted -- but prevent fencing, because the IPMI host will be unreachable. Similarly, on a machine with a single power supply + IPMI fencing in a cluster, the power cord becomes a SPF - if you pull the power, the host is dead and fencing cannot complete (because IPMI does not have power either!), which leads to... (b) If you do not have *both* dual power supplies and dual NICs, you need something else (in addition to IPMI) if NSPF is a requirement for your particular installation. For example, what one linux-cluster user did was add their fiber channel switch as a secondary fence device (in its own fence level). His cluster tries to fence using IPMI. Failing that, the cluster falls back to fencing via the fiber switch. (c) You often need to disable ACPI on hardware which has IPMI if you intend to use IPMI for fencing. This can vary on a per-machine basis, so you should check first. If a host does a "graceful shutdown" when you fence it via IPMI, you need to disable ACPI on that host (e.g. boot with acpi=off). The server should turn off immediately (or within 4-5 seconds, like when holding an ATX power button in to force a machine off). Hope that helps! -- Lon From peter.huesser at psi.ch Thu Sep 7 22:49:27 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Fri, 8 Sep 2006 00:49:27 +0200 Subject: [Linux-cluster] Two node cluster: node cannot connect to cluster infrastructure Message-ID: <8E2924888511274B95014C2DD906E58AD19E1E@MAILBOX0A.psi.ch> Hello I try to make a two node cluster run. Unfortunately if I run the "/etc/init.d/cman start" command I get in "/var/log/messages" entries like: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5 Initial status:: Inquorate Cluster manager shutdown. Attemping to reconnect... Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5 Initial status:: Inquorate Cluster manager shutdown. Attemping to reconnect... "dmesg" shows: CMAN: sendmsg failed: -13 CMAN: sendmsg failed: -13 CMAN: sendmsg failed: -13 CMAN: forming a new cluster CMAN: quorum regained, resuming activity CMAN: sendmsg failed: -13 CMAN: No functional network interfaces, leaving cluster CMAN: sendmsg failed: -13 CMAN: we are leaving the cluster. CMAN: Waiting to join or form a Linux-cluster CMAN: sendmsg failed: -13 .... The "/etc/hosts" file is correctly set up. "iptables" are disabled ("service iptables stop"). The "cluster.conf" file looks like: I found a thread about this topic in June and August but these did not help me. Any ideas what could be wrong. Sorry, it is possible that I make a complete stupid error (this is my first cluster I set up). Thanks' for any help Pedro -------------- next part -------------- An HTML attachment was scrubbed... URL: From orkcu at yahoo.com Thu Sep 7 22:59:18 2006 From: orkcu at yahoo.com (Roger Peņa Escobio) Date: Thu, 7 Sep 2006 15:59:18 -0700 (PDT) Subject: [Linux-cluster] Two node cluster: node cannot connect to cluster infrastructure In-Reply-To: <8E2924888511274B95014C2DD906E58AD19E1E@MAILBOX0A.psi.ch> Message-ID: <20060907225918.76101.qmail@web50612.mail.yahoo.com> --- Huesser Peter wrote: > Hello > > > > I try to make a two node cluster run. Unfortunately > if I run the > "/etc/init.d/cman start" command I get in > "/var/log/messages" entries > like: > > > > Connected to cluster infrastruture via: CMAN/SM > Plugin v1.1.5 > > Initial status:: Inquorate > > Cluster manager shutdown. Attemping to > reconnect... > > Connected to cluster infrastruture via: CMAN/SM > Plugin v1.1.5 > > Initial status:: Inquorate > > Cluster manager shutdown. Attemping to > reconnect... > > > > "dmesg" shows: > > > > CMAN: sendmsg failed: -13 > > CMAN: sendmsg failed: -13 > > CMAN: sendmsg failed: -13 > > CMAN: forming a new cluster > > CMAN: quorum regained, resuming activity > > CMAN: sendmsg failed: -13 > > CMAN: No functional network interfaces, leaving > cluster > > CMAN: sendmsg failed: -13 > > CMAN: we are leaving the cluster. > > CMAN: Waiting to join or form a Linux-cluster > > CMAN: sendmsg failed: -13 > > .... shutdown the iptables just to see if anything change cu roger __________________________________________ RedHat Certified Engineer ( RHCE ) Cisco Certified Network Associate ( CCNA ) __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From rodgersr at yahoo.com Thu Sep 7 23:07:28 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Thu, 7 Sep 2006 16:07:28 -0700 (PDT) Subject: [Linux-cluster] Issue stonith commands to the failing node twice?? Message-ID: <20060907230728.27221.qmail@web34207.mail.mud.yahoo.com> I am using an older version of clumanger (about 2 yrs old) and I notice that when the active node goes down the back will actually issue stonith commands twice. They are about 60 seconds apart. Does this happen to anyone else?? -------------- next part -------------- An HTML attachment was scrubbed... URL: From peter.huesser at psi.ch Fri Sep 8 07:44:50 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Fri, 8 Sep 2006 09:44:50 +0200 Subject: [Linux-cluster] Two node cluster: node cannot connect to clusterinfrastructure In-Reply-To: <20060907225918.76101.qmail@web50612.mail.yahoo.com> Message-ID: <8E2924888511274B95014C2DD906E58AD19E37@MAILBOX0A.psi.ch> Hello Roger > > shutdown the iptables just to see if anything change > Thanks' for the answer but I already tried this. No effect. Pedro From titi.titi75 at caramail.com Fri Sep 8 08:45:08 2006 From: titi.titi75 at caramail.com (titi.titi75) Date: Fri Sep 08 08:45:08 GMT+00:00 2006 Subject: [Linux-cluster] Re: lock_nolock to lock_dlm trouble - solved Message-ID: <16513365574571@lycos-europe.com> Hello, Ignore my precedent post. I made a mistake. The problem wasn't because of a lock manager modification, but because of a conflict with the FSName. I solved my problem with a 'gfs_tool sb /dev/sanstock3/vol001 table alpha_cluster:vol003' command Thank's Jerome From peter.huesser at psi.ch Fri Sep 8 09:25:30 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Fri, 8 Sep 2006 11:25:30 +0200 Subject: [Linux-cluster] Two node cluster: node cannot connect to clusterinfrastructure In-Reply-To: <8E2924888511274B95014C2DD906E58AD19E1E@MAILBOX0A.psi.ch> Message-ID: <8E2924888511274B95014C2DD906E58AD19E58@MAILBOX0A.psi.ch> I forgot to mention, that I first execute "/etc/init.d/ccsd start" on all servers and afterwards "/etc/init.d/cman start". In the "/var/log/messages" file I see (after some time) a line like "Cluster is quorate. Allowing connections" which sounds interesting but already on the next line I see "Cluster manager shutdown. Attempting to reconnect...". Later I only have the entries you see below. Pedro Hello I try to make a two node cluster run. Unfortunately if I run the "/etc/init.d/cman start" command I get in "/var/log/messages" entries like: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5 Initial status:: Inquorate Cluster manager shutdown. Attemping to reconnect... Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5 Initial status:: Inquorate Cluster manager shutdown. Attemping to reconnect... "dmesg" shows: CMAN: sendmsg failed: -13 CMAN: sendmsg failed: -13 CMAN: sendmsg failed: -13 CMAN: forming a new cluster CMAN: quorum regained, resuming activity CMAN: sendmsg failed: -13 CMAN: No functional network interfaces, leaving cluster CMAN: sendmsg failed: -13 CMAN: we are leaving the cluster. CMAN: Waiting to join or form a Linux-cluster CMAN: sendmsg failed: -13 .... The "/etc/hosts" file is correctly set up. "iptables" are disabled ("service iptables stop"). The "cluster.conf" file looks like: I found a thread about this topic in June and August but these did not help me. Any ideas what could be wrong. Sorry, it is possible that I make a complete stupid error (this is my first cluster I set up). Thanks' for any help Pedro -------------- next part -------------- An HTML attachment was scrubbed... URL: From orkcu at yahoo.com Fri Sep 8 13:08:23 2006 From: orkcu at yahoo.com (Roger Peņa Escobio) Date: Fri, 8 Sep 2006 06:08:23 -0700 (PDT) Subject: [Linux-cluster] Two node cluster: node cannot connect to clusterinfrastructure In-Reply-To: <8E2924888511274B95014C2DD906E58AD19E37@MAILBOX0A.psi.ch> Message-ID: <20060908130823.8334.qmail@web50610.mail.yahoo.com> --- Huesser Peter wrote: > Hello Roger > > > > > shutdown the iptables just to see if anything > change > > > > Thanks' for the answer but I already tried this. No > effect. in both nodes? even before the start the ccsd daemon? ok, that was a guess sorry if it not help you :-( cu roger __________________________________________ RedHat Certified Engineer ( RHCE ) Cisco Certified Network Associate ( CCNA ) __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From peter.huesser at psi.ch Fri Sep 8 13:17:46 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Fri, 8 Sep 2006 15:17:46 +0200 Subject: [Linux-cluster] Two node cluster: node cannot connect toclusterinfrastructure In-Reply-To: <20060908130823.8334.qmail@web50610.mail.yahoo.com> Message-ID: <8E2924888511274B95014C2DD906E58AD19E73@MAILBOX0A.psi.ch> > > in both nodes? > even before the start the ccsd daemon? > Yes (unfortunately). > > ok, that was a guess > sorry if it not help you :-( > I am glad for any answer. I am looking after the problem for quit a long time now and do not see a solution. Pedro From m.catanese at kinetikon.com Fri Sep 8 12:58:03 2006 From: m.catanese at kinetikon.com (Matteo Catanese) Date: Fri, 8 Sep 2006 14:58:03 +0200 Subject: [Linux-cluster] system-config-cluster problem Message-ID: I've setup a cluster some month ago. Cluster is working , but still not in production. Today, after summer break, i did all the updates for my rhat and CS First i disabled all services, then i patched one machine and rebooted, then the other one and rebooted. Cluster works perfectly: [root at lvzbe1 ~]# uname -a Linux lvzbe1.lavazza.it 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 18:00:32 EDT 2006 i686 i686 i386 GNU/Linux [root at lvzbe1 ~]# clustat -v clustat version 1.9.53 Connected via: CMAN/SM Plugin v1.1.7.1 [root at lvzbe1 ~]# clustat Member Status: Quorate Member Name Status ------ ---- ------ lvzbe1 Online, Local, rgmanager lvzbe2 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- oracle lvzbe1 started [root at lvzbe1 ~]# But when i try to run system-config-cluster,it pops out: Poorly Formed XML error A problem was encoutered while reading configuration file /etc/ cluster/clluster.conf. Details or the error appear below. Click the "New" button to create a new configuration file. To continue anyway(Not Recommended!), click the "ok" button. Relax-NG validity error : Extra element rm in interleave /etc/cluster/cluster.conf:35: element rm: Relax-NG validity error : Element cluster failed to validate content /etc/cluster/cluster.conf fails to validate I clicked the "cancel" button, to not to damage all. Conf file is immutated since Jul 13 2006 Matteo -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster.conf Type: application/octet-stream Size: 2192 bytes Desc: not available URL: From orkcu at yahoo.com Fri Sep 8 13:55:50 2006 From: orkcu at yahoo.com (Roger Peņa Escobio) Date: Fri, 8 Sep 2006 06:55:50 -0700 (PDT) Subject: [Linux-cluster] system-config-cluster problem In-Reply-To: Message-ID: <20060908135552.37715.qmail@web50611.mail.yahoo.com> > But when i try to run system-config-cluster,it pops > out: > > Poorly Formed XML error > A problem was encoutered while reading configuration > file /etc/ > cluster/clluster.conf. ^^^^^^^^ this extra 'l' is a tipo error or is actually there ? cu roger __________________________________________ RedHat Certified Engineer ( RHCE ) Cisco Certified Network Associate ( CCNA ) __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From orkcu at yahoo.com Fri Sep 8 14:10:43 2006 From: orkcu at yahoo.com (Roger Peņa Escobio) Date: Fri, 8 Sep 2006 07:10:43 -0700 (PDT) Subject: [Linux-cluster] Two node cluster: node cannot connect toclusterinfrastructure In-Reply-To: <8E2924888511274B95014C2DD906E58AD19E73@MAILBOX0A.psi.ch> Message-ID: <20060908141043.44667.qmail@web50611.mail.yahoo.com> --- Huesser Peter wrote: > > > > in both nodes? > > even before the start the ccsd daemon? > > > Yes (unfortunately). > > > > ok, that was a guess > > sorry if it not help you :-( > > > I am glad for any answer. I am looking after the > problem for quit a long > time now and do not see a solution. > my fisrt time with rhcs I had something like that, I did a lot of things, but the last one before the node join the cluster was a : cman_tooy join I did that in the first node without configure the second node, I didn't had to do that for the second node After that first experience I had others rhcs installations from scratch and never I had to do the "cman_tool join" again ... but maybe you need it ... cu roger __________________________________________ RedHat Certified Engineer ( RHCE ) Cisco Certified Network Associate ( CCNA ) __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From jparsons at redhat.com Fri Sep 8 13:44:44 2006 From: jparsons at redhat.com (Jim Parsons) Date: Fri, 08 Sep 2006 09:44:44 -0400 Subject: [Linux-cluster] system-config-cluster problem References: Message-ID: <450173CC.6070401@redhat.com> Hi Matteo, Sorry for the scary warning. I will look at this issue this morning. Before the s-c-cluster app reads in a cluster.conf, it runs the file against 'xmllint --relaxng' and checks for errors. A bad cluster.conf file could wreak havoc in the GUI. Sometimes errors creep in from hand editing, but they can also occur if we have missed an xml construct we use in the schema file. I'll let you know what is up. Unfortunately, the relaxNG error messages are not very descriptive, but they improve with every release of the validation checker. Thanks for sending your conf file. -J Matteo Catanese wrote: > I've setup a cluster some month ago. > > Cluster is working , but still not in production. > > Today, after summer break, i did all the updates for my rhat and CS > > First i disabled all services, then i patched one machine and > rebooted, then the other one and rebooted. > > > Cluster works perfectly: > > > [root at lvzbe1 ~]# uname -a > Linux lvzbe1.lavazza.it 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 18:00:32 > EDT 2006 i686 i686 i386 GNU/Linux > [root at lvzbe1 ~]# clustat -v > clustat version 1.9.53 > Connected via: CMAN/SM Plugin v1.1.7.1 > [root at lvzbe1 ~]# clustat > Member Status: Quorate > > Member Name Status > ------ ---- ------ > lvzbe1 Online, Local, rgmanager > lvzbe2 Online, rgmanager > > Service Name Owner (Last) State > ------- ---- ----- ------ ----- > oracle lvzbe1 started > [root at lvzbe1 ~]# > > > But when i try to run system-config-cluster,it pops out: > > Poorly Formed XML error > A problem was encoutered while reading configuration file /etc/ > cluster/clluster.conf. > Details or the error appear below. Click the "New" button to create a > new configuration file. > To continue anyway(Not Recommended!), click the "ok" button. > > > Relax-NG validity error : Extra element rm in interleave > /etc/cluster/cluster.conf:35: element rm: Relax-NG validity error : > Element cluster failed to validate content > /etc/cluster/cluster.conf fails to validate > > > > > I clicked the "cancel" button, to not to damage all. > > Conf file is immutated since Jul 13 2006 > > Matteo > > >------------------------------------------------------------------------ > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster > From peter.huesser at psi.ch Fri Sep 8 16:22:38 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Fri, 8 Sep 2006 18:22:38 +0200 Subject: [Linux-cluster] Two node cluster: node cannot connecttoclusterinfrastructure In-Reply-To: <20060908141043.44667.qmail@web50611.mail.yahoo.com> Message-ID: <8E2924888511274B95014C2DD906E58AD19E87@MAILBOX0A.psi.ch> > my fisrt time with rhcs I had something like that, I > did a lot of things, but the last one before the node > join the cluster was a : > cman_tooy join > > I did that in the first node without configure the > second node, I didn't had to do that for the second > node Did not help either. What I do not understand is, that in some situations the node gets quorated but immediately afterwards is shutdowned ??? > > After that first experience I had others rhcs > installations from scratch and never I had to do the > "cman_tool join" again ... > What do you mean with installation from scratch. Did you recompile the packages by yourself ? Pedro From orkcu at yahoo.com Fri Sep 8 16:30:21 2006 From: orkcu at yahoo.com (Roger Peņa Escobio) Date: Fri, 8 Sep 2006 09:30:21 -0700 (PDT) Subject: [Linux-cluster] Two node cluster: node cannot connecttoclusterinfrastructure In-Reply-To: <8E2924888511274B95014C2DD906E58AD19E87@MAILBOX0A.psi.ch> Message-ID: <20060908163021.99026.qmail@web50610.mail.yahoo.com> > > After that first experience I had others rhcs > > installations from scratch and never I had to do > the > > "cman_tool join" again ... > > > What do you mean with installation from scratch. Did > you recompile the > packages by yourself ? > no so "from scratch" ;-) I use the centos4 recompilation of rhcs and rhgfs what i mean was, complete instalation of the Cluster, including the operating system, so not previus conf files, not cache or any other file taken from previus working system. cu roger __________________________________________ RedHat Certified Engineer ( RHCE ) Cisco Certified Network Associate ( CCNA ) __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From Darrell.Frazier at crc.army.mil Fri Sep 8 17:14:39 2006 From: Darrell.Frazier at crc.army.mil (Frazier, Darrell USA CRC (Contractor)) Date: Fri, 8 Sep 2006 12:14:39 -0500 Subject: [Linux-cluster] Odd/Even Nodes for RHCS/GFS Message-ID: Hello, I have heard that RHCS may have an issue with odd number of nodes vs even number of nodes. Has anyone heard of this? Thanx. Darrell J. Frazier Unix System Administrator US Army Combat Readiness Center Fort Rucker, Alabama 36362 CAUTION: This electronic transmission may contain information protected by deliberative process or other privilege, which is protected from disclosure under the Freedom of Information Act, 5 U.S.C. ? 552. The information is intended for the use of the individual or agency to which it was sent. If you are not the intended recipient, be aware that any disclosure, distribution or use of the contents of this information is prohibited. Do not release outside of DoD channels without prior authorization from the sender. The sender provides no assurance as to the integrity of the content of this electronic transmission after it has been sent and received by the intended email recipient. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 3161 bytes Desc: not available URL: From rodgersr at yahoo.com Fri Sep 8 18:16:36 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Fri, 8 Sep 2006 11:16:36 -0700 (PDT) Subject: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas Message-ID: <20060908181636.5993.qmail@web34207.mail.mud.yahoo.com> Does anyone know of a good solution to providing good failover for somthing like a Dell 1850? The issue here is that the power souce plug in the back provides power for both the internal power controller and the node itself. So if you pull the cord it will not failover because it can not Stonith the failed node (power controller is down also). -------------- next part -------------- An HTML attachment was scrubbed... URL: From busyadmin at gmail.com Fri Sep 8 23:18:18 2006 From: busyadmin at gmail.com (Ken Johnson) Date: Fri, 8 Sep 2006 17:18:18 -0600 Subject: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas In-Reply-To: <20060908181636.5993.qmail@web34207.mail.mud.yahoo.com> References: <20060908181636.5993.qmail@web34207.mail.mud.yahoo.com> Message-ID: <200609081718.18842.ken@novell.com> On Fri, 8 Sep 2006 11:16:36 -0700, Rick Rodgers wrote: > Does anyone know of a good solution to providing good failover > for somthing like a Dell 1850? The issue here is that the power > souce plug in the back provides power for both the internal power > controller and the node itself. So if you pull the cord it will not > failover because it can not Stonith the failed node (power controller is > down also). I've used the fence_ipmi and fence_drac agents for these systems successfully. - Ken From eric at bootseg.com Sat Sep 9 00:10:57 2006 From: eric at bootseg.com (Eric Kerin) Date: Fri, 08 Sep 2006 20:10:57 -0400 Subject: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas In-Reply-To: <20060908181636.5993.qmail@web34207.mail.mud.yahoo.com> References: <20060908181636.5993.qmail@web34207.mail.mud.yahoo.com> Message-ID: <1157760657.16147.7.camel@mechanism.localnet> On Fri, 2006-09-08 at 11:16 -0700, Rick Rodgers wrote: > Does anyone know of a good solution to providing good failover > for somthing like a Dell 1850? The issue here is that the power > souce plug in the back provides power for both the internal power > controller > and the node itself. So if you pull the cord it will not failover > because > it can not Stonith the failed node (power controller is down also). > While you can't eliminate your chances of that happening while using the internal fence device, you can reduce the chance by using dual power supplies. Obviously if both power supplies go to the same PDU then you only buy so much. For my cluster, I use two external power controllers (APC 7900's) to fence my nodes, two to provide redundant power paths and no single point of failure for power. While I could use the built in RIB card (HP Servers) this method reduces the possible failure points. Thanks, Eric Kerin eric at bootseg.com From danwest at comcast.net Tue Sep 5 11:43:42 2006 From: danwest at comcast.net (danwest) Date: Tue, 05 Sep 2006 07:43:42 -0400 Subject: [Linux-cluster] 2-node fencing question (IPMI/ACPI question) In-Reply-To: <1154633146.28677.70.camel@ayanami.boston.redhat.com> References: <080220061550.6837.44D0C9B800021AD200001AB522007347489B9C0A99020E0B@comcast.net> <1154633146.28677.70.camel@ayanami.boston.redhat.com> Message-ID: <1157456622.4378.7.camel@belmont.site> What happens if the servers you are using require ACPI=on in order to boot. For instance IBM X366 servers need ACPI set in order to boot. With ACPI=on both nodes reboot when a fence occurs(see "both nodes off problem" in thread below). This is not desirable, especially with active/active clusters. Thanks, dan > Sorry I didn't see this earlier! > > On Wed, 2006-08-02 at 15:50 +0000, danwest at comcast.net wrote: > > It seems like a significant problem to have fence_ipmilan issue a power-off followed by a power-on with a 2 node cluster. > > Generally, the chances of this occurring are very, very small, though > not impossible. > > However, it could very well be that IPMI hardware modules are slow > enough at processing requests that this could pose a problem. What > hardware has this happened on? Was ACPI disabled on boot in the host OS > (it should be; see below)? > > > > This seems to make a 2-node cluster with ipmi fencing pointless. > > I'm pretty sure that 'both-nodes-off problem' can only occur if all of > the following criteria are met: > > (a) while using a separate NICs for IPMI and cluster traffic (the > recommended configuration), > > (b) in the event of a network partition, such that both nodes can not > see each other but can see each other's IPMI port, and > > (c) if both nodes send their power-off packets at or near the exact same > time. > > The time window for (c) increases significantly (5+ seconds) if the > cluster nodes are enabling ACPI power events on boot. This is one of > the reasons why booting with acpi=off is required when using IPMI, iLO, > or other integrated power management solutions. > > If booting with acpi=off, does the problem persist? > > > It looks like fence_ipmilan needs to support sending a cycle instead of a poweroff than a poweron? > > The reason fence_ipmilan functions this way (off, status, on) is because > that we require a confirmation that the node has lost power. I am not > sure that it is possible to confirm the node has rebooted using IPMI. > > Arguably, it also might not be necessary to make such a confirmation in > this particular case. > > > According to fence_ipmilan.c it looks like cycle is not an option although it is an option for ipmitool. (ipmitool -H -U -P chassis power cycle) > > Looks like you're on the right track. > > -- Lon > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From marcelosoaressouza at gmail.com Fri Sep 8 14:18:52 2006 From: marcelosoaressouza at gmail.com (Marcelo Souza) Date: Fri, 8 Sep 2006 10:18:52 -0400 Subject: [Linux-cluster] Slackware Package for openmpi 1.1.1 and mpich2 1.0.4p1 Message-ID: <12c9ca330609080718m4e15793fle202afa21a5b3227@mail.gmail.com> If interest anyone i make Slackware packages, i486, for openmpi 1.1.1 and mpich2 1.0.4p1 TGZ http://www.cebacad.net/slackware/openmpi-1.1.1-i486-1goa.tgz http://www.cebacad.net/slackware/mpich2-1.0.4p1-i486-1goa.tgz signed with my pgp key http://www.cebacad.net/slackware/openmpi-1.1.1-i486-1goa.tgz.asc http://www.cebacad.net/slackware/mpich2-1.0.4p1-i486-1goa.tgz.asc MD5 http://www.cebacad.net/slackware/openmpi-1.1.1-i486-1goa.tgz.md5 http://www.cebacad.net/slackware/mpich2-1.0.4p1-i486-1goa.tgz.md5 see ya Marcelo Souza (marcelo at cebacad.net) http://marcelo.cebacad.net http://slackbeowulf.cebacad.net From rodgersr at yahoo.com Mon Sep 11 02:02:03 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Sun, 10 Sep 2006 19:02:03 -0700 (PDT) Subject: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas In-Reply-To: <200609081718.18842.ken@novell.com> Message-ID: <20060911020203.67886.qmail@web34206.mail.mud.yahoo.com> How does this help? The power controller is still down ----- Original Message ---- From: Ken Johnson To: linux-cluster at redhat.com Sent: Friday, September 8, 2006 4:18:18 PM Subject: Re: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas On Fri, 8 Sep 2006 11:16:36 -0700, Rick Rodgers wrote: > Does anyone know of a good solution to providing good failover > for somthing like a Dell 1850? The issue here is that the power > souce plug in the back provides power for both the internal power > controller and the node itself. So if you pull the cord it will not > failover because it can not Stonith the failed node (power controller is > down also). I've used the fence_ipmi and fence_drac agents for these systems successfully. - Ken -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From busyadmin at gmail.com Mon Sep 11 04:02:19 2006 From: busyadmin at gmail.com (Ken Johnson) Date: Sun, 10 Sep 2006 22:02:19 -0600 Subject: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas In-Reply-To: <20060911020203.67886.qmail@web34206.mail.mud.yahoo.com> References: <200609081718.18842.ken@novell.com> <20060911020203.67886.qmail@web34206.mail.mud.yahoo.com> Message-ID: <1c0e77670609102102w79125384xd648b11b3e3dc889@mail.gmail.com> On Sun, 10 Sep 2006 at 19:02 -0700, Rick Rodgers wrote: > How does this help? The power controller is still down Sorry, I obviously don't understand your question. I thought you were looking for fencing solutions for these devices (1850's). - Ken From rodgersr at yahoo.com Mon Sep 11 04:14:11 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Sun, 10 Sep 2006 21:14:11 -0700 (PDT) Subject: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas In-Reply-To: <1c0e77670609102102w79125384xd648b11b3e3dc889@mail.gmail.com> Message-ID: <20060911041411.22029.qmail@web34202.mail.mud.yahoo.com> Yes I was, but if the power controller is down (unreachable) and the system (node) is hung how can these fence anything? By pulling the plug you loose both and you can not be sure of anything since you can not successfully issue a power cycle command. Thanks for your input though. ----- Original Message ---- From: Ken Johnson To: linux clustering Sent: Sunday, September 10, 2006 9:02:19 PM Subject: Re: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas On Sun, 10 Sep 2006 at 19:02 -0700, Rick Rodgers wrote: > How does this help? The power controller is still down Sorry, I obviously don't understand your question. I thought you were looking for fencing solutions for these devices (1850's). - Ken -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From rodgersr at yahoo.com Mon Sep 11 04:15:16 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Sun, 10 Sep 2006 21:15:16 -0700 (PDT) Subject: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas In-Reply-To: <1c0e77670609102102w79125384xd648b11b3e3dc889@mail.gmail.com> Message-ID: <20060911041516.22450.qmail@web34202.mail.mud.yahoo.com> can these agents do anything if the power controller is unaccessable? ----- Original Message ---- From: Ken Johnson To: linux clustering Sent: Sunday, September 10, 2006 9:02:19 PM Subject: Re: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas On Sun, 10 Sep 2006 at 19:02 -0700, Rick Rodgers wrote: > How does this help? The power controller is still down Sorry, I obviously don't understand your question. I thought you were looking for fencing solutions for these devices (1850's). - Ken -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From busyadmin at gmail.com Mon Sep 11 04:42:16 2006 From: busyadmin at gmail.com (Ken Johnson) Date: Sun, 10 Sep 2006 22:42:16 -0600 Subject: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas In-Reply-To: <20060911041411.22029.qmail@web34202.mail.mud.yahoo.com> References: <1c0e77670609102102w79125384xd648b11b3e3dc889@mail.gmail.com> <20060911041411.22029.qmail@web34202.mail.mud.yahoo.com> Message-ID: <1c0e77670609102142t267518e3l99a3a2a6eb5fd498@mail.gmail.com> On Sun, 10 Sep 2006 at 21:14 -0700, Rick Rodgers wrote: > Yes I was, but if the power controller is down (unreachable) > and the system (node) is hung how can these fence anything? > By pulling the plug you loose both and you can not be sure of anything > since you can not successfully issue a power cycle command. I'm not sure I understand what you mean by "if the power controller is down". These systems can be configured with redundant power supplies and if both power supplies fail then there's not anything you can do to fence a system. > Thanks for your input though. sure, np - Ken From rodgersr at yahoo.com Mon Sep 11 06:19:12 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Sun, 10 Sep 2006 23:19:12 -0700 (PDT) Subject: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas In-Reply-To: <1c0e77670609102142t267518e3l99a3a2a6eb5fd498@mail.gmail.com> Message-ID: <20060911061912.39375.qmail@web34204.mail.mud.yahoo.com> Yes that is what my point is. These systems use the same power cord for the powercontroller and system power. If you pull the plug then no failover can happen because the backup node can not shoot the active node because it can not talk to the active nodes power controller. This means a pull of the plug and no failover. Seem like we really should havea way to failover. ----- Original Message ---- From: Ken Johnson To: linux clustering Sent: Sunday, September 10, 2006 9:42:16 PM Subject: Re: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas On Sun, 10 Sep 2006 at 21:14 -0700, Rick Rodgers wrote: > Yes I was, but if the power controller is down (unreachable) > and the system (node) is hung how can these fence anything? > By pulling the plug you loose both and you can not be sure of anything > since you can not successfully issue a power cycle command. I'm not sure I understand what you mean by "if the power controller is down". These systems can be configured with redundant power supplies and if both power supplies fail then there's not anything you can do to fence a system. > Thanks for your input though. sure, np - Ken -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From rodgersr at yahoo.com Mon Sep 11 06:23:33 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Sun, 10 Sep 2006 23:23:33 -0700 (PDT) Subject: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas In-Reply-To: <1c0e77670609102142t267518e3l99a3a2a6eb5fd498@mail.gmail.com> Message-ID: <20060911062333.48552.qmail@web34210.mail.mud.yahoo.com> When you say redundant power supply, do you mean they have the same IP address?. If not, how does Clumanger handle talking to two power supplys? And if one goes down how does it know to talk to the other? Is there a configuration in cluster.xml? ----- Original Message ---- From: Ken Johnson To: linux clustering Sent: Sunday, September 10, 2006 9:42:16 PM Subject: Re: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas On Sun, 10 Sep 2006 at 21:14 -0700, Rick Rodgers wrote: > Yes I was, but if the power controller is down (unreachable) > and the system (node) is hung how can these fence anything? > By pulling the plug you loose both and you can not be sure of anything > since you can not successfully issue a power cycle command. I'm not sure I understand what you mean by "if the power controller is down". These systems can be configured with redundant power supplies and if both power supplies fail then there's not anything you can do to fence a system. > Thanks for your input though. sure, np - Ken -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From Alain.Moulle at bull.net Mon Sep 11 07:16:51 2006 From: Alain.Moulle at bull.net (Alain Moulle) Date: Mon, 11 Sep 2006 09:16:51 +0200 Subject: [Linux-cluster] CS4 Update 2 & Patch watchdog on rgmanager process Message-ID: <45050D63.2040507@bull.net> Hi I tried to apply the watchdog path on CS4 U2 , which normally should launch a reboot if the process clurmgrd disappears for any reason, but it seems not to work on Update 2 ... We have now two clurmgrd processes launched at rgmanager start, and I tried to kill it about 10 times, but it leads to a reboot of the node only once. Any idea ? Which is exactly the expected behavior with the watchdog patch ? Thanks Alain Moull? From jos at xos.nl Mon Sep 11 07:59:44 2006 From: jos at xos.nl (Jos Vos) Date: Mon, 11 Sep 2006 09:59:44 +0200 Subject: [Linux-cluster] GFS and (missing) filesystem labels Message-ID: <200609110759.k8B7xij01034@xos037.xos.nl> Hi, It seems that you can not add a filesystem label to a GFS filesystem. Especially when using iSCSI, it would be handy to have a method to be sure that you mount the right SCSI device, in case the device name has changed due to a failure of another (i)SCSI device. Is there a good solution for this? Thanks, -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From riaan at obsidian.co.za Mon Sep 11 09:19:21 2006 From: riaan at obsidian.co.za (Riaan van Niekerk) Date: Mon, 11 Sep 2006 11:19:21 +0200 Subject: [Linux-cluster] GFS and (missing) filesystem labels In-Reply-To: <200609110759.k8B7xij01034@xos037.xos.nl> References: <200609110759.k8B7xij01034@xos037.xos.nl> Message-ID: <45052A19.1000804@obsidian.co.za> Jos Vos wrote: > Hi, > > It seems that you can not add a filesystem label to a GFS filesystem. > > Especially when using iSCSI, it would be handy to have a method to be > sure that you mount the right SCSI device, in case the device name > has changed due to a failure of another (i)SCSI device. > > Is there a good solution for this? > > Thanks, > hi Jos a) use LVM. it does not care what the underlying physical volume names are, it will do the "right thing" w.r.t. volume groups and logical volumes names b) your multipathing solution (e.g. EMC PowerPath with its persistent mapping functionality of paths to for example /dev/emcpowera1) might also solve this problem, if you want to avoid using LVM. note - on SANs with multiple paths to the same LUN/partition, using labels to mount does not work (you will get an error message about duplicate labels), which is probably why the functionality is not there to begin with, and probably will not be either. HTH Riaan -------------- next part -------------- A non-text attachment was scrubbed... Name: riaan.vcf Type: text/x-vcard Size: 310 bytes Desc: not available URL: From peter.huesser at psi.ch Mon Sep 11 09:54:46 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Mon, 11 Sep 2006 11:54:46 +0200 Subject: [Linux-cluster] Immediate shutdown after getting quorate of two node cluster Message-ID: <8E2924888511274B95014C2DD906E58AD19EDE@MAILBOX0A.psi.ch> Hello I try to run a two node cluster. Starting ccsd on both servers is no problem. But if I try to start cman I get the following lines in my "/var/log/messages" file: Sep 11 11:44:26 server01 ccsd[24972]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5 Sep 11 11:44:26 server01 ccsd[24972]: Initial status:: Inquorate Sep 11 11:44:57 server01 ccsd[24972]: Cluster is quorate. Allowing connections. Sep 11 11:44:57 server01 ccsd[24972]: Cluster manager shutdown. Attemping to reconnect... Sep 11 11:44:58 server01 ccsd[24972]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5 Sep 11 11:44:58 server01 ccsd[24972]: Initial status:: Inquorate Sep 11 11:45:29 server01 ccsd[24972]: Cluster is quorate. Allowing connections. Sep 11 11:45:29 server01 ccsd[24972]: Cluster manager shutdown. Attemping to reconnect... ... Why is the daemon shutdown after getting quorated ? Any ideas ? Thanks' Pedro -------------- next part -------------- An HTML attachment was scrubbed... URL: From sara_sodagar at yahoo.com Mon Sep 11 11:26:31 2006 From: sara_sodagar at yahoo.com (sara sodagar) Date: Mon, 11 Sep 2006 04:26:31 -0700 (PDT) Subject: [Linux-cluster] Question about using Lock manager Message-ID: <20060911112631.98356.qmail@web31801.mail.mud.yahoo.com> hi I am new in GFS concept and planning to use RHEL4 GFS to immplement clustering.My SAN is HDS 9585 and I have 4 HS-20 web servers and 2 IBM HS20 ftp servers. My question is about the place of lock amanger in this configuration. Should I set up lock manager on a separate host or would it be possible to have a node with both roles of lock manager and apache ? please let me know the impact of having a node with both roles ? If I should set up Lock manager and its RLM on different nodes please let me know the best configuration . I would be greatful if any one can help me regarding this matter. --Best regards. Sara __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From lists at brimer.org Mon Sep 11 12:47:14 2006 From: lists at brimer.org (Barry Brimer) Date: Mon, 11 Sep 2006 07:47:14 -0500 (CDT) Subject: [Linux-cluster] Question about using Lock manager In-Reply-To: <20060911112631.98356.qmail@web31801.mail.mud.yahoo.com> References: <20060911112631.98356.qmail@web31801.mail.mud.yahoo.com> Message-ID: > hi > I am new in GFS concept and planning to use RHEL4 GFS > to immplement clustering.My SAN is HDS 9585 and I have > 4 HS-20 web servers and 2 IBM HS20 ftp servers. > My question is about the place of lock amanger in this > configuration. > Should I set up lock manager on a separate host or > would it be possible to have a node with both roles of > lock manager and apache ? please let me know the > impact of having a node with both roles ? > If I should set up Lock manager and its RLM on > different nodes please let me know the best > configuration . I would recommend using lock_dlm. With lock_dlm, each node manages locks for the files it uses. Hope this helps. Barry From lhh at redhat.com Mon Sep 11 13:55:19 2006 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 11 Sep 2006 09:55:19 -0400 Subject: [Linux-cluster] Issue stonith commands to the failing node twice?? In-Reply-To: <20060907230728.27221.qmail@web34207.mail.mud.yahoo.com> References: <20060907230728.27221.qmail@web34207.mail.mud.yahoo.com> Message-ID: <1157982919.3610.274.camel@rei.boston.devel.redhat.com> On Thu, 2006-09-07 at 16:07 -0700, Rick Rodgers wrote: > I am using an older version of clumanger (about 2 yrs old) and I > notice > that when the active node goes down the back will actually issue > stonith commands twice. They are about 60 seconds apart. Does this > happen to anyone else?? It's "normal" if you're using the disk tiebreaker. That is, it's been around for so long that people are used to it ;) Basically, both membership transitions and quorum disk transitions are causing full recovery (including STONITH). However, only one should cause a STONITH event -- the one that happens last. There is a switch which should fix it in 1.2.34, but it has to be enabled manually ('cludb -p cluquorumd%disk_quorum 1'). -- Lon -------------- next part -------------- A non-text attachment was scrubbed... Name: clumanager-1.2.31-179363.patch Type: text/x-patch Size: 5553 bytes Desc: not available URL: From lhh at redhat.com Mon Sep 11 13:58:40 2006 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 11 Sep 2006 09:58:40 -0400 Subject: [Linux-cluster] Odd/Even Nodes for RHCS/GFS In-Reply-To: References: Message-ID: <1157983120.3610.279.camel@rei.boston.devel.redhat.com> On Fri, 2006-09-08 at 12:14 -0500, Frazier, Darrell USA CRC (Contractor) wrote: > Hello, > > > > I have heard that RHCS may have an issue with odd number of nodes vs > even number of nodes. Has anyone heard of this? Thanx. The only special considerations are with two-node clusters, because there is no easy way to declare a majority in two node clusters. So, there has to be a way to decide which node is "alive" and which one is "dead" in the case of a network partition. There are several ways to do this. Otherwise, even vs. odd should not matter. If there are any issues WRT even vs. odd, it's probably a bug. -- Lon From lhh at redhat.com Mon Sep 11 14:07:27 2006 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 11 Sep 2006 10:07:27 -0400 Subject: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas In-Reply-To: <20060908181636.5993.qmail@web34207.mail.mud.yahoo.com> References: <20060908181636.5993.qmail@web34207.mail.mud.yahoo.com> Message-ID: <1157983647.3610.287.camel@rei.boston.devel.redhat.com> On Fri, 2006-09-08 at 11:16 -0700, Rick Rodgers wrote: > Does anyone know of a good solution to providing good failover > for somthing like a Dell 1850? The issue here is that the power > souce plug in the back provides power for both the internal power > controller > and the node itself. So if you pull the cord it will not failover > because > it can not Stonith the failed node (power controller is down also). Generally, you can't handle this without external fencing. https://www.redhat.com/archives/linux-cluster/2006-September/msg00026.html -- Lon From jparsons at redhat.com Mon Sep 11 14:40:12 2006 From: jparsons at redhat.com (James Parsons) Date: Mon, 11 Sep 2006 10:40:12 -0400 Subject: [Linux-cluster] system-config-cluster problem In-Reply-To: References: Message-ID: <4505754C.5020806@redhat.com> Matteo Catanese wrote: > I've setup a cluster some month ago. > > Cluster is working , but still not in production. > > Today, after summer break, i did all the updates for my rhat and CS > > First i disabled all services, then i patched one machine and > rebooted, then the other one and rebooted. > > > Cluster works perfectly: > > > [root at lvzbe1 ~]# uname -a > Linux lvzbe1.lavazza.it 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 18:00:32 > EDT 2006 i686 i686 i386 GNU/Linux > [root at lvzbe1 ~]# clustat -v > clustat version 1.9.53 > Connected via: CMAN/SM Plugin v1.1.7.1 > [root at lvzbe1 ~]# clustat > Member Status: Quorate > > Member Name Status > ------ ---- ------ > lvzbe1 Online, Local, rgmanager > lvzbe2 Online, rgmanager > > Service Name Owner (Last) State > ------- ---- ----- ------ ----- > oracle lvzbe1 started > [root at lvzbe1 ~]# > > > But when i try to run system-config-cluster,it pops out: > > Poorly Formed XML error > A problem was encoutered while reading configuration file /etc/ > cluster/clluster.conf. > Details or the error appear below. Click the "New" button to create a > new configuration file. > To continue anyway(Not Recommended!), click the "ok" button. > > > Relax-NG validity error : Extra element rm in interleave > /etc/cluster/cluster.conf:35: element rm: Relax-NG validity error : > Element cluster failed to validate content > /etc/cluster/cluster.conf fails to validate > Hi Matteo, Here is why the conf file is failing validation: In your conf lines specifying your two FS's, you have an fstype attribute but no fsid attribute. I spoke with Lon, who is the Grand Resource Guru, and he says that the two should be exclusive, that is, an fsid should not be necessary just because you are specifying an fstype. So this is a bug in the relaxNG schema validation file. A fix for this will be in the next update, and until then, using the conf file that you attached, please just disregard the warning message. For completeness sake, I am attaching a fixed version of the relaxNG file that you can drop into /usr/share/system-config-cluster/misc, if you want. Thanks for finding this issue. -Jim -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: cluster.ng URL: From riaan at obsidian.co.za Mon Sep 11 15:31:14 2006 From: riaan at obsidian.co.za (Riaan van Niekerk) Date: Mon, 11 Sep 2006 17:31:14 +0200 Subject: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas In-Reply-To: <1157983647.3610.287.camel@rei.boston.devel.redhat.com> References: <20060908181636.5993.qmail@web34207.mail.mud.yahoo.com> <1157983647.3610.287.camel@rei.boston.devel.redhat.com> Message-ID: <45058142.3040901@obsidian.co.za> Lon Hohberger wrote: > On Fri, 2006-09-08 at 11:16 -0700, Rick Rodgers wrote: >> Does anyone know of a good solution to providing good failover >> for somthing like a Dell 1850? The issue here is that the power >> souce plug in the back provides power for both the internal power >> controller >> and the node itself. So if you pull the cord it will not failover >> because >> it can not Stonith the failed node (power controller is down also). > > Generally, you can't handle this without external fencing. > > https://www.redhat.com/archives/linux-cluster/2006-September/msg00026.html > > -- Lon > Lon - having reread that previous posting of yours, and esp the last paragraph: +++ (c) ... If a host does a "graceful shutdown" when you fence it via IPMI, you need to disable ACPI on that host (e.g. boot with acpi=off). The server should turn off immediately (or within 4-5 seconds, like when holding an ATX power button in to force a machine off). ++++ Just so I am absolutely sure about this: Is the above the only scenario when would have to disable ACPI? e.g. a graceful shutdown is easy to spot. If I don't see one in the logs, that means I can leave ACPI on? Riaan -------------- next part -------------- A non-text attachment was scrubbed... Name: riaan.vcf Type: text/x-vcard Size: 310 bytes Desc: not available URL: From lhh at redhat.com Mon Sep 11 15:32:18 2006 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 11 Sep 2006 11:32:18 -0400 Subject: [Linux-cluster] 2-node fencing question (IPMI/ACPI question) In-Reply-To: <1157456622.4378.7.camel@belmont.site> References: <080220061550.6837.44D0C9B800021AD200001AB522007347489B9C0A99020E0B@comcast.net> <1154633146.28677.70.camel@ayanami.boston.redhat.com> <1157456622.4378.7.camel@belmont.site> Message-ID: <1157988738.3610.367.camel@rei.boston.devel.redhat.com> On Tue, 2006-09-05 at 07:43 -0400, danwest wrote: > What happens if the servers you are using require ACPI=on in order to > boot. For instance IBM X366 servers need ACPI set in order to boot. > With ACPI=on both nodes reboot when a fence occurs(see "both nodes off > problem" in thread below). This is not desirable, especially with > active/active clusters. Hopefully, the X366 either turns off immediately or can be configured to do so upon getting the "power off" command with ACPI enabled. If it does not, then you will need remote power control or fabric-level fencing. Here is some relevant background information. If you look at the IPMI v1.5 and v2 specifications, the instruction 0 for power control is supposed force the system to S4/S5 (soft-off) state immediately (for use in emergency situations). If you then look at the ipmitool source code, you will find that it uses the 0 instruction when you do a 'chassis power off' command. (quote, source = http://www.intel.com/design/servers/ipmi/pdf/IPMIv2_0_rev1_0_E3_markup.pdf - page 403): [3:0] - chassis control 0h = power down. Force system into soft off (S4/S45) state. This is for `emergency' management power down actions. The command does not initiate a clean shut-down of the operating system prior to powering down the system. (/quote) The reason linux-cluster often needs ACPI disabled with IPMI is because in many cases, machines which receive this "emergency power off" instruction do not appear to operate as what is stated in the IPMI specification. That is, some do a full, complete, clean shutdown when ACPI is enabled. If the shutdown never completes, fencing will never complete and the cluster will never recover. Now, not all machines behave this way. If your machine powers off immediately with ACPI enabled, then you do not need to disable ACPI. (Note: cheating by switching the acpid event for power button presses to /sbin/poweroff -fn does *not* count!) It is possible that some machines are - quite simply - twiddling the motherboard's soft power button. In that case, it is possible that those machines can also be configured to do an immediate-off in the BIOS when the power button is pressed, thereby alleviating the need for booting with ACPI disabled. There may be other ways to work around the ACPI/IPMI problem on your specific hardware; this is just an example. Booting with ACPI disabled is the general "quick fix", which works immediately for the majority of machines with IPMI - and does not require hardware-specific configuration. Booting with ACPI disabled also works for other types of integrated power management (iLO, RSA, DRAC, etc.) which often suffer the same problems. As noted by others in separate emails to this list, it would be nice if we could use the reboot operations more often - rather than "off, on" cycles in all cases. Most fencing solutions can not (as far as I know) confirm that a machine has rebooted the way it can confirm that a machine is "off" or "on". Of course, "reboot" does not suffer the theoretical "everyone off at once" problem, and it should eliminate the need boot with ACPI disabled. -- Lon From rodgersr at yahoo.com Mon Sep 11 15:52:37 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Mon, 11 Sep 2006 08:52:37 -0700 (PDT) Subject: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas In-Reply-To: <45058142.3040901@obsidian.co.za> Message-ID: <20060911155237.47711.qmail@web34215.mail.mud.yahoo.com> Graceful shutdown? The question I also have is: In a two node cluster when you shoutdown (shutdown/reboot command) the active node should this cause a failover? ----- Original Message ---- From: Riaan van Niekerk To: linux clustering Sent: Monday, September 11, 2006 8:31:14 AM Subject: Re: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas Lon Hohberger wrote: > On Fri, 2006-09-08 at 11:16 -0700, Rick Rodgers wrote: >> Does anyone know of a good solution to providing good failover >> for somthing like a Dell 1850? The issue here is that the power >> souce plug in the back provides power for both the internal power >> controller >> and the node itself. So if you pull the cord it will not failover >> because >> it can not Stonith the failed node (power controller is down also). > > Generally, you can't handle this without external fencing. > > https://www.redhat.com/archives/linux-cluster/2006-September/msg00026.html > > -- Lon > Lon - having reread that previous posting of yours, and esp the last paragraph: +++ (c) ... If a host does a "graceful shutdown" when you fence it via IPMI, you need to disable ACPI on that host (e.g. boot with acpi=off). The server should turn off immediately (or within 4-5 seconds, like when holding an ATX power button in to force a machine off). ++++ Just so I am absolutely sure about this: Is the above the only scenario when would have to disable ACPI? e.g. a graceful shutdown is easy to spot. If I don't see one in the logs, that means I can leave ACPI on? Riaan begin:vcard fn:Riaan van Niekerk n:van Niekerk;Riaan org:Obsidian Systems;Obsidian Red Hat Consulting email;internet:riaan at obsidian.co.za title:Systems Architect tel;work:+27 11 792 6500 tel;fax:+27 11 792 6522 tel;cell:+27 82 921 8768 x-mozilla-html:FALSE url:http://www.obsidian.co.za version:2.1 end:vcard -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From rodgersr at yahoo.com Mon Sep 11 16:24:50 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Mon, 11 Sep 2006 09:24:50 -0700 (PDT) Subject: [Linux-cluster] Clumanger reboots to same node Message-ID: <20060911162450.93432.qmail@web34204.mail.mud.yahoo.com> Somtimes during testing when you use the powerctroller to reboot the active node, clumanger will not fail over but instead restart the services on the same node. Has anyone seen this? -------------- next part -------------- An HTML attachment was scrubbed... URL: From lhh at redhat.com Mon Sep 11 16:33:20 2006 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 11 Sep 2006 12:33:20 -0400 Subject: [Linux-cluster] Issue stonith commands to the failing node twice?? In-Reply-To: <20060911161153.82338.qmail@web34201.mail.mud.yahoo.com> References: <20060911161153.82338.qmail@web34201.mail.mud.yahoo.com> Message-ID: <1157992400.3610.393.camel@rei.boston.devel.redhat.com> On Mon, 2006-09-11 at 09:11 -0700, Rick Rodgers wrote: > Thanks. > I woas wondering if Clumanager can work with dual power controllers? > So if one controller goes down and it needs to shoot the node it can > use the other > controller to shoot the node. If so how does that get configured into > clumanager? Clumanager 1.2.x's use of multiple power controllers is basically "all or nothing". That is, if you have two power controllers listed, both must succeed or STONITH fails. There is no equivalent (in clumanager 1.2.x) to RHCS4's / RHGFS6.0's / RHGFS6.1's "fence level" construct, which allows you to configure backup fencing. Each fence level is tried in sequence (each fence level may have one or more devices to try). The first level which fully succeeds ends the fencing operation (successfully). If no level succeeds, fencing fails (and is retried on RHCS4). -- Lon From venilton.junior at sercompe.com.br Mon Sep 11 19:29:16 2006 From: venilton.junior at sercompe.com.br (Venilton Junior) Date: Mon, 11 Sep 2006 16:29:16 -0300 Subject: [Linux-cluster] GFS questions Message-ID: Hi, I'm wondering if I could deploy a cluster solution with 3 nodes accessing the same storage area without using GFS. Are there any other solutions that allow me to access the same file system without using GFS? I have 3 nodes running RHAS4 and they're accessing an EVA4000. I'd like to run a Huge SMTP server on this infrastructure and I'm seeking for all possibilities to do that. Does anyone have an idea? Best regards, Venilton C. Junior -------------- next part -------------- An HTML attachment was scrubbed... URL: From lhh at redhat.com Mon Sep 11 20:32:34 2006 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 11 Sep 2006 16:32:34 -0400 Subject: [Linux-cluster] power controller is interal/loss of pwer prevents failover: any ideas In-Reply-To: <45058142.3040901@obsidian.co.za> References: <20060908181636.5993.qmail@web34207.mail.mud.yahoo.com> <1157983647.3610.287.camel@rei.boston.devel.redhat.com> <45058142.3040901@obsidian.co.za> Message-ID: <1158006754.3610.406.camel@rei.boston.devel.redhat.com> On Mon, 2006-09-11 at 17:31 +0200, Riaan van Niekerk wrote: > Lon Hohberger wrote: > > On Fri, 2006-09-08 at 11:16 -0700, Rick Rodgers wrote: > >> Does anyone know of a good solution to providing good failover > >> for somthing like a Dell 1850? The issue here is that the power > >> souce plug in the back provides power for both the internal power > >> controller > >> and the node itself. So if you pull the cord it will not failover > >> because > >> it can not Stonith the failed node (power controller is down also). > > > > Generally, you can't handle this without external fencing. > > > > https://www.redhat.com/archives/linux-cluster/2006-September/msg00026.html > > > > -- Lon > > > > Lon - having reread that previous posting of yours, and esp the last > paragraph: > > +++ > (c) ... If a host does a "graceful shutdown" when > you fence it via IPMI, you need to disable ACPI on that host (e.g. boot > with acpi=off). The server should turn off immediately (or within 4-5 > seconds, like when holding an ATX power button in to force a machine > off). > ++++ > > Just so I am absolutely sure about this: Is the above the only scenario > when would have to disable ACPI? e.g. a graceful shutdown is easy to > spot. If I don't see one in the logs, that means I can leave ACPI on? Basically, yes. If you want to be sure, watch the machine's console while you perform a power off using the integrated power management. If the machine shuts off immediately (while ACPI is enabled) then leaving it enabled should not cause any problems with the cluster. Note: Setting acpid to do /sbin/poweroff or its likeness does not count as an "instant off"... Don't cheat :) -- Lon From lhh at redhat.com Mon Sep 11 20:33:27 2006 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 11 Sep 2006 16:33:27 -0400 Subject: [Linux-cluster] Immediate shutdown after getting quorate of two node cluster In-Reply-To: <8E2924888511274B95014C2DD906E58AD19EDE@MAILBOX0A.psi.ch> References: <8E2924888511274B95014C2DD906E58AD19EDE@MAILBOX0A.psi.ch> Message-ID: <1158006807.3610.408.camel@rei.boston.devel.redhat.com> On Mon, 2006-09-11 at 11:54 +0200, Huesser Peter wrote: > Hello > > > > I try to run a two node cluster. Starting ccsd on both servers is no > problem. But if I try to start cman I get the following lines in my > ?/var/log/messages? file: > > > > Sep 11 11:44:26 server01 ccsd[24972]: Connected to cluster > infrastruture via: CMAN/SM Plugin v1.1.5 > > Sep 11 11:44:26 server01 ccsd[24972]: Initial status:: Inquorate > > Sep 11 11:44:57 server01 ccsd[24972]: Cluster is quorate. Allowing > connections. > > Sep 11 11:44:57 server01 ccsd[24972]: Cluster manager shutdown. > Attemping to reconnect... > > Sep 11 11:44:58 server01 ccsd[24972]: Connected to cluster > infrastruture via: CMAN/SM Plugin v1.1.5 > > Sep 11 11:44:58 server01 ccsd[24972]: Initial status:: Inquorate > > Sep 11 11:45:29 server01 ccsd[24972]: Cluster is quorate. Allowing > connections. > > Sep 11 11:45:29 server01 ccsd[24972]: Cluster manager shutdown. > Attemping to reconnect... > > ? > > > > Why is the daemon shutdown after getting quorated ? Any ideas ? What does the dmesg output look like? -- Lon From lhh at redhat.com Mon Sep 11 21:10:19 2006 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 11 Sep 2006 17:10:19 -0400 Subject: [Linux-cluster] CS4 Update 2 & Patch watchdog on rgmanager process In-Reply-To: <45050D63.2040507@bull.net> References: <45050D63.2040507@bull.net> Message-ID: <1158009019.3610.418.camel@rei.boston.devel.redhat.com> Hi, The self-watchdog patch adds a process which monitors the "real" clurgmgrd. The monitoring process should be the lower-numbered PID (it's the parent of the one doing the work). The monitoring process watches for crash signals (SIGBUS, SIGSEGV, etc.), and will simply exit if you kill the child with SIGKILL. So, basically, killing the higher-numbered PID with something like SIGSEGV should cause the node to reboot. -- Lon From rico_tsang at macroview.com Tue Sep 12 03:09:31 2006 From: rico_tsang at macroview.com (Rico Tsang) Date: Tue, 12 Sep 2006 11:09:31 +0800 Subject: [Linux-cluster] GFS questions Message-ID: <61E6BBD96354E1419428314BA80EA8B9750A2D@exchsvr.macroview.com> Dear Venilton, You may want to take a look at the list of shared file systems in Wiki: http://en.wikipedia.org/wiki/List_of_file_systems#Shared_disk_file_syste ms I think that IBM GPFS or Polyserve are some of the well-known SAN file systems that you can check. Regards, Rico _____ From: Venilton Junior [mailto:venilton.junior at sercompe.com.br] Sent: Tuesday, September 12, 2006 3:29 AM To: linux-cluster at redhat.com Subject: [Linux-cluster] GFS questions Hi, I'm wondering if I could deploy a cluster solution with 3 nodes accessing the same storage area without using GFS. Are there any other solutions that allow me to access the same file system without using GFS? I have 3 nodes running RHAS4 and they're accessing an EVA4000. I'd like to run a Huge SMTP server on this infrastructure and I'm seeking for all possibilities to do that. Does anyone have an idea? Best regards, Venilton C. Junior -------------- next part -------------- An HTML attachment was scrubbed... URL: From peter.huesser at psi.ch Tue Sep 12 05:10:49 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Tue, 12 Sep 2006 07:10:49 +0200 Subject: [Linux-cluster] Immediate shutdown after getting quorate oftwo node cluster In-Reply-To: <1158006807.3610.408.camel@rei.boston.devel.redhat.com> Message-ID: <8E2924888511274B95014C2DD906E58AD19F1A@MAILBOX0A.psi.ch> > What does the dmesg output look like? It looks the following CMAN: forming a new cluster CMAN: quorum regained, resuming activity CMAN: sendmsg failed: -13 CMAN: No functional network interfaces, leaving cluster CMAN: sendmsg failed: -13 CMAN: we are leaving the cluster. CMAN: Waiting to join or form a Linux-cluster CMAN: sendmsg failed: -13 CMAN: sendmsg failed: -13 CMAN: sendmsg failed: -13 CMAN: sendmsg failed: -13 CMAN: sendmsg failed: -13 CMAN: forming a new cluster CMAN: quorum regained, resuming activity CMAN: sendmsg failed: -13 CMAN: No functional network interfaces, leaving cluster CMAN: sendmsg failed: -13 CMAN: we are leaving the cluster. CMAN: Waiting to join or form a Linux-cluster CMAN: sendmsg failed: -13 Can't interpret the "No functional network interface". No firewall is running on the system. /etc/hosts.{allow,deny} makes no restriction. The /etc/hosts file is set up correctly. Pedro From dan.hawker at astrium.eads.net Tue Sep 12 08:25:33 2006 From: dan.hawker at astrium.eads.net (HAWKER, Dan) Date: Tue, 12 Sep 2006 09:25:33 +0100 Subject: [Linux-cluster] CLVMD - Do I need it??? Message-ID: <7F6B06837A5DBD49AC6E1650EFF5490601223028@auk52177.ukr.astrium.corp> Hi All, Have an EMC SAN unit on the way. I plan to use it as the central store for a couple of servers setup as a cluster, using GFS. As the SAN unit can handle all of its own Logical Volume management natively, I presume I don't have to use/implement CLVMD and hence can cut one layer of complexity in the disk structure away. Am I correct in this assumption, or does GFS/RHCS need to use CLVMD in its configuration??? TIA Dan -- Dan Hawker Linux System Administrator Astrium -- This email is for the intended addressee only. If you have received it in error then you must not use, retain, disseminate or otherwise deal with it. Please notify the sender by return email. The views of the author may not necessarily constitute the views of Astrium Limited. Nothing in this email shall bind Astrium Limited in any contract or obligation. Astrium Limited, Registered in England and Wales No. 2449259 Registered Office: Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England From m.catanese at kinetikon.com Tue Sep 12 08:45:30 2006 From: m.catanese at kinetikon.com (Matteo Catanese) Date: Tue, 12 Sep 2006 10:45:30 +0200 Subject: [Linux-cluster] system-config-cluster problem Message-ID: <6B5FFF19-58EF-42B0-81E1-98D280314168@kinetikon.com> Thx a lot James and Lon, i feel more relaxed now :-) I will disregard that warning message and wait until next patch. Ciao Matteo From pcaulfie at redhat.com Tue Sep 12 09:33:24 2006 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 12 Sep 2006 10:33:24 +0100 Subject: [Linux-cluster] Two node cluster: node cannot connect to cluster infrastructure In-Reply-To: <8E2924888511274B95014C2DD906E58AD19E1E@MAILBOX0A.psi.ch> References: <8E2924888511274B95014C2DD906E58AD19E1E@MAILBOX0A.psi.ch> Message-ID: <45067EE4.6010503@redhat.com> Huesser Peter wrote: > Hello > > > > I try to make a two node cluster run. Unfortunately if I run the > ?/etc/init.d/cman start? command I get in ?/var/log/messages? entries like: > > > > Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5 > > Initial status:: Inquorate > > Cluster manager shutdown. Attemping to reconnect... > > Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5 > > Initial status:: Inquorate > > Cluster manager shutdown. Attemping to reconnect... > > > > ?dmesg? shows: > > > > CMAN: sendmsg failed: -13 > That's a kernel/userspace mismatch. Upgrade the userspace cman tools. -- patrick From pcaulfie at redhat.com Tue Sep 12 09:37:58 2006 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 12 Sep 2006 10:37:58 +0100 Subject: [Linux-cluster] Immediate shutdown after getting quorate oftwo node cluster In-Reply-To: <8E2924888511274B95014C2DD906E58AD19F1A@MAILBOX0A.psi.ch> References: <8E2924888511274B95014C2DD906E58AD19F1A@MAILBOX0A.psi.ch> Message-ID: <45067FF6.1010605@redhat.com> Huesser Peter wrote: >> What does the dmesg output look like? > > It looks the following > > CMAN: forming a new cluster > CMAN: quorum regained, resuming activity > CMAN: sendmsg failed: -13 > CMAN: No functional network interfaces, leaving cluster > CMAN: sendmsg failed: -13 > CMAN: we are leaving the cluster. > CMAN: Waiting to join or form a Linux-cluster > CMAN: sendmsg failed: -13 > CMAN: sendmsg failed: -13 > CMAN: sendmsg failed: -13 > CMAN: sendmsg failed: -13 > CMAN: sendmsg failed: -13 That's a kernel/userspace mismatch Upgrade the cman user tools. (I'm going to put that text on a macro key!) -- patrick From riaan at obsidian.co.za Tue Sep 12 10:10:13 2006 From: riaan at obsidian.co.za (Riaan van Niekerk) Date: Tue, 12 Sep 2006 12:10:13 +0200 Subject: [Linux-cluster] post_fail_delay versus deadnode_timeout Message-ID: <45068785.9070404@obsidian.co.za> hi We are trying to capture diskdumps when a lock_dlm kernel panic happens and need to increase either post_fail_delay or deadnode_timeout to prevent the dumping node from being fenced. Is there any advantages or disadvantages to using either? Which is recommended? post_fail_delay and diskdump has come up previously, with some good answers from David http://www.redhat.com/archives/linux-cluster/2006-June/msg00037.html note: for capturing a "sysrq t", we manually increase deadnode_timeout, and decrease it back again, but don't have this luxury with a kernel panic (which can happen at any time). Riaan -------------- next part -------------- A non-text attachment was scrubbed... Name: riaan.vcf Type: text/x-vcard Size: 310 bytes Desc: not available URL: From lhh at redhat.com Tue Sep 12 14:04:19 2006 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 12 Sep 2006 10:04:19 -0400 Subject: [Linux-cluster] CLVMD - Do I need it??? In-Reply-To: <7F6B06837A5DBD49AC6E1650EFF5490601223028@auk52177.ukr.astrium.corp> References: <7F6B06837A5DBD49AC6E1650EFF5490601223028@auk52177.ukr.astrium.corp> Message-ID: <1158069859.3610.437.camel@rei.boston.devel.redhat.com> On Tue, 2006-09-12 at 09:25 +0100, HAWKER, Dan wrote: > > Hi All, > > Have an EMC SAN unit on the way. I plan to use it as the central store for a > couple of servers setup as a cluster, using GFS. As the SAN unit can handle > all of its own Logical Volume management natively, I presume I don't have to > use/implement CLVMD and hence can cut one layer of complexity in the disk > structure away. > Am I correct in this assumption, or does GFS/RHCS need to use CLVMD in its > configuration??? You don't need CLVM if you intend to use the internal array tools, but it's a "nice to have" thing. After all, we've had GFS (and simple failover, for that matter) for a few years -- while CLVM is a relatively new technology. Some SANs can do this internally too, of course. For example, if you had CLVM and you add another array, I'm pretty sure you could use CLVM to extend an existing logical volume on to the second array while the cluster is running. -- Lon From lists at brimer.org Tue Sep 12 14:04:38 2006 From: lists at brimer.org (Barry Brimer) Date: Tue, 12 Sep 2006 09:04:38 -0500 (CDT) Subject: [Linux-cluster] CLVMD - Do I need it??? In-Reply-To: <7F6B06837A5DBD49AC6E1650EFF5490601223028@auk52177.ukr.astrium.corp> References: <7F6B06837A5DBD49AC6E1650EFF5490601223028@auk52177.ukr.astrium.corp> Message-ID: > Am I correct in this assumption, or does GFS/RHCS need to use CLVMD in its > configuration??? My understanding is that you will continue to need clvmd. From peter.huesser at psi.ch Tue Sep 12 14:20:30 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Tue, 12 Sep 2006 16:20:30 +0200 Subject: [Linux-cluster] Immediate shutdown after getting quorate oftwonode cluster In-Reply-To: <45067FF6.1010605@redhat.com> Message-ID: <8E2924888511274B95014C2DD906E58AD19F7C@MAILBOX0A.psi.ch> > > > That's a kernel/userspace mismatch > > Upgrade the cman user tools. > Thanks' and sorry if you had to repeat some stuff again. In fact I had the newest versions installed. What I did now was to recompile all the packages for the clustersuite and install these packages. After this the "quorated" problem was solved. One node was now a member of the cluster but I still could not get the other one to be a member. After reboot of both systems both nodes were clustermembers so this works now (I got another message from Jari who told me that something could be wrong with my fence domain and I should reboot it). At the moment it looks much better than this morning. Something with my fencing is not correctly set up and services are not correctly working. Maybe I have to contact the mailing list later but for the moment thanks' to all who gave an answer. Pedro From danwest at comcast.net Tue Sep 12 16:12:30 2006 From: danwest at comcast.net (danwest at comcast.net) Date: Tue, 12 Sep 2006 16:12:30 +0000 Subject: [Linux-cluster] qdiskd not properly failing nodes?? Message-ID: <091220061612.9370.4506DC6E0006D5FF0000249A22007481849B9C0A99020E0B@comcast.net> Below is the qdisk configuration for a simple 2 node cluster with a webserver services. The service is configured with 3 heuristics below. # cat /tmp/qdisk_status Node ID: 1 Score (current / min req. / max allowed): 4 / 2 / 4 Current state: Master Current disk state: None Visible Set: { 1 2 } Master Node ID: 1 Quorate Set: { 1 2 } Causing the last 2 heuristics to fail causes the score to fall below ? and in theory should reboot the node. So far I get confirmation in /var/log/messages but no actual reboot ( See below ). The service (webserver) also remains on the node that dropped below ?. # cat /tmp/qdisk_status Node ID: 1 Score (current / min req. / max allowed): 1 / 2 / 4 Current state: None Current disk state: None Visible Set: { 1 2 } Master Node ID: 2 Quorate Set: { } /var/log/messages Sep 12 11:34:02 SERVER1 qdiskd[7495]: Score insufficient for master operation (1/2; max=4); downgrading Sep 12 11:34:04 SERVER1 qdiskd[7495]: Node 2 is the master Sep 12 11:34:02 SERVER2 qdiskd[9780]: Node 1 shutdown Sep 12 11:34:02 SERVER2 qdiskd[9780]: Making bid for master Sep 12 11:34:03 SERVER2 qdiskd[9780]: Assuming master role Any idea why the server is not getting rebooted/fenced? Thanks, Dan From jbrassow at redhat.com Tue Sep 12 17:41:11 2006 From: jbrassow at redhat.com (Jonathan Brassow) Date: Tue, 12 Sep 2006 12:41:11 -0500 Subject: [Linux-cluster] CLVMD - Do I need it??? In-Reply-To: <1158069859.3610.437.camel@rei.boston.devel.redhat.com> References: <7F6B06837A5DBD49AC6E1650EFF5490601223028@auk52177.ukr.astrium.corp> <1158069859.3610.437.camel@rei.boston.devel.redhat.com> Message-ID: <1158082871.988.4.camel@hydrogen.msp.redhat.com> On Tue, 2006-09-12 at 10:04 -0400, Lon Hohberger wrote: > On Tue, 2006-09-12 at 09:25 +0100, HAWKER, Dan wrote: > > > > Hi All, > > > > Have an EMC SAN unit on the way. I plan to use it as the central store for a > > couple of servers setup as a cluster, using GFS. As the SAN unit can handle > > all of its own Logical Volume management natively, I presume I don't have to > > use/implement CLVMD and hence can cut one layer of complexity in the disk > > structure away. > > > Am I correct in this assumption, or does GFS/RHCS need to use CLVMD in its > > configuration??? > > You don't need CLVM if you intend to use the internal array tools, but > it's a "nice to have" thing. After all, we've had GFS (and simple > failover, for that matter) for a few years -- while CLVM is a relatively > new technology. Some SANs can do this internally too, of course. > > For example, if you had CLVM and you add another array, I'm pretty sure > you could use CLVM to extend an existing logical volume on to the second > array while the cluster is running. I think one of the big things is naming - ensuring that the device name is always the same on all nodes in the cluster - regardless of any devices added/changed/removed. If you can do that, in addition to storage management, then there is probably no need to involve LVM (cluster or not). If you plan to use LVM on top of the storage device, then you must use clvmd. brassow From DylanV at semaphore.com Wed Sep 13 04:44:27 2006 From: DylanV at semaphore.com (Dylan Vanderhoof) Date: Tue, 12 Sep 2006 21:44:27 -0700 Subject: [Linux-cluster] Some newbie questions Message-ID: I'm getting ready to start using GFS for a project at my company and believe I have a sane migration path, but I wanted to ask for a sanity check from people who are using it first. The eventual architecture will be multiple iSCSI targets as part of a single GFS filesystem using CLVM, primarily so adding more disk is fairly seamless, and if I understand correctly, can be done without any downtime. (Is there downtime required for the fs grow step?) This also will allow multipath io for some extra redundancy in the future. (Obviously, the iSCSI targets are SPOFs, but that's unavoidable). This points me towards using DLM, of course, but in the initial install I only have a single node and will be adding other nodes in the fairly near future. Can I transition from nolock to using DLM? I would assume so, but I haven't seen anything indicating how that would be done. Other than those couple questions, I believe everything to be fairly straightforward. Looking forward to trying GFS out! Thanks, Dylan Vanderhoof Sr. Software Developer Semaphore Corporation From jos at xos.nl Wed Sep 13 06:34:52 2006 From: jos at xos.nl (Jos Vos) Date: Wed, 13 Sep 2006 08:34:52 +0200 Subject: [Linux-cluster] Some newbie questions In-Reply-To: ; from DylanV@semaphore.com on Tue, Sep 12, 2006 at 09:44:27PM -0700 References: Message-ID: <20060913083452.B14844@xos037.xos.nl> On Tue, Sep 12, 2006 at 09:44:27PM -0700, Dylan Vanderhoof wrote: > This points me towards using DLM, of course, but in the initial install > I only have a single node and will be adding other nodes in the fairly > near future. Can I transition from nolock to using DLM? I would assume > so, but I haven't seen anything indicating how that would be done. Yes, this can be done (on an unmounted fs) using: gfs_tool sb proto lock_dlm Note that you better should add enough journals to the filesystem when creating it. You can add journals later, but only if there is (enough) space left on the device after the filesystem, which is normally not the case (if your filesystem occupies the whole device). -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From DylanV at semaphore.com Wed Sep 13 06:57:04 2006 From: DylanV at semaphore.com (Dylan Vanderhoof) Date: Tue, 12 Sep 2006 23:57:04 -0700 Subject: [Linux-cluster] Some newbie questions Message-ID: > -----Original Message----- > From: Jos Vos [mailto:jos at xos.nl] > Sent: Tuesday, September 12, 2006 11:35 PM > To: linux clustering > Subject: Re: [Linux-cluster] Some newbie questions > > > Yes, this can be done (on an unmounted fs) using: > > gfs_tool sb proto lock_dlm > > Note that you better should add enough journals to the filesystem > when creating it. You can add journals later, but only if there > is (enough) space left on the device after the filesystem, which > is normally not the case (if your filesystem occupies the whole > device). Interesting. I hadn't considered that. Is there a document somewhere that shows how large a journal is? Or rather, is there a cost to adding more than I will likely need to be safe? If the fs is grown onto additional iSCSI targets, can journals be added at that point as well utilizing the additional space on those devices? Thanks, Dylan From Alain.Moulle at bull.net Wed Sep 13 07:51:31 2006 From: Alain.Moulle at bull.net (Alain Moulle) Date: Wed, 13 Sep 2006 09:51:31 +0200 Subject: [Linux-cluster] CS4 Update 2 & Patch watchdog on Message-ID: <4507B883.8060400@bull.net> >> The self-watchdog patch adds a process which monitors the "real" >> clurgmgrd. The monitoring process should be the lower-numbered PID >> (it's the parent of the one doing the work). >> The monitoring process watches for crash signals (SIGBUS, SIGSEGV, >> etc.), and will simply exit if you kill the child with SIGKILL. >> So, basically, killing the higher-numbered PID with something like >> SIGSEGV should cause the node to reboot. >> -- Lon Thanks Lon, I understand. And if I kill -9 (SIGKILL) the higher-numbered PID at test purpose, is it expected to reboot or not ? I see in code : case SIGCHLD: case SIGILL: case SIGFPE: case SIGSEGV: case SIGBUS: setup_signal(i, SIG_DFL); break; default: setup_signal(i, signal_handler); but can't conclude for a SIGKILL on higher-numbered PID process ... Thanks again Alain Moull? From dan.hawker at astrium.eads.net Wed Sep 13 08:25:08 2006 From: dan.hawker at astrium.eads.net (HAWKER, Dan) Date: Wed, 13 Sep 2006 09:25:08 +0100 Subject: [Linux-cluster] CLVMD - Do I need it??? Message-ID: <7F6B06837A5DBD49AC6E1650EFF5490601223032@auk52177.ukr.astrium.corp> > > Have an EMC SAN unit on the way. I plan to use it as the central store for a > > couple of servers setup as a cluster, using GFS. As the SAN unit can handle > > all of its own Logical Volume management natively, I presume I don't have to > > use/implement CLVMD and hence can cut one layer of complexity in the disk > > structure away. > > > Am I correct in this assumption, or does GFS/RHCS need to use CLVMD in its > > configuration??? > > You don't need CLVM if you intend to use the internal array tools, but > it's a "nice to have" thing. After all, we've had GFS (and simple > failover, for that matter) for a few years -- while CLVM is a relatively > new technology. Some SANs can do this internally too, of course. > > For example, if you had CLVM and you add another array, I'm pretty sure > you could use CLVM to extend an existing logical volume on to the second > array while the cluster is running. >I think one of the big things is naming - ensuring that the device name >is always the same on all nodes in the cluster - regardless of any >devices added/changed/removed. If you can do that, in addition to >storage management, then there is probably no need to involve LVM >(cluster or not). If you plan to use LVM on top of the storage device, >then you must use clvmd. > > brassow Thanks for the replies. So, the decision is purely a matter of policy rather than any technical reasons. Didn't think of the possibility of extending the cluster storage by utilising CLVM. Makes sense, nice feature, that may make me use CLVM anyway. Guess I'll have a think and make a decision. Thanks again Dan -- Dan Hawker Linux System Administrator Astrium -- This email is for the intended addressee only. If you have received it in error then you must not use, retain, disseminate or otherwise deal with it. Please notify the sender by return email. The views of the author may not necessarily constitute the views of Astrium Limited. Nothing in this email shall bind Astrium Limited in any contract or obligation. Astrium Limited, Registered in England and Wales No. 2449259 Registered Office: Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England From lhh at redhat.com Wed Sep 13 14:07:42 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 13 Sep 2006 10:07:42 -0400 Subject: [Linux-cluster] qdiskd not properly failing nodes?? In-Reply-To: <091220061612.9370.4506DC6E0006D5FF0000249A22007481849B9C0A99020E0B@comcast.net> References: <091220061612.9370.4506DC6E0006D5FF0000249A22007481849B9C0A99020E0B@comcast.net> Message-ID: <1158156462.11241.5.camel@rei.boston.devel.redhat.com> On Tue, 2006-09-12 at 16:12 +0000, danwest at comcast.net wrote: > Any idea why the server is not getting rebooted/fenced? Did you start fenced ? Qdisk doesn't handle fencing; it still relies on CMAN to handle the fencing bit. -- Lon From lhh at redhat.com Wed Sep 13 14:18:10 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 13 Sep 2006 10:18:10 -0400 Subject: [Linux-cluster] CS4 Update 2 & Patch watchdog on In-Reply-To: <4507B883.8060400@bull.net> References: <4507B883.8060400@bull.net> Message-ID: <1158157090.11241.8.camel@rei.boston.devel.redhat.com> On Wed, 2006-09-13 at 09:51 +0200, Alain Moulle wrote: > >> The self-watchdog patch adds a process which monitors the "real" > >> clurgmgrd. The monitoring process should be the lower-numbered PID > >> (it's the parent of the one doing the work). > > >> The monitoring process watches for crash signals (SIGBUS, SIGSEGV, > >> etc.), and will simply exit if you kill the child with SIGKILL. > > >> So, basically, killing the higher-numbered PID with something like > >> SIGSEGV should cause the node to reboot. > > >> -- Lon > > Thanks Lon, I understand. > And if I kill -9 (SIGKILL) the higher-numbered PID at test purpose, > is it expected to reboot or not ? > > I see in code : > case SIGCHLD: > case SIGILL: > case SIGFPE: > case SIGSEGV: > case SIGBUS: > setup_signal(i, SIG_DFL); > break; > default: > setup_signal(i, signal_handler); > but can't conclude for a SIGKILL on higher-numbered PID process ... No, sigkill will just cause the watchdog to commit suicide: if (waitpid(child, &status, 0) <= 0) continue; if (WIFEXITED(status)) exit(WEXITSTATUS(status)); if (WIFSIGNALED(status)) { if (WTERMSIG(status) == SIGKILL) { clulog(LOG_CRIT, "Watchdog: Daemon killed, exiting\n"); raise(SIGKILL); Use something like SIGSEGV (e.g. to simulate a crash) and the nanny/watchdog process should reboot the node. -- Lon From lhh at redhat.com Wed Sep 13 14:22:30 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 13 Sep 2006 10:22:30 -0400 Subject: [Linux-cluster] qdiskd not properly failing nodes?? In-Reply-To: <1158156462.11241.5.camel@rei.boston.devel.redhat.com> References: <091220061612.9370.4506DC6E0006D5FF0000249A22007481849B9C0A99020E0B@comcast.net> <1158156462.11241.5.camel@rei.boston.devel.redhat.com> Message-ID: <1158157350.11241.14.camel@rei.boston.devel.redhat.com> On Wed, 2006-09-13 at 10:07 -0400, Lon Hohberger wrote: > On Tue, 2006-09-12 at 16:12 +0000, danwest at comcast.net wrote: > > > Any idea why the server is not getting rebooted/fenced? > > Did you start fenced ? Qdisk doesn't handle fencing; it still relies on > CMAN to handle the fencing bit. If you want, I could add something to cause the node to reboot itself on the down-transition when it detects its score is insufficient to continue as part of the master partition. E.g., right here: Sep 12 11:34:02 SERVER1 qdiskd[7495]: Score insufficient for master operation (1/2; max=4); downgrading [ reboot(RB_AUTOBOOT); /* if new configuration thing is set */ ] -- Lon From isplist at logicore.net Wed Sep 13 14:40:12 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Wed, 13 Sep 2006 09:40:12 -0500 Subject: [Linux-cluster] Can't mount multiple GFS volumes? Message-ID: <200691394012.985163@leena> I have a need for non contiguous storage and wish to mount multiple GFS logical volumes. However, I cannot seem to get past this following error and others related. -Command # mount -t gfs /dev/vgcomp/str1 /lvstr1 mount: File exists [root at dev new]# -Error Log Sep 12 16:22:22 dev kernel: GFS: Trying to join cluster "lock_dlm", "vgcomp:gfscomp" Sep 12 16:22:22 dev kernel: dlm: gfscomp: lockspace already in use Sep 12 16:22:22 dev kernel: lock_dlm: new lockspace error -17 Sep 12 16:22:22 dev kernel: GFS: can't mount proto = lock_dlm, table = vgcomp:gfscomp, hostdata = Sep 12 16:22:23 dev hald[2168]: Timed out waiting for hotplug event 395. Rebasing to 396 There are two physical drives attached to a FC network. I would like to have access to each on their own, not as part of a single volume group of storage. Anyone have some ideas, things I can try, to start getting closer to something that works? I've tried all I can think of. Running RHEL4 with all latest updates. Let me know what info you need and I'll be happy to provide it of course. Thank you. Mike From isplist at logicore.net Wed Sep 13 14:44:18 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Wed, 13 Sep 2006 09:44:18 -0500 Subject: [Linux-cluster] Cluster.conf documentation? Message-ID: <200691394418.802318@leena> I've looked but cannot seem to find good documentation on the cluster.conf file itself. Is there documentation somewhere which clearly talks about only the cluster.conf options, how to best build the file, available options, etc. Thanks. Mike From isplist at logicore.net Wed Sep 13 14:45:58 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Wed, 13 Sep 2006 09:45:58 -0500 Subject: [Linux-cluster] Fencing using brocade Message-ID: <200691394558.252508@leena> I want to use my brocade switch as the fencing device for my cluster. I cannot find any documentation showing what I need to set up on the brocade itself and within the cluster.conf file as well to make this work. My cluster works fine... until a node dies of course or other problems come up. Thanks in advance for any help. Mike From jparsons at redhat.com Wed Sep 13 14:51:00 2006 From: jparsons at redhat.com (James Parsons) Date: Wed, 13 Sep 2006 10:51:00 -0400 Subject: [Linux-cluster] Fencing using brocade In-Reply-To: <200691394558.252508@leena> References: <200691394558.252508@leena> Message-ID: <45081AD4.5050801@redhat.com> isplist at logicore.net wrote: >I want to use my brocade switch as the fencing device for my cluster. I cannot >find any documentation showing what I need to set up on the brocade itself and >within the cluster.conf file as well to make this work. > >My cluster works fine... until a node dies of course or other problems come >up. > >Thanks in advance for any help. > >Mike > > The system-config-cluster application supports brocade fencing. It is a two part process - first you define the switch as a fence device; type brocade, then you select a node an click "Manage fencing for this node" and declare a fence instance. -J From isplist at logicore.net Wed Sep 13 15:00:54 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Wed, 13 Sep 2006 10:00:54 -0500 Subject: [Linux-cluster] Fencing using brocade In-Reply-To: <45081AD4.5050801@redhat.com> Message-ID: <200691310054.237706@leena> >> I want to use my brocade switch as the fencing device for my cluster. I >> cannot find any documentation showing what I need to set up on the brocade >> itself and within the cluster.conf file as well to make this work. > The system-config-cluster application supports brocade fencing. It is a > two part process - first you define the switch as a fence device; type > brocade, then you select a node an click "Manage fencing for this node" > and declare a fence instance. Ah, I'm at the command line :). So, there is nothing I need to do on the brocade itself then? The cluster ports aren't connected directly, they are connected into a compaq hub, then the hub is connected into the brocade. The brocade seems to know about the external ports however since they are listed when I look on the switch. As for the conf file, I've not found enough information on how to build a good conf file so know this one is probably not even complete. Been working on other parts of the problems then wanting to get to this. From jparsons at redhat.com Wed Sep 13 15:08:07 2006 From: jparsons at redhat.com (James Parsons) Date: Wed, 13 Sep 2006 11:08:07 -0400 Subject: [Linux-cluster] Fencing using brocade In-Reply-To: <200691310054.237706@leena> References: <200691310054.237706@leena> Message-ID: <45081ED7.9060505@redhat.com> isplist at logicore.net wrote: >>>I want to use my brocade switch as the fencing device for my cluster. I >>>cannot find any documentation showing what I need to set up on the brocade >>>itself and within the cluster.conf file as well to make this work. >>> >>> > > > >>The system-config-cluster application supports brocade fencing. It is a >>two part process - first you define the switch as a fence device; type >>brocade, then you select a node an click "Manage fencing for this node" >>and declare a fence instance. >> >> > >Ah, I'm at the command line :). > >So, there is nothing I need to do on the brocade itself then? The cluster >ports aren't connected directly, they are connected into a compaq hub, then >the hub is connected into the brocade. The brocade seems to know about the >external ports however since they are listed when I look on the switch. > >As for the conf file, I've not found enough information on how to build a good >conf file so know this one is probably not even complete. Been working on >other parts of the problems then wanting to get to this. > Why build it yourself? The app will do it for you, and not make a typo that could cost you valuable time. If you don't have X running on your nodes, no problem - just install the s-c-cluster app anywhere...it will let you configure a cluster and then save out the conf file which you can propogate to the cluster yourself, if you want. -J > > > > > > > > > > > > > > > > > name="brocade" passwd="xxx"/> > > > > > > > > > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster > > From cjk at techma.com Wed Sep 13 15:26:22 2006 From: cjk at techma.com (Kovacs, Corey J.) Date: Wed, 13 Sep 2006 11:26:22 -0400 Subject: [Linux-cluster] RHEL5 cluster problem... Message-ID: Good morning.. Some oddness regarding clusterring on RHEL5beta1 (could be me) I have a two node cluster and the cluster components installed. I have two nics in each node, the second of which I want to use for openais. I have my cluster.conf pointing to the primary nic and I have openais pointing to 192.168.0.0 (my second nics are on 192.168.0.1 and 2) Things seem to start ok on both nodes but they don't appear to be talking to eachother. For instance, clustat on the first node shows both nodes active even if node2 is down. Actually, openais seems to be doing fine, but cman looks to be acting up. This config was created using s-c-cluster and indeed it looks good. Am I missing some new fundemental thing with the new cluster versions? I've been running RHCS/GFS on RHEL3 and RHEL4 for some time now but this is my first attempt at the new (openais based) clusterring. Any thoughts? Corey -------------- next part -------------- An HTML attachment was scrubbed... URL: From lhh at redhat.com Wed Sep 13 15:28:34 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 13 Sep 2006 11:28:34 -0400 Subject: [Linux-cluster] RHEL5 cluster problem... In-Reply-To: References: Message-ID: <1158161314.11241.16.camel@rei.boston.devel.redhat.com> On Wed, 2006-09-13 at 11:26 -0400, Kovacs, Corey J. wrote: > Good morning.. > > Some oddness regarding clusterring on RHEL5beta1 (could be me) > > I have a two node cluster and the cluster components installed. > I have two nics in each node, the second of which I want to use for > openais. > > I have my cluster.conf pointing to the primary nic and I have openais > pointing to > 192.168.0.0 (my second nics are on 192.168.0.1 and 2) > > Things seem to start ok on both nodes but they don't appear to be > talking to eachother. > For instance, clustat on the first node shows both nodes active even > if node2 is down. > > Actually, openais seems to be doing fine, but cman looks to be acting > up. > > This config was created using s-c-cluster and indeed it looks good. Am > I missing some > new fundemental thing with the new cluster versions? > > I've been running RHCS/GFS on RHEL3 and RHEL4 for some time now but > this is my first > attempt at the new (openais based) clusterring. > > > Any thoughts? For a start, you can always try cman_tool status / cman_tool nodes, just to take clustat out of the picture. -- Lon From cjk at techma.com Wed Sep 13 15:39:38 2006 From: cjk at techma.com (Kovacs, Corey J.) Date: Wed, 13 Sep 2006 11:39:38 -0400 Subject: =?us-ascii?Q?RE:_=5BLinux-cluster=5D_RHEL5_cluster_problem...?= In-Reply-To: <1158161314.11241.16.camel@rei.boston.devel.redhat.com> Message-ID: Ok, that looks better. Both nodes show up as being memebers using cman_tool status and cman_tool nodes. Also, seems I forgot to start rgmanager. Once I started it, the "test" service I configured started up. stopping rgmanager, openais, cman in that order on the second node, caused node1 to fence node2. clustat still doesn't report correct status for me, but at least I am getting some status back. Thanks Corey -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger Sent: Wednesday, September 13, 2006 11:29 AM To: linux clustering Subject: Re: [Linux-cluster] RHEL5 cluster problem... On Wed, 2006-09-13 at 11:26 -0400, Kovacs, Corey J. wrote: > Good morning.. > > Some oddness regarding clusterring on RHEL5beta1 (could be me) > > I have a two node cluster and the cluster components installed. > I have two nics in each node, the second of which I want to use for > openais. > > I have my cluster.conf pointing to the primary nic and I have openais > pointing to 192.168.0.0 (my second nics are on 192.168.0.1 and 2) > > Things seem to start ok on both nodes but they don't appear to be > talking to eachother. > For instance, clustat on the first node shows both nodes active even > if node2 is down. > > Actually, openais seems to be doing fine, but cman looks to be acting > up. > > This config was created using s-c-cluster and indeed it looks good. Am > I missing some new fundemental thing with the new cluster versions? > > I've been running RHCS/GFS on RHEL3 and RHEL4 for some time now but > this is my first attempt at the new (openais based) clusterring. > > > Any thoughts? For a start, you can always try cman_tool status / cman_tool nodes, just to take clustat out of the picture. -- Lon -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From frank at opticalart.de Wed Sep 13 15:42:39 2006 From: frank at opticalart.de (Frank Hellmann) Date: Wed, 13 Sep 2006 17:42:39 +0200 Subject: [Linux-cluster] Fencing using brocade In-Reply-To: <200691310054.237706@leena> References: <200691310054.237706@leena> Message-ID: <450826EF.90203@opticalart.de> Hi! I can only recommend the system-config-cluster GUI, but if you feel brave enough you can do it by hand This example is for a sanbox2, but it should get you going: ... .... ... And don't forget to check the fence_brocade manpage for your brocade switch for further options... Cheers, Frank... isplist at logicore.net wrote: >>> I want to use my brocade switch as the fencing device for my cluster. I >>> cannot find any documentation showing what I need to set up on the brocade >>> itself and within the cluster.conf file as well to make this work. >>> > > >> The system-config-cluster application supports brocade fencing. It is a >> two part process - first you define the switch as a fence device; type >> brocade, then you select a node an click "Manage fencing for this node" >> and declare a fence instance. >> > > Ah, I'm at the command line :). > > So, there is nothing I need to do on the brocade itself then? The cluster > ports aren't connected directly, they are connected into a compaq hub, then > the hub is connected into the brocade. The brocade seems to know about the > external ports however since they are listed when I look on the switch. > > As for the conf file, I've not found enough information on how to build a good > conf file so know this one is probably not even complete. Been working on > other parts of the problems then wanting to get to this. > > > > > > > > > > > > > > > > > name="brocade" passwd="xxx"/> > > > > > > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- -------------------------------------------------------------------------- Frank Hellmann Optical Art GmbH Waterloohain 7a DI Supervisor http://www.opticalart.de 22769 Hamburg frank at opticalart.de Tel: ++49 40 5111051 Fax: ++49 40 43169199 -------------- next part -------------- An HTML attachment was scrubbed... URL: From lhh at redhat.com Wed Sep 13 16:10:35 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 13 Sep 2006 12:10:35 -0400 Subject: [Linux-cluster] RHEL5 cluster problem... In-Reply-To: References: Message-ID: <1158163835.11241.20.camel@rei.boston.devel.redhat.com> On Wed, 2006-09-13 at 11:39 -0400, Kovacs, Corey J. wrote: > Ok, that looks better. Both nodes show up as being memebers > using cman_tool status and cman_tool nodes. Also, seems I > forgot to start rgmanager. Once I started it, the "test" > service I configured started up. stopping rgmanager, openais, cman > in that order on the second node, caused node1 to fence node2. rgmanager caused a node to get fenced? :o I know there have been some pretty big rgmanager bugs fixed since B1 freeze, but that one is news to me. Let me see if there are any newer rgmanager packages available. -- Lon From jparsons at redhat.com Wed Sep 13 16:14:13 2006 From: jparsons at redhat.com (James Parsons) Date: Wed, 13 Sep 2006 12:14:13 -0400 Subject: [Linux-cluster] RHEL5 cluster problem... In-Reply-To: References: Message-ID: <45082E55.1080506@redhat.com> Kovacs, Corey J. wrote: > Good morning.. > > Some oddness regarding clusterring on RHEL5beta1 (could be me) > > I have a two node cluster and the cluster components installed. > I have two nics in each node, the second of which I want to use for > openais. > > I have my cluster.conf pointing to the primary nic and I have openais > pointing to > 192.168.0.0 (my second nics are on 192.168.0.1 and 2) > > Things seem to start ok on both nodes but they don't appear to be > talking to eachother. > For instance, clustat on the first node shows both nodes active even > if node2 is down. > > Actually, openais seems to be doing fine, but cman looks to be acting up. > > This config was created using s-c-cluster and indeed it looks good. Am > I missing some > new fundemental thing with the new cluster versions? > > I've been running RHCS/GFS on RHEL3 and RHEL4 for some time now but > this is my first > attempt at the new (openais based) clusterring. > > > Any thoughts? > EEk. s-c-cluster is NOT updated completely for rhel5 cluster in the beta 1 release - Sorry, Corey. Are you aware that each node needs an explicit 'nodeid' attribute value in the conf file, in addition to the name attribute? This just needs to be a unique integer value...a simple enumeration of the nodes, 1, 2, 3... -J From cjk at techma.com Wed Sep 13 16:31:02 2006 From: cjk at techma.com (Kovacs, Corey J.) Date: Wed, 13 Sep 2006 12:31:02 -0400 Subject: =?us-ascii?Q?RE:_=5BLinux-cluster=5D_RHEL5_cluster_problem...?= In-Reply-To: <45082E55.1080506@redhat.com> Message-ID: James, didn't know that s-c-cluster wasn't updated, good to know, not a problem as I like the command line anyway :) I did know about the nodeid so that's good to go. Thanks for the heads up on s-c-cluster Corey -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of James Parsons Sent: Wednesday, September 13, 2006 12:14 PM To: linux clustering Subject: Re: [Linux-cluster] RHEL5 cluster problem... Kovacs, Corey J. wrote: > Good morning.. > > Some oddness regarding clusterring on RHEL5beta1 (could be me) > > I have a two node cluster and the cluster components installed. > I have two nics in each node, the second of which I want to use for > openais. > > I have my cluster.conf pointing to the primary nic and I have openais > pointing to 192.168.0.0 (my second nics are on 192.168.0.1 and 2) > > Things seem to start ok on both nodes but they don't appear to be > talking to eachother. > For instance, clustat on the first node shows both nodes active even > if node2 is down. > > Actually, openais seems to be doing fine, but cman looks to be acting up. > > This config was created using s-c-cluster and indeed it looks good. Am > I missing some new fundemental thing with the new cluster versions? > > I've been running RHCS/GFS on RHEL3 and RHEL4 for some time now but > this is my first attempt at the new (openais based) clusterring. > > > Any thoughts? > EEk. s-c-cluster is NOT updated completely for rhel5 cluster in the beta 1 release - Sorry, Corey. Are you aware that each node needs an explicit 'nodeid' attribute value in the conf file, in addition to the name attribute? This just needs to be a unique integer value...a simple enumeration of the nodes, 1, 2, 3... -J -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From cjk at techma.com Wed Sep 13 16:34:25 2006 From: cjk at techma.com (Kovacs, Corey J.) Date: Wed, 13 Sep 2006 12:34:25 -0400 Subject: =?us-ascii?Q?RE:_=5BLinux-cluster=5D_RHEL5_cluster_problem...?= In-Reply-To: <1158163835.11241.20.camel@rei.boston.devel.redhat.com> Message-ID: Lon, no I don't believe rgmanager is the culprit rather it happened when rgmanager was not running. I went shutdown cman on node2 so in the beginning of my playing, I stopped openais, then cman (rgmamanager came later) and the node was fenced. I did this in the simplest way of course... service openais stop service cman stop anyway, I'll keep an eye on it. Thanks for your suggestions/help Corey -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger Sent: Wednesday, September 13, 2006 12:11 PM To: linux clustering Subject: RE: [Linux-cluster] RHEL5 cluster problem... On Wed, 2006-09-13 at 11:39 -0400, Kovacs, Corey J. wrote: > Ok, that looks better. Both nodes show up as being memebers using > cman_tool status and cman_tool nodes. Also, seems I forgot to start > rgmanager. Once I started it, the "test" > service I configured started up. stopping rgmanager, openais, cman in > that order on the second node, caused node1 to fence node2. rgmanager caused a node to get fenced? :o I know there have been some pretty big rgmanager bugs fixed since B1 freeze, but that one is news to me. Let me see if there are any newer rgmanager packages available. -- Lon -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From lhh at redhat.com Wed Sep 13 16:56:57 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 13 Sep 2006 12:56:57 -0400 Subject: [Linux-cluster] RHEL5 cluster problem... In-Reply-To: References: Message-ID: <1158166617.11241.22.camel@rei.boston.devel.redhat.com> On Wed, 2006-09-13 at 12:34 -0400, Kovacs, Corey J. wrote: > Lon, no I don't believe rgmanager is the culprit rather it happened when > rgmanager was not running. I went shutdown cman on node2 so in the beginning > of my playing, I stopped openais, then cman (rgmamanager came later) and the > node was fenced. I did this in the simplest way of course... > > service openais stop > service cman stop > > anyway, I'll keep an eye on it. Yeah, I read that last message wrong. -- Lon From lhh at redhat.com Wed Sep 13 21:46:30 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 13 Sep 2006 17:46:30 -0400 Subject: [Linux-cluster] qdiskd not properly failing nodes?? In-Reply-To: <2d4e61a8f96d5bf89f1d86611e4712d3@comcast.net> References: <2d4e61a8f96d5bf89f1d86611e4712d3@comcast.net> Message-ID: <1158183990.11241.65.camel@rei.boston.devel.redhat.com> On Wed, 2006-09-13 at 15:40 -0400, Andrea Westervelt wrote: > > > ______________________________________________________________________ > > Lon, > > fenced is running and based on the manpage it seems like dropping > below a score of ? should cause a reboot? It currently expects the quorate partition (remember, this node is no longer quorate) to fence the node rather than taking action itself. > I guess I am a little confused on what the heuristics/scoring are > meant to do. Can you explain the role of the master partition and > what the expected outcome of an insufficient score should be? The master node is a node with sufficient score to declare itself online according to the heuristics that you supply in the qdisk configuration. Assuming it maintains its score, it arbitrates what other nodes join the "master" partition. If a node becomes part of the master partition, the node advertises quorum device votes to CMAN. Insufficient scores should cause a node to remove itself from the master partition and tell CMAN that the quorum device is offline. This should cause CMAN on a node in the qdisk master partition to fence the node (assuming that this causes the node to transition from quorate->inquorate). I'm guessing what is happening here in your case is that CMAN is still seeing the node - even though it's inquorate - and it's not fencing it -- is that right? A transition from quorate->inquorate should cause the node to get fenced. That sounds like a bug (pretty easy to fix, too). -- Lon From lhh at redhat.com Wed Sep 13 21:58:59 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 13 Sep 2006 17:58:59 -0400 Subject: [Linux-cluster] qdiskd not properly failing nodes?? In-Reply-To: <1158183990.11241.65.camel@rei.boston.devel.redhat.com> References: <2d4e61a8f96d5bf89f1d86611e4712d3@comcast.net> <1158183990.11241.65.camel@rei.boston.devel.redhat.com> Message-ID: <1158184739.11241.73.camel@rei.boston.devel.redhat.com> On Wed, 2006-09-13 at 17:46 -0400, Lon Hohberger wrote: > I'm guessing what is happening here in your case is that CMAN is still > seeing the node - even though it's inquorate - and it's not fencing it > -- is that right? A transition from quorate->inquorate should cause the > node to get fenced. > > That sounds like a bug (pretty easy to fix, too). The easiest fix is to make it reboot on the S_RUN->S_NONE transition like it says in the man page (but allow a configuration parameter to override it). This would make it work exactly stated, and wouldn't require any changes to your configuration. -- Lon From lhh at redhat.com Wed Sep 13 22:24:43 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 13 Sep 2006 18:24:43 -0400 Subject: [Linux-cluster] [PATCH] reboot flag + score fix In-Reply-To: <1158184739.11241.73.camel@rei.boston.devel.redhat.com> References: <2d4e61a8f96d5bf89f1d86611e4712d3@comcast.net> <1158183990.11241.65.camel@rei.boston.devel.redhat.com> <1158184739.11241.73.camel@rei.boston.devel.redhat.com> Message-ID: <1158186283.11241.82.camel@rei.boston.devel.redhat.com> This implements a reboot flag which must be explicitly disabled. Upon a transition from majority score to less than majority score, a node will reboot unless the reboot flag is explicitly set to 0 in the cluster configuration. This makes qdiskd operate consistently with section 2.2 in the manual page. -- Lon -------------- next part -------------- A non-text attachment was scrubbed... Name: qdisk-transition.patch Type: text/x-patch Size: 2919 bytes Desc: not available URL: From isplist at logicore.net Thu Sep 14 02:17:25 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Wed, 13 Sep 2006 21:17:25 -0500 Subject: [Linux-cluster] Fencing using brocade In-Reply-To: <450826EF.90203@opticalart.de> Message-ID: <2006913211725.520516@leena> In my case, the nodes are connected to a hub, which is in turn connected to the brocade. Do I just use the brocade's port still? I have not been able to find clear information on building a proper cluster.conf file either so have bits of this and that. This is what I've got... you're sample and the bits and pieces I've been using. Mike On Wed, 13 Sep 2006 17:42:39 +0200, Frank Hellmann wrote: > Hi! > > I can only recommend the system-config-cluster GUI, but if you feel brave > enough you can do it by hand > > This example is for a sanbox2, but it should get you going: > > ... > > > > > > > > > > > > > > > > > .... > > > login="username" name="sanbox" passwd="password"/> > > ... > > And don't forget to check the fence_brocade manpage for your brocade switch > for further options... > > Cheers, > > Frank... > > isplist at logicore.net wrote: > > > I want to use my brocade switch as the > fencing device for my cluster. I cannot find any documentation showing what > I need to set up on the brocade itself and within the cluster.conf file as > well to make this work. > >>> The system-config-cluster application supports brocade fencing. It is a >>> two part process - first you define the switch as a fence device; type >>> brocade, then you select a node an click "Manage fencing for this node" >>> and declare a fence instance. >> Ah, I'm at the command line :). So, there is nothing I need to do on the >> brocade itself then? The cluster ports aren't connected directly, they >> are connected into a compaq hub, then the hub is connected into the >> brocade. The brocade seems to know about the external ports however since >> they are listed when I look on the switch. As for the conf file, I've not >> found enough information on how to build a good conf file so know this >> one is probably not even complete. Been working on other parts of the >> problems then wanting to get to this. > config_version="40" name="vgcomp"> > post_fail_delay="0" post_join_delay="3"/> > name="cweb92.companions.com" nodeid="92" votes="1"/> > name="cweb93.companions.com" nodeid="93" votes="1"/> > name="cweb94.companions.com" nodeid="94" votes="1"/> > name="dev.companions.com" nodeid="99" votes="1"/> > name="qm247.companions.com" nodeid="247" votes="1"/> > name="qm248.companions.com" nodeid="248" votes="1"/> > name="qm249.companions.com" nodeid="249" votes="1"/> > name="qm250.companions.com" nodeid="250" votes="1"/> >> > ipaddr="x.x.x.x" login="xxx" name="brocade" passwd="xxx"/> >> -- >> Linux-cluster mailing list Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- ------------------------------------------------------------------------- > - Frank Hellmann Optical Art GmbH Waterloohain 7a DI Supervisor > http://www.opticalart.de 22769 Hamburg frank at opticalart.de Tel: ++49 40 > 5111051 Fax: ++49 40 43169199 From isplist at logicore.net Thu Sep 14 02:34:12 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Wed, 13 Sep 2006 21:34:12 -0500 Subject: [Linux-cluster] cluster.conf using brocade In-Reply-To: <450826EF.90203@opticalart.de> Message-ID: <2006913213412.698710@leena> Anyone have any thoughts on this config? Make sense, not, needs work? Can do the job but not the best? Etc. Thanks. The nodes are all connected into a compaq FC hub. That hub is then connected into a brocade switch. I'd like to use the brocade switch as the fencing device. From eric at bootseg.com Thu Sep 14 02:48:28 2006 From: eric at bootseg.com (Eric Kerin) Date: Wed, 13 Sep 2006 22:48:28 -0400 Subject: [Linux-cluster] cluster.conf using brocade In-Reply-To: <2006913213412.698710@leena> References: <2006913213412.698710@leena> Message-ID: <1158202108.2411.4.camel@mechanism.localnet> On Wed, 2006-09-13 at 21:34 -0500, isplist at logicore.net wrote: > The nodes are all connected into a compaq FC hub. That hub is then connected > into a brocade switch. I'd like to use the brocade switch as the fencing > device. > Sadly, that won't work. The fence script for brocade instructs it to turn off a specified port. All your machines hook up to the switch through a single port. So when a node acts up, you disconnect ALL of your nodes nodes from the storage at the same time, since they all are connected to port 0 (through the hub). For SAN fabric fencing to work, each server needs to be connected to the brocade switch on it's own switch port. Thanks, Eric From isplist at logicore.net Thu Sep 14 02:50:49 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Wed, 13 Sep 2006 21:50:49 -0500 Subject: [Linux-cluster] cluster.conf using brocade In-Reply-To: <1158202108.2411.4.camel@mechanism.localnet> Message-ID: <2006913215049.413248@leena> > For SAN fabric fencing to work, each server needs to be connected to the > brocade switch on it's own switch port. Darn, thought that would be the case :). Well, I'm looking at a large McData switch and from what I've seen, those are also supported so, guess that's the next way to go. Thanks! Mike From isplist at logicore.net Thu Sep 14 03:19:35 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Wed, 13 Sep 2006 22:19:35 -0500 Subject: [Linux-cluster] cluster.conf using brocade In-Reply-To: <1158202108.2411.4.camel@mechanism.localnet> Message-ID: <2006913221935.494886@leena> > Sadly, that won't work. The fence script for brocade instructs it to > turn off a specified port. All your machines hook up to the switch > through a single port. So when a node acts up, you disconnect ALL of > your nodes nodes from the storage at the same time, since they all are > connected to port 0 (through the hub). Since hubs are much cheaper than switches, and from the brocade's point of view, it can see unique ports even on the hub... would it not be worth adding this functionality to the fencing functions? Mike From erling.nygaard at gmail.com Thu Sep 14 06:59:59 2006 From: erling.nygaard at gmail.com (Erling Nygaard) Date: Thu, 14 Sep 2006 08:59:59 +0200 Subject: [Linux-cluster] Fencing using brocade In-Reply-To: <2006913211725.520516@leena> References: <450826EF.90203@opticalart.de> <2006913211725.520516@leena> Message-ID: Mike If I understand your description correctly, you have all your nodes connected into a FC hub. This hub is then connected to one port of the Brocade FC switch. So all the nodes are on a single public Arbitrated Loop. I assume that all the FC-connected storage is on another port on the Brocade? I can see one potential problem with this setup. If the fencing is done by disabling the port on the Brocade the entire loop will be disconnected from the switch. So instead of fencing one node the entire loop (containing all nodes) will be fenced. (Cut off from the storage) Only way I can see this work is to configure the fencing work with the wwnn/wwpn of the nodes instead of the port on the Brocade. Instead of having a fencing operation block all traffic on a given Brocade port you need to have the Brocade block traffic to a given wwnn/wwpn (the wwnn/wwpn of the FC-HBA of the node to be fenced) I have not played with such a setup for a number of years, so I can't really tell you how this should be done. And of course, if you have the storage connected to the same FC-hub, this won't work at all. In that case the traffic between the storage and the nodes would not be controlled by the Brocade at all... This should at least point out a potential problem :-) Erling On 9/14/06, isplist at logicore.net wrote: > In my case, the nodes are connected to a hub, which is in turn connected to > the brocade. Do I just use the brocade's port still? > > I have not been able to find clear information on building a proper > cluster.conf file either so have bits of this and that. > > This is what I've got... you're sample and the bits and pieces I've been > using. > > > > > > > > > > > > > > > > > name="brocade" passwd="xxx"/> > > > > > > > > Mike > > > On Wed, 13 Sep 2006 17:42:39 +0200, Frank Hellmann wrote: > > Hi! > > > > I can only recommend the system-config-cluster GUI, but if you feel brave > > enough you can do it by hand > > > > This example is for a sanbox2, but it should get you going: > > > > ... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > .... > > > > > > > login="username" name="sanbox" passwd="password"/> > > > > ... > > > > And don't forget to check the fence_brocade manpage for your brocade switch > > for further options... > > > > Cheers, > > > > Frank... > > > > isplist at logicore.net wrote: > > > I want to use my brocade switch as the > > fencing device for my cluster. I cannot find any documentation showing what > > I need to set up on the brocade itself and within the cluster.conf file as > > well to make this work. > > > >>> The system-config-cluster application supports brocade fencing. It is a > >>> two part process - first you define the switch as a fence device; type > >>> brocade, then you select a node an click "Manage fencing for this node" > >>> and declare a fence instance. > >> Ah, I'm at the command line :). So, there is nothing I need to do on the > >> brocade itself then? The cluster ports aren't connected directly, they > >> are connected into a compaq hub, then the hub is connected into the > >> brocade. The brocade seems to know about the external ports however since > >> they are listed when I look on the switch. As for the conf file, I've not > >> found enough information on how to build a good conf file so know this > >> one is probably not even complete. Been working on other parts of the > >> problems then wanting to get to this. >> config_version="40" name="vgcomp"> >> post_fail_delay="0" post_join_delay="3"/> >> name="cweb92.companions.com" nodeid="92" votes="1"/> >> name="cweb93.companions.com" nodeid="93" votes="1"/> >> name="cweb94.companions.com" nodeid="94" votes="1"/> >> name="dev.companions.com" nodeid="99" votes="1"/> >> name="qm247.companions.com" nodeid="247" votes="1"/> >> name="qm248.companions.com" nodeid="248" votes="1"/> >> name="qm249.companions.com" nodeid="249" votes="1"/> >> name="qm250.companions.com" nodeid="250" votes="1"/> > >> >> ipaddr="x.x.x.x" login="xxx" name="brocade" passwd="xxx"/> > >> -- > >> Linux-cluster mailing list Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > > > > -- ------------------------------------------------------------------------- > > - Frank Hellmann Optical Art GmbH Waterloohain 7a DI Supervisor > > http://www.opticalart.de 22769 Hamburg frank at opticalart.de Tel: ++49 40 > > 5111051 Fax: ++49 40 43169199 > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- - Mac OS X. Because making Unix user-friendly is easier than debugging Windows From pcaulfie at redhat.com Thu Sep 14 07:51:14 2006 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Thu, 14 Sep 2006 08:51:14 +0100 Subject: [Linux-cluster] Can't mount multiple GFS volumes? In-Reply-To: <200691394012.985163@leena> References: <200691394012.985163@leena> Message-ID: <450909F2.7070106@redhat.com> isplist at logicore.net wrote: > I have a need for non contiguous storage and wish to mount multiple GFS > logical volumes. However, I cannot seem to get past this following error and > others related. > > -Command > # mount -t gfs /dev/vgcomp/str1 /lvstr1 > mount: File exists > [root at dev new]# > > -Error Log > Sep 12 16:22:22 dev kernel: GFS: Trying to join cluster "lock_dlm", > "vgcomp:gfscomp" > Sep 12 16:22:22 dev kernel: dlm: gfscomp: lockspace already in use > Sep 12 16:22:22 dev kernel: lock_dlm: new lockspace error -17 When you created the GFS volumes using gfs_mkfs did you give them different names ? All filesystems in a cluster must have unique names. -- patrick From frank at opticalart.de Thu Sep 14 08:05:01 2006 From: frank at opticalart.de (Frank Hellmann) Date: Thu, 14 Sep 2006 10:05:01 +0200 Subject: [Linux-cluster] cluster.conf using brocade In-Reply-To: <2006913221935.494886@leena> References: <2006913221935.494886@leena> Message-ID: <45090D2D.30606@opticalart.de> isplist at logicore.net wrote: >> Sadly, that won't work. The fence script for brocade instructs it to >> turn off a specified port. All your machines hook up to the switch >> through a single port. So when a node acts up, you disconnect ALL of >> your nodes nodes from the storage at the same time, since they all are >> connected to port 0 (through the hub). >> > > Since hubs are much cheaper than switches, and from the brocade's point of > view, it can see unique ports even on the hub... would it not be worth adding > this functionality to the fencing functions? > > Mike > > Can you try to turn off a single port of that hub via the brocade switch? I doubt that there is any method in the brocade switch to do that, but I could be wrong here. Also if the hub is manageable there might be a way to disable certain ports directly at the hub. If neither works, you'll need to think of a different setup for SAN fencing, like putting the nodes onto their own FC-switch, or consider power fencing via a network manageable pdu or ups. Cheers, Frank... -- -------------------------------------------------------------------------- Frank Hellmann Optical Art GmbH Waterloohain 7a DI Supervisor http://www.opticalart.de 22769 Hamburg frank at opticalart.de Tel: ++49 40 5111051 Fax: ++49 40 43169199 From chekov at ucla.edu Thu Sep 14 10:01:50 2006 From: chekov at ucla.edu (Alan Wood) Date: Thu, 14 Sep 2006 03:01:50 -0700 (PDT) Subject: [Linux-cluster] GFS and the Dell pv220s or iSCSI In-Reply-To: <20060816123335.74AB373340@hormel.redhat.com> References: <20060816123335.74AB373340@hormel.redhat.com> Message-ID: sorry I've been away from the list and only getting to this 1-month old thread now... Brendan, I have a pv220s which I used for a GFS cluster last year with disasterous consequences. Performance was terrible for multiple concurrent users (one of the chief thing you are worried about in selecting SCSI over SATA in the first place). In addition, the support I got from Dell, while attentive, ended after 3 months with "we do not support using the pv220s in an active-active linux cluster". This is after I had reverted out of GFS and was using linux-ha to do failover and getting SCSI reservation errors which led to data loss... I have since moved on to iSCSI as a few people on the list suggested you do. Instead of the Dell/EMC box most people were talking about I went with a Promise vtrak M300i. as far as I could see there were only a couple of minor differences and the promise box was less than half the price when fully stocked with SATA drives (because Dell totally rips you off on the price of the drives). It supports SATA II and NCQ. so far performance has been just as good as with the pv220s in clustered config (I am only using 10K drives in the 220 though). I have just bought a couple of Qlogic HBAs (in the US $500 instead of the $2K someone mentioned in Brazil) but have yet to test them. I also bought a second enclosure and am hoping to use lvm mirroring and multipathing as soon as its good to go in order to have full redundancy: http://www.redhat.com/f/summitfiles/presentation/May31/Clustering%20and%20Storage/StorageUninterrupted.pdf btw, HP does offer an iSCSI head unit that you can then daisy-chain SCSI or SATA enclosures off of -- so if you really want SCSI disks that would be an option: http://h18006.www1.hp.com/products/storageworks/msa1510i/index.html I haven't tested it (and last I heard HP only officially supported it in Windows) but if anyone else on the list has experience with it I'd be curious to hear it. now that I have a 10gig switch available to me I'm also curious to try out a 10gig iSCSI enclosure but haven't seen any on the market... -alan > ------------------------------ > > Message: 5 > Date: Tue, 15 Aug 2006 21:29:59 +0100 > From: Brendan Heading > Subject: [Linux-cluster] Setting up a GFS cluster > To: linux-cluster at redhat.com > Message-ID: <44E22EC7.8020506 at clara.co.uk> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Hi all, > > I'm planning to build a cluster using a pair of PE1950s, using RHEL 3 > (or 4) with RHCS. Plan at the moment is to use GFS. Most of our stuff is > Dell, therefore the obvious choice is to use a Dell PowerVault 220S as > the shared storage device. > > Before I kick off with this idea I'd be interested to hear if anyone had > any issues with this kind of setup, or if there were any general > performance problems. Are there other SCSI enclosures which might be > better or more appropriate for these purposes ? > > Regards > > Brendan > > > > ------------------------------ > > Message: 7 > Date: Tue, 15 Aug 2006 23:23:40 -0300 > From: "Celso K. Webber" > Subject: Re: [Linux-cluster] Setting up a GFS cluster > To: linux clustering > Message-ID: <44E281AC.4010608 at webbertek.com.br> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Hello Brendan, > > Although Dell hardware is an excellent choice for Linux, the PV220S > solution is terrible at performance under a cluster environment. > > The reason is that the PV220S itself does not manage RAID devices, it is > in fact a JBOD (Just a Bunch Of Disks). The RAID management is done by > the SCSI controllers within the servers (PERC 3/DC or PERC 4/DC). > > Since there is a possibility of one of the machines going down, together > with data in the controller's write cache, this solution automatically > disable the write cache (write through mode) when you set the > controllers in "cluster mode". > > The end result is very poor performance, specially on write operations. > It's not uncommon that Dell provides the PV-220S with 15K RPM disks to > compensate this performance penalty due to lack of write cache. > > As far as I can tell, Red Hat did support the PV220S solution in the > past, during the RHEL 2.1 era, but it is not supported anymore as > certified shared storage for cluster solutions (RHCS or RHGFS). > > If you still plan to go on, be warned that the PV220S performs better in > Cluster Mode if you set up the data transfer rate to 160 MB/s instead of > 320 MB/s (the PERC 3/DC supports transfer rates of up to 160 MB/s while > the PERC 4/DC supports up to 320 MB/s). This is a known issue at Dell > support queues. > > As an extra information, there were too many problems about reliability > with the PV220S when used in Cluster Mode, this can be seen by the large > amount of firmware updates for the PERC 3/DC and 4/DC (LSI Logic based > chipset, megaraid driver on Linux). More recent firmware versions seem > to have corrected most logical drive corruption problems I've > experienced, so I believe the PV220S is still worth a try if you can > live with the poor performance issue. > > Maybe a Dell|EMC AX-100 using iSCSI could a better choice with a not so > high price tag. > > Sorry for the long message, I believe this information can be useful to > others. > > Best regards, > > Celso. > > Brendan Heading escreveu: >> Hi all, >> >> I'm planning to build a cluster using a pair of PE1950s, using RHEL 3 >> (or 4) with RHCS. Plan at the moment is to use GFS. Most of our stuff is >> Dell, therefore the obvious choice is to use a Dell PowerVault 220S as >> the shared storage device. >> >> Before I kick off with this idea I'd be interested to hear if anyone had >> any issues with this kind of setup, or if there were any general >> performance problems. Are there other SCSI enclosures which might be >> better or more appropriate for these purposes ? >> >> Regards >> >> Brendan >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > From bosse at klykken.com Thu Sep 14 11:20:43 2006 From: bosse at klykken.com (Bosse Klykken) Date: Thu, 14 Sep 2006 13:20:43 +0200 Subject: [Linux-cluster] Cluster node won't rejoin cluster after fencing, stops at cman Message-ID: <45093B0B.10709@klykken.com> Hi. I'm having some issues with a two-node failover cluster on RHEL4/U3 with kernel 2.6.9-34.0.1.ELsmp, ccs-1.0.3-0, cman-1.0.4-0, fence-1.32.18-0 and rgmanager-1.9.46-0. After a mishap where I accidentaly caused a failover of services with power fencing of server01, the system will not rejoin the cluster after boot. I have tried using both the init.d scripts and starting the daemons manually to troubleshoot this further, to no avail. I'm able to start ccsd properly (although it logs the cluster as inquorate) but it fails completely on cman, claiming that connection is refused. If anyone could help me by giving me some tips, directing me to the proper documentation addressing this issue or downright pointing out my problem, I would be most grateful. [server01] # service ccsd start Starting ccsd: [ OK ] ---8<--- /var/log/messages Sep 14 00:33:28 server01 ccsd[30227]: Starting ccsd 1.0.3: Sep 14 00:33:28 server01 ccsd[30227]: Built: Jan 25 2006 16:54:43 Sep 14 00:33:28 server01 ccsd[30227]: Copyright (C) Red Hat, Inc. 2004 All rights reserved. Sep 14 00:33:28 server01 ccsd[30227]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.5 Sep 14 00:33:28 server01 ccsd[30227]: Initial status:: Inquorate Sep 14 00:33:29 server01 ccsd: startup succeeded ---8<--- [server01] # service cman start Starting cman: [FAILED] ---8<--- /var/log/messages Sep 14 00:39:07 server01 ccsd[31417]: Cluster is not quorate. Refusing connection. Sep 14 00:39:07 server01 ccsd[31417]: Error while processing connect: Connection refused Sep 14 00:39:07 server01 ccsd[31417]: cluster.conf (cluster name = something_cluster, version = 46) found. Sep 14 00:39:07 server01 ccsd[31417]: Remote copy of cluster.conf is from quorate node. Sep 14 00:39:07 server01 ccsd[31417]: Local version # : 46 Sep 14 00:39:07 server01 ccsd[31417]: Remote version #: 46 Sep 14 00:39:07 server01 cman: cman_tool: Node is already active failed Sep 14 00:39:12 server01 kernel: CMAN: sending membership request ---8<--- [server01] # cat /proc/cluster/status Protocol version: 5.0.1 Config version: 46 Cluster name: something_cluster Cluster ID: 47540 Cluster Member: No Membership state: Joining [server01] # cat /proc/cluster/nodes Node Votes Exp Sts Name [server02] # cat /proc/cluster/status Protocol version: 5.0.1 Config version: 46 Cluster name: something_cluster Cluster ID: 47540 Cluster Member: Yes Membership state: Cluster-Member Nodes: 1 Expected_votes: 1 Total_votes: 1 Quorum: 1 Active subsystems: 4 Node name: server02 Node addresses: xx.xx.xx.134 [server02] # cat /proc/cluster/nodes Node Votes Exp Sts Name 1 1 1 X server01 2 1 1 M server02 [server01] # cat /etc/cluster/cluster.conf ---8<--- > > > > > > >------------------------------------------------------------------------ > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster > From carlopmart at gmail.com Wed Sep 20 14:03:42 2006 From: carlopmart at gmail.com (carlopmart) Date: Wed, 20 Sep 2006 16:03:42 +0200 Subject: [Linux-cluster] Things that i don't understand about cluster suite In-Reply-To: <45114037.1040108@redhat.com> References: <4510FA68.2000403@gmail.com> <45114037.1040108@redhat.com> Message-ID: <45114A3E.2080202@gmail.com> Thanks Jim, but If i change iLO fence for a GNBD fence, results are the same for my three questions, or do I need to configure one gnbd fence for each node??? Jim Parsons wrote: > carlopmart wrote: > >> Hi all, >> >> Sorry for this toppic, but i have serious doubts about using cluster >> suite under some deployments. My questions: >> >> a) How can I configure status check on a service script? for exmaple: >> I have two nodes with CS U4 with postfix service running on two nodes >> and using DLM as a lock manager. If I stop postfix from the script and >> I wait status check, nothing happens and rgmanager returns an ok for >> the service, but this service is stopped !!!. >> >> b) is it posible to startup only one node on a two-node cluster? i >> have tested this feature, but this node doesn't startup ( i am using >> iLO as a fencing method, but I have tested gnbd too and the result is >> the same). >> >> c) why relocate service doesn't works?? I have attached my config. >> For example, if I reboot one node, all services go to the second. This >> is ok, but when this primary node is up, services continue getting up >> in the previous node and they don't migrate towards the other node. >> >> >> I suppose that I am doing something wrong but i don't know what. >> Somebody can helps me?? >> >> many thanks. > > Below in the conf file, you have one ilo device declared under the > fencedevice section, and both nodes are using it. This would mean that > if one node were ever fenced, then both nodes would be fenced. ilo is a > per node fence type - they are rarely shared. I think you should have a > fencedevice block of type ilo for each node, and then the fence section > under each node should ref the appropriate device....that is, node1 > should use it's built-in ilo and node2 should use its own. > > -J > >> >> >> >> ------------------------------------------------------------------------ >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > login="Administrator" name="fence_iLO" passwd="fenceilo"/> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > name="dbserver" recovery="relocate"> >> >> >> >> >> >> >> >> >> ------------------------------------------------------------------------ >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- CL Martinez carlopmart {at} gmail {d0t} com From lhh at redhat.com Wed Sep 20 18:19:21 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 20 Sep 2006 14:19:21 -0400 Subject: [Linux-cluster] clmrmtabd not running. Can anyone fill me in? In-Reply-To: <20060919001307.32980.qmail@web34201.mail.mud.yahoo.com> References: <20060919001307.32980.qmail@web34201.mail.mud.yahoo.com> Message-ID: <1158776361.7388.21.camel@rei.boston.devel.redhat.com> On Mon, 2006-09-18 at 17:13 -0700, Rick Rodgers wrote: > I am using Clumanager version 1.2.24. I notice that clurmtabd is not > running for my services. IS this correct? If so does anyone know why? > > Also if it is dupposed to be running what should the cluster.xml look > lilke to > make that happen? It only runs if an NFS service is running. -- Lon From lhh at redhat.com Wed Sep 20 18:26:15 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 20 Sep 2006 14:26:15 -0400 Subject: [Linux-cluster] Things that i don't understand about cluster suite In-Reply-To: <4510FA68.2000403@gmail.com> References: <4510FA68.2000403@gmail.com> Message-ID: <1158776775.7388.29.camel@rei.boston.devel.redhat.com> On Wed, 2006-09-20 at 10:23 +0200, carlopmart wrote: > Hi all, > > Sorry for this toppic, but i have serious doubts about using cluster > suite under some deployments. My questions: > > a) How can I configure status check on a service script? for exmaple: > I have two nodes with CS U4 with postfix service running on two nodes > and using DLM as a lock manager. If I stop postfix from the script and I > wait status check, nothing happens and rgmanager returns an ok for the > service, but this service is stopped !!!. The status check in the postfix script must return nonzero if the service is stopped. > b) is it posible to startup only one node on a two-node cluster? i > have tested this feature, but this node doesn't startup ( i am using iLO > as a fencing method, but I have tested gnbd too and the result is the same). Yes, but the other node must be fenced first. > c) why relocate service doesn't works?? I have attached my config. For > example, if I reboot one node, all services go to the second. This is > ok, but when this primary node is up, services continue getting up in > the previous node and they don't migrate towards the other node. Kill the 'exclusive' attribute. It doesn't do what you think it does (and is probably the source of your problem). -- Lon From jab at ufba.br Wed Sep 20 20:20:31 2006 From: jab at ufba.br (Jeronimo Bezerra) Date: Wed, 20 Sep 2006 17:20:31 -0300 Subject: [Linux-cluster] Troubles to install GFS on Debian Message-ID: <1158783631.28886.8.camel@localhost.localdomain> Hello All. I'm having a big trouble here to compile the gfs on Debian. I downloaded from CVS, and did the follow: cd /usr/src ln -s linux-source-2.6.16 linux-2.6 cd cluster ./configure make After that, the make command returns the follow: make[2]: Entering directory `/usr/src/cluster/group/daemon (...) gcc -Wall -g -I. -I../include/ -I../../cman/lib/ -I../lib/ -c -o joinleave.o joinleave.c joinleave.c: In function `do_leave': joinleave.c:129: warning: long long unsigned int format, uint64_t arg (arg 7) joinleave.c:136: warning: long long unsigned int format, uint64_t arg (arg 7) gcc -Wall -g -I. -I../include/ -I../../cman/lib/ -I../lib/ -c -o main.o main.c main.c: In function `app_deliver': main.c:180: warning: int format, different type arg (arg 6) gcc -L//usr/src/cluster/group/../cman/lib -L//usr/lib64/openais -L//usr/lib64 -o groupd app.o cpg.o cman.o joinleave.o main.o -lcman -lcpg /usr/bin/ld: cannot find -lcpg collect2: ld returned 1 exit status make[2]: *** [groupd] Error 1 make[2]: Leaving directory `/usr/src/cluster/group/daemon' make[1]: *** [all] Error 2 make[1]: Leaving directory `/usr/src/cluster/group' make: *** [all] Error 2 In the directory /usr/src/cluster/group/daemon : app.c app.o cman.c cman.o cpg.c cpg.o CVS gd_internal.h groupd.h joinleave.c joinleave.o main.c main.o Makefile ie, the gcc loads the cman (-lcman) but doesn't with cgp (-lcpg). Why this happen? Is there a best howto to install gfs on Debian? or another way?? Thanks a lot Jeronimo Bezerra From rpeterso at redhat.com Wed Sep 20 20:38:36 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Wed, 20 Sep 2006 15:38:36 -0500 Subject: [Linux-cluster] Troubles to install GFS on Debian In-Reply-To: <1158783631.28886.8.camel@localhost.localdomain> References: <1158783631.28886.8.camel@localhost.localdomain> Message-ID: <4511A6CC.4080301@redhat.com> Jeronimo Bezerra wrote: > Hello All. > > I'm having a big trouble here to compile the gfs on Debian. I downloaded > from CVS, and did the follow: > > cd /usr/src > ln -s linux-source-2.6.16 linux-2.6 > cd cluster > ./configure > make > > After that, the make command returns the follow: > > make[2]: Entering directory `/usr/src/cluster/group/daemon > (...) > gcc -Wall -g -I. -I../include/ -I../../cman/lib/ -I../lib/ -c -o > joinleave.o joinleave.c > joinleave.c: In function `do_leave': > joinleave.c:129: warning: long long unsigned int format, uint64_t arg > (arg 7) > joinleave.c:136: warning: long long unsigned int format, uint64_t arg > (arg 7) > gcc -Wall -g -I. -I../include/ -I../../cman/lib/ -I../lib/ -c -o main.o > main.c > main.c: In function `app_deliver': > main.c:180: warning: int format, different type arg (arg 6) > gcc -L//usr/src/cluster/group/../cman/lib -L//usr/lib64/openais > -L//usr/lib64 -o groupd app.o cpg.o cman.o joinleave.o main.o -lcman > -lcpg > /usr/bin/ld: cannot find -lcpg > collect2: ld returned 1 exit status > make[2]: *** [groupd] Error 1 > make[2]: Leaving directory `/usr/src/cluster/group/daemon' > make[1]: *** [all] Error 2 > make[1]: Leaving directory `/usr/src/cluster/group' > make: *** [all] Error 2 > > In the directory /usr/src/cluster/group/daemon : > > app.c app.o cman.c cman.o cpg.c cpg.o CVS gd_internal.h groupd.h > joinleave.c joinleave.o main.c main.o Makefile > > ie, the gcc loads the cman (-lcman) but doesn't with cgp (-lcpg). > > Why this happen? Is there a best howto to install gfs on Debian? or > another way?? > > Thanks a lot > > Jeronimo Bezerra Hi Jeronimo, libcpg is part of openais, so I suspect you're missing openais. Here is a usage.txt file that is a good resource for building the development tree of cluster/GFS from source. http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/doc/usage.txt?cvsroot=cluster Regards, Bob Peterson Red Hat Cluster Suite From jab at ufba.br Wed Sep 20 20:59:45 2006 From: jab at ufba.br (Jeronimo Bezerra) Date: Wed, 20 Sep 2006 17:59:45 -0300 Subject: [Linux-cluster] Troubles to install GFS on Debian In-Reply-To: <4511A6CC.4080301@redhat.com> References: <1158783631.28886.8.camel@localhost.localdomain> <4511A6CC.4080301@redhat.com> Message-ID: <1158785985.31019.5.camel@localhost.localdomain> Hello Bob, thanks :) I installed openais but I didn't see that was in /usr/local/usr/include/openais, and in the Debian the default location is /usr/include. I fix it. After that, I received another error: make[2]: Leaving directory `/usr/src/cluster/group/tool' make -C dlm_controld all make[2]: Entering directory `/usr/src/cluster/group/dlm_controld' gcc -Wall -g -I//usr/include -I../config -idirafter /include/linux -I../../group/lib/ -I../../ccs/lib/ -I../../cman/lib/ -I../include/ -c -o main.o main.c main.c: In function `setup_uevent': main.c:183: error: `NETLINK_KOBJECT_UEVENT' undeclared (first use in this function) main.c:183: error: (Each undeclared identifier is reported only once main.c:183: error: for each function it appears in.) make[2]: *** [main.o] Error 1 make[2]: Leaving directory `/usr/src/cluster/group/dlm_controld' make[1]: *** [all] Error 2 make[1]: Leaving directory `/usr/src/cluster/group' make: *** [all] Error 2 I will try to resolve it tonight. But if you would like to help, please! :) Thank you again! Jeronimo Em Qua, 2006-09-20 ?s 15:38 -0500, Robert Peterson escreveu: > Jeronimo Bezerra wrote: > > Hello All. > > > > I'm having a big trouble here to compile the gfs on Debian. I downloaded > > from CVS, and did the follow: > > > > cd /usr/src > > ln -s linux-source-2.6.16 linux-2.6 > > cd cluster > > ./configure > > make > > > > After that, the make command returns the follow: > > > > make[2]: Entering directory `/usr/src/cluster/group/daemon > > (...) > > gcc -Wall -g -I. -I../include/ -I../../cman/lib/ -I../lib/ -c -o > > joinleave.o joinleave.c > > joinleave.c: In function `do_leave': > > joinleave.c:129: warning: long long unsigned int format, uint64_t arg > > (arg 7) > > joinleave.c:136: warning: long long unsigned int format, uint64_t arg > > (arg 7) > > gcc -Wall -g -I. -I../include/ -I../../cman/lib/ -I../lib/ -c -o main.o > > main.c > > main.c: In function `app_deliver': > > main.c:180: warning: int format, different type arg (arg 6) > > gcc -L//usr/src/cluster/group/../cman/lib -L//usr/lib64/openais > > -L//usr/lib64 -o groupd app.o cpg.o cman.o joinleave.o main.o -lcman > > -lcpg > > /usr/bin/ld: cannot find -lcpg > > collect2: ld returned 1 exit status > > make[2]: *** [groupd] Error 1 > > make[2]: Leaving directory `/usr/src/cluster/group/daemon' > > make[1]: *** [all] Error 2 > > make[1]: Leaving directory `/usr/src/cluster/group' > > make: *** [all] Error 2 > > > > In the directory /usr/src/cluster/group/daemon : > > > > app.c app.o cman.c cman.o cpg.c cpg.o CVS gd_internal.h groupd.h > > joinleave.c joinleave.o main.c main.o Makefile > > > > ie, the gcc loads the cman (-lcman) but doesn't with cgp (-lcpg). > > > > Why this happen? Is there a best howto to install gfs on Debian? or > > another way?? > > > > Thanks a lot > > > > Jeronimo Bezerra > Hi Jeronimo, > > libcpg is part of openais, so I suspect you're missing openais. > Here is a usage.txt file that is a good resource for building the > development > tree of cluster/GFS from source. > http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/doc/usage.txt?cvsroot=cluster > > Regards, > > Bob Peterson > Red Hat Cluster Suite > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From rpeterso at redhat.com Wed Sep 20 21:07:10 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Wed, 20 Sep 2006 16:07:10 -0500 Subject: [Linux-cluster] Troubles to install GFS on Debian In-Reply-To: <1158785985.31019.5.camel@localhost.localdomain> References: <1158783631.28886.8.camel@localhost.localdomain> <4511A6CC.4080301@redhat.com> <1158785985.31019.5.camel@localhost.localdomain> Message-ID: <4511AD7E.6020302@redhat.com> Jeronimo Bezerra wrote: > Hello Bob, thanks :) > > I installed openais but I didn't see that was > in /usr/local/usr/include/openais, and in the Debian the default > location is /usr/include. I fix it. After that, I received another > error: > > make[2]: Leaving directory `/usr/src/cluster/group/tool' > make -C dlm_controld all > make[2]: Entering directory `/usr/src/cluster/group/dlm_controld' > gcc -Wall -g -I//usr/include -I../config -idirafter /include/linux > -I../../group/lib/ -I../../ccs/lib/ -I../../cman/lib/ -I../include/ -c > -o main.o main.c > main.c: In function `setup_uevent': > main.c:183: error: `NETLINK_KOBJECT_UEVENT' undeclared (first use in > this function) > main.c:183: error: (Each undeclared identifier is reported only once > main.c:183: error: for each function it appears in.) > make[2]: *** [main.o] Error 1 > make[2]: Leaving directory `/usr/src/cluster/group/dlm_controld' > make[1]: *** [all] Error 2 > make[1]: Leaving directory `/usr/src/cluster/group' > make: *** [all] Error 2 > > I will try to resolve it tonight. But if you would like to help, > please! :) > > Thank you again! > > Jeronimo > Sounds like you're not picking up netlink.h. Regards, Bob Peterson Red Hat Cluster Suite From sdake at redhat.com Wed Sep 20 23:22:16 2006 From: sdake at redhat.com (Steven Dake) Date: Wed, 20 Sep 2006 16:22:16 -0700 Subject: [Linux-cluster] Troubles to install GFS on Debian In-Reply-To: <4511AD7E.6020302@redhat.com> References: <1158783631.28886.8.camel@localhost.localdomain> <4511A6CC.4080301@redhat.com> <1158785985.31019.5.camel@localhost.localdomain> <4511AD7E.6020302@redhat.com> Message-ID: <1158794536.20300.7.camel@shih.broked.org> Jeronimo, I suspect you have old kernel include headers which did not support the uevent mechanism. For example on my 2.6.9 kernel I am using with 2.6.9 include headers, there is no support for uevents. You can work around this problem by defining NETLINK_KOBJECT_UEVENT to be whatever value is in your kernel (found in include/linux/netlink.h in your new kernel sources you are installing) in main.c. Alternatively you could upgrade your kernel include headers. You didn't state which version of debian you are using, but updating the kernel headers could cause problems, so I'd stick with the workaround above. You should also verify your kernel supports uevents. This can be done by checking for a copy of the file lib/kobject_uevent.c within the kernel source tree. It also needs to be enabled in the kernel configuration options. I would expect debian unstable and also FC6 to support these features. Regards -steve On Wed, 2006-09-20 at 16:07 -0500, Robert Peterson wrote: > Jeronimo Bezerra wrote: > > Hello Bob, thanks :) > > > > I installed openais but I didn't see that was > > in /usr/local/usr/include/openais, and in the Debian the default > > location is /usr/include. I fix it. After that, I received another > > error: > > > > make[2]: Leaving directory `/usr/src/cluster/group/tool' > > make -C dlm_controld all > > make[2]: Entering directory `/usr/src/cluster/group/dlm_controld' > > gcc -Wall -g -I//usr/include -I../config -idirafter /include/linux > > -I../../group/lib/ -I../../ccs/lib/ -I../../cman/lib/ -I../include/ -c > > -o main.o main.c > > main.c: In function `setup_uevent': > > main.c:183: error: `NETLINK_KOBJECT_UEVENT' undeclared (first use in > > this function) > > main.c:183: error: (Each undeclared identifier is reported only once > > main.c:183: error: for each function it appears in.) > > make[2]: *** [main.o] Error 1 > > make[2]: Leaving directory `/usr/src/cluster/group/dlm_controld' > > make[1]: *** [all] Error 2 > > make[1]: Leaving directory `/usr/src/cluster/group' > > make: *** [all] Error 2 > > > > I will try to resolve it tonight. But if you would like to help, > > please! :) > > > > Thank you again! > > > > Jeronimo > > > Sounds like you're not picking up netlink.h. > > Regards, > > Bob Peterson > Red Hat Cluster Suite > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From rodgersr at yahoo.com Wed Sep 20 23:45:21 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Wed, 20 Sep 2006 16:45:21 -0700 (PDT) Subject: [Linux-cluster] Clurmtabd needs to be run manually? Message-ID: <20060920234521.37344.qmail@web34214.mail.mud.yahoo.com> When I start clumanager I notice that clumrmtabd does not start up. The man page says the service manager daemon will automatically start is for each mount point. But this does not seem to happen. I can start it manually. Does anyone know if this is the expected behavior? Is it accepatable to manually start it? Below is my cluster.xml file. -------------- next part -------------- An HTML attachment was scrubbed... URL: From RMoody at mweb.com Thu Sep 21 08:54:17 2006 From: RMoody at mweb.com (Robert Moody - MWEB) Date: Thu, 21 Sep 2006 10:54:17 +0200 Subject: [Linux-cluster] Testing a ipmi fence. Message-ID: <6586D1F97DDEDE408BEEF44402F37978084467@mwmx4.mweb.com> Hi all, I have configured this before but now that I want to show someone how it is working do you think it will work. Anyway here is my problem. I have 3 dell 2850's with an onboard ipmi interface on. I have configured these to work on the lan on a private network. Now there was a command that I have used before to manually fence a node to test if the fencing is working. For the life of me I can not remember what it was. Anyone done this recently? Thanks, Robert. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jab at ufba.br Thu Sep 21 11:04:38 2006 From: jab at ufba.br (Jeronimo Bezerra) Date: Thu, 21 Sep 2006 08:04:38 -0300 Subject: [Linux-cluster] Troubles to install GFS on Debian In-Reply-To: <1158794536.20300.7.camel@shih.broked.org> References: <1158783631.28886.8.camel@localhost.localdomain> <4511A6CC.4080301@redhat.com> <1158785985.31019.5.camel@localhost.localdomain> <4511AD7E.6020302@redhat.com> <1158794536.20300.7.camel@shih.broked.org> Message-ID: <1158836678.31019.10.camel@localhost.localdomain> Thanks Steve! I'll try to upgrade my kernel-headers. This box is just for tests. My debian is 3.1 Sarge, and my kernel is 2.6.16. Thanks again Jeronimo Em Qua, 2006-09-20 ?s 16:22 -0700, Steven Dake escreveu: > Jeronimo, > > I suspect you have old kernel include headers which did not support the > uevent mechanism. For example on my 2.6.9 kernel I am using with 2.6.9 > include headers, there is no support for uevents. You can work around > this problem by defining NETLINK_KOBJECT_UEVENT to be whatever value is > in your kernel (found in include/linux/netlink.h in your new kernel > sources you are installing) in main.c. Alternatively you could upgrade > your kernel include headers. You didn't state which version of debian > you are using, but updating the kernel headers could cause problems, so > I'd stick with the workaround above. > > You should also verify your kernel supports uevents. This can be done > by checking for a copy of the file lib/kobject_uevent.c within the > kernel source tree. It also needs to be enabled in the kernel > configuration options. I would expect debian unstable and also FC6 to > support these features. > > Regards > -steve > > On Wed, 2006-09-20 at 16:07 -0500, Robert Peterson wrote: > > Jeronimo Bezerra wrote: > > > Hello Bob, thanks :) > > > > > > I installed openais but I didn't see that was > > > in /usr/local/usr/include/openais, and in the Debian the default > > > location is /usr/include. I fix it. After that, I received another > > > error: > > > > > > make[2]: Leaving directory `/usr/src/cluster/group/tool' > > > make -C dlm_controld all > > > make[2]: Entering directory `/usr/src/cluster/group/dlm_controld' > > > gcc -Wall -g -I//usr/include -I../config -idirafter /include/linux > > > -I../../group/lib/ -I../../ccs/lib/ -I../../cman/lib/ -I../include/ -c > > > -o main.o main.c > > > main.c: In function `setup_uevent': > > > main.c:183: error: `NETLINK_KOBJECT_UEVENT' undeclared (first use in > > > this function) > > > main.c:183: error: (Each undeclared identifier is reported only once > > > main.c:183: error: for each function it appears in.) > > > make[2]: *** [main.o] Error 1 > > > make[2]: Leaving directory `/usr/src/cluster/group/dlm_controld' > > > make[1]: *** [all] Error 2 > > > make[1]: Leaving directory `/usr/src/cluster/group' > > > make: *** [all] Error 2 > > > > > > I will try to resolve it tonight. But if you would like to help, > > > please! :) > > > > > > Thank you again! > > > > > > Jeronimo > > > > > Sounds like you're not picking up netlink.h. > > > > Regards, > > > > Bob Peterson > > Red Hat Cluster Suite > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From carlopmart at gmail.com Thu Sep 21 11:12:53 2006 From: carlopmart at gmail.com (carlopmart) Date: Thu, 21 Sep 2006 13:12:53 +0200 Subject: [Linux-cluster] Things that i don't understand about cluster suite In-Reply-To: <1158776775.7388.29.camel@rei.boston.devel.redhat.com> References: <4510FA68.2000403@gmail.com> <1158776775.7388.29.camel@rei.boston.devel.redhat.com> Message-ID: <451273B5.3090804@gmail.com> Lon Hohberger wrote: > On Wed, 2006-09-20 at 10:23 +0200, carlopmart wrote: >> Hi all, >> >> Sorry for this toppic, but i have serious doubts about using cluster >> suite under some deployments. My questions: >> >> a) How can I configure status check on a service script? for exmaple: >> I have two nodes with CS U4 with postfix service running on two nodes >> and using DLM as a lock manager. If I stop postfix from the script and I >> wait status check, nothing happens and rgmanager returns an ok for the >> service, but this service is stopped !!!. > > The status check in the postfix script must return nonzero if the > service is stopped. Lon, I use original's postfix script and returns this if postfix is up: "master (pid 957) is running..." when postfix isn't up, script returns: "master is stopped". Do I need to change this message to "0" for status check works ok?? > >> b) is it posible to startup only one node on a two-node cluster? i >> have tested this feature, but this node doesn't startup ( i am using iLO >> as a fencing method, but I have tested gnbd too and the result is the same). > > Yes, but the other node must be fenced first. Then, I can't startup only one node, when both are stopped, right?? > >> c) why relocate service doesn't works?? I have attached my config. For >> example, if I reboot one node, all services go to the second. This is >> ok, but when this primary node is up, services continue getting up in >> the previous node and they don't migrate towards the other node. > > Kill the 'exclusive' attribute. It doesn't do what you think it does > (and is probably the source of your problem). Thanks, now works ok. > > -- Lon > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- CL Martinez carlopmart {at} gmail {d0t} com From gforte at leopard.us.udel.edu Thu Sep 21 12:11:53 2006 From: gforte at leopard.us.udel.edu (Greg Forte) Date: Thu, 21 Sep 2006 08:11:53 -0400 Subject: [Linux-cluster] Things that i don't understand about cluster suite In-Reply-To: <451273B5.3090804@gmail.com> References: <4510FA68.2000403@gmail.com> <1158776775.7388.29.camel@rei.boston.devel.redhat.com> <451273B5.3090804@gmail.com> Message-ID: <45128189.3060108@leopard.us.udel.edu> > Lon, I use original's postfix script and returns this if postfix is up: > "master (pid 957) is running..." when postfix isn't up, script returns: > "master is stopped". Do I need to change this message to "0" for status > check works ok?? That's just a message that's printed. return status is the value given in a statement of the form 'return X', or 0 if no such statement is explicitly reached. All executables return a status value to the shell, where 0 is taken to mean "OK", and non-zero means "something bad happened". The postfix script appears to return the correct values in each case. My guess would be that it's cluster configuration problem, but I didn't see anything about postfix in the conf that you pasted ... > Then, I can't startup only one node, when both are stopped, right?? No, you definitely can do this, if the cluster is configured correctly. The problem may be in your fencing method - the first thing the booted node will do when cman starts is to try to contact the other node. When it times out, it'll try to fence the other node and won't continue until it does. If the fence process fails, it'll hang there, which I'm guessing is what you're seeing. So the problem is most likely that fencing is failing, either due to misconfiguration or because the other node is powered off and so its iLo agent isn't responding. Since iLo is supposed to be able to power-up a switched-off server, my guess is there's a problem with your fencing configuration - did you fix it so that you have a separate fencedevice entry for each node? -g From Alain.Moulle at bull.net Thu Sep 21 13:23:42 2006 From: Alain.Moulle at bull.net (Alain Moulle) Date: Thu, 21 Sep 2006 15:23:42 +0200 Subject: [Linux-cluster] CS4 U2 / clustat not responding Message-ID: <4512925E.1090007@bull.net> Hi Could you give me all patches (or defect) numbers available to avoid clustat stalled on CS4 U2 ? Thanks Alain From f.hackenberger at mediatransfer.com Thu Sep 21 13:57:59 2006 From: f.hackenberger at mediatransfer.com (Falk Hackenberger - MediaTransfer AG Netresearch & Consulting) Date: Thu, 21 Sep 2006 15:57:59 +0200 Subject: [Linux-cluster] search experiences RedHat CS and lvm2 snapshots on both nodes Message-ID: <45129A67.6080007@mediatransfer.com> Hello all, I have running 2 nodes (active-pasive) on one san. Because the san have no snapshot functionality I use lvm2 snapshots. the disks on the san are one Volume group with many Logical volumes. have you experiences with setups wich are: service1 runs on node1 an need Logical volume1 service2 runs on node2 an need Logical volume2 it is posible to say in such a setup snapshot1 on on node1 on Logical volume1 snapshot2 on on node2 on Logical volume2 remember both Logical volumes are on one Volume group. experiences, recommendations? thanks, falk From lhh at redhat.com Thu Sep 21 14:07:20 2006 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 21 Sep 2006 10:07:20 -0400 Subject: [Linux-cluster] Testing a ipmi fence. In-Reply-To: <6586D1F97DDEDE408BEEF44402F37978084467@mwmx4.mweb.com> References: <6586D1F97DDEDE408BEEF44402F37978084467@mwmx4.mweb.com> Message-ID: <1158847640.7388.83.camel@rei.boston.devel.redhat.com> On Thu, 2006-09-21 at 10:54 +0200, Robert Moody - MWEB wrote: > Hi all, > > I have configured this before but now that I want to show someone how > it is working do you think it will work. > > Anyway here is my problem. I have 3 dell 2850's with an onboard ipmi > interface on. I have configured these to work on the lan on a private > network. > > Now there was a command that I have used before to manually fence a > node to test if the fencing is working. For the life of me I can not > remember what it was. > > Anyone done this recently? fence_node -n ? From lhh at redhat.com Thu Sep 21 14:11:03 2006 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 21 Sep 2006 10:11:03 -0400 Subject: [Linux-cluster] Testing a ipmi fence. In-Reply-To: <1158847640.7388.83.camel@rei.boston.devel.redhat.com> References: <6586D1F97DDEDE408BEEF44402F37978084467@mwmx4.mweb.com> <1158847640.7388.83.camel@rei.boston.devel.redhat.com> Message-ID: <1158847863.7388.88.camel@rei.boston.devel.redhat.com> On Thu, 2006-09-21 at 10:07 -0400, Lon Hohberger wrote: > On Thu, 2006-09-21 at 10:54 +0200, Robert Moody - MWEB wrote: > > Hi all, > > > > I have configured this before but now that I want to show someone how > > it is working do you think it will work. > > > > Anyway here is my problem. I have 3 dell 2850's with an onboard ipmi > > interface on. I have configured these to work on the lan on a private > > network. > > > > Now there was a command that I have used before to manually fence a > > node to test if the fencing is working. For the life of me I can not > > remember what it was. > > > > Anyone done this recently? > > fence_node -n er, fence_node -- Lon From lhh at redhat.com Thu Sep 21 14:18:36 2006 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 21 Sep 2006 10:18:36 -0400 Subject: [Linux-cluster] Clurmtabd needs to be run manually? In-Reply-To: <20060920234521.37344.qmail@web34214.mail.mud.yahoo.com> References: <20060920234521.37344.qmail@web34214.mail.mud.yahoo.com> Message-ID: <1158848316.7388.96.camel@rei.boston.devel.redhat.com> On Wed, 2006-09-20 at 16:45 -0700, Rick Rodgers wrote: > When I start clumanager I notice that clumrmtabd does not start up. > The man > page says the service manager daemon will automatically start is for > each mount > point. But this does not seem to happen. > > I can start it manually. Does anyone know if this is the expected > behavior? Is it accepatable to manually start it? Below is my > cluster.xml file. > > name="service-core" userscript="/etc/init.d/service-core"> > > ipaddress="10.20.70.104" netmask="255.255.255.0"/> > > > options=""/> > > > It looks like there is no clumanager-managed NFS component to the service, which is why it's not being started. If you're not running NFS, then you don't need clurmtabd. You can start it manually, or you can tweak the scripts for your /service mountpoint and make it start if you need to. The easiest thing to do is just add a dummy export entry to the service. -- Lon From RMoody at mweb.com Thu Sep 21 14:54:20 2006 From: RMoody at mweb.com (Robert Moody - MWEB) Date: Thu, 21 Sep 2006 16:54:20 +0200 Subject: [Linux-cluster] Testing a ipmi fence. References: <6586D1F97DDEDE408BEEF44402F37978084467@mwmx4.mweb.com><1158847640.7388.83.camel@rei.boston.devel.redhat.com> <1158847863.7388.88.camel@rei.boston.devel.redhat.com> Message-ID: <6586D1F97DDEDE408BEEF44402F3797808446D@mwmx4.mweb.com> Ok I get the barney award. The one command I did not write down and document cause duh it was so easy is the one that I forget. Thanks guys I feel really clever right now. ;-) (Very sheepishly looks at the ground....) -----Original Message----- From: linux-cluster-bounces at redhat.com on behalf of Lon Hohberger Sent: Thu 9/21/2006 4:11 PM To: linux clustering Subject: Re: [Linux-cluster] Testing a ipmi fence. On Thu, 2006-09-21 at 10:07 -0400, Lon Hohberger wrote: > On Thu, 2006-09-21 at 10:54 +0200, Robert Moody - MWEB wrote: > > Hi all, > > > > I have configured this before but now that I want to show someone how > > it is working do you think it will work. > > > > Anyway here is my problem. I have 3 dell 2850's with an onboard ipmi > > interface on. I have configured these to work on the lan on a private > > network. > > > > Now there was a command that I have used before to manually fence a > > node to test if the fencing is working. For the life of me I can not > > remember what it was. > > > > Anyone done this recently? > > fence_node -n er, fence_node -- Lon -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From lhh at redhat.com Thu Sep 21 14:57:56 2006 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 21 Sep 2006 10:57:56 -0400 Subject: [Linux-cluster] CS4 U2 / clustat not responding In-Reply-To: <4512925E.1090007@bull.net> References: <4512925E.1090007@bull.net> Message-ID: <1158850676.7388.104.camel@rei.boston.devel.redhat.com> On Thu, 2006-09-21 at 15:23 +0200, Alain Moulle wrote: > Hi > Could you give me all patches (or defect) numbers available to avoid > clustat stalled on CS4 U2 ? > Thanks > Alain Please look at errata notes for bugs fixed. Especially look at magma, magma-plugins, and rgmanager bugzillas. Some patches may be incremental (and therefore may not apply to U2). Several bugs may cause this symptom. Here are the errata: https://rhn.redhat.com/errata/RHBA-2006-0557.html https://rhn.redhat.com/errata/RHBA-2006-0552.html https://rhn.redhat.com/errata/RHBA-2006-0551.html https://rhn.redhat.com/errata/RHBA-2006-0241.html https://rhn.redhat.com/errata/RHBA-2006-0240.html https://rhn.redhat.com/errata/RHBA-2006-0239.html If you would like to fork U2, your quickest bet to get something working is to take a diff of the sources between U2 and U4 and pull out things like resource agent changes and such. You can run the U4 version of magma-plugins, magma, and rgmanager on the U2 infrastructure if this makes it your particular environment. That is, you can leave cman[-kernel], dlm[-kernel], ccsd, gfs, and the rest of the system at U2, and just upgrade magma, magma-plugins, and rgmanager (though, you can safely update ccsd too) if you want, and it will probably save you time and effort. -- Lon From dist-list at LEXUM.UMontreal.CA Thu Sep 21 16:03:13 2006 From: dist-list at LEXUM.UMontreal.CA (FM) Date: Thu, 21 Sep 2006 12:03:13 -0400 Subject: [Linux-cluster] Throttling HTTP traffic when using director ? Message-ID: <4512B7C1.5010109@lexum.umontreal.ca> Hello, Our setup : 1 director in front of 4 web servers. (redhat cluster suite). All servers are behind Pix Firewall We need a way to stop abusing users (based on a download limit per day for ex.). The prob is that we need to do it on the director level (so we cannot use apache2 modules) and it has to be dynamic based on the bandwidth usage of the client. Should I use iptable's traffic shaping capabilities for that ? Do you have any advice for this particular situation. Thanks !! F From jab at ufba.br Thu Sep 21 17:25:13 2006 From: jab at ufba.br (Jeronimo Bezerra) Date: Thu, 21 Sep 2006 14:25:13 -0300 Subject: [Linux-cluster] Troubles to install GFS on Debian] Message-ID: <1158859513.31019.37.camel@localhost.localdomain> I forget to send to list :). So, where can I find the lock_dlm_plock.h? I already searched im my linux box and nothing. The openais is installed too. Thank Jeronimo -------------- next part -------------- An embedded message was scrubbed... From: Jeronimo Bezerra Subject: Re: [Linux-cluster] Troubles to install GFS on Debian Date: Thu, 21 Sep 2006 10:35:48 -0300 Size: 4002 URL: From teigland at redhat.com Thu Sep 21 18:03:34 2006 From: teigland at redhat.com (David Teigland) Date: Thu, 21 Sep 2006 13:03:34 -0500 Subject: [Linux-cluster] Troubles to install GFS on Debian] In-Reply-To: <1158859513.31019.37.camel@localhost.localdomain> References: <1158859513.31019.37.camel@localhost.localdomain> Message-ID: <20060921180334.GA24022@redhat.com> On Thu, Sep 21, 2006 at 02:25:13PM -0300, Jeronimo Bezerra wrote: > I forget to send to list :). > > So, where can I find the lock_dlm_plock.h? > > I already searched im my linux box and nothing. Code in cvs head needs the kernel in this git tree: git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6.git Dave From Zelikov_Mikhail at emc.com Thu Sep 21 19:12:56 2006 From: Zelikov_Mikhail at emc.com (Zelikov_Mikhail at emc.com) Date: Thu, 21 Sep 2006 15:12:56 -0400 Subject: [Linux-cluster] Unable to lock any resource Message-ID: <9B2FEC4CE7E80B4A965F1D9ADF22B17304A65B92@CORPUSMX40B.corp.emc.com> I am debugging a program that uses DLM (lock_resource()) to lock a resource. If I kill the process within GDB and leave it running for a long time (for example overnight), I am not longer able to lock any resources. I obviously killed gdb and verified that I have no leftovers. To verify that it is not just my resource that I can not lock I use: dlmtest from ...dlm/tests/usertests/ directory to lock any resource: [root at bof227 usertest]# ./dlmtest -m NL TEST locking TEST NL ... lock: Invalid argument The error code returned on the lock_resources is EINVAL (22). I can obviously fix this by rebooting the system, however it is a pain. I tried to fix it by restarting cman and clvmd services - no success. And I can not reload dlm kernel module as it is in use. The content of dlm_stats shows that there is the same number of locks as unlocks: [root at bof227 usertest]# cat /proc/cluster/dlm_stats DLM stats (HZ=1000) Lock operations: 21 Unlock operations: 21 Convert operations: 0 Completion ASTs: 42 Blocking ASTs: 0 Lockqueue num waittime ave WAIT_RSB 19 8 0 Total 19 8 0 I was wondering if anybody could provide an insight on this. I was also wondering if there is a better way to deal with this than just rebooting the system. Thanks, Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Blank Bkgrd.gif Type: image/gif Size: 145 bytes Desc: Blank Bkgrd.gif URL: From phung at cs.columbia.edu Thu Sep 21 22:54:46 2006 From: phung at cs.columbia.edu (Dan B. Phung) Date: Thu, 21 Sep 2006 18:54:46 -0400 Subject: [Linux-cluster] kernel oops on mount and sendmsg failed: -22 Message-ID: <45131836.7010802@ncl.cs.columbia.edu> I have a two node cluster, one node (node A) runs linux kernel 2.6.11.12 while the other (node B) runs 2.6.18. both are running cman_tool version 5.0.1. I first start up node A, then node B joins. node A can mount the GFS file systems, but when node B tries that, it gets a kernel oops, which is pasted at the end of the email (see "KERNEL OOPS output"). So I reboot node B and try to rejoin, but it seems to not be able to communicate with node A correctly, as if the cluster is in some stale state (see "node B rejoin kernel messages"). Upon viewing node A, it seemed to have received the join message, but it looks like it didn't send an ack or something, and then node A simply quits...(see "node A kernel messages"). I think the problem lies in my use of two different cluster software versions (even though --version doesn't say so), but the newest -rSTABLE doesn't compile with 2.6.11.12 anymore. What is the recommended solution for a cluster that must run different kernel versions? tia, dan --- BUG: unable to handle kernel NULL pointer dereference at virtual address 0000001c printing eip: c01825e6 *pde = 00000000 Oops: 0000 [#1] PREEMPT SMP Modules linked in: lock_dlm dlm gfs lock_harness cman qla2xxx firmware_class scsi_transport_fc ppdev parport_pc lp parport sg sd_mod scsi_mod ide_generic ide_cd cdrom evdev i2c_piix4 psmouse i2c_core serio_raw sworks_agp agpgart rtc pcspkr ext3 jbd mbcache dm_mirror dm_snapshot dm_mod ide_disk serverworks generic ohci_hcd ide_core usbcore tg3 thermal processor fan unix CPU: 2 EIP: 0060:[] Tainted: GF VLI EFLAGS: 00010293 (2.6.18 #1) EIP is at do_add_mount+0x66/0x130 eax: 0000000c ebx: f3843f24 ecx: c24fbac0 edx: f443f550 esi: df907200 edi: 00000000 ebp: 00000000 esp: f3843df4 ds: 007b es: 007b ss: 0068 Process mount (pid: 14922, ti=f3842000 task=f443f550 task.ti=f3842000) Stack: c0394388 00000000 00000000 f49a1000 f3843f24 00000000 c018321d df907200 f3843f24 00000000 00000000 f49a1000 df907200 c033a5c0 fffffffe 00000000 c0175080 c24fbac0 f3843ef8 00000050 f4998000 dfb98c40 c24fbac0 df98330c Call Trace: [] do_mount+0x33d/0x760 [] link_path_walk+0x80/0x100 [] __handle_mm_fault+0x233/0x980 [] __handle_mm_fault+0x4d6/0x980 [] __alloc_pages+0x4f/0x2f0 [] __get_free_pages+0x2d/0x40 [] copy_mount_options+0x47/0x130 [] sys_mount+0x9d/0xe0 [] syscall_call+0x7/0xb Code: e4 89 e0 8b 4b 04 25 00 e0 ff ff 8b 10 8b 41 64 3b 82 58 04 00 00 0f 85 a1 00 00 00 8b 41 14 3b 46 14 0f 84 ac 00 00 00 8b 46 10 <8b> 40 10 0f b7 40 28 25 00 f0 00 00 3d 00 a0 00 00 74 55 8b 44 EIP: [] do_add_mount+0x66/0x130 SS:ESP 0068:f3843df4 CMAN: Waiting to join or form a Linux-cluster CMAN: sending membership request (message repeated 30 times) CMAN: Been in JOINWAIT for too long - giving up CMAN: sendmsg failed: -22 CMAN: node blade14 rejoining CMAN: too many transition restarts - will die CMAN: we are leaving the cluster. Inconsistent cluster view From pcaulfie at redhat.com Fri Sep 22 07:36:16 2006 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Fri, 22 Sep 2006 08:36:16 +0100 Subject: [Linux-cluster] kernel oops on mount and sendmsg failed: -22 In-Reply-To: <45131836.7010802@ncl.cs.columbia.edu> References: <45131836.7010802@ncl.cs.columbia.edu> Message-ID: <45139270.3040401@redhat.com> Dan B. Phung wrote: > I have a two node cluster, one node (node A) runs linux kernel 2.6.11.12 > while the other (node B) runs 2.6.18. both are running cman_tool > version 5.0.1. I first start up node A, then node B joins. node A can > mount the GFS file systems, but when node B tries that, it gets a kernel > oops, which is pasted at the end of the email (see "KERNEL OOPS output"). > So I reboot node B and try to rejoin, but it seems to not be able to > communicate with node A correctly, as if the cluster is in some stale > state (see "node B rejoin kernel messages"). Upon viewing node A, it > seemed to have received the join message, but it looks like it didn't > send an ack or something, and then node A simply quits...(see "node A > kernel messages"). > > I think the problem lies in my use of two different cluster software > versions (even though --version doesn't say so), but the newest -rSTABLE > doesn't compile with 2.6.11.12 anymore. What is the recommended > solution for a cluster that must run different kernel versions? > > tia, > dan > > --- > > > > BUG: unable to handle kernel NULL pointer dereference at virtual > address 0000001c > printing eip: > c01825e6 > *pde = 00000000 > Oops: 0000 [#1] > PREEMPT SMP > Modules linked in: lock_dlm dlm gfs lock_harness cman qla2xxx > firmware_class scsi_transport_fc ppdev parport_pc lp parport sg sd_mod > scsi_mod ide_generic ide_cd cdrom evdev i2c_piix4 psmouse i2c_core > serio_raw sworks_agp agpgart rtc pcspkr ext3 jbd mbcache dm_mirror > dm_snapshot dm_mod ide_disk serverworks generic ohci_hcd ide_core > usbcore tg3 thermal processor fan unix > CPU: 2 > EIP: 0060:[] Tainted: GF VLI > EFLAGS: 00010293 (2.6.18 #1) > EIP is at do_add_mount+0x66/0x130 > eax: 0000000c ebx: f3843f24 ecx: c24fbac0 edx: f443f550 > esi: df907200 edi: 00000000 ebp: 00000000 esp: f3843df4 > ds: 007b es: 007b ss: 0068 > Process mount (pid: 14922, ti=f3842000 task=f443f550 task.ti=f3842000) > Stack: c0394388 00000000 00000000 f49a1000 f3843f24 00000000 c018321d > df907200 > f3843f24 00000000 00000000 f49a1000 df907200 c033a5c0 fffffffe > 00000000 > c0175080 c24fbac0 f3843ef8 00000050 f4998000 dfb98c40 c24fbac0 > df98330c > Call Trace: > [] do_mount+0x33d/0x760 > [] link_path_walk+0x80/0x100 > [] __handle_mm_fault+0x233/0x980 > [] __handle_mm_fault+0x4d6/0x980 > [] __alloc_pages+0x4f/0x2f0 > [] __get_free_pages+0x2d/0x40 > [] copy_mount_options+0x47/0x130 > [] sys_mount+0x9d/0xe0 > [] syscall_call+0x7/0xb > Code: e4 89 e0 8b 4b 04 25 00 e0 ff ff 8b 10 8b 41 64 3b 82 58 04 00 > 00 0f 85 a1 00 00 00 8b 41 14 3b 46 14 0f 84 ac 00 00 00 8b 46 10 <8b> > 40 10 0f b7 40 28 25 00 f0 00 00 3d 00 a0 00 00 74 55 8b 44 > EIP: [] do_add_mount+0x66/0x130 SS:ESP 0068:f3843df4 > > > CMAN: Waiting to join or form a Linux-cluster > CMAN: sending membership request (message repeated 30 times) > CMAN: Been in JOINWAIT for too long - giving up > CMAN: sendmsg failed: -22 > > > CMAN: node blade14 rejoining > CMAN: too many transition restarts - will die > CMAN: we are leaving the cluster. Inconsistent cluster view That's a known bug. Upgrade the kernel component of cman. -- patrick From carlopmart at gmail.com Fri Sep 22 08:07:17 2006 From: carlopmart at gmail.com (carlopmart) Date: Fri, 22 Sep 2006 10:07:17 +0200 Subject: [Linux-cluster] Things that i don't understand about cluster suite In-Reply-To: <45128189.3060108@leopard.us.udel.edu> References: <4510FA68.2000403@gmail.com> <1158776775.7388.29.camel@rei.boston.devel.redhat.com> <451273B5.3090804@gmail.com> <45128189.3060108@leopard.us.udel.edu> Message-ID: <451399B5.20409@gmail.com> Greg Forte wrote: >> Lon, I use original's postfix script and returns this if postfix is >> up: "master (pid 957) is running..." when postfix isn't up, script >> returns: "master is stopped". Do I need to change this message to "0" >> for status check works ok?? > > That's just a message that's printed. return status is the value given > in a statement of the form 'return X', or 0 if no such statement is > explicitly reached. All executables return a status value to the shell, > where 0 is taken to mean "OK", and non-zero means "something bad happened". > > The postfix script appears to return the correct values in each case. > My guess would be that it's cluster configuration problem, but I didn't > see anything about postfix in the conf that you pasted ... Lon, I have attached postfix script > >> Then, I can't startup only one node, when both are stopped, right?? > > No, you definitely can do this, if the cluster is configured correctly. > The problem may be in your fencing method - the first thing the booted > node will do when cman starts is to try to contact the other node. When > it times out, it'll try to fence the other node and won't continue until > it does. If the fence process fails, it'll hang there, which I'm > guessing is what you're seeing. So the problem is most likely that > fencing is failing, either due to misconfiguration or because the other > node is powered off and so its iLo agent isn't responding. Since iLo is > supposed to be able to power-up a switched-off server, my guess is > there's a problem with your fencing configuration - did you fix it so > that you have a separate fencedevice entry for each node? I have changed iLO fence for gnbd fence. But can not boot only one node. > > -g > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- CL Martinez carlopmart {at} gmail {d0t} com -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: imss-mta URL: From sandra-llistes at fib.upc.edu Fri Sep 22 10:23:18 2006 From: sandra-llistes at fib.upc.edu (sandra-llistes) Date: Fri, 22 Sep 2006 12:23:18 +0200 Subject: [Linux-cluster] GFS and samba problem Message-ID: <4513B996.8050804@fib.upc.edu> Hello, We have two Fedora 5 Servers clustered with GFS. We installed samba and exported the same shares in both of them. All went fine at first, with people accessing to theirs own files and so, but for some programs (minitab, matlab, ...) people need to access the same file at once. Then samba begins to fail and clients hang. In order to fix samba is necessary to restart the service. We've tried to put the shares in a filesystem without GFS and all goes well, people can access the same file without problems simultaneously. Is a weird behaviour because the shares are exported from the two servers, but we really only access files simoultaneuosly using the first server (the second is used for other linux clients than don't access the same shares), the other server exports the shares too but isn't used by that clients. I don't know how to debug this problem to see what is happening. It seems something related to GFS and Samba. I have seen mails of people with samba+GFS problems, but we aren't using the same configuration, and the GFS rpm are updated: GFS-6.1.5-0.FC5.1 GFS-kernel-2.6.15.1-5.FC5.32 Any help will be greatly apreciated. Thanks, Sandra From peter.huesser at psi.ch Fri Sep 22 15:42:55 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Fri, 22 Sep 2006 17:42:55 +0200 Subject: [Linux-cluster] Cannot restart service after "failed" state Message-ID: <8E2924888511274B95014C2DD906E58AD1A30E@MAILBOX0A.psi.ch> Hello I have defined a web-services (for testing it contains an IP and two script resources). I sometimes happens that I produce failed state of the cluster. After this I am not able to restart the service anymore. Even after a reboot of all (two) clustermembers it is not possible. Do I have to remove by hand some kind of "lock" file. Greetings Pedro -------------- next part -------------- An HTML attachment was scrubbed... URL: From jstoner at opsource.net Fri Sep 22 16:27:07 2006 From: jstoner at opsource.net (Jeff Stoner) Date: Fri, 22 Sep 2006 17:27:07 +0100 Subject: [Linux-cluster] Cannot restart service after "failed" state Message-ID: <38A48FA2F0103444906AD22E14F1B5A3042C6E19@mailxchg01.corp.opsource.net> Check for errors in the logs files for the service itself (you didn't say exactly what it is) and in /var/log/message for Cluster-related messages for more specific information about why it won't start. We can't help very much without knowing what is wrong. --Jeff SME - UNIX OpSource Inc. PGP Key ID 0x6CB364CA ________________________________ I have defined a web-services (for testing it contains an IP and two script resources). I sometimes happens that I produce failed state of the cluster. After this I am not able to restart the service anymore. Even after a reboot of all (two) clustermembers it is not possible. Do I have to remove by hand some kind of "lock" file. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rodgersr at yahoo.com Fri Sep 22 20:28:43 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Fri, 22 Sep 2006 13:28:43 -0700 (PDT) Subject: [Linux-cluster] Disk tie breaker -how does it work? Message-ID: <20060922202843.42656.qmail@web34205.mail.mud.yahoo.com> Does anyone know much about the details of how a disk tiebreaker works in a two member node? Or any docs to point to? --------------------------------- Yahoo! Messenger with Voice. Make PC-to-Phone Calls to the US (and 30+ countries) for 2?/min or less. -------------- next part -------------- An HTML attachment was scrubbed... URL: From peter.huesser at psi.ch Fri Sep 22 20:43:59 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Fri, 22 Sep 2006 22:43:59 +0200 Subject: [Linux-cluster] Cannot restart service after "failed" state In-Reply-To: <38A48FA2F0103444906AD22E14F1B5A3042C6E19@mailxchg01.corp.opsource.net> Message-ID: <8E2924888511274B95014C2DD906E58AD1A316@MAILBOX0A.psi.ch> You were right with the check of the error log. I should have read it more carefully before writing to the group. The problem was with one of the scripts. What I was curious was that after a restart of both of the serves I had the same problem again. But I have to reformulate my problem and want to start a new thread. Thanks Pedro ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeff Stoner Sent: Freitag, 22. September 2006 18:27 To: linux clustering Subject: RE: [Linux-cluster] Cannot restart service after "failed" state Check for errors in the logs files for the service itself (you didn't say exactly what it is) and in /var/log/message for Cluster-related messages for more specific information about why it won't start. We can't help very much without knowing what is wrong. --Jeff SME - UNIX OpSource Inc. PGP Key ID 0x6CB364CA ________________________________ I have defined a web-services (for testing it contains an IP and two script resources). I sometimes happens that I produce failed state of the cluster. After this I am not able to restart the service anymore. Even after a reboot of all (two) clustermembers it is not possible. Do I have to remove by hand some kind of "lock" file. -------------- next part -------------- An HTML attachment was scrubbed... URL: From celso at webbertek.com.br Sat Sep 23 03:08:14 2006 From: celso at webbertek.com.br (Celso K. Webber) Date: Sat, 23 Sep 2006 00:08:14 -0300 Subject: [Linux-cluster] LVM and Multipath with EMC PowerPath (Was: CLVMD - Do I need it) In-Reply-To: <1158082871.988.4.camel@hydrogen.msp.redhat.com> References: <7F6B06837A5DBD49AC6E1650EFF5490601223028@auk52177.ukr.astrium.corp> <1158069859.3610.437.camel@rei.boston.devel.redhat.com> <1158082871.988.4.camel@hydrogen.msp.redhat.com> Message-ID: <4514A51E.1080507@webbertek.com.br> Hello all, After reading a thread on this list (CLVMD - Do I need it), I started playing around with CLVM, just to make sure two problems I had in the past were solved: 1) LVM normally cannot be used on shared disks, because the first server that "sees" the PVs will initialize them, and the other server will see the LVM objects as inactive. This is solved in LVM2 when used together with CLVM, right? I'm not pretty sure about the mecanics of CLVM, but I imagine it shares device UUIDs between the machines. So far, so good. 2) The other problem is not directly related to CLVM, but I found no solution for it (yet). In my setup, I have multiple paths to the same devices in the shared storage (either in a SAN or DAS). Under the EMC solution, we employ PowerPath to solve the multiple devices issue for each LUN. It works quite well. But LVM is not aware of PowerPath's multiple path aggregation, so when it scans the PVs on the LUN's partitions, it "finds" duplicates for the PVs, like this: [root at csumccaixa12 network-scripts]# pvscan Found duplicate PV 7v9XUzPHIRqe6E0fA6hgCR3ybeaJoiWm: using /dev/sdc1 not /dev/emcpowerb1 Found duplicate PV 3eKnMIm00kg6DXn4MW1UX9QCFh96ykwG: using /dev/emcpowerc1 not /dev/sdb1 Found duplicate PV 3T00PR5Ky1XrBesYHRtyowoBQLWDO1kd: using /dev/sdd1 not /dev/emcpowera1 Found duplicate PV 3eKnMIm00kg6DXn4MW1UX9QCFh96ykwG: using /dev/sde1 not /dev/emcpowerc1 Found duplicate PV 7v9XUzPHIRqe6E0fA6hgCR3ybeaJoiWm: using /dev/sdf1 not /dev/sdc1 Found duplicate PV 3T00PR5Ky1XrBesYHRtyowoBQLWDO1kd: using /dev/sdg1 not /dev/sdd1 PV /dev/sda3 VG vg0 lvm2 [59.81 GB / 37.75 GB free] PV /dev/sdg1 lvm2 [127.43 GB] PV /dev/sde1 lvm2 [127.43 GB] PV /dev/sdf1 lvm2 [127.43 GB] Total: 4 [442.10 GB] / in use: 1 [59.81 GB] / in no VG: 3 [382.29 GB] You can see above that the /dev/emcpowerX devices were declined in favor of the real Linux devices. "vg0" is a VG in the internal disks (/dev/sda). The problem I see here is that whenever the specific device that LVM2 chose goes down because of a link failure, LVM will not automatically failover to another device, will it? In my tests it didn't. Another matter is that using the /dev/emcpowerX devices I have also load balancing, so even if LVM2 did failover to the other paths (the other devices), I would loose the load balancing feature I can achieve with PowerPath. Question 1: did anyone solve this problem? Does device-mapper-multipath solve this problem? Question 2: is there a way to "force" which devices LVM should employ when scanning the PVs over the disks Linux recognize? Thank you all for any hints on this. Regards, Celso. -- *Celso Kopp Webber* celso at webbertek.com.br *Webbertek - Opensource Knowledge* (41) 8813-1919 (41) 3284-3035 -- Esta mensagem foi verificada pelo sistema de antiv?rus e acredita-se estar livre de perigo. From isplist at logicore.net Sat Sep 23 03:14:25 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Fri, 22 Sep 2006 22:14:25 -0500 Subject: [Linux-cluster] Can't mount multiple GFS volumes? In-Reply-To: <20060915135516.GA17451@redhat.com> Message-ID: <2006922221425.220068@leena> Sorry about the delay. I trashed my whole setup since I bought new hardware. Once I have it all up and running, I'll see if I'm still having the same problems and post again. Thanks much Dave. Mike > Could you send the output of 'cman_tool services' from all nodes before > and after you try to mount? Thanks > Dave From ben.yarwood at juno.co.uk Sat Sep 23 11:09:07 2006 From: ben.yarwood at juno.co.uk (Ben Yarwood) Date: Sat, 23 Sep 2006 12:09:07 +0100 Subject: [Linux-cluster] LVM and Multipath with EMC PowerPath (Was: CLVMD -Do I need it) In-Reply-To: <4514A51E.1080507@webbertek.com.br> Message-ID: <007f01c6df00$b010e890$3964a8c0@WS076> Good document on emc powerlink site about setting up gfs6.1 and powerpath. https://powerlink.emc.com/nsepn/webapps/btg548664833igtcuup4826/km/live1/en_ US/Offering_Technical/Technical_Documentation/300-003-820_a01_elccnt_0.pdf?m tcs=ZXZlbnRUeXBlPUttQ2xpY2tTZWFyY2hSZXN1bHRzRXZlbnQsZG9jdW1lbnRJZD0wOTAxNDA2 NjgwMTg3YjFhLGRhdGFTb3VyY2U9RENUTV9lbl9VU18w Page 18 I believe has the filtering solution you are after for point 2. Ben > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Celso K. Webber > Sent: 23 September 2006 04:08 > To: linux clustering > Subject: [Linux-cluster] LVM and Multipath with EMC PowerPath > (Was: CLVMD -Do I need it) > > Hello all, > > After reading a thread on this list (CLVMD - Do I need it), I > started playing around with CLVM, just to make sure two > problems I had in the past were solved: > > 1) LVM normally cannot be used on shared disks, because the > first server that "sees" the PVs will initialize them, and > the other server will see the LVM objects as inactive. This > is solved in LVM2 when used together with CLVM, right? I'm > not pretty sure about the mecanics of CLVM, but I imagine it > shares device UUIDs between the machines. So far, so good. > > 2) The other problem is not directly related to CLVM, but I > found no solution for it (yet). In my setup, I have multiple > paths to the same devices in the shared storage (either in a > SAN or DAS). Under the EMC solution, we employ PowerPath to > solve the multiple devices issue for each LUN. It works quite > well. But LVM is not aware of PowerPath's multiple path > aggregation, so when it scans the PVs on the LUN's > partitions, it "finds" duplicates for the PVs, like this: > [root at csumccaixa12 network-scripts]# pvscan > Found duplicate PV 7v9XUzPHIRqe6E0fA6hgCR3ybeaJoiWm: using > /dev/sdc1 not /dev/emcpowerb1 > Found duplicate PV 3eKnMIm00kg6DXn4MW1UX9QCFh96ykwG: using > /dev/emcpowerc1 not /dev/sdb1 > Found duplicate PV 3T00PR5Ky1XrBesYHRtyowoBQLWDO1kd: using > /dev/sdd1 not /dev/emcpowera1 > Found duplicate PV 3eKnMIm00kg6DXn4MW1UX9QCFh96ykwG: using > /dev/sde1 not /dev/emcpowerc1 > Found duplicate PV 7v9XUzPHIRqe6E0fA6hgCR3ybeaJoiWm: using > /dev/sdf1 not /dev/sdc1 > Found duplicate PV 3T00PR5Ky1XrBesYHRtyowoBQLWDO1kd: using > /dev/sdg1 not /dev/sdd1 > PV /dev/sda3 VG vg0 lvm2 [59.81 GB / 37.75 GB free] > PV /dev/sdg1 lvm2 [127.43 GB] > PV /dev/sde1 lvm2 [127.43 GB] > PV /dev/sdf1 lvm2 [127.43 GB] > Total: 4 [442.10 GB] / in use: 1 [59.81 GB] / in no VG: 3 > [382.29 GB] > > You can see above that the /dev/emcpowerX devices were > declined in favor of the real Linux devices. "vg0" is a VG in > the internal disks (/dev/sda). > > The problem I see here is that whenever the specific device > that LVM2 chose goes down because of a link failure, LVM will > not automatically failover to another device, will it? In my > tests it didn't. > > Another matter is that using the /dev/emcpowerX devices I > have also load balancing, so even if LVM2 did failover to the > other paths (the other devices), I would loose the load > balancing feature I can achieve with PowerPath. > > > Question 1: did anyone solve this problem? Does > device-mapper-multipath solve this problem? > > Question 2: is there a way to "force" which devices LVM > should employ when scanning the PVs over the disks Linux recognize? > > > Thank you all for any hints on this. > > Regards, > > Celso. > -- > *Celso Kopp Webber* > > celso at webbertek.com.br > > *Webbertek - Opensource Knowledge* > (41) 8813-1919 > (41) 3284-3035 > > > -- > Esta mensagem foi verificada pelo sistema de antiv?rus e > acredita-se estar livre de perigo. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > From bosse at klykken.com Sat Sep 23 16:00:44 2006 From: bosse at klykken.com (Bosse Klykken) Date: Sat, 23 Sep 2006 18:00:44 +0200 Subject: [Linux-cluster] Cannot restart service after "failed" state In-Reply-To: <8E2924888511274B95014C2DD906E58AD1A30E@MAILBOX0A.psi.ch> References: <8E2924888511274B95014C2DD906E58AD1A30E@MAILBOX0A.psi.ch> Message-ID: <45155A2C.8030806@klykken.com> Huesser Peter wrote: > I have defined a web-services (for testing it contains an IP and two > script resources). I sometimes happens that I produce failed state of > the cluster. After this I am not able to restart the service anymore. > Even after a reboot of all (two) clustermembers it is not possible. Do I > have to remove by hand some kind of ?lock? file. If the problem is that you're unable to restart the service when it is in "failed" modus, you could try this: clusvcadm -d service # disables the failed service clusvcadm -e service # enables/starts the now disabled service .../Bosse From peter.huesser at psi.ch Sat Sep 23 19:26:33 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Sat, 23 Sep 2006 21:26:33 +0200 Subject: [Linux-cluster] Cannot restart service after "failed" state In-Reply-To: <45155A2C.8030806@klykken.com> Message-ID: <8E2924888511274B95014C2DD906E58AD1A318@MAILBOX0A.psi.ch> > > If the problem is that you're unable to restart the service when it is > in "failed" modus, you could try this: > > clusvcadm -d service # disables the failed service > clusvcadm -e service # enables/starts the now disabled service > Thanks for the hint but the problem was that I did (and still do) not understand the concept of the different resource types within a service. There you can choose between "Add a Shared Resource to this service", "Attach a new Private Resource to the Selection" and "Attach a Shared Resource to the selection". I played around a little bit and everything works now as expected. But I have to search the web first before asking some more specific questions. Greetings Pedro From peter.huesser at psi.ch Sat Sep 23 19:53:24 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Sat, 23 Sep 2006 21:53:24 +0200 Subject: [Linux-cluster] When is a failed node fenced Message-ID: <8E2924888511274B95014C2DD906E58AD1A319@MAILBOX0A.psi.ch> Hello Maybe I do not understand the concept of fencing in the right way. I created a two node cluster with a webservice running on it. The failover works fine now. I also configured a fencing device (ipmilan). Fencing by hand works also fine (using a command like: "fence_ipmilan -a server_con -l loginname -p my_passwort -o off") but if I initiate a failover on one of the nodes I expect the services to switch to the other node (which works) and to let this node shutdown the failed node. The second does not happen. Any idea? Thanks Pedro -------------- next part -------------- An HTML attachment was scrubbed... URL: From orkcu at yahoo.com Sat Sep 23 21:16:13 2006 From: orkcu at yahoo.com (Roger Peņa Escobio) Date: Sat, 23 Sep 2006 14:16:13 -0700 (PDT) Subject: [Linux-cluster] LVM and Multipath with EMC PowerPath (Was: CLVMD - Do I need it) In-Reply-To: <4514A51E.1080507@webbertek.com.br> Message-ID: <20060923211613.93874.qmail@web50601.mail.yahoo.com> --- "Celso K. Webber" wrote: > Hello all, > > After reading a thread on this list (CLVMD - Do I > need it), I started > playing around with CLVM, just to make sure two > problems I had in the > past were solved: > > 1) LVM normally cannot be used on shared disks, [...] > 2) The other problem is not directly related to > CLVM, but I found no > solution for it (yet). In my setup, I have multiple > paths to the same > devices in the shared storage (either in a SAN or > DAS). Under the EMC > solution, we employ PowerPath to solve the multiple > devices issue for > each LUN. It works quite well. But LVM is not aware > of PowerPath's > multiple path aggregation, so when it scans the PVs > on the LUN's > partitions, it "finds" duplicates for the PVs, like > this: tyhe solution for this is to "filter" the "under-powerpath devices" :-) I mean, to filter to not scan the devices exported by the SAN or DAS, and just use the powerpath devices for the LVM check the file /etc/lvm/lvm.conf > Another matter is that using the /dev/emcpowerX > devices I have also load > balancing, so even if LVM2 did failover to the other > paths (the other > devices), I would loose the load balancing feature I > can achieve with > PowerPath. if you put LVM over powerpath, you are layering the environment, do you? path load balancing and failover are cover by powerpath, and LVM do its own business :-) > Question 2: is there a way to "force" which devices > LVM should employ > when scanning the PVs over the disks Linux > recognize? /etc/lvm/lvm.conf # A filter that tells LVM2 to only use a restricted set of devices. # The filter consists of an array of regular expressions. These # expressions can be delimited by a character of your choice, and # prefixed with either an 'a' (for accept) or 'r' (for reject). # The first expression found to match a device name determines if # the device will be accepted or rejected (ignored). Devices that # don't match any patterns are accepted. cu roger __________________________________________ RedHat Certified Engineer ( RHCE ) Cisco Certified Network Associate ( CCNA ) __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From peter.huesser at psi.ch Sat Sep 23 22:19:59 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Sun, 24 Sep 2006 00:19:59 +0200 Subject: [Linux-cluster] Problem with loadbalancer Message-ID: <8E2924888511274B95014C2DD906E58AD1A31C@MAILBOX0A.psi.ch> Hello I just started learning how to do loadbalancing using LVS and piranha. As an example I wanted to have a loadbalancer running in front of one webserver (testing). I want to use direct routing. The IP of the loadbalancer is e.g. 236.25.1.1 the one of the webserver is 236.25.1.2 (web01) and the VIP is 236.25.1.3. Here is the lvs.cf file I created with the aid of piranha_gui: serial_no = 67 primary = 236.25.1.1 service = lvs backup_active = 0 backup = 0.0.0.0 heartbeat = 1 heartbeat_port = 539 keepalive = 6 deadtime = 18 network = direct debug_level = NONE monitor_links = 0 virtual webserver { active = 1 address = 236.25.1.3 eth0:1 vip_nmask = 255.255.255.0 port = 80 send = "GET / HTTP/1.0rnrn" expect = "HTTP" use_regex = 0 load_monitor = ruptime scheduler = wlc protocol = tcp timeout = 6 reentry = 15 quiesce_server = 1 server web01 { address = 236.25.1.2 active = 1 weight = 1 } } - The webservice on web01 is running correctly. - I can ping the VIP 236.25.1.3. - The output of "/sbin/ip addr" looks fine: the interface eth0 has the rigth secondary IP. - If I run "tcpdump host web01" I see that there is some communication (eg a "GET / HTTP/1.0") between the loadbalancer and web01. - The output of "/sbin/sysctl net.ipv4.ip_forward" is "net.ipv4.ip_forward = 1". But if I try to connect to the VIP on port 80 I get a "connection refused". Something is wrong but what? In the /var/log/messages I have a lot of the following lines: Sep 24 00:09:35 loadbalancer nanny[3925]: READ to 236.25.1.2:80 timed out Sep 24 00:09:47 loadbalancer nanny[3925]: READ to 236.25.1.2:80 timed out Some idea where I have to look for to solve the problem? By the way: how can I increase the debuglevel? As far as I have seen it is not possible with GUI. Thanks' in advance Pedro -------------- next part -------------- An HTML attachment was scrubbed... URL: From rodgersr at yahoo.com Sun Sep 24 00:20:57 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Sat, 23 Sep 2006 17:20:57 -0700 (PDT) Subject: [Linux-cluster] Workings of Tiebreaker IP (RHCS) Message-ID: <20060924002057.43217.qmail@web34214.mail.mud.yahoo.com> I pulled a message from 2005 about tiebreakers. I have some questions and it does not seem to agree with what I see culmanger do. >> Hello, >> >> To completely understand what the role of a tiebreaker IP within a two >> or four node RHCS cluster is, I've searched redhat and Google. I can't >> however find anything describing the precise workings of the >> tiebreaker-IP. I would really like to know what happens excactly when >> the tiebreaker is used an how (maybe even somekind of flow diagram). >> >> Can anyone here maybe explain that to me, or point me in the direction >> of more specific information regarding tiebreaker? >The tiebreaker IP address is used as an additional vote in the event >that half the nodes become unreachable or dead in a 2 or 4 node >cluster >on RHCS. >The IP address must reside on the same network as is used for cluster >communication. To be a little more specific, if your cluster is using >eth0 for communication, your IP address used for a tiebreaker must be >reachable only via eth0 (otherwise, you will end up with a split >brain). >When enabled, the nodes ping the given IP address at regular >intervals. >When the IP address is not reachable, the tiebreaker is considered >"dead". When it is reachable, it is considered "alive". >It acts as an additional vote (like an extra cluster member), except >for >one key difference: Unless the default configuration is overridden, >the How does this work? Does the node trying to become the active node access the tiebreaker and put a lock on it? How does it reseve it? Just pinging it would not prevent the other node from doing the same. >IP tiebreaker may not be used to *form* a quorum where one did not >exist >previously. >So, if one node of a two node cluster is online, it will never become >quorate unless the other node comes online (or administrator override, >see man pages for "cluforce" and "cludb"). >So, in a 2 node cluster, if one node fails and the other node is >online >(and the tiebreaker is still "alive" according to that node), the >remaining node considers itself quorate and "shoots" (aka STONITHs, >aka >fences) the dead node and takes over services. >If a network partition occurs such that both nodes see the tiebreaker >but not each other, the first one to fence the other will naturally >win. >Ok, moving on... >The disk tiebreaker works in a similar way, except that it lets the >cluster limp in along in a safe, semi-split-brain (split brain) in a >network outage. What I mean is that because there's state information >written to/read from the shared raw partitions, the nodes can actually >tell via other means whether or not the other node is "alive" or not >as >opposed to relying solely on the network traffic. >Both nodes update state information on the shared partitions. When >one >node detects that the other node has not updated its information for a >period of time, that node is "down" according to the disk subsystem. >If >this coincides with a "down" status from the membership daemon, the >node >is fenced and services are failed over. If the node never goes down >(and keeps updating its information on the shared partitions), then >the I do not use a IP tiebreaker. I have a two nodes system. When the active node shows it is down via memebership but up via disk then Clumanager determines it is in an ?uncertain state? and shoots it. >node is never fenced and services never fail over. -- Lon --------------------------------- Talk is cheap. Use Yahoo! Messenger to make PC-to-Phone calls. Great rates starting at 1?/min. -------------- next part -------------- An HTML attachment was scrubbed... URL: From linuxr at gmail.com Mon Sep 25 04:15:33 2006 From: linuxr at gmail.com (Marc ) Date: Mon, 25 Sep 2006 00:15:33 -0400 Subject: [Linux-cluster] starting point Message-ID: Hi, I am interested in setting up a linux cluster. I have the following hardware: Dell 2950 server x2, each has PERC4 RAID controller and the Dell remote access client (DRAC) 8 GB RAM (2) x 74 GB for the OS (10) x 300 GB for storage Core 2 duo processors (I think) LPE 11000 HBA's x 2 per machine Switch: unknown Fiber connections OS: Red Hat Enterprise Server (AS) 4 EMT64 - latest update (3 I think) SAN: unknown (volumes/LUN's provided for test) enterprise app: Perforce The goal is to cluster the application so that there is no possibility of downtime. This is to be a dev environment that will have quite a lot of users. The organization has already used Perforce and likes that, just wants to migrate to a Linux/SAN/GFS environment. I have worked with clustering and also a lot with linux, but not together, so that is my challenge. I am wondering how things like LVM and NFS come into play with the GFS once it is all up and running. It is a given that SAMBA will probably have to be running on there at some point, not sure how that plays into the mix. Also I am worried about block size. Perforce is a CVS type database that will store code as flat (tiny) text files, even only storing updates. Great for text storage. However, this is a multimedia type company, and much of the data may be full multimedia files (jpeg, video, game stuff, music, you name it). Therefore if a developer writes 10 edits to a C++ application in text form, it only stores the changes and not even the whole text file each time. However (how's this for contrast?) ----if a developer edits a video clip and stores it ten times, perforce saves it ten separate times, each at least as big as the first. So although I hear that I should avoid the 64k block size, I don't know what to go to, realistically. If anyone has specifically grappled with this I would love to know more about how/why you decided whatever you did for your situation. Does anyone know of a good starting point, best practices, HOWTO's, etc.? I am reviewing Karl Knopper's book 'Enterprise Linux Cluster'. I have to get this cranked out NOW. I really need some sort of guidelines or outline since I need to set this up as a project. Any information is GREATLY appreciated. Thanks Marc -------------- next part -------------- An HTML attachment was scrubbed... URL: From peter.huesser at psi.ch Mon Sep 25 06:52:51 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Mon, 25 Sep 2006 08:52:51 +0200 Subject: [Linux-cluster] starting point In-Reply-To: Message-ID: <8E2924888511274B95014C2DD906E58AD1A349@MAILBOX0A.psi.ch> Hi I also read the bock of Knopper which I find a good for understanding. If you want more concrete details about the redhat cluster suite read the RedHat documentation: http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/ (you can also get it in pdf format but I did not find the link anymore. If you want it I can send it to you). Some documentation in the same style can be found on Wikipedia: http://gfs.wikidev.net/Installation. Maybe the FAQ can help you too: http://sources.redhat.com/cluster/faq.html. Pedro Does anyone know of a good starting point, best practices, HOWTO's, etc.? I am reviewing Karl Knopper's book 'Enterprise Linux Cluster'. I have to get this cranked out NOW. I really need some sort of guidelines or outline since I need to set this up as a project. Any information is GREATLY appreciated. -------------- next part -------------- An HTML attachment was scrubbed... URL: From redhat at watson-wilson.ca Mon Sep 25 12:30:50 2006 From: redhat at watson-wilson.ca (Neil Watson) Date: Mon, 25 Sep 2006 08:30:50 -0400 Subject: [Linux-cluster] starting point In-Reply-To: <8E2924888511274B95014C2DD906E58AD1A349@MAILBOX0A.psi.ch> References: <8E2924888511274B95014C2DD906E58AD1A349@MAILBOX0A.psi.ch> Message-ID: <20060925123050.GB31534@ettin> On Mon, Sep 25, 2006 at 08:52:51AM +0200, Huesser Peter wrote: > Does anyone know of a good starting point, best practices, HOWTO's, etc.? http://technocrat.watson-wilson.ca/db2-cluster.pdf -- Neil Watson | Gentoo Linux System Administrator | Uptime 7 days http://watson-wilson.ca | 2.6.17.6 AMD Athlon(tm) MP 2000+ x 2 From jos at xos.nl Mon Sep 25 14:11:42 2006 From: jos at xos.nl (Jos Vos) Date: Mon, 25 Sep 2006 16:11:42 +0200 Subject: [Linux-cluster] IPMI fencing on an IBM x366 Message-ID: <200609251411.k8PEBg406654@xos037.xos.nl> Hi, Is it possible to use the built-in IPMI support of an IBM x366 server with RHEL CS? I think it is not compatible with RSA II, and I also tried IPMI Lan, but none of them seems to work. Any suggestions? Thanks, -- -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From rpeterso at redhat.com Mon Sep 25 14:13:39 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Mon, 25 Sep 2006 09:13:39 -0500 Subject: [Linux-cluster] starting point In-Reply-To: References: Message-ID: <4517E413.9040702@redhat.com> Marc wrote: > Does anyone know of a good starting point, best practices, HOWTO's, > etc.? I am reviewing Karl Knopper's book 'Enterprise Linux Cluster'. > I have to get this cranked out NOW. I really need some sort of > guidelines or outline since I need to set this up as a project. Any > information is GREATLY appreciated. > > Thanks > Marc Hi Marc, I recommend the "Unofficial" NFS/GFS cookbook (but I'm biased): http://sources.redhat.com/cluster/doc/nfscookbook.pdf Regards, Bob Peterson Red Hat Cluster Suite From isplist at logicore.net Mon Sep 25 15:17:40 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Mon, 25 Sep 2006 10:17:40 -0500 Subject: [Linux-cluster] General FC Question Message-ID: <2006925101740.355825@leena> After adding storage, my cluster comes up with different /dev/sda, /dev/sdb, etc settings. My initial device now comes up as sdc when it used to be sda. Is there some way of allowing GFS to see the storage in some way that it can know which device is which when I add a new one or remove one, etc? Hard loop ID's on the FC side I think but is there anything on the GFS side? Mike From peter.huesser at psi.ch Mon Sep 25 19:31:42 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Mon, 25 Sep 2006 21:31:42 +0200 Subject: [Linux-cluster] piranha Message-ID: <8E2924888511274B95014C2DD906E58AD1A3A6@MAILBOX0A.psi.ch> Hello I sent a similar question a few days ago and did not get any answer. Maybe the time (Saturday night) was unfavorable or the question was not that clear. So I try it once more: I want to run a loadbalancer in front of two webserver (using direct routing). But if I connect to port 80 of the loadbalancer I get a "connection refused". 1) Did anybody had a similar problem? 2) How can I increase the debuglevel? Thanks' in advance Pedro -------------- next part -------------- An HTML attachment was scrubbed... URL: From peter.huesser at psi.ch Mon Sep 25 20:53:41 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Mon, 25 Sep 2006 22:53:41 +0200 Subject: [Linux-cluster] piranha In-Reply-To: <8E2924888511274B95014C2DD906E58AD1A3A6@MAILBOX0A.psi.ch> Message-ID: <8E2924888511274B95014C2DD906E58AD1A3A7@MAILBOX0A.psi.ch> By the way: I started the "pulse" daemon in the debug modus ("pulse -v -n") and got the following output: nanny: Opening TCP socket to remote service port 80... nanny: Connecting socket to remote address... nanny: DEBUG -- Posting CONNECT poll() nanny: Sending len=16, text="GET / HTTP/1.0 " nanny: DEBUG -- Posting READ poll() nanny: DEBUG -- READ poll() completed (1,1) nanny: Posting READ I/O; expecting 4 character(s)... nanny: DEBUG -- READ returned 4 nanny: READ expected len=4, text="HTTP" nanny: READ got len=4, text=HTTP nanny: avail: 1 active: 1: count: 13 pulse: DEBUG -- setting SEND_heartbeat timer pulse: DEBUG -- setting SEND_heartbeat timer pulse: DEBUG -- setting NEED_heartbeat timer pulse: DEBUG -- setting SEND_heartbeat timer nanny: Opening TCP socket to remote service port 80... ... For me this looks as if everything is ok. "nanny" sends from time to time a "GET / HTTP/1.0" request and the response ("HTTP" only first four letters) correspondence with what is expected. The problem is that pulse is not opening port 80 on the loadbalancer for reveiving http-request. A "netstat -anp" verifies this. Hello I sent a similar question a few days ago and did not get any answer. Maybe the time (Saturday night) was unfavorable or the question was not that clear. So I try it once more: I want to run a loadbalancer in front of two webserver (using direct routing). But if I connect to port 80 of the loadbalancer I get a "connection refused". 1) Did anybody had a similar problem? 2) How can I increase the debuglevel? Thanks' in advance Pedro -------------- next part -------------- An HTML attachment was scrubbed... URL: From jpenalbae at gmail.com Mon Sep 25 23:11:32 2006 From: jpenalbae at gmail.com (=?ISO-8859-1?Q?Jaime_Pe=F1alba?=) Date: Tue, 26 Sep 2006 01:11:32 +0200 Subject: [Linux-cluster] General FC Question In-Reply-To: <2006925101740.355825@leena> References: <2006925101740.355825@leena> Message-ID: You can try multipath-tools or some other software that will group disks by WWN (World Wide Name). Regards, Jaime. 2006/9/25, isplist at logicore.net : > After adding storage, my cluster comes up with different /dev/sda, /dev/sdb, > etc settings. My initial device now comes up as sdc when it used to be sda. > > Is there some way of allowing GFS to see the storage in some way that it can > know which device is which when I add a new one or remove one, etc? > > Hard loop ID's on the FC side I think but is there anything on the GFS side? > > Mike > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From isplist at logicore.net Mon Sep 25 23:56:47 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Mon, 25 Sep 2006 18:56:47 -0500 Subject: [Linux-cluster] General FC Question In-Reply-To: Message-ID: <2006925185647.450867@leena> I'll take a look at that, wwn might work so long as all my storage devices supports it. Mike On Tue, 26 Sep 2006 01:11:32 +0200, Jaime Pe?alba wrote: > You can try multipath-tools or some other software that will group > > disks by WWN (World Wide Name). > > Regards, > Jaime. > > > 2006/9/25, isplist at logicore.net : >> After adding storage, my cluster comes up with different /dev/sda, >> /dev/sdb, >> etc settings. My initial device now comes up as sdc when it used to be >> sda. >> >> Is there some way of allowing GFS to see the storage in some way that it >> can >> know which device is which when I add a new one or remove one, etc? >> >> Hard loop ID's on the FC side I think but is there anything on the GFS >> side? >> >> Mike >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster From orkcu at yahoo.com Tue Sep 26 01:05:29 2006 From: orkcu at yahoo.com (Roger Peņa Escobio) Date: Mon, 25 Sep 2006 18:05:29 -0700 (PDT) Subject: [Linux-cluster] General FC Question In-Reply-To: <2006925185647.450867@leena> Message-ID: <20060926010530.88046.qmail@web50601.mail.yahoo.com> --- "isplist at logicore.net" wrote: > I'll take a look at that, wwn might work so long as > all my storage devices > supports it. but the LUNs that you export from the same SAN will show the same wwn, do they? the SAN's wwn I guess maybe somre kind of LUNs ID can be mapped with udev so the same name apply to the same LUNs Id, I am just guessing. Of course, the other way is to use LVM, LVM can help because it have "IDs" that helps to gruop always the same PV no matter if you add new devices :-) cu roger __________________________________________ RedHat Certified Engineer ( RHCE ) Cisco Certified Network Associate ( CCNA ) __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From isplist at logicore.net Tue Sep 26 01:29:14 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Mon, 25 Sep 2006 20:29:14 -0500 Subject: [Linux-cluster] General FC Question In-Reply-To: <20060926010530.88046.qmail@web50601.mail.yahoo.com> Message-ID: <2006925202914.382809@leena> Problem is that SCSI devices are changing when I add/remove storage devices. For example, a device that was all set up as sda is now sdc upon reboot. Mike On Mon, 25 Sep 2006 18:05:29 -0700 (PDT), Pe?a wrote: > > > --- "isplist at logicore.net" > wrote: > >> I'll take a look at that, wwn might work so long as >> all my storage devices >> supports it. > but the LUNs that you export from the same SAN will > show the same wwn, do they? > the SAN's wwn I guess > > maybe somre kind of LUNs ID can be mapped with udev so > the same name apply to the same LUNs Id, I am just > guessing. Of course, the other way is to use LVM, LVM > can help because it have "IDs" that helps to gruop > always the same PV no matter if you add new devices > :-) > > cu > roger > > __________________________________________ > RedHat Certified Engineer ( RHCE ) > Cisco Certified Network Associate ( CCNA ) > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com From orkcu at yahoo.com Tue Sep 26 01:39:41 2006 From: orkcu at yahoo.com (Roger Peņa Escobio) Date: Mon, 25 Sep 2006 18:39:41 -0700 (PDT) Subject: [Linux-cluster] General FC Question In-Reply-To: <2006925202914.382809@leena> Message-ID: <20060926013941.31337.qmail@web50613.mail.yahoo.com> --- "isplist at logicore.net" wrote: > Problem is that SCSI devices are changing when I > add/remove storage devices. > For example, a device that was all set up as sda is > now sdc upon reboot. > yes, I understand you the same happen when you add LUNs exported from a SAN the OS will see more devices , and maybe something that was sda now is sde :-( with LVM you will not have any problem no matther how the name of the device change (sda -> sdb or sde) with GFS alone I guess you will maybe with the help of multipath or powerpath or any other "device mapper" tool you can map a device to something unique and invariable across addiction of new real scsi devices or LUNs to the system cu roger > Mike > > > On Mon, 25 Sep 2006 18:05:29 -0700 (PDT), Pe?a > wrote: > > > > > > --- "isplist at logicore.net" > > wrote: > > > >> I'll take a look at that, wwn might work so long > as > >> all my storage devices > >> supports it. > > but the LUNs that you export from the same SAN > will > > show the same wwn, do they? > > the SAN's wwn I guess > > > > maybe somre kind of LUNs ID can be mapped with > udev so > > the same name apply to the same LUNs Id, I am just > > guessing. Of course, the other way is to use LVM, > LVM > > can help because it have "IDs" that helps to gruop > > always the same PV no matter if you add new > devices > > :-) > > > > cu > > roger > > > > __________________________________________ > > RedHat Certified Engineer ( RHCE ) > > Cisco Certified Network Associate ( CCNA ) > > > > __________________________________________________ > > Do You Yahoo!? > > Tired of spam? Yahoo! Mail has the best spam > protection around > > http://mail.yahoo.com > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > __________________________________________ RedHat Certified Engineer ( RHCE ) Cisco Certified Network Associate ( CCNA ) __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From kanderso at redhat.com Tue Sep 26 07:19:45 2006 From: kanderso at redhat.com (Kevin Anderson) Date: Tue, 26 Sep 2006 02:19:45 -0500 Subject: [Linux-cluster] General FC Question In-Reply-To: <2006925101740.355825@leena> References: <2006925101740.355825@leena> Message-ID: <1159255185.2997.3.camel@localhost.localdomain> On Mon, 2006-09-25 at 10:17 -0500, isplist at logicore.net wrote: > After adding storage, my cluster comes up with different /dev/sda, /dev/sdb, > etc settings. My initial device now comes up as sdc when it used to be sda. > > Is there some way of allowing GFS to see the storage in some way that it can > know which device is which when I add a new one or remove one, etc? > You should be using lvm2 and lvm2_cluster to handle this issue. LVM2 handles the name changing of the device on reboot. This often happens depending on the scan order for the devices. By using a volume manager, you make these changes transparent. You also have the advantage of not being tied to single devices, but able to concatenate or stripe your filesystem across multiple devices. You must also use lvm2-cluster to ensure any changes you make to the volume information is consistent across the cluster. > Hard loop ID's on the FC side I think but is there anything on the GFS side? > > Mike > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > From cjk at techma.com Tue Sep 26 11:37:59 2006 From: cjk at techma.com (Kovacs, Corey J.) Date: Tue, 26 Sep 2006 07:37:59 -0400 Subject: =?us-ascii?Q?RE:_=5BLinux-cluster=5D_General_FC_Question?= In-Reply-To: <2006925101740.355825@leena> Message-ID: You don't say which FC cards you are using but if it's qlogic, then the driver can be set to combine the devices. Basically whats happened is that your machine is picking up the alternate path to the device, which is a perfectly valid thing to do, it's just not what you need at this point. It may be as simple as your secondary controller actually has the lun you are trying to access. To work around yo might just be able to reset the seconday controller and force the primary to take over the LUN. This happens quite a bit depending on your setup. The Qlogic drivers, when setup for failover, will coelesce the devices into a single device by the WWID of the LUN. If that's not an option, then try the multipath tools support in RHEL4.2 or above. You won't be using the /dev/sd{a,b,c,...} devices, rather it'll be /dev/mpath/mpath0 etc, or whatever you set them to instead. Even without failover, the latest Qlogic drivers will make both paths active so that you never end up with a dead path upon boot up. Hope this helps. Corey -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of isplist at logicore.net Sent: Monday, September 25, 2006 11:18 AM To: linux-cluster Subject: [Linux-cluster] General FC Question After adding storage, my cluster comes up with different /dev/sda, /dev/sdb, etc settings. My initial device now comes up as sdc when it used to be sda. Is there some way of allowing GFS to see the storage in some way that it can know which device is which when I add a new one or remove one, etc? Hard loop ID's on the FC side I think but is there anything on the GFS side? Mike -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From cjk at techma.com Tue Sep 26 11:44:00 2006 From: cjk at techma.com (Kovacs, Corey J.) Date: Tue, 26 Sep 2006 07:44:00 -0400 Subject: =?us-ascii?Q?RE:_=5BLinux-cluster=5D_General_FC_Question?= In-Reply-To: Message-ID: One more thing, when using more than one path (basically anyu san setup) the device mappings will wrap around for every path, so for two paths... single hba, dual controller.. three disks will look like this... disk1=/dev/sda disk2=/dev/sdb disk3=/dev/sdc disk1=/dev/sdd disk2=/dev/sde disk3=/dev/sde and four like this.. disk1=/dev/sda disk2=/dev/sdb disk3=/dev/sdc disk4=/dev/sdd disk1=/dev/sde disk2=/dev/sde disk3=/dev/sdf disk4=/dev/sdg Or for dual hba, dual controller (4 paths) disk1=/dev/sda disk2=/dev/sdb disk3=/dev/sdc disk4=/dev/sdd disk1=/dev/sde disk2=/dev/sde disk3=/dev/sdf disk4=/dev/sdg disk1=/dev/sdh disk2=/dev/sdi disk3=/dev/sdj disk4=/dev/sdk disk1=/dev/sdl disk2=/dev/sdm disk3=/dev/sdn disk4=/dev/sdo etc... Cheers With the Qlogic drivers in failover mode, you'll get this.. disk1=/dev/sda disk2=/dev/sdb disk3=/dev/sdc disk4=/dev/sdd even though there are multiple paths Corey -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kovacs, Corey J. Sent: Tuesday, September 26, 2006 7:38 AM To: isplist at logicore.net; linux clustering Subject: RE: [Linux-cluster] General FC Question You don't say which FC cards you are using but if it's qlogic, then the driver can be set to combine the devices. Basically whats happened is that your machine is picking up the alternate path to the device, which is a perfectly valid thing to do, it's just not what you need at this point. It may be as simple as your secondary controller actually has the lun you are trying to access. To work around yo might just be able to reset the seconday controller and force the primary to take over the LUN. This happens quite a bit depending on your setup. The Qlogic drivers, when setup for failover, will coelesce the devices into a single device by the WWID of the LUN. If that's not an option, then try the multipath tools support in RHEL4.2 or above. You won't be using the /dev/sd{a,b,c,...} devices, rather it'll be /dev/mpath/mpath0 etc, or whatever you set them to instead. Even without failover, the latest Qlogic drivers will make both paths active so that you never end up with a dead path upon boot up. Hope this helps. Corey -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of isplist at logicore.net Sent: Monday, September 25, 2006 11:18 AM To: linux-cluster Subject: [Linux-cluster] General FC Question After adding storage, my cluster comes up with different /dev/sda, /dev/sdb, etc settings. My initial device now comes up as sdc when it used to be sda. Is there some way of allowing GFS to see the storage in some way that it can know which device is which when I add a new one or remove one, etc? Hard loop ID's on the FC side I think but is there anything on the GFS side? Mike -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From isplist at logicore.net Tue Sep 26 13:26:34 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Tue, 26 Sep 2006 08:26:34 -0500 Subject: [Linux-cluster] General FC Question In-Reply-To: <1159255185.2997.3.camel@localhost.localdomain> Message-ID: <200692682634.118034@leena> > You should be using lvm2 and lvm2_cluster to handle this issue. LVM2 > handles the name changing of the device on reboot. This often happens > depending on the scan order for the devices. By using a volume manager, Yes, this is what I'm using. I'll reply to the next message about gear as well. Since I've just added the hardware, I guess I've not had enough time to notice how well LVM2 handles this. I just noticed it right off the bat after boot up. The problem came up when I was not able to mount my previously set up GFS FS. Mike From isplist at logicore.net Tue Sep 26 13:32:00 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Tue, 26 Sep 2006 08:32:00 -0500 Subject: [Linux-cluster] General FC Question In-Reply-To: Message-ID: <20069268320.047255@leena> > You don't say which FC cards you are using but if it's qlogic, then the > driver can be set to combine the devices. Basically whats happened is that >your machine is picking up the alternate path to the device, which is a >perfectly valid thing to do, it's just not what you need at this point. It >may be as simple as your The cards are all Qlogic, the switch is going to be an ED-5000 next week, the storage is mostly Xyratex chassis. >secondary controller actually has the lun you are trying to access. To work >around yo might just be able to reset the seconday controller and force the >primary to take over the LUN. This happens quite a bit depending on your >setup. I do have options on the Xyratex to combine the two controllers into one number but that's the loop ID. Not sure I've seen anything for LUN control yet. > Even without failover, the latest Qlogic drivers will make both paths active > so that you never end up with a dead path upon boot up. Path's seem fine, I mean, the storage does show up. It's just that my initial device has moved to another /dev/sdx number. In another message, am I to understand that once the physical device is set up and running as a GFS volume, that LVM2 will always see it no matter if the /dev number changes? I'll also have to look into your suggestions. Thank you very much. Mike From isplist at logicore.net Tue Sep 26 13:35:36 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Tue, 26 Sep 2006 08:35:36 -0500 Subject: [Linux-cluster] General FC Question In-Reply-To: Message-ID: <200692683536.307927@leena> > One more thing, when using more than one path (basically anyu san setup) the > device mappings will wrap around for every path, so for two paths... single > hba, dual controller.. Right, and of course, this is what's happened. New disks have shown up and the old disk now shows up as a new device number. By the way, is there a way to clear everything in the LVM2 cache and setup info? It is now confused and seeing a lot of trashed information. Since the setup is new, I can start from scratch so wish to nuke all the old info. Mike > > three disks will look like this... > > disk1=/dev/sda > disk2=/dev/sdb > disk3=/dev/sdc > disk1=/dev/sdd > disk2=/dev/sde > disk3=/dev/sde > > and four like this.. > > disk1=/dev/sda > disk2=/dev/sdb > disk3=/dev/sdc > disk4=/dev/sdd > disk1=/dev/sde > disk2=/dev/sde > disk3=/dev/sdf > disk4=/dev/sdg > > > Or for dual hba, dual controller (4 paths) > > > disk1=/dev/sda > disk2=/dev/sdb > disk3=/dev/sdc > disk4=/dev/sdd > disk1=/dev/sde > disk2=/dev/sde > disk3=/dev/sdf > disk4=/dev/sdg > disk1=/dev/sdh > disk2=/dev/sdi > disk3=/dev/sdj > disk4=/dev/sdk > disk1=/dev/sdl > disk2=/dev/sdm > disk3=/dev/sdn > disk4=/dev/sdo > > etc... > > Cheers > > With the Qlogic drivers in failover mode, you'll get this.. > > disk1=/dev/sda > disk2=/dev/sdb > disk3=/dev/sdc > disk4=/dev/sdd > > even though there are multiple paths > > > Corey > > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kovacs, Corey J. > Sent: Tuesday, September 26, 2006 7:38 AM > To: isplist at logicore.net; linux clustering > Subject: RE: [Linux-cluster] General FC Question > > You don't say which FC cards you are using but if it's qlogic, then the > driver can be set to combine the devices. Basically whats happened is that > your machine is picking up the alternate path to the device, which is a > perfectly valid thing to do, it's just not what you need at this point. It > may be as simple as your > > secondary controller actually has the lun you are trying to access. To work > around yo might just be able to reset the seconday controller and force the > primary to take over the LUN. This happens quite a bit depending on your > setup. The Qlogic drivers, when setup for failover, will coelesce the > devices > into a single device by the WWID of the LUN. If that's not an option, then > try the multipath tools support in > RHEL4.2 > or above. You won't be using the /dev/sd{a,b,c,...} devices, rather it'll be > /dev/mpath/mpath0 etc, or whatever you set them to instead. > > Even without failover, the latest Qlogic drivers will make both paths active > so that you never end up with a dead path upon boot up. > > > Hope this helps. > > > Corey > > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of isplist at logicore.net > Sent: Monday, September 25, 2006 11:18 AM > To: linux-cluster > Subject: [Linux-cluster] General FC Question > > After adding storage, my cluster comes up with different /dev/sda, /dev/sdb, > etc settings. My initial device now comes up as sdc when it used to be sda. > > Is there some way of allowing GFS to see the storage in some way that it can > know which device is which when I add a new one or remove one, etc? > > Hard loop ID's on the FC side I think but is there anything on the GFS side? > > Mike > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From isplist at logicore.net Tue Sep 26 13:46:22 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Tue, 26 Sep 2006 08:46:22 -0500 Subject: [Linux-cluster] General FC Question In-Reply-To: Message-ID: <200692684622.663564@leena> PS: Is my problem hard loop ID's or LUN's? Could I achieve what I need either way or is it one or thew other? On Tue, 26 Sep 2006 07:44:00 -0400, Kovacs, Corey J. wrote: > One more thing, when using more than one path (basically anyu san setup) the > > device > mappings will wrap around for every path, so for two paths... single hba, > dual controller.. > > > three disks will look like this... > > disk1=/dev/sda > disk2=/dev/sdb > disk3=/dev/sdc > disk1=/dev/sdd > disk2=/dev/sde > disk3=/dev/sde > > and four like this.. > > disk1=/dev/sda > disk2=/dev/sdb > disk3=/dev/sdc > disk4=/dev/sdd > disk1=/dev/sde > disk2=/dev/sde > disk3=/dev/sdf > disk4=/dev/sdg > > > Or for dual hba, dual controller (4 paths) > > > disk1=/dev/sda > disk2=/dev/sdb > disk3=/dev/sdc > disk4=/dev/sdd > disk1=/dev/sde > disk2=/dev/sde > disk3=/dev/sdf > disk4=/dev/sdg > disk1=/dev/sdh > disk2=/dev/sdi > disk3=/dev/sdj > disk4=/dev/sdk > disk1=/dev/sdl > disk2=/dev/sdm > disk3=/dev/sdn > disk4=/dev/sdo > > etc... > > Cheers > > With the Qlogic drivers in failover mode, you'll get this.. > > disk1=/dev/sda > disk2=/dev/sdb > disk3=/dev/sdc > disk4=/dev/sdd > > even though there are multiple paths > > > Corey > > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kovacs, Corey J. > Sent: Tuesday, September 26, 2006 7:38 AM > To: isplist at logicore.net; linux clustering > Subject: RE: [Linux-cluster] General FC Question > > You don't say which FC cards you are using but if it's qlogic, then the > driver can be set to combine the devices. Basically whats happened is that > your machine is picking up the alternate path to the device, which is a > perfectly valid thing to do, it's just not what you need at this point. It > may be as simple as your > > secondary controller actually has the lun you are trying to access. To work > around yo might just be able to reset the seconday controller and force the > primary to take over the LUN. This happens quite a bit depending on your > setup. The Qlogic drivers, when setup for failover, will coelesce the > devices > into a single device by the WWID of the LUN. If that's not an option, then > try the multipath tools support in > RHEL4.2 > or above. You won't be using the /dev/sd{a,b,c,...} devices, rather it'll be > /dev/mpath/mpath0 etc, or whatever you set them to instead. > > Even without failover, the latest Qlogic drivers will make both paths active > so that you never end up with a dead path upon boot up. > > > Hope this helps. > > > Corey > > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of isplist at logicore.net > Sent: Monday, September 25, 2006 11:18 AM > To: linux-cluster > Subject: [Linux-cluster] General FC Question > > After adding storage, my cluster comes up with different /dev/sda, /dev/sdb, > etc settings. My initial device now comes up as sdc when it used to be sda. > > Is there some way of allowing GFS to see the storage in some way that it can > know which device is which when I add a new one or remove one, etc? > > Hard loop ID's on the FC side I think but is there anything on the GFS side? > > Mike > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From jaap at sara.nl Tue Sep 26 14:02:11 2006 From: jaap at sara.nl (Jaap Dijkshoorn) Date: Tue, 26 Sep 2006 16:02:11 +0200 Subject: [Linux-cluster] Files are there, but not. Message-ID: <339554D0FE9DD94A8E5ACE4403676CEB016936D4@douwes.ka.sara.nl> Hi All, We are running a GFS / NFS cluster with 5 fileservers. Each server exports the same storage as a different NFS server. nfs1, nfs2....nfs5 On nodes that mount on nfs1 we have the following problem: root# ls -l ls: CHGCAR: No such file or directory ls: CHG: No such file or directory ls: WAVECAR: No such file or directory total 17616 -rw------- 1 xxxxxxxx yyy 1612 Sep 26 15:43 CONTCAR -rw------- 1 xxxxxxxx yyy 167 Sep 26 14:11 DOSCAR Other fileservers display the 3 missing files normally and they are accessible. On the fileservers we also get these kind messages: h_update: A.TCNQ/CHGCAR already up-to-date! fh_update: A.TCNQ/CHGCAR already up-to-date! fh_update: A.TCNQ/CHGCAR already up-to-date! fh_update: A.TCNQ/CHGCAR already up-to-date! fh_update: A.TCNQ/WAVECAR already up-to-date! fh_update: A.TCNQ/WAVECAR already up-to-date! fh_update: A.TCNQ/WAVECAR already up-to-date! fh_update: A+B/CHGCAR already up-to-date! fh_update: A+B/CHGCAR already up-to-date! fh_update: A+B/CHGCAR already up-to-date! fh_update: A+B/WAVECAR already up-to-date! I don't know where to start looking to trac this problem. If i reboot the nfs1 server the problem is gone, but in time the problem comes back with other files, until now on the same fileserver. Maybe someone has seen this problem before? We use GFS version CVS 1.0.3 stable. with kernel 2.6.17.11 Met vriendelijke groet, Kind Regards, Jaap P. Dijkshoorn Systems Programmer mailto:jaap at sara.nl http://home.sara.nl/~jaapd SARA Computing & Networking Services Kruislaan 415 1098 SJ Amsterdam Tel: +31-(0)20-5923000 Fax: +31-(0)20-6683167 http://www.sara.nl From peter.huesser at psi.ch Tue Sep 26 14:26:29 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Tue, 26 Sep 2006 16:26:29 +0200 Subject: [Linux-cluster] Realserver configuration using loadbalancer Message-ID: <8E2924888511274B95014C2DD906E58AD1A3F9@MAILBOX0A.psi.ch> Hello If I run a loadbalancer in front of the webservers (using piranha_gui and pulse) is there anything I have configure on the real webservers? Pedro -------------- next part -------------- An HTML attachment was scrubbed... URL: From cjk at techma.com Tue Sep 26 15:19:41 2006 From: cjk at techma.com (Kovacs, Corey J.) Date: Tue, 26 Sep 2006 11:19:41 -0400 Subject: =?us-ascii?Q?RE:_=5BLinux-cluster=5D_General_FC_Question?= In-Reply-To: <200692684622.663564@leena> Message-ID: I'd say LUN. If you cat out /proc/scsi/scsi you'll see the luns are repeated. The qlogic based failover doesn't have anything to do with settings on the SAN (combining luns etc) it does it at the scsi layer (on the host). sort of like "secure path" from HP. What you are seeing is the presence of both paths by the driver. The RedHat qlogic driver seems a bit crippled since they'd (and the upstream kernel devs) would rather you used the device mapper multipath solution instead. The path of least resistence is to get the qlogic drivers from the qlogic site (not the stock redhat drivers) and install them. A better long term solution is prolly to go ahead and figure out the multipath device mapper stuff. Cheers. Corey -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of isplist at logicore.net Sent: Tuesday, September 26, 2006 9:46 AM To: linux-cluster Subject: RE: [Linux-cluster] General FC Question PS: Is my problem hard loop ID's or LUN's? Could I achieve what I need either way or is it one or thew other? On Tue, 26 Sep 2006 07:44:00 -0400, Kovacs, Corey J. wrote: > One more thing, when using more than one path (basically anyu san setup) the > > device > mappings will wrap around for every path, so for two paths... single hba, > dual controller.. > > > three disks will look like this... > > disk1=/dev/sda > disk2=/dev/sdb > disk3=/dev/sdc > disk1=/dev/sdd > disk2=/dev/sde > disk3=/dev/sde > > and four like this.. > > disk1=/dev/sda > disk2=/dev/sdb > disk3=/dev/sdc > disk4=/dev/sdd > disk1=/dev/sde > disk2=/dev/sde > disk3=/dev/sdf > disk4=/dev/sdg > > > Or for dual hba, dual controller (4 paths) > > > disk1=/dev/sda > disk2=/dev/sdb > disk3=/dev/sdc > disk4=/dev/sdd > disk1=/dev/sde > disk2=/dev/sde > disk3=/dev/sdf > disk4=/dev/sdg > disk1=/dev/sdh > disk2=/dev/sdi > disk3=/dev/sdj > disk4=/dev/sdk > disk1=/dev/sdl > disk2=/dev/sdm > disk3=/dev/sdn > disk4=/dev/sdo > > etc... > > Cheers > > With the Qlogic drivers in failover mode, you'll get this.. > > disk1=/dev/sda > disk2=/dev/sdb > disk3=/dev/sdc > disk4=/dev/sdd > > even though there are multiple paths > > > Corey > > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kovacs, Corey J. > Sent: Tuesday, September 26, 2006 7:38 AM > To: isplist at logicore.net; linux clustering > Subject: RE: [Linux-cluster] General FC Question > > You don't say which FC cards you are using but if it's qlogic, then the > driver can be set to combine the devices. Basically whats happened is that > your machine is picking up the alternate path to the device, which is a > perfectly valid thing to do, it's just not what you need at this point. It > may be as simple as your > > secondary controller actually has the lun you are trying to access. To work > around yo might just be able to reset the seconday controller and force the > primary to take over the LUN. This happens quite a bit depending on your > setup. The Qlogic drivers, when setup for failover, will coelesce the > devices > into a single device by the WWID of the LUN. If that's not an option, then > try the multipath tools support in > RHEL4.2 > or above. You won't be using the /dev/sd{a,b,c,...} devices, rather it'll be > /dev/mpath/mpath0 etc, or whatever you set them to instead. > > Even without failover, the latest Qlogic drivers will make both paths active > so that you never end up with a dead path upon boot up. > > > Hope this helps. > > > Corey > > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of isplist at logicore.net > Sent: Monday, September 25, 2006 11:18 AM > To: linux-cluster > Subject: [Linux-cluster] General FC Question > > After adding storage, my cluster comes up with different /dev/sda, /dev/sdb, > etc settings. My initial device now comes up as sdc when it used to be sda. > > Is there some way of allowing GFS to see the storage in some way that it can > know which device is which when I add a new one or remove one, etc? > > Hard loop ID's on the FC side I think but is there anything on the GFS side? > > Mike > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From wcheng at redhat.com Tue Sep 26 16:02:58 2006 From: wcheng at redhat.com (Wendy Cheng) Date: Tue, 26 Sep 2006 12:02:58 -0400 Subject: [Linux-cluster] Files are there, but not. In-Reply-To: <339554D0FE9DD94A8E5ACE4403676CEB016936D4@douwes.ka.sara.nl> References: <339554D0FE9DD94A8E5ACE4403676CEB016936D4@douwes.ka.sara.nl> Message-ID: <45194F32.2040403@redhat.com> Jaap Dijkshoorn wrote: >Hi All, > >We are running a GFS / NFS cluster with 5 fileservers. Each server >exports the same storage as a different NFS server. nfs1, nfs2....nfs5 > >On nodes that mount on nfs1 we have the following problem: > >root# ls -l >ls: CHGCAR: No such file or directory >ls: CHG: No such file or directory >ls: WAVECAR: No such file or directory >total 17616 >-rw------- 1 xxxxxxxx yyy 1612 Sep 26 15:43 CONTCAR >-rw------- 1 xxxxxxxx yyy 167 Sep 26 14:11 DOSCAR > >Other fileservers display the 3 missing files normally and they are >accessible. On the fileservers we also get these kind messages: > > Look like NFS client side caching issue. What's the kernel version you have in the nfs client machine (do a "uname -a")? -- Wendy >h_update: A.TCNQ/CHGCAR already up-to-date! >fh_update: A.TCNQ/CHGCAR already up-to-date! >fh_update: A.TCNQ/CHGCAR already up-to-date! >fh_update: A.TCNQ/CHGCAR already up-to-date! >fh_update: A.TCNQ/WAVECAR already up-to-date! >fh_update: A.TCNQ/WAVECAR already up-to-date! >fh_update: A.TCNQ/WAVECAR already up-to-date! >fh_update: A+B/CHGCAR already up-to-date! >fh_update: A+B/CHGCAR already up-to-date! >fh_update: A+B/CHGCAR already up-to-date! >fh_update: A+B/WAVECAR already up-to-date! > >I don't know where to start looking to trac this problem. If i reboot >the nfs1 server the problem is gone, but in time the problem comes back >with other files, until now on the same fileserver. > >Maybe someone has seen this problem before? > >We use GFS version CVS 1.0.3 stable. with kernel 2.6.17.11 > > > > >Met vriendelijke groet, Kind Regards, > >Jaap P. Dijkshoorn >Systems Programmer >mailto:jaap at sara.nl http://home.sara.nl/~jaapd > >SARA Computing & Networking Services >Kruislaan 415 1098 SJ Amsterdam >Tel: +31-(0)20-5923000 >Fax: +31-(0)20-6683167 >http://www.sara.nl > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster > > -- S. Wendy Cheng wcheng at redhat.com From jpenalbae at gmail.com Tue Sep 26 18:05:45 2006 From: jpenalbae at gmail.com (=?ISO-8859-1?Q?Jaime_Pe=F1alba?=) Date: Tue, 26 Sep 2006 20:05:45 +0200 Subject: [Linux-cluster] General FC Question In-Reply-To: <200692684622.663564@leena> References: <200692684622.663564@leena> Message-ID: Hi, I will recommend you again using multipath-tools which uses device-mapper. Here is an example. # cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: HP Model: HSV100 Rev: 3025 Type: RAID ANSI SCSI revision: 02 Host: scsi0 Channel: 00 Id: 00 Lun: 01 Vendor: HP Model: HSV100 Rev: 3025 Type: Direct-Access ANSI SCSI revision: 02 Host: scsi0 Channel: 00 Id: 00 Lun: 02 Vendor: HP Model: HSV100 Rev: 3025 Type: Direct-Access ANSI SCSI revision: 02 Host: scsi0 Channel: 00 Id: 01 Lun: 00 Vendor: HP Model: HSV100 Rev: 3025 Type: RAID ANSI SCSI revision: 02 Host: scsi0 Channel: 00 Id: 01 Lun: 01 Vendor: HP Model: HSV100 Rev: 3025 Type: Direct-Access ANSI SCSI revision: 02 Host: scsi0 Channel: 00 Id: 01 Lun: 02 Vendor: HP Model: HSV100 Rev: 3025 Type: Direct-Access ANSI SCSI revision: 02 Host: scsi1 Channel: 00 Id: 00 Lun: 00 Vendor: HP Model: HSV100 Rev: 3025 Type: RAID ANSI SCSI revision: 02 Host: scsi1 Channel: 00 Id: 00 Lun: 01 Vendor: HP Model: HSV100 Rev: 3025 Type: Direct-Access ANSI SCSI revision: 02 Host: scsi1 Channel: 00 Id: 00 Lun: 02 Vendor: HP Model: HSV100 Rev: 3025 Type: Direct-Access ANSI SCSI revision: 02 Host: scsi1 Channel: 00 Id: 01 Lun: 00 Vendor: HP Model: HSV100 Rev: 3025 Type: RAID ANSI SCSI revision: 02 Host: scsi1 Channel: 00 Id: 01 Lun: 01 Vendor: HP Model: HSV100 Rev: 3025 Type: Direct-Access ANSI SCSI revision: 02 Host: scsi1 Channel: 00 Id: 01 Lun: 02 Vendor: HP Model: HSV100 Rev: 3025 Type: Direct-Access ANSI SCSI revision: 02 That output wont help you to identify each disk. Here is the multipath-tools output from those disks: # multipath -v3 ..... truncated output ....... 3600508b4000116370000a00000c00000 0:0:0:1 sda 8:0 [ready] 3600508b40001168a0000e00000090000 0:0:0:2 sdb 8:16 [ready] 3600508b4000116370000a00000c00000 0:0:1:1 sdc 8:32 [faulty] 3600508b40001168a0000e00000090000 0:0:1:2 sdd 8:48 [faulty] 3600508b4000116370000a00000c00000 1:0:0:1 sde 8:64 [ready] 3600508b40001168a0000e00000090000 1:0:0:2 sdf 8:80 [ready] 3600508b4000116370000a00000c00000 1:0:1:1 sdg 8:96 [faulty] 3600508b40001168a0000e00000090000 1:0:1:2 sdh 8:112 [faulty] ..... truncated output ....... It finds each disk WWN and groups them. # multipath -l storage.old () [size=100 GB][features="0"][hwhandler="0"] \_ round-robin 0 [active] \_ 0:0:0:1 sda 8:0 [active] \_ 0:0:1:1 sdc 8:32 [active] \_ 1:0:0:1 sde 8:64 [active] \_ 1:0:1:1 sdg 8:96 [failed] storage () [size=100 GB][features="0"][hwhandler="0"] \_ round-robin 0 [active] \_ 0:0:0:2 sdb 8:16 [active] \_ 0:0:1:2 sdd 8:48 [failed] \_ 1:0:0:2 sdf 8:80 [active] \_ 1:0:1:2 sdh 8:112 [failed] And WWNs are aliased to names in the /etc/multipath.conf file, example: ---------------------- multipath { wwid 3600508b4000116370000a00000c00000 alias storage.old } multipath { wwid 3600508b40001168a0000e00000090000 alias storage } --------------------- So it will create /dev/mapper/storage and /dev/mapper/storage.old # dmsetup ls storage.old (253, 4) storage3 (253, 3) storage2 (253, 2) storage1 (253, 1) storage (253, 0) storage.old2 (253, 6) storage.old1 (253, 5) My devices are partiotioned so, direct access for each partition is automatically created. /dev/mapper/storage (hole disk) /dev/mapper/storage1 (first partition) /dev/mapper/storage2 (second) /dev/mapper/storage3 (third) This way you can tell gfs to access those mapper devices which dont care about the order of found disks, just WWNs. About the device-mapper question, you can clean all devices by doing # dmsetup remove_all Or just remove one device # dmsetup remove storage I hope this helps you. Regards, Jaime. 2006/9/26, isplist at logicore.net : > PS: Is my problem hard loop ID's or LUN's? Could I achieve what I need either > way or is it one or thew other? > > > On Tue, 26 Sep 2006 07:44:00 -0400, Kovacs, Corey J. wrote: > > One more thing, when using more than one path (basically anyu san setup) the > > > > device > > mappings will wrap around for every path, so for two paths... single hba, > > dual controller.. > > > > > > three disks will look like this... > > > > disk1=/dev/sda > > disk2=/dev/sdb > > disk3=/dev/sdc > > disk1=/dev/sdd > > disk2=/dev/sde > > disk3=/dev/sde > > > > and four like this.. > > > > disk1=/dev/sda > > disk2=/dev/sdb > > disk3=/dev/sdc > > disk4=/dev/sdd > > disk1=/dev/sde > > disk2=/dev/sde > > disk3=/dev/sdf > > disk4=/dev/sdg > > > > > > Or for dual hba, dual controller (4 paths) > > > > > > disk1=/dev/sda > > disk2=/dev/sdb > > disk3=/dev/sdc > > disk4=/dev/sdd > > disk1=/dev/sde > > disk2=/dev/sde > > disk3=/dev/sdf > > disk4=/dev/sdg > > disk1=/dev/sdh > > disk2=/dev/sdi > > disk3=/dev/sdj > > disk4=/dev/sdk > > disk1=/dev/sdl > > disk2=/dev/sdm > > disk3=/dev/sdn > > disk4=/dev/sdo > > > > etc... > > > > Cheers > > > > With the Qlogic drivers in failover mode, you'll get this.. > > > > disk1=/dev/sda > > disk2=/dev/sdb > > disk3=/dev/sdc > > disk4=/dev/sdd > > > > even though there are multiple paths > > > > > > Corey > > > > -----Original Message----- > > From: linux-cluster-bounces at redhat.com > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kovacs, Corey J. > > Sent: Tuesday, September 26, 2006 7:38 AM > > To: isplist at logicore.net; linux clustering > > Subject: RE: [Linux-cluster] General FC Question > > > > You don't say which FC cards you are using but if it's qlogic, then the > > driver can be set to combine the devices. Basically whats happened is that > > your machine is picking up the alternate path to the device, which is a > > perfectly valid thing to do, it's just not what you need at this point. It > > may be as simple as your > > > > secondary controller actually has the lun you are trying to access. To work > > around yo might just be able to reset the seconday controller and force the > > primary to take over the LUN. This happens quite a bit depending on your > > setup. The Qlogic drivers, when setup for failover, will coelesce the > > devices > > into a single device by the WWID of the LUN. If that's not an option, then > > try the multipath tools support in > > RHEL4.2 > > or above. You won't be using the /dev/sd{a,b,c,...} devices, rather it'll be > > /dev/mpath/mpath0 etc, or whatever you set them to instead. > > > > Even without failover, the latest Qlogic drivers will make both paths active > > so that you never end up with a dead path upon boot up. > > > > > > Hope this helps. > > > > > > Corey > > > > -----Original Message----- > > From: linux-cluster-bounces at redhat.com > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of isplist at logicore.net > > Sent: Monday, September 25, 2006 11:18 AM > > To: linux-cluster > > Subject: [Linux-cluster] General FC Question > > > > After adding storage, my cluster comes up with different /dev/sda, /dev/sdb, > > etc settings. My initial device now comes up as sdc when it used to be sda. > > > > Is there some way of allowing GFS to see the storage in some way that it can > > know which device is which when I add a new one or remove one, etc? > > > > Hard loop ID's on the FC side I think but is there anything on the GFS side? > > > > Mike > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From jaap at sara.nl Wed Sep 27 06:56:59 2006 From: jaap at sara.nl (Jaap Dijkshoorn) Date: Wed, 27 Sep 2006 08:56:59 +0200 Subject: [Linux-cluster] Files are there, but not. In-Reply-To: <45194F32.2040403@redhat.com> Message-ID: <339554D0FE9DD94A8E5ACE4403676CEB016936E0@douwes.ka.sara.nl> Hi Wendy, > > > > > Look like NFS client side caching issue. What's the kernel > version you > have in the nfs client machine (do a "uname -a")? > It is the same as our NFS server, kernel 2.6.17.11 Regards, Jaap From jaap at sara.nl Wed Sep 27 08:44:10 2006 From: jaap at sara.nl (Jaap Dijkshoorn) Date: Wed, 27 Sep 2006 10:44:10 +0200 Subject: [Linux-cluster] Files are there, but not. In-Reply-To: <339554D0FE9DD94A8E5ACE4403676CEB016936E0@douwes.ka.sara.nl> Message-ID: <339554D0FE9DD94A8E5ACE4403676CEB016936E9@douwes.ka.sara.nl> Wendy, Just to be complete. We see the problems occure on one of our GFS fileservers that is acting as a NFS fileserver. So on both server and clients connected to that GFS/NFS server, the files are missing. On the other GFS/NFS fileservers and clients connected to those servers the files are still available. So the same ls command on fileserver 2,3,4,5 gives a normal view of all the files. Regards, Jaap > > > Hi Wendy, > > > > > > > > > Look like NFS client side caching issue. What's the kernel > > version you > > have in the nfs client machine (do a "uname -a")? > > > > It is the same as our NFS server, kernel 2.6.17.11 > > Regards, > Jaap > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From rpeterso at redhat.com Wed Sep 27 14:21:44 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Wed, 27 Sep 2006 09:21:44 -0500 Subject: [Linux-cluster] Files are there, but not. In-Reply-To: <339554D0FE9DD94A8E5ACE4403676CEB016936E9@douwes.ka.sara.nl> References: <339554D0FE9DD94A8E5ACE4403676CEB016936E9@douwes.ka.sara.nl> Message-ID: <451A88F8.4010503@redhat.com> Jaap Dijkshoorn wrote: > Wendy, > > Just to be complete. We see the problems occure on one of our GFS > fileservers that is acting as a NFS fileserver. So on both server and > clients connected to that GFS/NFS server, the files are missing. On the > other GFS/NFS fileservers and clients connected to those servers the > files are still available. > > So the same ls command on fileserver 2,3,4,5 gives a normal view of all > the files. > > Regards, > Jaap > Hi Jaap, Your problem may very well be the same as bugzilla bz 190756: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=190756 I've been trying unsuccessfully to recreate this problem at RHEL4U4. I need to reproduce the problem in our lab in order to debug the problem. I suspect that NFS changes make in U4 may have changed the timing to make the problem much less likely to occur. I may need to go back to U3 to recreate it. If you can give me any information that can help me reproduce the problem, I would be grateful. Regards, Bob Peterson Red Hat Cluster Suite From jaap at sara.nl Wed Sep 27 15:03:45 2006 From: jaap at sara.nl (Jaap Dijkshoorn) Date: Wed, 27 Sep 2006 17:03:45 +0200 Subject: [Linux-cluster] Files are there, but not. In-Reply-To: <451A88F8.4010503@redhat.com> Message-ID: <339554D0FE9DD94A8E5ACE4403676CEB016936FE@douwes.ka.sara.nl> Hi Bob, > > > Hi Jaap, > > Your problem may very well be the same as bugzilla bz 190756: > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=190756 It looks like it! > > I've been trying unsuccessfully to recreate this problem at RHEL4U4. > I need to reproduce the problem in our lab in order to debug > the problem. > I suspect that NFS changes make in U4 may have changed the timing > to make the problem much less likely to occur. I may need to go back > to U3 to recreate it. > > If you can give me any information that can help me reproduce the > problem, I would be grateful. I have aksed the user who is having this problem, what exactly is happening with those files during his job. I hope this will give us a clue in what ways those files are touched and/or deleted etc. All files are read/write by the users through NFS. But that strange thing is that on 4 of the 5 servers the files are still available, on GFS as well on the clients through NFS. > > Regards, > > Bob Peterson > Red Hat Cluster Suite > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > thanks already for the effort. I hope we can tackle this bug! Best Regards, Jaap From peter.huesser at psi.ch Wed Sep 27 16:13:54 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Wed, 27 Sep 2006 18:13:54 +0200 Subject: [Linux-cluster] Realserver configuration using loadbalancer In-Reply-To: <8E2924888511274B95014C2DD906E58AD1A3F9@MAILBOX0A.psi.ch> Message-ID: <8E2924888511274B95014C2DD906E58AD1A460@MAILBOX0A.psi.ch> I found the solution. One also has to manipulate the real webservers. This is not described in the official "Red Hat Cluster Suite" documentation. Pedro ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Huesser Peter Sent: Dienstag, 26. September 2006 16:26 To: Linux-cluster at redhat.com Subject: [Linux-cluster] Realserver configuration using loadbalancer Hello If I run a loadbalancer in front of the webservers (using piranha_gui and pulse) is there anything I have configure on the real webservers? Pedro -------------- next part -------------- An HTML attachment was scrubbed... URL: From peter.huesser at psi.ch Wed Sep 27 16:15:34 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Wed, 27 Sep 2006 18:15:34 +0200 Subject: [Linux-cluster] piranha In-Reply-To: <8E2924888511274B95014C2DD906E58AD1A3A7@MAILBOX0A.psi.ch> Message-ID: <8E2924888511274B95014C2DD906E58AD1A464@MAILBOX0A.psi.ch> I found the solution. One also has to manipulate the real webservers. This is not described in the official "Red Hat Cluster Suite" documentation. Pedro ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Huesser Peter Sent: Montag, 25. September 2006 22:54 To: linux clustering Subject: RE: [Linux-cluster] piranha By the way: I started the "pulse" daemon in the debug modus ("pulse -v -n") and got the following output: nanny: Opening TCP socket to remote service port 80... nanny: Connecting socket to remote address... nanny: DEBUG -- Posting CONNECT poll() nanny: Sending len=16, text="GET / HTTP/1.0 " nanny: DEBUG -- Posting READ poll() nanny: DEBUG -- READ poll() completed (1,1) nanny: Posting READ I/O; expecting 4 character(s)... nanny: DEBUG -- READ returned 4 nanny: READ expected len=4, text="HTTP" nanny: READ got len=4, text=HTTP nanny: avail: 1 active: 1: count: 13 pulse: DEBUG -- setting SEND_heartbeat timer pulse: DEBUG -- setting SEND_heartbeat timer pulse: DEBUG -- setting NEED_heartbeat timer pulse: DEBUG -- setting SEND_heartbeat timer nanny: Opening TCP socket to remote service port 80... ... For me this looks as if everything is ok. "nanny" sends from time to time a "GET / HTTP/1.0" request and the response ("HTTP" only first four letters) correspondence with what is expected. The problem is that pulse is not opening port 80 on the loadbalancer for reveiving http-request. A "netstat -anp" verifies this. Hello I sent a similar question a few days ago and did not get any answer. Maybe the time (Saturday night) was unfavorable or the question was not that clear. So I try it once more: I want to run a loadbalancer in front of two webserver (using direct routing). But if I connect to port 80 of the loadbalancer I get a "connection refused". 1) Did anybody had a similar problem? 2) How can I increase the debuglevel? Thanks' in advance Pedro -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpeterso at redhat.com Wed Sep 27 16:59:56 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Wed, 27 Sep 2006 11:59:56 -0500 Subject: [Linux-cluster] Files are there, but not. In-Reply-To: <339554D0FE9DD94A8E5ACE4403676CEB016936FE@douwes.ka.sara.nl> References: <339554D0FE9DD94A8E5ACE4403676CEB016936FE@douwes.ka.sara.nl> Message-ID: <451AAE0C.70702@redhat.com> Jaap Dijkshoorn wrote: > It looks like it! > > I have aksed the user who is having this problem, what exactly is > happening with those files during his job. I hope this will give us a > clue in what ways those files are touched and/or deleted etc. > > All files are read/write by the users through NFS. But that strange > thing is that on 4 of the 5 servers the files are still available, on > GFS as well on the clients through NFS. > > thanks already for the effort. I hope we can tackle this bug! > > Best Regards, > Jaap > Hi Jaap, Soon after I sent the last email, I did recreate the problem here in our lab, though it was after several days of trying. That's good: It means the U4 is very stable, and it means I can probably work on the problem without the need for further information from people in the field. I did just update the bugzilla, but here's what I know so far: This is hard to explain, so let me simplify by calling "A" the cluster node that shows the files correctly, and "B" the cluster node that say the files are missing. Let's further say that an example "missing" file is: /mnt/gfs/subdir/xyz. So "ls /mnt/gfs/subdir/xyz" from "A" shows the file correctly, while the same command from "B" produces "No such file or directory". The biggest clue I've found today is this: It looks as if "B" somehow seems to have the wrong inode cached for "subdir". In other words, a stat command run on the directory "/mnt/gfs/subdir" shows the wrong directory inode (possibly a deleted subdirectory?) on "B" whereas "A" has the correct inode for "subdir" with the same stat command. I'm not sure yet if this incorrect cached inode is coming from GFS, or whether it's in the Linux vfs. I'm still investigating. Please update the bugzilla if you get more information. In the meanwhile, I'll continue working on the problem and I'll keep the bugzilla up to date when I find out more. Regards, Bob Peterson Red Hat Cluster Suite From tmornini at engineyard.com Wed Sep 27 20:29:49 2006 From: tmornini at engineyard.com (Tom Mornini) Date: Wed, 27 Sep 2006 13:29:49 -0700 Subject: [Linux-cluster] Re: [Xen-users] what do you recommend for cluster fs ?? In-Reply-To: References: <68729346-BD22-4B3D-84B1-948F79D72CDA@engineyard.com> Message-ID: <738F62ED-86F1-4DEA-9F9D-A97B09327137@engineyard.com> On Sep 27, 2006, at 12:56 PM, Anand Gupta wrote: > Hello Tom > > Thanks for the response. You're welcome. > We use CLVM. > > Would you mind sharing howto / documentation on how to get CLVM and > GFS setup ? I found a consultant to help out. It's a difficult configuration and requires a lot of time, patience, and expertise. We liked him so much he's now available for consulting rates through these other companies that I'm involved in: www.engineyard.com www.qualityhumans.com -- -- Tom Mornini From celso at webbertek.com.br Thu Sep 28 04:36:16 2006 From: celso at webbertek.com.br (Celso K. Webber) Date: Thu, 28 Sep 2006 01:36:16 -0300 Subject: [Linux-cluster] IPMI fencing on an IBM x366 In-Reply-To: <200609251411.k8PEBg406654@xos037.xos.nl> References: <200609251411.k8PEBg406654@xos037.xos.nl> Message-ID: <451B5140.90404@webbertek.com.br> Hi Jos, I've configured a pair of x366s in the past successfully to use the builtin IPMI device as a fence device, these systems run RHELv3u6 and Cluster Suite v3u6. They seem to work quite well until today. Since the systems are into production in a client, we didn't upgrade anything since then but it seems to work ok. I've not tried anything with IBM servers under Cluster Suite v4, though, but I imagine it works ok too. The IPMI device on the x366 servers was configured in the traditional way, login+password with admin rights, IP+netmask configured and everything worked ok. IBM's implementation of IPMI over LAN respond to "pings" (ICMP echos), while Dell's don't. But you will not be able to connect to the IPMI device from the machine itself, you have to try from an outside machine, ok? Hope this gives you some light. Best regards, Celso. Jos Vos escreveu: > Hi, > > Is it possible to use the built-in IPMI support of an IBM x366 server > with RHEL CS? > > I think it is not compatible with RSA II, and I also tried IPMI Lan, > but none of them seems to work. > > Any suggestions? > Thanks, > > -- > -- Jos Vos > -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 > -- Amsterdam, The Netherlands | Fax: +31 20 6948204 > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- *Celso Kopp Webber* -- Esta mensagem foi verificada pelo sistema de antiv?rus e acredita-se estar livre de perigo. From celso at webbertek.com.br Thu Sep 28 04:36:30 2006 From: celso at webbertek.com.br (Celso K. Webber) Date: Thu, 28 Sep 2006 01:36:30 -0300 Subject: [Linux-cluster] Fencing deadlock under Cluster Suite v4, how to solve? Message-ID: <451B514E.4000607@webbertek.com.br> Hello all, I'm having a strange problem. Here is the scenario: * 2-node GFS cluster on 2 Dell PE-2900 servers; * 1 Dell|EMC CX300 storage, with servers direct attached using two HBAs each; * RHEL AS 4 Update 4, no updates applied; * Red Hat Cluster Suite v4 Update 4, no updates applied; * Red Hat GFS Update 4, no updates applied; * Using IPMI over LAN fencing. The Cluster was configured quite straight forward, the GFS filesystems worked fine. Since the Dell PowerEdge x9xx series now support IPMI on both LOMs (onboard NICs) as a configurable failover option, we decided to "channel bond" eth0 and eth1 (onboard NICs) together to have both the normal network traffic and also the heartbeat traffic over a redundant channel (bond0). Since IPMI works over both NICs, fencing is expected to work even if one of the NICs/cables goes down. Now the problem: whenever I pull both cables from one server, the servers almost simultaneously detect each other as offline (the logs show "serverX lost too many heartbeats, removing it from the Cluster"). A few seconds later and one server fences the other, at the same time!!! As far as I can tell, there is some delay between the sending of the "power off" IPMI command and the real poweroff from the IPMI embedded controller. By the way, there is no "normal shutdown" caused by ACPI or APM, these are both turned off in the servers. So it seems that when the first server kills the other, there is enough time to the second server to send the IPMI command to kill the first server also, and a few seconds later both are turned off, so my redundant environment goes down alltogether. Question: does someone is aware of a solution for this? Is there a way a server can notify the other that it is removing it from the cluster? Maybe using a shared disk? By the way, I didn't experimented with the new shared disk feature under CS v4, only with CS v3. Thank you all in advance. Regards, Celso. -- *Celso Kopp Webber* -- Esta mensagem foi verificada pelo sistema de antiv?rus e acredita-se estar livre de perigo. From riaan at obsidian.co.za Thu Sep 28 08:58:13 2006 From: riaan at obsidian.co.za (Riaan van Niekerk) Date: Thu, 28 Sep 2006 10:58:13 +0200 Subject: [Linux-cluster] piranha In-Reply-To: <8E2924888511274B95014C2DD906E58AD1A464@MAILBOX0A.psi.ch> References: <8E2924888511274B95014C2DD906E58AD1A464@MAILBOX0A.psi.ch> Message-ID: <451B8EA5.5070202@obsidian.co.za> hi Pedro Care to tell us what you did to the real servers? If this is an omission in the documentation, please file a bugzilla against the RHCS manual. tnx Riaan Huesser Peter wrote: > I found the solution. One also has to manipulate the real webservers. > This is not described in the official ?Red Hat Cluster Suite? documentation. > > > > Pedro > > > > ------------------------------------------------------------------------ > > *From:* linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] *On Behalf Of *Huesser Peter > *Sent:* Montag, 25. September 2006 22:54 > *To:* linux clustering > *Subject:* RE: [Linux-cluster] piranha > > > > By the way: I started the ?pulse? daemon in the debug modus (?pulse ?v > ?n?) and got the following output: > > > > nanny: Opening TCP socket to remote service port 80... > > nanny: Connecting socket to remote address... > > nanny: DEBUG -- Posting CONNECT poll() > > nanny: Sending len=16, text="GET / HTTP/1.0 > > > > " > > nanny: DEBUG -- Posting READ poll() > > nanny: DEBUG -- READ poll() completed (1,1) > > nanny: Posting READ I/O; expecting 4 character(s)... > > nanny: DEBUG -- READ returned 4 > > nanny: READ expected len=4, text="HTTP" > > nanny: READ got len=4, text=HTTP > > nanny: avail: 1 active: 1: count: 13 > > pulse: DEBUG -- setting SEND_heartbeat timer > > pulse: DEBUG -- setting SEND_heartbeat timer > > pulse: DEBUG -- setting NEED_heartbeat timer > > pulse: DEBUG -- setting SEND_heartbeat timer > > nanny: Opening TCP socket to remote service port 80... > > ? > > > > For me this looks as if everything is ok. ?nanny? sends from time to > time a ?GET / HTTP/1.0? request and the response (?HTTP? only first four > letters) correspondence with what is expected. The problem is that pulse > is not opening port 80 on the loadbalancer for reveiving http-request. A > ?netstat ?anp? verifies this. > > > > > > Hello > > > > I sent a similar question a few days ago and did not get any answer. > Maybe the time (Saturday night) was unfavorable or the question was not > that clear. So I try it once more: > > > > I want to run a loadbalancer in front of two webserver (using direct > routing). But if I connect to port 80 of the loadbalancer I get a > ?connection refused?. > > > > 1) Did anybody had a similar problem? > > 2) How can I increase the debuglevel? > > > > Thanks? in advance > > > > Pedro > > > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: riaan.vcf Type: text/x-vcard Size: 310 bytes Desc: not available URL: From isplist at logicore.net Thu Sep 28 11:21:33 2006 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 28 Sep 2006 06:21:33 -0500 Subject: [Linux-cluster] McDara Message-ID: <200692862133.433361@leena> Anyone using a McData ED-5000 or ED-6064 as their fence/hub who might be able to help me? Mike From csim at ices.utexas.edu Thu Sep 28 12:56:15 2006 From: csim at ices.utexas.edu (Chris Simmons) Date: Thu, 28 Sep 2006 07:56:15 -0500 Subject: [Linux-cluster] piranha In-Reply-To: <451B8EA5.5070202@obsidian.co.za> References: <8E2924888511274B95014C2DD906E58AD1A464@MAILBOX0A.psi.ch> <451B8EA5.5070202@obsidian.co.za> Message-ID: <20060928125615.GA21742@ices.utexas.edu> I imagine he had to add an iptables rule to his real servers to utilize direct routing. Older documentation contains direct routing examples, however, the latest incarnation does not. It only contains examples for NAT. Something like the following should work: iptables -t nat -A PREROUTING -p tcp -d VIP --dport 80 -j REDIRECT Chris * Riaan van Niekerk [2006-09-28 10:58:13 +0200]: > hi Pedro > > Care to tell us what you did to the real servers? > > If this is an omission in the documentation, please file a bugzilla > against the RHCS manual. > > tnx > Riaan > > Huesser Peter wrote: > >I found the solution. One also has to manipulate the real webservers. > >This is not described in the official ?Red Hat Cluster Suite? > >documentation. > > > > > > > > Pedro > > > > > > > >------------------------------------------------------------------------ > > > >*From:* linux-cluster-bounces at redhat.com > >[mailto:linux-cluster-bounces at redhat.com] *On Behalf Of *Huesser Peter > >*Sent:* Montag, 25. September 2006 22:54 > >*To:* linux clustering > >*Subject:* RE: [Linux-cluster] piranha > > > > > > > >By the way: I started the ?pulse? daemon in the debug modus (?pulse > >?v ?n?) and got the following output: > > > > > > > >nanny: Opening TCP socket to remote service port 80... > > > >nanny: Connecting socket to remote address... > > > >nanny: DEBUG -- Posting CONNECT poll() > > > >nanny: Sending len=16, text="GET / HTTP/1.0 > > > > > > > >" > > > >nanny: DEBUG -- Posting READ poll() > > > >nanny: DEBUG -- READ poll() completed (1,1) > > > >nanny: Posting READ I/O; expecting 4 character(s)... > > > >nanny: DEBUG -- READ returned 4 > > > >nanny: READ expected len=4, text="HTTP" > > > >nanny: READ got len=4, text=HTTP > > > >nanny: avail: 1 active: 1: count: 13 > > > >pulse: DEBUG -- setting SEND_heartbeat timer > > > >pulse: DEBUG -- setting SEND_heartbeat timer > > > >pulse: DEBUG -- setting NEED_heartbeat timer > > > >pulse: DEBUG -- setting SEND_heartbeat timer > > > >nanny: Opening TCP socket to remote service port 80... > > > >? > > > > > > > >For me this looks as if everything is ok. ?nanny? sends from time to > >time a ?GET / HTTP/1.0? request and the response (?HTTP? only > >first four letters) correspondence with what is expected. The problem is > >that pulse is not opening port 80 on the loadbalancer for reveiving > >http-request. A ?netstat ?anp? verifies this. > > > > > > > > > > > >Hello > > > > > > > >I sent a similar question a few days ago and did not get any answer. > >Maybe the time (Saturday night) was unfavorable or the question was not > >that clear. So I try it once more: > > > > > > > >I want to run a loadbalancer in front of two webserver (using direct > >routing). But if I connect to port 80 of the loadbalancer I get a > >?connection refused?. > > > > > > > >1) Did anybody had a similar problem? > > > >2) How can I increase the debuglevel? > > > > > > > >Thanks? in advance > > > > > > > > Pedro > > > > > > > > > >------------------------------------------------------------------------ > > > >-- > >Linux-cluster mailing list > >Linux-cluster at redhat.com > >https://www.redhat.com/mailman/listinfo/linux-cluster > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From peter.huesser at psi.ch Thu Sep 28 14:01:00 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Thu, 28 Sep 2006 16:01:00 +0200 Subject: [Linux-cluster] piranha In-Reply-To: <451B8EA5.5070202@obsidian.co.za> Message-ID: <8E2924888511274B95014C2DD906E58A011078DF@MAILBOX0A.psi.ch> > > hi Pedro > > Care to tell us what you did to the real servers? > I found a solution in LVS-HowTo in chapter 5.7. Here is the link: http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.arp_problem.html# 2.6_arp I had to change the /etc/sysctl.conf and let the lo:1 listen to the VIP without responding to arp requests. Pedro From dbrieck at gmail.com Thu Sep 28 16:15:43 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Thu, 28 Sep 2006 12:15:43 -0400 Subject: [Linux-cluster] Hard lockups during file transfer to GNBD/GFS device Message-ID: <8c1094290609280915o6b6b4962ud0d090e58e5d7fc6@mail.gmail.com> Here is our setup: 2 GNBD servers attached to a shared SCSI array. Each (of 9) nodes uses multipath to import the shared device from both servers. We are also using GFS on to of that for our shared storage. What is happening is that I need to transfer a large number of files (about 1.5 million) from a nodes local storage to the network storage. I'm using rsync locally to move all the files. Orginally my problem was that the oom killer would start running partway through the transfer and the machine would then be unusable (however it was still up enough that it wasn't fenced). Here is that log: Sep 27 12:21:43 db2 kernel: oom-killer: gfp_mask=0xd0 Sep 27 12:21:43 db2 kernel: Mem-info: Sep 27 12:21:43 db2 kernel: DMA per-cpu: Sep 27 12:21:43 db2 kernel: cpu 0 hot: low 2, high 6, batch 1 Sep 27 12:21:43 db2 kernel: cpu 0 cold: low 0, high 2, batch 1 Sep 27 12:21:43 db2 kernel: cpu 1 hot: low 2, high 6, batch 1 Sep 27 12:21:43 db2 kernel: cpu 1 cold: low 0, high 2, batch 1 Sep 27 12:21:43 db2 kernel: cpu 2 hot: low 2, high 6, batch 1 Sep 27 12:21:43 db2 kernel: cpu 2 cold: low 0, high 2, batch 1 Sep 27 12:21:43 db2 kernel: cpu 3 hot: low 2, high 6, batch 1 Sep 27 12:21:43 db2 kernel: cpu 3 cold: low 0, high 2, batch 1 Sep 27 12:21:43 db2 kernel: cpu 4 hot: low 2, high 6, batch 1 Sep 27 12:21:44 db2 kernel: cpu 4 cold: low 0, high 2, batch 1 Sep 27 12:21:53 db2 in[15473]: 1159374113||chericee at herr-sacco.com |2852|timeout|1 Sep 27 12:21:54 db2 kernel: cpu 5 hot: low 2, high 6, batch 1 Sep 27 12:21:54 db2 kernel: cpu 5 cold: low 0, high 2, batch 1 Sep 27 12:21:54 db2 kernel: cpu 6 hot: low 2, high 6, batch 1 Sep 27 12:21:54 db2 kernel: cpu 6 cold: low 0, high 2, batch 1 Sep 27 12:21:54 db2 kernel: cpu 7 hot: low 2, high 6, batch 1 Sep 27 12:21:54 db2 kernel: cpu 7 cold: low 0, high 2, batch 1 Sep 27 12:21:54 db2 kernel: Normal per-cpu: Sep 27 12:21:54 db2 kernel: cpu 0 hot: low 32, high 96, batch 16 Sep 27 12:21:54 db2 kernel: cpu 0 cold: low 0, high 32, batch 16 Sep 27 12:21:54 db2 kernel: cpu 1 hot: low 32, high 96, batch 16 Sep 27 12:21:54 db2 kernel: cpu 1 cold: low 0, high 32, batch 16 Sep 27 12:21:54 db2 kernel: cpu 2 hot: low 32, high 96, batch 16 Sep 27 12:27:59 db2 syslogd 1.4.1: restart. Sep 27 12:27:59 db2 syslog: syslogd startup succeeded Sep 27 12:27:59 db2 kernel: klogd 1.4.1, log source = /proc/kmsg started. Sep 27 12:27:59 db2 kernel: Linux version 2.6.9-42.0.2.ELsmp ( buildsvn at build-i386) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1 SMP Wed Aug 23 00:17:26 CDT 2006 I found a few postings saying that using the hugemem kernel would solve the problems (they claimed it was a known SMP bug by redhat) so all my systems are now running on that kernel. It did solve the out of memory problem, but it seems to have introduced some new ones. Here are the logs from the most recent crashes: Sep 28 11:15:05 db2 kernel: do_IRQ: stack overflow: 412 Sep 28 11:15:05 db2 kernel: [<02107c6b>] do_IRQ+0x49/0x1ae<1>Unable to handle kernel NULL pointer dereference at virtual address 00000000 Sep 28 11:15:05 db2 kernel: printing eip: Sep 28 11:15:05 db2 kernel: 0212928c Sep 28 11:15:05 db2 kernel: *pde = 00004001 Sep 28 11:15:05 db2 kernel: Oops: 0002 [#1] Sep 28 11:15:05 db2 kernel: SMP Sep 28 11:15:05 db2 kernel: Modules linked in: mptctl mptbase dell_rbu nfsd exportfs lockd nfs_acl parport_pc lp parport autofs4 i 2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dm_round_robin gnbd(U) dlm(U) cman(U) sunrpc ipmi_devintf ipmi_si ipmi_msghandl er iptable_filter iptable_mangle iptable_nat ip_conntrack ip_tables md5 ipv6 dm_multipath joydev button battery ac uhci_hcd ehci_h cd hw_random e1000 bonding(U) floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod megaraid_mbox megaraid_mm sd_mod scsi_mod Sep 28 11:15:05 db2 kernel: CPU: 1548750336 Sep 28 11:15:05 db2 kernel: EIP: 0060:[<0212928c>] Not tainted VLI Sep 28 11:15:05 db2 kernel: EFLAGS: 00010002 (2.6.9-42.0.2.ELhugemem) Sep 28 11:15:05 db2 kernel: EIP is at internal_add_timer+0x84/0x8c Sep 28 11:15:05 db2 kernel: eax: 00000000 ebx: 023b7900 ecx: 023b8680 edx: 02447620 Sep 28 11:15:05 db2 kernel: esi: 00000000 edi: 023b7900 ebp: 02ee0c94 esp: 48552fb4 Sep 28 11:15:05 db2 kernel: ds: 007b es: 007b ss: 0068 Sep 28 11:15:05 db2 kernel: Process (pid: 1, threadinfo=48552000 task=6d641a00) Sep 28 11:17:54 db2 syslogd 1.4.1: restart. Sep 28 11:17:54 db2 syslog: syslogd startup succeeded Sep 28 11:17:54 db2 kernel: klogd 1.4.1, log source = /proc/kmsg started. Sep 28 11:17:54 db2 syslog: klogd startup succeeded Sep 28 11:17:54 db2 kernel: Linux version 2.6.9-42.0.2.ELhugemem ( buildsvn at build-i386) (gcc version 3.4.6 20060404 (Red Hat 3.4.6- 3)) #1 SMP Wed Aug 23 00:38:38 CDT 2006 The GNBD servers stay online and don't have any problems, it's just the client where all the trouble is coming from. Is this a bug or is something not setup right? If you need more info I'll be happy to provide it. Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rodgersr at yahoo.com Thu Sep 28 17:09:17 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Thu, 28 Sep 2006 10:09:17 -0700 (PDT) Subject: [Linux-cluster] clurmtabd Message-ID: <20060928170917.70680.qmail@web34214.mail.mud.yahoo.com> Is there anyway to have clurmtabd monitor all the subdirectories of a mount point. (ie. specify a parent directory but have nodes mounting off some of the subdirectories) Or do you always have to have a clurmtabd running for each subdirectory mount point --------------------------------- Talk is cheap. Use Yahoo! Messenger to make PC-to-Phone calls. Great rates starting at 1?/min. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dbrieck at gmail.com Thu Sep 28 19:08:58 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Thu, 28 Sep 2006 15:08:58 -0400 Subject: [Linux-cluster] Re: Hard lockups during file transfer to GNBD/GFS device In-Reply-To: <8c1094290609280915o6b6b4962ud0d090e58e5d7fc6@mail.gmail.com> References: <8c1094290609280915o6b6b4962ud0d090e58e5d7fc6@mail.gmail.com> Message-ID: <8c1094290609281208i6a5eaf8br70697c6b5d085cf@mail.gmail.com> On 9/28/06, David Brieck Jr. wrote: > Here is our setup: 2 GNBD servers attached to a shared SCSI array. Each (of 9) nodes uses multipath to import the shared device from both servers. We are also using GFS on to of that for our shared storage. > > What is happening is that I need to transfer a large number of files (about 1.5 million) from a nodes local storage to the network storage. I'm using rsync locally to move all the files. Orginally my problem was that the oom killer would start running partway through the transfer and the machine would then be unusable (however it was still up enough that it wasn't fenced). Here is that log: > > > > I found a few postings saying that using the hugemem kernel would solve the problems (they claimed it was a known SMP bug by redhat) so all my systems are now running on that kernel. It did solve the out of memory problem, but it seems to have introduced some new ones. Here are the logs from the most recent crashes: > > > > > The GNBD servers stay online and don't have any problems, it's just the client where all the trouble is coming from. Is this a bug or is something not setup right? > > If you need more info I'll be happy to provide it. > > Thanks. I just tried to more the same data by tar-ing it up to the network, same result. Again, this is about 94GB and 1.5 million files that I seem to be unable to move from local storage to shared. Anyone have any suggestions? From dbrieck at gmail.com Thu Sep 28 19:27:20 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Thu, 28 Sep 2006 15:27:20 -0400 Subject: [Linux-cluster] Re: Hard lockups during file transfer to GNBD/GFS device In-Reply-To: <8c1094290609281208i6a5eaf8br70697c6b5d085cf@mail.gmail.com> References: <8c1094290609280915o6b6b4962ud0d090e58e5d7fc6@mail.gmail.com> <8c1094290609281208i6a5eaf8br70697c6b5d085cf@mail.gmail.com> Message-ID: <8c1094290609281227j1303ec11u300932ab8d4953ab@mail.gmail.com> On 9/28/06, David Brieck Jr. wrote: > On 9/28/06, David Brieck Jr. wrote: > > Here is our setup: 2 GNBD servers attached to a shared SCSI array. Each (of 9) nodes uses multipath to import the shared device from both servers. We are also using GFS on to of that for our shared storage. > > > > What is happening is that I need to transfer a large number of files (about 1.5 million) from a nodes local storage to the network storage. I'm using rsync locally to move all the files. Orginally my problem was that the oom killer would start running partway through the transfer and the machine would then be unusable (however it was still up enough that it wasn't fenced). Here is that log: > > > > > > > > I found a few postings saying that using the hugemem kernel would solve the problems (they claimed it was a known SMP bug by redhat) so all my systems are now running on that kernel. It did solve the out of memory problem, but it seems to have introduced some new ones. Here are the logs from the most recent crashes: > > > > > > > > > > The GNBD servers stay online and don't have any problems, it's just the client where all the trouble is coming from. Is this a bug or is something not setup right? > > > > If you need more info I'll be happy to provide it. > > > > Thanks. > > > I just tried to more the same data by tar-ing it up to the network, > same result. Again, this is about 94GB and 1.5 million files that I > seem to be unable to move from local storage to shared. Anyone have > any suggestions? > I forgot to include the kernel message, see below: Sep 28 15:01:56 db2 kernel: do_IRQ: stack overflow: 460 Sep 28 15:01:56 db2 kernel: [<02107c6b>] do_IRQ+0x49/0x1ae Sep 28 15:01:56 db2 kernel: [] tcp_in_window+0x1c6/0x3ad [ip_conntrack] Sep 28 15:01:56 db2 kernel: [] tcp_packet+0x338/0x412 [ip_conntrack] Sep 28 15:01:56 db2 kernel: [] __ip_conntrack_find+0xf/0xa1 [ip_conntrack] Sep 28 15:01:56 db2 kernel: [] ip_conntrack_in+0x1dc/0x2a6 [ip_conntrack] Sep 28 15:01:56 db2 kernel: [<0228227b>] nf_iterate+0x40/0x81 Sep 28 15:01:56 db2 kernel: [<022927d8>] dst_output+0x0/0x1a Sep 28 15:01:56 db2 kernel: [<02282581>] nf_hook_slow+0x47/0xbc Sep 28 15:01:56 db2 kernel: [<022927d8>] dst_output+0x0/0x1a Sep 28 15:01:56 db2 kernel: [<02293093>] ip_queue_xmit+0x395/0x3f9 Sep 28 15:04:39 db2 syslogd 1.4.1: restart. From teigland at redhat.com Thu Sep 28 19:58:44 2006 From: teigland at redhat.com (David Teigland) Date: Thu, 28 Sep 2006 14:58:44 -0500 Subject: [Linux-cluster] Re: Hard lockups during file transfer to GNBD/GFS device In-Reply-To: <8c1094290609281227j1303ec11u300932ab8d4953ab@mail.gmail.com> References: <8c1094290609280915o6b6b4962ud0d090e58e5d7fc6@mail.gmail.com> <8c1094290609281208i6a5eaf8br70697c6b5d085cf@mail.gmail.com> <8c1094290609281227j1303ec11u300932ab8d4953ab@mail.gmail.com> Message-ID: <20060928195844.GB25242@redhat.com> On Thu, Sep 28, 2006 at 03:27:20PM -0400, David Brieck Jr. wrote: > I forgot to include the kernel message, see below: > > Sep 28 15:01:56 db2 kernel: do_IRQ: stack overflow: 460 > Sep 28 15:01:56 db2 kernel: [<02107c6b>] do_IRQ+0x49/0x1ae > Sep 28 15:01:56 db2 kernel: [] tcp_in_window+0x1c6/0x3ad > [ip_conntrack] > Sep 28 15:01:56 db2 kernel: [] tcp_packet+0x338/0x412 > [ip_conntrack] > Sep 28 15:01:56 db2 kernel: [] __ip_conntrack_find+0xf/0xa1 > [ip_conntrack] > Sep 28 15:01:56 db2 kernel: [] ip_conntrack_in+0x1dc/0x2a6 > [ip_conntrack] > Sep 28 15:01:56 db2 kernel: [<0228227b>] nf_iterate+0x40/0x81 > Sep 28 15:01:56 db2 kernel: [<022927d8>] dst_output+0x0/0x1a > Sep 28 15:01:56 db2 kernel: [<02282581>] nf_hook_slow+0x47/0xbc > Sep 28 15:01:56 db2 kernel: [<022927d8>] dst_output+0x0/0x1a > Sep 28 15:01:56 db2 kernel: [<02293093>] ip_queue_xmit+0x395/0x3f9 > Sep 28 15:04:39 db2 syslogd 1.4.1: restart. Could you try it without multipath? You have quite a few layers there. Dave From teigland at redhat.com Thu Sep 28 20:01:00 2006 From: teigland at redhat.com (David Teigland) Date: Thu, 28 Sep 2006 15:01:00 -0500 Subject: [Linux-cluster] Fencing deadlock under Cluster Suite v4, how to solve? In-Reply-To: <451B514E.4000607@webbertek.com.br> References: <451B514E.4000607@webbertek.com.br> Message-ID: <20060928200100.GC25242@redhat.com> On Thu, Sep 28, 2006 at 01:36:30AM -0300, Celso K. Webber wrote: > So it seems that when the first server kills the other, there is enough > time to the second server to send the IPMI command to kill the first > server also, and a few seconds later both are turned off, so my > redundant environment goes down alltogether. > > Question: does someone is aware of a solution for this? Is there a way a > server can notify the other that it is removing it from the cluster? > Maybe using a shared disk? By the way, I didn't experimented with the > new shared disk feature under CS v4, only with CS v3. The new qdisk should be a good way to solve this. Dave From celso at webbertek.com.br Fri Sep 29 13:15:37 2006 From: celso at webbertek.com.br (Celso K. Webber) Date: Fri, 29 Sep 2006 10:15:37 -0300 Subject: [Linux-cluster] Fencing deadlock under Cluster Suite v4, how to solve? In-Reply-To: <20060928200100.GC25242@redhat.com> References: <451B514E.4000607@webbertek.com.br> <20060928200100.GC25242@redhat.com> Message-ID: <451D1C79.3050601@webbertek.com.br> Hello David, Do you know (or someone else) where can I find documentation about the new qdisk mechanism? I imagine I should configure it by editing cluster.conf directly, isn't it? The GUI does not mantion the "shared state" configuration as it did under Cluster Suite v3. Thank you all. Celso. David Teigland escreveu: > On Thu, Sep 28, 2006 at 01:36:30AM -0300, Celso K. Webber wrote: >> Question: does someone is aware of a solution for this? Is there a way a >> server can notify the other that it is removing it from the cluster? >> Maybe using a shared disk? By the way, I didn't experimented with the >> new shared disk feature under CS v4, only with CS v3. > > The new qdisk should be a good way to solve this. > Dave > > -- *Celso Kopp Webber* celso at webbertek.com.br *Webbertek - Opensource Knowledge* (41) 8813-1919 (41) 3284-3035 -- Esta mensagem foi verificada pelo sistema de antiv?rus e acredita-se estar livre de perigo. From jparsons at redhat.com Fri Sep 29 13:27:12 2006 From: jparsons at redhat.com (James Parsons) Date: Fri, 29 Sep 2006 09:27:12 -0400 Subject: [Linux-cluster] Fencing deadlock under Cluster Suite v4, how to solve? In-Reply-To: <451D1C79.3050601@webbertek.com.br> References: <451B514E.4000607@webbertek.com.br> <20060928200100.GC25242@redhat.com> <451D1C79.3050601@webbertek.com.br> Message-ID: <451D1F30.5030807@redhat.com> Celso K. Webber wrote: > Hello David, > > Do you know (or someone else) where can I find documentation about the > new qdisk mechanism? > > I imagine I should configure it by editing cluster.conf directly, isn't > it? The GUI does not mantion the "shared state" configuration as it did > under Cluster Suite v3. The GUI will support it in 4U5 and in RHEL5. -J > > > Thank you all. > > Celso. > > David Teigland escreveu: > >> On Thu, Sep 28, 2006 at 01:36:30AM -0300, Celso K. Webber wrote: >> >>> Question: does someone is aware of a solution for this? Is there a >>> way a server can notify the other that it is removing it from the >>> cluster? Maybe using a shared disk? By the way, I didn't >>> experimented with the new shared disk feature under CS v4, only with >>> CS v3. >> >> >> The new qdisk should be a good way to solve this. >> Dave >> >> > From lhh at redhat.com Fri Sep 29 13:40:25 2006 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 29 Sep 2006 09:40:25 -0400 Subject: [Linux-cluster] clurmtabd In-Reply-To: <20060928170917.70680.qmail@web34214.mail.mud.yahoo.com> References: <20060928170917.70680.qmail@web34214.mail.mud.yahoo.com> Message-ID: <1159537225.27578.9.camel@rei.boston.devel.redhat.com> On Thu, 2006-09-28 at 10:09 -0700, Rick Rodgers wrote: > Is there anyway to have clurmtabd monitor all the subdirectories > of a mount point. (ie. specify a parent directory but have nodes > mounting off some of the subdirectories) Or do you always have to have > a clurmtabd running for each subdirectory mount point It matches based on the parent mount point, and should sync all subdirectories present in /var/lib/nfs/rmtab... e.g. clurmtabd /foo Clients which mount /foo/bar, /foo/bar/1, etc. should all have entries in /foo/.clumanager/rmtab -- Lon From riaan at obsidian.co.za Fri Sep 29 13:41:51 2006 From: riaan at obsidian.co.za (Riaan van Niekerk) Date: Fri, 29 Sep 2006 15:41:51 +0200 Subject: [Linux-cluster] Fencing deadlock under Cluster Suite v4, how to solve? In-Reply-To: <451D1C79.3050601@webbertek.com.br> References: <451B514E.4000607@webbertek.com.br> <20060928200100.GC25242@redhat.com> <451D1C79.3050601@webbertek.com.br> Message-ID: <451D229F.80005@obsidian.co.za> qdisk is part of newersions of cman. "man qdisk" is the best source of information (that I am aware of) for the new quorum disk functionality. Unfortunately the Cluster Suite docs have not been updated with the qdisk subsystem. However, the CS update 4 release notes mention it. http://www.redhat.com/docs/manuals/csgfs/ slighly off topic rant: unfortunately it is very difficult to tell when Red Hat update their documentation. At the above link, there is a red "(Updated)" if something is updated recently, but "recently" is a very vague term. I have even submitted Bugzilla Bug 195890: RFE "(Updated)" and "(New)" labels in documentation should have dates without success. Riaan Celso K. Webber wrote: > Hello David, > > Do you know (or someone else) where can I find documentation about the > new qdisk mechanism? > > I imagine I should configure it by editing cluster.conf directly, isn't > it? The GUI does not mantion the "shared state" configuration as it did > under Cluster Suite v3. > > Thank you all. > > Celso. > > David Teigland escreveu: >> On Thu, Sep 28, 2006 at 01:36:30AM -0300, Celso K. Webber wrote: >>> Question: does someone is aware of a solution for this? Is there a >>> way a server can notify the other that it is removing it from the >>> cluster? Maybe using a shared disk? By the way, I didn't experimented >>> with the new shared disk feature under CS v4, only with CS v3. >> >> The new qdisk should be a good way to solve this. >> Dave >> >> > -------------- next part -------------- A non-text attachment was scrubbed... Name: riaan.vcf Type: text/x-vcard Size: 310 bytes Desc: not available URL: From dbrieck at gmail.com Fri Sep 29 13:51:08 2006 From: dbrieck at gmail.com (David Brieck Jr.) Date: Fri, 29 Sep 2006 09:51:08 -0400 Subject: [Linux-cluster] Re: Hard lockups during file transfer to GNBD/GFS device In-Reply-To: <20060928195844.GB25242@redhat.com> References: <8c1094290609280915o6b6b4962ud0d090e58e5d7fc6@mail.gmail.com> <8c1094290609281208i6a5eaf8br70697c6b5d085cf@mail.gmail.com> <8c1094290609281227j1303ec11u300932ab8d4953ab@mail.gmail.com> <20060928195844.GB25242@redhat.com> Message-ID: <8c1094290609290651r62cec5f9n28278d6a81c3e6ef@mail.gmail.com> On 9/28/06, David Teigland wrote: > > Could you try it without multipath? You have quite a few layers there. > Dave > > Thanks for the response. I unloaded gfs, clvm, gnbd and multipath, the reloaded gnbd, clvm and gfs. It was only talking to one of the gnbd servers and without multipath. Here's the log from this crash. It seems to have more info in it. I'm kinda confused why it still has references to mulitpath though. I unloaded the multipath module so I'm not sure why it's still in there. Sep 29 09:39:26 db2 kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000 Sep 29 09:39:26 db2 kernel: printing eip: Sep 29 09:39:26 db2 kernel: f882d427 Sep 29 09:39:26 db2 kernel: *pde = 00004001 Sep 29 09:39:26 db2 kernel: Oops: 0000 [#1] Sep 29 09:39:26 db2 kernel: SMP Sep 29 09:39:26 db2 kernel: Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) gnbd(U) mptctl mptbase dell_rbu nfsd exportfs lockd nfs_acl parport_pc lp p arport autofs4 i2c_dev i2c_core dm_round_robin dlm(U) cman(U) sunrpc ipmi_devintf ipmi_si ipmi_msghandler iptable_filter iptable_mangle iptable_nat ip_conntr ack ip_tables md5 ipv6 dm_multipath joydev button battery ac uhci_hcd ehci_hcd hw_random e1000 bonding(U) floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd dm _mod megaraid_mbox megaraid_mm sd_mod scsi_mod Sep 29 09:39:26 db2 kernel: CPU: 5 Sep 29 09:39:26 db2 kernel: EIP: 0060:[] Not tainted VLI Sep 29 09:39:26 db2 kernel: EFLAGS: 00010286 (2.6.9-42.0.2.ELhugemem) Sep 29 09:39:26 db2 kernel: EIP is at journal_start+0x23/0x9e [jbd] Sep 29 09:39:26 db2 kernel: eax: 00000000 ebx: 8ca9b300 ecx: e1f0b400 edx: 00000042 Sep 29 09:39:26 db2 kernel: esi: e1f0bc00 edi: 1ef03000 ebp: 02325e78 esp: 1ef03bc0 Sep 29 09:39:26 db2 kernel: ds: 007b es: 007b ss: 0068 Sep 29 09:39:26 db2 kernel: Process rsync (pid: 20038, threadinfo=1ef03000 task=d9f178b0) Sep 29 09:39:26 db2 kernel: Stack: d406cde8 1ef03c00 00000031 f88a8c55 d406cde8 1ef03c00 0216fc5c d406cde8 Sep 29 09:39:26 db2 kernel: 0216fcf1 3d38f768 3d38f770 0000000a 02170076 00000080 00000080 00000080 Sep 29 09:39:26 db2 kernel: bf756da8 8b255598 00000000 00000086 00000000 39ffe980 021700e3 02148548 Sep 29 09:39:26 db2 kernel: Call Trace: Sep 29 09:39:26 db2 kernel: [] ext3_dquot_drop+0x14/0x3b [ext3] Sep 29 09:39:26 db2 kernel: [<0216fc5c>] clear_inode+0xb4/0x102 Sep 29 09:39:26 db2 kernel: [<0216fcf1>] dispose_list+0x47/0x6d Sep 29 09:39:26 db2 kernel: [<02170076>] prune_icache+0x193/0x1ec Sep 29 09:39:26 db2 kernel: [<021700e3>] shrink_icache_memory+0x14/0x2b Sep 29 09:39:26 db2 kernel: [<02148548>] shrink_slab+0xf8/0x161 Sep 29 09:39:26 db2 kernel: [<0214952c>] try_to_free_pages+0xd1/0x1a7 Sep 29 09:39:26 db2 kernel: [<02142f1d>] __alloc_pages+0x1b5/0x29d Sep 29 09:39:26 db2 kernel: [<02140e51>] generic_file_buffered_write+0x1a1/0x533 Sep 29 09:39:26 db2 kernel: [<0214156c>] __generic_file_aio_write_nolock+0x389/0x3b7 Sep 29 09:39:26 db2 kernel: [<021415d3>] generic_file_aio_write_nolock+0x39/0x7f Sep 29 09:39:26 db2 kernel: [<02141736>] generic_file_write_nolock+0x84/0x99 Sep 29 09:39:26 db2 kernel: [] gfs_glock_nq+0xe3/0x116 [gfs] Sep 29 09:39:26 db2 kernel: [<021204e9>] autoremove_wake_function+0x0/0x2d Sep 29 09:39:26 db2 kernel: [] gfs_trans_begin_i+0xfd/0x15a [gfs] Sep 29 09:39:26 db2 kernel: [] do_do_write_buf+0x2a6/0x452 [gfs] Sep 29 09:39:26 db2 kernel: [] do_write_buf+0x11b/0x15e [gfs] Sep 29 09:39:26 db2 kernel: [] walk_vm+0xd7/0x100 [gfs] Sep 29 09:39:26 db2 kernel: [] __gfs_write+0xa1/0xbb [gfs] Sep 29 09:39:26 db2 kernel: [] do_write_buf+0x0/0x15e [gfs] Sep 29 09:39:26 db2 kernel: [] gfs_write+0xb/0xe [gfs] Sep 29 09:39:26 db2 kernel: [<0215a52f>] vfs_write+0xb6/0xe2 Sep 29 09:39:26 db2 kernel: [<0215a5f9>] sys_write+0x3c/0x62 Sep 29 09:39:26 db2 kernel: Code: <3>Debug: sleeping function called from invalid context at include/linux/rwsem.h:43 Sep 29 09:39:26 db2 kernel: in_atomic():0[expected: 0], irqs_disabled():1 Sep 29 09:39:26 db2 kernel: [<02120209>] __might_sleep+0x7d/0x88 Sep 29 09:39:26 db2 kernel: [<0215537c>] rw_vm+0xe4/0x29c Sep 29 09:39:26 db2 kernel: [] new_handle+0x38/0x40 [jbd] Sep 29 09:39:26 db2 kernel: [] new_handle+0x38/0x40 [jbd] Sep 29 09:39:26 db2 kernel: [<021557f3>] get_user_size+0x30/0x57 Sep 29 09:39:26 db2 kernel: [] new_handle+0x38/0x40 [jbd] Sep 29 09:39:26 db2 kernel: [<021061bb>] show_registers+0x115/0x16c Sep 29 09:39:26 db2 kernel: [<02106352>] die+0xdb/0x16b Sep 29 09:39:26 db2 kernel: [<02122a14>] vprintk+0x136/0x14a Sep 29 09:39:26 db2 kernel: [<0211b236>] do_page_fault+0x421/0x5f7 Sep 29 09:39:26 db2 kernel: [] journal_start+0x23/0x9e [jbd] Sep 29 09:39:26 db2 kernel: [<0211cec9>] activate_task+0x88/0x95 Sep 29 09:39:26 db2 kernel: [<0211d3f4>] try_to_wake_up+0x28e/0x299 Sep 29 09:39:26 db2 kernel: [<0211ae15>] do_page_fault+0x0/0x5f7 Sep 29 09:39:26 db2 kernel: [] journal_start+0x23/0x9e [jbd] Sep 29 09:39:26 db2 kernel: [] ext3_dquot_drop+0x14/0x3b [ext3] Sep 29 09:39:26 db2 kernel: [<0216fc5c>] clear_inode+0xb4/0x102 Sep 29 09:39:26 db2 kernel: [<0216fcf1>] dispose_list+0x47/0x6d Sep 29 09:39:26 db2 kernel: [<02170076>] prune_icache+0x193/0x1ec Sep 29 09:39:26 db2 kernel: [<021700e3>] shrink_icache_memory+0x14/0x2b Sep 29 09:39:26 db2 kernel: [<02148548>] shrink_slab+0xf8/0x161 Sep 29 09:39:26 db2 kernel: [<0214952c>] try_to_free_pages+0xd1/0x1a7 Sep 29 09:39:26 db2 kernel: [<02142f1d>] __alloc_pages+0x1b5/0x29d Sep 29 09:39:26 db2 kernel: [<02140e51>] generic_file_buffered_write+0x1a1/0x533 Sep 29 09:39:26 db2 kernel: [<0214156c>] __generic_file_aio_write_nolock+0x389/0x3b7 Sep 29 09:39:26 db2 kernel: [<021415d3>] generic_file_aio_write_nolock+0x39/0x7f Sep 29 09:39:26 db2 kernel: [<02141736>] generic_file_write_nolock+0x84/0x99 Sep 29 09:39:26 db2 kernel: [] gfs_glock_nq+0xe3/0x116 [gfs] Sep 29 09:39:26 db2 kernel: [<021204e9>] autoremove_wake_function+0x0/0x2d Sep 29 09:39:26 db2 kernel: [] gfs_trans_begin_i+0xfd/0x15a [gfs] Sep 29 09:39:26 db2 kernel: [] do_do_write_buf+0x2a6/0x452 [gfs] Sep 29 09:39:26 db2 kernel: [] do_write_buf+0x11b/0x15e [gfs] Sep 29 09:39:26 db2 kernel: [] walk_vm+0xd7/0x100 [gfs] Sep 29 09:39:26 db2 kernel: [] __gfs_write+0xa1/0xbb [gfs] Sep 29 09:39:26 db2 kernel: [] do_write_buf+0x0/0x15e [gfs] Sep 29 09:39:26 db2 kernel: [] gfs_write+0xb/0xe [gfs] Sep 29 09:39:26 db2 kernel: [<0215a52f>] vfs_write+0xb6/0xe2 Sep 29 09:39:26 db2 kernel: [<0215a5f9>] sys_write+0x3c/0x62 Sep 29 09:39:26 db2 kernel: Bad EIP value. Sep 29 09:39:26 db2 kernel: <0>Fatal exception: panic in 5 seconds Sep 29 09:42:17 db2 syslogd 1.4.1: restart. Thanks again for your help. From lhh at redhat.com Fri Sep 29 13:58:21 2006 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 29 Sep 2006 09:58:21 -0400 Subject: [Linux-cluster] IPMI fencing on an IBM x366 In-Reply-To: <200609251411.k8PEBg406654@xos037.xos.nl> References: <200609251411.k8PEBg406654@xos037.xos.nl> Message-ID: <1159538301.27578.13.camel@rei.boston.devel.redhat.com> On Mon, 2006-09-25 at 16:11 +0200, Jos Vos wrote: > Hi, > > Is it possible to use the built-in IPMI support of an IBM x366 server > with RHEL CS? > > I think it is not compatible with RSA II, and I also tried IPMI Lan, > but none of them seems to work. Maybe this patch would help? http://bugzilla.redhat.com/bugzilla/attachment.cgi?id=135803 It enables IPMI Lan+ operation; you'll need to add lanplus=1 to the fence device definition. -- Lon From lhh at redhat.com Fri Sep 29 13:59:18 2006 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 29 Sep 2006 09:59:18 -0400 Subject: [Linux-cluster] Realserver configuration using loadbalancer In-Reply-To: <8E2924888511274B95014C2DD906E58AD1A3F9@MAILBOX0A.psi.ch> References: <8E2924888511274B95014C2DD906E58AD1A3F9@MAILBOX0A.psi.ch> Message-ID: <1159538358.27578.15.camel@rei.boston.devel.redhat.com> On Tue, 2006-09-26 at 16:26 +0200, Huesser Peter wrote: > Hello > > > > If I run a loadbalancer in front of the webservers (using piranha_gui > and pulse) is there anything I have configure on the real webservers? Not usually, unless you're trying to do direct routing. -- Lon > From lhh at redhat.com Fri Sep 29 14:05:08 2006 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 29 Sep 2006 10:05:08 -0400 Subject: [Linux-cluster] Fencing deadlock under Cluster Suite v4, how to solve? In-Reply-To: <451D1C79.3050601@webbertek.com.br> References: <451B514E.4000607@webbertek.com.br> <20060928200100.GC25242@redhat.com> <451D1C79.3050601@webbertek.com.br> Message-ID: <1159538708.27578.21.camel@rei.boston.devel.redhat.com> On Fri, 2006-09-29 at 10:15 -0300, Celso K. Webber wrote: > Hello David, > > Do you know (or someone else) where can I find documentation about the > new qdisk mechanism? I'll assist you -- most of the documentation is in the manual pages: man qdisk The only "difficult" part is getting the heuristics right. In your case, you'd want heuristics which monitor network connectivity, so that when you pull the cables, the node (which is still alive, despite the fact that it has lost network connectivity) will remove itself from the cluster, and the other node will fence it. The example in the manual page for pinging a router should be of some use, but you may very well have a better method of determining network connectivity. Oh, and to note a question which isn't actually mentioned in the manual page -- the qdisk partition should be around 10mb. :) -- Lon From lhh at redhat.com Fri Sep 29 14:07:14 2006 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 29 Sep 2006 10:07:14 -0400 Subject: [Linux-cluster] Fencing deadlock under Cluster Suite v4, how to solve? In-Reply-To: <1159538708.27578.21.camel@rei.boston.devel.redhat.com> References: <451B514E.4000607@webbertek.com.br> <20060928200100.GC25242@redhat.com> <451D1C79.3050601@webbertek.com.br> <1159538708.27578.21.camel@rei.boston.devel.redhat.com> Message-ID: <1159538834.27578.24.camel@rei.boston.devel.redhat.com> On Fri, 2006-09-29 at 10:05 -0400, Lon Hohberger wrote: > On Fri, 2006-09-29 at 10:15 -0300, Celso K. Webber wrote: > > Hello David, > > > > Do you know (or someone else) where can I find documentation about the > > new qdisk mechanism? > > I'll assist you -- most of the documentation is in the manual pages: Note: Please keep it on-list for posterity. -- Lon From jos at xos.nl Fri Sep 29 16:30:39 2006 From: jos at xos.nl (Jos Vos) Date: Fri, 29 Sep 2006 18:30:39 +0200 Subject: [Linux-cluster] IPMI fencing on an IBM x366 In-Reply-To: <1159538301.27578.13.camel@rei.boston.devel.redhat.com>; from lhh@redhat.com on Fri, Sep 29, 2006 at 09:58:21AM -0400 References: <200609251411.k8PEBg406654@xos037.xos.nl> <1159538301.27578.13.camel@rei.boston.devel.redhat.com> Message-ID: <20060929183039.A8483@xos037.xos.nl> On Fri, Sep 29, 2006 at 09:58:21AM -0400, Lon Hohberger wrote: > Maybe this patch would help? > > http://bugzilla.redhat.com/bugzilla/attachment.cgi?id=135803 > > It enables IPMI Lan+ operation; you'll need to add lanplus=1 to the > fence device definition. In the meantime I solved the problem. It was a password problem (PASSW*O*RD vs. PASSW*0*RD :-( ). -- Jos Vos -- X/OS Experts in Open Systems BV | Phone: +31 20 6938364 -- Amsterdam, The Netherlands | Fax: +31 20 6948204 From lhh at redhat.com Fri Sep 29 20:30:13 2006 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 29 Sep 2006 16:30:13 -0400 Subject: [Linux-cluster] Cannot restart service after "failed" state In-Reply-To: <8E2924888511274B95014C2DD906E58AD1A30E@MAILBOX0A.psi.ch> References: <8E2924888511274B95014C2DD906E58AD1A30E@MAILBOX0A.psi.ch> Message-ID: <1159561813.30820.0.camel@rei.boston.devel.redhat.com> On Fri, 2006-09-22 at 17:42 +0200, Huesser Peter wrote: > Hello > > > > I have defined a web-services (for testing it contains an IP and two > script resources). I sometimes happens that I produce failed state of > the cluster. After this I am not able to restart the service anymore. > Even after a reboot of all (two) clustermembers it is not possible. Do > I have to remove by hand some kind of ?lock? file. > This sounds like bug 208011 FYI. -- Lon > From lhh at redhat.com Fri Sep 29 20:42:50 2006 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 29 Sep 2006 16:42:50 -0400 Subject: [Linux-cluster] Disk tie breaker -how does it work? In-Reply-To: <20060922202843.42656.qmail@web34205.mail.mud.yahoo.com> References: <20060922202843.42656.qmail@web34205.mail.mud.yahoo.com> Message-ID: <1159562570.31184.2.camel@rei.boston.devel.redhat.com> On Fri, 2006-09-22 at 13:28 -0700, Rick Rodgers wrote: > Does anyone know much about the details of how a disk tiebreaker > works in a two member node? Or any docs to point to? http://people.redhat.com/lhh/rhcm-3-internals.odt (Note: Open Office 2.0 format) It's mostly up to date. -- Lon From rodgersr at yahoo.com Fri Sep 29 21:37:41 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Fri, 29 Sep 2006 14:37:41 -0700 (PDT) Subject: [Linux-cluster] clurmtabd In-Reply-To: <1159537225.27578.9.camel@rei.boston.devel.redhat.com> Message-ID: <20060929213741.97281.qmail@web34212.mail.mud.yahoo.com> it does not seem to work that way. I tested it and it only got what was mounted on the specified directory. Not the subdirectories. Has this changed recently (in the last 2 years?) Lon Hohberger wrote: On Thu, 2006-09-28 at 10:09 -0700, Rick Rodgers wrote: > Is there anyway to have clurmtabd monitor all the subdirectories > of a mount point. (ie. specify a parent directory but have nodes > mounting off some of the subdirectories) Or do you always have to have > a clurmtabd running for each subdirectory mount point It matches based on the parent mount point, and should sync all subdirectories present in /var/lib/nfs/rmtab... e.g. clurmtabd /foo Clients which mount /foo/bar, /foo/bar/1, etc. should all have entries in /foo/.clumanager/rmtab -- Lon -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster --------------------------------- Get your own web address for just $1.99/1st yr. We'll help. Yahoo! Small Business. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rodgersr at yahoo.com Fri Sep 29 21:39:59 2006 From: rodgersr at yahoo.com (Rick Rodgers) Date: Fri, 29 Sep 2006 14:39:59 -0700 (PDT) Subject: [Linux-cluster] Disk tie breaker -how does it work? In-Reply-To: <1159562570.31184.2.camel@rei.boston.devel.redhat.com> Message-ID: <20060929213959.98975.qmail@web34209.mail.mud.yahoo.com> The link to the page can not be found. Lon Hohberger wrote: On Fri, 2006-09-22 at 13:28 -0700, Rick Rodgers wrote: > Does anyone know much about the details of how a disk tiebreaker > works in a two member node? Or any docs to point to? http://people.redhat.com/lhh/rhcm-3-internals.odt (Note: Open Office 2.0 format) It's mostly up to date. -- Lon -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster --------------------------------- Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jotheswaran at renaissance-it.com Sat Sep 30 06:43:33 2006 From: jotheswaran at renaissance-it.com (Jotheswaran M) Date: Sat, 30 Sep 2006 12:13:33 +0530 Subject: [Linux-cluster] Red Hat Linux AS 4 U3 Clustering Message-ID: <7BED60E643BD1C4F8A84E3F0B411C14A0F3F31@srit_mail.renaissance-it.com> Hi All, I am new to this forum, I have a problem with Red Hat Linux AS 4 U3 Clustering I have used IBM Xseries 366 servers with two HBA's and DS4300 SAN storage. I have installed and configured the OS and the clustering with out any issues. I am running oracle9i as the database and the same has been configured in the cluster and it works fine, I can also fail over it works fine. The problem is if I shutdown one server or remove the power chord of one server the cluster doesn't switch over but if I go through the normal shutdown the cluster switches. Can you gueys help me to resolve this please. Regards, Jotheswaran M -------------- next part -------------- An HTML attachment was scrubbed... URL: From peter.huesser at psi.ch Sat Sep 30 07:09:51 2006 From: peter.huesser at psi.ch (Huesser Peter) Date: Sat, 30 Sep 2006 09:09:51 +0200 Subject: [Linux-cluster] Realserver configuration using loadbalancer In-Reply-To: <1159538358.27578.15.camel@rei.boston.devel.redhat.com> Message-ID: <8E2924888511274B95014C2DD906E58A01107956@MAILBOX0A.psi.ch> > > If I run a loadbalancer in front of the webservers (using piranha_gui > > and pulse) is there anything I have configure on the real webservers? > > Not usually, unless you're trying to do direct routing. > Thanks' for your answer. I found the solution and posted it a few days ago. Pedro