From Alain.Moulle at bull.net Tue Aug 1 06:24:29 2006 From: Alain.Moulle at bull.net (Alain Moulle) Date: Tue, 01 Aug 2006 08:24:29 +0200 Subject: [Linux-cluster] 2-node fencing question Message-ID: <44CEF39D.8000001@bull.net> > Also is there a way to configure fence_ipmilan in cluster.xml to reboot > rather than stop the server? fence_ipmilan by itself takes the ?o > option (on,off,reboot) I use fence_ipmilan (with CS4 Update 2) it does at first poweroff AND then poweron ... except if it does not get the off status after the poweroff. (check agent ipmilan.c) Alain Moull? From zachacker at ibh.de Tue Aug 1 06:38:18 2006 From: zachacker at ibh.de (Zachacker, Maik) Date: Tue, 1 Aug 2006 08:38:18 +0200 Subject: [Linux-cluster] 2-node fencing question Message-ID: <0DDD325898FC3C4C88393E540965B37F1E59DB@dcdwa.ibh> >> Also is there a way to configure fence_ipmilan in cluster.xml to reboot >> rather than stop the server? fence_ipmilan by itself takes the -o >> option (on,off,reboot) > > I use fence_ipmilan (with CS4 Update 2) it does at > first poweroff AND then poweron ... except if it does not get > the off status after the poweroff. (check agent ipmilan.c) I use fence_ilo and fence_apc (CS4U3) - both first poweroff and then poweron too. This is only a problem in a two node configuration because both nodes send the poweroff command and non of them can send the poweron command because both are down. The most fence-devices have an option or action tag, that is not available via the cluster configuration tool. They can be used to force a reboot (default) or an poweroff. Maik Zachacker -- Maik Zachacker IBH Prof. Dr. Horn GmbH, Dresden, Germany From zhiwei at linuxone.myftp.org Tue Aug 1 08:02:17 2006 From: zhiwei at linuxone.myftp.org (zhiwei) Date: Tue, 01 Aug 2006 16:02:17 +0800 Subject: [Linux-cluster] Re: lvm2 liblvm2clusterlock.so on fc5 (Jeff Hardy) In-Reply-To: <20060728141302.E2A9573A0E@hormel.redhat.com> References: <20060728141302.E2A9573A0E@hormel.redhat.com> Message-ID: <1154419337.6987.13.camel@alpha01.mcs.com> > Message: 4 > Date: Thu, 27 Jul 2006 13:52:43 -0400 > From: Jeff Hardy > Subject: [Linux-cluster] lvm2 liblvm2clusterlock.so on fc5 > To: linux clustering > Message-ID: <1154022763.2789.120.camel at fritzdesk.potsdam.edu> > Content-Type: text/plain > > I apologize if this has been answered already or appeared in release > notes somewhere, but I cannot find it. FC4 had the lvm2-cluster package > to provide the clvm locking library. This was removed in FC5 (as > indicated in the release notes). > > Is this still necessary for a clvm setup: > > In /etc/lvm/lvm.conf: > locking_type = 2 > locking_library = "/lib/liblvm2clusterlock.so" > > And if so, where does one find this now? > You can obtain lvm2 source code from rhcs and recompile it to enable clvmd option. Clvmd is needed to manage the shared storage to share the lvm information among cluster members. Zhiwei From stephen.willey at framestore-cfc.com Tue Aug 1 10:40:20 2006 From: stephen.willey at framestore-cfc.com (Stephen Willey) Date: Tue, 01 Aug 2006 11:40:20 +0100 Subject: [Linux-cluster] gfs_fsck fails on large filesystem Message-ID: <44CF2F94.4000003@framestore-cfc.com> We fscked the filesystem because we'd started seeing the following errors following a power failure. GFS: fsid=nearlineA:gfs1.0: fatal: invalid metadata block GFS: fsid=nearlineA:gfs1.0: bh = 2644310219 (type: exp=4, found=5) GFS: fsid=nearlineA:gfs1.0: function = gfs_get_meta_buffer GFS: fsid=nearlineA:gfs1.0: file = /usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/gfs/dio.c, line = 1223 GFS: fsid=nearlineA:gfs1.0: time = 1154425344 GFS: fsid=nearlineA:gfs1.0: about to withdraw from the cluster GFS: fsid=nearlineA:gfs1.0: waiting for outstanding I/O GFS: fsid=nearlineA:gfs1.0: telling LM to withdraw lock_dlm: withdraw abandoned memory GFS: fsid=nearlineA:gfs1.0: withdrawn And another instance: GFS: fsid=nearlineA:gfs1.1: fatal: filesystem consistency error GFS: fsid=nearlineA:gfs1.1: inode = 2384574146/2384574146 GFS: fsid=nearlineA:gfs1.1: function = dir_e_del GFS: fsid=nearlineA:gfs1.1: file = /usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/gfs/dir.c, line = 1495 GFS: fsid=nearlineA:gfs1.1: time = 1154393717 GFS: fsid=nearlineA:gfs1.1: about to withdraw from the cluster GFS: fsid=nearlineA:gfs1.1: waiting for outstanding I/O GFS: fsid=nearlineA:gfs1.1: telling LM to withdraw lock_dlm: withdraw abandoned memory GFS: fsid=nearlineA:gfs1.1: withdrawn Running gfs_fsck -vvv -y /dev/gfs1_vg/gfs1_lv Returns the following after chewing all the physical and swap RAM. The machines have 4Gb or RAM and 2Gb of swap. We can increase the swap size, but is this just gonna keep running out of RAM? We're running on x86_64 so it can use as much memory as it likes. The filesystem is roughly 45Tb. Initializing fsck Initializing lists... Initializing special inodes... Setting block ranges... Creating a block list of size 11105160192... Unable to allocate bitmap of size 1388145025 Segmentation fault [root at ns1a ~]# gfs_fsck -vvv -y /dev/gfs1_vg/gfs1_lv Initializing fsck Initializing lists... (bio.c:140) Writing to 65536 - 16 4096 Initializing special inodes... (file.c:45) readi: Offset (640) is >= the file size (640). (super.c:208) 8 journals found. (file.c:45) readi: Offset (7116576) is >= the file size (7116576). (super.c:265) 74131 resource groups found. Setting block ranges... Creating a block list of size 11105160192... (bitmap.c:68) Allocated bitmap of size 5552580097 with 2 chunks per byte Unable to allocate bitmap of size 1388145025 (block_list.c:72) - block_list_create() Segmentation fault -- Stephen Willey Senior Systems Engineer, Framestore-CFC +44 (0)207 344 8000 http://www.framestore-cfc.com From Alain.Moulle at bull.net Tue Aug 1 11:06:54 2006 From: Alain.Moulle at bull.net (Alain Moulle) Date: Tue, 01 Aug 2006 13:06:54 +0200 Subject: [Linux-cluster] CS4 Update 2 / is this problem fix more recent update ? Message-ID: <44CF35CE.1060700@bull.net> Hi We are facing a big problem of split-brain, due to the fact that the process and Clurgmgrd daemon from RedHat Cluster-Suite unexpectedly disappeared (still for an unknown reason ...) on one of the HA-Nodes pair. This caused the other Clurgmgrd on the other Node to be aware of this and then simply to re-start the application service without effective fenceing/migration. It seems to be an abnormal behavior, isn't it ? Is there a already a fix available in more recent Update ? Have you any suggestion about this ? Thanks a lot Alain Moull? From kent2004 at gmail.com Tue Aug 1 13:23:47 2006 From: kent2004 at gmail.com (Kent Chen) Date: Tue, 1 Aug 2006 21:23:47 +0800 Subject: [Linux-cluster] hung when 3rd nodes mounting the gfs using dlm Message-ID: I connect 4 SUN x4100 (2 AMD dual core, 2G RAM ) to a SUN Storage 3510 with a silkworm 200e FC switch. the OS is RHEL 4 U3 for X86_64. I make 2 GFS FS?one called Alpha:gfs1, another called Alpha:gfs2 All things seems good when only 2 nodes mount GFS. Once 3rd node mount the GFS, the command hang. Is there anyone who encounted the similar problem? Is it a bug of GFS? -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent2004 at gmail.com Tue Aug 1 13:23:47 2006 From: kent2004 at gmail.com (Kent Chen) Date: Tue, 1 Aug 2006 21:23:47 +0800 Subject: [Linux-cluster] hung when 3rd nodes mounting the gfs using dlm Message-ID: I connect 4 SUN x4100 (2 AMD dual core, 2G RAM ) to a SUN Storage 3510 with a silkworm 200e FC switch. the OS is RHEL 4 U3 for X86_64. I make 2 GFS FS?one called Alpha:gfs1, another called Alpha:gfs2 All things seems good when only 2 nodes mount GFS. Once 3rd node mount the GFS, the command hang. Is there anyone who encounted the similar problem? Is it a bug of GFS? -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpeterso at redhat.com Tue Aug 1 16:38:27 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Tue, 01 Aug 2006 11:38:27 -0500 Subject: [Linux-cluster] gfs_fsck fails on large filesystem In-Reply-To: <44CF2F94.4000003@framestore-cfc.com> References: <44CF2F94.4000003@framestore-cfc.com> Message-ID: <44CF8383.3040208@redhat.com> Stephen Willey wrote: > We fscked the filesystem because we'd started seeing the following > errors following a power failure. > (snip) > We're running on x86_64 so it can use as much memory as it likes. The > filesystem is roughly 45Tb. > Hi Stephen, Yes, this is a problem with gfs_fsck. The problem is, it tries to allocate memory for bitmaps based on the size of the file system. The bitmap structures are used throughout the code, so they're not optional. I'll have to figure out how to do this a better way. Thanks for opening the bugzilla (200883). I'll work on it. Regards, Bob Peterson Red Hat Cluster Suite From stephen.willey at framestore-cfc.com Tue Aug 1 16:32:45 2006 From: stephen.willey at framestore-cfc.com (Stephen Willey) Date: Tue, 01 Aug 2006 17:32:45 +0100 Subject: [Linux-cluster] gfs_fsck fails on large filesystem In-Reply-To: <44CF8383.3040208@redhat.com> References: <44CF2F94.4000003@framestore-cfc.com> <44CF8383.3040208@redhat.com> Message-ID: <44CF822D.7070705@framestore-cfc.com> Robert Peterson wrote: > > Hi Stephen, > > Yes, this is a problem with gfs_fsck. The problem is, it tries to > allocate memory > for bitmaps based on the size of the file system. The bitmap structures > are used > throughout the code, so they're not optional. I'll have to figure out > how to > do this a better way. Thanks for opening the bugzilla (200883). I'll > work on it. > > Regards, > > Bob Peterson > Red Hat Cluster Suite The fsck is now running after we added the 137Gb swap drive. It appears to consistently chew about 4Gb of RAM (sometimes higher) but it is working (for now). Any ballpark idea of how long it'll take to fsck a 45Tb FS? I know that's a "how long is a piece of string" question, but are we talking hours/days/weeks? Stephen From teigland at redhat.com Tue Aug 1 16:35:57 2006 From: teigland at redhat.com (David Teigland) Date: Tue, 1 Aug 2006 11:35:57 -0500 Subject: [Linux-cluster] gfs_fsck fails on large filesystem In-Reply-To: <44CF2F94.4000003@framestore-cfc.com> References: <44CF2F94.4000003@framestore-cfc.com> Message-ID: <20060801163557.GD5976@redhat.com> On Tue, Aug 01, 2006 at 11:40:20AM +0100, Stephen Willey wrote: > We fscked the filesystem because we'd started seeing the following > errors following a power failure. > > GFS: fsid=nearlineA:gfs1.0: fatal: invalid metadata block > GFS: fsid=nearlineA:gfs1.0: bh = 2644310219 (type: exp=4, found=5) > GFS: fsid=nearlineA:gfs1.0: function = gfs_get_meta_buffer > GFS: fsid=nearlineA:gfs1.0: file = > /usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/gfs/dio.c, line = 1223 > GFS: fsid=nearlineA:gfs1.0: time = 1154425344 > GFS: fsid=nearlineA:gfs1.0: about to withdraw from the cluster > GFS: fsid=nearlineA:gfs1.0: waiting for outstanding I/O > GFS: fsid=nearlineA:gfs1.0: telling LM to withdraw > lock_dlm: withdraw abandoned memory > GFS: fsid=nearlineA:gfs1.0: withdrawn > > And another instance: > > GFS: fsid=nearlineA:gfs1.1: fatal: filesystem consistency error > GFS: fsid=nearlineA:gfs1.1: inode = 2384574146/2384574146 > GFS: fsid=nearlineA:gfs1.1: function = dir_e_del > GFS: fsid=nearlineA:gfs1.1: file = > /usr/src/redhat/BUILD/gfs-kernel-2.6.9-49/smp/src/gfs/dir.c, line = 1495 > GFS: fsid=nearlineA:gfs1.1: time = 1154393717 > GFS: fsid=nearlineA:gfs1.1: about to withdraw from the cluster > GFS: fsid=nearlineA:gfs1.1: waiting for outstanding I/O > GFS: fsid=nearlineA:gfs1.1: telling LM to withdraw > lock_dlm: withdraw abandoned memory > GFS: fsid=nearlineA:gfs1.1: withdrawn What kind of fencing are you using in the cluster? Trying to understand how this might have happened. Dave From stephen.willey at framestore-cfc.com Tue Aug 1 16:40:48 2006 From: stephen.willey at framestore-cfc.com (Stephen Willey) Date: Tue, 01 Aug 2006 17:40:48 +0100 Subject: [Linux-cluster] gfs_fsck fails on large filesystem In-Reply-To: <20060801163557.GD5976@redhat.com> References: <44CF2F94.4000003@framestore-cfc.com> <20060801163557.GD5976@redhat.com> Message-ID: <44CF8410.7040507@framestore-cfc.com> David Teigland wrote: > > What kind of fencing are you using in the cluster? Trying to understand > how this might have happened. > > Dave > We're using STONITH through HP/Compaq ILO. We believe that the corruption was almost certainly caused during a building-wide power failure though. That'll teach us to double-check the UPS setup. -- Stephen Willey Senior Systems Engineer, Framestore-CFC +44 (0)207 344 8000 http://www.framestore-cfc.com From rpeterso at redhat.com Tue Aug 1 17:53:02 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Tue, 01 Aug 2006 12:53:02 -0500 Subject: [Linux-cluster] gfs_fsck fails on large filesystem In-Reply-To: <44CF822D.7070705@framestore-cfc.com> References: <44CF2F94.4000003@framestore-cfc.com> <44CF8383.3040208@redhat.com> <44CF822D.7070705@framestore-cfc.com> Message-ID: <44CF94FE.3070407@redhat.com> Stephen Willey wrote: > The fsck is now running after we added the 137Gb swap drive. It appears > to consistently chew about 4Gb of RAM (sometimes higher) but it is > working (for now). > > Any ballpark idea of how long it'll take to fsck a 45Tb FS? I know > that's a "how long is a piece of string" question, but are we talking > hours/days/weeks? > > Stephen > Hi Stephen, I don't know how long it will take to fsck a 45TB fs, but it wouldn't surprise me if it took several days. It also varies because of hardware differences, and of course if you're going to swap, that might slow it down too. Any way you look at it, 45TB is a lot of data to go through with a fine-tooth comb like gfs_fsck does. The latest RHEL4 U3 version (and up) and recent STABLE and HEAD versions (in CVS) now give you a percent complete number every second during the more lengthy passes, such as pass5. When it finishes, can you post something on the list to let us know? We've tried to kick around ideas on how to improve the speed, such as (1) adding an option to only focus on areas where the journals are dirty, (2) introducing multiple threads to process the different RGs, and even (3) trying to get multiple nodes in the cluster to team up and do different areas of the file system. None of these have been implemented yet because of higher priorities. Since this is an open-source project, anyone could step in and do these. Volunteers? Regards, Bob Peterson Red Hat Cluster Suite From Leonardo.Mello at planejamento.gov.br Tue Aug 1 16:24:01 2006 From: Leonardo.Mello at planejamento.gov.br (Leonardo Rodrigues de Mello) Date: Tue, 1 Aug 2006 13:24:01 -0300 Subject: [Linux-cluster] hung when 3rd nodes mounting the gfs using dlm Message-ID: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B55@corp-bsa-mp01.planejamento.gov.br> What command had you used to create the gfs filesystem ? GFS need one journal for each server that mount the filesystem. If you have created the filesystem only with 2 journals, you won't be able to use more than two machines. The is used among other things to restore the filesystem in case of server failure. If gfs in the current architeture permits you to mount in more servers than the number of journals you have, will damage the filesystem, maybe this is one of the reasons for gfs block access from more servers than the number of journals. 01 - To specify the number of journals at filesystem creation: with the option -j number at mkfs.gfs where number is the number of machines. for 4 machines the option will be: -j 4 02 - To increase the number of journals in filesystem that has been already created: (possibly that is your case) for this task exist the tool gfs_jadd. see it manpage to use this tool, you need to mount the gfs filesystem in the machine that will increase the number of journal. gfs_jadd -j number_to_increase /gfs/filesystem/mount/point number_to_increase must be how many journals you want to add to that filesystem, by default this number is 1. in your case with four servers: (you already have 2 journals, could be like: gfs_jadd -j 2 /gfs/filesystem/mount/point some times gfs_jadd doesnt work, because there isnt space in the disk for the journal creation. in that case the best solution i know is to format the filesystem specifying the correct number of journals. Best Regards Leonardo Rodrigues de Mello -----Original Message----- From: linux-cluster-bounces at redhat.com on behalf of Kent Chen Sent: ter 1/8/2006 10:23 To: linux-cluster at redhat.com Cc: Subject: [Linux-cluster] hung when 3rd nodes mounting the gfs using dlm I connect 4 SUN x4100 (2 AMD dual core, 2G RAM ) to a SUN Storage 3510 with a silkworm 200e FC switch. the OS is RHEL 4 U3 for X86_64. I make 2 GFS FS,one called Alpha:gfs1, another called Alpha:gfs2 All things seems good when only 2 nodes mount GFS. Once 3rd node mount the GFS, the command hang. Is there anyone who encounted the similar problem? Is it a bug of GFS? -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 3755 bytes Desc: not available URL: From mykleb at no.ibm.com Tue Aug 1 20:16:42 2006 From: mykleb at no.ibm.com (Jan-Frode Myklebust) Date: Tue, 1 Aug 2006 22:16:42 +0200 Subject: [Linux-cluster] Re: E-Mail Cluster References: <44CE15B1.9010603@fiocruz.br> Message-ID: On 2006-07-31, Nicholas Anderson wrote: > I'm new to clustering and was wondering what would be the best solution > when clustering an email server. > > Today we've 1 server with a storage where all mailboxes (mbox format) For clustering, I think it would be better to use Maildir-format for the mailboxes. Then you'll avoid any locking problems on the mailboxes. New messages can be delivered on one machine while other messages in the same mail-folder is being deleted on another machine. If your users are only accessing their email by pop/imap, moving to Maildir shouldn't be any issue. > and home dirs are stored. > I'm planning to use 3 or 4 nodes running imap, pop and smtp, all of them > sharing users' data. > > Should I use NFS or GFS? NFS is very single-point-of-failure.. so definately a clusterfs/GFS. If you can move to Maildir, you should be able to run any number of servers where each server is running all services (imap, pop and smtp), and incoming traffic is routed to a random server trough f.ex. round robin dns. To handle single-node downtime/crash, you'll just need to move the ip-address to an available node. Easily achivable trough f.ex. heartbeat from linux-ha.org, and probably also RH Cluster Suite.. -jf From hyperbaba at neobee.net Wed Aug 2 06:49:24 2006 From: hyperbaba at neobee.net (Vladimir Grujic) Date: Wed, 2 Aug 2006 08:49:24 +0200 Subject: [Linux-cluster] gfs_fsck fails on large filesystem In-Reply-To: <44CF822D.7070705@framestore-cfc.com> References: <44CF2F94.4000003@framestore-cfc.com> <44CF8383.3040208@redhat.com> <44CF822D.7070705@framestore-cfc.com> Message-ID: <200608020849.25035.hyperbaba@neobee.net> On Tuesday 01 August 2006 18:32, Stephen Willey wrote: > Robert Peterson wrote: > > Hi Stephen, > > > > Yes, this is a problem with gfs_fsck. The problem is, it tries to > > allocate memory > > for bitmaps based on the size of the file system. The bitmap structures > > are used > > throughout the code, so they're not optional. I'll have to figure out > > how to > > do this a better way. Thanks for opening the bugzilla (200883). I'll > > work on it. > > > > Regards, > > > > Bob Peterson > > Red Hat Cluster Suite > > The fsck is now running after we added the 137Gb swap drive. It appears > to consistently chew about 4Gb of RAM (sometimes higher) but it is > working (for now). > > Any ballpark idea of how long it'll take to fsck a 45Tb FS? I know > that's a "how long is a piece of string" question, but are we talking > hours/days/weeks? > > Stephen > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster It took 55 hours for all 7 passes on my 1TB partition (with alot of files on it) . partition resided on raid 10 sata storage. Does anyone else have execution times for gfs_gsck ? From stephen.willey at framestore-cfc.com Wed Aug 2 09:51:05 2006 From: stephen.willey at framestore-cfc.com (Stephen Willey) Date: Wed, 02 Aug 2006 10:51:05 +0100 Subject: [Linux-cluster] gfs_fsck fails on large filesystem In-Reply-To: <200608020849.25035.hyperbaba@neobee.net> References: <44CF2F94.4000003@framestore-cfc.com> <44CF8383.3040208@redhat.com> <44CF822D.7070705@framestore-cfc.com> <200608020849.25035.hyperbaba@neobee.net> Message-ID: <44D07589.7090507@framestore-cfc.com> > On Tuesday 01 August 2006 18:32, Stephen Willey wrote: >> The fsck is now running after we added the 137Gb swap drive. It appears >> to consistently chew about 4Gb of RAM (sometimes higher) but it is >> working (for now). >> >> Any ballpark idea of how long it'll take to fsck a 45Tb FS? I know >> that's a "how long is a piece of string" question, but are we talking >> hours/days/weeks? >> >> Stephen >> Is there any way we can determine the progress during all passes? At the moment all we're seeing is lines like the following: (pass1.c:213) Setting 557096777 to data block Is this representative of simply the numbers of blocks in the filesystem? If so, how do we get the numbers of blocks in the filesystem while the fsck is running? We use this FS for backups and we're currently determining whether we'd be better off just wiping it and re-syncing all our data (which would take a couple of days, not several) so unless we can get a reliable indication of how long this will take, we probably won't finish it. -- Stephen Willey Senior Systems Engineer, Framestore-CFC +44 (0)207 344 8000 http://www.framestore-cfc.com From f.hackenberger at mediatransfer.com Wed Aug 2 11:03:15 2006 From: f.hackenberger at mediatransfer.com (Falk Hackenberger - MediaTransfer AG Netresearch & Consulting) Date: Wed, 02 Aug 2006 13:03:15 +0200 Subject: [Linux-cluster] clurgmgrd stops service without reason Message-ID: <44D08673.3010207@mediatransfer.com> Hello, we have running cs4 config with 2 nodes. for debuging one node is offline. so it is running only on 1 node. Now we have the problem that clurgmgrd stops the services wich he provides without recognizable reason. we have log_level 7 on the cman and on rm -lines in cluster.conf but the reason of stoping the service is not recognizable. I see in the logfile entries as: --snip-- Aug 1 17:31:28 kain clurgmgrd: [4780]: Executing /exports/imap/checkimapstartup.sh status Aug 1 17:31:28 kain clurgmgrd: [4780]: Executing /exports/subversion/etc/rc.d/init.d/svnserver status Aug 1 17:31:28 kain clurgmgrd: [4780]: Checking 192.168.0.223, Level 0 Aug 1 17:31:28 kain clurgmgrd: [4780]: 192.168.0.223 present on eth0 Aug 1 17:31:28 kain clurgmgrd: [4780]: Link for eth0: Detected Aug 1 17:31:28 kain clurgmgrd: [4780]: Link detected on eth0 Aug 1 17:31:37 kain clurgmgrd[4780]: Stopping service storage --snap-- how to say to clurgmgrd, that he should log the reason for stoping the service? any other hints ? thanks falk From Michael.Roethlein at ri-solution.com Wed Aug 2 12:09:29 2006 From: Michael.Roethlein at ri-solution.com (=?iso-8859-1?Q?R=F6thlein_Michael_=28RI-Solution=29?=) Date: Wed, 2 Aug 2006 14:09:29 +0200 Subject: [Linux-cluster] Tracing gfs problems Message-ID: <992633B6A0E42B49BC5A41C10A8C841B030E29B8@MUCEX004.root.local> Hello, In the past there occured hangs resulting in reboots of our 4 node cluster. The real problem is: there aren't any traces in the log files of the nodes. Is there a possibilty to raise the verbosity of gfs? Thanks Michael From rpeterso at redhat.com Wed Aug 2 14:27:32 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Wed, 02 Aug 2006 09:27:32 -0500 Subject: [Linux-cluster] gfs_fsck fails on large filesystem In-Reply-To: <44D07589.7090507@framestore-cfc.com> References: <44CF2F94.4000003@framestore-cfc.com> <44CF8383.3040208@redhat.com> <44CF822D.7070705@framestore-cfc.com> <200608020849.25035.hyperbaba@neobee.net> <44D07589.7090507@framestore-cfc.com> Message-ID: <44D0B654.5010508@redhat.com> Stephen Willey wrote: > Is there any way we can determine the progress during all passes? At > the moment all we're seeing is lines like the following: > > (pass1.c:213) Setting 557096777 to data block > > Is this representative of simply the numbers of blocks in the > filesystem? If so, how do we get the numbers of blocks in the > filesystem while the fsck is running? > > We use this FS for backups and we're currently determining whether we'd > be better off just wiping it and re-syncing all our data (which would > take a couple of days, not several) so unless we can get a reliable > indication of how long this will take, we probably won't finish it. Hi Stephen, The latest gfs_fsck will report the percent complete for passes 1 and 5, which take the longest. It sounds like you're running it in verbose mode (i.e. with -v) which is going to do a lot of unnecessary I/O to stdout and will slow it down considerably. If you're redirecting stdout, you can do a 'grep "percent complete" /your/stdout | tail' or something similar to figure out how far along it is with that pass. Only passes 1 and 5 go block-by-block and therefore it's easy to figure out how far they've gotten. For the other passes, it would be difficult to estimate their progress, and probably not worth the overhead in terms of time the computer would have to spend figuring it out. You can get it to go faster by restarting it without the -v, but then it will have to re-do all the work it's already done to this point. Based on what you've told me, it probably will take longer to fsck than you're willing to wait. Regards, Bob Peterson Red Hat Cluster Suite From stephen.willey at framestore-cfc.com Wed Aug 2 14:19:56 2006 From: stephen.willey at framestore-cfc.com (Stephen Willey) Date: Wed, 02 Aug 2006 15:19:56 +0100 Subject: [Linux-cluster] gfs_fsck fails on large filesystem In-Reply-To: <44D0B654.5010508@redhat.com> References: <44CF2F94.4000003@framestore-cfc.com> <44CF8383.3040208@redhat.com> <44CF822D.7070705@framestore-cfc.com> <200608020849.25035.hyperbaba@neobee.net> <44D07589.7090507@framestore-cfc.com> <44D0B654.5010508@redhat.com> Message-ID: <44D0B48C.4080004@framestore-cfc.com> Robert Peterson wrote: > Hi Stephen, > > The latest gfs_fsck will report the percent complete for passes 1 and 5, > which take the longest. It sounds like you're running it in verbose mode > (i.e. with -v) which is going to do a lot of unnecessary I/O to stdout and > will slow it down considerably. If you're redirecting stdout, you can > do a 'grep "percent complete" /your/stdout | tail' or something similar to > figure out how far along it is with that pass. > > Only passes 1 and 5 go block-by-block and therefore it's easy to figure > out how far they've gotten. For the other passes, it would be difficult to > estimate their progress, and probably not worth the overhead in terms > of time the computer would have to spend figuring it out. > > You can get it to go faster by restarting it without the -v, but then it > will > have to re-do all the work it's already done to this point. > > Based on what you've told me, it probably will take longer to fsck than > you're willing to wait. > Regards, We have restarted it without the -vs and it does appear to be progressing much faster. We'll give it a while... -- Stephen Willey Senior Systems Engineer, Framestore-CFC +44 (0)207 344 8000 http://www.framestore-cfc.com From danwest at comcast.net Wed Aug 2 15:50:16 2006 From: danwest at comcast.net (danwest at comcast.net) Date: Wed, 02 Aug 2006 15:50:16 +0000 Subject: [Linux-cluster] 2-node fencing question Message-ID: <080220061550.6837.44D0C9B800021AD200001AB522007347489B9C0A99020E0B@comcast.net> It seems like a significant problem to have fence_ipmilan issue a power-off followed by a power-on with a 2 node cluster. As described both nodes power-off and are then unable to issue the required power-on. Does anyone know a solution to this? This seems to make a 2-node cluster with ipmi fencing pointless. It looks like fence_ipmilan needs to support sending a cycle instead of a poweroff than a poweron? According to fence_ipmilan.c it looks like cycle is not an option although it is an option for ipmitool. (ipmitool -H -U -P chassis power cycle) >From fence_ipmilan.c: switch(op) { case ST_POWERON: snprintf(arg, sizeof(arg), "%s chassis power on", cmd); break; case ST_POWEROFF: snprintf(arg, sizeof(arg), "%s chassis power off", cmd); break; case ST_STATUS: snprintf(arg, sizeof(arg), "%s chassis power status", cmd); break; } Thanks, Dan -------------- Original message ---------------------- From: "Zachacker, Maik" > >> Also is there a way to configure fence_ipmilan in cluster.xml to > reboot > >> rather than stop the server? fence_ipmilan by itself takes the -o > >> option (on,off,reboot) > > > > I use fence_ipmilan (with CS4 Update 2) it does at > > first poweroff AND then poweron ... except if it does not get > > the off status after the poweroff. (check agent ipmilan.c) > > I use fence_ilo and fence_apc (CS4U3) - both first poweroff and then > poweron too. This is only a problem in a two node configuration because > both nodes send the poweroff command and non of them can send the > poweron command because both are down. > > The most fence-devices have an option or action tag, that is not > available via the cluster configuration tool. They can be used to force > a reboot (default) or an poweroff. > > > > > Maik Zachacker > -- > Maik Zachacker > IBH Prof. Dr. Horn GmbH, Dresden, Germany > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From nicholas at fiocruz.br Wed Aug 2 18:43:47 2006 From: nicholas at fiocruz.br (Nicholas Anderson) Date: Wed, 02 Aug 2006 15:43:47 -0300 Subject: [Linux-cluster] Re: E-Mail Cluster In-Reply-To: References: <44CE15B1.9010603@fiocruz.br> Message-ID: <44D0F263.509@fiocruz.br> Hi Jan, I'm searching in google how to convert from mbox to maildir using sendmail/procmail .... i have 3000+ users and something like 70GB of emails and I'll have to test it very well before doing in the production server .... As soon as i get this things working fine, i'll try gfs and the other cluster stuff ..... I'm thinking on doing something like you said ... 3 nodes running imap/pop/smtp sharing one filesystem probably with gfs where user data will be stored..... I was running slackware but now I'm thinking about something like redhat or centos (will depend on our budget :-) ) to the nodes .... It'll be easier to keep them up2dated :-) Any new tips are welcome :-) thanks, Nick Jan-Frode Myklebust wrote: > On 2006-07-31, Nicholas Anderson wrote: > > For clustering, I think it would be better to use Maildir-format > for the mailboxes. Then you'll avoid any locking problems on the > mailboxes. New messages can be delivered on one machine while other > messages in the same mail-folder is being deleted on another machine. > > If your users are only accessing their email by pop/imap, moving to > Maildir shouldn't be any issue. > > NFS is very single-point-of-failure.. so definately a clusterfs/GFS. > If you can move to Maildir, you should be able to run any number of > servers where each server is running all services (imap, pop and smtp), > and incoming traffic is routed to a random server trough f.ex. round > robin dns. > > To handle single-node downtime/crash, you'll just need to move the > ip-address to an available node. Easily achivable trough f.ex. > heartbeat from linux-ha.org, and probably also RH Cluster Suite.. > > > -jf > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Nicholas Anderson Administrador de Sistemas Unix LPIC-1 Certified Rede Fiocruz e-mail: nicholas at fiocruz.br From mykleb at no.ibm.com Wed Aug 2 19:43:33 2006 From: mykleb at no.ibm.com (Jan-Frode Myklebust) Date: Wed, 2 Aug 2006 21:43:33 +0200 Subject: [Linux-cluster] Re: E-Mail Cluster References: <44CE15B1.9010603@fiocruz.br> <44D0F263.509@fiocruz.br> Message-ID: On 2006-08-02, Nicholas Anderson wrote: > > I'm searching in google how to convert from mbox to maildir using > sendmail/procmail .... At my previous job we changed from exim/uw-imap on mbox, to exim/docevot on maildir a couple of years ago. Didn't use a cluster-fs, only SCSI-based disk failover. For about 500-users. Right now I'm setting up a similar solution to your... trying to support up to 200.000 users on a 5 node cluster, using IBM GPFS. If sendmail is using procmail to do final mailbox-delivery, I think the configuration change should be primarily putting a '/' at the end of the path, as that should instruct procmail to do maildir-style delivery. At least that's how I've been doing it in my ~/.procmailrc. Ref. 'man procmailrc'. > i have 3000+ users and something like 70GB of > emails and I'll have to test it very well before doing in the > production server .... Sure.. There are a few mbox2maildir converters.. You should probably try a few of them and verify that they all give the same result. Another thing to check is that your cluster-fs handles your load well. My main consern would be how well GFS performs on maildir-style folders, as most cluster-fs's I've seen are optimized for large file streaming I/O. If possible, try to keep a lot of file-metadata in cache so that you don't have to go to disk every time someone check their maildir for new messages. -jf From riaan at obsidian.co.za Wed Aug 2 22:27:32 2006 From: riaan at obsidian.co.za (Riaan van Niekerk) Date: Thu, 03 Aug 2006 00:27:32 +0200 Subject: [Linux-cluster] Re: E-Mail Cluster In-Reply-To: References: <44CE15B1.9010603@fiocruz.br> <44D0F263.509@fiocruz.br> Message-ID: <44D126D4.2080303@obsidian.co.za> Jan-Frode Myklebust wrote: > On 2006-08-02, Nicholas Anderson wrote: >> I'm searching in google how to convert from mbox to maildir using >> sendmail/procmail .... > > At my previous job we changed from exim/uw-imap on mbox, > to exim/docevot on maildir a couple of years ago. Didn't use > a cluster-fs, only SCSI-based disk failover. For about 500-users. > > Right now I'm setting up a similar solution to your... trying > to support up to 200.000 users on a 5 node cluster, using IBM GPFS. > > If sendmail is using procmail to do final mailbox-delivery, I > think the configuration change should be primarily putting a '/' > at the end of the path, as that should instruct procmail to > do maildir-style delivery. At least that's how I've been doing > it in my ~/.procmailrc. Ref. 'man procmailrc'. > >> i have 3000+ users and something like 70GB of >> emails and I'll have to test it very well before doing in the >> production server .... > > Sure.. There are a few mbox2maildir converters.. You should probably > try a few of them and verify that they all give the same result. > > Another thing to check is that your cluster-fs handles your load > well. My main consern would be how well GFS performs on > maildir-style folders, as most cluster-fs's I've seen are optimized > for large file streaming I/O. If possible, try to keep a lot of > file-metadata in cache so that you don't have to go to disk every > time someone check their maildir for new messages. > We are running 700 000 users on a 2.5 GFS, 4 nodes, with POP, IMAP (direct access and SquirrelmMail) and SMTP. To make things worse, we use NFS between our GFS nodes and our mail servers. We initially had huge performance problems in our setup, which I wrote in this message: http://www.redhat.com/archives/linux-cluster/2006-July/msg00136.html We ended up bumping the spindle count from 36 to 60 and then to 114, without it making a noticeable difference. Our main killer was Squirrelmail over IMAP (the solution is primarily a webmail-based one) Our performance problems were solved by the following: - removing the folder-size plugin (built-in) and mail quota plugin (3rd party) reduced the traffic between IMAP servers and storage backend by 40%. - Implement imap proxy (www.imapproxy.org). This is giving us a 1 to 14 hit ratio. This storage which could not keep up previously, is now humming along fine. Our initial mistake was to try and optimise on the FS layer (there werent any real performance optimizations in our setup to be made) and throw hardware at the problem, instead of suspecting and optimizing our application. Despite GFS not being designed for lots of small files, and not recommended for use with NFS, with the above changes, it performs more than adequately. We hope to see another performance gain once we get rid of the NFS and have our mail servers access the GFS directly. Riaan -------------- next part -------------- A non-text attachment was scrubbed... Name: riaan.vcf Type: text/x-vcard Size: 310 bytes Desc: not available URL: From Hansjoerg.Maurer at dlr.de Thu Aug 3 07:17:31 2006 From: Hansjoerg.Maurer at dlr.de (=?ISO-8859-15?Q?Hansj=F6rg_Maurer?=) Date: Thu, 03 Aug 2006 09:17:31 +0200 Subject: [Linux-cluster] Re: E-Mail Cluster In-Reply-To: References: <44CE15B1.9010603@fiocruz.br> <44D0F263.509@fiocruz.br> Message-ID: <44D1A30B.4060200@dlr.de> Hi we had some problems with cyrus-imap on top of gpfs a year ago concerning mmap files in a simple failover environment. (see the gpfs-mailinglist) But with recent versions it works. Greetings Hansjoerg > >Right now I'm setting up a similar solution to your... trying >to support up to 200.000 users on a 5 node cluster, using IBM GPFS. > >If sendmail is using procmail to do final mailbox-delivery, I >think the configuration change should be primarily putting a '/' >at the end of the path, as that should instruct procmail to >do maildir-style delivery. At least that's how I've been doing >it in my ~/.procmailrc. Ref. 'man procmailrc'. > > > >>i have 3000+ users and something like 70GB of >>emails and I'll have to test it very well before doing in the >>production server .... >> >> > >Sure.. There are a few mbox2maildir converters.. You should probably >try a few of them and verify that they all give the same result. > >Another thing to check is that your cluster-fs handles your load >well. My main consern would be how well GFS performs on >maildir-style folders, as most cluster-fs's I've seen are optimized >for large file streaming I/O. If possible, try to keep a lot of >file-metadata in cache so that you don't have to go to disk every >time someone check their maildir for new messages. > > > -jf > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster > > -- _________________________________________________________________ Dr. Hansjoerg Maurer | LAN- & System-Manager | Deutsches Zentrum | DLR Oberpfaffenhofen f. Luft- und Raumfahrt e.V. | Institut f. Robotik | Postfach 1116 | Muenchner Strasse 20 82230 Wessling | 82234 Wessling Germany | | Tel: 08153/28-2431 | E-mail: Hansjoerg.Maurer at dlr.de Fax: 08153/28-1134 | WWW: http://www.robotic.dlr.de/ __________________________________________________________________ There are 10 types of people in this world, those who understand binary and those who don't. From mykleb at no.ibm.com Thu Aug 3 11:42:45 2006 From: mykleb at no.ibm.com (Jan-Frode Myklebust) Date: Thu, 3 Aug 2006 13:42:45 +0200 Subject: [Linux-cluster] Re: E-Mail Cluster References: <44CE15B1.9010603@fiocruz.br> <44D0F263.509@fiocruz.br> <44D1A30B.4060200@dlr.de> Message-ID: On 2006-08-03, Hansj?rg Maurer wrote: > > we had some problems with cyrus-imap on top of gpfs a year ago > concerning mmap files in a simple failover environment. > (see the gpfs-mailinglist) > But with recent versions it works. Yes, I remember your posting.. and AFAICT you solved it by turning off mmap in Cyrus. https://lists.sdsc.edu/mailman/private.cgi/gpfs-general/2005q4/000040.html Did you consider GFS for this project ? Or are you now looking at GFS for the same project ? We're using courier-imap, which as far as I can tell doesn't use mmap, so we shouldn't hit this problem. Otherwise there is always the mmap-invalidate patch that should solve this... -jf From singh.rajeshwar at gmail.com Thu Aug 3 12:41:08 2006 From: singh.rajeshwar at gmail.com (Rajesh singh) Date: Thu, 3 Aug 2006 18:11:08 +0530 Subject: [Linux-cluster] fencing agent Message-ID: Hi all, We are in process of procurring a fencing device and we have been suggested by our hardware vendor to use fencing device as mentioned in below URL. http://www.supermicro.com/products/accessories/addon/AOC-IPMI20-E.cfm My setup is that I am using 2 node AMD servers on rhel4 u2 in clustered mode. I am not using gfs, but i am putting fencing device. My querry is that, can I use the *AOC-IPMI20-E card as an fencing device. regards * -------------- next part -------------- An HTML attachment was scrubbed... URL: From riaan at obsidian.co.za Thu Aug 3 14:21:28 2006 From: riaan at obsidian.co.za (Riaan van Niekerk) Date: Thu, 03 Aug 2006 16:21:28 +0200 Subject: [Linux-cluster] fencing agent In-Reply-To: References: Message-ID: <44D20668.6070901@obsidian.co.za> Rajesh singh wrote: > Hi all, > We are in process of procurring a fencing device and we have been > suggested by our hardware vendor to use fencing device as mentioned in > below URL. > http://www.supermicro.com/products/accessories/addon/AOC-IPMI20-E.cfm > My setup is that I am using 2 node AMD servers on rhel4 u2 in > clustered mode. > I am not using gfs, but i am putting fencing device. > My querry is that, can I use the *AOC-IPMI20-E card as an fencing device. > > regards According to the above URL, this card supports IPMI 2, which means that it should work with the fence_ipmilan fencing module in RHCS 4. Riaan -------------- next part -------------- A non-text attachment was scrubbed... Name: riaan.vcf Type: text/x-vcard Size: 310 bytes Desc: not available URL: From mwill at penguincomputing.com Thu Aug 3 15:29:36 2006 From: mwill at penguincomputing.com (Michael Will) Date: Thu, 3 Aug 2006 08:29:36 -0700 Subject: [Linux-cluster] fencing agent Message-ID: <433093DF7AD7444DA65EFAFE3987879C125E83@jellyfish.highlyscyld.com> Or you buy systems that come with ipmi on the mainboard. -----Original Message----- From: Rajesh singh [mailto:singh.rajeshwar at gmail.com] Sent: Thu Aug 03 07:01:25 2006 To: linux-cluster at redhat.com Subject: [Linux-cluster] fencing agent Hi all, We are in process of procurring a fencing device and we have been suggested by our hardware vendor to use fencing device as mentioned in below URL. http://www.supermicro.com/products/accessories/addon/AOC-IPMI20-E.cfm My setup is that I am using 2 node AMD servers on rhel4 u2 in clustered mode. I am not using gfs, but i am putting fencing device. My querry is that, can I use the *AOC-IPMI20-E card as an fencing device. regards * -------------- next part -------------- An HTML attachment was scrubbed... URL: From raycharles_man at yahoo.com Thu Aug 3 15:55:04 2006 From: raycharles_man at yahoo.com (Ray Charles) Date: Thu, 3 Aug 2006 08:55:04 -0700 (PDT) Subject: [Linux-cluster] Logging for cluster errors. Message-ID: <20060803155505.26305.qmail@web32108.mail.mud.yahoo.com> Hi, Easy question. When I run system-config-cluster I am able to see the gui. But in the event there's an error while using the gui where does that get logged? I've seen an error from the gui and it says check logging. I didn't see a /var/log/cluster/ -TIA __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From rpeterso at redhat.com Thu Aug 3 16:20:03 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Thu, 03 Aug 2006 11:20:03 -0500 Subject: [Linux-cluster] Logging for cluster errors. In-Reply-To: <20060803155505.26305.qmail@web32108.mail.mud.yahoo.com> References: <20060803155505.26305.qmail@web32108.mail.mud.yahoo.com> Message-ID: <44D22233.5030403@redhat.com> Ray Charles wrote: > > Hi, > > Easy question. > > When I run system-config-cluster I am able to see the > gui. But in the event there's an error while using > the gui where does that get logged? > > I've seen an error from the gui and it says check > logging. I didn't see a /var/log/cluster/ > > -TIA > Hi Ray, Usually the messages go into /var/log/messages. Many of the messages can be redirected to other places by changing the cluster.conf file, so the code won't tell you specifically where to look. Regards, Bob peterson Red Hat Cluster Suite From jparsons at redhat.com Thu Aug 3 17:06:26 2006 From: jparsons at redhat.com (James Parsons) Date: Thu, 03 Aug 2006 13:06:26 -0400 Subject: [Linux-cluster] Logging for cluster errors. In-Reply-To: <44D22233.5030403@redhat.com> References: <20060803155505.26305.qmail@web32108.mail.mud.yahoo.com> <44D22233.5030403@redhat.com> Message-ID: <44D22D12.2060501@redhat.com> Robert Peterson wrote: > Ray Charles wrote: > >> >> Hi, >> >> Easy question. >> >> When I run system-config-cluster I am able to see the >> gui. But in the event there's an error while using >> the gui where does that get logged? >> I've seen an error from the gui and it says check >> logging. I didn't see a /var/log/cluster/ >> >> -TIA >> > > Hi Ray, > > Usually the messages go into /var/log/messages. > Many of the messages can be redirected to other places by > changing the cluster.conf file, so the code won't tell you > specifically where to look. > > Regards, > > Bob peterson > Red Hat Cluster Suite > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Ray, What is the nature of the error that you are seeing? -J From rpeterso at redhat.com Thu Aug 3 18:44:31 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Thu, 03 Aug 2006 13:44:31 -0500 Subject: [Linux-cluster] Tracing gfs problems In-Reply-To: <992633B6A0E42B49BC5A41C10A8C841B030E29B8@MUCEX004.root.local> References: <992633B6A0E42B49BC5A41C10A8C841B030E29B8@MUCEX004.root.local> Message-ID: <44D2440F.40408@redhat.com> R?thlein Michael (RI-Solution) wrote: > Hello, > > In the past there occured hangs resulting in reboots of our 4 node cluster. The real problem is: there aren't any traces in the log files of the nodes. > > Is there a possibilty to raise the verbosity of gfs? > > Thanks > > Michael > Hi Michael, Right now, there's no way to increase the level of verbosity or logging in the gfs kernel code, but I'm not sure that would help you anyway. The lockup could be in any part of the kernel: GFS, The DLM/Gulm locking infrastructure, or any other part for that matter. It could also be hardware related or running out of memory, etc. Your best bet may be to temporarily disable fencing so that the hung node(s) don't get fenced as soon as it happens, for example by changing it to manual fencing, and then when it hangs, check for dmesgs on the console, syslog messages in /var/log/messages and if you can't get a command prompt, use the "magic sysreq" key to dump out what each module, thread and process is doing. If that doesn't tell you where the problem is, you can send the info to this list or create a bugzilla for the problem and attach the output from the sysrq, along with details on what release of code you're using, your cluster.conf, etc. Here are simple instructions for using the "magic sysrq" in case you're unfamiliar: 1. Turn it on by doing: echo "1" > /proc/sys/kernel/sysrq 2. Recreate your kernel hang 3. If you're at the system console with a keyboard, do alt-sysrq t (task list) If you have a telnet console instead, do ctrl-] to get telnet> prompt telnet> send brk (send a break char) t (task list) If you don't have a keyboard or telnet, but do have a shell: echo "t" > /proc/sysrq-trigger If you're doing it from a minicom, use: f followed by t (For other types of serial consoles, you have to get it to send a break, then letter t) 4. The task info will be dumped to the console, so hopefully you have a way to save that off. Regards, Bob Peterson Red Hat Cluster Suite From lhh at redhat.com Thu Aug 3 18:36:39 2006 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 03 Aug 2006 14:36:39 -0400 Subject: [Linux-cluster] CS4 Update 2 / is this problem fix more recent update ? In-Reply-To: <44CF35CE.1060700@bull.net> References: <44CF35CE.1060700@bull.net> Message-ID: <1154630199.28677.18.camel@ayanami.boston.redhat.com> On Tue, 2006-08-01 at 13:06 +0200, Alain Moulle wrote: > Hi > > We are facing a big problem of split-brain, due to the fact > that the process and Clurgmgrd daemon from RedHat Cluster-Suite unexpectedly > disappeared (still for an unknown reason ...) on one of the HA-Nodes pair. This > caused the other Clurgmgrd on the other Node to be aware of this and then simply > to re-start the application service without effective fenceing/migration. > > It seems to be an abnormal behavior, isn't it ? > > Is there a already a fix available in more recent Update ? Fixed in U4 beta; there were two problems: (a) a segfault, and (b) missing inclusion of Stanko Kupcevic's self-monitoring in clurgmgrd. -- Lon From lhh at redhat.com Thu Aug 3 18:38:44 2006 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 03 Aug 2006 14:38:44 -0400 Subject: [Linux-cluster] clurgmgrd stops service without reason In-Reply-To: <44D08673.3010207@mediatransfer.com> References: <44D08673.3010207@mediatransfer.com> Message-ID: <1154630324.28677.21.camel@ayanami.boston.redhat.com> On Wed, 2006-08-02 at 13:03 +0200, Falk Hackenberger - MediaTransfer AG Netresearch & Consulting wrote: > --snip-- > Aug 1 17:31:28 kain clurgmgrd: [4780]: Executing > /exports/imap/checkimapstartup.sh status > Aug 1 17:31:28 kain clurgmgrd: [4780]: Executing > /exports/subversion/etc/rc.d/init.d/svnserver status > Aug 1 17:31:28 kain clurgmgrd: [4780]: Checking 192.168.0.223, > Level 0 > Aug 1 17:31:28 kain clurgmgrd: [4780]: 192.168.0.223 present on > eth0 > Aug 1 17:31:28 kain clurgmgrd: [4780]: Link for eth0: Detected > Aug 1 17:31:28 kain clurgmgrd: [4780]: Link detected on eth0 > Aug 1 17:31:37 kain clurgmgrd[4780]: Stopping service storage > --snap-- > > how to say to clurgmgrd, that he should log the reason for stoping the > service? Something must be returning an error code where it should not be; can you post your service XML blob? -- Lon From lhh at redhat.com Thu Aug 3 18:39:29 2006 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 03 Aug 2006 14:39:29 -0400 Subject: [Linux-cluster] fencing agent In-Reply-To: <44D20668.6070901@obsidian.co.za> References: <44D20668.6070901@obsidian.co.za> Message-ID: <1154630369.28677.23.camel@ayanami.boston.redhat.com> On Thu, 2006-08-03 at 16:21 +0200, Riaan van Niekerk wrote: > According to the above URL, this card supports IPMI 2, which means that > it should work with the fence_ipmilan fencing module in RHCS 4. It should work with RHCS4, since we're just calling ipmitool. -- Lon From lhh at redhat.com Thu Aug 3 19:25:46 2006 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 03 Aug 2006 15:25:46 -0400 Subject: [Linux-cluster] 2-node fencing question In-Reply-To: <080220061550.6837.44D0C9B800021AD200001AB522007347489B9C0A99020E0B@comcast.net> References: <080220061550.6837.44D0C9B800021AD200001AB522007347489B9C0A99020E0B@comcast.net> Message-ID: <1154633146.28677.70.camel@ayanami.boston.redhat.com> Sorry I didn't see this earlier! On Wed, 2006-08-02 at 15:50 +0000, danwest at comcast.net wrote: > It seems like a significant problem to have fence_ipmilan issue a power-off followed by a power-on with a 2 node cluster. Generally, the chances of this occurring are very, very small, though not impossible. However, it could very well be that IPMI hardware modules are slow enough at processing requests that this could pose a problem. What hardware has this happened on? Was ACPI disabled on boot in the host OS (it should be; see below)? > This seems to make a 2-node cluster with ipmi fencing pointless. I'm pretty sure that 'both-nodes-off problem' can only occur if all of the following criteria are met: (a) while using a separate NICs for IPMI and cluster traffic (the recommended configuration), (b) in the event of a network partition, such that both nodes can not see each other but can see each other's IPMI port, and (c) if both nodes send their power-off packets at or near the exact same time. The time window for (c) increases significantly (5+ seconds) if the cluster nodes are enabling ACPI power events on boot. This is one of the reasons why booting with acpi=off is required when using IPMI, iLO, or other integrated power management solutions. If booting with acpi=off, does the problem persist? > It looks like fence_ipmilan needs to support sending a cycle instead of a poweroff than a poweron? The reason fence_ipmilan functions this way (off, status, on) is because that we require a confirmation that the node has lost power. I am not sure that it is possible to confirm the node has rebooted using IPMI. Arguably, it also might not be necessary to make such a confirmation in this particular case. > According to fence_ipmilan.c it looks like cycle is not an option although it is an option for ipmitool. (ipmitool -H -U -P chassis power cycle) Looks like you're on the right track. -- Lon From nicholas at fiocruz.br Thu Aug 3 20:27:03 2006 From: nicholas at fiocruz.br (Nicholas Anderson) Date: Thu, 03 Aug 2006 17:27:03 -0300 Subject: [Linux-cluster] Re: E-Mail Cluster In-Reply-To: <44D126D4.2080303@obsidian.co.za> References: <44CE15B1.9010603@fiocruz.br> <44D0F263.509@fiocruz.br> <44D126D4.2080303@obsidian.co.za> Message-ID: <44D25C17.8000004@fiocruz.br> Hi again all ..... I guess i'm starting to understand how the things should work .... I was reading about GFS and all the documents that i found suppose that you have a storage with a SAN and 2 or more machines connected through FC to the SAN. Well, it seems to me that in this case the storage or the SAN switch still being one single-point-of-failure right? If the storage or SAN goes down, the whole service will be offline right ? I thought that with GFS i could do something like a "Parallel FS" where 2 (or more) machines would have the same data in their disks, but this data would be synchronized in realtime .... am i totally noob or there really has a way to make FS's work in parallel, synchronizing in realtime? I'd like to do this without having a SAN (cause i don't have one :-) and i have only 1 storage ) and without leaving a single-point-of-failure. Let me try to explain exactly what I'm thinking ... 3 servers, each one with a 300GB SCSI disk (local, no FC) to be synchronized with the others (through GFS?? mounted and shared as a /data f.ex.), and one 36GB disk only for the SO. All the servers would have smtp(sendmail with spamassassin and clamav), imap and pop3 services running, and probably a squirrelmail. Is it possible to do this? Is it possible to get this data synchronized in realtime ? Thanks again for your really really important answers, and sorry for asking so much noob questions :-) Nick Riaan van Niekerk wrote: > > We are running 700 000 users on a 2.5 GFS, 4 nodes, with POP, IMAP > (direct access and SquirrelmMail) and SMTP. To make things worse, we > use NFS between our GFS nodes and our mail servers. > > We initially had huge performance problems in our setup, which I wrote > in this message: > http://www.redhat.com/archives/linux-cluster/2006-July/msg00136.html > > We ended up bumping the spindle count from 36 to 60 and then to 114, > without it making a noticeable difference. > > Our main killer was Squirrelmail over IMAP (the solution is primarily > a webmail-based one) > Our performance problems were solved by the following: > - removing the folder-size plugin (built-in) and mail quota plugin > (3rd party) reduced the traffic between IMAP servers and storage > backend by 40%. > - Implement imap proxy (www.imapproxy.org). This is giving us a 1 to > 14 hit ratio. This storage which could not keep up previously, is now > humming along fine. > > Our initial mistake was to try and optimise on the FS layer (there > werent any real performance optimizations in our setup to be made) and > throw hardware at the problem, instead of suspecting and optimizing > our application. Despite GFS not being designed for lots of small > files, and not recommended for use with NFS, with the above changes, > it performs more than adequately. We hope to see another performance > gain once we get rid of the NFS and have our mail servers access the > GFS directly. > > Riaan > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Nicholas Anderson Administrador de Sistemas Unix LPIC-1 Certified Rede Fiocruz e-mail: nicholas at fiocruz.br From rainer at ultra-secure.de Thu Aug 3 22:53:51 2006 From: rainer at ultra-secure.de (Rainer Duffner) Date: Fri, 04 Aug 2006 00:53:51 +0200 Subject: [Linux-cluster] Re: E-Mail Cluster In-Reply-To: <44D25C17.8000004@fiocruz.br> References: <44CE15B1.9010603@fiocruz.br> <44D0F263.509@fiocruz.br> <44D126D4.2080303@obsidian.co.za> <44D25C17.8000004@fiocruz.br> Message-ID: <44D27E7F.6000706@ultra-secure.de> Nicholas Anderson wrote: > Hi again all ..... > > I guess i'm starting to understand how the things should work .... > > I was reading about GFS and all the documents that i found suppose > that you have a storage with a SAN and 2 or more machines connected > through FC to the SAN. > Well, it seems to me that in this case the storage or the SAN switch > still being one single-point-of-failure right? If the storage or SAN > goes down, the whole service will be offline right ? First of all, you (should) have redundant FC-switches (mulipathing). Then, your storage has (should have) multiple controllers. Eg. HP EVA series. If that isn't enough, there are solution to mirror the storage at the hardware-level. Usually, this is in the "if-you-have-to-ask-it's-probably-too-expensive-for-you-anyway"-pricerange and thus only used where the (lack of) downtime is worth the investment. > > I thought that with GFS i could do something like a "Parallel FS" > where 2 (or more) machines would have the same data in their disks, > but this data would be synchronized in realtime .... > am i totally noob or there really has a way to make FS's work in > parallel, synchronizing in realtime? > I'd like to do this without having a SAN (cause i don't have one :-) > and i have only 1 storage ) and without leaving a > single-point-of-failure. > > Let me try to explain exactly what I'm thinking ... > > 3 servers, each one with a 300GB SCSI disk (local, no FC) to be > synchronized with the others (through GFS?? mounted and shared as a > /data f.ex.), and one 36GB disk only for the SO. > All the servers would have smtp(sendmail with spamassassin and > clamav), imap and pop3 services running, and probably a squirrelmail. > You can have a master/slave solution with DRBD. > Is it possible to do this? Is it possible to get this data > synchronized in realtime ? I don't think so. Well, Google has sort-of a solution via their "Google Filesystem". But not for you or me. :-( > > Thanks again for your really really important answers, and sorry for > asking so much noob questions :-) > IMO, hardware is very reliable these days (if you choose wisely). Things like DRBD seem (to me) only useful in very special cases - and I would fear that DRBD might create more problems than it solves. In your special case (email), if you can't afford a SAN, get a used NetApp and store the maildirs there (qmail-style maildirs). Then NFS-mount them on the "cluster-nodes". The NetApp is reliable enough for these scenarios and depending on the exact model, already contains a lot of redundancy in itself. cheers, Rainer From nanfang.xun at sunnexchina.com Fri Aug 4 00:37:21 2006 From: nanfang.xun at sunnexchina.com (Nanfang.Xun) Date: Fri, 04 Aug 2006 08:37:21 +0800 Subject: [Linux-cluster] Linux-cluster mailing list submissions Message-ID: <1154651841.3512.20.camel@ns.xunting.net> From yfttyfs at gmail.com Fri Aug 4 02:32:07 2006 From: yfttyfs at gmail.com (y f) Date: Fri, 4 Aug 2006 10:32:07 +0800 Subject: [Linux-cluster] Linux-cluster mailing list submissions In-Reply-To: <1154651841.3512.20.camel@ns.xunting.net> References: <1154651841.3512.20.camel@ns.xunting.net> Message-ID: <78fcc84a0608031932j7522df50wf6cd36b28a81ff67@mail.gmail.com> Hi, Xun, Do you also like Cluster as a guy of metal products company ? Wish you a nice day ! /yf On 8/4/06, Nanfang.Xun wrote: > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nicholas at fiocruz.br Fri Aug 4 02:34:30 2006 From: nicholas at fiocruz.br (Nicholas Anderson) Date: Thu, 3 Aug 2006 23:34:30 -0300 (BRT) Subject: [Linux-cluster] Re: E-Mail Cluster In-Reply-To: <44D27E7F.6000706@ultra-secure.de> References: <44CE15B1.9010603@fiocruz.br> <44D0F263.509@fiocruz.br> <44D126D4.2080303@obsidian.co.za> <44D25C17.8000004@fiocruz.br> <44D27E7F.6000706@ultra-secure.de> Message-ID: <61582.201.51.123.23.1154658870.squirrel@www.redefiocruz.fiocruz.br> > First of all, you (should) have redundant FC-switches (mulipathing). > Then, your storage has (should have) multiple controllers. Eg. HP EVA > series. > If that isn't enough, there are solution to mirror the storage at the > hardware-level. > Usually, this is in the > "if-you-have-to-ask-it's-probably-too-expensive-for-you-anyway"-pricerange > and thus only used where the (lack of) downtime is worth the investment. oops, money is the problem :-P i work for a government institution ..... in Brazil :-P > IMO, hardware is very reliable these days (if you choose wisely). Things > like DRBD seem (to me) only useful in very special cases - and I would > fear that DRBD might create more problems than it solves. > In your special case (email), if you can't afford a SAN, get a used > NetApp and store the maildirs there (qmail-style maildirs). Then > NFS-mount them on the "cluster-nodes". > The NetApp is reliable enough for these scenarios and depending on the > exact model, already contains a lot of redundancy in itself. i already thought about this ..... its a possibility .... thanks for the answer ... cheers Nick -- Nicholas Anderson Administrador de Sistemas Unix LPIC-1 Certified Rede Fiocruz e-mail: nicholas at fiocruz.br From Leonardo.Mello at planejamento.gov.br Fri Aug 4 11:31:01 2006 From: Leonardo.Mello at planejamento.gov.br (Leonardo Rodrigues de Mello) Date: Fri, 4 Aug 2006 08:31:01 -0300 Subject: [Linux-cluster] gfs support for extended security attributes Message-ID: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B5E@corp-bsa-mp01.planejamento.gov.br> The gfs doesn't support SELinux attributes, currently you MUST DISABLE SELinux to use GFS+Cluster Suite. I don't know if there is any plan to support it. maybe one developer or someone from redhat can anwser you better. :-D Best Regards Leonardo Rodrigues de Mello -----Original Message----- From: linux-cluster-bounces at redhat.com on behalf of David Caplan Sent: sex 21/7/2006 17:16 To: linux-cluster at redhat.com Cc: Subject: [Linux-cluster] gfs support for extended security attributes Does the current release of GFS support extended security attributes for use with SELinux? If not, are there any plans for support? Thanks, David -- __________________________________ David Caplan 410 290 1411 x105 dac at tresys.com Tresys Technology, LLC 8840 Stanford Blvd., Suite 2100 Columbia, MD 21045 -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 3061 bytes Desc: not available URL: From kanderso at redhat.com Fri Aug 4 14:12:49 2006 From: kanderso at redhat.com (Kevin Anderson) Date: Fri, 04 Aug 2006 09:12:49 -0500 Subject: [Linux-cluster] gfs support for extended security attributes In-Reply-To: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B5E@corp-bsa-mp01.planejamento.gov.br> References: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B5E@corp-bsa-mp01.planejamento.gov.br> Message-ID: <1154700769.2783.3.camel@dhcp80-204.msp.redhat.com> The gfs-kernel code in the HEAD of the cvs tree now has SELinux extended attribute support integrated into the code. Also, the upstream gfs2 code that is in the -mm kernel also has the SELinux support as well. The gfs code in HEAD is targeted at the Fedora Core 6 and RHEL5 releases. Kevin On Fri, 2006-08-04 at 08:31 -0300, Leonardo Rodrigues de Mello wrote: > The gfs doesn't support SELinux attributes, currently you MUST DISABLE SELinux to use GFS+Cluster Suite. > > I don't know if there is any plan to support it. maybe one developer or someone from redhat can anwser you better. :-D > > Best Regards > Leonardo Rodrigues de Mello > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com on behalf of David Caplan > Sent: sex 21/7/2006 17:16 > To: linux-cluster at redhat.com > Cc: > Subject: [Linux-cluster] gfs support for extended security attributes > > Does the current release of GFS support extended security attributes for > use with SELinux? If not, are there any plans for support? > > Thanks, > David > > -- > __________________________________ > > David Caplan 410 290 1411 x105 > dac at tresys.com > Tresys Technology, LLC > 8840 Stanford Blvd., Suite 2100 > Columbia, MD 21045 > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From gforte at leopard.us.udel.edu Fri Aug 4 14:32:25 2006 From: gforte at leopard.us.udel.edu (Greg Forte) Date: Fri, 04 Aug 2006 10:32:25 -0400 Subject: [Linux-cluster] what causes "magma send einval to ..."? Message-ID: <44D35A79.700@leopard.us.udel.edu> I had a cluster node chugging along seemingly fine last night, then the following two lines appear in /var/log/messages: Aug 3 22:20:07 hostname kernel: al to 1 Aug 3 22:20:07 hostname kernel: Magma send einval to 1 And about 20 seconds later the other node fenced this one. I'm guessing that that fragmented message means that there's some sort of kernel flakiness going on, or that the box got overloaded (no way to tell, unfortunately - any recommendations on monitoring tools to track and log load level?), but that's just a guess. -g Greg Forte gforte at udel.edu IT - User Services University of Delaware 302-831-1982 Newark, DE From lhh at redhat.com Fri Aug 4 15:11:12 2006 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 04 Aug 2006 11:11:12 -0400 Subject: [Linux-cluster] CS4 Update 2 / is this problem fix more recent update ? In-Reply-To: <44D3251A.5080001@bull.net> References: <44D3251A.5080001@bull.net> Message-ID: <1154704272.28677.90.camel@ayanami.boston.redhat.com> On Fri, 2006-08-04 at 12:44 +0200, Alain Moulle wrote: > Hi Ron, > > could you provide me the defects numbers and/or linked patches ? Here's the current list of pending fixes: http://bugzilla.redhat.com/bugzilla/buglist.cgi?component=rgmanager&bug_status=MODIFIED&bug_status=FAILS_QA&bug_status=ON_QA The patch for internal self-monitoring was simply a backport from the HEAD branch. I've attached a hand-edited patch which should enable the self-monitoring bit. Additionally, there was a segfault fixed in U3. Here's the errata advisory, which contains links to bugzillas: https://rhn.redhat.com/errata/RHBA-2006-0241.html -- Lon -------------- next part -------------- A non-text attachment was scrubbed... Name: watchdog.diff Type: text/x-patch Size: 4064 bytes Desc: not available URL: From raycharles_man at yahoo.com Fri Aug 4 15:43:41 2006 From: raycharles_man at yahoo.com (Ray Charles) Date: Fri, 4 Aug 2006 08:43:41 -0700 (PDT) Subject: [Linux-cluster] Logging for cluster errors. In-Reply-To: <44D22D12.2060501@redhat.com> Message-ID: <20060804154341.47391.qmail@web32114.mail.mud.yahoo.com> Yes, The error pops up say when I want to disable a service and re-enable. At the moment its a failed service so when I go to disable it in the gui i get the error and it directs me to check the logs. The Error is not explicit at all just ERROR and a directive to check the logs. -Ray --- James Parsons wrote: > Robert Peterson wrote: > > > Ray Charles wrote: > > > >> > >> Hi, > >> > >> Easy question. > >> > >> When I run system-config-cluster I am able to see > the > >> gui. But in the event there's an error while > using > >> the gui where does that get logged? > >> I've seen an error from the gui and it says check > >> logging. I didn't see a /var/log/cluster/ > >> > >> -TIA > >> > > > > Hi Ray, > > > > Usually the messages go into /var/log/messages. > > Many of the messages can be redirected to other > places by > > changing the cluster.conf file, so the code won't > tell you > > specifically where to look. > > > > Regards, > > > > Bob peterson > > Red Hat Cluster Suite > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > Ray, > > What is the nature of the error that you are seeing? > > -J > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From Leonardo.Mello at planejamento.gov.br Fri Aug 4 17:20:48 2006 From: Leonardo.Mello at planejamento.gov.br (Leonardo Rodrigues de Mello) Date: Fri, 4 Aug 2006 14:20:48 -0300 Subject: [Linux-cluster] what causes "magma send einval to ..."? Message-ID: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B61@corp-bsa-mp01.planejamento.gov.br> There was some discussions about this on the list in october 2005. http://www.google.com/search?q=%22Magma+send+einval+to%22&hl=en&lr=&filter=0 One entry in bugzilla relative this: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=169693 Best Regards Leonardo Rodrigues de Mello -----Original Message----- From: linux-cluster-bounces at redhat.com on behalf of Greg Forte Sent: sex 4/8/2006 11:32 To: linux clustering Cc: Subject: [Linux-cluster] what causes "magma send einval to ..."? I had a cluster node chugging along seemingly fine last night, then the following two lines appear in /var/log/messages: Aug 3 22:20:07 hostname kernel: al to 1 Aug 3 22:20:07 hostname kernel: Magma send einval to 1 And about 20 seconds later the other node fenced this one. I'm guessing that that fragmented message means that there's some sort of kernel flakiness going on, or that the box got overloaded (no way to tell, unfortunately - any recommendations on monitoring tools to track and log load level?), but that's just a guess. -g Greg Forte gforte at udel.edu IT - User Services University of Delaware 302-831-1982 Newark, DE -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 3388 bytes Desc: not available URL: From rohara at redhat.com Fri Aug 4 21:57:16 2006 From: rohara at redhat.com (Ryan O'Hara) Date: Fri, 04 Aug 2006 16:57:16 -0500 Subject: [Linux-cluster] gfs support for extended security attributes In-Reply-To: <6FE441CD9F0C0C479F2D88F959B01588298BCA@exchange.columbia.tresys.com> References: <6FE441CD9F0C0C479F2D88F959B01588298BCA@exchange.columbia.tresys.com> Message-ID: <44D3C2BC.2070904@redhat.com> David, Sorry for the delay. The current release of GFS (in RHEL3 and RHEL4) does not support SELinux extended attributes. The code for GFS(1) in cvs HEAD does have support of SELinux. I added this code recently. This should make its way into our released version of GFS in the near future. GFS2, which is currently in development and being pushed upstream, also has SELinux extended attribute support. So to answer your questions... No, our current release does not support SELinux. Yes, we do plan to support it and the code is in-place. Note that anyone who wanted to try using GFS/GFS2 with SELinux attributes may need to make relevant changes to the policy. With that said, I do know for certain that the Rawhide packages do have a policy that define gfs and gfs2 as supported filesystems. Ryan David Caplan wrote: > > Does the current release of GFS support extended security attributes for > use with SELinux? If not, are there any plans for support? > > Thanks, > David > > -- > __________________________________ > > David Caplan 410 290 1411 x105 > dac at tresys.com > Tresys Technology, LLC > 8840 Stanford Blvd., Suite 2100 > Columbia, MD 21045 > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From riaan at obsidian.co.za Sat Aug 5 22:19:56 2006 From: riaan at obsidian.co.za (Riaan van Niekerk) Date: Sun, 06 Aug 2006 00:19:56 +0200 Subject: [Linux-cluster] 2-node fencing question In-Reply-To: <1154633146.28677.70.camel@ayanami.boston.redhat.com> References: <080220061550.6837.44D0C9B800021AD200001AB522007347489B9C0A99020E0B@comcast.net> <1154633146.28677.70.camel@ayanami.boston.redhat.com> Message-ID: <44D5198C.2090603@obsidian.co.za> > However, it could very well be that IPMI hardware modules are slow > enough at processing requests that this could pose a problem. What > hardware has this happened on? Was ACPI disabled on boot in the host OS > (it should be; see below)? > > snip > > The time window for (c) increases significantly (5+ seconds) if the > cluster nodes are enabling ACPI power events on boot. This is one of > the reasons why booting with acpi=off is required when using IPMI, iLO, > or other integrated power management solutions. > > If booting with acpi=off, does the problem persist? > Lon - is the requirement for disabling acpi when using integrated fence devices documented anywhere? I have searched far and wide on the nature of acpi=off (if it is good or bad, recommended by Red Hat or anyone out there). Yours is the strongest against acpi enabled I have found, but not for reasons I would have expected. My impression of acpi=off is it borders on a magical cure-all for boot/installation problems (in part due to bad acpi by server/firmware vendors), but that it also acts as some kind of safe mode (e.g. ht is disabled, does things to IRQ routing, etc) which may have an adverse effect on system performance. Are you aware of any negative effects, performance or otherwise, which acpi=off will cause. E.g. if the only adverse effect of acpi=off is hyperthreading being disabled, users that want it back, can so using acpi=ht Riaan note: IMHO, a Knowledge Base article on the use of acpi=off (and its variants), for general RHEL installations, and pertaining to RHCS/GFS implementations would be very welcome. -------------- next part -------------- A non-text attachment was scrubbed... Name: riaan.vcf Type: text/x-vcard Size: 310 bytes Desc: not available URL: From riaan at obsidian.co.za Sat Aug 5 22:53:13 2006 From: riaan at obsidian.co.za (Riaan van Niekerk) Date: Sun, 06 Aug 2006 00:53:13 +0200 Subject: [Linux-cluster] CS4 Update 2 / is this problem fix more recent update ? In-Reply-To: <1154704272.28677.90.camel@ayanami.boston.redhat.com> References: <44D3251A.5080001@bull.net> <1154704272.28677.90.camel@ayanami.boston.redhat.com> Message-ID: <44D52159.8090004@obsidian.co.za> Lon Hohberger wrote: > On Fri, 2006-08-04 at 12:44 +0200, Alain Moulle wrote: >> Hi Ron, >> >> could you provide me the defects numbers and/or linked patches ? > > Here's the current list of pending fixes: > > http://bugzilla.redhat.com/bugzilla/buglist.cgi?component=rgmanager&bug_status=MODIFIED&bug_status=FAILS_QA&bug_status=ON_QA > Lon With RHEL 4 update 4 just around the corner, what is the planned release schedule for RHCS 4 update 4 / GFS 6.1 update 4? Since these are not even in beta yet, does that mean that CS/GFS customers will have to wait for the CS/GFS versions of update 4 before they can go to RHEL 4 update 4? tnx Riaan -------------- next part -------------- A non-text attachment was scrubbed... Name: riaan.vcf Type: text/x-vcard Size: 310 bytes Desc: not available URL: From f.hackenberger at mediatransfer.de Mon Aug 7 07:16:56 2006 From: f.hackenberger at mediatransfer.de (Falk Hackenberger - MediaTransfer AG Netresearch & Consulting) Date: Mon, 07 Aug 2006 09:16:56 +0200 Subject: [Linux-cluster] clurgmgrd stops service without reason In-Reply-To: <1154630324.28677.21.camel@ayanami.boston.redhat.com> References: <44D08673.3010207@mediatransfer.com> <1154630324.28677.21.camel@ayanami.boston.redhat.com> Message-ID: <44D6E8E8.4090903@mediatransfer.de> Lon Hohberger wrote: > On Wed, 2006-08-02 at 13:03 +0200, Falk Hackenberger - MediaTransfer AG > Netresearch & Consulting wrote: > >>--snip-- >>Aug 1 17:31:28 kain clurgmgrd: [4780]: Executing >>/exports/imap/checkimapstartup.sh status >>Aug 1 17:31:28 kain clurgmgrd: [4780]: Executing >>/exports/subversion/etc/rc.d/init.d/svnserver status >>Aug 1 17:31:28 kain clurgmgrd: [4780]: Checking 192.168.0.223, >>Level 0 >>Aug 1 17:31:28 kain clurgmgrd: [4780]: 192.168.0.223 present on >>eth0 >>Aug 1 17:31:28 kain clurgmgrd: [4780]: Link for eth0: Detected >>Aug 1 17:31:28 kain clurgmgrd: [4780]: Link detected on eth0 >>Aug 1 17:31:37 kain clurgmgrd[4780]: Stopping service storage >>--snap-- >> >>how to say to clurgmgrd, that he should log the reason for stoping the >>service? > > Something must be returning an error code where it should not be; can > you post your service XML blob? it is very long and a little bit complex as i know... ;-) From f.hackenberger at mediatransfer.com Mon Aug 7 07:17:26 2006 From: f.hackenberger at mediatransfer.com (Falk Hackenberger - MediaTransfer AG Netresearch & Consulting) Date: Mon, 07 Aug 2006 09:17:26 +0200 Subject: [Linux-cluster] clurgmgrd stops service without reason In-Reply-To: <1154630324.28677.21.camel@ayanami.boston.redhat.com> References: <44D08673.3010207@mediatransfer.com> <1154630324.28677.21.camel@ayanami.boston.redhat.com> Message-ID: <44D6E906.5060006@mediatransfer.com> Lon Hohberger wrote: > On Wed, 2006-08-02 at 13:03 +0200, Falk Hackenberger - MediaTransfer AG > Netresearch & Consulting wrote: > >>--snip-- >>Aug 1 17:31:28 kain clurgmgrd: [4780]: Executing >>/exports/imap/checkimapstartup.sh status >>Aug 1 17:31:28 kain clurgmgrd: [4780]: Executing >>/exports/subversion/etc/rc.d/init.d/svnserver status >>Aug 1 17:31:28 kain clurgmgrd: [4780]: Checking 192.168.0.223, >>Level 0 >>Aug 1 17:31:28 kain clurgmgrd: [4780]: 192.168.0.223 present on >>eth0 >>Aug 1 17:31:28 kain clurgmgrd: [4780]: Link for eth0: Detected >>Aug 1 17:31:28 kain clurgmgrd: [4780]: Link detected on eth0 >>Aug 1 17:31:37 kain clurgmgrd[4780]: Stopping service storage >>--snap-- >> >>how to say to clurgmgrd, that he should log the reason for stoping the >>service? > > Something must be returning an error code where it should not be; can > you post your service XML blob? it is very long and a little bit complex as i know... ;-) From neohill at gmail.com Mon Aug 7 07:50:07 2006 From: neohill at gmail.com (Neo Hill) Date: Mon, 7 Aug 2006 09:50:07 +0200 Subject: [Linux-cluster] DRBD in Active-active mode Message-ID: Hi everybody, I am still looking on information or documents regarding DRBD in active-active mode. Does anyone could help me ? Thanks a lot. Neo hill -------------- next part -------------- An HTML attachment was scrubbed... URL: From Alain.Moulle at bull.net Mon Aug 7 09:29:33 2006 From: Alain.Moulle at bull.net (Alain Moulle) Date: Mon, 07 Aug 2006 11:29:33 +0200 Subject: [Linux-cluster] CS4 Update 2 / is this problem fix more recent update ? In-Reply-To: <1154704272.28677.90.camel@ayanami.boston.redhat.com> References: <44D3251A.5080001@bull.net> <1154704272.28677.90.camel@ayanami.boston.redhat.com> Message-ID: <44D707FD.6080602@bull.net> Hi Lon I've tried to patch the U2 version with this patch, but it requires a nodeevent.c which apparently did not exist in CS4 U2 (that Makefile patch adds a nodeevent.o as well as the watchdog.o) . Does that mean that this patch can definetly not be applied on rgmanager (1.9.38) from CS4 U2 ? Thanks Alain Moull? Lon Hohberger wrote: > On Fri, 2006-08-04 at 12:44 +0200, Alain Moulle wrote: > >>Hi Ron, >> >>could you provide me the defects numbers and/or linked patches ? > > > Here's the current list of pending fixes: > > http://bugzilla.redhat.com/bugzilla/buglist.cgi?component=rgmanager&bug_status=MODIFIED&bug_status=FAILS_QA&bug_status=ON_QA > > The patch for internal self-monitoring was simply a backport from the > HEAD branch. I've attached a hand-edited patch which should enable the > self-monitoring bit. > > Additionally, there was a segfault fixed in U3. Here's the errata > advisory, which contains links to bugzillas: > > https://rhn.redhat.com/errata/RHBA-2006-0241.html > > -- Lon From joe.devman at yahoo.fr Mon Aug 7 12:08:00 2006 From: joe.devman at yahoo.fr (Joe) Date: Mon, 07 Aug 2006 14:08:00 +0200 Subject: [Linux-cluster] gfs_fsck fails on large filesystem In-Reply-To: <44CF94FE.3070407@redhat.com> References: <44CF2F94.4000003@framestore-cfc.com> <44CF8383.3040208@redhat.com> <44CF822D.7070705@framestore-cfc.com> <44CF94FE.3070407@redhat.com> Message-ID: <44D72D20.2060705@yahoo.fr> Robert Peterson wrote: > We've tried to kick around ideas on how to improve the speed, such as > (1) adding an option to only focus on areas where the journals are dirty, > (2) introducing multiple threads to process the different RGs, and even > (3) trying to get multiple nodes in the cluster to team up and do > different > areas of the file system. None of these have been implemented yet > because of higher priorities. Since this is an open-source project, > anyone > could step in and do these. Volunteers? > I've tried to look at the code many times. But, as a clustered file system is a complex thing, it gets hard to figure out what it's all about. I tried to find a "big picture" documentation, at least for on-disk layout. The only nearest thing i've found is : http://opengfs.sourceforge.net/docs.php , which is the documentation written at the time OpenGFS forked from Cistina's code. Although principles may still be the sames (or not ?), the code has obviously changed and on-disk layout may not be the same, too. So, is there some sort of documentation about the principles found in GFS (not a design doc, i've read /usr/src/linux/Documentation/stable_api_nonsense.txt) ? This would much help anybody who wishes to enter the code to do it more efficientely... Thanks ! From lhh at redhat.com Mon Aug 7 14:27:27 2006 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 07 Aug 2006 10:27:27 -0400 Subject: [Linux-cluster] DRBD in Active-active mode In-Reply-To: References: Message-ID: <1154960847.21204.35.camel@ayanami.boston.redhat.com> On Mon, 2006-08-07 at 09:50 +0200, Neo Hill wrote: > Hi everybody, > > I am still looking on information or documents regarding DRBD in > active-active mode. > > Does anyone could help me ? > > Thanks a lot. Fairly certain this is not possible, unless something has changed recently. That is, you cannot use DRBD as a distributed concurrently writable mirror; only one node can be the master of a DRBD device at a time. You can do this with GNBD + Cluster Mirroring, though. -- Lon From lhh at redhat.com Mon Aug 7 14:30:41 2006 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 07 Aug 2006 10:30:41 -0400 Subject: [Linux-cluster] CS4 Update 2 / is this problem fix more recent update ? In-Reply-To: <44D707FD.6080602@bull.net> References: <44D3251A.5080001@bull.net> <1154704272.28677.90.camel@ayanami.boston.redhat.com> <44D707FD.6080602@bull.net> Message-ID: <1154961041.21204.40.camel@ayanami.boston.redhat.com> On Mon, 2006-08-07 at 11:29 +0200, Alain Moulle wrote: > Hi Lon > > I've tried to patch the U2 version with this patch, but it requires > a nodeevent.c which apparently did not exist in CS4 U2 (that Makefile patch > adds a nodeevent.o as well as the watchdog.o) . > Does that mean that this patch can definetly not be applied > on rgmanager (1.9.38) from CS4 U2 ? Take it out of the patched Makefile. Nodeevent.c shouldn't be required to make the watchdog work. -- Lon From chawkins at bplinux.com Mon Aug 7 14:33:31 2006 From: chawkins at bplinux.com (Christopher Hawkins) Date: Mon, 7 Aug 2006 10:33:31 -0400 Subject: [Linux-cluster] DRBD in Active-active mode In-Reply-To: <1154960847.21204.35.camel@ayanami.boston.redhat.com> Message-ID: <200608071418.k77EIq1X000664@mail2.ontariocreditcorp.com> On Mon, 2006-08-07 at 09:50 +0200, Neo Hill wrote: >> Hi everybody, >> >> I am still looking on information or documents regarding DRBD in >> active-active mode. >> >> Does anyone could help me ? >> >> Thanks a lot. >Fairly certain this is not possible, unless something has changed recently. >That is, you cannot use DRBD as adistributed concurrently writable mirror; >only one node can be the master of a DRBD device at a time. >You can do this with GNBD + Cluster Mirroring, though. >-- Lon Lon, GNBD + Cluster Mirroring? Are you referring to clvm2, or is there another package out there I haven't heard of? Thanks, Chris From riaan at obsidian.co.za Mon Aug 7 15:03:32 2006 From: riaan at obsidian.co.za (Riaan van Niekerk) Date: Mon, 07 Aug 2006 17:03:32 +0200 Subject: [Linux-cluster] DRBD in Active-active mode In-Reply-To: <1154960847.21204.35.camel@ayanami.boston.redhat.com> References: <1154960847.21204.35.camel@ayanami.boston.redhat.com> Message-ID: <44D75644.8000901@obsidian.co.za> Lon Hohberger wrote: > On Mon, 2006-08-07 at 09:50 +0200, Neo Hill wrote: >> Hi everybody, >> >> I am still looking on information or documents regarding DRBD in >> active-active mode. >> >> Does anyone could help me ? >> >> Thanks a lot. > > Fairly certain this is not possible, unless something has changed > recently. That is, you cannot use DRBD as a distributed concurrently > writable mirror; only one node can be the master of a DRBD device at a > time. > > You can do this with GNBD + Cluster Mirroring, though. > > -- Lon According to the DRBD FAQ: http://www.linux-ha.org/DRBD/FAQ#head-ec4ab5a57e15232e9ac4e12775de5a1b328aeff5 Why does DRBD not allow concurrent access from all nodes? I'd like to use it with GFS/OCFS2... Actually, DRBD v8 (which is still in pre-release state at the time of this writing) supports this. You need to net { allow-two-primaries; } ... I have not tried this myself though, but would like to hear the experiences of anyone who has tried this. Riaan -------------- next part -------------- A non-text attachment was scrubbed... Name: riaan.vcf Type: text/x-vcard Size: 310 bytes Desc: not available URL: From Alain.Moulle at bull.net Mon Aug 7 15:07:40 2006 From: Alain.Moulle at bull.net (Alain Moulle) Date: Mon, 07 Aug 2006 17:07:40 +0200 Subject: [Linux-cluster] CS4 Update 2 / is this problem fix more recent update ? In-Reply-To: <1154961041.21204.40.camel@ayanami.boston.redhat.com> References: <44D3251A.5080001@bull.net> <1154704272.28677.90.camel@ayanami.boston.redhat.com> <44D707FD.6080602@bull.net> <1154961041.21204.40.camel@ayanami.boston.redhat.com> Message-ID: <44D7573C.5020209@bull.net> Lon Hohberger wrote: > On Mon, 2006-08-07 at 11:29 +0200, Alain Moulle wrote: > >>Hi Lon >> >>I've tried to patch the U2 version with this patch, but it requires >>a nodeevent.c which apparently did not exist in CS4 U2 (that Makefile patch >>adds a nodeevent.o as well as the watchdog.o) . >>Does that mean that this patch can definetly not be applied >>on rgmanager (1.9.38) from CS4 U2 ? > > > Take it out of the patched Makefile. Nodeevent.c shouldn't be required > to make the watchdog work. > > -- Lon Build ok. Thanks. Could you explain exactly the benefit of this watchdog work ? Thanks Alain > > From rpeterso at redhat.com Mon Aug 7 15:55:10 2006 From: rpeterso at redhat.com (Robert Peterson) Date: Mon, 07 Aug 2006 10:55:10 -0500 Subject: [Linux-cluster] gfs_fsck fails on large filesystem In-Reply-To: <44D72D20.2060705@yahoo.fr> References: <44CF2F94.4000003@framestore-cfc.com> <44CF8383.3040208@redhat.com> <44CF822D.7070705@framestore-cfc.com> <44CF94FE.3070407@redhat.com> <44D72D20.2060705@yahoo.fr> Message-ID: <44D7625E.9090305@redhat.com> Joe wrote: > I've tried to look at the code many times. But, as a clustered file > system is a complex thing, it gets hard to figure out what it's all > about. I tried to find a "big picture" documentation, at least for > on-disk layout. The only nearest thing i've found is : > http://opengfs.sourceforge.net/docs.php , which is the documentation > written at the time OpenGFS forked from Cistina's code. Although > principles may still be the sames (or not ?), the code has obviously > changed and on-disk layout may not be the same, too. > So, is there some sort of documentation about the principles found in > GFS (not a design doc, i've read > /usr/src/linux/Documentation/stable_api_nonsense.txt) ? This would > much help anybody who wishes to enter the code to do it more > efficientely... > > Thanks ! Hi Joe, I agree that there isn't much good design information out there regarding GFS. That might be because it started out as a proprietary product before Red Hat open-sourced it. There are some comments in the kernel's gfs_ondisk.h include: http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/gfs-kernel/src/gfs/gfs_ondisk.h?cvsroot=cluster Perhaps I'll start working on a GFS1/2 design whitepaper based on some of the information I've gathered. Regards, Bob Peterson Red Hat Cluster Suite From hardyjm at potsdam.edu Mon Aug 7 18:11:06 2006 From: hardyjm at potsdam.edu (Jeff Hardy) Date: Mon, 07 Aug 2006 14:11:06 -0400 Subject: [Linux-cluster] lvm2 liblvm2clusterlock.so on fc5 In-Reply-To: <1154022763.2789.120.camel@fritzdesk.potsdam.edu> References: <1154022763.2789.120.camel@fritzdesk.potsdam.edu> Message-ID: <1154974266.1797.124.camel@fritzdesk.potsdam.edu> On Thu, 2006-07-27 at 13:52 -0400, Jeff Hardy wrote: > I apologize if this has been answered already or appeared in release > notes somewhere, but I cannot find it. FC4 had the lvm2-cluster package > to provide the clvm locking library. This was removed in FC5 (as > indicated in the release notes). > > Is this still necessary for a clvm setup: > > In /etc/lvm/lvm.conf: > locking_type = 2 > locking_library = "/lib/liblvm2clusterlock.so" > > And if so, where does one find this now? > > Thank you. > > Well, though absent in FC5, I just recently saw a message somewhere indicating the lvm2-cluster package was back in FC6 testing. Anyone have any idea why this was dropped for FC5? I built off of the lvm2 source rpm, using a modified lvm2-cluster spec file from FC4. Looks ok. If anyone has reason to believe this is a really bad idea, or wants the srpm or rpm, feel free to drop me a line. -- Jeff Hardy Systems Analyst hardyjm at potsdam.edu From agk at redhat.com Mon Aug 7 18:28:24 2006 From: agk at redhat.com (Alasdair G Kergon) Date: Mon, 7 Aug 2006 19:28:24 +0100 Subject: [Linux-cluster] lvm2 liblvm2clusterlock.so on fc5 In-Reply-To: <1154974266.1797.124.camel@fritzdesk.potsdam.edu> References: <1154022763.2789.120.camel@fritzdesk.potsdam.edu> <1154974266.1797.124.camel@fritzdesk.potsdam.edu> Message-ID: <20060807182824.GP18633@agk.surrey.redhat.com> On Mon, Aug 07, 2006 at 02:11:06PM -0400, Jeff Hardy wrote: > have any idea why this was dropped for FC5? It got disabled early on because it wouldn't build (depended on cluster infrastructure that wasn't there) and when that got resolved, nobody remembered to reenable it. As you noticed, we've got it back into fc6/devel and we're trying to to get it approved for fc5 updates. Alasdair -- agk at redhat.com From Leonardo.Mello at planejamento.gov.br Mon Aug 7 19:09:00 2006 From: Leonardo.Mello at planejamento.gov.br (Leonardo Rodrigues de Mello) Date: Mon, 7 Aug 2006 16:09:00 -0300 Subject: RES: [Linux-cluster] DRBD in Active-active mode Message-ID: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B6E@corp-bsa-mp01.planejamento.gov.br> Hi everyone, Lon you are right for the stable version of DRBD the version 0.7. But DRBD actualy has support for active-active setup in the development version 0.8. There is significant changes between this versions, the entire roadmap can be read at: http://svn.drbd.org/drbd/trunk/ROADMAP I have done some investigations and tests with DRBD 0.8 in active-active setup with two nodes and OCFS2 and with GFS. This was for one project i was doing related to oracle Rac 10g. I have produced one documentation in portuguese that shows how to setup and use drbd in active-active with ocfs2. the link is: http://guialivre.governoeletronico.gov.br/seminario/index.php/DocumentacaoTecnologiasDRBDOCFS2 I have discovered in my investigations that ocfs2 is more unstable that GFS. I have a several kernel panics with ocfs2 under high loads on the machine, but no one with GFS. I have the instalation of GFS documented at, one performance test i have done some time ago: http://guialivre.governoeletronico.gov.br/mediawiki/index.php/TestesGFS (here i use clvm, and gnbd) The problem of drbd is that actualy you can use just two machines, if you want to use more you need to use the commercial version drbd+. Best Regards Leonardo Rodrigues de Mello -----Mensagem original----- De: linux-cluster-bounces at redhat.com em nome de Lon Hohberger Enviada: seg 7/8/2006 11:27 Para: linux clustering Cc: Assunto: Re: [Linux-cluster] DRBD in Active-active mode On Mon, 2006-08-07 at 09:50 +0200, Neo Hill wrote: > Hi everybody, > > I am still looking on information or documents regarding DRBD in > active-active mode. > > Does anyone could help me ? > > Thanks a lot. Fairly certain this is not possible, unless something has changed recently. That is, you cannot use DRBD as a distributed concurrently writable mirror; only one node can be the master of a DRBD device at a time. You can do this with GNBD + Cluster Mirroring, though. -- Lon -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 3708 bytes Desc: not available URL: From Leonardo.Mello at planejamento.gov.br Mon Aug 7 19:14:19 2006 From: Leonardo.Mello at planejamento.gov.br (Leonardo Rodrigues de Mello) Date: Mon, 7 Aug 2006 16:14:19 -0300 Subject: RES: [Linux-cluster] DRBD in Active-active mode Message-ID: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B6F@corp-bsa-mp01.planejamento.gov.br> sorry for the typos and english errors in the last message. I believe need more coffee. Leonardo Rodrigues de Mello -----Mensagem original----- De: Leonardo Rodrigues de Mello em nome de Leonardo Rodrigues de Mello Enviada: seg 7/8/2006 16:09 Para: linux clustering Cc: Assunto: RES: [Linux-cluster] DRBD in Active-active mode Hi everyone, Lon you are right for the stable version of DRBD the version 0.7. But DRBD actualy has support for active-active setup in the development version 0.8. There is significant changes between this versions, the entire roadmap can be read at: http://svn.drbd.org/drbd/trunk/ROADMAP I have done some investigations and tests with DRBD 0.8 in active-active setup with two nodes and OCFS2 and with GFS. This was for one project i was doing related to oracle Rac 10g. I have produced one documentation in portuguese that shows how to setup and use drbd in active-active with ocfs2. the link is: http://guialivre.governoeletronico.gov.br/seminario/index.php/DocumentacaoTecnologiasDRBDOCFS2 I have discovered in my investigations that ocfs2 is more unstable that GFS. I have a several kernel panics with ocfs2 under high loads on the machine, but no one with GFS. I have the instalation of GFS documented at, one performance test i have done some time ago: http://guialivre.governoeletronico.gov.br/mediawiki/index.php/TestesGFS (here i use clvm, and gnbd) The problem of drbd is that actualy you can use just two machines, if you want to use more you need to use the commercial version drbd+. Best Regards Leonardo Rodrigues de Mello -----Mensagem original----- De: linux-cluster-bounces at redhat.com em nome de Lon Hohberger Enviada: seg 7/8/2006 11:27 Para: linux clustering Cc: Assunto: Re: [Linux-cluster] DRBD in Active-active mode On Mon, 2006-08-07 at 09:50 +0200, Neo Hill wrote: > Hi everybody, > > I am still looking on information or documents regarding DRBD in > active-active mode. > > Does anyone could help me ? > > Thanks a lot. Fairly certain this is not possible, unless something has changed recently. That is, you cannot use DRBD as a distributed concurrently writable mirror; only one node can be the master of a DRBD device at a time. You can do this with GNBD + Cluster Mirroring, though. -- Lon -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 3932 bytes Desc: not available URL: From haiwu.us at gmail.com Mon Aug 7 20:21:39 2006 From: haiwu.us at gmail.com (hai wu) Date: Mon, 7 Aug 2006 15:21:39 -0500 Subject: [Linux-cluster] 2-node cluster and fence_drac Message-ID: Hi, For a 2-node cluster (RHEL4), does it require the use of power switch or fence_drac would be good enough for the setup? Would fence_drac work properly in a 2-node cluster? Thanks, Hai -------------- next part -------------- An HTML attachment was scrubbed... URL: From brad at seatab.com Mon Aug 7 22:07:29 2006 From: brad at seatab.com (Brad Dameron) Date: Mon, 07 Aug 2006 15:07:29 -0700 Subject: [Linux-cluster] GFS 6.1 kernel warning. Message-ID: <1154988449.19157.20.camel@serpent.office.seatab.com> We have two rather large server's running GFS in a production environment and have been getting errors since the start. First our configuration. 2 - Quad 880 Opteron Servers with 64GB RAM 1 - Infortrend 2GB SAN OS - SuSe 10.0 Professional (Kernel 2.6.13-15.8-smp x86_64) Cluster network is on GigE connection. This link is shared and used for other purposes but not much traffic. Here is the error message: Aug 7 14:40:45 CServer01 kernel: GFS: fsid=Cluster01:gfs1.1: warning: assertion "gfs_glock_is_locked_by_me(ip->i_gl)" failed Aug 7 14:40:45 CServer01 kernel: GFS: fsid=Cluster01:gfs1.1: function = gfs_readpage Aug 7 14:40:45 CServer01 kernel: GFS: fsid=Cluster01:gfs1.1: file = /usr/src/gfs/src/cluster-1.02.00/gfs-kernel/src/gfs/ops_address.c, line = 283 Aug 7 14:40:45 CServer01 kernel: GFS: fsid=Cluster01:gfs1.1: time = 1154986845 This appears to occur when both machines try to access the same files/directory. They happen at a rate of about 10-15 an hour. Anyone know if this is critical or a way to turn these off if they are not an issue? There is definitely a big performance issue when using GFS on very CPU intense applications. When the first server is using all 8 CPU core's doing processing the second server's IO response slows to a crawl. Any sysctl tweaks to help improve the performance appreciated. Thanks, Brad Dameron SeaTab Software www.seatab.com From Alain.Moulle at bull.net Tue Aug 8 07:53:27 2006 From: Alain.Moulle at bull.net (Alain Moulle) Date: Tue, 08 Aug 2006 09:53:27 +0200 Subject: [Linux-cluster] CS4 Update 4 / two questions Message-ID: <44D842F7.7080805@bull.net> Hi 1/ About the return of quorum disk functionnality : is it mandatory to configure it, or is it possible to run the CS4 U4 without it in a first step ? (this question only to know how to manage eventual update from U2 (currently in production without quorum disk configured) to U4 ) 2/ is there a beta documentation on CS4 U4 download-able somewhere ? Thanks Alain From pcaulfie at redhat.com Tue Aug 8 08:04:37 2006 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Tue, 08 Aug 2006 09:04:37 +0100 Subject: [Linux-cluster] CS4 Update 4 / two questions In-Reply-To: <44D842F7.7080805@bull.net> References: <44D842F7.7080805@bull.net> Message-ID: <44D84595.3000304@redhat.com> Alain Moulle wrote: > Hi > > 1/ About the return of quorum disk functionnality : is it mandatory > to configure it, or is it possible to run the CS4 U4 without it in > a first step ? > > (this question only to know how to manage eventual update from U2 (currently in > production without quorum disk configured) to U4 ) > Quorum disk is completely optional, even in a two-node system. -- patrick From Alain.Moulle at bull.net Tue Aug 8 13:08:38 2006 From: Alain.Moulle at bull.net (Alain Moulle) Date: Tue, 08 Aug 2006 15:08:38 +0200 Subject: [Linux-cluster] CS4 Update 4/ about __NR_gettid and syscall Message-ID: <44D88CD6.7030800@bull.net> Hi In CS4 Update 4 , there are several places where a syscall call is dependant on NR_gettid set or not , for example in qdisk/gettid.c : #include #include #include #include /* Patch from Adam Conrad / Ubuntu: Don't use _syscall macro */ #ifdef __NR_gettid pid_t gettid (void) { return syscall(__NR_gettid); } #else #warn "gettid not available -- substituting with pthread_self()" #include pid_t gettid (void) { return (pid_t)pthread_self(); } #endif and also in : magma-plugins-1.0.9/gulm/gulm.c rgmanager-1.9.52/src/clulib/gettid And in fact, I have compilation error if the syscall is choosen by the ifdef , so I wonder what to do about that , what does __NR_gettid means ,etc. Any piece of advise ? Thanks Alain From Leonardo.Mello at planejamento.gov.br Wed Aug 9 15:04:58 2006 From: Leonardo.Mello at planejamento.gov.br (Leonardo Rodrigues de Mello) Date: Wed, 9 Aug 2006 12:04:58 -0300 Subject: [Linux-cluster] cs-deploy-gfs Message-ID: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B78@corp-bsa-mp01.planejamento.gov.br> Hi everyone, Does anyone know what happened with the development of cs-deploy-gfs ? The version in cvs is the lastest version ? There is any improvements since the initial version ? This software was abandoned ? I don't have much time but I want help in the development of it, do the things in TODO, and others like porting it to systems debian-like or better, port it to use smartpm (http://labix.org/smart), help with the internacionalization, and translation to portuguese brazil. Best Regards Leonardo Rodrigues de Mello -------------- next part -------------- An HTML attachment was scrubbed... URL: From Leonardo.Mello at planejamento.gov.br Wed Aug 9 15:16:16 2006 From: Leonardo.Mello at planejamento.gov.br (Leonardo Rodrigues de Mello) Date: Wed, 9 Aug 2006 12:16:16 -0300 Subject: cs-deploy-tool (WAS: [Linux-cluster] cs-deploy-gfs) Message-ID: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B79@corp-bsa-mp01.planejamento.gov.br> sorry, the application name is cs-deploy-tool, not cs-deploy-gfs. Best Regards Leonardo Rodrigues de Mello -----Mensagem original----- De: linux-cluster-bounces at redhat.com em nome de Leonardo Rodrigues de Mello Enviada: qua 9/8/2006 12:04 Para: linux-cluster at redhat.com Cc: Assunto: [Linux-cluster] cs-deploy-gfs Hi everyone, Does anyone know what happened with the development of cs-deploy-gfs ? The version in cvs is the lastest version ? There is any improvements since the initial version ? This software was abandoned ? I don't have much time but I want help in the development of it, do the things in TODO, and others like porting it to systems debian-like or better, port it to use smartpm (http://labix.org/smart), help with the internacionalization, and translation to portuguese brazil. Best Regards Leonardo Rodrigues de Mello -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 3004 bytes Desc: not available URL: From jparsons at redhat.com Wed Aug 9 15:16:41 2006 From: jparsons at redhat.com (James Parsons) Date: Wed, 09 Aug 2006 11:16:41 -0400 Subject: cs-deploy-tool (WAS: [Linux-cluster] cs-deploy-gfs) In-Reply-To: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B79@corp-bsa-mp01.planejamento.gov.br> References: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B79@corp-bsa-mp01.planejamento.gov.br> Message-ID: <44D9FC59.5050408@redhat.com> Leonardo Rodrigues de Mello wrote: >sorry, >the application name is cs-deploy-tool, not cs-deploy-gfs. > >Best Regards >Leonardo Rodrigues de Mello > > >-----Mensagem original----- >De: linux-cluster-bounces at redhat.com em nome de Leonardo Rodrigues de Mello >Enviada: qua 9/8/2006 12:04 >Para: linux-cluster at redhat.com >Cc: >Assunto: [Linux-cluster] cs-deploy-gfs > >Hi everyone, > >Does anyone know what happened with the development of cs-deploy-gfs ? > >The version in cvs is the lastest version ? > >There is any improvements since the initial version ? > >This software was abandoned ? > > The functionality available in cs-deploy-tool will be available in a new management interface for clusters and storage called Conga, targetted for RHEL5 and (hopefully) RHEL4.5 -J From stephen.willey at framestore-cfc.com Wed Aug 9 15:33:29 2006 From: stephen.willey at framestore-cfc.com (Stephen Willey) Date: Wed, 09 Aug 2006 16:33:29 +0100 Subject: [Linux-cluster] gfs_fsck fails on large filesystem In-Reply-To: <44D7625E.9090305@redhat.com> References: <44CF2F94.4000003@framestore-cfc.com> <44CF8383.3040208@redhat.com> <44CF822D.7070705@framestore-cfc.com> <44CF94FE.3070407@redhat.com> <44D72D20.2060705@yahoo.fr> <44D7625E.9090305@redhat.com> Message-ID: <44DA0049.8030505@framestore-cfc.com> So ya know... Once we'd added a 137Gb swap drive, it took 48 hours to run all stages of the gfs_fsck on a 42Tb filesystem without any -v options That was on a dual Opteron 275 (4Gb RAM) with 4Gb FC to 6 SATA RAIDs in CLVM. -- Stephen Willey Senior Systems Engineer, Framestore-CFC +44 (0)207 344 8000 http://www.framestore-cfc.com From lhh at redhat.com Wed Aug 9 15:34:39 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 09 Aug 2006 11:34:39 -0400 Subject: [Linux-cluster] CS4 Update 4 / two questions In-Reply-To: <44D842F7.7080805@bull.net> References: <44D842F7.7080805@bull.net> Message-ID: <1155137679.21204.144.camel@ayanami.boston.redhat.com> On Tue, 2006-08-08 at 09:53 +0200, Alain Moulle wrote: > Hi > > 1/ About the return of quorum disk functionnality : is it mandatory > to configure it, Not required in the least. -- Lon From lhh at redhat.com Wed Aug 9 15:35:16 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 09 Aug 2006 11:35:16 -0400 Subject: [Linux-cluster] CS4 Update 2 / is this problem fix more recent update ? In-Reply-To: <44D7573C.5020209@bull.net> References: <44D3251A.5080001@bull.net> <1154704272.28677.90.camel@ayanami.boston.redhat.com> <44D707FD.6080602@bull.net> <1154961041.21204.40.camel@ayanami.boston.redhat.com> <44D7573C.5020209@bull.net> Message-ID: <1155137716.21204.146.camel@ayanami.boston.redhat.com> On Mon, 2006-08-07 at 17:07 +0200, Alain Moulle wrote: > Build ok. Thanks. > Could you explain exactly the benefit of this watchdog work ? > Thanks > Alain If rgmanager crashes, the node gets rebooted. -- Lon From lhh at redhat.com Wed Aug 9 15:37:07 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 09 Aug 2006 11:37:07 -0400 Subject: RES: [Linux-cluster] DRBD in Active-active mode In-Reply-To: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B6E@corp-bsa-mp01.planejamento.gov.br> References: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B6E@corp-bsa-mp01.planejamento.gov.br> Message-ID: <1155137827.21204.149.camel@ayanami.boston.redhat.com> On Mon, 2006-08-07 at 16:09 -0300, Leonardo Rodrigues de Mello wrote: > Hi everyone, > > Lon you are right for the stable version of DRBD the version 0.7. But DRBD actualy has support for active-active setup in the development version 0.8. There is significant changes between this versions, the entire roadmap can be read at: > http://svn.drbd.org/drbd/trunk/ROADMAP Awesome. :) > > I have done some investigations and tests with DRBD 0.8 in active-active setup with two nodes and OCFS2 and with GFS. This was for one project i was doing related to oracle Rac 10g. > > I have produced one documentation in portuguese that shows how to setup and use drbd in active-active with ocfs2. the link is: > http://guialivre.governoeletronico.gov.br/seminario/index.php/DocumentacaoTecnologiasDRBDOCFS2 > > I have discovered in my investigations that ocfs2 is more unstable that GFS. I have a several kernel panics with ocfs2 under high loads on the machine, but no one with GFS. > > I have the instalation of GFS documented at, one performance test i have done some time ago: > http://guialivre.governoeletronico.gov.br/mediawiki/index.php/TestesGFS > (here i use clvm, and gnbd) > > > The problem of drbd is that actualy you can use just two machines, if you want to use more you need to use the commercial version drbd+. Wow, great information. Thanks! -- Lon From lhh at redhat.com Wed Aug 9 15:46:12 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 09 Aug 2006 11:46:12 -0400 Subject: [Linux-cluster] 2-node cluster and fence_drac In-Reply-To: References: Message-ID: <1155138372.21204.158.camel@ayanami.boston.redhat.com> On Mon, 2006-08-07 at 15:21 -0500, hai wu wrote: > Hi, > For a 2-node cluster (RHEL4), does it require the use of power switch > or fence_drac would be good enough for the setup? Would fence_drac > work properly in a 2-node cluster? > Thanks, > Hai fence_drac would be fine, but you need to understand that with DRAC (or any integrated power management which receives power from the machine) that if host power is completely lost, fencing will fail - causing the cluster to stop. This failure is indistinguishable from DRAC + host losing network at the same time (ex: the ethernet switch fails). Generally, these machines have redundant power, so losing power all at once is less likely. So, DRAC is fine, but there are failure cases where it is less than optimal, particularly in machines without redundant power supplies. -- Lon From lhh at redhat.com Wed Aug 9 15:46:38 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 09 Aug 2006 11:46:38 -0400 Subject: [Linux-cluster] cs-deploy-gfs In-Reply-To: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B78@corp-bsa-mp01.planejamento.gov.br> References: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B78@corp-bsa-mp01.planejamento.gov.br> Message-ID: <1155138398.21204.160.camel@ayanami.boston.redhat.com> On Wed, 2006-08-09 at 12:04 -0300, Leonardo Rodrigues de Mello wrote: > Hi everyone, > > Does anyone know what happened with the development of cs-deploy-gfs ? I think that it's being replaced with Conga. -- Lon From Leonardo.Mello at planejamento.gov.br Wed Aug 9 16:54:03 2006 From: Leonardo.Mello at planejamento.gov.br (Leonardo Rodrigues de Mello) Date: Wed, 9 Aug 2006 13:54:03 -0300 Subject: RES: cs-deploy-tool (WAS: [Linux-cluster] cs-deploy-gfs) Message-ID: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B7A@corp-bsa-mp01.planejamento.gov.br> Thanks for the anwsers :-D But I believe beside the fact that conga and cs-deploy-tool share some things in common. cs-deploy-tool has it place for simple and no pain instalation of cluster suite with basic services in one network environment. The point that count for me is that i can get anywhere with my laptop and just with the knowledge of IP numbers and root passwords setup one cluster in 10 minutes or less. To use conga i need go configure and install zope in one machine, install the agents in the servers that will be in the cluster, configure the zope to see the agents, its more complicated and demands more work for the simple task of cluster instalation and basic initial configuration. Conga is one great initiative and complex initiative for managing, deploy, administration, and others thinks for production cluster enviroments. if i need to choose one tool to just deploy cluster suite. i will choose cs-deploy-tool. If i need to manage, and be the administrator of a cluster, of course i will need the power of conga. :-D this long message is just to ask: Can I implement the changes i had proposed in the first message ? if, yes to who i will send they ? best regards Leonardo Rodrigues de Mello -----Mensagem original----- De: linux-cluster-bounces at redhat.com em nome de James Parsons Enviada: qua 9/8/2006 12:16 Para: linux clustering Cc: Assunto: Re: cs-deploy-tool (WAS: [Linux-cluster] cs-deploy-gfs) Leonardo Rodrigues de Mello wrote: >sorry, >the application name is cs-deploy-tool, not cs-deploy-gfs. > >Best Regards >Leonardo Rodrigues de Mello > > >-----Mensagem original----- >De: linux-cluster-bounces at redhat.com em nome de Leonardo Rodrigues de Mello >Enviada: qua 9/8/2006 12:04 >Para: linux-cluster at redhat.com >Cc: >Assunto: [Linux-cluster] cs-deploy-gfs > >Hi everyone, > >Does anyone know what happened with the development of cs-deploy-gfs ? > >The version in cvs is the lastest version ? > >There is any improvements since the initial version ? > >This software was abandoned ? > > The functionality available in cs-deploy-tool will be available in a new management interface for clusters and storage called Conga, targetted for RHEL5 and (hopefully) RHEL4.5 -J -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 3761 bytes Desc: not available URL: From jparsons at redhat.com Wed Aug 9 18:17:44 2006 From: jparsons at redhat.com (James Parsons) Date: Wed, 09 Aug 2006 14:17:44 -0400 Subject: RES: cs-deploy-tool (WAS: [Linux-cluster] cs-deploy-gfs) In-Reply-To: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B7A@corp-bsa-mp01.planejamento.gov.br> References: <1DDCE5B29CB5BC42BC2BFC39E3F1C8A3255B7A@corp-bsa-mp01.planejamento.gov.br> Message-ID: <44DA26C8.6020701@redhat.com> Leonardo Rodrigues de Mello wrote: >Thanks for the anwsers :-D > >But I believe beside the fact that conga and cs-deploy-tool share some things in common. cs-deploy-tool has it place for simple and no pain instalation of cluster suite with basic services in one network environment. > >The point that count for me is that i can get anywhere with my laptop and just with the knowledge of IP numbers and root passwords setup one cluster in 10 minutes or less. > >To use conga i need go configure and install zope in one machine, install the agents in the servers that will be in the cluster, configure the zope to see the agents, its more complicated and demands more work for the simple task of cluster instalation and basic initial configuration. > HOLD IT HOLD IT! I have to make an urgent correction to your response above :) Conga requires absolutely NO configuration of zope. In fact, zope is so far beneath the sheets that you will be able to install your OWN default instance of zope, and there will be no interaction with Conga. After installing the Conga server component, the admin enters the IP addresses of the machines/cluster nodes to be managed just like you do with cs-deploy-tool. There is no other configuration work necessary. It is true that you will need the agent installed on the machines that you wish to manage. cs-deploy-tool does not use an agent, but rather logs in through an ssh session with the user-provided root password in order to get things set up. cs-deploy-tool is not going away, and your patches are welcome. You can send them to me and I will forward them to Stan Kupcevic who wrote and maintains cs-deploy-tool. Thanks for your involvement, Leonardo. -J > > > >Conga is one great initiative and complex initiative for managing, deploy, administration, and others thinks for production cluster enviroments. if i need to choose one tool to just deploy cluster suite. i will choose cs-deploy-tool. > >If i need to manage, and be the administrator of a cluster, of course i will need the power of conga. :-D > >this long message is just to ask: Can I implement the changes i had proposed in the first message ? if, yes to who i will send they ? > > >best regards >Leonardo Rodrigues de Mello >-----Mensagem original----- >De: linux-cluster-bounces at redhat.com em nome de James Parsons >Enviada: qua 9/8/2006 12:16 >Para: linux clustering >Cc: >Assunto: Re: cs-deploy-tool (WAS: [Linux-cluster] cs-deploy-gfs) > >Leonardo Rodrigues de Mello wrote: > > > >>sorry, >>the application name is cs-deploy-tool, not cs-deploy-gfs. >> >>Best Regards >>Leonardo Rodrigues de Mello >> >> >>-----Mensagem original----- >>De: linux-cluster-bounces at redhat.com em nome de Leonardo Rodrigues de Mello >>Enviada: qua 9/8/2006 12:04 >>Para: linux-cluster at redhat.com >>Cc: >>Assunto: [Linux-cluster] cs-deploy-gfs >> >>Hi everyone, >> >>Does anyone know what happened with the development of cs-deploy-gfs ? >> >>The version in cvs is the lastest version ? >> >>There is any improvements since the initial version ? >> >>This software was abandoned ? >> >> >> >> >The functionality available in cs-deploy-tool will be available in a new >management interface for clusters and storage called Conga, targetted >for RHEL5 and (hopefully) RHEL4.5 > >-J > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > >------------------------------------------------------------------------ > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster > From haiwu.us at gmail.com Wed Aug 9 18:44:39 2006 From: haiwu.us at gmail.com (hai wu) Date: Wed, 9 Aug 2006 13:44:39 -0500 Subject: [Linux-cluster] 2-node cluster and fence_drac In-Reply-To: <1155138372.21204.158.camel@ayanami.boston.redhat.com> References: <1155138372.21204.158.camel@ayanami.boston.redhat.com> Message-ID: Thanks Lon. We got redundant power here. How can I test this fence_drac? How to simulate a failure on one node and know for sure that it does kick in and restarts the failed node in the cluster? Thanks, Hai On 8/9/06, Lon Hohberger wrote: > > On Mon, 2006-08-07 at 15:21 -0500, hai wu wrote: > > Hi, > > For a 2-node cluster (RHEL4), does it require the use of power switch > > or fence_drac would be good enough for the setup? Would fence_drac > > work properly in a 2-node cluster? > > Thanks, > > Hai > > fence_drac would be fine, but you need to understand that with DRAC (or > any integrated power management which receives power from the machine) > that if host power is completely lost, fencing will fail - causing the > cluster to stop. > > This failure is indistinguishable from DRAC + host losing network at the > same time (ex: the ethernet switch fails). > > Generally, these machines have redundant power, so losing power all at > once is less likely. > > So, DRAC is fine, but there are failure cases where it is less than > optimal, particularly in machines without redundant power supplies. > > -- Lon > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lhh at redhat.com Wed Aug 9 19:05:32 2006 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 09 Aug 2006 15:05:32 -0400 Subject: [Linux-cluster] 2-node cluster and fence_drac In-Reply-To: References: <1155138372.21204.158.camel@ayanami.boston.redhat.com> Message-ID: <1155150332.21204.202.camel@ayanami.boston.redhat.com> On Wed, 2006-08-09 at 13:44 -0500, hai wu wrote: > Thanks Lon. We got redundant power here. > > How can I test this fence_drac? How to simulate a failure on one node > and know for sure that it does kick in and restarts the failed node in > the cluster? After both nodes join the cluster, try doing 'reboot -fn' on the node. Oh, also, you should be booting with acpi=off when using integrated power management. -- Lon From haiwu.us at gmail.com Wed Aug 9 20:39:50 2006 From: haiwu.us at gmail.com (hai wu) Date: Wed, 9 Aug 2006 15:39:50 -0500 Subject: [Linux-cluster] 2-node cluster and fence_drac In-Reply-To: <1155150332.21204.202.camel@ayanami.boston.redhat.com> References: <1155138372.21204.158.camel@ayanami.boston.redhat.com> <1155150332.21204.202.camel@ayanami.boston.redhat.com> Message-ID: I got the following errors after "reboot -fn" on erd-tt-eproof1, which script do I need to change? Aug 9 15:35:40 erd-tt-eproof2 kernel: CMAN: removing node erd-tt-eproof1 from t he cluster : Missed too many heartbeats Aug 9 15:35:40 erd-tt-eproof2 fenced[3437]: erd-tt-eproof1 not a cluster member after 0 sec post_fail_delay Aug 9 15:35:40 erd-tt-eproof2 fenced[3437]: fencing node "erd-tt-eproof1" Aug 9 15:35:42 erd-tt-eproof2 fenced[3437]: agent "fence_drac" reports: WARNING : unable to detect DRAC version ' Dell Embedded Remote Access Controller (ERA) F irmware Version 3.31 (Build 07.15) ' WARNING: unsupported DRAC version '__unknow n__' failed: unable to determine power state This is DRAC on Dell PE2650. Thanks, Hai On 8/9/06, Lon Hohberger wrote: > > On Wed, 2006-08-09 at 13:44 -0500, hai wu wrote: > > Thanks Lon. We got redundant power here. > > > > How can I test this fence_drac? How to simulate a failure on one node > > and know for sure that it does kick in and restarts the failed node in > > the cluster? > > After both nodes join the cluster, try doing 'reboot -fn' on the node. > > Oh, also, you should be booting with acpi=off when using integrated > power management. > > -- Lon > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jparsons at redhat.com Wed Aug 9 20:54:36 2006 From: jparsons at redhat.com (James Parsons) Date: Wed, 09 Aug 2006 16:54:36 -0400 Subject: [Linux-cluster] 2-node cluster and fence_drac In-Reply-To: References: <1155138372.21204.158.camel@ayanami.boston.redhat.com> <1155150332.21204.202.camel@ayanami.boston.redhat.com> Message-ID: <44DA4B8C.1050602@redhat.com> hai wu wrote: > I got the following errors after "reboot -fn" on erd-tt-eproof1, which > script do I need to change? > > Aug 9 15:35:40 erd-tt-eproof2 kernel: CMAN: removing node > erd-tt-eproof1 from t > he cluster : Missed too many heartbeats > Aug 9 15:35:40 erd-tt-eproof2 fenced[3437]: erd-tt-eproof1 not a > cluster member > after 0 sec post_fail_delay > Aug 9 15:35:40 erd-tt-eproof2 fenced[3437]: fencing node "erd-tt-eproof1" > Aug 9 15:35:42 erd-tt-eproof2 fenced[3437]: agent "fence_drac" > reports: WARNING > : unable to detect DRAC version ' Dell Embedded Remote Access > Controller (ERA) F > irmware Version 3.31 (Build 07.15) ' WARNING: unsupported DRAC version > '__unknow > n__' failed: unable to determine power state > > This is DRAC on Dell PE2650. > Thanks, > Hai Do you know what DRAC version you are using? Can you please telnet into the drac port and find out what it says when it starts your session? Thanks, -J From haiwu.us at gmail.com Wed Aug 9 21:04:40 2006 From: haiwu.us at gmail.com (hai wu) Date: Wed, 9 Aug 2006 16:04:40 -0500 Subject: [Linux-cluster] 2-node cluster and fence_drac In-Reply-To: <44DA4B8C.1050602@redhat.com> References: <1155138372.21204.158.camel@ayanami.boston.redhat.com> <1155150332.21204.202.camel@ayanami.boston.redhat.com> <44DA4B8C.1050602@redhat.com> Message-ID: I got the following prompts after telneting to the drac port, maybe a simple upgrade for the firmware would fix this issue: Dell Embedded Remote Access Controller (ERA) Firmware Version 3.31 (Build 07.15) Login: Thanks, Hai On 8/9/06, James Parsons wrote: > > hai wu wrote: > > > I got the following errors after "reboot -fn" on erd-tt-eproof1, which > > script do I need to change? > > > > Aug 9 15:35:40 erd-tt-eproof2 kernel: CMAN: removing node > > erd-tt-eproof1 from t > > he cluster : Missed too many heartbeats > > Aug 9 15:35:40 erd-tt-eproof2 fenced[3437]: erd-tt-eproof1 not a > > cluster member > > after 0 sec post_fail_delay > > Aug 9 15:35:40 erd-tt-eproof2 fenced[3437]: fencing node > "erd-tt-eproof1" > > Aug 9 15:35:42 erd-tt-eproof2 fenced[3437]: agent "fence_drac" > > reports: WARNING > > : unable to detect DRAC version ' Dell Embedded Remote Access > > Controller (ERA) F > > irmware Version 3.31 (Build 07.15) ' WARNING: unsupported DRAC version > > '__unknow > > n__' failed: unable to determine power state > > > > This is DRAC on Dell PE2650. > > Thanks, > > Hai > > Do you know what DRAC version you are using? Can you please telnet into > the drac port and find out what it says when it starts your session? > > Thanks, > > -J > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jparsons at redhat.com Wed Aug 9 22:47:45 2006 From: jparsons at redhat.com (James Parsons) Date: Wed, 09 Aug 2006 18:47:45 -0400 Subject: [Linux-cluster] 2-node cluster and fence_drac In-Reply-To: References: <1155138372.21204.158.camel@ayanami.boston.redhat.com> <1155150332.21204.202.camel@ayanami.boston.redhat.com> <44DA4B8C.1050602@redhat.com> Message-ID: <44DA6611.6080602@redhat.com> hai wu wrote: > I got the following prompts after telneting to the drac port, maybe a > simple upgrade for the firmware would fix this issue: > > Dell Embedded Remote Access Controller (ERA) > Firmware Version 3.31 (Build 07.15) > Login: > > Thanks, > Hai Oops. Sorry. Unsupported version. If you want, you could hack the agent script (it is in perl) and get it to accept that version and just *see* if it works -- it might. I tried looking for documentation for that firmware rev and couldn't google any. If you know of some, drop me a line and maybe we can get something working - or at least know if it *will ever* work. :) -J BTW, the agent supports DRAC III/XT, DRAC MC, and DRAC 4/I. > > On 8/9/06, *James Parsons* > wrote: > > hai wu wrote: > > > I got the following errors after "reboot -fn" on erd-tt-eproof1, > which > > script do I need to change? > > > > Aug 9 15:35:40 erd-tt-eproof2 kernel: CMAN: removing node > > erd-tt-eproof1 from t > > he cluster : Missed too many heartbeats > > Aug 9 15:35:40 erd-tt-eproof2 fenced[3437]: erd-tt-eproof1 not a > > cluster member > > after 0 sec post_fail_delay > > Aug 9 15:35:40 erd-tt-eproof2 fenced[3437]: fencing node > "erd-tt-eproof1" > > Aug 9 15:35:42 erd-tt-eproof2 fenced[3437]: agent "fence_drac" > > reports: WARNING > > : unable to detect DRAC version ' Dell Embedded Remote Access > > Controller (ERA) F > > irmware Version 3.31 (Build 07.15) ' WARNING: unsupported DRAC > version > > '__unknow > > n__' failed: unable to determine power state > > > > This is DRAC on Dell PE2650. > > Thanks, > > Hai > > Do you know what DRAC version you are using? Can you please telnet > into > the drac port and find out what it says when it starts your session? > > Thanks, > > -J > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > >------------------------------------------------------------------------ > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster > From joni at philox.eu Thu Aug 10 07:04:31 2006 From: joni at philox.eu (Jonathan Salomon) Date: Thu, 10 Aug 2006 09:04:31 +0200 Subject: [Linux-cluster] patch 2.6 kernel without modules Message-ID: <44DADA7F.8040505@philox.eu> Hi all, I want to use GFS for a webcluster with shared data through a iSCSI SAN. The cluster nodes are diskless and boot through PXE by downloading a kernel and rootfs that is stored in RAM. I have built a custom minimal Linux system with LFS (http://linuxfromscratch.org) to keep the image as small as possible (the smallest Fedora/RedHat I could get by stripping RPMs was still 650MB). I would like to work without kernel modules and therefore I would like to know whether it is possible to patch the 2.6 kernel to include GFS 'statically' (i.e. no kernel modules). As far as I cann tell the cluster-1.02.00 package I downloaded builds kernel modules. In addition I would like to know what minimal requirements I need to use GFS. The load balancing itself will be done on other machines with a different setup. Hence I would like to refrain from installing any of that functionality on the cluster nodes. From reading the docs I get the impression GFS needs a whole lot of clustering packages. Thanks! Jonathan From sbhagat at redhat.com Thu Aug 10 08:11:21 2006 From: sbhagat at redhat.com (Subodh Bhagat) Date: Thu, 10 Aug 2006 13:41:21 +0530 Subject: [Linux-cluster] Red Hat Cluster Service and Informix with 1.5 GB memory allocation Message-ID: <44DAEA29.1060903@redhat.com> Dear all, This issue is with one of our major customers, IBM Global Services. They are implementing a 3 node cluster and configuring Informix database for failover. The specifications of the three nodes are as follows: ADBM01 2.4.21-40.ELsmp i686 AS release 3 (Taroon Update 8) clumanager-1.2.31-1-i386 ADBM02 2.4.21-40.ELsmp i686 AS release 3 (Taroon Update 8) clumanager-1.2.31-1-i386 ADBM03 2.4.21-40.ELhugemem i686 AS release 3 (Taroon Update 8) clumanager-1.2.26.1-1-i386 Informix version: IBM Informix Dynamic Server 10.00.UC4 On Linux Intel Informix runs with over 1.8GB MEM allocated to it on the server when the clustering agents are turned off. Also it works with Mem allocation of less that 1.5 GB in cluster environment. But when in cluster environment, the node is rebooted if >=1.5 GB is allocated. At Informix end, the SHMBASE parameter would help only if there was a memory allocation issue between Linux and Informix. But as Informix runs with over 1.8GB MEM allocated to it on the server when the clustering agents are turned off, altering SHMBASE may not help resolving this issue. The issue most definitely be between the Red Hat Cluster Service and Informix with a high mem allocation. * We have suggested the customer to setup all the nodes in cluster as identical with respect to the kernel version and cluster suite versions and OS versions. * Any idea, what else can be done here? Please suggest. -- Regards, Subodh Bhagat, Technical Engineer, Red Hat India Pvt. Ltd. 1st Floor, 'C' Wing, Fortune 2000, Bandra Kurla Complex, Bandra (East), Mumbai 400051 ---------------------------- Mobile: +91-9323968930 Technical Support: +91-9322952612 Tel: +91-22-39878888 (Board Line) Fax: +91-22-39878899 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at sparkyone.com Thu Aug 10 11:35:13 2006 From: mark at sparkyone.com (Mark Reynolds) Date: Thu, 10 Aug 2006 12:35:13 +0100 (BST) Subject: [Linux-cluster] clurgmgrd stops service without reason Message-ID: <35159.82.70.162.86.1155209713.squirrel@www.easilymail.co.uk> Hi, Have you been able to resolve this issue? I have the exact same symptoms on a RedHat cluster (rgmanger version 1.9.46). I receive a message " stopping service fileserver" and the node shutsdown and ends up rebooting as it cant unmount a partition. What worries me is that this has happened 3 times in 2 weeks with no obvous reason as the server is working fine up until that point. The relevant section of my cluster.conf is