From raziebe at gmail.com Thu Dec 1 11:05:57 2005 From: raziebe at gmail.com (Raz Ben-Jehuda(caro)) Date: Thu, 1 Dec 2005 13:05:57 +0200 Subject: [Linux-cluster] redundancy in redhat clusters In-Reply-To: <7c1e2e6e97d94de9571ee838d2d9a677@redhat.com> References: <5d96567b0511290719u3b88bf01w7a275b704f4eb813@mail.gmail.com> <7c1e2e6e97d94de9571ee838d2d9a677@redhat.com> Message-ID: <5d96567b0512010305t17f35c56la306d97b84de9925@mail.gmail.com> got it. good point. why do you think raid5 would give poor performance ? as long as it is not in degredation mode the performance scales to n-1 disks. thanks raz. On 11/30/05, Jonathan E Brassow wrote: > > > On Nov 29, 2005, at 9:19 AM, Raz Ben-Jehuda(caro) wrote: > > > Question: > > I need to add to a clsutered environment redundancy. > > > > Since the native linux raid 5 is not clustered awared, > > what would make it aware to the cluster ? > > What does it lack ? > > > > Clustered file systems and applications will ensure that they are not > doing simultaneous writes to the same [meta-]data. However, they have > no way to tell that a write to one area will conflict with the write to > another because of the stripe width and parity calculation of the RAID > device. This will lead to parity block corruption. > > To solve this problem, the RAID 5 implementation must be cluster aware > and take out single-writer/multiple-reader locks on the stripes - > ensuring that multiple machines are not writing to the same stripe at > the same time. > > The performance of a cluster-aware software RAID 5 is likely to be > abysmal, and will probably not rank very high on anyone's priority > list. > > A mirroring solution is in the works, and later, dd-raid may become a > reality. > > brassow > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Raz -------------- next part -------------- An HTML attachment was scrubbed... URL: From danwest at comcast.net Thu Dec 1 15:17:45 2005 From: danwest at comcast.net (danwest at comcast.net) Date: Thu, 01 Dec 2005 15:17:45 +0000 Subject: [Linux-cluster] recovery= options implemented? Message-ID: <120120051517.12633.438F141900033D5E0000315922007354469B9C0A99020E0B@comcast.net> Does anyone know if the below options are actually implemented/working? Thanks, Dan This currently has three possible options: "restart" tries to restart failed parts of this resource group locally before attempting to relocate (default); "relocate" does not bother trying to restart the service locally; "disable" disables the resource group if any component fails. Note that any resource with a valid "recover" operation which can be recovered without a restart will be. Failure recovery policy From lhh at redhat.com Thu Dec 1 23:22:20 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 01 Dec 2005 18:22:20 -0500 Subject: [Linux-cluster] recovery= options implemented? In-Reply-To: <120120051517.12633.438F141900033D5E0000315922007354469B9C0A99020E0B@comcast.net> References: <120120051517.12633.438F141900033D5E0000315922007354469B9C0A99020E0B@comcast.net> Message-ID: <1133479340.11030.152.camel@ayanami.boston.redhat.com> On Thu, 2005-12-01 at 15:17 +0000, danwest at comcast.net wrote: > Does anyone know if the below options are actually implemented/working? > > Thanks, > Dan > > > > This currently has three possible options: "restart" tries > to restart failed parts of this resource group locally before > attempting to relocate (default); "relocate" does not bother > trying to restart the service locally; "disable" disables > the resource group if any component fails. Note that > any resource with a valid "recover" operation which can be > recovered without a restart will be. > > > Failure recovery policy > > > They should be, but they're not in the GUI currently... -- Lon From bmarzins at redhat.com Fri Dec 2 22:51:08 2005 From: bmarzins at redhat.com (Benjamin Marzinski) Date: Fri, 2 Dec 2005 16:51:08 -0600 Subject: [Linux-cluster] help In-Reply-To: <5d96567b0511282201y3a616e11t332b16039b0ce2bb@mail.gmail.com> References: <5d96567b0511220738l7fb3d5c9u9df6dc12d2fd3bef@mail.gmail.com> <20051128161723.GC27662@redhat.com> <5d96567b0511281007j60a7dca1v9bf3d252a920a66e@mail.gmail.com> <20051128181357.GG27662@redhat.com> <5d96567b0511282146k4d1f839dq76b42a11e566c0d9@mail.gmail.com> <5d96567b0511282201y3a616e11t332b16039b0ce2bb@mail.gmail.com> Message-ID: <20051202225107.GB14768@phlogiston.msp.redhat.com> On Tue, Nov 29, 2005 at 08:01:20AM +0200, Raz Ben-Jehuda(caro) wrote: > sorry, i maanage to make it work only when cache enabled. > Is it possible to do it with no cache ? The only difference in letting you export between cached and uncached, is that uncached requires the server to be a member of a quorate cluster. Could you start up gnbd_serv with the -v option, try to export an uncached device, and mail me what the messages you get back, both from the command and from the logs. -Ben > On 11/29/05, Raz Ben-Jehuda(caro) wrote: > > been there. > > if i would load gnbd_serv with no cluster i would failed to export > > any devices with gnbd_export. > > if i join with: "cman_tool -X -e 2 join -c gamma -m 224.0.0.1 -i eth1" > > and then gnbd_export hangs and dmeg reports > > CMAN: Waiting to join or form a Linux-cluster > > CMAN: forming a new cluster > > CMANsendmsg failed: -22. > > sometimes gnbd_export just says that ERROR create request failed : > > Operation not supported: > > So again i am stuck. > > > > On 11/28/05, David Teigland wrote: > > > On Mon, Nov 28, 2005 at 08:07:47PM +0200, Raz Ben-Jehuda(caro) wrote: > > > > tried it. > > > > According to the min-gfs.txt at the GNBD server the only thing i have to do > > > > is simly run gnbd_serv. but looking at the code i learned that i need > > > > to load cman. > > > > yet this is not enough. gnbd_serv fails to load with : > > > > > > > > gnbd_serv: ERROR cannot get node name : No such process > > > > gnbd_serv: ERROR No cluster manager is running > > > > gnbd_serv: ERROR If you are not planning to use a cluster manager, use -n > > > > > > > > does gnbd_serv depends in a cluster manager? > > > > What is my mistake ? this is not part of the cluster. > > > > > > min-gfs.txt is wrong, I'll fix it. You need to use gnbd_serv -n. > > > Then gnbd_serv will ignore all clustering stuff which is what you want. > > > > > > Dave > > > > > > > > > > > > -- > > Raz > > > > > -- > Raz > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From bmarzins at redhat.com Fri Dec 2 23:16:45 2005 From: bmarzins at redhat.com (Benjamin Marzinski) Date: Fri, 2 Dec 2005 17:16:45 -0600 Subject: [Linux-cluster] Attn Patrick. In-Reply-To: <013d01c5f42d$c49a0120$6401a8c0@fileserver> References: <013d01c5f42d$c49a0120$6401a8c0@fileserver> Message-ID: <20051202231645.GC14768@phlogiston.msp.redhat.com> On Mon, Nov 28, 2005 at 11:09:45PM +0800, James Davis wrote: > Hi Patrick, > I'm hoping you can explain to me, and if I have the right idea > about GFS/GNBD > > I'm referring to this document http://gfs.wikidev.net/GNBD_installation for > the purposes of config > > What I'm trying to do is setup GNBD on 2 machines.. I'm wondering if its > possible for the two mahines to cluster the data on the local hdd's rather > than using an external storage array.. No. Not without cluster aware mirror software, which isn't available just yet. Otherwise, you run into the problem where If a node crashes, it might have written to one device and not the other. So your mirror can be out of sync, and the other machine will never know. -Ben > i.e machine raid0. basically replicated volumes... > > Also if this IS possible, am I right in assuming if one machine goes down > the second one will take over the primaries role... > > On the client machine how do you set it to connect to the cluster? > > The documentation seems to be very lacking and I'm somewhat stressed out > from work wondering if what I'm trying to do is possible? > > If you need more clarification on what I'm trying to do please ask. > > Sorry if this comes across as newbish > > Regards > James > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From elmar at pruesse.net Sat Dec 3 22:49:15 2005 From: elmar at pruesse.net (Elmar Pruesse) Date: Sat, 03 Dec 2005 23:49:15 +0100 Subject: [Linux-cluster] New (small) cluster; What filesystem? GFS? Message-ID: <439220EB.80901@pruesse.net> hi! We're getting a new cluster for Christmas this year, and I am wondering whether there are better options than the nfs-server setup used on our old one. Unfortunately we don't have any people who have had the chance to try things out, and we won't have much time either. I'd really appreciate some comments on what would make sense for us and what would not. (I did read up some, but everyone claims to be best...) We've got: Five nodes 16GB/dual 275 opteron, one 32GB DB-host and one 8GB fileserver. The latter two have external SCSI-Raids. All are interconnected via Infiniband. We will expand to 12 or 16 nodes with the next round of money. The old cluster (as provided by our IBM vendor) uses a NFS-server connected to a fibre-channel raid. Filesystem performance is a [expletive deleted] major problem. We will use the cluster for serveral different bioinformatics tools, some of which I've been told produce directories with many thousands of files. Does iSCSI+GFS make sense? And more so than NFS? Would you route it via the Infiniband network or via GbE? How about Lustre, PVFS2, OCFS? regards, Elmar ps: Please do tell me the combination of hardware makes at least a little sense. The infiniband was something of an afterthought, since it was unexpectetly cheap. From raziebe at gmail.com Sun Dec 4 11:07:58 2005 From: raziebe at gmail.com (Raz Ben-Jehuda(caro)) Date: Sun, 4 Dec 2005 13:07:58 +0200 Subject: [Linux-cluster] help In-Reply-To: <20051202225107.GB14768@phlogiston.msp.redhat.com> References: <5d96567b0511220738l7fb3d5c9u9df6dc12d2fd3bef@mail.gmail.com> <20051128161723.GC27662@redhat.com> <5d96567b0511281007j60a7dca1v9bf3d252a920a66e@mail.gmail.com> <20051128181357.GG27662@redhat.com> <5d96567b0511282146k4d1f839dq76b42a11e566c0d9@mail.gmail.com> <5d96567b0511282201y3a616e11t332b16039b0ce2bb@mail.gmail.com> <20051202225107.GB14768@phlogiston.msp.redhat.com> Message-ID: <5d96567b0512040307v52782e16teb34800b9f6e0cef@mail.gmail.com> gnbd_serv cannot load without -n flag. They fixed it in min-gfs.txt document. So, I have a little problem with it. On 12/3/05, Benjamin Marzinski wrote: > > On Tue, Nov 29, 2005 at 08:01:20AM +0200, Raz Ben-Jehuda(caro) wrote: > > sorry, i maanage to make it work only when cache enabled. > > Is it possible to do it with no cache ? > > The only difference in letting you export between cached and uncached, is > that > uncached requires the server to be a member of a quorate cluster. Could > you > start up gnbd_serv with the -v option, try to export an uncached device, > and > mail me what the messages you get back, both from the command and from the > logs. > > -Ben > > > On 11/29/05, Raz Ben-Jehuda(caro) wrote: > > > been there. > > > if i would load gnbd_serv with no cluster i would failed to export > > > any devices with gnbd_export. > > > if i join with: "cman_tool -X -e 2 join -c gamma -m 224.0.0.1 -i eth1" > > > and then gnbd_export hangs and dmeg reports > > > CMAN: Waiting to join or form a Linux-cluster > > > CMAN: forming a new cluster > > > CMANsendmsg failed: -22. > > > sometimes gnbd_export just says that ERROR create request failed : > > > Operation not supported: > > > So again i am stuck. > > > > > > On 11/28/05, David Teigland wrote: > > > > On Mon, Nov 28, 2005 at 08:07:47PM +0200, Raz Ben-Jehuda(caro) > wrote: > > > > > tried it. > > > > > According to the min-gfs.txt at the GNBD server the only thing i > have to do > > > > > is simly run gnbd_serv. but looking at the code i learned that i > need > > > > > to load cman. > > > > > yet this is not enough. gnbd_serv fails to load with : > > > > > > > > > > gnbd_serv: ERROR cannot get node name : No such process > > > > > gnbd_serv: ERROR No cluster manager is running > > > > > gnbd_serv: ERROR If you are not planning to use a cluster manager, > use -n > > > > > > > > > > does gnbd_serv depends in a cluster manager? > > > > > What is my mistake ? this is not part of the cluster. > > > > > > > > min-gfs.txt is wrong, I'll fix it. You need to use gnbd_serv -n. > > > > Then gnbd_serv will ignore all clustering stuff which is what you > want. > > > > > > > > Dave > > > > > > > > > > > > > > > > > -- > > > Raz > > > > > > > > > -- > > Raz > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Raz -------------- next part -------------- An HTML attachment was scrubbed... URL: From raziebe at gmail.com Mon Dec 5 13:04:05 2005 From: raziebe at gmail.com (Raz Ben-Jehuda(caro)) Date: Mon, 5 Dec 2005 05:04:05 -0800 Subject: [Linux-cluster] question : the flow of adding a GNBD with new storage Message-ID: <5d96567b0512050504p2f478965x75a49eaa6ceb6a82@mail.gmail.com> when i am adding a new GNBD to the cluster with an additional storage. obvioulsy i must add it both to volume and the file system. question : does clustered linux migrate data to balance cluster ? If so , how ? -- thank you Raz -------------- next part -------------- An HTML attachment was scrubbed... URL: From gforte at leopard.us.udel.edu Tue Dec 6 00:40:16 2005 From: gforte at leopard.us.udel.edu (Greg Forte) Date: Mon, 05 Dec 2005 19:40:16 -0500 Subject: [Linux-cluster] two fencing problems Message-ID: <4394DDF0.4080603@leopard.us.udel.edu> two (probably related) questions concerning fencing and APC AP7900 units: 1) fence_apc doesn't appear to be compatible with these units - when I run: sudo /sbin/fence_apc -a -l -p -n1 -T -v it comes back with: failed: unrecognised menu response The output file shows that it's getting as far as the "Outlet Control/Configuration" menu, but never selects the specified port. This is on RHEL ES4 update 2 with fence-1.32.6-0 installed. Does anyone have this working with AP7900s, and if so did you have to hack the fence_apc script or is there just something I'm missing? 2) in the cluster configuration tool (GUI), there's no place to specify the port to cycle for an "APC Power Device". I tried adding "port=#" to the tags in the cluster.conf file, but the cluster configuration tool didn't like that. And of course, I was unable to test if this actually works anyway because of problem #1 :-( Anyway, assuming I get fence_apc to work, how do I specify ports in the cluster configuration tool? or is this not supported? In which case can I add the port option in the cluster.conf like I'm trying to do and have it work? I have system-config-cluster-1.0.16-1.0 installed. -g Greg Forte gforte at udel.edu IT - User Services University of Delaware 302-831-1982 Newark, DE From busyadmin at gmail.com Tue Dec 6 02:04:38 2005 From: busyadmin at gmail.com (busy admin) Date: Mon, 5 Dec 2005 19:04:38 -0700 Subject: [Linux-cluster] manual fencing not working in RHEL4 branch In-Reply-To: <20051130164839.GB23663@redhat.com> References: <1c0e77670511281307i75bc26a4pc5bbcd3d152a8c8e@mail.gmail.com> <20051128212731.GK27662@redhat.com> <1c0e77670511291853i2603f61ayf2eae51903032ebd@mail.gmail.com> <20051130164839.GB23663@redhat.com> Message-ID: <1c0e77670512051804l5cc38edfy1d2b87e8fd71cf2e@mail.gmail.com> David, I have tried the same init scripts with both ipmi and drac fencing, no problems. When I try manual fencing (it seems) that fence_manual introduces some strangeness such that I run into my problem. What is the problem: When running manual fencing and doing failover testing, my secondary node takes over the service without waiting for a fence_ack_manual. This all works perfectly with automatic fencing (ipmi, drac). I have the same problem (most of the time) when I run this whole thing by hand: 1. nodeA: ccsd 2. nodeB: ccsd 3. nodeA: cman_tool join -w 4. nodeB: cman_tool join -w 5. nodeA: fence_tool join -w 6. nodeB: fence_tool join -w When I start to see the problem, on the next reboot of both the systems I can replace steps 5 & 6 with 'fenced -D'. Now if I try to failover a machine then manual fencing works perfectly (meaning forces me to do a fence_ack_manual before a service fails over). Next, I can go in and change 'fenced -D' back to 'fence_tool join -w' and things still work (forces me to run fence_ack_manual). Next, if I replace the manual steps above with the init scripts then manual fencing breaks all over again until I repeat the above steps. Sounds like a timing issue around fence_manual? Let me know if you want me to try anything different. Thanks for all your help. On 11/30/05, David Teigland wrote: > On Tue, Nov 29, 2005 at 07:53:09PM -0700, busy admin wrote: > > Here's a quick summary of what I've done and the results... to > > simplify the config I've just been running ccsd and cman via init > > scripts during boot and then manual executing 'fenced' or 'fence_tool' > > or the fenced init script. The results I see are random success's and > > failures! > > > > Initial test - reboot both systems and then, on both, executed 'fenced > > -D' both systems joined the cluster and it was quorate. Rebooted one > > node and to my surprise manual fencing worked, meaning > > /tmp/fence_manual.fifo was created and I had to run 'fence_ack_manual' > > on the other node. Tried again when the first node came back up and > > again everything worked as expected. > > > > Additional testing - reboot both system and then, on both, executed > > 'fence_tool join -w', both systems joined the cluster and it was > > quorate. Rebooted one node and no fencing was done (nothing logged in > > /var/log/messages). > > > > rebooted both systems again and this time executed 'fenced -D' on both > > nodes... rebooted a node and fencing worked, was logged in > > /var/log/messages and I had to manual run 'fence_ack_manual -n x64-5'. > > when that node came back up again I again manually executed 'fenced > > -D' on it and the cluster was quorate. I then rebooted the other node > > and again fencing worked! > > > > so again I rebooted both nodes and executed 'fence_tool join -w' on > > each... I again rebooted a node and fencing worked this time. fenced > > msgs were logged to /var/log/messages, /tmp/fence_manual.fifo was > > created and I had to execute 'fence_ack_manual -n x64-4' to recover. > > > > ... more testing w/mixed results ... > > > > modified fenced init script to execute 'fenced -D &' instead of > > 'fence_tool join -w' and used chkconfig to turn it on on both systems > > and rebooted them. both system restarted and joined the cluster. once > > again I rebooted one node (x64-4) and fencing didn't work... nothing > > was logged in /var/log/messages from fenced. see corresponding > > /var/log/messages, fenced -D output and cluster.conf below. > > It's not clear what you're trying to test or what you expect to happen. > Here's the optimal way to start up a cluster from a newly rebooted state: > > 1. nodeA: ccsd > 2. nodeB: ccsd > 3. nodeA: cman_tool join -w > 4. nodeB: cman_tool join -w > 5. nodeA: fence_tool join > 6. nodeB: fence_tool join > > It's best if steps 5 & 6 only happen after both nodes are members of > the cluster (see 'cman_tool nodes'). If this is the case, then no > nodes should be fenced when starting up. > > If you use the init scripts you may loose a little control and certainty > about what happens when, so I'd suggest using the commands directly until > you know that things are running correctly, then try the init scripts. > > If, from the state above, nodeB fails, then nodeA should always fence > nodeB. With manual fencing, this means that a message should appear in > nodeA's /var/log/messages telling you to reboot nodeB and run > fence_ack_manual. If, by chance, nodeB reboots and rejoins the cluster > before you get to running fence_ack_manual, the fencing system on nodeA > will just complete the fencing operation itself and you don't need to run > fence_ack_manual (and if you try, the fence_ack_manual command will report > an error.) > > Dave > > From bmarzins at redhat.com Tue Dec 6 02:49:20 2005 From: bmarzins at redhat.com (Benjamin Marzinski) Date: Mon, 5 Dec 2005 20:49:20 -0600 Subject: [Linux-cluster] help In-Reply-To: <5d96567b0512040307v52782e16teb34800b9f6e0cef@mail.gmail.com> References: <5d96567b0511220738l7fb3d5c9u9df6dc12d2fd3bef@mail.gmail.com> <20051128161723.GC27662@redhat.com> <5d96567b0511281007j60a7dca1v9bf3d252a920a66e@mail.gmail.com> <20051128181357.GG27662@redhat.com> <5d96567b0511282146k4d1f839dq76b42a11e566c0d9@mail.gmail.com> <5d96567b0511282201y3a616e11t332b16039b0ce2bb@mail.gmail.com> <20051202225107.GB14768@phlogiston.msp.redhat.com> <5d96567b0512040307v52782e16teb34800b9f6e0cef@mail.gmail.com> Message-ID: <20051206024920.GC30722@phlogiston.msp.redhat.com> On Sun, Dec 04, 2005 at 01:07:58PM +0200, Raz Ben-Jehuda(caro) wrote: > gnbd_serv cannot load without -n flag. > They fixed it in min-gfs.txt document. > So, I have a little problem with it. I have had no problem starting gnbd_serv without the -n option. If you try to start gnbd_serv without a cluster manager running on the node, you will receive a message like this on the command line gnbd_serv: ERROR cannot get node name : No such process gnbd_serv: ERROR No cluster manager is running gnbd_serv: ERROR If you are not planning to use a cluster manager, use -n and like this in syslog ERROR [gnbd_serv.c:389] If you are not planning to use a cluster manager, use -n This is not a gnbd error. This means that the cluster has not been properly started. Not only must there be a working cluster, but the gnbd server node must be a cluster member. In the min-gfs.txt document, the gnbd server node is not a cluster member, so you will not be able to export uncached gnbds with this setup. If you believe that the gnbd server node is a member of a quorate cluster, you can do this check. run # ccsd # cman_tool join on all cluster nodes. Then run # cat /proc/cluster/status For "Membership state:" you should see "Cluster-Member" For "Nodes:" you should see the number of nodes you expect there to be. If the cluster has not started up, you will not see a "Nodes:" section, and "Membership state:" will say something like "Starting" or "Not-in-Cluster". If you are a cluster member, and you still cannot start up gnbd_serv, please run it with the -v option, and send me the command and log output, along with the output from # cat /proc/cluster/status. Thanks, Ben > On 12/3/05, Benjamin Marzinski wrote: > > On Tue, Nov 29, 2005 at 08:01:20AM +0200, Raz Ben-Jehuda(caro) wrote: > > sorry, i maanage to make it work only when cache enabled. > > Is it possible to do it with no cache ? > > The only difference in letting you export between cached and uncached, > is that > uncached requires the server to be a member of a quorate cluster. Could > you > start up gnbd_serv with the -v option, try to export an uncached device, > and > mail me what the messages you get back, both from the command and from > the logs. > > -Ben > > > On 11/29/05, Raz Ben-Jehuda(caro) wrote: > > > been there. > > > if i would load gnbd_serv with no cluster i would failed to export > > > any devices with gnbd_export. > > > if i join with: "cman_tool -X -e 2 join -c gamma -m 224.0.0.1 -i > eth1" > > > and then gnbd_export hangs and dmeg reports > > > CMAN: Waiting to join or form a Linux-cluster > > > CMAN: forming a new cluster > > > CMANsendmsg failed: -22. > > > sometimes gnbd_export just says that ERROR create request failed : > > > Operation not supported: > > > So again i am stuck. > > > > > > On 11/28/05, David Teigland < teigland at redhat.com> wrote: > > > > On Mon, Nov 28, 2005 at 08:07:47PM +0200, Raz Ben-Jehuda(caro) > wrote: > > > > > tried it. > > > > > According to the min-gfs.txt at the GNBD server the only thing > i have to do > > > > > is simly run gnbd_serv. but looking at the code i learned that i > need > > > > > to load cman. > > > > > yet this is not enough. gnbd_serv fails to load with : > > > > > > > > > > gnbd_serv: ERROR cannot get node name : No such process > > > > > gnbd_serv: ERROR No cluster manager is running > > > > > gnbd_serv: ERROR If you are not planning to use a cluster > manager, use -n > > > > > > > > > > does gnbd_serv depends in a cluster manager? > > > > > What is my mistake ? this is not part of the cluster. > > > > > > > > min-gfs.txt is wrong, I'll fix it. You need to use gnbd_serv -n. > > > > Then gnbd_serv will ignore all clustering stuff which is what you > want. > > > > > > > > Dave > > > > > > > > > > > > > > > > > -- > > > Raz > > > > > > > > > -- > > Raz > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Raz From raziebe at gmail.com Tue Dec 6 10:21:09 2005 From: raziebe at gmail.com (Raz Ben-Jehuda(caro)) Date: Tue, 6 Dec 2005 02:21:09 -0800 Subject: [Linux-cluster] no a storage question Message-ID: <5d96567b0512060221u8a92e5al44fdaa7a401a1466@mail.gmail.com> i know this not the place, but since they so many kernel developers here... i have a different question regarding locking in the kernel. In the last issue of linux magazine there was an article about locking. it presented the follwing scenario. spin_lock(lock) ... spin_unlock(lock). In uni processot mahcine with preemption enabled. spin_lock saves flags and cli(). spin_unlock() push flags out ( for nesting interrupts) they said that this code : Does not protect preemption. no protection from preemption ? How ? How can process B get some cpu while process A had disabled interrupts, no scheduling (unwillingly ) can be made.Prior to the preemption the kernel scehdular must run and set Process B as the one to run , and to the best my knowledge, a schedular runs only in timer interrupt. Or is it possible that the schduler timer routine isn't running in interrupt context ? thank you -- Raz -------------- next part -------------- An HTML attachment was scrubbed... URL: From adingman at cookgroup.com Tue Dec 6 15:29:23 2005 From: adingman at cookgroup.com (Andrew C. Dingman) Date: Tue, 06 Dec 2005 10:29:23 -0500 Subject: [Linux-cluster] two fencing problems In-Reply-To: <4394DDF0.4080603@leopard.us.udel.edu> References: <4394DDF0.4080603@leopard.us.udel.edu> Message-ID: <1133882963.7571.43.camel@adingman.cin.cook> At least in RHEL3, fence_apc only works with a very particular configuration on the power strip. In particular, the numbers for menu items change depending on whether the connecting user has permission to do various things, so revoking permission to use outlets that aren't part of your cluster will break the fencing agent. I ended up writing my own fencing agent that was able to deal with at least a few more configurations, though it's still not as flexible as I'd like. My fencing agent is attached. I wrote it for RHEL3 and APC AP7901 power strips, configured the way I wanted them, so it may or may not work for you. Read the code. Test it somewhere it can't do any significant damage. Don't come crying to me or my employer if it breaks. I disclaim any responsibility for anything it might do, however heinous. Read and understand the code before you use it. It's at least a start. It's not as general as I'd like it to be, but since it works in the clusters I wrote it for, I haven't been motivated to change it. It's derived from the fencing agents Red Hat distributes, and therefore also under the GPL. Once you've got that working for GFS, you can then set cluster suite to use the Stonith bridge to fence through GFS, so you don't need to explicitly configure the fencing device in system-config-cluster. Hope that helps. -Andrew C. Dingman Unix Administrator Cook Incorporated On Mon, 2005-12-05 at 19:40 -0500, Greg Forte wrote: > two (probably related) questions concerning fencing and APC AP7900 units: > > 1) fence_apc doesn't appear to be compatible with these units - when I run: > > sudo /sbin/fence_apc -a -l -p -n1 -T -v > > it comes back with: > > failed: unrecognised menu response > > The output file shows that it's getting as far as the "Outlet > Control/Configuration" menu, but never selects the specified port. > > This is on RHEL ES4 update 2 with fence-1.32.6-0 installed. > > Does anyone have this working with AP7900s, and if so did you have to > hack the fence_apc script or is there just something I'm missing? > > 2) in the cluster configuration tool (GUI), there's no place to > specify the port to cycle for an "APC Power Device". I tried adding > "port=#" to the tags in the cluster.conf file, but the > cluster configuration tool didn't like that. And of course, I was > unable to test if this actually works anyway because of problem #1 :-( > > Anyway, assuming I get fence_apc to work, how do I specify ports in the > cluster configuration tool? or is this not supported? In which case > can I add the port option in the cluster.conf like I'm trying to do and > have it work? I have system-config-cluster-1.0.16-1.0 installed. > > -g > > Greg Forte > gforte at udel.edu > IT - User Services > University of Delaware > 302-831-1982 > Newark, DE > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From gforte at leopard.us.udel.edu Tue Dec 6 16:26:34 2005 From: gforte at leopard.us.udel.edu (Greg Forte) Date: Tue, 06 Dec 2005 11:26:34 -0500 Subject: [Linux-cluster] two fencing problems In-Reply-To: <20051206034238.GA3226@rover.pcbi.upenn.edu> References: <4394DDF0.4080603@leopard.us.udel.edu> <20051206034238.GA3226@rover.pcbi.upenn.edu> Message-ID: <4395BBBA.2090404@leopard.us.udel.edu> Bryan Cardillo wrote: > I'm in the process of testing the attached patch, basically > just had to remove a portion of the match for the `Control > Outlet' option. Interesting ... I see you were getting hung on the menu after where I was - looks like my problem was that the author didn't expect anyone to rename their outlets to something more useful than "Outlet 1", "Outlet 2", etc. The same problem plagues the next menu, because it was looking to match the "----- Outlet # -------" banner, but the assigned name shows up there instead. The following patch (against the "original") seemingly fixes both of these problems generally (incorporating Bryan's fix as well). --- /sbin/fence_apc 2005-08-01 19:01:17.000000000 -0400 +++ fence_apc 2005-12-06 09:09:55.000000000 -0500 @@ -244,10 +244,10 @@ /--\s*device manager.*(\d+)\s*-\s*Outlet Control/is || # "Device Manager", "1- Cluster Node 0 ON" - /--\s*Outlet Control.*(\d+)\s*-\s+Outlet\s+$opt_n\D[^\n]*\s(?-i:ON|OFF)\*?\s/ism || + /--\s*Outlet Control.*($opt_n)\s*-[^\n]+\s(?-i:ON|OFF)\*?\s/ism || # Administrator Outlet Control menu - /--\s*Outlet $opt_n\D.*(\d+)\s*-\s*control outlet\s+$opt_n\D/ism + /Outlet\s+:\s*$opt_n\D.*(\d+)\s*-\s*control outlet/ism ) { $t->print($1); next; > here is the clusternode elem I'm using, with the port > specified, and seems to work so far. as far as I know, this > must be specified in the cluster.conf manually. > > > > > > > > Ah, I see I was confusing with - it looks like it is configurable in the configuration tool afterall, under "manage fencing for this node". Here's what I got after setting it up with my two cross-wired PDUs (the nodes have redundant power, so node 1 is plugged into outlet 1 on each pdu, and node 2 to outlet 2 on each pdu): Except then when I stopped the configurator and started it again it complained about the "switch=" options that it put there itself! removing them by hand seems to have fixed it. *sigh* And it still doesn't appear to work ... I can turn the outlets on and off from the command line, but if I down the interface on a node, the other node reports that it's removing the "failed" node from the cluster, and that it's fencing the "failed" node, but the "failed" node never gets shut down. Does this get logged somewhere besides /var/log/messages, or is there a way to force it to be more verbose? If I could see what command fenced is actually invoking that might help ... -g Greg Forte gforte at udel.edu IT - User Services University of Delaware 302-831-1982 Newark, DE From jeff at jettis.com Tue Dec 6 17:12:34 2005 From: jeff at jettis.com (Jeff Dinisco) Date: Tue, 6 Dec 2005 09:12:34 -0800 Subject: [Linux-cluster] custom fence agent Message-ID: Matt, This script is great. I just finished hacking on it for my own purposes and it's working well from the command line. Could you pass along your fencing section from cluster.conf as well? Thanks a million. - Jeff DiNisco _____ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Matt Brookover Sent: Wednesday, November 30, 2005 5:34 PM To: linux clustering Subject: Re: [Linux-cluster] custom fence agent I took the fence_apc and hacked it to do what I needed. The fence agents are perl scripts and can easily be modified to fit most any SAN. Matt On Wed, 2005-11-30 at 13:58, Jeff Dinisco wrote: Could someone outline the rules for creating your own fencing agent and how they're applied in cluster.conf? Or just point me to a doc? Thanks - Jeff _____ -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From mbrookov at mines.edu Tue Dec 6 18:00:12 2005 From: mbrookov at mines.edu (Matt Brookover) Date: Tue, 06 Dec 2005 11:00:12 -0700 Subject: [Linux-cluster] custom fence agent In-Reply-To: References: Message-ID: <1133892012.4169.5.camel@merlin.Mines.EDU> I am currently using GFS 6.0, so there is no cluster.conf. I have included all three files, cluster.css, fence.css and nodes.css. [root at imagine CSM_ACN]# more * :::::::::::::: cluster.ccs :::::::::::::: cluster { name = "CSM_ACN" lock_gulm { servers = ["imagine.Mines.EDU","illuminate.Mines.EDU","illusion.Mines.EDU"] heartbeat_rate = 3.0 allowed_misses = 5 } } :::::::::::::: fence.ccs :::::::::::::: fence_devices { CSMACN_fence { agent = "fence_cisco" } } :::::::::::::: nodes.ccs :::::::::::::: nodes { imagine.Mines.EDU { ip_interfaces { eth0 = "138.67.130.1" } fence { snmpfence { CSMACN_fence { port="imagine" } } } } illuminate.Mines.EDU { ip_interfaces { eth0 = "138.67.130.2" } fence { snmpfence { CSMACN_fence { port="illuminate" } } } } illusion.Mines.EDU { ip_interfaces { eth0 = "138.67.130.3" } fence { snmpfence { CSMACN_fence { port="illusion" } } } } inspire.Mines.EDU { ip_interfaces { eth0 = "138.67.130.5" } fence { snmpfence { CSMACN_fence { port="inspire" } } } } inception.Mines.EDU { ip_interfaces { eth0 = "138.67.130.4" } fence { snmpfence { CSMACN_fence { port="inception" } } } } incantation.Mines.EDU { ip_interfaces { eth0 = "138.67.130.6" } fence { snmpfence { CSMACN_fence { port="incantation" } } } } } [root at imagine CSM_ACN]# On Tue, 2005-12-06 at 10:12, Jeff Dinisco wrote: > Matt, > > This script is great. I just finished hacking on it for my own > purposes and it's working well from the command line. Could you pass > along your fencing section from cluster.conf as well? Thanks a > million. > > - Jeff DiNisco > > ______________________________________________________________________ > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Matt Brookover > Sent: Wednesday, November 30, 2005 5:34 PM > To: linux clustering > Subject: Re: [Linux-cluster] custom fence agent > > > I took the fence_apc and hacked it to do what I needed. The fence > agents are perl scripts and can easily be modified to fit most any > SAN. > > Matt > > On Wed, 2005-11-30 at 13:58, Jeff Dinisco wrote: > > > Could someone outline the rules for creating your own fencing agent > > and how they're applied in cluster.conf? Or just point me to a > > doc? Thanks > > > > - Jeff > > > > > > > > ____________________________________________________________________ > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From canfield at uindy.edu Tue Dec 6 21:07:58 2005 From: canfield at uindy.edu (D Canfield) Date: Tue, 06 Dec 2005 16:07:58 -0500 Subject: [Linux-cluster] CLVM & Partition Mounting Message-ID: <4395FDAE.6020900@uindy.edu> I'm trying to build my first GFS cluster (2-node on a SAN) on RHEL4, and I can get things up and running manually, but I'm having some trouble getting the process to automate smoothly. The first issue is that after I install the lvm2-cluster RPM, I can no longer boot the machine cleanly because my /var/log partition is on a separate LVM VolumeGroup (It's still a standard ext3 partition, I just keep all my logs on a RAID10 array in a different area of the SAN for performance) and the presence of clvm library seems to prevent vgchange from running at boot time since clvmd isn't yet running. This part I'm assuming I'm just missing something obvious, but I have no idea what. The second issue is that GFS doesn't seem to allow an automatic way to actually mount the GFS partitions once clvmd is started. This is a bit of an issue since the partition I am going to want to mount in most cases is /home, and even if I put a mount line in /etc/rc.local, that means services like imap (this cluster) or samba (on the next one) will be up and trying to serve items out of the home directories before the directories exist. Sorry if I'm being brain dead on this, the fact that I couldn't any reference to it anywhere else suggests I probably am. Can anyone offer any hints? Thanks DC From pcaulfie at redhat.com Wed Dec 7 08:40:24 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Wed, 07 Dec 2005 08:40:24 +0000 Subject: [Linux-cluster] CLVM & Partition Mounting In-Reply-To: <4395FDAE.6020900@uindy.edu> References: <4395FDAE.6020900@uindy.edu> Message-ID: <43969FF8.2000507@redhat.com> D Canfield wrote: > I'm trying to build my first GFS cluster (2-node on a SAN) on RHEL4, and > I can get things up and running manually, but I'm having some trouble > getting the process to automate smoothly. > > The first issue is that after I install the lvm2-cluster RPM, I can no > longer boot the machine cleanly because my /var/log partition is on a > separate LVM VolumeGroup (It's still a standard ext3 partition, I just > keep all my logs on a RAID10 array in a different area of the SAN for > performance) and the presence of clvm library seems to prevent vgchange > from running at boot time since clvmd isn't yet running. This part I'm > assuming I'm just missing something obvious, but I have no idea what. You need to mark cluster VGs as clustered (vgchange -cy) and non-clustered VGs as non-clustered (vgchange -cn). You can't have non-clustered LVs in a clustered VG (though it doesn't look like you're doing that). The activation for local VGs should then have the --ignorelockingfailure flag passed to the LVM commands, which should also only be activating the local VG) so it will carry on even if the cluster locking attempt fails. > The second issue is that GFS doesn't seem to allow an automatic way to > actually mount the GFS partitions once clvmd is started. This is a bit > of an issue since the partition I am going to want to mount in most > cases is /home, and even if I put a mount line in /etc/rc.local, that > means services like imap (this cluster) or samba (on the next one) will > be up and trying to serve items out of the home directories before the > directories exist. > > Sorry if I'm being brain dead on this, the fact that I couldn't any > reference to it anywhere else suggests I probably am. Can anyone offer > any hints? > -- patrick From dillo+cluster at seas.upenn.edu Tue Dec 6 03:42:38 2005 From: dillo+cluster at seas.upenn.edu (Bryan Cardillo) Date: Mon, 5 Dec 2005 22:42:38 -0500 Subject: [Linux-cluster] two fencing problems In-Reply-To: <4394DDF0.4080603@leopard.us.udel.edu> References: <4394DDF0.4080603@leopard.us.udel.edu> Message-ID: <20051206034238.GA3226@rover.pcbi.upenn.edu> On Mon, Dec 05, 2005 at 07:40:16PM -0500, Greg Forte wrote: > two (probably related) questions concerning fencing and APC AP7900 units: > > 1) fence_apc doesn't appear to be compatible with these units - when I run: > > sudo /sbin/fence_apc -a -l -p -n1 -T -v > > it comes back with: > > failed: unrecognised menu response > > The output file shows that it's getting as far as the "Outlet > Control/Configuration" menu, but never selects the specified port. > > This is on RHEL ES4 update 2 with fence-1.32.6-0 installed. > > Does anyone have this working with AP7900s, and if so did you have to > hack the fence_apc script or is there just something I'm missing? I'm in the process of testing the attached patch, basically just had to remove a portion of the match for the `Control Outlet' option. > 2) in the cluster configuration tool (GUI), there's no place to > specify the port to cycle for an "APC Power Device". I tried adding > "port=#" to the tags in the cluster.conf file, but the > cluster configuration tool didn't like that. And of course, I was > unable to test if this actually works anyway because of problem #1 :-( > > Anyway, assuming I get fence_apc to work, how do I specify ports in the > cluster configuration tool? or is this not supported? In which case > can I add the port option in the cluster.conf like I'm trying to do and > have it work? I have system-config-cluster-1.0.16-1.0 installed. here is the clusternode elem I'm using, with the port specified, and seems to work so far. as far as I know, this must be specified in the cluster.conf manually. hope this helps. Cheers, Bryan Cardillo Penn Bioinformatics Core University of Pennsylvania -------------- next part -------------- --- /sbin/fence_apc 2005-10-27 16:12:19.000000000 -0400 +++ fence_apc 2005-12-05 22:33:04.000000000 -0500 @@ -247,7 +247,7 @@ /--\s*Outlet Control.*(\d+)\s*-\s+Outlet\s+$opt_n\D[^\n]*\s(?-i:ON|OFF)\*?\s/ism || # Administrator Outlet Control menu - /--\s*Outlet $opt_n\D.*(\d+)\s*-\s*control outlet\s+$opt_n\D/ism + /--\s*Outlet $opt_n\D.*(\d+)\s*-\s*control outlet/ism ) { $t->print($1); next; From canfield at uindy.edu Wed Dec 7 15:01:05 2005 From: canfield at uindy.edu (D Canfield) Date: Wed, 07 Dec 2005 10:01:05 -0500 Subject: [Linux-cluster] CLVM & Partition Mounting In-Reply-To: <43969FF8.2000507@redhat.com> References: <4395FDAE.6020900@uindy.edu> <43969FF8.2000507@redhat.com> Message-ID: <4396F931.3000802@uindy.edu> Patrick Caulfield wrote: >D Canfield wrote: > > >>I'm trying to build my first GFS cluster (2-node on a SAN) on RHEL4, and >>I can get things up and running manually, but I'm having some trouble >>getting the process to automate smoothly. >> >>The first issue is that after I install the lvm2-cluster RPM, I can no >>longer boot the machine cleanly because my /var/log partition is on a >>separate LVM VolumeGroup (It's still a standard ext3 partition, I just >>keep all my logs on a RAID10 array in a different area of the SAN for >>performance) and the presence of clvm library seems to prevent vgchange >>from running at boot time since clvmd isn't yet running. This part I'm >>assuming I'm just missing something obvious, but I have no idea what. >> >> > >You need to mark cluster VGs as clustered (vgchange -cy) and non-clustered VGs >as non-clustered (vgchange -cn). You can't have non-clustered LVs in a >clustered VG (though it doesn't look like you're doing that). > >The activation for local VGs should then have the --ignorelockingfailure flag >passed to the LVM commands, which should also only be activating the local VG) >so it will carry on even if the cluster locking attempt fails. > > > I see that the ignorelockingfailure flag was already in the initscripts of RHEL4, and a bit more testing got me some different information. If I have lvm2-cluster installed, the process will error out to the maintenance shell when it tries to fsck my /var/log partition. If I look in /dev/mapper VolGroup01 has not been activated (though if I look higher up in the boot log, vgscan did see it). But from the maintenance shell, I can go ahead and run vgchange -a y --ignorelockingfailure (just like the rc.sysinit does 2-3 times by the time it gets to the fsck), and the VolGroup01 is activated just fine. If I remove the lvm2-cluster RPM, the machine boots up fine. Also, if I leave the lvm2-cluster RPM installed but change the mount options from "defaults 0 2" ro "defaults 0 0", it will skip the fsck, and by the time the machine is booted, the /var/log partition has indeed been mounted (I think it gets mounted after clvmd starts). I've checked that -c n is set on this local volumegroup, but that doesn't seem to make a difference. I've listed a few outputs below. Any other thoughts? Thanks much. # vgdisplay --- Volume group --- VG Name VolGroupMailGFS System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 3 VG Access read/write VG Status resizable Clustered yes Shared no MAX LV 0 Cur LV 1 Open LV 0 Max PV 0 Cur PV 1 Act PV 1 VG Size 341.62 GB PE Size 16.00 MB Total PE 21864 Alloc PE / Size 21864 / 341.62 GB Free PE / Size 0 / 0 VG UUID ehOhtR-cYE8-xjls-Qle0-eT71-DmZO-p5ur6v --- Volume group --- VG Name VolGroup01 System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 2 VG Access read/write VG Status resizable MAX LV 0 Cur LV 1 Open LV 1 Max PV 0 Cur PV 1 Act PV 1 VG Size 4.98 GB PE Size 16.00 MB Total PE 319 Alloc PE / Size 318 / 4.97 GB Free PE / Size 1 / 16.00 MB VG UUID 3Xuzas-tiX2-DgPG-71JH-dB2O-U1qH-SCdgGD --- Volume group --- VG Name VolGroup00 System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 3 VG Access read/write VG Status resizable MAX LV 0 Cur LV 2 Open LV 2 Max PV 0 Cur PV 1 Act PV 1 VG Size 7.89 GB PE Size 16.00 MB Total PE 505 Alloc PE / Size 504 / 7.88 GB Free PE / Size 1 / 16.00 MB VG UUID cYiUzS-QlnZ-PF50-0kAO-kYL0-V3Yw-dXwBIe # pvdisplay --- Physical volume --- PV Name /dev/sdc VG Name VolGroupMailGFS PV Size 341.62 GB / not usable 0 Allocatable yes (but full) PE Size (KByte) 16384 Total PE 21864 Free PE 0 Allocated PE 21864 PV UUID NYWZVb-yKBl-o7dR-Xq9s-0z3A-VFS0-wxzwc1 --- Physical volume --- PV Name /dev/sdb1 VG Name VolGroup01 PV Size 4.98 GB / not usable 0 Allocatable yes PE Size (KByte) 16384 Total PE 319 Free PE 1 Allocated PE 318 PV UUID EFIqWw-SvP6-OWGV-u350-mwyx-5lJQ-29ksqz --- Physical volume --- PV Name /dev/sda2 VG Name VolGroup00 PV Size 7.89 GB / not usable 0 Allocatable yes PE Size (KByte) 16384 Total PE 505 Free PE 1 Allocated PE 504 PV UUID qR2QxR-KuPF-Wsvc-w0yv-d7rK-3NlY-wLRREb # lvdisplay --- Logical volume --- LV Name /dev/VolGroupMailGFS/LogVolHome VG Name VolGroupMailGFS LV UUID 7bE2Zt-27A2-OHga-qFDI-QnNc-m21r-LUaXEm LV Write Access read/write LV Status NOT available LV Size 341.62 GB Current LE 21864 Segments 1 Allocation inherit Read ahead sectors 0 --- Logical volume --- LV Name /dev/VolGroup01/LogVolLogs VG Name VolGroup01 LV UUID 01rj7U-809c-jHmg-n6y7-md6Z-yYlF-NYMxCi LV Write Access read/write LV Status available # open 1 LV Size 4.97 GB Current LE 318 Segments 1 Allocation inherit Read ahead sectors 0 Block device 253:2 --- Logical volume --- LV Name /dev/VolGroup00/LogVolRoot VG Name VolGroup00 LV UUID YFunW2-SKSz-T6pZ-7Agf-AFvO-W411-bfX3Q1 LV Write Access read/write LV Status available # open 1 LV Size 6.88 GB Current LE 440 Segments 1 Allocation inherit Read ahead sectors 0 Block device 253:0 --- Logical volume --- LV Name /dev/VolGroup00/LogVolSwap VG Name VolGroup00 LV UUID uvuww5-PzDY-79pc-hxtk-33Rl-L2tI-Kp9IDb LV Write Access read/write LV Status available # open 1 LV Size 1.00 GB Current LE 64 Segments 1 Allocation inherit Read ahead sectors 0 Block device 253:1 From gforte at leopard.us.udel.edu Wed Dec 7 15:08:20 2005 From: gforte at leopard.us.udel.edu (Greg Forte) Date: Wed, 07 Dec 2005 10:08:20 -0500 Subject: [Linux-cluster] two fencing problems In-Reply-To: <4395BBBA.2090404@leopard.us.udel.edu> References: <4394DDF0.4080603@leopard.us.udel.edu> <20051206034238.GA3226@rover.pcbi.upenn.edu> <4395BBBA.2090404@leopard.us.udel.edu> Message-ID: <4396FAE4.80006@leopard.us.udel.edu> Greg Forte wrote: > And it still doesn't appear to work ... I can turn the outlets on and > off from the command line, but if I down the interface on a node, the > other node reports that it's removing the "failed" node from the > cluster, and that it's fencing the "failed" node, but the "failed" node > never gets shut down. Does this get logged somewhere besides > /var/log/messages, or is there a way to force it to be more verbose? If > I could see what command fenced is actually invoking that might help ... Well, in case anyone is interested, I got fed up with having no decent logging from any of these components, so I finally used tcpdump to monitor the telnet connection between the non-failed node and the PDUs as it tried to fence them ... and it turns out that fence_apc was trying to turn each port ON twice, instead of OFF and then ON like it's supposed to according to my configuration. The fault apparently lies somewhere in ccsd or fenced, because the fence_apc script definitely responds properly to the on|off|reboot options, both on the command line and in the stdin like fenced uses. I changed my cluster.conf so that it uses 'reboot' instead of 'off' and 'on' (e.g. the old conf looked like this: and the new one looks like this: and increased the reboot wait time on the PDUs to make sure it'd wait long enough, and that SEEMS to work (once I remembered to turn off ccsd before updating my cluster.conf by hand so that it didn't end up replacing it with the old one immediately ;-) Of course, I can't bring up any of the per-node fencing configuration items in system-config-cluster anymore, but I think I mentioned that previously - when I set them up through the gui it put "switch=" options in each tag, and then when I shut down and restarted the gui it complained that the file was formatted improperly. I removed those options by hand, and then the gui worked again, but ever since the fencing info hasn't been available ... Any developers care to comment on any of this? I'm finding it really tough to believe that this is a supported RedHat "product". -g Greg Forte gforte at udel.edu IT - User Services University of Delaware 302-831-1982 Newark, DE From eric at bootseg.com Wed Dec 7 16:16:29 2005 From: eric at bootseg.com (Eric Kerin) Date: Wed, 07 Dec 2005 11:16:29 -0500 Subject: [Linux-cluster] two fencing problems In-Reply-To: <4396FAE4.80006@leopard.us.udel.edu> References: <4394DDF0.4080603@leopard.us.udel.edu> <20051206034238.GA3226@rover.pcbi.upenn.edu> <4395BBBA.2090404@leopard.us.udel.edu> <4396FAE4.80006@leopard.us.udel.edu> Message-ID: <1133972189.3454.25.camel@auh5-0479.corp.jabil.org> Greg, I'm using the fence_apc agent on my cluster with APC 7900s, and fencing is working perfectly for me, and has for more than 6 months now. Only thing I had to do was modify the fence_apc agent to allow for the renamed ports I setup (got rid of the outlet X names, and put in descriptive server names) and add in the port groups feature I'm using. One of these days I'll get a few spare minutes to whip up a correct patch to the agent that can be submitted to the tree. One that will work in both the "Outlet X" naming method, and the descriptive port method. My device entry inside of fence for a node looks like this: You can test that the cluster is configured correctly to fence a node by running "fence_node " This will use the cluster's config file to fence the node, ensuring that all config settings are correct. > once I remembered to turn off ccsd > before updating my cluster.conf by hand so that it didn't end up > replacing it with the old one immediately ;-) > When updating the cluster.conf file by hand, you are updating the config_version attribute of the cluster node, right? I do updates to my cluster.conf file by hand pretty much exclusively, while the cluster is running, and with no problems whatsoever. Changes propagate as expected after running "ccs_tool update "and "cman_tool version -r " Thanks, Eric Kerin eric at bootseg.com From gforte at leopard.us.udel.edu Wed Dec 7 19:32:37 2005 From: gforte at leopard.us.udel.edu (Greg Forte) Date: Wed, 07 Dec 2005 14:32:37 -0500 Subject: [Linux-cluster] two fencing problems In-Reply-To: <1133972189.3454.25.camel@auh5-0479.corp.jabil.org> References: <4394DDF0.4080603@leopard.us.udel.edu> <20051206034238.GA3226@rover.pcbi.upenn.edu> <4395BBBA.2090404@leopard.us.udel.edu> <4396FAE4.80006@leopard.us.udel.edu> <1133972189.3454.25.camel@auh5-0479.corp.jabil.org> Message-ID: <439738D5.5000808@leopard.us.udel.edu> Eric Kerin wrote: > Greg, > > I'm using the fence_apc agent on my cluster with APC 7900s, and fencing > is working perfectly for me, and has for more than 6 months now. Thanks, Eric, but the fence_apc script is definitely not the issue - I had to make a couple of minor changes to fence_apc's regexps, and it now works both with command-line options and passing arguments through stdin. This doesn't explain why the cluster conf doesn't work when it has "off" and then "on" as set up by system-config-cluster (and it did that itself, all I did was configure the ip address and login for the fence devices, and tell it which ports to use), but it does work when I make the change to 'reboot' as described in my previous message (this is the default option, anyway, which I assume is why yours works with no "option=" option). > You can test that the cluster is configured correctly to fence a node by > running "fence_node " This will use the cluster's config file > to fence the node, ensuring that all config settings are correct. Actually, that doesn't seem to work for me - no matter what nodename I specify, and regardless of whether I run it on the node I'm trying to fence or the other node (it's a two-node cluster), it comes back with "Fence of 'hostname' was unsuccessful." I suspect this is because it's a two-node cluster so fenced doesn't want to let me kick out a node that's still active ... or maybe it's a just host name problem. Regardless, it _does_ work correctly if I simulate a real failure, after I made the aforementioned cluster.conf change, so I'm confident that I've got it configured correctly. My gripe is that (a) the gui tool can't seem to generate even the most simple conf correctly, and (b) there's apparently a bug in fenced where it passes an "option=on" to the fence_apc agent, when it clearly should be "option = off". Or else ccsd is misparsing the cluster.conf file. I don't see how else to explain that the conf file said "off", then "on", but the daemon did "on", "on". > When updating the cluster.conf file by hand, you are updating the > config_version attribute of the cluster node, right? I do updates to my > cluster.conf file by hand pretty much exclusively, while the cluster is > running, and with no problems whatsoever. Changes propagate as expected > after running "ccs_tool update "and "cman_tool > version -r " Hmmm ... nope, but I will do so in the future. ;-) Thanks. -g Greg Forte gforte at udel.edu IT - User Services University of Delaware 302-831-1982 Newark, DE From eric at bootseg.com Wed Dec 7 19:48:26 2005 From: eric at bootseg.com (Eric Kerin) Date: Wed, 07 Dec 2005 14:48:26 -0500 Subject: [Linux-cluster] two fencing problems In-Reply-To: <439738D5.5000808@leopard.us.udel.edu> References: <4394DDF0.4080603@leopard.us.udel.edu> <20051206034238.GA3226@rover.pcbi.upenn.edu> <4395BBBA.2090404@leopard.us.udel.edu> <4396FAE4.80006@leopard.us.udel.edu> <1133972189.3454.25.camel@auh5-0479.corp.jabil.org> <439738D5.5000808@leopard.us.udel.edu> Message-ID: <1133984906.5344.14.camel@auh5-0479.corp.jabil.org> On Wed, 2005-12-07 at 14:32 -0500, Greg Forte wrote: > I suspect this is because it's > a two-node cluster so fenced doesn't want to let me kick out a node > that's still active ... or maybe it's a just host name problem. > Regardless, it _does_ work correctly if I simulate a real failure, after > I made the aforementioned cluster.conf change, so I'm confident that > I've got it configured correctly. It's most likely a host name problem, because I run a two node cluster, and I used fence_node while testing everything. If you post the relevant sections of your cluster.conf file (the clusternodes, and fencedevices sections are the important ones) We might be able to help you figure out why it's not working right though. But mainly, check that the names you use in the clusternode name attributes are resolvable on both nodes, and they resolve to the same IP address on both nodes. Thanks, Eric Kerin eric at bootseg.com > My gripe is that (a) the gui tool > can't seem to generate even the most simple conf correctly, and (b) > there's apparently a bug in fenced where it passes an "option=on" to the > fence_apc agent, when it clearly should be "option = off". Or else ccsd > is misparsing the cluster.conf file. I don't see how else to explain > that the conf file said "off", then "on", but the daemon did "on", "on". Hmm, I'll see if I can replicate this on my testing cluster. Although I don't think it's designed to work the way you're expecting it to from your config. Of course, I haven't played with multiple fence devices in my configs before, so I could be mistaken. Eric From teigland at redhat.com Wed Dec 7 19:54:26 2005 From: teigland at redhat.com (David Teigland) Date: Wed, 7 Dec 2005 13:54:26 -0600 Subject: [Linux-cluster] two fencing problems In-Reply-To: <439738D5.5000808@leopard.us.udel.edu> References: <4394DDF0.4080603@leopard.us.udel.edu> <20051206034238.GA3226@rover.pcbi.upenn.edu> <4395BBBA.2090404@leopard.us.udel.edu> <4396FAE4.80006@leopard.us.udel.edu> <1133972189.3454.25.camel@auh5-0479.corp.jabil.org> <439738D5.5000808@leopard.us.udel.edu> Message-ID: <20051207195426.GA29230@redhat.com> On Wed, Dec 07, 2005 at 02:32:37PM -0500, Greg Forte wrote: > there's apparently a bug in fenced where it passes an "option=on" to the > fence_apc agent, when it clearly should be "option = off". Or else ccsd > is misparsing the cluster.conf file. I don't see how else to explain > that the conf file said "off", then "on", but the daemon did "on", "on". This may be the bug https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=172401 Dave From gforte at leopard.us.udel.edu Wed Dec 7 20:15:25 2005 From: gforte at leopard.us.udel.edu (Greg Forte) Date: Wed, 07 Dec 2005 15:15:25 -0500 Subject: [Linux-cluster] two fencing problems In-Reply-To: <20051207195426.GA29230@redhat.com> References: <4394DDF0.4080603@leopard.us.udel.edu> <20051206034238.GA3226@rover.pcbi.upenn.edu> <4395BBBA.2090404@leopard.us.udel.edu> <4396FAE4.80006@leopard.us.udel.edu> <1133972189.3454.25.camel@auh5-0479.corp.jabil.org> <439738D5.5000808@leopard.us.udel.edu> <20051207195426.GA29230@redhat.com> Message-ID: <439742DD.4020403@leopard.us.udel.edu> David Teigland wrote: > On Wed, Dec 07, 2005 at 02:32:37PM -0500, Greg Forte wrote: > >>there's apparently a bug in fenced where it passes an "option=on" to the >>fence_apc agent, when it clearly should be "option = off". Or else ccsd >>is misparsing the cluster.conf file. I don't see how else to explain >>that the conf file said "off", then "on", but the daemon did "on", "on". > > > This may be the bug > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=172401 That certainly appears to be it! Thanks. Now I don't suppose there's one for system-config-cluster not being able to read the configuration file it just wrote after adding a fence method to a node ... I'm not finding one, but apparently my luck/skill with bugzilla is pretty poor. ;-) -g Greg Forte gforte at udel.edu IT - User Services University of Delaware 302-831-1982 Newark, DE From gforte at leopard.us.udel.edu Wed Dec 7 20:34:22 2005 From: gforte at leopard.us.udel.edu (Greg Forte) Date: Wed, 07 Dec 2005 15:34:22 -0500 Subject: [Linux-cluster] two fencing problems In-Reply-To: <1133984906.5344.14.camel@auh5-0479.corp.jabil.org> References: <4394DDF0.4080603@leopard.us.udel.edu> <20051206034238.GA3226@rover.pcbi.upenn.edu> <4395BBBA.2090404@leopard.us.udel.edu> <4396FAE4.80006@leopard.us.udel.edu> <1133972189.3454.25.camel@auh5-0479.corp.jabil.org> <439738D5.5000808@leopard.us.udel.edu> <1133984906.5344.14.camel@auh5-0479.corp.jabil.org> Message-ID: <4397474E.7000705@leopard.us.udel.edu> Eric Kerin wrote: > But mainly, check that the names you use in the clusternode name > attributes are resolvable on both nodes, and they resolve to the same IP > address on both nodes. They do resolve, and to the same IP address. Interestingly, if I stop fenced on the "good" node and run it manually as 'fenced -D' to monitor the debugging output, and then run 'fence_node hostname', no activity shows up - but if I do my simulated failure on the "bad" node (drop the interface), then it starts spewing debugging output (though the fencing fails for some other unknown reason ... but killing that and restarting fenced properly fixes it). I kind of give up at this point - fencing now works, and I can always force a node by dropping its interface (or yanking the network cable) - it's dirty, but it works. -g Greg Forte gforte at udel.edu IT - User Services University of Delaware 302-831-1982 Newark, DE From gforte at leopard.us.udel.edu Wed Dec 7 20:42:17 2005 From: gforte at leopard.us.udel.edu (Greg Forte) Date: Wed, 07 Dec 2005 15:42:17 -0500 Subject: [Linux-cluster] failover domain ip address hidden? In-Reply-To: <4397474E.7000705@leopard.us.udel.edu> References: <4394DDF0.4080603@leopard.us.udel.edu> <20051206034238.GA3226@rover.pcbi.upenn.edu> <4395BBBA.2090404@leopard.us.udel.edu> <4396FAE4.80006@leopard.us.udel.edu> <1133972189.3454.25.camel@auh5-0479.corp.jabil.org> <439738D5.5000808@leopard.us.udel.edu> <1133984906.5344.14.camel@auh5-0479.corp.jabil.org> <4397474E.7000705@leopard.us.udel.edu> Message-ID: <43974929.8090106@leopard.us.udel.edu> Can anyone explain to me how failover ip addresses are bound to interfaces in the kernel, or why they don't seem to show up in 'ifconfig' output? I've got one configured and it worked like a charm first try (unlike my fencing setup, heh), I'm just confused as to why it doesn't appear in ifconfig. -g Greg Forte gforte at udel.edu IT - User Services University of Delaware 302-831-1982 Newark, DE From fajar at telkom.co.id Thu Dec 8 07:03:04 2005 From: fajar at telkom.co.id (Fajar A. Nugraha) Date: Thu, 08 Dec 2005 14:03:04 +0700 Subject: [Linux-cluster] failover domain ip address hidden? In-Reply-To: <43974929.8090106@leopard.us.udel.edu> References: <4394DDF0.4080603@leopard.us.udel.edu> <20051206034238.GA3226@rover.pcbi.upenn.edu> <4395BBBA.2090404@leopard.us.udel.edu> <4396FAE4.80006@leopard.us.udel.edu> <1133972189.3454.25.camel@auh5-0479.corp.jabil.org> <439738D5.5000808@leopard.us.udel.edu> <1133984906.5344.14.camel@auh5-0479.corp.jabil.org> <4397474E.7000705@leopard.us.udel.edu> <43974929.8090106@leopard.us.udel.edu> Message-ID: <4397DAA8.1010103@telkom.co.id> Greg Forte wrote: > Can anyone explain to me how failover ip addresses are bound to > interfaces in the kernel, or why they don't seem to show up in > 'ifconfig' output? I've got one configured and it worked like a charm > first try (unlike my fencing setup, heh), I'm just confused as to why > it doesn't appear in ifconfig. > try running "ip addr list" -- Fajar From lhh at redhat.com Thu Dec 8 15:18:06 2005 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 08 Dec 2005 10:18:06 -0500 Subject: [Linux-cluster] Fencing problems In-Reply-To: References: Message-ID: <1134055086.28864.8.camel@ayanami.boston.redhat.com> On Wed, 2005-11-30 at 15:22 +0200, Jari J. Taskinen wrote: > Hi there! > > > I'm running RHEL v3 and GFS-6.0.2.20-2 with it and having problems with manual > fencing. I'm planning to use other fencing methods, but will they work if > manual doesn't? By trying to fence a node (fence_node test3) I only get this > in /etc/log/messages: As long as you do not intend to run manual fencing in production, see below. Otherwise, disregard this email... > nodes.ccs > > nodes { > test1 { > ip_interfaces { > eth0 = "10.0.0.1" > } > fence { > human { > t1 { > ipaddr = "10.0.0.1" Should be 'nodename="test1"', not ipaddr=xxx I think. -- Lon From gforte at leopard.us.udel.edu Thu Dec 8 15:22:02 2005 From: gforte at leopard.us.udel.edu (Greg Forte) Date: Thu, 08 Dec 2005 10:22:02 -0500 Subject: [Linux-cluster] failover domain ip address hidden? In-Reply-To: <4397DAA8.1010103@telkom.co.id> References: <4394DDF0.4080603@leopard.us.udel.edu> <20051206034238.GA3226@rover.pcbi.upenn.edu> <4395BBBA.2090404@leopard.us.udel.edu> <4396FAE4.80006@leopard.us.udel.edu> <1133972189.3454.25.camel@auh5-0479.corp.jabil.org> <439738D5.5000808@leopard.us.udel.edu> <1133984906.5344.14.camel@auh5-0479.corp.jabil.org> <4397474E.7000705@leopard.us.udel.edu> <43974929.8090106@leopard.us.udel.edu> <4397DAA8.1010103@telkom.co.id> Message-ID: <43984F9A.10605@leopard.us.udel.edu> Interesting, thanks. I didn't know you _could_ set multiple addresses for an interface without using a separate label for each. I don't suppose there's a way to configure cman so that it _does_ use labels? It would seem a tad more convenient to have this show up in ifconfig, I can never remember the ip syntax. -g Fajar A. Nugraha wrote: > Greg Forte wrote: > >> Can anyone explain to me how failover ip addresses are bound to >> interfaces in the kernel, or why they don't seem to show up in >> 'ifconfig' output? I've got one configured and it worked like a charm >> first try (unlike my fencing setup, heh), I'm just confused as to why >> it doesn't appear in ifconfig. >> > try running "ip addr list" > From pcaulfie at redhat.com Thu Dec 8 15:36:39 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Thu, 08 Dec 2005 15:36:39 +0000 Subject: [Linux-cluster] failover domain ip address hidden? In-Reply-To: <43984F9A.10605@leopard.us.udel.edu> References: <4394DDF0.4080603@leopard.us.udel.edu> <20051206034238.GA3226@rover.pcbi.upenn.edu> <4395BBBA.2090404@leopard.us.udel.edu> <4396FAE4.80006@leopard.us.udel.edu> <1133972189.3454.25.camel@auh5-0479.corp.jabil.org> <439738D5.5000808@leopard.us.udel.edu> <1133984906.5344.14.camel@auh5-0479.corp.jabil.org> <4397474E.7000705@leopard.us.udel.edu> <43974929.8090106@leopard.us.udel.edu> <4397DAA8.1010103@telkom.co.id> <43984F9A.10605@leopard.us.udel.edu> Message-ID: <43985307.4060803@redhat.com> Greg Forte wrote: > I don't > suppose there's a way to configure cman so that it _does_ use labels? No, it uses node names/addresses and it's not likely to change in a hurry either, Sorry. -- patrick From teigland at redhat.com Thu Dec 8 17:33:15 2005 From: teigland at redhat.com (David Teigland) Date: Thu, 8 Dec 2005 11:33:15 -0600 Subject: [Linux-cluster] manual fencing not working in RHEL4 branch In-Reply-To: <1c0e77670512051804l5cc38edfy1d2b87e8fd71cf2e@mail.gmail.com> References: <1c0e77670511281307i75bc26a4pc5bbcd3d152a8c8e@mail.gmail.com> <20051128212731.GK27662@redhat.com> <1c0e77670511291853i2603f61ayf2eae51903032ebd@mail.gmail.com> <20051130164839.GB23663@redhat.com> <1c0e77670512051804l5cc38edfy1d2b87e8fd71cf2e@mail.gmail.com> Message-ID: <20051208173315.GD10340@redhat.com> On Mon, Dec 05, 2005 at 07:04:38PM -0700, busy admin wrote: > What is the problem: > When running manual fencing and doing failover testing, my secondary > node takes over the service without waiting for a fence_ack_manual. > This all works perfectly with automatic fencing (ipmi, drac). We're going to try this here. Just to be clear, we expect: 1. A and B in cluster, in fence domain and running rgmanager 2. kill B 3. A should start fence_manual and print message in /var/log/messages 4. admin should run fence_ack_manual on A 5. services from B should fail over to A The problem is you're seeing 5 happen before 4. What version of the code are you using (cluster-1.01.00 ?) Thanks Dave From busyadmin at gmail.com Thu Dec 8 18:19:38 2005 From: busyadmin at gmail.com (busy admin) Date: Thu, 8 Dec 2005 11:19:38 -0700 Subject: [Linux-cluster] manual fencing not working in RHEL4 branch In-Reply-To: <20051208173315.GD10340@redhat.com> References: <1c0e77670511281307i75bc26a4pc5bbcd3d152a8c8e@mail.gmail.com> <20051128212731.GK27662@redhat.com> <1c0e77670511291853i2603f61ayf2eae51903032ebd@mail.gmail.com> <20051130164839.GB23663@redhat.com> <1c0e77670512051804l5cc38edfy1d2b87e8fd71cf2e@mail.gmail.com> <20051208173315.GD10340@redhat.com> Message-ID: <1c0e77670512081019g16971027l9ae856866ccbcc05@mail.gmail.com> David, We are using cluster-1.00 code. I didn't see modifications under 1.01 that would have an impact, but maybe I missed something. You are right, I see step 5 happen before step 4 (but remember, sometimes it works fine specially after I run with 'fenced -D'). And I have never seen any of these problems when I use IPMI or DRAC. BTW, for simplicity sake, I wasn't even running rgmanager. Just ccsd, cman and fenced. Thanks, Ken On 12/8/05, David Teigland wrote: > On Mon, Dec 05, 2005 at 07:04:38PM -0700, busy admin wrote: > > What is the problem: > > When running manual fencing and doing failover testing, my secondary > > node takes over the service without waiting for a fence_ack_manual. > > This all works perfectly with automatic fencing (ipmi, drac). > > We're going to try this here. Just to be clear, we expect: > > 1. A and B in cluster, in fence domain and running rgmanager > 2. kill B > 3. A should start fence_manual and print message in /var/log/messages > 4. admin should run fence_ack_manual on A > 5. services from B should fail over to A > > The problem is you're seeing 5 happen before 4. What version of the > code are you using (cluster-1.01.00 ?) > > Thanks > Dave > > From teigland at redhat.com Thu Dec 8 19:08:16 2005 From: teigland at redhat.com (David Teigland) Date: Thu, 8 Dec 2005 13:08:16 -0600 Subject: [Linux-cluster] manual fencing not working in RHEL4 branch In-Reply-To: <1c0e77670512081019g16971027l9ae856866ccbcc05@mail.gmail.com> References: <1c0e77670511281307i75bc26a4pc5bbcd3d152a8c8e@mail.gmail.com> <20051128212731.GK27662@redhat.com> <1c0e77670511291853i2603f61ayf2eae51903032ebd@mail.gmail.com> <20051130164839.GB23663@redhat.com> <1c0e77670512051804l5cc38edfy1d2b87e8fd71cf2e@mail.gmail.com> <20051208173315.GD10340@redhat.com> <1c0e77670512081019g16971027l9ae856866ccbcc05@mail.gmail.com> Message-ID: <20051208190816.GE10340@redhat.com> On Thu, Dec 08, 2005 at 11:19:38AM -0700, busy admin wrote: > David, > > We are using cluster-1.00 code. I didn't see modifications under 1.01 > that would have an impact, but maybe I missed something. > > You are right, I see step 5 happen before step 4 (but remember, > sometimes it works fine specially after I run with 'fenced -D'). And > I have never seen any of these problems when I use IPMI or DRAC. > > BTW, for simplicity sake, I wasn't even running rgmanager. Just ccsd, > cman and fenced. Then I'm confused; I thought we defined the problem as step 5 (services starting on A) happening before step 4 (admin running fence_ack_manual). With no step 5, what's the problem? > On 12/8/05, David Teigland wrote: > > On Mon, Dec 05, 2005 at 07:04:38PM -0700, busy admin wrote: > > > What is the problem: > > > When running manual fencing and doing failover testing, my secondary > > > node takes over the service without waiting for a fence_ack_manual. > > > This all works perfectly with automatic fencing (ipmi, drac). > > > > We're going to try this here. Just to be clear, we expect: > > > > 1. A and B in cluster, in fence domain and running rgmanager > > 2. kill B > > 3. A should start fence_manual and print message in /var/log/messages > > 4. admin should run fence_ack_manual on A > > 5. services from B should fail over to A > > > > The problem is you're seeing 5 happen before 4. What version of the > > code are you using (cluster-1.01.00 ?) > > > > Thanks > > Dave > > > > From busyadmin at gmail.com Thu Dec 8 19:18:01 2005 From: busyadmin at gmail.com (busy admin) Date: Thu, 8 Dec 2005 12:18:01 -0700 Subject: [Linux-cluster] manual fencing not working in RHEL4 branch In-Reply-To: <20051208190816.GE10340@redhat.com> References: <1c0e77670511281307i75bc26a4pc5bbcd3d152a8c8e@mail.gmail.com> <20051128212731.GK27662@redhat.com> <1c0e77670511291853i2603f61ayf2eae51903032ebd@mail.gmail.com> <20051130164839.GB23663@redhat.com> <1c0e77670512051804l5cc38edfy1d2b87e8fd71cf2e@mail.gmail.com> <20051208173315.GD10340@redhat.com> <1c0e77670512081019g16971027l9ae856866ccbcc05@mail.gmail.com> <20051208190816.GE10340@redhat.com> Message-ID: <1c0e77670512081118o67e2b030mb877bbce4fa9a97c@mail.gmail.com> You are exactly right, step 5 happens before step 4. On 12/8/05, David Teigland wrote: > On Thu, Dec 08, 2005 at 11:19:38AM -0700, busy admin wrote: > > David, > > > > We are using cluster-1.00 code. I didn't see modifications under 1.01 > > that would have an impact, but maybe I missed something. > > > > You are right, I see step 5 happen before step 4 (but remember, > > sometimes it works fine specially after I run with 'fenced -D'). And > > I have never seen any of these problems when I use IPMI or DRAC. > > > > BTW, for simplicity sake, I wasn't even running rgmanager. Just ccsd, > > cman and fenced. > > Then I'm confused; I thought we defined the problem as step 5 (services > starting on A) happening before step 4 (admin running fence_ack_manual). > With no step 5, what's the problem? > > > > On 12/8/05, David Teigland wrote: > > > On Mon, Dec 05, 2005 at 07:04:38PM -0700, busy admin wrote: > > > > What is the problem: > > > > When running manual fencing and doing failover testing, my secondary > > > > node takes over the service without waiting for a fence_ack_manual. > > > > This all works perfectly with automatic fencing (ipmi, drac). > > > > > > We're going to try this here. Just to be clear, we expect: > > > > > > 1. A and B in cluster, in fence domain and running rgmanager > > > 2. kill B > > > 3. A should start fence_manual and print message in /var/log/messages > > > 4. admin should run fence_ack_manual on A > > > 5. services from B should fail over to A > > > > > > The problem is you're seeing 5 happen before 4. What version of the > > > code are you using (cluster-1.01.00 ?) > > > > > > Thanks > > > Dave > > > > > > > From jeff at jettis.com Thu Dec 8 22:01:50 2005 From: jeff at jettis.com (Jeff Dinisco) Date: Thu, 8 Dec 2005 14:01:50 -0800 Subject: [Linux-cluster] corrupted gfs filesystem Message-ID: I'm testing gfs 6.1 (lock dlm) in a 2 node cluster on FC4. I took both nodes out of the cluster manually, then added node01 back in. As expected, it fenced node02. Fencing was done by shutting down a network port on a switch so iscsi could not access the storage devices. However, the device files still existed. Just to see how the cluster would react, I started up ccsd, cman, and fenced on node02. It joined the cluster w/ out issue. Even though I knew iscsi was unable to get to the storage devices, I started the gfs init script which attempted to mount the filesystem. Looks like it trashed it. Output from gfs_fsck... # gfs_fsck /dev/iscsi/laxrifa01/lun0 Initializing fsck Buffer #150609096 (1 of 5) is neither GFS_METATYPE_RB nor GFS_METATYPE_RG. Resource group is corrupted. Unable to read in rgrp descriptor. Unable to fill in resource group information. Is this expected behavior or is it possible that I'm missing something in my configuration that allowed this to happen? Thanks. - Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From elmar at pruesse.net Fri Dec 9 12:03:05 2005 From: elmar at pruesse.net (Elmar Pruesse) Date: Fri, 09 Dec 2005 13:03:05 +0100 Subject: [Linux-cluster] New (small) cluster; What filesystem? GFS? In-Reply-To: <439220EB.80901@pruesse.net> References: <439220EB.80901@pruesse.net> Message-ID: <43997279.40809@pruesse.net> Since I got no response from you guys, I guess I'm off topic. I apologize for that. Can you point me anywhere to ask my question? We have no one with experience in this area and I'm having a really hard time finding material to base a decision on. If I had the time, I'd just try them all, but as usual, I don't... regards, Elmar From jeff at jettis.com Fri Dec 9 14:33:57 2005 From: jeff at jettis.com (Jeff Dinisco) Date: Fri, 9 Dec 2005 06:33:57 -0800 Subject: [Linux-cluster] corrupted gfs filesystem Message-ID: Also, does the output from gfs_fsck indicate that the filesystem is beyond repair? If not, what steps could I take to fix it? _____ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeff Dinisco Sent: Thursday, December 08, 2005 5:02 PM To: linux-cluster at redhat.com Subject: [Linux-cluster] corrupted gfs filesystem I'm testing gfs 6.1 (lock dlm) in a 2 node cluster on FC4. I took both nodes out of the cluster manually, then added node01 back in. As expected, it fenced node02. Fencing was done by shutting down a network port on a switch so iscsi could not access the storage devices. However, the device files still existed. Just to see how the cluster would react, I started up ccsd, cman, and fenced on node02. It joined the cluster w/ out issue. Even though I knew iscsi was unable to get to the storage devices, I started the gfs init script which attempted to mount the filesystem. Looks like it trashed it. Output from gfs_fsck... # gfs_fsck /dev/iscsi/laxrifa01/lun0 Initializing fsck Buffer #150609096 (1 of 5) is neither GFS_METATYPE_RB nor GFS_METATYPE_RG. Resource group is corrupted. Unable to read in rgrp descriptor. Unable to fill in resource group information. Is this expected behavior or is it possible that I'm missing something in my configuration that allowed this to happen? Thanks. - Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.yarwood at juno.co.uk Fri Dec 9 15:44:45 2005 From: ben.yarwood at juno.co.uk (Ben Yarwood) Date: Fri, 9 Dec 2005 15:44:45 -0000 Subject: [Linux-cluster] Question about fencing using wti power switch Message-ID: <03af01c5fcd7$7adbf1d0$3964a8c0@WS076> I am running FC4, with gfs and clustering and am trying to test fencing. My cluster.conf file is shown at the bottom. Whenever I disable the network port for one of the cluster boxes I would expect the device to be fenced using the wti power switch, however I just get the following messsages in the log saying the device needs to be fenced manually. Dec 9 15:36:01 jrmedia-a fenced[2066]: fencing node "jrmedia-c" Dec 9 15:36:01 jrmedia-a fence_manual: Node jrmedia-c needs to be reset before recovery can procede. Waiting for jrmedia-c to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n jrmedia-c) I have tested that the device can be fenced by using fence_wti directly and it works correclty, power cycling the plug. Can someone tell me either if I have made a mistake in the cluster.conf file or how I can get more debugging information. Thanks Ben Ben Yarwood Technical Director Juno Records t - 020 7424 2804 m - 07930 922 333 e - ben.yarwood at juno.co.uk From teigland at redhat.com Fri Dec 9 16:07:55 2005 From: teigland at redhat.com (David Teigland) Date: Fri, 9 Dec 2005 10:07:55 -0600 Subject: [Linux-cluster] corrupted gfs filesystem In-Reply-To: References: Message-ID: <20051209160755.GA30517@redhat.com> On Thu, Dec 08, 2005 at 02:01:50PM -0800, Jeff Dinisco wrote: > I'm testing gfs 6.1 (lock dlm) in a 2 node cluster on FC4. I took both > nodes out of the cluster manually, then added node01 back in. As > expected, it fenced node02. Fencing was done by shutting down a network > port on a switch so iscsi could not access the storage devices. > However, the device files still existed. > > Just to see how the cluster would react, I started up ccsd, cman, and > fenced on node02. It joined the cluster w/ out issue. Even though I > knew iscsi was unable to get to the storage devices, I started the gfs > init script which attempted to mount the filesystem. Looks like it > trashed it. But node02 couldn't reach the storage, how could it trash it? If node02 _could_ reach the storage, it would have just mounted the fs normally. > Output from gfs_fsck... When and where did you run fsck? Not while either node had the fs mounted I trust. Dave > > # gfs_fsck /dev/iscsi/laxrifa01/lun0 > Initializing fsck > Buffer #150609096 (1 of 5) is neither GFS_METATYPE_RB nor > GFS_METATYPE_RG. > Resource group is corrupted. > Unable to read in rgrp descriptor. > Unable to fill in resource group information. > > Is this expected behavior or is it possible that I'm missing something > in my configuration that allowed this to happen? Thanks. From jeff at jettis.com Fri Dec 9 16:20:52 2005 From: jeff at jettis.com (Jeff Dinisco) Date: Fri, 9 Dec 2005 08:20:52 -0800 Subject: [Linux-cluster] corrupted gfs filesystem Message-ID: nope, the fs was unmounted on both nodes. I ran it from node01 after I was unable to mount it and had to reboot the node because the mount command hung the system. The latest output from gfs_fsck... Initializing fsck fs_compute_bitstructs: # of blks in rgrp do not equal # of blks represented in bitmap. bi_start = 134230407 bi_len = 17 GFS_NBBY = 4 ri_data = 8 Unable to fill in resource group information. The only thing that has changed is I tried to mount it a 2nd time and again couldn't kill mount and was forced to reboot. -----Original Message----- From: David Teigland [mailto:teigland at redhat.com] Sent: Friday, December 09, 2005 11:08 AM To: Jeff Dinisco Cc: linux-cluster at redhat.com Subject: Re: [Linux-cluster] corrupted gfs filesystem On Thu, Dec 08, 2005 at 02:01:50PM -0800, Jeff Dinisco wrote: > I'm testing gfs 6.1 (lock dlm) in a 2 node cluster on FC4. I took both > nodes out of the cluster manually, then added node01 back in. As > expected, it fenced node02. Fencing was done by shutting down a network > port on a switch so iscsi could not access the storage devices. > However, the device files still existed. > > Just to see how the cluster would react, I started up ccsd, cman, and > fenced on node02. It joined the cluster w/ out issue. Even though I > knew iscsi was unable to get to the storage devices, I started the gfs > init script which attempted to mount the filesystem. Looks like it > trashed it. But node02 couldn't reach the storage, how could it trash it? If node02 _could_ reach the storage, it would have just mounted the fs normally. > Output from gfs_fsck... When and where did you run fsck? Not while either node had the fs mounted I trust. Dave > > # gfs_fsck /dev/iscsi/laxrifa01/lun0 > Initializing fsck > Buffer #150609096 (1 of 5) is neither GFS_METATYPE_RB nor > GFS_METATYPE_RG. > Resource group is corrupted. > Unable to read in rgrp descriptor. > Unable to fill in resource group information. > > Is this expected behavior or is it possible that I'm missing something > in my configuration that allowed this to happen? Thanks. From thomsonr at ucalgary.ca Fri Dec 9 23:38:06 2005 From: thomsonr at ucalgary.ca (Ryan Thomson) Date: Fri, 9 Dec 2005 16:38:06 -0700 (MST) Subject: [Linux-cluster] rgmanager causing hard lock ups Message-ID: <50712.136.159.234.21.1134171486.squirrel@136.159.234.21> Hi List, I have an RHCS cluster with four nodes on RHEL4U2 using the RHN RPMs and GFS CVS (RHEL4) and LVM2 (clvmd) from source tarball (2.2.01.09). I'm seeing some rather disturbing behavior from my cluster. I can get all the nodes to join, fence each other properly, etc. I also have some services setup, mainly GFS mounts and NFS exports. However, now if I bring up the cluster and start rgmanager, the node that tries to start one or more of the services (I can't tell which service but I suspect the NFS export service) will hard lock with the caps lock and scroll lock lights blinking and the rest of the cluster is useless: services don't start and rgmanager won't stop or reload or do anything... on all the nodes. Also, I have all but one of my services set to NOT autostart, yet when I start rgmanager, they begin starting anyways... Here is my cluster.conf file, I suspect the problem is with my NFS export service as that is the only one I've changed since I started seeing this behavior: