From jpalmae at gmail.com Sat Mar 1 00:12:46 2008 From: jpalmae at gmail.com (Jorge Palma) Date: Fri, 29 Feb 2008 21:12:46 -0300 Subject: [Linux-cluster] HA LVM In-Reply-To: <9CAA1922-44FD-4244-9B45-C6E0B51A1E67@redhat.com> References: <5b65f1b10802271447s69f736a2gec497abd1ec06317@mail.gmail.com> <541A6F37-F850-42D8-B030-9C694453413E@redhat.com> <5b65f1b10802281745n748afc21u99c93f9069f1239c@mail.gmail.com> <9CAA1922-44FD-4244-9B45-C6E0B51A1E67@redhat.com> Message-ID: <5b65f1b10802291612r1ae1a433s9c4965313009a5cd@mail.gmail.com> On Fri, Feb 29, 2008 at 5:56 PM, Jonathan Brassow wrote: > > On Feb 28, 2008, at 7:45 PM, Jorge Palma wrote: > > Of course, in fact we found other problems, such as the agent of lvm > > read parameters name and the name of volume wrongly > > > > In any event, the most common clustering services working properly > > What happens if you simply do 'clusvcadm -e '? Does it start- > up then? And if you use clusvcadm to relocate teh service to the > other node? > > brassow > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > Reacts the same way in both cases I am smells like bug Greetings -- Jorge Palma Escobar Ingeniero de Sistemas Red Hat Linux Certified Engineer Certificate N? 804005089418233 From oblivion78 at web.de Sat Mar 1 21:39:09 2008 From: oblivion78 at web.de (oblivion78 at web.de) Date: Sat, 01 Mar 2008 22:39:09 +0100 Subject: [Linux-cluster] kernel panic and no failover Message-ID: <1391832836@web.de> Hi There, I am running RHEL5.1 on 4 clusternodes, one of these clusternodes had kernel panic today, and no services where switched to another node, even clustat said, that this node is still available. It wasn't able to relocate this services to another node until I rebooted the crashed node. Also it wasn't able to fence this node! This is really strange, does anybody has some advice? _____________________________________________________________________ Der WEB.DE SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! http://smartsurfer.web.de/?mc=100071&distributionid=000000000066 From balajisundar at midascomm.com Mon Mar 3 07:18:45 2008 From: balajisundar at midascomm.com (Balaji) Date: Mon, 03 Mar 2008 12:48:45 +0530 Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 46, Issue 48 In-Reply-To: <20080229170012.BFD7C618D51@hormel.redhat.com> References: <20080229170012.BFD7C618D51@hormel.redhat.com> Message-ID: <47CBA655.3000404@midascomm.com> linux-cluster-request at redhat.com wrote: >Send Linux-cluster mailing list submissions to > linux-cluster at redhat.com > >To subscribe or unsubscribe via the World Wide Web, visit > https://www.redhat.com/mailman/listinfo/linux-cluster >or, via email, send a message with subject or body 'help' to > linux-cluster-request at redhat.com > >You can reach the person managing the list at > linux-cluster-owner at redhat.com > >When replying, please edit your Subject line so it is more specific >than "Re: Contents of Linux-cluster digest..." > > >Today's Topics: > > 1. RE: is clvm a must all the time in a cluster ? (Roger Pe?a) > 2. Re: gfs_tool: Cannot allocate memory (Bob Peterson) > 3. Ethernet Channel Bonding Configuration in RHEL Cluster Suite > Setup (Balaji) > 4. R: [Linux-cluster] Ethernet Channel Bonding Configuration in > RHELCluster Suite Setup (Leandro Dardini) > 5. probe a special port and fence according to the result (Peter) > 6. Re: probe a special port and fence according to the result > (Lon Hohberger) > 7. Re: Ethernet Channel Bonding Configuration in RHEL Cluster > Suite Setup (Lon Hohberger) > 8. Re: probe a special port and fence according to the result > (Brian Kroth) > 9. Re: probe a special port and fence according to the result (Peter) > 10. Re: Redundant (multipath)/Replicating NFS (isplist at logicore.net) > 11. Re: Redundant (multipath)/Replicating NFS (gordan at bobich.net) > 12. Re: Redundant (multipath)/Replicating NFS (isplist at logicore.net) > 13. Re: gfs_tool: Cannot allocate memory (rhurst at bidmc.harvard.edu) > > >---------------------------------------------------------------------- > >Message: 1 >Date: Thu, 28 Feb 2008 22:00:04 -0500 (EST) >From: Roger Pe?a >Subject: RE: [Linux-cluster] is clvm a must all the time in a cluster > ? >To: linux clustering >Message-ID: <233538.63310.qm at web50602.mail.re2.yahoo.com> >Content-Type: text/plain; charset=iso-8859-1 > > >--- Roger Pe?a wrote: > > > >>--- Steffen Plotner wrote: >> >> >> >>>Hi, >>> >>>You asked below: can I run the cluster with GFS >>>without clvmd? The answer is yes. I believe in >>>having the least number of components running, and >>>found that clvmd had start up problems (and then >>>refuses to stop) after doing an update of RHEL4 >>>during December. >>> >>> >>> >>I apply those updates this month so I guess I am >>seeing the same as you >> >> >> >>>It is clearly possible to use GFS directly on a >>> >>> >>SAN >> >> >>>based LUN. >>> >>> >>I know that but, as you already said, the problem is >>"having a uniq name for the filesystem to all the >>nodes" >> >> >> >>>The trick of course is how to deal with >>>the /dev/sdb reference which will probably not be >>>the same on all hosts. To fix that use udev rules >>>that provide a symlink (using the serial number of >>>the LUN, for example) by which the GFS file system >>>can be referred to in /etc/fstab. >>> >>> >>I am sure udev rules works but definitly seting that >>enviroment is more complex that use a LV :-) >>so, what happen is I use the shared LV just as a >>local >>LV? >>each node will treated the same way as it treat the >>LV >>from the local disks. I guess that will not be a >>problem as far as I do not work with the VG >>metadata, >>am I right? >> >> >> >> >> >> >>>We have converted 2 clusters in the past few >>> >>> >>months >> >> >>>because we have had real problems with clvmd >>>misbehaving itself during startup. At this point >>> >>> >>it >> >> >>>is a pleasure to let the cluster boot by itself >>> >>> >>and >> >> >>>not to have to worry about GFS file systems not >>>being mounted (ccsd, cman, fenced, iscsi, gfs). >>> >>> >>not activating LV even when clvmd is running? it >>happen to me several times in the last month >>;-) >>that is why I want to get rid of lvm :-) >> >> > ^^^ >I mean clvm > >sorry, I want to use lvm, but I dont want to use clvm >all the time, just when, very rarely, need to create >or resize shared lv > >cu >roger > >__________________________________________ >RedHat Certified ( RHCE ) >Cisco Certified ( CCNA & CCDA ) > > > Get a sneak peak at messages with a handy reading pane with All new Yahoo! Mail: http://ca.promos.yahoo.com/newmail/overview2/ > > > >------------------------------ > >Message: 2 >Date: Thu, 28 Feb 2008 21:03:36 -0600 >From: Bob Peterson >Subject: Re: [Linux-cluster] gfs_tool: Cannot allocate memory >To: linux clustering >Message-ID: <1204254216.3272.15.camel at technetium.msp.redhat.com> >Content-Type: text/plain > >On Thu, 2008-02-28 at 10:28 -0500, rhurst at bidmc.harvard.edu wrote: > > >>ioctl(3, 0x472d, 0x7fbfffe300) = -1 ENOMEM (Cannot allocate >> >> > >Hi Robert, > >The gfs_tool does most of its work using ioctl calls to the gfs kernel >module. Often, it tries to allocate and pass in a huge buffer to make >sure it doesn't ask for more than the kernel needs to respond with. > >In some cases, it doesn't need to allocate such a big buffer. >I fixed "gfs_tool counters" for a similar ENOMEM problem with >bugzilla record 229461 about a year ago. (I don't know if that bug >record is public or locked so you may not be able to access it, which >is out of my control--sorry). > >I should probably go through all the other gfs_tool functions, including >the two you mentioned, and figure out their minimum memory requirements >and change the code so it doesn't ask for so much memory. > >Perhaps you can open up a bugzilla record so I can schedule this work? >Also, you didn't say whether you're on RHEL4/Centos4/similar, or >RHEL5/Centos5/similar. > >Regards, > >Bob Peterson >Red Hat GFS > > > > >------------------------------ > >Message: 3 >Date: Fri, 29 Feb 2008 17:58:00 +0530 >From: Balaji >Subject: [Linux-cluster] Ethernet Channel Bonding Configuration in > RHEL Cluster Suite Setup >To: ClusterGrp >Message-ID: <47C7FA50.8080202 at midascomm.com> >Content-Type: text/plain; charset=ISO-8859-1; format=flowed > >Dear All, > > I am new in RHEL Cluster Suite. > > I have Configure Cluster and Rebooted the system and then cluster >become active in primary node and other node as passive and member >status becomes Online for both the cluster nodes > > In Cluster Suite i am monitoring the resources as scripts files and >ipaddress and During network failure one of the node or both the nodes >are removed from the cluster member and All the resources are stopped >and then rebooted the system only both the system are joining into the >cluster member. > > I have followed the RHEL Cluster Suite Configuration document >"rh-cs-en-4.pdf" and I have found out Ethernet Channel Bonding in Each >Cluster Nodes will avoid the network single point failure in cluster >system. > > I have configured the Ethernet Channel Bonding with mode as >active-backup without fence device. > > Ethernet Channel Bonding Configuration Details are > 1. In " /etc/modprobe.conf" file added the following bonding driver >support > alias bond0 bonding > options bonding miimon=100 mode=1 > 2. Edited the "/etc/sysconfig/network-scripts/ifcfg-eth0" file added >the following configuration > DEVICE=eth0 > USERCTL=no > ONBOOT=yes > MASTER=bond0 > SLAVE=yes > BOOTPROTO=none > 3. Edited the "/etc/sysconfig/network-scripts/ifcfg-eth1" file added >the following configuration > DEVICE=eth1 > USERCTL=no > ONBOOT=yes > MASTER=bond0 > SLAVE=yes > BOOTPROTO=none > 4. Created a network script for the bonding device >"/etc/sysconfig/network-scripts/ifcfg-bond0" > DEVICE=bond0 > USERCTL=no > ONBOOT=yes > IPADDR=192.168.13.109 > NETMASK=255.255.255.0 > GATEWAY=192.168.13.1 > 5. Reboot the system for the changes to take effect. > 6. Configure Ethernet Channel Bonding > 7. Rebooted the system and then cluster services are active on both >the nodes and member status of current node is Online and other node as >Offline > > I need clarification about Ethernet Channel Bonding will work with >Fence Device or not. > > I am not sure why this is happening. Can some one throw light on this. > >Regards >-S.Balaji > > > > > > > > > > > > > > > > > > > > > > > > > >------------------------------ > >Message: 4 >Date: Fri, 29 Feb 2008 13:51:05 +0100 >From: "Leandro Dardini" >Subject: R: [Linux-cluster] Ethernet Channel Bonding Configuration in > RHELCluster Suite Setup >To: "linux clustering" >Message-ID: > <6F861500A5092B4C8CD653DE20A4AA0D511B91 at exchange3.comune.prato.local> >Content-Type: text/plain; charset="iso-8859-1" > > > > > >>-----Messaggio originale----- >>Da: linux-cluster-bounces at redhat.com >>[mailto:linux-cluster-bounces at redhat.com] Per conto di Balaji >>Inviato: venerd? 29 febbraio 2008 13.28 >>A: ClusterGrp >>Oggetto: [Linux-cluster] Ethernet Channel Bonding >>Configuration in RHELCluster Suite Setup >> >>Dear All, >> >> I am new in RHEL Cluster Suite. >> >> I have Configure Cluster and Rebooted the system and then >>cluster become active in primary node and other node as >>passive and member status becomes Online for both the cluster nodes >> >> In Cluster Suite i am monitoring the resources as scripts >>files and ipaddress and During network failure one of the >>node or both the nodes are removed from the cluster member >>and All the resources are stopped and then rebooted the >>system only both the system are joining into the cluster member. >> >> I have followed the RHEL Cluster Suite Configuration >>document "rh-cs-en-4.pdf" and I have found out Ethernet >>Channel Bonding in Each Cluster Nodes will avoid the network >>single point failure in cluster system. >> >> I have configured the Ethernet Channel Bonding with mode >>as active-backup without fence device. >> >> Ethernet Channel Bonding Configuration Details are >> 1. In " /etc/modprobe.conf" file added the following >>bonding driver support >> alias bond0 bonding >> options bonding miimon=100 mode=1 >> 2. Edited the "/etc/sysconfig/network-scripts/ifcfg-eth0" >>file added the following configuration >> DEVICE=eth0 >> USERCTL=no >> ONBOOT=yes >> MASTER=bond0 >> SLAVE=yes >> BOOTPROTO=none >> 3. Edited the "/etc/sysconfig/network-scripts/ifcfg-eth1" >>file added the following configuration >> DEVICE=eth1 >> USERCTL=no >> ONBOOT=yes >> MASTER=bond0 >> SLAVE=yes >> BOOTPROTO=none >> 4. Created a network script for the bonding device >>"/etc/sysconfig/network-scripts/ifcfg-bond0" >> DEVICE=bond0 >> USERCTL=no >> ONBOOT=yes >> IPADDR=192.168.13.109 >> NETMASK=255.255.255.0 >> GATEWAY=192.168.13.1 >> 5. Reboot the system for the changes to take effect. >> 6. Configure Ethernet Channel Bonding >> 7. Rebooted the system and then cluster services are >>active on both the nodes and member status of current node is >>Online and other node as Offline >> >> I need clarification about Ethernet Channel Bonding will >>work with Fence Device or not. >> >> I am not sure why this is happening. Can some one throw >>light on this. >> >>Regards >>-S.Balaji >> >> >> > >When there is a network failure, each member cannot reach the other one. Each member trigger the fence script to turn off the other one. Unfortunately the network is off, so the low level network interface take the fence script connection request in mind and wait for the network to come up. When the network is again available, each member can reach the fencing device and turn off the other. There is no simple way to avoid this. You can make network near 100% fault proof using bond device or you can use a STONITH fencing device, so the first member who regain network prevents the other to fence, or you can use three members. In a network failure no one have the quorum to fence the others. > >Leandro > > > >------------------------------ > >Message: 5 >Date: Fri, 29 Feb 2008 15:05:23 +0100 >From: Peter >Subject: [Linux-cluster] probe a special port and fence according to > the result >To: linux clustering >Message-ID: <99FC135E-6367-41B5-B6B5-AA1FEC5D9833 at gmx.de> >Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes > >Hi! > >I am planning to test a service for availability and if the service is >not available anymore, fence the node and start the service on another >node. Therefore, i need a clue how to probe a port with system tools. >Sure, with telnet there is a possibility to check. But i have no idea >to avoid the interactivity. So it is too difficult for me to implement >and usually not allowed in this environment. > >On Solaris, there is the "ping -p" command to test the availability of >a application waiting on a port. But for Redhat, i have no clue. > >Could you please help? > > >Peter > > > >------------------------------ > >Message: 6 >Date: Fri, 29 Feb 2008 09:20:23 -0500 >From: Lon Hohberger >Subject: Re: [Linux-cluster] probe a special port and fence according > to the result >To: linux clustering >Message-ID: <20080229142023.GD6571 at redhat.com> >Content-Type: text/plain; charset=us-ascii > >On Fri, Feb 29, 2008 at 03:05:23PM +0100, Peter wrote: > > >>Hi! >> >>I am planning to test a service for availability and if the service is not >>available anymore, fence the node and start the service on another node. >>Therefore, i need a clue how to probe a port with system tools. >>Sure, with telnet there is a possibility to check. But i have no idea to >>avoid the interactivity. So it is too difficult for me to implement and >>usually not allowed in this environment. >> >>On Solaris, there is the "ping -p" command to test the availability of a >>application waiting on a port. But for Redhat, i have no clue. >> >>Could you please help? >> >> > >Net Cat? (man nc ) ? > > > Dear All, Ping is working but the members are not in cluster and its forming a new cluster at both the nodes The following Messages are logged in "/var/log/messages" CMAN: forming a new cluster CMAN: quorum regained, resuming activity Regards -S.Balaji then cluster services are active on both the nodes and member status of current node is Online and other node as Offline -------------- next part -------------- An HTML attachment was scrubbed... URL: From Alain.Moulle at bull.net Mon Mar 3 08:36:31 2008 From: Alain.Moulle at bull.net (Alain Moulle) Date: Mon, 03 Mar 2008 09:36:31 +0100 Subject: [Linux-cluster] CS5 / magma_tools ? Message-ID: <47CBB88F.8020907@bull.net> Hi It seems that there is no more magma rpm in RHEL5, therefore no more magma_tool ... is there an equivalent ? or a new rpm to be installed to recover the magma_tool ? Thanks Regards Alain Moull? From pep at belnet.be Mon Mar 3 10:22:16 2008 From: pep at belnet.be (PeP) Date: Mon, 03 Mar 2008 11:22:16 +0100 Subject: [Linux-cluster] cannot fsck on our 16 TB gfs2 volume... Message-ID: <47CBD158.9000803@belnet.be> Hello, We have created a 16 TB GFS2 cluster lvm on our iSCSI SAN using following commands : - mkfs.gfs2 -p lock_dlm -r 2048 -J 16 -t ftpcluster:ftpdata /dev/ftpdata/ftp When running a fsck command on it we obtain following : hydra11 openais # fsck.gfs2 -v /dev/ftpdata/ftp Initializing fsck Initializing lists... Initializing special inodes... Validating Resource Group index. Level 1 RG check. (level 1 passed) 8001 resource groups found. Setting block ranges... This system doesn't have enough memory + swap space to fsck this file system. Additional memory needed is approximately: 6000MB Please increase your swap space by that amount and run gfs_fsck again. Freeing buffers. hydra11 openais # This looks weird for us. Our current config is : - Dual Core Intel Xeon + 4 GB RAM + 200 GB swap - Gentoo Linux 2.6.24-gentoo-r3 (GFS2 kernel module) - redhat-cluster-suite-2.01.00 Can you help us diagnose what's wrong with our setup ? PeP -- Pascal Panneels - BELNET, the Belgian Research and Education Network From gordan at bobich.net Mon Mar 3 11:06:45 2008 From: gordan at bobich.net (gordan at bobich.net) Date: Mon, 3 Mar 2008 11:06:45 +0000 (GMT) Subject: [Linux-cluster] cannot fsck on our 16 TB gfs2 volume... In-Reply-To: <47CBD158.9000803@belnet.be> References: <47CBD158.9000803@belnet.be> Message-ID: On Mon, 3 Mar 2008, PeP wrote: > We have created a 16 TB GFS2 cluster lvm on our iSCSI SAN using following > commands : > - mkfs.gfs2 -p lock_dlm -r 2048 -J 16 -t ftpcluster:ftpdata /dev/ftpdata/ftp > > When running a fsck command on it we obtain following : > > hydra11 openais # fsck.gfs2 -v /dev/ftpdata/ftp > Initializing fsck > Initializing lists... > Initializing special inodes... > Validating Resource Group index. > Level 1 RG check. > (level 1 passed) > 8001 resource groups found. > Setting block ranges... > This system doesn't have enough memory + swap space to fsck this file system. > Additional memory needed is approximately: 6000MB > Please increase your swap space by that amount and run gfs_fsck again. > Freeing buffers. > hydra11 openais # > > This looks weird for us. > > Our current config is : > > - Dual Core Intel Xeon + 4 GB RAM + 200 GB swap > - Gentoo Linux 2.6.24-gentoo-r3 (GFS2 kernel module) > - redhat-cluster-suite-2.01.00 > > Can you help us diagnose what's wrong with our setup ? Other than using Gentoo? ;-) Have you tried with GFS1 rather than GFS2? Gordan From gordan at bobich.net Mon Mar 3 11:23:36 2008 From: gordan at bobich.net (gordan at bobich.net) Date: Mon, 3 Mar 2008 11:23:36 +0000 (GMT) Subject: [Linux-cluster] GFS + DRBD Problems Message-ID: Hi, I'm appear to be a experiencing a strange compound problem with this, that is proving rather difficult to troubleshoot, so I'm hoping someone here can spot a problem I hadn't. I have a 2-node cluster with Open Shared Root on GFS on DRBD. A single node mounts GFS OK and works, but after a while seems to just block for disk. Very much as if it started trying to fence the other node and is waiting for acknowledgement. There are no fence devices defined (so this could be a possibility), but the other node was never powered up in the first place, so it is somewhat beyond me why it might suddenly decide to try to fence it. This usually happens after a period of idleness. If the node is used, this doesn't seem to happen, but leaving it along for half an hour causes it to block for disk I/O. Unfortunately, it doesn't end there. When an attempt is made to dual-mount the GFS file system before the secondary is fully up to date (but is connected and syncing), the 2nd node to join notices an inconsistency, and withdraws from the cluster. In the process, GFS gets corrupted, and the only way to get it to mount again on either node is to repair it with fsck. I'm not sure if this is a problem with my cluster setup or not, but I cannot see that the nodes would fail to find each other and get DLM working. Console logs seem to indicate that everything is in fact OK, and the nodes are connected directly via a cross-over cable. If the nodes are in sync by the time GFS tries to mount, the mount succeeds, but everything grinds to a halt shortly afterwards - so much so that the only way to get things moving again is to hard-reset one of the nodes, preferably the 2nd one to join. Here is where the second thing that seems wrong happend - the first node doesn't just lock-up at this point, as one might expect (when a connected node disappears, e.g. due to a hard reset, cluster is supposed to try to fence it until it cleanly rejoins - and it can't possibly fence the other node since I haven't configured any fencing devices yet). This doesn't seem to happen. The first node seems to continue like nothing happened. This is possibly connected to the fact that by this point, GFS is corrupted and has to be fsck-ed at next boot. This part may be a cluster setup issue, so I'll raise that on the cluster list, although it seems to be a DRBD specific peculiarity - using a SAN doesn't have this issue with a nearly identical cluster.conf (only difference being the block device specification). The cluster.conf is as follows: Getting to the logs can be a bit difficult with OSR (they get reset on reboot, and it's rather difficult getting to them when the node stops responding without rebooting it), so I don't have those at the moment. Any suggestions would be welcome at this point. TIA. Gordan From rpeterso at redhat.com Mon Mar 3 15:03:45 2008 From: rpeterso at redhat.com (Bob Peterson) Date: Mon, 03 Mar 2008 09:03:45 -0600 Subject: [Linux-cluster] cannot fsck on our 16 TB gfs2 volume... In-Reply-To: <47CBD158.9000803@belnet.be> References: <47CBD158.9000803@belnet.be> Message-ID: <1204556625.2873.5.camel@technetium.msp.redhat.com> On Mon, 2008-03-03 at 11:22 +0100, PeP wrote: > hydra11 openais # fsck.gfs2 -v /dev/ftpdata/ftp > Initializing fsck > Initializing lists... > Initializing special inodes... > Validating Resource Group index. > Level 1 RG check. > (level 1 passed) > 8001 resource groups found. > Setting block ranges... > This system doesn't have enough memory + swap space to fsck this file > system. > Additional memory needed is approximately: 6000MB > Please increase your swap space by that amount and run gfs_fsck again. Hi PeP, gfs2_fsck only gives this message if it tries to allocate memory (i.e. malloc) for an in-core bitmap and the malloc fails. Are you sure that swap is turned on? On several occasions I've tried to figure out ways to reduce the memory requirements of gfs_fsck and gfs2_fsck, but I haven't found a way yet. Perhaps the problem is that it's just trying to allocate too big of memory chunks for bitmaps because of your 2G RG size. You could try specifying -r 1024 on mkfs.gfs2 and see if that helps. Regards, Bob Peterson Red Hat GFS From lhh at redhat.com Mon Mar 3 15:58:55 2008 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 03 Mar 2008 10:58:55 -0500 Subject: [Linux-cluster] Heuristics without quorum disk In-Reply-To: <762AE60C-0850-4E41-B847-29848C38F3BB@gmx.de> References: <762AE60C-0850-4E41-B847-29848C38F3BB@gmx.de> Message-ID: <1204559935.31133.46.camel@ayanami.boston.devel.redhat.com> On Fri, 2008-02-29 at 20:47 +0100, Peter wrote: > Hi! > > I try to involve another failure detection method than heartbeat for > failover. For example "nc -z -w3 " as a check wether the > service behind the port is available. > > The only thing i found is "heuristic program" from QDisk/quorumd. As i > do not have a qourum disk, i am a little bit confused how to implement > such a detection. > > Can you please help me? mon ? http://mon.wiki.kernel.org/index.php/Main_Page -- Lon From lhh at redhat.com Mon Mar 3 16:01:31 2008 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 03 Mar 2008 11:01:31 -0500 Subject: [Linux-cluster] CS5 / magma_tools ? In-Reply-To: <47CBB88F.8020907@bull.net> References: <47CBB88F.8020907@bull.net> Message-ID: <1204560091.31133.50.camel@ayanami.boston.devel.redhat.com> On Mon, 2008-03-03 at 09:36 +0100, Alain Moulle wrote: > Hi > > It seems that there is no more magma rpm in RHEL5, therefore > no more magma_tool ... is there an equivalent ? or a new rpm > to be installed to recover the magma_tool ? Magma was basically just a simple cluster-API abstraction layer designed to work with gulm or cman. Since we don't have gulm anymore, we don't have magma. cman_tool + some sed work is about your best bet in RHEL5. -- Lon From lhh at redhat.com Mon Mar 3 16:14:39 2008 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 03 Mar 2008 11:14:39 -0500 Subject: [Linux-cluster] GFS + DRBD Problems In-Reply-To: References: Message-ID: <1204560879.31133.64.camel@ayanami.boston.devel.redhat.com> On Mon, 2008-03-03 at 11:23 +0000, gordan at bobich.net wrote: > I have a 2-node cluster with Open Shared Root on GFS on DRBD. Last week, I saw a car with a license plate from 'Wyoming'. Now, someone's running GFS on shared root DRBD. My world's turning upside down. > A single > node mounts GFS OK and works, but after a while seems to just block for > disk. Very much as if it started trying to fence the other node and is > waiting for acknowledgement. If CMAN was trying to fence, you'd see it in /var/log/messages. I'm not sure about DRBD. > There are no fence devices defined (so this > could be a possibility), Unlikely. Even if this was the cause, you'd still see it (and you could work around it). > Unfortunately, it doesn't end there. When an attempt is made to dual-mount > the GFS file system before the secondary is fully up to date (but is > connected and syncing), the 2nd node to join notices an inconsistency, and > withdraws from the cluster. In the process, GFS gets corrupted, and the > only way to get it to mount again on either node is to repair it with > fsck. Off the top of my head, this sounds like a DRBD thing. If sync's completed, it works, right? -- Lon From lhh at redhat.com Mon Mar 3 16:15:33 2008 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 03 Mar 2008 11:15:33 -0500 Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 46, Issue 48 In-Reply-To: <47CBA655.3000404@midascomm.com> References: <20080229170012.BFD7C618D51@hormel.redhat.com> <47CBA655.3000404@midascomm.com> Message-ID: <1204560933.31133.67.camel@ayanami.boston.devel.redhat.com> On Mon, 2008-03-03 at 12:48 +0530, Balaji wrote: > > > > > > > Net Cat? (man nc ) ? > > > > > Dear All, > Ping is working but the members are not in cluster and its forming a > new cluster at both the nodes > The following Messages are logged in "/var/log/messages" > CMAN: forming a new cluster > CMAN: quorum regained, resuming activity > > then cluster services are active on both > the nodes and member status of current node is Online and other node as > Offline One of the nodes would have had to fence the other before starting services, unless no fencing is configured (which is a bad idea). Anyway, your bonding configuration looks correct, but we really need to see cluster.conf to have a complete picture. Maybe you posted it and I missed it? -- Lon From gordan at bobich.net Mon Mar 3 16:22:51 2008 From: gordan at bobich.net (gordan at bobich.net) Date: Mon, 3 Mar 2008 16:22:51 +0000 (GMT) Subject: [Linux-cluster] GFS + DRBD Problems In-Reply-To: <1204560879.31133.64.camel@ayanami.boston.devel.redhat.com> References: <1204560879.31133.64.camel@ayanami.boston.devel.redhat.com> Message-ID: On Mon, 3 Mar 2008, Lon Hohberger wrote: >> I have a 2-node cluster with Open Shared Root on GFS on DRBD. > > Last week, I saw a car with a license plate from 'Wyoming'. Now, > someone's running GFS on shared root DRBD. My world's turning upside > down. LOL! We live in interesting times. :) And anyway, what's wrong with GFS shared root on DRBD? :) >> A single >> node mounts GFS OK and works, but after a while seems to just block for >> disk. Very much as if it started trying to fence the other node and is >> waiting for acknowledgement. > > If CMAN was trying to fence, you'd see it in /var/log/messages. I'm not > sure about DRBD. I can't see any evidence of that, and I'd expect to see something on the console about it, too. I'll set up a remote syslog to double-check. >> There are no fence devices defined (so this >> could be a possibility), > > Unlikely. Even if this was the cause, you'd still see it (and you could > work around it). > > >> Unfortunately, it doesn't end there. When an attempt is made to dual-mount >> the GFS file system before the secondary is fully up to date (but is >> connected and syncing), the 2nd node to join notices an inconsistency, and >> withdraws from the cluster. In the process, GFS gets corrupted, and the >> only way to get it to mount again on either node is to repair it with >> fsck. > > Off the top of my head, this sounds like a DRBD thing. If sync's > completed, it works, right? Not quite - it works in as far as it gets as far as mounting the file system without noticing it to be inconsistent (presumably because it isn't changing underneath it). But the FS gets corrupted. I cannot be sure right now, but I have a suspicion that both machines might be trying to mount the FS with the same journal. I could be mis-remembering and/or mis-interpreting what mount output says when it's connecting, though. I'll check it via the remote console in a bit and paste the output from each node. Gordan From gordan at bobich.net Mon Mar 3 17:06:19 2008 From: gordan at bobich.net (gordan at bobich.net) Date: Mon, 3 Mar 2008 17:06:19 +0000 (GMT) Subject: [Linux-cluster] GFS + DRBD Problems In-Reply-To: References: Message-ID: On Mon, 3 Mar 2008, gordan at bobich.net wrote: > I have a 2-node cluster with Open Shared Root on GFS on DRBD. A single node > mounts GFS OK and works, but after a while seems to just block for disk. [...] > This usually happens after a period of idleness. If the node is used, this > doesn't seem to happen, but leaving it alone for half an hour causes it > to block for disk I/O. I've done a bit more digging, and the processes that hang seem to do so, as expected, in disk sleep state. For example, when trying to log in, sshd hangs. It's status (from /proc) is: Name: sshd State: D (disk sleep) SleepAVG: 97% [...] The only open file handles it has are: # ls -la /proc/9643/fd/ total 0 dr-x------ 2 root root 0 Mar 3 16:41 . dr-xr-xr-x 5 root root 0 Mar 3 16:41 .. lrwx------ 1 root root 64 Mar 3 16:42 0 -> /dev/null lrwx------ 1 root root 64 Mar 3 16:42 1 -> /dev/null lrwx------ 1 root root 64 Mar 3 16:42 2 -> /dev/null lrwx------ 1 root root 64 Mar 3 16:42 3 -> socket:[118904] lrwx------ 1 root root 64 Mar 3 16:42 4 -> /cdsl.local/var/run/utmp I am guessing that it's the utmp that is blocking things, but I'm not sure. I can read-write the /var/run/utmp file just fine (/var/run is symlinked to /cdsl.local/var/run). The socked is a TCP socket, so I cannot see that being a disk block issue. As for /dev/null, I didn't think that could be flock-ed... Looking at cman_tool status and /proc/drbd, both seem to be in order and saying everything is working. Any ideas as to what could be causing these bogus disk-sleep lock-ups? Gordan From breeves at redhat.com Mon Mar 3 17:21:42 2008 From: breeves at redhat.com (Bryn M. Reeves) Date: Mon, 03 Mar 2008 17:21:42 +0000 Subject: [Linux-cluster] GFS + DRBD Problems In-Reply-To: References: Message-ID: <47CC33A6.3020207@redhat.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 gordan at bobich.net wrote: > On Mon, 3 Mar 2008, gordan at bobich.net wrote: > >> I have a 2-node cluster with Open Shared Root on GFS on DRBD. A single >> node mounts GFS OK and works, but after a while seems to just block >> for disk. > [...] > State: D (disk sleep) > SleepAVG: 97% > [...] You can find out what it's sleeping on either by via a sysrq or by getting ps to display the wchan field of the stat data, e.g.: ps ac -opid,comm,wchan And see what symbol appears in the 3rd field. For sysrq, just echo t > /proc/sysrq-trigger Then look in dmesg & find the stacktrace for the process you're interested in. Regards, Bryn. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFHzDOm6YSQoMYUY94RAv4EAJ9ETV417mVlw3DLcil+zbKC3IIccgCgnOns QAKLelOI8BMcjKMfpbW69xg= =R9XQ -----END PGP SIGNATURE----- From lhh at redhat.com Mon Mar 3 18:28:43 2008 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 03 Mar 2008 13:28:43 -0500 Subject: [Linux-cluster] GFS + DRBD Problems In-Reply-To: References: <1204560879.31133.64.camel@ayanami.boston.devel.redhat.com> Message-ID: <1204568923.31133.85.camel@ayanami.boston.devel.redhat.com> On Mon, 2008-03-03 at 16:22 +0000, gordan at bobich.net wrote: > On Mon, 3 Mar 2008, Lon Hohberger wrote: > > >> I have a 2-node cluster with Open Shared Root on GFS on DRBD. > > > > Last week, I saw a car with a license plate from 'Wyoming'. Now, > > someone's running GFS on shared root DRBD. My world's turning upside > > down. > > LOL! We live in interesting times. :) > And anyway, what's wrong with GFS shared root on DRBD? :) (Note: That was not a stab at Wyoming or its residents; it was a joke referencing the Wyoming Conspiracy Theory which asserts that Wyoming does not exist.) > I can't see any evidence of that, and I'd expect to see something on the > console about it, too. I'll set up a remote syslog to double-check. CMAN doesn't log to the console for this sort of thing, it goes to syslog... but of course, if root's on GFS and access to that GFS is blocked, it might be missed in /var/log/messages. > > Off the top of my head, this sounds like a DRBD thing. If sync's > > completed, it works, right? > > Not quite - it works in as far as it gets as far as mounting the file > system without noticing it to be inconsistent (presumably because it isn't > changing underneath it). But the FS gets corrupted. > > I cannot be sure right now, but I have a suspicion that both machines > might be trying to mount the FS with the same journal. I could be > mis-remembering and/or mis-interpreting what mount output says when it's > connecting, though. I'll check it via the remote console in a bit and > paste the output from each node. Well, after fencing completes, one node replays the other node's journal, but then it stops using it AFAIK. When the other node boots, it uses its own journal (or should be). -- Lon From grimme at atix.de Mon Mar 3 19:42:48 2008 From: grimme at atix.de (Marc Grimme) Date: Mon, 3 Mar 2008 20:42:48 +0100 Subject: [Linux-cluster] GFS + DRBD Problems In-Reply-To: References: Message-ID: <200803032042.48222.grimme@atix.de> On Monday 03 March 2008 12:23:36 gordan at bobich.net wrote: > Hi, > > I'm appear to be a experiencing a strange compound problem with this, that > is proving rather difficult to troubleshoot, so I'm hoping someone here > can spot a problem I hadn't. > > I have a 2-node cluster with Open Shared Root on GFS on DRBD. A single > node mounts GFS OK and works, but after a while seems to just block for > disk. Very much as if it started trying to fence the other node and is > waiting for acknowledgement. There are no fence devices defined (so this > could be a possibility), but the other node was never powered up in the > first place, so it is somewhat beyond me why it might suddenly decide to > try to fence it. This usually happens after a period of idleness. If the > node is used, this doesn't seem to happen, but leaving it along for half > an hour causes it to block for disk I/O. As I cannot help you too much with DBRB problems here some infos to help you debugging them at least ;-) . Regarding OSR being stuck (manual fencing): You should try using the fenceacksv. As far as I am informed of your configuration it also is configured: Now you could do a telnet on the hung node on port 12242 login and should automatically see, if it is in manual fencing state or not. If you also install comoonics-fenceacksv-plugins-py you will be able to trigger sysrqs via the fenceacksv. Hope that helps with debugging. Marc. > > Unfortunately, it doesn't end there. When an attempt is made to dual-mount > the GFS file system before the secondary is fully up to date (but is > connected and syncing), the 2nd node to join notices an inconsistency, and > withdraws from the cluster. In the process, GFS gets corrupted, and the > only way to get it to mount again on either node is to repair it with > fsck. > > I'm not sure if this is a problem with my cluster setup or not, but I > cannot see that the nodes would fail to find each other and get DLM > working. Console logs seem to indicate that everything is in fact OK, and > the nodes are connected directly via a cross-over cable. > > If the nodes are in sync by the time GFS tries to mount, the mount > succeeds, but everything grinds to a halt shortly afterwards - so much so > that the only way to get things moving again is to hard-reset one of the > nodes, preferably the 2nd one to join. > > Here is where the second thing that seems wrong happend - the first node > doesn't just lock-up at this point, as one might expect (when a connected > node disappears, e.g. due to a hard reset, cluster is supposed to try to > fence it until it cleanly rejoins - and it can't possibly fence the other > node since I haven't configured any fencing devices yet). This doesn't seem > to happen. The first node seems to continue like nothing happened. This is > possibly connected to the fact that by this point, GFS is corrupted and has > to be fsck-ed at next boot. This part may be a cluster setup issue, so I'll > raise that on the cluster list, although it seems to be a DRBD specific > peculiarity - using a SAN doesn't have this issue with a nearly identical > cluster.conf (only difference being the block device specification). > > The cluster.conf is as follows: > > > > > > > > > > > "/dev/drbd1" mountopts = "noatime,nodiratime,noquota" /> > ip = "10.0.0.1" > mac = "00:0B:DB:92:C5:E1" > mask = "255.255.255.0" > gateway = "" > /> > passwd = "secret" > /> > > > > > > > > > > > "/dev/drbd1" mountopts = "noatime,nodiratime,noquota" /> > ip = "10.0.0.2" > mac = "00:0B:DB:90:4E:1B" > mask = "255.255.255.0" > gateway = "" > /> > passwd = "secret" > /> > > > > > > > > > > > > > > > Getting to the logs can be a bit difficult with OSR (they get reset on > reboot, and it's rather difficult getting to them when the node stops > responding without rebooting it), so I don't have those at the moment. > > Any suggestions would be welcome at this point. > > TIA. > > Gordan > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Gruss / Regards, Marc Grimme Phone: +49-89 452 3538-14 http://www.atix.de/ http://www.open-sharedroot.org/ ** ATIX Informationstechnologie und Consulting AG Einsteinstr. 10 85716 Unterschleissheim Deutschland/Germany Phone: +49-89 452 3538-0 Fax: +49-89 990 1766-0 Registergericht: Amtsgericht Muenchen Registernummer: HRB 168930 USt.-Id.: DE209485962 Vorstand: Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.) Vorsitzender des Aufsichtsrats: Dr. Martin Buss From gordan at bobich.net Mon Mar 3 21:37:44 2008 From: gordan at bobich.net (Gordan Bobic) Date: Mon, 03 Mar 2008 21:37:44 +0000 Subject: [Linux-cluster] GFS + DRBD Problems In-Reply-To: <47CC33A6.3020207@redhat.com> References: <47CC33A6.3020207@redhat.com> Message-ID: <47CC6FA8.7080607@bobich.net> Bryn M. Reeves wrote: >>> I have a 2-node cluster with Open Shared Root on GFS on DRBD. A single >>> node mounts GFS OK and works, but after a while seems to just block >>> for disk. >> [...] >> State: D (disk sleep) >> SleepAVG: 97% >> [...] > > You can find out what it's sleeping on either by via a sysrq or by > getting ps to display the wchan field of the stat data, e.g.: > > ps ac -opid,comm,wchan > > And see what symbol appears in the 3rd field. And this is what comes out for all the stuck processes: ps ax -opid,comm,wchan | grep ssh 9250 sshd - 9507 sshd gdlm_plock 9642 sshd gdlm_plock They are all stuck in the gdlm_plock function. I figured it'd be something like this. How do I debug this further? Gordan From gordan at bobich.net Mon Mar 3 21:44:56 2008 From: gordan at bobich.net (Gordan Bobic) Date: Mon, 03 Mar 2008 21:44:56 +0000 Subject: [Linux-cluster] GFS + DRBD Problems In-Reply-To: <200803032042.48222.grimme@atix.de> References: <200803032042.48222.grimme@atix.de> Message-ID: <47CC7158.5050401@bobich.net> Marc Grimme wrote: > As I cannot help you too much with DBRB problems here some infos to help you > debugging them at least ;-) . > > Regarding OSR being stuck (manual fencing): > You should try using the fenceacksv. As far as I am informed of your > configuration it also is configured: [...] > passwd = "password" > /> [...] > Now you could do a telnet on the hung node on port 12242 login and should > automatically see, if it is in manual fencing state or not. Easier said than done. My cluster IP is on a separate subnet that nothing else is plugged into, and I suspect the fenceackserver is only listening on that interface (it doesn't seem to respond on the other interfaces). I'll see if I can get a laptop on it and see what it says... > If you also install comoonics-fenceacksv-plugins-py you will be able to > trigger sysrqs via the fenceacksv. > > Hope that helps with debugging. Thanks, I appreciate it. :-) Gordan From gordan at bobich.net Mon Mar 3 21:57:41 2008 From: gordan at bobich.net (Gordan Bobic) Date: Mon, 03 Mar 2008 21:57:41 +0000 Subject: [Linux-cluster] GFS + DRBD Problems In-Reply-To: <200803032042.48222.grimme@atix.de> References: <200803032042.48222.grimme@atix.de> Message-ID: <47CC7455.9090600@bobich.net> Marc Grimme wrote: > passwd = "password" > /> [...] > Now you could do a telnet on the hung node on port 12242 login and should > automatically see, if it is in manual fencing state or not. Hmm, this doesn't seem to be responding. Is there a separate package that needs to be installed to add this feature? I've not seen mkinitrd moan about missing files, so I just assumed it was all there. Gordan From superjunk at 126.com Tue Mar 4 05:28:15 2008 From: superjunk at 126.com (Superjunk) Date: Tue, 4 Mar 2008 13:28:15 +0800 Subject: [Linux-cluster] Where can I find old RHCS version References: <47CBD158.9000803@belnet.be> Message-ID: <010f01c87db8$8c9c3100$9665a8c0@dyxnet537f1698> Hello As title, Just like cman-kernel-2.6.9-45.4.centos4.i686.rpm etc.... From grimme at atix.de Tue Mar 4 08:26:24 2008 From: grimme at atix.de (Marc Grimme) Date: Tue, 4 Mar 2008 09:26:24 +0100 Subject: [Linux-cluster] GFS + DRBD Problems In-Reply-To: <47CC7455.9090600@bobich.net> References: <200803032042.48222.grimme@atix.de> <47CC7455.9090600@bobich.net> Message-ID: <200803040926.25052.grimme@atix.de> On Monday 03 March 2008 22:57:41 Gordan Bobic wrote: > Marc Grimme wrote: > > > passwd = "password" > > /> > > [...] > > > Now you could do a telnet on the hung node on port 12242 login and should > > automatically see, if it is in manual fencing state or not. > > Hmm, this doesn't seem to be responding. Is there a separate package > that needs to be installed to add this feature? I've not seen mkinitrd > moan about missing files, so I just assumed it was all there. Check for comoonics-bootimage-fenceacksv that's the software that must be installed and rebuild an initrd. Marc > > Gordan > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Gruss / Regards, Marc Grimme Phone: +49-89 452 3538-14 http://www.atix.de/ http://www.open-sharedroot.org/ ** ATIX Informationstechnologie und Consulting AG Einsteinstr. 10 85716 Unterschleissheim Deutschland/Germany Phone: +49-89 452 3538-0 Fax: +49-89 990 1766-0 Registergericht: Amtsgericht Muenchen Registernummer: HRB 168930 USt.-Id.: DE209485962 Vorstand: Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.) Vorsitzender des Aufsichtsrats: Dr. Martin Buss From kadlec at sunserv.kfki.hu Tue Mar 4 08:36:10 2008 From: kadlec at sunserv.kfki.hu (Kadlecsik Jozsef) Date: Tue, 4 Mar 2008 09:36:10 +0100 (CET) Subject: [Linux-cluster] Strange 'lock out all' Message-ID: Hi, A strange situation happened here in our five-node GFS cluster. Two of the nodes (lxserv1-gfs and web1-gfs) were fenced off due to an administrator error. We still should have been able to run smoothly but two nodes got disallowed(?!) and so only one vote remained. From the last standing node: root at saturn:~# cman_tool status Version: 6.0.1 Config Version: 6 Cluster Name: kfki Cluster Id: 1583 Cluster Member: Yes Cluster Generation: 2332 Membership state: Cluster-Member Nodes: 4 Expected votes: 5 Total votes: 1 Quorum: 3 Active subsystems: 7 Flags: DisallowedNodes Ports Bound: 0 11 Node name: saturn-gfs Node ID: 5 Multicast addresses: 224.0.0.3 Node addresses: 192.168.192.18 Disallowed nodes: lxserv0-gfs web0-gfs Why and how does a node get disallowed? How could we prevent it to happen in the future? Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From balajisundar at midascomm.com Tue Mar 4 09:39:39 2008 From: balajisundar at midascomm.com (Balaji) Date: Tue, 04 Mar 2008 15:09:39 +0530 Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 47, Issue 4 In-Reply-To: <20080303170008.7B39F618C7D@hormel.redhat.com> References: <20080303170008.7B39F618C7D@hormel.redhat.com> Message-ID: <47CD18DB.7080001@midascomm.com> Dear All, I have missed my cluster.conf complete picture and my cluster configurations are like -------------- next part -------------- An HTML attachment was scrubbed... URL: From Alain.Moulle at bull.net Wed Mar 5 09:09:31 2008 From: Alain.Moulle at bull.net (Alain Moulle) Date: Wed, 05 Mar 2008 10:09:31 +0100 Subject: [Linux-cluster] CS4 U4 / timers tuning Message-ID: <47CE634B.6020603@bull.net> Hi is there a rule to follow between the DLM lock_timeout and the deadnode_timeout value ? Meaning for example that the first one must be always lesser than the second one ? And if so, could we have a deadnode_timeout=60s and the /proc/cluster/config/dlm/lock_timeout at 70s ? or are there some upper limits not to exceed ? Another question : is there somewhere a list of all parameters around CS4 that we are allowed to tune ? Thanks Regards Alain Moull? From ccaulfie at redhat.com Wed Mar 5 09:51:58 2008 From: ccaulfie at redhat.com (Christine Caulfield) Date: Wed, 05 Mar 2008 09:51:58 +0000 Subject: [Linux-cluster] CS4 U4 / timers tuning In-Reply-To: <47CE634B.6020603@bull.net> References: <47CE634B.6020603@bull.net> Message-ID: <47CE6D3E.3030904@redhat.com> Alain Moulle wrote: > Hi > > is there a rule to follow between the DLM lock_timeout > and the deadnode_timeout value ? > Meaning for example that the first one must be always lesser than > the second one ? > > And if so, could we have a deadnode_timeout=60s and the > /proc/cluster/config/dlm/lock_timeout at 70s ? or are > there some upper limits not to exceed ? The DLM's lock_timeout should always be greater than cman's deadnode_timeout. A sensible minimum is about 1.5 times the cman value, but it can go as high as you like. > Another question : > is there somewhere a list of all parameters around CS4 that > we are allowed to tune ? This isn't exactly 'documentation' but it does briefly describe the cman tunables: http://sourceware.org/git/?p=cluster.git;a=blob_plain;f=cman-kernel/src/config.c;hb=RHEL4 -- Chrissie From abhramica at gmail.com Wed Mar 5 12:12:13 2008 From: abhramica at gmail.com (Abhra Paul) Date: Wed, 5 Mar 2008 17:42:13 +0530 Subject: [Linux-cluster] Problem in running a job in linux cluster Message-ID: <8e3fbac10803050412u7e247d62v518174362a5591d7@mail.gmail.com> Hi, When I am submitting a job it is immediately killed. I have checked the /var/spool/mail directory. There is a message like this: PBS Job Id: 1626.cluster1.iacs.res.in Job Name: test An error has occurred processing your job, see below. Post job file processing error; job 1626.cluster1.iacs.res.in on host node7.iacs.res.in/3+node7.iacs.res.in/2+node7.iacs.res.in/1+node7.iacs.res.in/0+node8.iacs.res.in/3+node8.iacs.res.in/2+node8.iacs.res.in/1+node8.iacs.res.in/0+node9.iacs.res.in/3+node9.iacs.res.in/2+node9.iacs.res.in/1+node9.iacs.res.in/0+node10.iacs.res.in/3+node10.iacs.res.in/2+node10.iacs.res.in/1+node10.iacs.res.in/0Unknown resource type REJHOST=node7.iacs.res.in MSG=invalid home directory '/home/pcpkp/scratch2/CsH' specified, errno=2 (No such file or directory) Actually to run the job home directory(/home/pcpkp) is unavailable. I have checked /etc/exports, /auto.master this two configuration file. All those things are ok. I am unable to find out the actual problem. Please help regarding this matter. With regards Abhra From kadlec at sunserv.kfki.hu Wed Mar 5 12:36:34 2008 From: kadlec at sunserv.kfki.hu (Kadlecsik Jozsef) Date: Wed, 5 Mar 2008 13:36:34 +0100 (CET) Subject: R: [Linux-cluster] Performance tuning help sought In-Reply-To: <47CD7F43.1080408@nasa.gov> References: <6F861500A5092B4C8CD653DE20A4AA0D511CBC@exchange3.comune.prato.local> <47CD7F43.1080408@nasa.gov> Message-ID: Hi, On Tue, 4 Mar 2008, Bill Sellers wrote: > > There are some TCP tuning parameters you can change in the /etc/sysctl.conf. > Adding these made a noticeable improvement for us. Also, if you use any NFS, > set your rsize and wsize = 32768 (NFS mount options) in a Linux 2.6 kernel. > The noatime mount option appears to work well. We have abandoned ext3 for > filesystems > 1Tb in favor of reiserfs and jfs. These other filesytems > appear to give better performance in a cluster environment. If you use gfs as > the filesystem, then the above point is moot. > > http://dsd.lbl.gov/TCP-tuning/linux.html > > http://www.nren.nasa.gov/tcp_tuning.html > > http://nfs.sourceforge.net/nfs-howto/ar01s05.html Thank you the tips, to all of you, NFS will be tuned for sure! Digging into the mailing list archive brought out some GFS tuning options too. What is already done: - statfs_slots set to 128 - statfs_fast set to 1 What we are going to try is to set - glock_purge to 50 - demote_secs to 150 Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From s.vinod79 at gmail.com Wed Mar 5 13:14:01 2008 From: s.vinod79 at gmail.com (Vinod Kumar) Date: Wed, 5 Mar 2008 18:44:01 +0530 Subject: [Linux-cluster] Where can I find old RHCS version In-Reply-To: <010f01c87db8$8c9c3100$9665a8c0@dyxnet537f1698> References: <47CBD158.9000803@belnet.be> <010f01c87db8$8c9c3100$9665a8c0@dyxnet537f1698> Message-ID: Hi, You can find some of the old rpms for RHEL4/CentOS4 here http://vault.centos.org/4.4/apt/i386/RPMS.csgfs/ Vinod On 3/4/08, Superjunk wrote: > > Hello > > As title, Just like cman-kernel-2.6.9-45.4.centos4.i686.rpm etc.... > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Alain.Moulle at bull.net Wed Mar 5 13:25:42 2008 From: Alain.Moulle at bull.net (Alain Moulle) Date: Wed, 05 Mar 2008 14:25:42 +0100 Subject: [Linux-cluster] CS4 U4 / timers tuning Message-ID: <47CE9F56.9080600@bull.net> Thanks Chrissie And is there a way to change the dlm lock_timeout with a parameter in cluster.conf ? Thanks Regards Alain Alain Moulle wrote: >> Hi >> >> is there a rule to follow between the DLM lock_timeout >> and the deadnode_timeout value ? >> Meaning for example that the first one must be always lesser than >> the second one ? >> >> And if so, could we have a deadnode_timeout=60s and the >> /proc/cluster/config/dlm/lock_timeout at 70s ? or are >> there some upper limits not to exceed ? The DLM's lock_timeout should always be greater than cman's deadnode_timeout. A sensible minimum is about 1.5 times the cman value, but it can go as high as you like. >> Another question : >> is there somewhere a list of all parameters around CS4 that >> we are allowed to tune ? This isn't exactly 'documentation' but it does briefly describe the cman tunables: http://sourceware.org/git/?p=cluster.git;a=blob_plain;f=cman-kernel/src/config.c;hb=RHEL4 -- Chrissie From ccaulfie at redhat.com Wed Mar 5 13:44:06 2008 From: ccaulfie at redhat.com (Christine Caulfield) Date: Wed, 05 Mar 2008 13:44:06 +0000 Subject: [Linux-cluster] CS4 U4 / timers tuning In-Reply-To: <47CE9F56.9080600@bull.net> References: <47CE9F56.9080600@bull.net> Message-ID: <47CEA3A6.4010607@redhat.com> Alain Moulle wrote: > Thanks Chrissie > > And is there a way to change the dlm lock_timeout with > a parameter in cluster.conf ? > Unfortunately not. Only the cman timers can be changed in cluster.conf. Because the DLM has no userspace portion in RHEL4 there's nothing that can read cluster.conf for it. > Alain Moulle wrote: > >>> Hi >>> >>> is there a rule to follow between the DLM lock_timeout >>> and the deadnode_timeout value ? >>> Meaning for example that the first one must be always lesser than >>> the second one ? >>> >>> And if so, could we have a deadnode_timeout=60s and the >>> /proc/cluster/config/dlm/lock_timeout at 70s ? or are >>> there some upper limits not to exceed ? > > > The DLM's lock_timeout should always be greater than cman's > deadnode_timeout. A sensible minimum is about 1.5 times the cman value, > but it can go as high as you like. > > >>> Another question : >>> is there somewhere a list of all parameters around CS4 that >>> we are allowed to tune ? > > > This isn't exactly 'documentation' but it does briefly describe the cman > tunables: > > http://sourceware.org/git/?p=cluster.git;a=blob_plain;f=cman-kernel/src/config.c;hb=RHEL4 > > -- Chrissie > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Chrissie From nerix at free.fr Wed Mar 5 16:10:05 2008 From: nerix at free.fr (nerix) Date: Wed, 05 Mar 2008 17:10:05 +0100 Subject: [Linux-cluster] qdisk master Message-ID: <1204733405.47cec5dd7fd07@imp.free.fr> Hi list, I have troubles in a samba 2 nodes cluster using qdisk (shared iscsi disk). I did some "crash test" of nodes (one by one of course), unpluging the network cable. When the service provided (samba) and the qdisk master are located on the same node, the node remaining is unable to get the qdisk master role and the cluster dead. When the qdisk master is located on the node which remain before simulate of crash on the node which has the service. All is working. The service is relocated on the node remaining. When I do a test with cman_tool kill -n node, all is ok. Does the transfert of the qdisk master role need network connection ? Thanks for reading ! Derrick. From nerix at free.fr Wed Mar 5 17:55:13 2008 From: nerix at free.fr (nerix) Date: Wed, 05 Mar 2008 18:55:13 +0100 Subject: [Linux-cluster] qdisk master In-Reply-To: <1204733405.47cec5dd7fd07@imp.free.fr> References: <1204733405.47cec5dd7fd07@imp.free.fr> Message-ID: <1204739713.47cede813deb1@imp.free.fr> Resolved by https://bugzilla.redhat.com/show_bug.cgi?id=429927 and https://bugzilla.redhat.com/attachment.cgi?id=292708 Thanks to lon from #linux-cluster on freenode Quoting nerix : > > Hi list, > > I have troubles in a samba 2 nodes cluster using qdisk (shared iscsi disk). > > I did some "crash test" of nodes (one by one of course), unpluging the > network > cable. > When the service provided (samba) and the qdisk master are located on the > same > node, the node remaining is unable to get the qdisk master role and the > cluster > dead. > When the qdisk master is located on the node which remain before simulate of > crash on the node which has the service. All is working. The service is > relocated on the node remaining. > When I do a test with cman_tool kill -n node, all is ok. > > Does the transfert of the qdisk master role need network connection ? > > Thanks for reading ! > Derrick. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From grimme at atix.de Wed Mar 5 17:10:20 2008 From: grimme at atix.de (Marc Grimme) Date: Wed, 5 Mar 2008 18:10:20 +0100 Subject: [Linux-cluster] GFS + DRBD Problems In-Reply-To: <47CDCEA9.9020801@bobich.net> References: <47CDC9B1.4060907@bobich.net> <47CDCEA9.9020801@bobich.net> Message-ID: <200803051810.21020.grimme@atix.de> You're sure the times are equal on both nodes even written back to hwclock? ntpdate && hwclock --systohc Cause I had exactly the same behaviour last week where only the times between the nodes were different. They would not get fenced, imediately after the second node joined the cluster the first one "lost connection to node 1" and all cluster services just vanished on that node, the filesystem was still mounted (on both nodes) and so on. After setting times to normal on both nodes everything was working as expected. Marc. On Tuesday 04 March 2008 23:35:21 Gordan Bobic wrote: > Gordan Bobic wrote: > > As I thought, the problem I'm seeing is indeed rather multi-part. The > > first part is now resolved - large time-skips due to the system clock > > being out of date until ntpd syncs it up. It seems that large time jumps > > made dlm choke. > > > > Now for part 2: > > > > The two nodes connect - certainly enough to sync up DRBD. That stage > > goes through fine. They start cman and other cluster components, but it > > would appear then never actually find each other. > > > > When mounting the shared file system: > > > > Node 1: > > GFS: fsid=sentinel:root.0: jid=0: Trying to acquire journal lock... > > GFS: fsid=sentinel:root.0: jid=0: Looking at journal... > > GFS: fsid=sentinel:root.0: jid=0: Acquiring the transaction lock... > > GFS: fsid=sentinel:root.0: jid=0: Replaying journal... > > GFS: fsid=sentinel:root.0: jid=0: Replayed 54 of 197 blocks > > GFS: fsid=sentinel:root.0: jid=0: replays = 54, skips = 36, sames = 107 > > GFS: fsid=sentinel:root.0: jid=0: Journal replayed in 1s > > GFS: fsid=sentinel:root.0: jid=0: Done > > GFS: fsid=sentinel:root.0: jid=1: Trying to acquire journal lock... > > GFS: fsid=sentinel:root.0: jid=1: Looking at journal... > > GFS: fsid=sentinel:root.0: jid=1: Done > > GFS: fsid=sentinel:root.0: Scanning for log elements... > > GFS: fsid=sentinel:root.0: Found 0 unlinked inodes > > GFS: fsid=sentinel:root.0: Found quota changes for 7 IDs > > GFS: fsid=sentinel:root.0: Done > > > > > > Node 2: > > GFS: fsid=sentinel:root.0: jid=0: Trying to acquire journal lock... > > GFS: fsid=sentinel:root.0: jid=0: Looking at journal... > > GFS: fsid=sentinel:root.0: jid=0: Acquiring the transaction lock... > > GFS: fsid=sentinel:root.0: jid=0: Replaying journal... > > GFS: fsid=sentinel:root.0: jid=0: Replayed 6 of 6 blocks > > GFS: fsid=sentinel:root.0: jid=0: replays = 6, skips = 0, sames = 0 > > GFS: fsid=sentinel:root.0: jid=0: Journal replayed in 1s > > GFS: fsid=sentinel:root.0: jid=0: Done > > GFS: fsid=sentinel:root.0: jid=1: Trying to acquire journal lock... > > GFS: fsid=sentinel:root.0: jid=1: Looking at journal... > > GFS: fsid=sentinel:root.0: jid=1: Done > > GFS: fsid=sentinel:root.0: Scanning for log elements... > > GFS: fsid=sentinel:root.0: Found 0 unlinked inodes > > GFS: fsid=sentinel:root.0: Found quota changes for 2 IDs > > GFS: fsid=sentinel:root.0: Done > > > > Unless I'm reading this wrong, they are both trying to use JID 0. > > > > The second node to join generally chokes at some point during the boot, > > but AFTER it mounted the GFS volume. On the booted node, cman_tool > > status says: > > > > # cman_tool status > > Version: 6.0.1 > > Config Version: 20 > > Cluster Name: sentinel > > Cluster Id: 28150 > > Cluster Member: Yes > > Cluster Generation: 4 > > Membership state: Cluster-Member > > Nodes: 1 > > Expected votes: 1 > > Total votes: 1 > > Quorum: 1 > > Active subsystems: 6 > > Flags: 2node > > Ports Bound: 0 > > Node name: sentinel1c > > Node ID: 1 > > Multicast addresses: 239.192.109.100 > > Node addresses: 10.0.0.1 > > > > So the second node never joined. > > I know for a fact that the network connection between them is working, > > as they sync DRBD. > > > > cluster.conf is here: > > > > > > > > > > > > > > > > > > > > > > > > > "/dev/drbd1" > > mountopts = > > "defaults,noatime,nodiratime,noquota" > > /> > > > ip = "10.0.0.1" > > mac = "00:0B:DB:92:C5:E1" > > mask = "255.255.255.0" > > gateway = "" > > /> > > > passwd = "password" > > /> > > > > > > > > > > > > > > > > > > > > > > > > > > > "/dev/drbd1" > > mountopts = > > "defaults,noatime,nodiratime,noquota" > > /> > > > ip = "10.0.0.2" > > mac = "00:0B:DB:90:4E:1B" > > mask = "255.255.255.0" > > gateway = "" > > /> > > > passwd = "password" > > /> > > > > > > > > > > > > > > > > > > > > > > > login="root" name="sentinel1d" passwd="password"/> > > > login="root" name="sentinel2d" passwd="password"/> > > > > > > > > > > > > > > > > What could be causing the nodes to not join in the cluster? > > A bit of additional information. When both nodes come up at the same > time, they actually sort out the journals between them correctly. One > gets 0, the other 1. > > But almost immediately afterwards, this happens on the 2nd node: > dlm: closing connection to node 1 > dlm: connect from non cluster node > > shortly followed by DRBD keeling over: > > drbd1: Handshake successful: DRBD Network Protocol version 86 > drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC > drbd1: conn( WFConnection -> WFReportParams ) > drbd1: Discard younger/older primary did not found a decision > Using discard-least-changes instead > drbd1: State change failed: Device is held open by someone > drbd1: state = { cs:WFReportParams st:Primary/Unknown > ds:UpToDate/DUnknown r-- > - } > drbd1: wanted = { cs:WFReportParams st:Secondary/Unknown > ds:UpToDate/DUnknown r > --- } > drbd1: helper command: /sbin/drbdadm pri-lost-after-sb > drbd1: Split-Brain detected, dropping connection! > drbd1: self > 866625728B4E10B9:E4C3366683AFBC6B:ED24F75CC7B3F4A5:EFFAB6EF6A3CC469 > drbd1: peer > 572F799325FDF21D:E4C3366683AFBC6B:ED24F75CC7B3F4A4:EFFAB6EF6A3CC469 > drbd1: conn( WFReportParams -> Disconnecting ) > drbd1: helper command: /sbin/drbdadm split-brain > drbd1: error receiving ReportState, l: 4! > drbd1: asender terminated > drbd1: tl_clear() > drbd1: Connection closed > drbd1: conn( Disconnecting -> StandAlone ) > drbd1: receiver terminated > > At this point the 1st node seems to lock up, but despite fencing being > set up, the 2nd node doesn't get powered down. The fencing device is a > DRAC III ERA/O. Rebooting the 2nd node makes things revert back to it > trying to use JID 0, which is already used by the 1st node, and things > go wrong again. > > I'm sure I must be missing something obvious here, but for the life of > me I cannot see what. > > Gordan > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Gruss / Regards, Marc Grimme Phone: +49-89 452 3538-14 http://www.atix.de/ http://www.open-sharedroot.org/ ** ATIX Informationstechnologie und Consulting AG Einsteinstr. 10 85716 Unterschleissheim Deutschland/Germany Phone: +49-89 452 3538-0 Fax: +49-89 990 1766-0 Registergericht: Amtsgericht Muenchen Registernummer: HRB 168930 USt.-Id.: DE209485962 Vorstand: Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.) Vorsitzender des Aufsichtsrats: Dr. Martin Buss From jamesc at exa.com Wed Mar 5 20:46:34 2008 From: jamesc at exa.com (James Chamberlain) Date: Wed, 5 Mar 2008 15:46:34 -0500 (EST) Subject: [Linux-cluster] Kernel panic Message-ID: Two of the three nodes in my CS/GFS cluster just crashed, which dissolved quorum and allowed me to finally capture part of the kernel panic. Here is what was displayed on the screen: [] :gfs:gfs_write+0x0/0x8 [] :gfs:gfs_glock_d1+0x15c/0x16c [] :gfs:gfs_open+0x12c/0x15e [] :nfsd:nfsd_vfs_write+0xf2/0x2e1 [] :gfs:gfs_open+0x0/0x15e [] __dentry_open+0x101/0x1dc [] :nfsd:nfsd_write+0xb5/0xd5 [] :nfsd:nfsd3_proc_write+0xea/0x109 [] :nfsd:nfsd_dispatch+0xd7/0x198 [] :sunrpc:svc_process+0x44d/0x70b [] __down_read+0x12/0x92 [] :nfsd:nfsd+0x0/0x2db [] :nfsd:nfsd+0x1ae/0x2db [] child_rip+0xa/0x11 [] :nfsd:nfsd+0x0/0x2db [] :nfsd:nfsd+0x0/0x2db [] child_rip+0x0/0x11 Code: Bad RIP value. RIP [<0000000000000000>] _stext+0x7fff000/0x1000 RSP CR2: 0000000000000000 <0>Kernel panic - not syncing: Fatal exception Is this enough to figure out what happened, and how can I prevent this from happening in the future? I suspect that all the instability I have had with my CS/GFS cluster is related to this sort of crash. I am using the following on all three nodes: cman-2.0.73-1.el5_1.1 openais-0.80.3-7.el5 rgmanager-2.0.31-1.el5.centos lvm2-cluster-2.02.26-1.el5 luci-0.10.0-6.el5.centos.1 ricci-0.10.0-6.el5.centos.1 kernel-2.6.18-53.1.4.el5 gfs-utils-0.1.12-1.el5 kmod-gfs-0.1.19-7.el5_1.1 Thanks, James From qqlka at nask.pl Thu Mar 6 14:15:03 2008 From: qqlka at nask.pl (=?iso-8859-2?Q?Agnieszka_Kuka=B3owicz?=) Date: Thu, 6 Mar 2008 15:15:03 +0100 Subject: [Linux-cluster] Clusvcadm doesn't behave as it should Message-ID: <063d01c87f94$789558c0$0777b5c2@gda07ak> Hi, I have problem with clusvcadm command. In some cases it doesn't behave as it should. My cluster has 3 nodes: w11.local, w12.local, w21.local. Member Name ID Status ------ ---- ---- ------ w11.local.polska.pl 1 Online, Local, rgmanager w12.local.polska.pl 2 Online, rgmanager w21.local.polska.pl 4 Online, rgmanager /dev/xvdd1 0 Online, Quorum Disk I configured 2 simple httpd services in restricted failover domain. The /etc/cluster/cluster.conf file is something like that: Any enlightenment would be much appreciated :) Regards, Arjuna Christensen?|?Systems Engineer? Maximum Internet Ltd 7a Parkhead Pl, Albany, North Shore, 0632 | PO Box 8006, Auckland, 1150, NZ DDI: + 64 9?913 9683 | Ph: +64 9 915 1825 | Fax:: +64 9 300 7227 arjuna.christensen at maxnet.co.nz| www.maxnet.co.nz ________________________________ Maxnet | mission critical internet ________________________________ This email (including any attachments) is confidential and intended only for the person to whom it is addressed. If you have received this email in error, please notify the sender immediately and erase all copies of this message and attachments. The views expressed in this email do not necessarily reflect those held by Maxnet. From arjuna.christensen at maxnet.co.nz Fri Mar 28 01:01:50 2008 From: arjuna.christensen at maxnet.co.nz (Arjuna Christensen) Date: Fri, 28 Mar 2008 14:01:50 +1300 Subject: [Linux-cluster] RHCS / DRBD / MYSQL In-Reply-To: <6DD7CC182D1E154E9F5FF6301B077EFE72EAA0@exchange01.office.maxnet.co.nz> References: <6DD7CC182D1E154E9F5FF6301B077EFE72EAA0@exchange01.office.maxnet.co.nz> Message-ID: <6DD7CC182D1E154E9F5FF6301B077EFE72EAA1@exchange01.office.maxnet.co.nz> Of note, testing with rg_tool yields the correct sequence (or so I'd expect), yet, I don't see the mysql/asterisk scripts called at all... root at asterisktest01:/etc/cluster# rg_test noop cluster.conf start service asterisk Running in test mode. Starting asterisk... [start] service:asterisk [start] ip:192.168.111.1 [start] script:drbdcontrol [start] fs:mysql_disk [start] script:mysql [start] script:asterisk Start of asterisk complete +++ Memory table dump +++ 0xb74d6d10 (32 bytes) allocation trace: 0x804bc04 --- End Memory table dump --- root at asterisktest01:/etc/cluster# rg_test noop cluster.conf stop service asterisk Running in test mode. Stopping asterisk... [stop] script:asterisk [stop] script:mysql [stop] fs:mysql_disk [stop] script:drbdcontrol [stop] ip:192.168.111.1 [stop] service:asterisk Stop of asterisk complete +++ Memory table dump +++ 0xb74c5d10 (32 bytes) allocation trace: 0x804bc04 --- End Memory table dump --- Arjuna Christensen?|?Systems Engineer? Maximum Internet Ltd 7a Parkhead Pl, Albany, North Shore, 0632 | PO Box 8006, Auckland, 1150, NZ DDI: + 64 9?913 9683 | Ph: +64 9 915 1825 | Fax:: +64 9 300 7227 arjuna.christensen at maxnet.co.nz| www.maxnet.co.nz ________________________________ Maxnet | mission critical internet ________________________________ This email (including any attachments) is confidential and intended only for the person to whom it is addressed. If you have received this email in error, please notify the sender immediately and erase all copies of this message and attachments. The views expressed in this email do not necessarily reflect those held by Maxnet. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Arjuna Christensen Sent: Friday, 28 March 2008 1:54 p.m. To: linux-cluster at redhat.com Subject: [Linux-cluster] RHCS / DRBD / MYSQL Hiyas, I'm having some slight issues getting RHCS to work with drbd/asterisk/mysql. Basically I've got a service configured which (as far as I know) should bring up an IP address, set itself to the DRBD primary, mount the MySQL DRBD partition and then proceed to start up mysql/asterisk. I can see it bringing the IP up, setting itself as the DRBD primary and even mounting the partition, yet it fails to bring up asterisk/mysql. Could anyone take a look at my clusterconf - see below. Any enlightenment would be much appreciated :) Regards, Arjuna Christensen?|?Systems Engineer? Maximum Internet Ltd 7a Parkhead Pl, Albany, North Shore, 0632 | PO Box 8006, Auckland, 1150, NZ DDI: + 64 9?913 9683 | Ph: +64 9 915 1825 | Fax:: +64 9 300 7227 arjuna.christensen at maxnet.co.nz| www.maxnet.co.nz ________________________________ Maxnet | mission critical internet ________________________________ This email (including any attachments) is confidential and intended only for the person to whom it is addressed. If you have received this email in error, please notify the sender immediately and erase all copies of this message and attachments. The views expressed in this email do not necessarily reflect those held by Maxnet. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From arjuna.christensen at maxnet.co.nz Fri Mar 28 04:24:16 2008 From: arjuna.christensen at maxnet.co.nz (Arjuna Christensen) Date: Fri, 28 Mar 2008 17:24:16 +1300 Subject: [Linux-cluster] RHCS / DRBD / MYSQL In-Reply-To: <6DD7CC182D1E154E9F5FF6301B077EFE72EAA1@exchange01.office.maxnet.co.nz> References: <6DD7CC182D1E154E9F5FF6301B077EFE72EAA0@exchange01.office.maxnet.co.nz> <6DD7CC182D1E154E9F5FF6301B077EFE72EAA1@exchange01.office.maxnet.co.nz> Message-ID: <6DD7CC182D1E154E9F5FF6301B077EFE72EAA4@exchange01.office.maxnet.co.nz> Sorry to bump my own post again (again), but I've loaded my config up in system-config-cluster and rebuilt it to the best of my knowledge, and am still experiencing the same issue (my childest scripts aren't being launched). See rg_tool: Running in test mode. Loaded 17 resource rules === Resources List === Resource type: fs Instances: 1/1 Agent: fs.sh Attributes: name = mysqlfs [ primary ] mountpoint = /var/lib/mysql [ unique required ] device = /dev/drbd0 [ unique required ] fstype = ext3 force_unmount = 0 self_fence = 1 nfslock [ inherit("nfslock") ] fsid = 11607 force_fsck = 0 options = Resource type: ip Instances: 1/1 Agent: ip.sh Attributes: address = 192.168.111.1 [ primary unique ] monitor_link = 1 nfslock [ inherit("service%nfslock") ] Resource type: script Agent: script.sh Attributes: name = drbdcontrol [ primary unique ] file = /etc/init.d/drbdcontrol [ unique required ] service_name [ inherit("service%name") ] Resource type: script Agent: script.sh Attributes: name = asterisk [ primary unique ] file = /etc/init.d/asterisk [ unique required ] service_name [ inherit("service%name") ] Resource type: script Agent: script.sh Attributes: name = mysql [ primary unique ] file = /etc/init.d/mysql [ unique required ] service_name [ inherit("service%name") ] Resource type: service [INLINE] Instances: 1/1 Agent: service.sh Attributes: name = asteriskcluster [ primary unique required ] domain = asterisk autostart = 1 === Resource Tree === service { name = "asteriskcluster"; domain = "asterisk"; autostart = "1"; ip { address = "192.168.111.1"; monitor_link = "1"; script { name = "drbdcontrol"; file = "/etc/init.d/drbdcontrol"; fs { name = "mysqlfs"; mountpoint = "/var/lib/mysql"; device = "/dev/drbd0"; fstype = "ext3"; force_unmount = "0"; self_fence = "1"; fsid = "11607"; force_fsck = "0"; options = ""; script { name = "asterisk"; file = "/etc/init.d/asterisk"; } script { name = "mysql"; file = "/etc/init.d/mysql"; } } } } } === Failover Domains === Failover domain: asterisk Flags: Restricted Node asterisktest01 priority 1 nodeid 1 Node asterisktest02 priority 1 nodeid 2 +++ Memory table dump +++ 0xb749be48 (16 bytes) allocation trace: 0x804bc04 0xb748b3a8 (24 bytes) allocation trace: 0x804bc04 --- End Memory table dump --- And my revised cluster.conf: As before, I'm keen for any enlightenment :( Regards, Arjuna Christensen?|?Systems Engineer? Maximum Internet Ltd DDI: + 64 9?913 9683 | Ph: +64 9 915 1825 | Fax:: +64 9 300 7227 arjuna.christensen at maxnet.co.nz| www.maxnet.co.nz -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Arjuna Christensen Sent: Friday, 28 March 2008 2:02 p.m. To: linux-cluster at redhat.com Subject: RE: [Linux-cluster] RHCS / DRBD / MYSQL Of note, testing with rg_tool yields the correct sequence (or so I'd expect), yet, I don't see the mysql/asterisk scripts called at all... root at asterisktest01:/etc/cluster# rg_test noop cluster.conf start service asterisk Running in test mode. Starting asterisk... [start] service:asterisk [start] ip:192.168.111.1 [start] script:drbdcontrol [start] fs:mysql_disk [start] script:mysql [start] script:asterisk Start of asterisk complete +++ Memory table dump +++ 0xb74d6d10 (32 bytes) allocation trace: 0x804bc04 --- End Memory table dump --- root at asterisktest01:/etc/cluster# rg_test noop cluster.conf stop service asterisk Running in test mode. Stopping asterisk... [stop] script:asterisk [stop] script:mysql [stop] fs:mysql_disk [stop] script:drbdcontrol [stop] ip:192.168.111.1 [stop] service:asterisk Stop of asterisk complete +++ Memory table dump +++ 0xb74c5d10 (32 bytes) allocation trace: 0x804bc04 --- End Memory table dump --- Arjuna Christensen?|?Systems Engineer? Maximum Internet Ltd 7a Parkhead Pl, Albany, North Shore, 0632 | PO Box 8006, Auckland, 1150, NZ DDI: + 64 9?913 9683 | Ph: +64 9 915 1825 | Fax:: +64 9 300 7227 arjuna.christensen at maxnet.co.nz| www.maxnet.co.nz ________________________________ Maxnet | mission critical internet ________________________________ This email (including any attachments) is confidential and intended only for the person to whom it is addressed. If you have received this email in error, please notify the sender immediately and erase all copies of this message and attachments. The views expressed in this email do not necessarily reflect those held by Maxnet. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Arjuna Christensen Sent: Friday, 28 March 2008 1:54 p.m. To: linux-cluster at redhat.com Subject: [Linux-cluster] RHCS / DRBD / MYSQL Hiyas, I'm having some slight issues getting RHCS to work with drbd/asterisk/mysql. Basically I've got a service configured which (as far as I know) should bring up an IP address, set itself to the DRBD primary, mount the MySQL DRBD partition and then proceed to start up mysql/asterisk. I can see it bringing the IP up, setting itself as the DRBD primary and even mounting the partition, yet it fails to bring up asterisk/mysql. Could anyone take a look at my clusterconf - see below. Any enlightenment would be much appreciated :) Regards, Arjuna Christensen?|?Systems Engineer? Maximum Internet Ltd 7a Parkhead Pl, Albany, North Shore, 0632 | PO Box 8006, Auckland, 1150, NZ DDI: + 64 9?913 9683 | Ph: +64 9 915 1825 | Fax:: +64 9 300 7227 arjuna.christensen at maxnet.co.nz| www.maxnet.co.nz ________________________________ Maxnet | mission critical internet ________________________________ This email (including any attachments) is confidential and intended only for the person to whom it is addressed. If you have received this email in error, please notify the sender immediately and erase all copies of this message and attachments. The views expressed in this email do not necessarily reflect those held by Maxnet. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From denisb+gmane at gmail.com Fri Mar 28 09:25:00 2008 From: denisb+gmane at gmail.com (denis) Date: Fri, 28 Mar 2008 10:25:00 +0100 Subject: [Linux-cluster] Re: fence_manual missing /tmp/fence_manual.fifo In-Reply-To: <47EBCEE8.7090905@artegence.com> References: <47EBCEE8.7090905@artegence.com> Message-ID: Maciej Bogucki wrote: >> Are you certain you want to continue? [yN] y >> can't open /tmp/fence_manual.fifo: No such file or directory > Hello, > > Do You have the latest version of fence package? > How does look like Your cluster.conf? Hi, I have installed the patched cman package from Lon H. to fix the broken rgmanager communications bug : https://bugzilla.redhat.com/show_bug.cgi?id=327721 Version : 2.0.73 Vendor: Red Hat, Inc. Release : 1.6.el5.test.bz327721 Build Date: Mon 26 Nov 2007 07:22:55 PM CET Install Date: Thu 27 Mar 2008 03:28:20 PM CET Build Host: hs20-bc1-6.build.redhat.com Group : System Environment/Base Source RPM: cman-2.0.73-1.6.el5.test.bz327721.src.rpm Size : 1164641 License: GPL Signature : (none) Packager : Red Hat, Inc. URL : http://sources.redhat.com/cluster/ Summary : cman - The Cluster Manager cluster.conf @ pastebin : http://pastebin.com/m5f8787a2 cluster.conf : Apparently it didn't like the