From isplist at logicore.net Thu Mar 1 01:16:36 2007 From: isplist at logicore.net (isplist at logicore.net) Date: Wed, 28 Feb 2007 19:16:36 -0600 Subject: [Linux-cluster] RHEL Problem: Can't access LUNS Message-ID: <2007228191636.191879@leena> Is there anyone on this list who can make a suggestion to RH??? I've been asking about this for some time now and have no idea where to turn. I've been working with Qlogic on my LUN problem for a couple of weeks or so. I posted asking about this here but it seems that there aren't any answers here either. The problem is that I need to install RHEL on fibre channel storage volumes. I cannot find a way of doing this from the installer. Once the OS is installed I can see them but that's of little use. In order to see the volumes which I need to gain access to, I need to install RHEL and then do the following; Edit /etc/modprobe.conf options scsi_mod max_luns=256 dev_flags=INLINE:TF200:0x242 mkinitrd -f /boot/initrd-`uname -r`.img `uname -r` Reboot; Then all volumes on the storage device (32 of them) can be seen by the server. The problem is with the Red Hat scsi layer, not with the Qlogic driver. That is apparent by the fact that the devices can be seen by the HBA BIOS. Ultimately this needs to be answered by Red Hat - how to pass "max_luns=xxx" to the scsi layer during the OS installation. Or, if someone could explain how I might be able to build a custom install disk, that might work also. Perhaps I can pre-install the code needed so that anaconda can see all of the volumes? Mike From tmornini at engineyard.com Thu Mar 1 03:08:42 2007 From: tmornini at engineyard.com (Tom Mornini) Date: Wed, 28 Feb 2007 19:08:42 -0800 Subject: [Linux-cluster] cmirror status? Message-ID: <3E495EB1-CB02-4634-98BA-6608F9D1A21E@engineyard.com> Hello all. What is the status of cmirror package? I see that there have been code changes as late as this month. We're extraordinarily interested in using it! -- -- Tom Mornini, CTO -- Engine Yard, Ruby on Rails Hosting -- Reliability, Ease of Use, Scalability -- (866) 518-YARD (9273) From Alain.Moulle at bull.net Thu Mar 1 06:57:50 2007 From: Alain.Moulle at bull.net (Alain Moulle) Date: Thu, 01 Mar 2007 07:57:50 +0100 Subject: [Linux-cluster] Re: CS4 Update 4 / Oops in dlm module (Alain Moulle) Message-ID: <45E6796E.5030809@bull.net> Hi We test it : 1/ it seems that the services stuck in stopping state is fixed 2/ about DLM Oops, we have not reproduced it but it happens only once with former rpm version, so ... wait & see ... 3/ we have a problem just after the boot in clurgmgrd, don't know if it is due to this new rpm or not, but we never had this problem with former rpm version; syslog gives : clurgmgrd[7069]: Services Initialized clurgmgrd[7069]: #10: Couldn't set up listen socket and CS4 is stalled on the machine. Any idea ? Thanks a lot Alain >> Could you install the current rgmanager test RPM: >> >> http://people.redhat.com/lhh/rgmanager-1.9.54-3.228823test.i386.rpm >> >> ...and see if it goes away? The above RPM is the same as 1.9.54, but >> includes fix for an assertion failure, a way to fix services stuck in >> the stopping state, and (the important one for you) a fix for an >> intermittent DLM lock leak. >> >> ia64/x86_64/srpms here: http://people.redhat.com/lhh/packages.html >>(Lon Hohberger) -- mailto:Alain.Moulle at bull.net +------------------------------+--------------------------------+ | Alain Moull? | from France : 04 76 29 75 99 | | | FAX number : 04 76 29 72 49 | | Bull SA | | | 1, Rue de Provence | Adr : FREC B1-041 | | B.P. 208 | | | 38432 Echirolles - CEDEX | Email: Alain.Moulle at bull.net | | France | BCOM : 229 7599 | +-------------------------------+-------------------------------+ From jbrassow at redhat.com Thu Mar 1 21:19:29 2007 From: jbrassow at redhat.com (Jonathan E Brassow) Date: Thu, 1 Mar 2007 15:19:29 -0600 Subject: [Linux-cluster] cmirror status? In-Reply-To: <3E495EB1-CB02-4634-98BA-6608F9D1A21E@engineyard.com> References: <3E495EB1-CB02-4634-98BA-6608F9D1A21E@engineyard.com> Message-ID: <7b40c10fa6df9aa866b93ca230ddf733@redhat.com> There are a handful of bugs that need to be cleaned up for cmirror before RHEL4.5. The speed at which that happens will depend on their repeatability. The version which will eventually go upstream is still in an initial phase. brassow On Feb 28, 2007, at 9:08 PM, Tom Mornini wrote: > Hello all. > > What is the status of cmirror package? > > I see that there have been code changes as late as this month. > > We're extraordinarily interested in using it! > > -- > -- Tom Mornini, CTO > -- Engine Yard, Ruby on Rails Hosting > -- Reliability, Ease of Use, Scalability > -- (866) 518-YARD (9273) > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From tmornini at engineyard.com Thu Mar 1 21:40:42 2007 From: tmornini at engineyard.com (Tom Mornini) Date: Thu, 1 Mar 2007 13:40:42 -0800 Subject: [Linux-cluster] cmirror status? In-Reply-To: <7b40c10fa6df9aa866b93ca230ddf733@redhat.com> References: <3E495EB1-CB02-4634-98BA-6608F9D1A21E@engineyard.com> <7b40c10fa6df9aa866b93ca230ddf733@redhat.com> Message-ID: <26F144F3-F48E-49A0-AE66-97D18141EF80@engineyard.com> Thanks! -- -- Tom Mornini, CTO -- Engine Yard, Ruby on Rails Hosting -- Reliability, Ease of Use, Scalability -- (866) 518-YARD (9273) On Mar 1, 2007, at 1:19 PM, Jonathan E Brassow wrote: > There are a handful of bugs that need to be cleaned up for cmirror > before RHEL4.5. The speed at which that happens will depend on > their repeatability. > > The version which will eventually go upstream is still in an > initial phase. > > brassow > > On Feb 28, 2007, at 9:08 PM, Tom Mornini wrote: > >> Hello all. >> >> What is the status of cmirror package? >> >> I see that there have been code changes as late as this month. >> >> We're extraordinarily interested in using it! >> >> -- >> -- Tom Mornini, CTO >> -- Engine Yard, Ruby on Rails Hosting >> -- Reliability, Ease of Use, Scalability >> -- (866) 518-YARD (9273) >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From orkcu at yahoo.com Thu Mar 1 23:55:14 2007 From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=) Date: Thu, 1 Mar 2007 15:55:14 -0800 (PST) Subject: [Linux-cluster] RHEL Problem: Can't access LUNS In-Reply-To: <2007228191636.191879@leena> Message-ID: <732955.27236.qm@web50607.mail.yahoo.com> --- "isplist at logicore.net" wrote: > Is there anyone on this list who can make a > suggestion to RH??? I've been > asking about this for some time now and have no idea > where to turn. > > I've been working with Qlogic on my LUN problem for > a couple of weeks or so. I > posted asking about this here but it seems that > there aren't any answers here > either. > > The problem is that I need to install RHEL on fibre > channel storage volumes. I > cannot find a way of doing this from the installer. > Once the OS is installed I > can see them but that's of little use. > > In order to see the volumes which I need to gain > access to, I need to install > RHEL and then do the following; > > Edit /etc/modprobe.conf > options scsi_mod max_luns=256 > dev_flags=INLINE:TF200:0x242 > mkinitrd -f /boot/initrd-`uname -r`.img `uname -r` > Reboot; > I insist that you can get the expected result _if_ you type _exactly_ the _right_ line in the boot promp line and according to this document: http://www.kernel.org/pub/linux/kernel/people/gregkh/lkn/lkn_pdf/ch09.pdf what you have to writte is a litle modification of what you type in the modprobe.conf file maybe something like: scsi_mod.max_luns=256 scsi_mod.scsi_dev_flags=INLINE:TF200:0x242 man, you have to test it, try and error approach maybe cu roger __________________________________________ RedHat Certified Engineer ( RHCE ) Cisco Certified Network Associate ( CCNA ) ____________________________________________________________________________________ TV dinner still cooling? Check out "Tonight's Picks" on Yahoo! TV. http://tv.yahoo.com/ From isplist at logicore.net Fri Mar 2 00:05:47 2007 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 1 Mar 2007 18:05:47 -0600 Subject: [Linux-cluster] RHEL Problem: Can't access LUNS In-Reply-To: <732955.27236.qm@web50607.mail.yahoo.com> Message-ID: <20073118547.393207@leena> I'm not sure what you're telling me here. Since this isn't something I knew how to do, I've been asking for help. The things I know how to do now, which still don't work, are all things which have been suggested by people who are trying to help. From all of this, the one thing I've learned is that it does not matter what I try to enter at the install command line, the problem is the redhat installer, there is no way of passing this information. I need to modify the RHEL4 install CD's initrd.img in order for anaconda to see this at install time... and, I don't know how to do this but am reading up on it. > http://www.kernel.org/pub/linux/kernel/people/gregkh/lkn/lkn_pdf/ch09.pdf Thanks, I'll check this out. > what you have to writte is a litle modification of > what you type in the modprobe.conf file > maybe something like: > scsi_mod.max_luns=256 > scsi_mod.scsi_dev_flags=INLINE:TF200:0x242 I don't have access to modprobe.conf at install time, that's the point of my problem. I can modify it AFTER install and see all of the storage but I need to see the storage AT install time so that I can install TO the storage. > man, you have to test it, try and error approach maybe I've been at trial and error for a couple of weeks now. Mike From isplist at logicore.net Fri Mar 2 03:05:12 2007 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 1 Mar 2007 21:05:12 -0600 Subject: [Linux-cluster] RHEL Problem: Can't access LUNS In-Reply-To: Message-ID: <20073121512.491314@leena> Interesting! The option is scsi_dev_flags, no dev_flags as I've always seen it. Yet another thing to try before moving on :). On Thu, 1 Mar 2007 22:10:40 -0300, Filipe Miranda wrote: > Mike, > > I got this from the URL: http://www.redhat.com/docs/manuals/enterprise/RHEL- > 4-Manual/ref-guide/ch-modules.html > > "During installation, Red Hat Enterprise Linux uses a limited subset of > device drivers to create a stable installation environment. Although the > installation program supports installation on many different types of > hardware, some drivers (including those for SCSI adapters and network > adapters) are not included in the installation kernel." > > Which means that if the installation kernel that you are using does have > the SCSI modules, you should just load the parameters during the process of > loading anaconda. > > When the RHEL loads, the first phase of the installation, you should use > the SCSI parameters there! > If nothing happens, this means that the SCSI drivers that you need are not > in the installation kernel. > > I recommend using the latest release of the version of the RHEL you are > trying to install. > > Also check page number 106 (20-30) of the pdf that Roger posted, on SCSI > options. > > I hope these guidelines will help you solve the problem. > > Regards, > > Filipe Miranda > > On 3/1/07, isplist at logicore.net < isplist at logicore.net> wrote:> I'm not > sure what you're telling me here. > >> Since this isn't something I knew how to do, I've been asking for help. >> The >> things I know how to do now, which still don't work, are all things which >> have >> been suggested by people who are trying to help. >> >> From all of this, the one thing I've learned is that it does not matter >> what I >> try to enter at the install command line, the problem is the redhat >> installer, >> there is no way of passing this information. >> >> I need to modify the RHEL4 install CD's initrd.img in order for anaconda >> to >> see this at install time... and, I don't know how to do this but am >> reading up >> on it. >> >>> http://www.kernel.org/pub/linux/kernel/people/gregkh/lkn/lkn_pdf/ch09.pdf >>> >> Thanks, I'll check this out. >> >>> what you have to writte is a litle modification of >>> what you type in the modprobe.conf file >>> maybe something like: >>> scsi_mod.max_luns=256 >>> scsi_mod.scsi_dev_flags=INLINE:TF200:0x242 >>> >> I don't have access to modprobe.conf at install time, that's the point of >> my >> problem. I can modify it AFTER install and see all of the storage but I >> need >> to see the storage AT install time so that I can install TO the storage. >> >>> man, you have to test it, try and error approach maybe >>> >> I've been at trial and error for a couple of weeks now. >> >> Mike >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster From isplist at logicore.net Fri Mar 2 02:59:57 2007 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 1 Mar 2007 20:59:57 -0600 Subject: [Linux-cluster] RHEL Problem: Can't access LUNS In-Reply-To: Message-ID: <200731205957.022213@leena> I have tried all of the options I can find or have been given during installation. So at this stage, I feel that I should be looking at building a custom initrd.img file in order to get this problem resolved. Mike On Thu, 1 Mar 2007 22:10:40 -0300, Filipe Miranda wrote: > Mike, > > I got this from the URL: http://www.redhat.com/docs/manuals/enterprise/RHEL- > 4-Manual/ref-guide/ch-modules.html > > "During installation, Red Hat Enterprise Linux uses a limited subset of > device drivers to create a stable installation environment. Although the > installation program supports installation on many different types of > hardware, some drivers (including those for SCSI adapters and network > adapters) are not included in the installation kernel." > > Which means that if the installation kernel that you are using does have > the SCSI modules, you should just load the parameters during the process of > loading anaconda. > > When the RHEL loads, the first phase of the installation, you should use > the SCSI parameters there! > If nothing happens, this means that the SCSI drivers that you need are not > in the installation kernel. > > I recommend using the latest release of the version of the RHEL you are > trying to install. > > Also check page number 106 (20-30) of the pdf that Roger posted, on SCSI > options. > > I hope these guidelines will help you solve the problem. > > Regards, > > Filipe Miranda > > On 3/1/07, isplist at logicore.net < isplist at logicore.net> wrote:> I'm not > sure what you're telling me here. > >> Since this isn't something I knew how to do, I've been asking for help. >> The >> things I know how to do now, which still don't work, are all things which >> have >> been suggested by people who are trying to help. >> >> From all of this, the one thing I've learned is that it does not matter >> what I >> try to enter at the install command line, the problem is the redhat >> installer, >> there is no way of passing this information. >> >> I need to modify the RHEL4 install CD's initrd.img in order for anaconda >> to >> see this at install time... and, I don't know how to do this but am >> reading up >> on it. >> >>> http://www.kernel.org/pub/linux/kernel/people/gregkh/lkn/lkn_pdf/ch09.pdf >>> >> Thanks, I'll check this out. >> >>> what you have to writte is a litle modification of >>> what you type in the modprobe.conf file >>> maybe something like: >>> scsi_mod.max_luns=256 >>> scsi_mod.scsi_dev_flags=INLINE:TF200:0x242 >>> >> I don't have access to modprobe.conf at install time, that's the point of >> my >> problem. I can modify it AFTER install and see all of the storage but I >> need >> to see the storage AT install time so that I can install TO the storage. >> >>> man, you have to test it, try and error approach maybe >>> >> I've been at trial and error for a couple of weeks now. >> >> Mike >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster From isplist at logicore.net Fri Mar 2 03:23:51 2007 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 1 Mar 2007 21:23:51 -0600 Subject: [Linux-cluster] RHEL Problem: Can't access LUNS In-Reply-To: <732955.27236.qm@web50607.mail.yahoo.com> Message-ID: <200731212351.418306@leena> > I insist that you can get the expected result _if_ you > type _exactly_ the _right_ line in the boot promp line > and according to this document: Didn't work. I entered it just as it is here; scsi_mod.max_luns=256 scsi_mod.scsi_dev_flags=INLINE:TF200:0x242 Mike From filipe.miranda at gmail.com Fri Mar 2 01:10:40 2007 From: filipe.miranda at gmail.com (Filipe Miranda) Date: Thu, 1 Mar 2007 22:10:40 -0300 Subject: [Linux-cluster] RHEL Problem: Can't access LUNS In-Reply-To: <20073118547.393207@leena> References: <732955.27236.qm@web50607.mail.yahoo.com> <20073118547.393207@leena> Message-ID: Mike, I got this from the URL: http://www.redhat.com/docs/manuals/enterprise/RHEL-4-Manual/ref-guide/ch-modules.html "During installation, Red Hat Enterprise Linux uses a limited subset of device drivers to create a stable installation environment. Although the installation program supports installation on many different types of hardware, some drivers (including those for SCSI adapters and network adapters) are not included in the installation kernel." Which means that if the installation kernel that you are using does have the SCSI modules, you should just load the parameters during the process of loading anaconda. When the RHEL loads, the first phase of the installation, you should use the SCSI parameters there! If nothing happens, this means that the SCSI drivers that you need are not in the installation kernel. I recommend using the latest release of the version of the RHEL you are trying to install. Also check page number 106 (20-30) of the pdf that Roger posted, on SCSI options. I hope these guidelines will help you solve the problem. Regards, Filipe Miranda On 3/1/07, isplist at logicore.net wrote: > > I'm not sure what you're telling me here. > > Since this isn't something I knew how to do, I've been asking for help. > The > things I know how to do now, which still don't work, are all things which > have > been suggested by people who are trying to help. > > From all of this, the one thing I've learned is that it does not matter > what I > try to enter at the install command line, the problem is the redhat > installer, > there is no way of passing this information. > > I need to modify the RHEL4 install CD's initrd.img in order for anaconda > to > see this at install time... and, I don't know how to do this but am > reading up > on it. > > > > http://www.kernel.org/pub/linux/kernel/people/gregkh/lkn/lkn_pdf/ch09.pdf > > Thanks, I'll check this out. > > > what you have to writte is a litle modification of > > what you type in the modprobe.conf file > > maybe something like: > > scsi_mod.max_luns=256 > > scsi_mod.scsi_dev_flags=INLINE:TF200:0x242 > > I don't have access to modprobe.conf at install time, that's the point of > my > problem. I can modify it AFTER install and see all of the storage but I > need > to see the storage AT install time so that I can install TO the storage. > > > man, you have to test it, try and error approach maybe > > I've been at trial and error for a couple of weeks now. > > Mike > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From isplist at logicore.net Fri Mar 2 03:07:51 2007 From: isplist at logicore.net (isplist at logicore.net) Date: Thu, 1 Mar 2007 21:07:51 -0600 Subject: [Linux-cluster] RHEL Problem: Can't access LUNS In-Reply-To: <732955.27236.qm@web50607.mail.yahoo.com> Message-ID: <20073121751.180050@leena> I see now Roger... I had the wrong option wording. I never saw it this way so will give it a try and post my findings, thanks. On Thu, 1 Mar 2007 15:55:14 -0800 (PST), Roger Pe?a wrote: > > > --- "isplist at logicore.net" > wrote: > >> Is there anyone on this list who can make a >> suggestion to RH??? I've been >> asking about this for some time now and have no idea >> where to turn. >> >> I've been working with Qlogic on my LUN problem for >> a couple of weeks or so. I >> posted asking about this here but it seems that >> there aren't any answers here >> either. >> >> The problem is that I need to install RHEL on fibre >> channel storage volumes. I >> cannot find a way of doing this from the installer. >> Once the OS is installed I >> can see them but that's of little use. >> >> In order to see the volumes which I need to gain >> access to, I need to install >> RHEL and then do the following; >> >> Edit /etc/modprobe.conf >> options scsi_mod max_luns=256 >> dev_flags=INLINE:TF200:0x242 >> mkinitrd -f /boot/initrd-`uname -r`.img `uname -r` >> Reboot; >> > > I insist that you can get the expected result _if_ you > type _exactly_ the _right_ line in the boot promp line > and according to this document: > > http://www.kernel.org/pub/linux/kernel/people/gregkh/lkn/lkn_pdf/ch09.pdf > > what you have to writte is a litle modification of > what you type in the modprobe.conf file > maybe something like: > scsi_mod.max_luns=256 > scsi_mod.scsi_dev_flags=INLINE:TF200:0x242 > > > man, you have to test it, try and error approach maybe > > > cu > roger > > __________________________________________ > RedHat Certified Engineer ( RHCE ) > Cisco Certified Network Associate ( CCNA ) > > > ___________________________________________________________________ ___________ > ______ > TV dinner still cooling? > Check out "Tonight's Picks" on Yahoo! TV. > http://tv.yahoo.com/ From jose.dr.g at gmail.com Fri Mar 2 05:01:50 2007 From: jose.dr.g at gmail.com (Jose Guevarra) Date: Thu, 1 Mar 2007 21:01:50 -0800 Subject: [Linux-cluster] can't start GFS on Fedora Message-ID: <3837d8af0703012101m5fe475e4xadebf56d1406b986@mail.gmail.com> Hi, I'm trying to get GFS installed and running on FC4 on a Poweredge 4400 dual processor. Take note that I'm a total newbie at GFS. I've installed GFS-6.1 GFS-kernel-smp lvm2-cluster ccs cman-kernel-smp magma fence gnbd-kernel-smp gnbd I have a volume group that I want to mount w/ GFS /dev/mapper/VolGroup00-LogVol02 I was able to create a GFS file system w/ this command... # gfs_mkfs -p lock_dlm -t CLUST:gfs1 -j 6 /dev/mapper/VolGroup00-LogVol02 Now. when I try to start ccsd it fails. so none of the other daemons start either. /var/log/messages doesn't say anything about the start failure. How can I troubleshoot this more? What are the required daemons that need to start? -------------- next part -------------- An HTML attachment was scrubbed... URL: From orkcu at yahoo.com Fri Mar 2 13:34:18 2007 From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=) Date: Fri, 2 Mar 2007 05:34:18 -0800 (PST) Subject: [Linux-cluster] RHEL Problem: Can't access LUNS In-Reply-To: <20073121751.180050@leena> Message-ID: <378534.10487.qm@web50607.mail.yahoo.com> --- "isplist at logicore.net" wrote: > I see now Roger... I had the wrong option wording. I > never saw it this way so > will give it a try and post my findings, thanks. that was exactly what I like to point if you mistake anything, the kernel just drop what you type, so you have to be very carefull about what to write. there is also the way to give more than one option to the same kernel module, usually the examples just show howto give only one option to the modules but in your case you have to pass _two_ options to the same kernel module, how do you do that? I guess I give you the way it should be done, but I am not sure at all :-( in past times I recall that you can pass several options with a ',' separator but I couldn't fine if it is the same with kernels 2.6.x .... scsi_mod.max_luns=256 scsi_mod.scsi_dev_flags=INLINE:TF200:0x242 see the '.' between the module name and its option? maybe, and just maybe, it make the trick there is also the posibility that, because the scsi_mod maybe be loaded as a module (not be an in-line module), that any of the options you pass at boot time has any use, so, why not to switch to the console F2 and unload scsi_mod and then loaded with the right options? if that do not works, then, my next try would be to install from network (PXE boot) so I can create-modify whatever file I like and add whatever option I like to the kernel at boot time without recreate the installer CD > > > On Thu, 1 Mar 2007 15:55:14 -0800 (PST), Roger Pe?a > wrote: > > > > > > --- "isplist at logicore.net" > > wrote: > > > >> Is there anyone on this list who can make a > >> suggestion to RH??? I've been > >> asking about this for some time now and have no > idea > >> where to turn. > >> > >> I've been working with Qlogic on my LUN problem > for > >> a couple of weeks or so. I > >> posted asking about this here but it seems that > >> there aren't any answers here > >> either. > >> In order to see the volumes which I need to gain > >> access to, I need to install > >> RHEL and then do the following; > >> > >> Edit /etc/modprobe.conf > >> options scsi_mod max_luns=256 > >> dev_flags=INLINE:TF200:0x242 > >> mkinitrd -f /boot/initrd-`uname -r`.img `uname > -r` > >> Reboot; > > maybe something like: > > scsi_mod.max_luns=256 > > scsi_mod.scsi_dev_flags=INLINE:TF200:0x242 cu roger __________________________________________ RedHat Certified Engineer ( RHCE ) Cisco Certified Network Associate ( CCNA ) ____________________________________________________________________________________ Have a burning question? Go to www.Answers.yahoo.com and get answers from real people who know. From isplist at logicore.net Fri Mar 2 14:47:51 2007 From: isplist at logicore.net (isplist at logicore.net) Date: Fri, 2 Mar 2007 08:47:51 -0600 Subject: [Linux-cluster] RHEL Problem: Can't access LUNS In-Reply-To: <378534.10487.qm@web50607.mail.yahoo.com> Message-ID: <20073284751.501337@leena> > if that do not works, then, my next try would be to > install from network (PXE boot) so I can create-modify > whatever file I like and add whatever option I like to > the kernel at boot time without recreate the installer > CD You're right, this would be the best way. Plus, I can keep various versions handy too. This is exactly how I'd like to do this but I've not learned how to modify the files I need in this case yet. Mike From rpeterso at redhat.com Fri Mar 2 16:33:02 2007 From: rpeterso at redhat.com (Robert Peterson) Date: Fri, 02 Mar 2007 10:33:02 -0600 Subject: [Linux-cluster] can't start GFS on Fedora In-Reply-To: <3837d8af0703012101m5fe475e4xadebf56d1406b986@mail.gmail.com> References: <3837d8af0703012101m5fe475e4xadebf56d1406b986@mail.gmail.com> Message-ID: <45E851BE.4090300@redhat.com> Jose Guevarra wrote: > I have a volume group that I want to mount w/ GFS > /dev/mapper/VolGroup00-LogVol02 > > I was able to create a GFS file system w/ this command... > > # gfs_mkfs -p lock_dlm -t CLUST:gfs1 -j 6 /dev/mapper/VolGroup00-LogVol02 > > Now. when I try to start ccsd it fails. so none of the other daemons > start > either. /var/log/messages doesn't say anything about the start failure. > > How can I troubleshoot this more? What are the required daemons that > need to > start? Hi Jose, I have a couple of suggestions. First of all, you need to determine if you're planning to use GFS in a cluster (i.e. on shared storage like a SAN) or stand-alone (and share that with us if you want help.) Your use of "lock_dlm" and a cluster name makes it sound like you want it in a cluster, but the VolGroup00-LogVol02 makes it sound like a local hard disk and not any kind of shared storage. If you're using it stand-alone, you don't need ccsd, since ccsd is part of the cluster infrastructure. If stand-alone, you also would want to use lock_nolock rather than lock_dlm. Now about ccsd: If you're using the cluster code shipped with FC6, that's the "new" infrastructure code. With the "new" stuff, you don't need to start ccsd with a separate script like in RHEL4. Everything should be handled by doing: "service cman start." The ccsd daemon is started by the init script. I apologize if you already knew this. It's just that I couldn't tell how you were starting ccsd. You say that ccsd fails, but you didn't say much about how it fails or what error message it gives you. I guess the bottom line is that you didn't give us enough information to help you. Also, if this storage is shared in the cluster, you need to do "service clvmd start" as well, and you may want to change locking_type = 3 in your /etc/lvm/lvm.conf before starting clvmd. If you're using it on shared storage in a cluster, you should probably post your cluster.conf file, which might tell us why ccsd is having issues. Also, the gfs_mkfs command is typically used on the logical volume, not on the /dev/mapper device. So something like: # gfs_mkfs -p lock_dlm -t CLUST:gfs1 -j 6 /dev/VolGroup00/LogVol02 I hope this helps. Regards, Bob Peterson Red Hat Cluster Suite From jose.dr.g at gmail.com Fri Mar 2 19:36:23 2007 From: jose.dr.g at gmail.com (Jose Guevarra) Date: Fri, 2 Mar 2007 11:36:23 -0800 Subject: [Linux-cluster] can't start GFS on Fedora In-Reply-To: <45E851BE.4090300@redhat.com> References: <3837d8af0703012101m5fe475e4xadebf56d1406b986@mail.gmail.com> <45E851BE.4090300@redhat.com> Message-ID: <3837d8af0703021136t54fb61bch2cf6f87a3d7ca6c4@mail.gmail.com> yes, I'm trying to get a test HPC cluster going with GFS to be used as a SAN and shared among several nodes. I'm currently using Fedora Core 4 which was the first version to come with GFS. As you say that there is now a "new" infrast. would you recommend that I simply upgrade to Fedora Core 6? In terms of CCSD, 'service ccsd start' simply returns [Failed]. the logs show .. Mar 2 11:28:09 IQCD1 ccsd[8651]: Starting ccsd 1.0.0: Mar 2 11:28:09 IQCD1 ccsd[8651]: Built: Jun 16 2005 10:45:39 Mar 2 11:28:09 IQCD1 ccsd[8651]: Copyright (C) Red Hat, Inc. 2004 All rights reserved. That's it. I've now discovered that cluster.conf is nowhere to be found on my system. The probably explains CCSD failing. ccs-1.0 is installed. What package installed a default cluster.conf file? Thanks. On 3/2/07, Robert Peterson wrote: > > Jose Guevarra wrote: > > I have a volume group that I want to mount w/ GFS > > /dev/mapper/VolGroup00-LogVol02 > > > > I was able to create a GFS file system w/ this command... > > > > # gfs_mkfs -p lock_dlm -t CLUST:gfs1 -j 6 > /dev/mapper/VolGroup00-LogVol02 > > > > Now. when I try to start ccsd it fails. so none of the other daemons > > start > > either. /var/log/messages doesn't say anything about the start failure. > > > > How can I troubleshoot this more? What are the required daemons that > > need to > > start? > Hi Jose, > > I have a couple of suggestions. First of all, you need to determine > if you're planning to use GFS in a cluster (i.e. on shared storage like > a SAN) or > stand-alone (and share that with us if you want help.) > > Your use of "lock_dlm" and a cluster name makes it sound like you want > it in a cluster, but the VolGroup00-LogVol02 makes it sound like a local > hard disk and not any kind of shared storage. > > If you're using it stand-alone, you don't need ccsd, since ccsd is part > of the > cluster infrastructure. If stand-alone, you also would want to use > lock_nolock > rather than lock_dlm. > > Now about ccsd: If you're using the cluster code shipped with FC6, > that's the "new" infrastructure code. With the "new" stuff, you don't > need to start ccsd with a separate script like in RHEL4. Everything > should > be handled by doing: "service cman start." The ccsd daemon is started > by the init script. I apologize if you already knew this. It's just that > I couldn't tell how you were starting ccsd. > > You say that ccsd fails, but you didn't say much about how it fails or > what error message it gives you. > > I guess the bottom line is that you didn't give us enough information to > help you. > > Also, if this storage is shared in the cluster, you need to do > "service clvmd start" as well, and you may want to > change locking_type = 3 in your /etc/lvm/lvm.conf before starting clvmd. > > If you're using it on shared storage in a cluster, you should probably > post your cluster.conf file, which might tell us why ccsd is having > issues. > > Also, the gfs_mkfs command is typically used on the logical volume, not on > the /dev/mapper device. So something like: > > # gfs_mkfs -p lock_dlm -t CLUST:gfs1 -j 6 /dev/VolGroup00/LogVol02 > > I hope this helps. > > Regards, > > Bob Peterson > Red Hat Cluster Suite > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpeterso at redhat.com Sat Mar 3 00:28:01 2007 From: rpeterso at redhat.com (Robert Peterson) Date: Fri, 02 Mar 2007 18:28:01 -0600 Subject: [Linux-cluster] can't start GFS on Fedora In-Reply-To: <3837d8af0703021136t54fb61bch2cf6f87a3d7ca6c4@mail.gmail.com> References: <3837d8af0703012101m5fe475e4xadebf56d1406b986@mail.gmail.com> <45E851BE.4090300@redhat.com> <3837d8af0703021136t54fb61bch2cf6f87a3d7ca6c4@mail.gmail.com> Message-ID: <45E8C111.2060801@redhat.com> Jose Guevarra wrote: > yes, I'm trying to get a test HPC cluster going with GFS to be used as a > SAN and shared among several nodes. > > I'm currently using Fedora Core 4 which was the first version to come > with > GFS. As you say that there is now > a "new" infrast. would you recommend that I simply upgrade to Fedora > Core > 6? > > In terms of CCSD, 'service ccsd start' simply returns [Failed]. the logs > show .. > > Mar 2 11:28:09 IQCD1 ccsd[8651]: Starting ccsd 1.0.0: > Mar 2 11:28:09 IQCD1 ccsd[8651]: Built: Jun 16 2005 10:45:39 > Mar 2 11:28:09 IQCD1 ccsd[8651]: Copyright (C) Red Hat, Inc. 2004 All > rights reserved. > > That's it. I've now discovered that cluster.conf is nowhere to be > found on > my system. The probably > explains CCSD failing. ccs-1.0 is installed. What package installed a > default cluster.conf file? > > Thanks. Hi Jose, Well, I can't tell you to go to FC6, but I can tell you this much: I like FC6 a lot better than FC4, plus all the cluster code has had a lot of bug fixes and improvements. The current CVS development tree is geared toward newer (upstream) kernels, so FC6 will get you closer if you want to build it from source. There is no default cluster.conf file. The cluster.conf file defines your cluster: what computers ("nodes") are in your cluster, what fencing device(s) you are using, and the services you want for High Availability. There's no way any of that can be determined by default. That's determined by the boxes in your network. There is, however, a couple of GUIs that may make your life easier. The first one is called Conga, and it's web based. The second one is called system-config-cluster, but it's not as user friendly as Conga. I don't think they'll work on FC4 though. If you haven't already done so, you might want to check out my "NFS/GFS Cookbook" that will walk you through the process of setting up and configuring a cluster, although that's geared more toward RHEL4 (not the new infrastructure). http://sources.redhat.com/cluster/doc/nfscookbook.pdf I recently posted a link to a quick install guide to getting the STABLE cvs branch working on an upstream kernel too. The STABLE branch is much like the RHEL4, in that it uses the old infrastructure. That link is: https://rpeterso.108.redhat.com/files/documents/98/247/STABLE.txt The advantage of doing this is that more people on this list are familiar with that infrastructure and can therefore answer questions. It should work for FC6. Hopefully this gets you going. Learning how to set up and manage a cluster can be a frustrating and confusing learning experience. At least it was for me! But once you get going you'll be alright. You may have a lot of questions, and perhaps the cluster FAQ can help with some of those: http://sources.redhat.com/cluster/faq.html Otherwise, the people on this list are pretty friendly and helpful. Regards, Bob Peterson Red Hat Cluster Suite From shailesh at verismonetworks.com Mon Mar 5 05:33:13 2007 From: shailesh at verismonetworks.com (Shailesh) Date: Mon, 05 Mar 2007 11:03:13 +0530 Subject: [Linux-cluster] Clustering questions Message-ID: <1173072793.15762.37.camel@shailesh> Hi All, I am designing a low cost storage for file serving, which will contain servers with directly attached storage (NO common storage). The requirement here is that all the servers nodes should be able to access EACH OTHERS directly attached storage. I have some of questions, your answers will be helpful? - Is a file system like GFS useful in this scenario? If not which would be optimum based on performance? - Can I use a ToE card on each server node to for the storage access? This is for both getting access to other servers storage and for giving access to other servers for it's own storage. - How safe is it to put all the storage (of individual servers) into a single volume group? And then make logical volume group My Intentions is to just add low-cost PC to this network and provide both strorage and load-handling scalability. Thanks & Regards Shailesh From lhh at redhat.com Mon Mar 5 16:11:57 2007 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 05 Mar 2007 11:11:57 -0500 Subject: [Linux-cluster] Power Switch recommendation In-Reply-To: <453D02254A9EBC45866DBF28FECEA46FBADBED@ILEX01.corp.diligent.com> References: <453D02254A9EBC45866DBF28FECEA46FBADBED@ILEX01.corp.diligent.com> Message-ID: <1173111118.17223.6.camel@asuka.boston.devel.redhat.com> On Sat, 2007-02-24 at 10:23 +0200, Krikler, Samuel wrote: > Hi, > > > > I want to purchase a power switch to be used as fence device. > > Since I don?t any experience with this kind of devices: > > Can someone recommend me a specific power switch model of those > supported by GFS2? Much the same as GFS1. Any WTI IPS800, IPS1600 or NBB series: http://www.wti.com Most APC 79xx series: http://www.apcc.com Older WTI models (e.g. NPS-115 or NPS-230) and older APC models (9225 +AP9606 web/snmp card comes to mind) are often available on eBay, but the manufacturer may have stopped supporting the units. As such, I'd recommend one of the newer ones for production. Currently, we don't have much else in the way of supported remote power controller vendors. Black Box seems to sell re-branded WTI devices, so it might be possible to tweak the fence_wti agent to work with those. -- Lon From lhh at redhat.com Mon Mar 5 16:26:35 2007 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 05 Mar 2007 11:26:35 -0500 Subject: [Linux-cluster] FWD: Question on RH Cluster from a MySQL Customer In-Reply-To: <1638.1172372671@sss.pgh.pa.us> References: <1638.1172372671@sss.pgh.pa.us> Message-ID: <1173111995.17223.18.camel@asuka.boston.devel.redhat.com> On Sat, 2007-02-24 at 22:04 -0500, Tom Lane wrote: > Can someone help out this questioner? I know zip about Cluster. > I looked at the FAQ for a bit and thought that what he wants is > probably doable, but I couldn't tell if it would be easy or > painful to do load-balancing in this particular way. (And I'm not > qualified to say if what he wants is a sensible approach, either.) The short answer is "yes, sort of". With all the data on separate places on the SAN, you can certainly spawn as many instances of MySQL as you want, and have them fail over. There, however, is currently no way to make linux-cluster figure out where to place new instances of MySQL based on the number of instances of MySQL which are currently running. Now, you can set sort of an affinity for specific nodes, and manually have the instances of MySQL set up, say, like this: node 1 -> runs instances 1 and 5 node 2 -> runs instances 2 and 6 node 3 -> runs instances 3 and 7 node 4 -> runs instances 4 and 8 You can make it decide to split the load, for example, set the preferred list for instance 1 to: {1 2 3} While setting instance 5 to: {1 4 2} If node 1 fails, instance 1 will start on node 2, and instance 5 will start on node 4. With enough thought, you probably could get it so that the instances will be equally distributed regardless of the failure model. Something like this for 4 nodes + 8 instances (did not check for correctness): Inst. Node list (e.g. ordered/unrestricted failover domain) 1 {1 2 3 4} 2 {2 3 4 1} 3 {3 4 1 2} 4 {4 1 2 3} 5 {1 4 3 2} 6 {2 1 4 3} 7 {3 2 1 4} 8 {4 3 2 1} -- Lon From lhh at redhat.com Mon Mar 5 16:29:48 2007 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 05 Mar 2007 11:29:48 -0500 Subject: [Linux-cluster] Question about Cluster Service In-Reply-To: <767095.27017.qm@web31809.mail.mud.yahoo.com> References: <767095.27017.qm@web31809.mail.mud.yahoo.com> Message-ID: <1173112189.17223.21.camel@asuka.boston.devel.redhat.com> On Sun, 2007-02-25 at 05:10 -0800, sara sodagar wrote: > Hi > I would be grateful if anyone could tell me if this > solution works or not? It looks like it will work fine. > As I have only 1 passive server , I should create 2 > fail over domain . > > Node A ,C (cluster service 1) > Node B , C (cluster service 2) > Node c : (Failover domain 1 : service 1, failover > domain2: service 2) > Each Cluster service comprises : ip address resource , > web serviver init script,file > system resource (gfs) Yup, you certainly can do that. > Also I would like to know what are the advantages of > using gfs in this solution over > other types of files systems (like ext3) , as there > are no 2 active servers writing on the same area at > the > same time. Note that multiple "readers" from a single EXT3 file system will not be reliable, either - so if you intend to mount on multiple servers (at all - not just read-only), then you should use GFS. If you are not trying to do the above, then the only practical advantage GFS gives you is the potential for slightly faster recovery (due to the fs already being mounted). For most people in this case, ext3 is fine. -- Lon From lhh at redhat.com Mon Mar 5 16:31:10 2007 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 05 Mar 2007 11:31:10 -0500 Subject: [Linux-cluster] clurgmgrd[6147]: Starving for lock usrm::rg="SDA database" In-Reply-To: <49900.193.133.138.40.1172572976.squirrel@lapthorn.biz> References: <49900.193.133.138.40.1172572976.squirrel@lapthorn.biz> Message-ID: <1173112270.17223.24.camel@asuka.boston.devel.redhat.com> On Tue, 2007-02-27 at 10:42 +0000, James Lapthorn wrote: > Hi Guys, > > I have a 4 node cluster running RH Cluster Suite 4. I have just added a > DB2 service to one of the nodes and have starting gettingerrors relating > to locks ion the system log. I plan to restart this node at luch time > today to see if this fixes the problem. > > Is there anyone who can explain what these errors relate to so that I can > understand the problem better. I have checked RHN, Cluster Project > website and Google and I cant find anything? > > Its worth mentioning that the service is running fine. cat /proc/slabinfo | grep dlm If you see a big number, try the rgmanager packages from here - they should fix it: http://people.redhat.com/lhh/packages.html -- Lon From lhh at redhat.com Mon Mar 5 16:32:21 2007 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 05 Mar 2007 11:32:21 -0500 Subject: [Linux-cluster] Typo in Makefile In-Reply-To: <45E58780.8060901@seanodes.com> References: <45E58780.8060901@seanodes.com> Message-ID: <1173112341.17223.26.camel@asuka.boston.devel.redhat.com> On Wed, 2007-02-28 at 14:45 +0100, Erwan Velu wrote: > I found many line like this one in the Makefile or rgmanager. > > rgmanager/src/utils/Makefile: $(CC) -o $@ $^ $(INLUDE) $(CFLAGS) $(LDFLAGS) > > Looks like INLUDE is a typo ;) Yes, it does. -- Lon From lhh at redhat.com Mon Mar 5 16:39:46 2007 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 05 Mar 2007 11:39:46 -0500 Subject: [Linux-cluster] can't start GFS on Fedora In-Reply-To: <3837d8af0703012101m5fe475e4xadebf56d1406b986@mail.gmail.com> References: <3837d8af0703012101m5fe475e4xadebf56d1406b986@mail.gmail.com> Message-ID: <1173112787.17223.34.camel@asuka.boston.devel.redhat.com> On Thu, 2007-03-01 at 21:01 -0800, Jose Guevarra wrote: > Hi, > > I'm trying to get GFS installed and running on FC4 on a Poweredge 4400 > dual processor. Take note that I'm a total newbie at GFS. > > I've installed > > GFS-6.1 > GFS-kernel-smp > lvm2-cluster > ccs > cman-kernel-smp > magma > fence > gnbd-kernel-smp > gnbd > How can I troubleshoot this more? What are the required daemons that > need to start? Missing magma-plugins rpm -- Lon From erwan at seanodes.com Mon Mar 5 16:42:54 2007 From: erwan at seanodes.com (Erwan Velu) Date: Mon, 05 Mar 2007 17:42:54 +0100 Subject: [Linux-cluster] can't start GFS on Fedora In-Reply-To: <1173112787.17223.34.camel@asuka.boston.devel.redhat.com> References: <3837d8af0703012101m5fe475e4xadebf56d1406b986@mail.gmail.com> <1173112787.17223.34.camel@asuka.boston.devel.redhat.com> Message-ID: <45EC488E.6000908@seanodes.com> Lon Hohberger wrote: >> How can I troubleshoot this more? What are the required daemons that >> need to start? >> > > Missing magma-plugins rpm > It seems there is a missing dependencies from one rpm isn't it ? From ROBERTO.RAMIREZ at hitachigst.com Tue Mar 6 00:41:00 2007 From: ROBERTO.RAMIREZ at hitachigst.com (ROBERTO.RAMIREZ at hitachigst.com) Date: Mon, 5 Mar 2007 16:41:00 -0800 Subject: [Linux-cluster] ipmi fence device config Message-ID: Hello i am trying to setup ipmi fence device on a RHEL AS V4 Cluster Suite i have 2 ibm 3950 servers the BMC device is setup with an ip address on both servers when i add the ipmi device to the cluster and test , the fence fails have somebody configure ipmi on Cluster Suite with IBM servers that can help me also if you can tell me some basics about the ipmi fence method i would apriciate i have only do fence on balde centers regards thank -------------- next part -------------- An HTML attachment was scrubbed... URL: From Britt.Treece at savvis.net Tue Mar 6 04:30:53 2007 From: Britt.Treece at savvis.net (Treece, Britt) Date: Mon, 5 Mar 2007 22:30:53 -0600 Subject: [Linux-cluster] RE: Errors trying to login to LT000: ... 1006:Not Allowed Message-ID: All, After much further investigation I found /etc/hosts is off by one for these 3 client nodes on all 3 lock servers. Having fixed the typo's is it safe to assume that the root of the problem trying to login to LTPX is that /etc/hosts on the lock servers was wrong for these nodes? If yes, why would these 3 clients be allowed into the cluster when it was originally started being that they had incorrect entries in /etc/hosts? Regards, Britt Treece -------------- next part -------------- An HTML attachment was scrubbed... URL: From britt.treece at savvis.net Tue Mar 6 04:51:26 2007 From: britt.treece at savvis.net (Britt Treece) Date: Mon, 05 Mar 2007 22:51:26 -0600 Subject: [Linux-cluster] RE: Errors trying to login to LT000: ... 1006:Not Allowed In-Reply-To: Message-ID: Not sure why my first post didn?t, but here it is... --- I am running a 13 node GFS (6.0.2.33) cluster with 10 mounting clients and 3 dedicated lock servers. The master lock server was rebooted and the next slave in the voting order took over. At that time 3 of the client nodes started receiving login errors for the ltpx server Mar 4 00:05:52 lock1 lock_gulmd_core[3798]: Master Node Is Logging Out NOW! ... Mar 4 00:05:52 lock2 lock_gulmd_core[24627]: Master Node has logged out. Mar 4 00:05:54 lock2 lock_gulmd_core[24627]: I see no Masters, So I am Arbitrating until enough Slaves talk to me. Mar 4 00:05:54 lock2 lock_gulmd_LTPX[24638]: New Master at lock2 :192.168.1.3 Mar 4 00:05:56 lock2 lock_gulmd_core[24627]: Now have Slave quorum, going full Master. Mar 4 00:11:39 lock2 lock_gulmd_core[24627]: Master Node Is Logging Out NOW! ? Mar 4 00:05:52 client1 kernel: lock_gulm: Checking for journals for node "lock1 " Mar 4 00:05:52 client1 lock_gulmd_core[9383]: Master Node has logged out. Mar 4 00:05:52 client1 kernel: lock_gulm: Checking for journals for node "lock1 " Mar 4 00:05:56 client1 lock_gulmd_core[9383]: Found Master at lock2 , so I'm a Client. Mar 4 00:05:56 client1 lock_gulmd_core[9383]: Failed to receive a timely heartbeat reply from Master. (t:1172988356370685 mb:1) Mar 4 00:05:56 client1 lock_gulmd_LTPX[9390]: New Master at lock2 :192.168.1.3 Mar 4 00:06:01 client1 lock_gulmd_LTPX[9390]: Errors trying to login to LT002: (lock2 :192.168.1.3) 1006:Not Allowed Mar 4 00:06:01 client1 lock_gulmd_LTPX[9390]: Errors trying to login to LT000: (lock2 :192.168.1.3) 1006:Not Allowed Mar 4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to LT000: (lock2 :192.168.1.3) 1006:Not Allowed Mar 4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to LT002: (lock2 :192.168.1.3) 1006:Not Allowed Mar 4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to LT004: (lock2 :192.168.1.3) 1006:Not Allowed Mar 4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to LT001: (lock2 :192.168.1.3) 1006:Not Allowed --- Britt On 3/5/07 10:30 PM, "Treece, Britt" wrote: > All, > > After much further investigation I found /etc/hosts is off by one for these 3 > client nodes on all 3 lock servers. Having fixed the typo's is it safe to > assume that the root of the problem trying to login to LTPX is that /etc/hosts > on the lock servers was wrong for these nodes? If yes, why would these 3 > clients be allowed into the cluster when it was originally started being that > they had incorrect entries in /etc/hosts? > > Regards, > > Britt Treece > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew at arts.usyd.edu.au Tue Mar 6 09:25:49 2007 From: matthew at arts.usyd.edu.au (Matthew Geier) Date: Tue, 06 Mar 2007 20:25:49 +1100 Subject: [Linux-cluster] RHEL4 cluster NFS Message-ID: <45ED339D.7000504@arts.usyd.edu.au> I'm beating my head against the wall trying to figure out how to do this properly. I have a active/passive file server running Samba (and NetAtalk) all running fine. I'm using ext3 file systems on an EMC san and have all the failover stuff mostly sorted. Only I need to export one of the filesystems with NFS as well. Not finding an obvious way to do an NFS export in system-config-cluster (no where to enter the file system I want to share), I put the export in /etc/exports. Only now the 'file service' will not shutdown cleanly as it can't stop NFS on the exported volume thus can't unmount it... I put in a support call with Redhat (my cluster file service will not shutdown cleanly) and after a little to-and-fro they have said I have NFS incorrectly configured. But after much searching the web, i'm even more confused. Many articles on Redhat's site (and others) are for RHEL 3 and not RHEL4. Things they say to do I can't, or the config file format has changed, etc. Any one have a concise example on how to NFS export an ext3 filesystem on RHEL U4 cluster suite. ? Thanks. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3415 bytes Desc: S/MIME Cryptographic Signature URL: From lhh at redhat.com Tue Mar 6 15:05:52 2007 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 06 Mar 2007 10:05:52 -0500 Subject: [Linux-cluster] RHEL4 cluster NFS In-Reply-To: <45ED339D.7000504@arts.usyd.edu.au> References: <45ED339D.7000504@arts.usyd.edu.au> Message-ID: <1173193552.14390.4.camel@asuka.boston.devel.redhat.com> On Tue, 2007-03-06 at 20:25 +1100, Matthew Geier wrote: > Any one have a concise example on how to NFS export an ext3 filesystem > on RHEL U4 cluster suite. ? Typically, it should look something like this: If you need to export something *other* than the top-level mountpoint, you can add path="" attributes to the nfsclient resources (include the full path; i.e. if mountpoint was /mnt/1 and you want to export /mnt/1/foo, use path="/mnt/1/foo", not "/foo"...). Let's see what you've got now? -- Lon From rpeterso at redhat.com Tue Mar 6 15:09:00 2007 From: rpeterso at redhat.com (Robert Peterson) Date: Tue, 06 Mar 2007 09:09:00 -0600 Subject: [Linux-cluster] RHEL4 cluster NFS In-Reply-To: <45ED339D.7000504@arts.usyd.edu.au> References: <45ED339D.7000504@arts.usyd.edu.au> Message-ID: <45ED840C.3000002@redhat.com> Matthew Geier wrote: > I'm beating my head against the wall trying to figure out how to do > this properly. > > I have a active/passive file server running Samba (and NetAtalk) all > running fine. I'm using ext3 file systems on an EMC san and have all the > failover stuff mostly sorted. > Only I need to export one of the filesystems with NFS as well. > > Not finding an obvious way to do an NFS export in system-config-cluster > (no where to enter the file system I want to share), I put the export in > /etc/exports. > Only now the 'file service' will not shutdown cleanly as it can't stop > NFS on the exported volume thus can't unmount it... > > I put in a support call with Redhat (my cluster file service will not > shutdown cleanly) and after a little to-and-fro they have said I have > NFS incorrectly configured. > > But after much searching the web, i'm even more confused. Many articles > on Redhat's site (and others) are for RHEL 3 and not RHEL4. Things they > say to do I can't, or the config file format has changed, etc. > > > Any one have a concise example on how to NFS export an ext3 filesystem > on RHEL U4 cluster suite. ? > > Thanks. > Hi Matthew, Have you looked at my NFS/GFS cookbook? It's not too different for EXT3, and it has concrete examples. http://sources.redhat.com/cluster/doc/nfscookbook.pdf Regards, Bob Peterson Red Hat Cluster Suite From lhh at redhat.com Tue Mar 6 15:20:46 2007 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 06 Mar 2007 10:20:46 -0500 Subject: [Linux-cluster] ipmi fence device config In-Reply-To: References: Message-ID: <1173194447.14390.14.camel@asuka.boston.devel.redhat.com> On Mon, 2007-03-05 at 16:41 -0800, ROBERTO.RAMIREZ at hitachigst.com wrote: > > Hello > > i am trying to setup ipmi fence device on a RHEL AS V4 Cluster Suite > > i have 2 ibm 3950 servers the BMC device is setup with an ip address > on both servers > > when i add the ipmi device to the cluster and test , the fence fails > > have somebody configure ipmi on Cluster Suite with IBM servers that > can help me > > also if you can tell me some basics about the ipmi fence method i > would apriciate i have only do fence on balde centers Try fence_ipmi from the command line -- e.g. fence_ipmi -a -o off -p (etc.) - see what the output is. There are two bugs which are fixed in CVS which you might be hitting: (1) fence_ipmi doesn't work with null passwords https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=218974 (2) fence_ipmi doesn't work with lan-plus components (don't know the bugzilla # :( ) Also, fence_ipmi uses ipmitool. If you get an "ipmitool not found" warning, ensure that the OpenIPMI package from RHEL4 U3 or later is installed. The fence package does not require this any other packages at install-time (Note: -not- a bug). -- Lon From lhh at redhat.com Tue Mar 6 15:22:44 2007 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 06 Mar 2007 10:22:44 -0500 Subject: [Linux-cluster] RHEL4 cluster NFS In-Reply-To: <1173193552.14390.4.camel@asuka.boston.devel.redhat.com> References: <45ED339D.7000504@arts.usyd.edu.au> <1173193552.14390.4.camel@asuka.boston.devel.redhat.com> Message-ID: <1173194564.14390.17.camel@asuka.boston.devel.redhat.com> On Tue, 2007-03-06 at 10:05 -0500, Lon Hohberger wrote: > On Tue, 2007-03-06 at 20:25 +1100, Matthew Geier wrote: Note: Bob's NFS cookbook is way better than my example. /me goes to drink more coffee -- Lon From lgodoy at atichile.com Tue Mar 6 22:48:54 2007 From: lgodoy at atichile.com (Luis Godoy Gonzalez) Date: Tue, 06 Mar 2007 19:48:54 -0300 Subject: [Linux-cluster] node fails to join cluster after it was fenced In-Reply-To: <45DC4934.4040504@redhat.com> References: <1171458304.24507.91.camel@pc029.sc.diamond.ac.uk> <45D31766.3080908@redhat.com> <1171469028.24507.109.camel@pc029.sc.diamond.ac.uk> <45D339CF.7070408@redhat.com> <1171474578.24507.148.camel@pc029.sc.diamond.ac.uk> <45D422B7.30506@redhat.com> <1171539363.24507.210.camel@pc029.sc.diamond.ac.uk> <1172056251.18210.135.camel@pc029.sc.diamond.ac.uk> <45DC2C5E.1040808@redhat.com> <1172059653.18210.166.camel@pc029.sc.diamond.ac.uk> <45DC4934.4040504@redhat.com> Message-ID: <45EDEFD6.8070802@atichile.com> Hi we have the same problem... :| we have RHE4 U2 with cluster suite 4 U2, in our case one node send a fenced to the other node, and we have not succes to rejoining the node to cluster. On logs appeared that node 2 cannot comunicate with node 1, but the network connectivity is working fine In a test we deleted the cluster.conf from node 2 and reboot it. After the reboot the node got the last version of cluster.conf from node 1, but still cannot joining to cluster again. Below of this mail, we attached a little dump from node 1 that were the cluster service is running. Thanks in advanced for any help. Best Regards, Luis.G. ========================================================================= [root at lvs-gt1 ~]# tcpdump -s0 -x port 6809 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on bond0, link-type EN10MB (Ethernet), capture size 65535 bytes 16:45:30.043719 IP lvs-gt1.6809 > 192.168.150.255.6809: UDP, length 28 0x0000: 4500 0038 d064 4000 4011 bbea c0a8 9615 E..8.d at .@....... 0x0010: c0a8 96ff 1a99 1a99 0024 f69c 0101 4ed0 .........$....N. 0x0020: 0000 b49d 0000 1900 0100 0000 0000 0000 ................ 0x0030: 0402 0100 0200 0000 ........ 16:45:30.043758 IP lvs-gt1.6809 > 192.168.150.255.6809: UDP, length 28 0x0000: 4500 0038 d064 4000 4011 bbea c0a8 9615 E..8.d at .@....... 0x0010: c0a8 96ff 1a99 1a99 0024 f69c 0101 4ed0 .........$....N. 0x0020: 0000 b49d 0000 1900 0100 0000 0000 0000 ................ 0x0030: 0402 0100 0200 0000 ........ 16:45:30.043829 IP lvs-gt2.6809 > lvs-gt1.6809: UDP, length 92 0x0000: 4500 0078 0226 4000 4011 8ad2 c0a8 9616 E..x.&@. at ....... 0x0010: c0a8 9615 1a99 1a99 0064 1b42 0101 2902 .........d.B..). 0x0020: 0000 b49d 0000 0100 0000 0000 0000 0000 ................ 0x0030: 0201 0100 0100 0000 0000 0000 0500 0000 ................ 0x0040: 0000 0000 0100 0000 0a00 0000 1000 0000 ................ 0x0050: 6c62 5f63 6c75 7374 6572 0000 0000 0000 lb_cluster...... 0x0060: 0200 1a99 c0a8 9616 0000 0000 0000 0000 ................ 0x0070: 6c76 732d 6774 3200 lvs-gt2. 16:45:35.042945 IP lvs-gt1.6809 > 192.168.150.255.6809: UDP, length 28 0x0000: 4500 0038 d065 4000 4011 bbe9 c0a8 9615 E..8.e at .@....... 0x0010: c0a8 96ff 1a99 1a99 0024 f59c 0101 4fd0 .........$....O. 0x0020: 0000 b49d 0000 1900 0100 0000 0000 0000 ................ 0x0030: 0402 0100 0200 0000 ........ 16:45:35.042998 IP lvs-gt1.6809 > 192.168.150.255.6809: UDP, length 28 0x0000: 4500 0038 d065 4000 4011 bbe9 c0a8 9615 E..8.e at .@....... 0x0010: c0a8 96ff 1a99 1a99 0024 f59c 0101 4fd0 .........$....O. 0x0020: 0000 b49d 0000 1900 0100 0000 0000 0000 ................ 0x0030: 0402 0100 0200 0000 ........ 16:45:35.043075 IP lvs-gt2.6809 > lvs-gt1.6809: UDP, length 92 0x0000: 4500 0078 0227 4000 4011 8ad1 c0a8 9616 E..x.'@. at ....... 0x0010: c0a8 9615 1a99 1a99 0064 1a42 0101 2a02 .........d.B..*. 0x0020: 0000 b49d 0000 0100 0000 0000 0000 0000 ................ 0x0030: 0201 0100 0100 0000 0000 0000 0500 0000 ................ 0x0040: 0000 0000 0100 0000 0a00 0000 1000 0000 ................ 0x0050: 6c62 5f63 6c75 7374 6572 0000 0000 0000 lb_cluster...... 0x0060: 0200 1a99 c0a8 9616 0000 0000 0000 0000 ................ 0x0070: 6c76 732d 6774 3200 lvs-gt2. 6 packets captured 6 packets received by filter 0 packets dropped by kernel ============================================================================================= Patrick Caulfield wrote: > Frederik Ferner wrote: > >> On Wed, 2007-02-21 at 11:26 +0000, Patrick Caulfield wrote: >> >>> Frederik Ferner wrote: >>> >>>> Hi Patrick, All, >>>> >>>> let me give you an update on that problem. >>>> >>>> On Thu, 2007-02-15 at 11:36 +0000, Frederik Ferner wrote: >>>> >>>>> On Thu, 2007-02-15 at 09:07 +0000, Patrick Caulfield wrote: >>>>> >>>> [node not joining cluster] >>>> >>>>>> It would be interesting to know - though you may not want to do it - if the >>>>>> problem persists when the still-running node is rebooted. >>>>>> >>>>> Obviously not at the moment, but I have a maintenance window upcoming >>>>> soon where I might be able to do that. I'll keep you informed about the >>>>> result. >>>>> >>>> Today I had the possibility to reboot the node that was still quorate >>>> (i04-storage1) while the other node (i04-storage2) was still trying to >>>> join. >>>> When i04-storage1 came to the stage where the cluster services are >>>> started, both nodes joined the cluster at the same time. >>>> >>>> With this running cluster, I tried to reproduce the problem by fencing >>>> one node but after rebooting this immediately joined the cluster. >>>> >>> Interesting. it sounds similar to a cman bug that was introduced in U3, but it >>> was fixed in U4 - which you said you were running. >>> >> Let's verify that then. I have the following RHCS related packages >> installed: >> ccs-1.0.7-0 >> rgmanager-1.9.54-1 >> cman-1.0.11-0 >> fence-1.32.25-1 >> cman-kernel-smp-2.6.9-45.8 >> dlm-kernel-smp-2.6.9-44.3 >> dlm-1.0.1-1 >> > > Yes, those look fine. > > From ROBERTO.RAMIREZ at hitachigst.com Tue Mar 6 23:17:52 2007 From: ROBERTO.RAMIREZ at hitachigst.com (ROBERTO.RAMIREZ at hitachigst.com) Date: Tue, 6 Mar 2007 15:17:52 -0800 Subject: [Linux-cluster] node fails to join cluster after it was fenced In-Reply-To: <45EDEFD6.8070802@atichile.com> Message-ID: Luis have you check it the iptables are off if they are on try to disable them for a test and try again service iptables stop chkconfig iptables off fence and see if it get back Luis Godoy Gonzalez Sent by: linux-cluster-bounces at redhat.com 03/06/2007 02:48 PM Please respond to linux clustering To linux clustering cc Subject Re: [Linux-cluster] node fails to join cluster after it was fenced Hi we have the same problem... :| we have RHE4 U2 with cluster suite 4 U2, in our case one node send a fenced to the other node, and we have not succes to rejoining the node to cluster. On logs appeared that node 2 cannot comunicate with node 1, but the network connectivity is working fine In a test we deleted the cluster.conf from node 2 and reboot it. After the reboot the node got the last version of cluster.conf from node 1, but still cannot joining to cluster again. Below of this mail, we attached a little dump from node 1 that were the cluster service is running. Thanks in advanced for any help. Best Regards, Luis.G. ========================================================================= [root at lvs-gt1 ~]# tcpdump -s0 -x port 6809 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on bond0, link-type EN10MB (Ethernet), capture size 65535 bytes 16:45:30.043719 IP lvs-gt1.6809 > 192.168.150.255.6809: UDP, length 28 0x0000: 4500 0038 d064 4000 4011 bbea c0a8 9615 E..8.d at .@....... 0x0010: c0a8 96ff 1a99 1a99 0024 f69c 0101 4ed0 .........$....N. 0x0020: 0000 b49d 0000 1900 0100 0000 0000 0000 ................ 0x0030: 0402 0100 0200 0000 ........ 16:45:30.043758 IP lvs-gt1.6809 > 192.168.150.255.6809: UDP, length 28 0x0000: 4500 0038 d064 4000 4011 bbea c0a8 9615 E..8.d at .@....... 0x0010: c0a8 96ff 1a99 1a99 0024 f69c 0101 4ed0 .........$....N. 0x0020: 0000 b49d 0000 1900 0100 0000 0000 0000 ................ 0x0030: 0402 0100 0200 0000 ........ 16:45:30.043829 IP lvs-gt2.6809 > lvs-gt1.6809: UDP, length 92 0x0000: 4500 0078 0226 4000 4011 8ad2 c0a8 9616 E..x.&@. at ....... 0x0010: c0a8 9615 1a99 1a99 0064 1b42 0101 2902 .........d.B..). 0x0020: 0000 b49d 0000 0100 0000 0000 0000 0000 ................ 0x0030: 0201 0100 0100 0000 0000 0000 0500 0000 ................ 0x0040: 0000 0000 0100 0000 0a00 0000 1000 0000 ................ 0x0050: 6c62 5f63 6c75 7374 6572 0000 0000 0000 lb_cluster...... 0x0060: 0200 1a99 c0a8 9616 0000 0000 0000 0000 ................ 0x0070: 6c76 732d 6774 3200 lvs-gt2. 16:45:35.042945 IP lvs-gt1.6809 > 192.168.150.255.6809: UDP, length 28 0x0000: 4500 0038 d065 4000 4011 bbe9 c0a8 9615 E..8.e at .@....... 0x0010: c0a8 96ff 1a99 1a99 0024 f59c 0101 4fd0 .........$....O. 0x0020: 0000 b49d 0000 1900 0100 0000 0000 0000 ................ 0x0030: 0402 0100 0200 0000 ........ 16:45:35.042998 IP lvs-gt1.6809 > 192.168.150.255.6809: UDP, length 28 0x0000: 4500 0038 d065 4000 4011 bbe9 c0a8 9615 E..8.e at .@....... 0x0010: c0a8 96ff 1a99 1a99 0024 f59c 0101 4fd0 .........$....O. 0x0020: 0000 b49d 0000 1900 0100 0000 0000 0000 ................ 0x0030: 0402 0100 0200 0000 ........ 16:45:35.043075 IP lvs-gt2.6809 > lvs-gt1.6809: UDP, length 92 0x0000: 4500 0078 0227 4000 4011 8ad1 c0a8 9616 E..x.'@. at ....... 0x0010: c0a8 9615 1a99 1a99 0064 1a42 0101 2a02 .........d.B..*. 0x0020: 0000 b49d 0000 0100 0000 0000 0000 0000 ................ 0x0030: 0201 0100 0100 0000 0000 0000 0500 0000 ................ 0x0040: 0000 0000 0100 0000 0a00 0000 1000 0000 ................ 0x0050: 6c62 5f63 6c75 7374 6572 0000 0000 0000 lb_cluster...... 0x0060: 0200 1a99 c0a8 9616 0000 0000 0000 0000 ................ 0x0070: 6c76 732d 6774 3200 lvs-gt2. 6 packets captured 6 packets received by filter 0 packets dropped by kernel ============================================================================================= Patrick Caulfield wrote: > Frederik Ferner wrote: > >> On Wed, 2007-02-21 at 11:26 +0000, Patrick Caulfield wrote: >> >>> Frederik Ferner wrote: >>> >>>> Hi Patrick, All, >>>> >>>> let me give you an update on that problem. >>>> >>>> On Thu, 2007-02-15 at 11:36 +0000, Frederik Ferner wrote: >>>> >>>>> On Thu, 2007-02-15 at 09:07 +0000, Patrick Caulfield wrote: >>>>> >>>> [node not joining cluster] >>>> >>>>>> It would be interesting to know - though you may not want to do it - if the >>>>>> problem persists when the still-running node is rebooted. >>>>>> >>>>> Obviously not at the moment, but I have a maintenance window upcoming >>>>> soon where I might be able to do that. I'll keep you informed about the >>>>> result. >>>>> >>>> Today I had the possibility to reboot the node that was still quorate >>>> (i04-storage1) while the other node (i04-storage2) was still trying to >>>> join. >>>> When i04-storage1 came to the stage where the cluster services are >>>> started, both nodes joined the cluster at the same time. >>>> >>>> With this running cluster, I tried to reproduce the problem by fencing >>>> one node but after rebooting this immediately joined the cluster. >>>> >>> Interesting. it sounds similar to a cman bug that was introduced in U3, but it >>> was fixed in U4 - which you said you were running. >>> >> Let's verify that then. I have the following RHCS related packages >> installed: >> ccs-1.0.7-0 >> rgmanager-1.9.54-1 >> cman-1.0.11-0 >> fence-1.32.25-1 >> cman-kernel-smp-2.6.9-45.8 >> dlm-kernel-smp-2.6.9-44.3 >> dlm-1.0.1-1 >> > > Yes, those look fine. > > -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew at arts.usyd.edu.au Tue Mar 6 23:23:51 2007 From: matthew at arts.usyd.edu.au (Matthew Geier) Date: Wed, 07 Mar 2007 10:23:51 +1100 Subject: [Linux-cluster] RHEL4 cluster NFS In-Reply-To: <1173193552.14390.4.camel@asuka.boston.devel.redhat.com> References: <45ED339D.7000504@arts.usyd.edu.au> <1173193552.14390.4.camel@asuka.boston.devel.redhat.com> Message-ID: <45EDF807.3050104@arts.usyd.edu.au> Lon Hohberger wrote: > On Tue, 2007-03-06 at 20:25 +1100, Matthew Geier wrote: > >> Any one have a concise example on how to NFS export an ext3 filesystem >> on RHEL U4 cluster suite. ? Ok, thanks, it's 'clicked' now and I have the idea. Some one emailed me directly a screen grab of the layout in system-config-cluster that showed me the relationship I was missing, however last evening while relaxing in the bath, I had a 'flash of inspiration' on how the relationship between the file systems in the services section and the NFS clients went together and I tried it remotely and it seems to work. all your helpful emails arrived later. It's still not perfect, but functional I gather the nfsclient should be a public resource so it can be reused on other file systems. I made it private. Have to wait to my next maintenance window to change it as the resulting service restart will annoy all my Mac users. (Unlike Windows, Mac's don't expect their servers to go down all the time :-) The examples put the nfsexport in the main resources section as well, which is why I couldn't make the connection with the export resource and the file system it exports. They neatly put the resource configuration in order, directly after the file system is is going to reference, which has given me a false impression of how it was supposed to work, as I mistakenly thought the binding must happen in resources (and I couldn't get it to work) when it actually happens in the service section. What does the actual nfsexport directive do ?. It seems to be that adding an nfsclient to a filesystem resource would imply it. Thanks to all those that helped. From srramasw at cisco.com Wed Mar 7 02:03:22 2007 From: srramasw at cisco.com (Sridhar Ramaswamy (srramasw)) Date: Tue, 6 Mar 2007 18:03:22 -0800 Subject: [Linux-cluster] uboot / bootloader support for GFS Message-ID: Has anyone tried to use uboot bootloader to boot off a local GFS filesystem? For that matter any other bootloader? thanks, Sridhar -------------- next part -------------- An HTML attachment was scrubbed... URL: From shailesh at verismonetworks.com Wed Mar 7 05:38:50 2007 From: shailesh at verismonetworks.com (Shailesh) Date: Wed, 07 Mar 2007 11:08:50 +0530 Subject: [Linux-cluster] fence and lvm Message-ID: <1173245930.20588.21.camel@shailesh> Hi, what are the uses of 'fenced' in a clustered environment where I am not using any power device ? Can you elaborate any other uses. Is it possible to use 'lvm' (make logical volumes) in a RAID-5/6 disk array ? If so how are the parity blocks of RAID 5/6 taken care of in the logical volumes. Thanks & Regards Shailesh From shailesh at verismonetworks.com Wed Mar 7 06:52:12 2007 From: shailesh at verismonetworks.com (Shailesh) Date: Wed, 07 Mar 2007 12:22:12 +0530 Subject: [Linux-cluster] Any RH cluster suite MIB Message-ID: <1173250332.20588.25.camel@shailesh> I am looking for private mibs defined for GFS and the other cluster suite. Do you know of any? Thanks & Regards Shailesh From pcaulfie at redhat.com Wed Mar 7 08:53:18 2007 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Wed, 07 Mar 2007 08:53:18 +0000 Subject: [Linux-cluster] node fails to join cluster after it was fenced In-Reply-To: <45EDEFD6.8070802@atichile.com> References: <1171458304.24507.91.camel@pc029.sc.diamond.ac.uk> <45D31766.3080908@redhat.com> <1171469028.24507.109.camel@pc029.sc.diamond.ac.uk> <45D339CF.7070408@redhat.com> <1171474578.24507.148.camel@pc029.sc.diamond.ac.uk> <45D422B7.30506@redhat.com> <1171539363.24507.210.camel@pc029.sc.diamond.ac.uk> <1172056251.18210.135.camel@pc029.sc.diamond.ac.uk> <45DC2C5E.1040808@redhat.com> <1172059653.18210.166.camel@pc029.sc.diamond.ac.uk> <45DC4934.4040504@redhat.com> <45EDEFD6.8070802@atichile.com> Message-ID: <45EE7D7E.6030101@redhat.com> Luis Godoy Gonzalez wrote: > Hi > > we have the same problem... :| we have RHE4 U2 with cluster suite 4 > U2, in our case one node send a fenced to the other node, and we have > not succes to rejoining the node to cluster. > On logs appeared that node 2 cannot comunicate with node 1, but the > network connectivity is working fine > In a test we deleted the cluster.conf from node 2 and reboot it. After > the reboot the node got the last version of cluster.conf from node 1, > but still cannot joining to cluster again. > > Below of this mail, we attached a little dump from node 1 that were the > cluster service is running. That's showing the same symptoms. The new node is sending joinreq messages but they are not received by the node that's already in the cluster. If you're running U2 you should upgrade anyway. there are lots of bugs fixed between that and the current U4. -- patrick From lgodoy at atichile.com Wed Mar 7 13:56:55 2007 From: lgodoy at atichile.com (Luis Godoy Gonzalez) Date: Wed, 07 Mar 2007 10:56:55 -0300 Subject: [Linux-cluster] node fails to join cluster after it was fenced In-Reply-To: <45EE7D7E.6030101@redhat.com> References: <1171458304.24507.91.camel@pc029.sc.diamond.ac.uk> <45D31766.3080908@redhat.com> <1171469028.24507.109.camel@pc029.sc.diamond.ac.uk> <45D339CF.7070408@redhat.com> <1171474578.24507.148.camel@pc029.sc.diamond.ac.uk> <45D422B7.30506@redhat.com> <1171539363.24507.210.camel@pc029.sc.diamond.ac.uk> <1172056251.18210.135.camel@pc029.sc.diamond.ac.uk> <45DC2C5E.1040808@redhat.com> <1172059653.18210.166.camel@pc029.sc.diamond.ac.uk> <45DC4934.4040504@redhat.com> <45EDEFD6.8070802@atichile.com> <45EE7D7E.6030101@redhat.com> Message-ID: <45EEC4A7.8030200@atichile.com> Hi The "IPtable" service is not running on both nodes. We are thinking in update the platform (RHE4 U4 RHCS 4U4) but thid is not easy right now because we have several servers on production. Another reason to not do it the version update is that we are waiting for an update 5 por RHE4 or the production release for RHE5. In this moment we only update "rgmanager" in some sites (we have several issues with the rgmanager of update 2 RHCS4). Thanks again for you reply. Best regards, Luis G. Patrick Caulfield wrote: > Luis Godoy Gonzalez wrote: > >> Hi >> >> we have the same problem... :| we have RHE4 U2 with cluster suite 4 >> U2, in our case one node send a fenced to the other node, and we have >> not succes to rejoining the node to cluster. >> On logs appeared that node 2 cannot comunicate with node 1, but the >> network connectivity is working fine >> In a test we deleted the cluster.conf from node 2 and reboot it. After >> the reboot the node got the last version of cluster.conf from node 1, >> but still cannot joining to cluster again. >> >> Below of this mail, we attached a little dump from node 1 that were the >> cluster service is running. >> > > That's showing the same symptoms. The new node is sending joinreq messages but > they are not received by the node that's already in the cluster. > > If you're running U2 you should upgrade anyway. there are lots of bugs fixed > between that and the current U4. > > From pcaulfie at redhat.com Wed Mar 7 14:56:55 2007 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Wed, 07 Mar 2007 14:56:55 +0000 Subject: [Linux-cluster] node fails to join cluster after it was fenced In-Reply-To: <45EEC4A7.8030200@atichile.com> References: <1171458304.24507.91.camel@pc029.sc.diamond.ac.uk> <45D31766.3080908@redhat.com> <1171469028.24507.109.camel@pc029.sc.diamond.ac.uk> <45D339CF.7070408@redhat.com> <1171474578.24507.148.camel@pc029.sc.diamond.ac.uk> <45D422B7.30506@redhat.com> <1171539363.24507.210.camel@pc029.sc.diamond.ac.uk> <1172056251.18210.135.camel@pc029.sc.diamond.ac.uk> <45DC2C5E.1040808@redhat.com> <1172059653.18210.166.camel@pc029.sc.diamond.ac.uk> <45DC4934.4040504@redhat.com> <45EDEFD6.8070802@atichile.com> <45EE7D7E.6030101@redhat.com> <45EEC4A7.8030200@atichile.com> Message-ID: <45EED2B7.6040002@redhat.com> Luis Godoy Gonzalez wrote: > Hi > > The "IPtable" service is not running on both nodes. > We are thinking in update the platform (RHE4 U4 RHCS 4U4) but thid is > not easy right now because we have several servers on production. > Another reason to not do it the version update is that we are waiting > for an update 5 por RHE4 or the production release for RHE5. > In this moment we only update "rgmanager" in some sites (we have several > issues with the rgmanager of update 2 RHCS4). > It is really rather odd. Node 1 can obviously see the joinreq messages - at least tcpdump can, but cman is either not seeing them or ignoring them. What really bothers me is that this seems to be affecting U2 and U4 - if both of you were using U3 I would think no more of it :) Annoyingly it's hard to debug at this level (you can't strace a kernel thread!). I"m pretty sure that a reboot of node1 would fix the problem but that's hardly helpful. -- patrick From Britt.Treece at savvis.net Wed Mar 7 15:17:17 2007 From: Britt.Treece at savvis.net (Treece, Britt) Date: Wed, 7 Mar 2007 09:17:17 -0600 Subject: [Linux-cluster] RE: Errors trying to login to LT000: ...1006:Not Allowed In-Reply-To: Message-ID: Does anyone have any idea why incorrect entries in /etc/hosts of the lock servers would intermittently cause the "Errors trying to login to LT000: ...1006:Not Allowed?" I would think this would be something that if wrong should *consistently* cause the client not to be allowed into the lockspace. Additionally can anyone explain the fundamentals of GFS 6.0 lock tables and the locking process. A couple specific questions I have... What is the difference between LTPX and the LT000? What is the advantage of having additional lock tables and when would having more be a disadvantage? Is each lock propagated to each locktable or is it held in only one table? Is the highwater mark for each locktable or the sum of locks across all locktables? Regards, Britt Treece ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Britt Treece Sent: Monday, March 05, 2007 10:51 PM To: linux clustering Subject: Re: [Linux-cluster] RE: Errors trying to login to LT000: ...1006:Not Allowed Not sure why my first post didn't, but here it is... --- I am running a 13 node GFS (6.0.2.33) cluster with 10 mounting clients and 3 dedicated lock servers. The master lock server was rebooted and the next slave in the voting order took over. At that time 3 of the client nodes started receiving login errors for the ltpx server Mar 4 00:05:52 lock1 lock_gulmd_core[3798]: Master Node Is Logging Out NOW! ... Mar 4 00:05:52 lock2 lock_gulmd_core[24627]: Master Node has logged out. Mar 4 00:05:54 lock2 lock_gulmd_core[24627]: I see no Masters, So I am Arbitrating until enough Slaves talk to me. Mar 4 00:05:54 lock2 lock_gulmd_LTPX[24638]: New Master at lock2 :192.168.1.3 Mar 4 00:05:56 lock2 lock_gulmd_core[24627]: Now have Slave quorum, going full Master. Mar 4 00:11:39 lock2 lock_gulmd_core[24627]: Master Node Is Logging Out NOW! ... Mar 4 00:05:52 client1 kernel: lock_gulm: Checking for journals for node "lock1 " Mar 4 00:05:52 client1 lock_gulmd_core[9383]: Master Node has logged out. Mar 4 00:05:52 client1 kernel: lock_gulm: Checking for journals for node "lock1 " Mar 4 00:05:56 client1 lock_gulmd_core[9383]: Found Master at lock2 , so I'm a Client. Mar 4 00:05:56 client1 lock_gulmd_core[9383]: Failed to receive a timely heartbeat reply from Master. (t:1172988356370685 mb:1) Mar 4 00:05:56 client1 lock_gulmd_LTPX[9390]: New Master at lock2 :192.168.1.3 Mar 4 00:06:01 client1 lock_gulmd_LTPX[9390]: Errors trying to login to LT002: (lock2 :192.168.1.3) 1006:Not Allowed Mar 4 00:06:01 client1 lock_gulmd_LTPX[9390]: Errors trying to login to LT000: (lock2 :192.168.1.3) 1006:Not Allowed Mar 4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to LT000: (lock2 :192.168.1.3) 1006:Not Allowed Mar 4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to LT002: (lock2 :192.168.1.3) 1006:Not Allowed Mar 4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to LT004: (lock2 :192.168.1.3) 1006:Not Allowed Mar 4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to LT001: (lock2 :192.168.1.3) 1006:Not Allowed --- Britt On 3/5/07 10:30 PM, "Treece, Britt" wrote: All, After much further investigation I found /etc/hosts is off by one for these 3 client nodes on all 3 lock servers. Having fixed the typo's is it safe to assume that the root of the problem trying to login to LTPX is that /etc/hosts on the lock servers was wrong for these nodes? If yes, why would these 3 clients be allowed into the cluster when it was originally started being that they had incorrect entries in /etc/hosts? Regards, Britt Treece ________________________________ -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpeterso at redhat.com Wed Mar 7 16:15:41 2007 From: rpeterso at redhat.com (Robert Peterson) Date: Wed, 07 Mar 2007 10:15:41 -0600 Subject: [Linux-cluster] uboot / bootloader support for GFS In-Reply-To: References: Message-ID: <45EEE52D.6070405@redhat.com> Sridhar Ramaswamy (srramasw) wrote: > Has anyone tried to use uboot bootloader to boot off a local GFS > filesystem? For that matter any other bootloader? > > thanks, > Sridhar Hi Sridhar, I'm familiar with uboot, but I haven't tried booting a GFS root from it. I only know about one platform for using GFS as a root partition, and that is open-sharedroot. See: http://sources.redhat.com/cluster/faq.html#gfs_diskless Unfortunately, I haven't played with that either. Regards, Bob Peterson Red Hat Cluster Suite From james.lapthorn at lapthornconsulting.com Wed Mar 7 16:36:14 2007 From: james.lapthorn at lapthornconsulting.com (James Lapthorn) Date: Wed, 7 Mar 2007 16:36:14 -0000 (UTC) Subject: [Linux-cluster] Quorum Disk question Message-ID: <35859.193.133.138.40.1173285374.squirrel@www.lapthorn.biz> Good afternoon, Hopefully somebody can help with a Quorum disk question I have. I have a 4 node cluster and have adopted a 'last man standing' approach. Because of this I use a quorum disk which is I have setup on my SAN. I have added the following configuration into my cluster.conf file. Each node has 1 vote and the quorum disk has 4, therefore Quorum should remain if 3 nodes are removed from the cluster. While testing this I have noticed that i get the following error in the log: qdiskd[26676]: Score insufficient for master operation (1/2; max=4); downgrading At this point Activity is blocked. Is this to do with my heuristic programs, should I not ping each memeber node and maybe ping something like 'localhost' PLease help James Lapthorn _________________________________ This email has been ClamScanned ! www.clamav.net From lgodoy at atichile.com Wed Mar 7 17:19:27 2007 From: lgodoy at atichile.com (Luis Godoy Gonzalez) Date: Wed, 07 Mar 2007 14:19:27 -0300 Subject: [Linux-cluster] node fails to join cluster after it was fenced In-Reply-To: <45EED2B7.6040002@redhat.com> References: <1171458304.24507.91.camel@pc029.sc.diamond.ac.uk> <45D31766.3080908@redhat.com> <1171469028.24507.109.camel@pc029.sc.diamond.ac.uk> <45D339CF.7070408@redhat.com> <1171474578.24507.148.camel@pc029.sc.diamond.ac.uk> <45D422B7.30506@redhat.com> <1171539363.24507.210.camel@pc029.sc.diamond.ac.uk> <1172056251.18210.135.camel@pc029.sc.diamond.ac.uk> <45DC2C5E.1040808@redhat.com> <1172059653.18210.166.camel@pc029.sc.diamond.ac.uk> <45DC4934.4040504@redhat.com> <45EDEFD6.8070802@atichile.com> <45EE7D7E.6030101@redhat.com> <45EEC4A7.8030200@atichile.com> <45EED2B7.6040002@redhat.com> Message-ID: <45EEF41F.1040704@atichile.com> We are programming a Maintenance Window to reboot node 1, bellow you can find more configuration info. Until this moment, we have had two problems that we describe like "big problems". One of them was solved with a rgmanager update, and the other (more extrange) was solved changing a 10/100/1000 switch for a 10/100 switch (that is the used in our producction platforms) . Bellow I athach too a generic diagram of this instalation. This instalation particulary only have one switch (commonly we have two switch for redundant) Thanks & Regards Luis G. ================================================================================ [root at lvs-gt1 ~]# clustat Member Status: Quorate Member Name Status ------ ---- ------ lvs-gt2 Offline lvs-gt1 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- XXX1 lvs-gt1 started XXX2 lvs-gt1 started [root at lvs-gt1 ~]# cman_tool status Protocol version: 5.0.1 Config version: 10 Cluster name: lb_cluster Cluster ID: 40372 Cluster Member: Yes Membership state: Cluster-Member Nodes: 1 Expected_votes: 1 Total_votes: 1 Quorum: 1 Active subsystems: 4 Node name: lvs-gt1 Node addresses: 192.168.150.21 [root at lvs-gt1 ~]# cman_tool nodes Node Votes Exp Sts Name 1 1 1 M lvs-gt1 2 1 1 X lvs-gt2 [root at lvs-gt1 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1] DLM Lock Space: "Magma" 3 4 run - [1] User: "usrm::manager" 2 3 run - [1] [root at lvs-gt1 ~]# uname -a Linux lvs-gt1 2.6.9-22.EL #1 Mon Sep 19 18:20:28 EDT 2005 i686 athlon i386 GNU/Linux [root at lvs-gt1 ~]# rpm -qa | grep cman cman-kernel-2.6.9-39.5 cman-1.0.2-0 cman-kernel-hugemem-2.6.9-39.5 cman-kernheaders-2.6.9-39.5 cman-kernel-smp-2.6.9-39.5 [root at lvs-gt1 ~]# rpm -qa | grep -i ccs ccs-1.0.2-0 [root at lvs-gt1 ~]# rpm -qa | grep -i fence fence-1.32.6-0 [root at lvs-gt1 ~]# rpm -qa | grep -i rgma rgmanager-1.9.53-0 OTHER NODE ========== [root at lvs-gt1 log]# ssh lvs-gt2 Last login: Tue Mar 6 17:57:07 2007 from 172.22.22.52 [root at lvs-gt2 ~]# tail /var/log/messages Mar 7 09:47:06 lvs-gt2 kernel: CMAN: sending membership request Mar 7 09:47:41 lvs-gt2 last message repeated 7 times Mar 7 09:47:56 lvs-gt2 last message repeated 3 times Mar 7 09:47:57 lvs-gt2 sshd(pam_unix)[13006]: session opened for user root by root(uid=0) Mar 7 09:48:01 lvs-gt2 kernel: CMAN: sending membership request Mar 7 09:48:01 lvs-gt2 crond(pam_unix)[12936]: session closed for user root Mar 7 09:48:01 lvs-gt2 crond(pam_unix)[13039]: session opened for user root by (uid=0) Mar 7 09:48:01 lvs-gt2 su(pam_unix)[13044]: session opened for user admin by (uid=0) Mar 7 09:48:01 lvs-gt2 su(pam_unix)[13044]: session closed for user admin Mar 7 09:48:06 lvs-gt2 kernel: CMAN: sending membership request [root at lvs-gt2 ~]# [root at lvs-gt2 ~]# clustat Segmentation fault [root at lvs-gt2 ~]# cman_tool status Protocol version: 5.0.1 Config version: 10 Cluster name: lb_cluster Cluster ID: 40372 Cluster Member: No Membership state: Joining Patrick Caulfield wrote: > Luis Godoy Gonzalez wrote: > >> Hi >> >> The "IPtable" service is not running on both nodes. >> We are thinking in update the platform (RHE4 U4 RHCS 4U4) but thid is >> not easy right now because we have several servers on production. >> Another reason to not do it the version update is that we are waiting >> for an update 5 por RHE4 or the production release for RHE5. >> In this moment we only update "rgmanager" in some sites (we have several >> issues with the rgmanager of update 2 RHCS4). >> >> > > It is really rather odd. Node 1 can obviously see the joinreq messages - at > least tcpdump can, but cman is either not seeing them or ignoring them. > > What really bothers me is that this seems to be affecting U2 and U4 - if both of > you were using U3 I would think no more of it :) > > Annoyingly it's hard to debug at this level (you can't strace a kernel thread!). > I"m pretty sure that a reboot of node1 would fix the problem but that's hardly > helpful. > > -------------- next part -------------- A non-text attachment was scrubbed... Name: generico.png Type: image/png Size: 31248 bytes Desc: not available URL: From Bowie_Bailey at BUC.com Wed Mar 7 17:23:15 2007 From: Bowie_Bailey at BUC.com (Bowie Bailey) Date: Wed, 7 Mar 2007 12:23:15 -0500 Subject: [Linux-cluster] Shutdown a cluster Message-ID: <4766EEE585A6D311ADF500E018C154E30268524C@bnifex.cis.buc.com> What is the proper way to shutdown an entire cluster? I have a 3-node cluster. I can shutdown any one node with no problems, but when I try to shutdown the second and third nodes, it locks up trying to stop the cluster processes since it loses quorum at that point. There has to be a way to tell the cluster to do a complete shutdown. Any pointers? Thanks, -- Bowie From jleafey at utmem.edu Wed Mar 7 17:32:20 2007 From: jleafey at utmem.edu (Jay Leafey) Date: Wed, 07 Mar 2007 11:32:20 -0600 Subject: [Linux-cluster] Quorum Disk question In-Reply-To: <35859.193.133.138.40.1173285374.squirrel@www.lapthorn.biz> References: <35859.193.133.138.40.1173285374.squirrel@www.lapthorn.biz> Message-ID: <45EEF724.2050408@utmem.edu> James Lapthorn wrote: > Good afternoon, > > Hopefully somebody can help with a Quorum disk question I have. I have a > 4 node cluster and have adopted a 'last man standing' approach. Because > of this I use a quorum disk which is I have setup on my SAN. > > I have added the following configuration into my cluster.conf file. > > > > > > > > > Each node has 1 vote and the quorum disk has 4, therefore Quorum should > remain if 3 nodes are removed from the cluster. > > While testing this I have noticed that i get the following error in the log: > > qdiskd[26676]: Score insufficient for master operation (1/2; > max=4); downgrading > > At this point Activity is blocked. > > Is this to do with my heuristic programs, should I not ping each memeber > node and maybe ping something like 'localhost' > Instead of pinging the individual nodes in the cluster, how about pinging the default router? If the entire network goes away, all nodes in the cluster should become inquorate and block activity. If an Ethernet drop to a single host goes bad, it will lose quorum but the other hosts should remain quorate. That's the approach we're using and it seems to work OK so far. Hope that helps! -- Jay Leafey - University of Tennessee E-Mail: jleafey at utmem.edu Phone: 901-448-5848 FAX: 901-448-8199 -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5153 bytes Desc: S/MIME Cryptographic Signature URL: From lhh at redhat.com Wed Mar 7 17:54:53 2007 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 07 Mar 2007 12:54:53 -0500 Subject: [Linux-cluster] RHEL4 cluster NFS In-Reply-To: <45EDF807.3050104@arts.usyd.edu.au> References: <45ED339D.7000504@arts.usyd.edu.au> <1173193552.14390.4.camel@asuka.boston.devel.redhat.com> <45EDF807.3050104@arts.usyd.edu.au> Message-ID: <1173290093.12686.16.camel@asuka.boston.devel.redhat.com> On Wed, 2007-03-07 at 10:23 +1100, Matthew Geier wrote: > Lon Hohberger wrote: > > On Tue, 2007-03-06 at 20:25 +1100, Matthew Geier wrote: > > > >> Any one have a concise example on how to NFS export an ext3 filesystem > >> on RHEL U4 cluster suite. ? > > Ok, thanks, it's 'clicked' now and I have the idea. Some one emailed > me directly a screen grab of the layout in system-config-cluster that > showed me the relationship I was missing, however last evening while > relaxing in the bath, I had a 'flash of inspiration' on how the > relationship between the file systems in the services section and the > NFS clients went together and I tried it remotely and it seems to work. > all your helpful emails arrived later. > > It's still not perfect, but functional > > > > options="async,rw" target="whitestar.arts.usyd.edu.au"/> > > > > I gather the nfsclient should be a public resource so it can be > reused on other file systems. I made it private. Have to wait to my next > maintenance window to change it as the resulting service restart will > annoy all my Mac users. (Unlike Windows, Mac's don't expect their > servers to go down all the time :-) haha :) > What does the actual nfsexport directive do ?. It seems to be that > adding an nfsclient to a filesystem resource would imply it. It's basically a per-mountpoint script that does NFS cleanups prior to allowing the file systems to be unmounted, and does sanity checks (makes sure nfsd is running prior to trying to call 'exportfs', for example). It's also a placeholder for future, um "special steps", that might need to happen as Linux NFS changes over time. It is designed to inherit everything it needs to know from the parent resource (and its parent service resource), which is why it can be reused. -- Lon From rpeterso at redhat.com Wed Mar 7 17:56:05 2007 From: rpeterso at redhat.com (Robert Peterson) Date: Wed, 07 Mar 2007 11:56:05 -0600 Subject: [Linux-cluster] Shutdown a cluster In-Reply-To: <4766EEE585A6D311ADF500E018C154E30268524C@bnifex.cis.buc.com> References: <4766EEE585A6D311ADF500E018C154E30268524C@bnifex.cis.buc.com> Message-ID: <45EEFCB5.4000607@redhat.com> Bowie Bailey wrote: > What is the proper way to shutdown an entire cluster? > > I have a 3-node cluster. I can shutdown any one node with no problems, > but when I try to shutdown the second and third nodes, it locks up > trying to stop the cluster processes since it loses quorum at that > point. There has to be a way to tell the cluster to do a complete > shutdown. > > Any pointers? > > Thanks, > Hi Bowie, http://sources.redhat.com/cluster/faq.html#cman_shutdown Regards, Bob Peterson Red Hat Cluster Suite From lhh at redhat.com Wed Mar 7 18:03:32 2007 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 07 Mar 2007 13:03:32 -0500 Subject: [Linux-cluster] Quorum Disk question In-Reply-To: <35859.193.133.138.40.1173285374.squirrel@www.lapthorn.biz> References: <35859.193.133.138.40.1173285374.squirrel@www.lapthorn.biz> Message-ID: <1173290612.12686.25.camel@asuka.boston.devel.redhat.com> On Wed, 2007-03-07 at 16:36 +0000, James Lapthorn wrote: > Good afternoon, > > Hopefully somebody can help with a Quorum disk question I have. I have a > 4 node cluster and have adopted a 'last man standing' approach. Because > of this I use a quorum disk which is I have setup on my SAN. > > I have added the following configuration into my cluster.conf file. > > > > > > > > > Each node has 1 vote and the quorum disk has 4, therefore Quorum should > remain if 3 nodes are removed from the cluster. > > While testing this I have noticed that i get the following error in the log: > > qdiskd[26676]: Score insufficient for master operation (1/2; > max=4); downgrading > > At this point Activity is blocked. > > Is this to do with my heuristic programs, should I not ping each memeber > node and maybe ping something like 'localhost' Yes, that's the problem. Try one heuristic with a big score and pinging the closest router. Note that if you have any unreliable parts of the network with the RHEL4U4 Qdisk, you may experience false 'downgrades' (bad). If you have a current CVS update of RHEL4/STABLE/RHEL5/etc., you can run with no heuristics at all - which will be a "last-man-standing" approach. Additionally, the 'false downgrade' problem will be fixed in RHEL4U5 (and is fixed in CVS) by adding 'tko' counts to the heuristics. I have updated cman / cman-kernel packages which have several qdisk fixes here (including the two above): http://people.redhat.com/lhh/packages.html -- Lon From Bowie_Bailey at BUC.com Wed Mar 7 18:22:25 2007 From: Bowie_Bailey at BUC.com (Bowie Bailey) Date: Wed, 7 Mar 2007 13:22:25 -0500 Subject: [Linux-cluster] Shutdown a cluster Message-ID: <4766EEE585A6D311ADF500E018C154E30268524E@bnifex.cis.buc.com> Robert Peterson wrote: > Bowie Bailey wrote: > > What is the proper way to shutdown an entire cluster? > > > > I have a 3-node cluster. I can shutdown any one node with no > > problems, but when I try to shutdown the second and third nodes, it > > locks up trying to stop the cluster processes since it loses quorum > > at that point. There has to be a way to tell the cluster to do a > > complete shutdown. > > > > Any pointers? > > > > Thanks, > > > Hi Bowie, > > http://sources.redhat.com/cluster/faq.html#cman_shutdown Now why couldn't I find that?? :) Looking at my init scripts, it looks like the system should run this command when I do a normal shutdown: cman_tool -t 60 -w leave remove Should this be sufficient to allow me to shut down the entire cluster by just issuing a "shutdown" command on each node, or is there something I am missing? -- Bowie From rpeterso at redhat.com Wed Mar 7 18:56:10 2007 From: rpeterso at redhat.com (Robert Peterson) Date: Wed, 07 Mar 2007 12:56:10 -0600 Subject: [Linux-cluster] FWD: Question on RH Cluster from a MySQL Customer In-Reply-To: <1638.1172372671@sss.pgh.pa.us> References: <1638.1172372671@sss.pgh.pa.us> Message-ID: <45EF0ACA.1090900@redhat.com> Tom Lane wrote: > Can someone help out this questioner? I know zip about Cluster. > I looked at the FAQ for a bit and thought that what he wants is > probably doable, but I couldn't tell if it would be easy or > painful to do load-balancing in this particular way. (And I'm not > qualified to say if what he wants is a sensible approach, either.) > > regards, tom lane > > ------- Forwarded Message > > Date: Sat, 24 Feb 2007 15:37:17 +0000 > From: Ivan Zoratti > To: tgl at redhat.com > Subject: Question on RH Cluster from a MySQL Customer > > Dear Tom, > > first of all, let me introduce myself. I am the Sales Engineering > Manager for EMEA at MySQL. Kath O'Neil, our Director of Strategic > Alliances, kindly gave me your name for a technical question related > to the use of Red Hat and MySQL - hopefully leading to the adoption > of RH Cluster. > > Our customer is looking for a solution that could provide high > availability and scalability in a cluster environment based on linux > servers that are connected to a large SAN. Their favourite choice > would be to go with Red Hat. > Each server connected to the SAN would provide resources to host, > let's say, 5 different instances of MySQL (mysqld). Each mysqld will > have its own configuration, datadir, connection port and IP address. > The clustering software should be able to load-balance new mysqld > instances on the available servers. For example, considering servers > with same specs and workload, when the first mysqld starts, it will > be placed on Server A, the second one will go on Server B and so on > for C,D and E. The sixth mysqld will then go on A again, then B and > so forth. If one of the server fails, the mysqld(s) is (or are) > "moved" on the other servers, still in a way to guarantee a load- > balance of the whole system. > After my long (and hopefully clear enough) explanation, my quick > question is: does RH Cluster provide this kind of features? I am > mostly interested in the way we can instatiate mysqld and re-launch > them on any other server in the cluster in case of fault. > > I would be very grateful if you could help me or address me to > somebody or something for an answer. > > Thank you in advance for your help. > > Kind Regards, > > Ivan > > > -- > Ivan Zoratti - Sales Engineering Manager EMEA > > MySQL AB - Windsor - UK > Mobile: +44 7866 363 180 > > ivan at mysql.com > http://www.mysql.com Hi Tom, Ivan, and linux-cluster readers, In theory, our Piranha / LVS (Linux Virtual Server) may be used to load-balance the requests to numerous mysql servers in a cluster. Our rgmanager can provide the High Availability to fail over mysql services to other nodes in the cluster if they fail. However, if the mysqld daemons are all running on a SAN and you're mysqld daemons are trying to serve data from the same file system, you probably have a problem. To share the data/database on the SAN in one harmonious file system, you could use the GFS file system, but "regular" mysql is not cluster-aware (to the best of my knowledge). The sum of my understanding about this may be found here: http://sources.redhat.com/cluster/faq.html#gfs_mysql Since Ivan works for mysql, perhaps he can clear this up if it's not accurate. I'd like to know more about "mysql-cluster" and how it's implemented. I'd like to see mysql implemented as a cluster-friendly app using our cluster infrastructure so they can effectively compete against Oracle RAC without reinventing the wheel. I'd even like to be a part of the effort to make this happen. Hope this helps. Regards, Bob Peterson Red Hat Cluster Suite From Britt.Treece at savvis.net Tue Mar 6 03:45:32 2007 From: Britt.Treece at savvis.net (Treece, Britt) Date: Mon, 5 Mar 2007 21:45:32 -0600 Subject: [Linux-cluster] Errors trying to login to LT000: ... 1006:Not Allowed Message-ID: All, I am running a 13 node GFS (6.0.2.33) cluster with 10 mounting clients and 3 dedicated lock servers. The master lock server was rebooted and the next slave in the voting order took over. At that time 3 of the client nodes started receiving login errors for the ltpx server Mar 4 00:05:52 lock1 lock_gulmd_core[3798]: Master Node Is Logging Out NOW! ... Mar 4 00:05:52 lock2 lock_gulmd_core[24627]: Master Node has logged out. Mar 4 00:05:54 lock2 lock_gulmd_core[24627]: I see no Masters, So I am Arbitrating until enough Slaves talk to me. Mar 4 00:05:54 lock2 lock_gulmd_LTPX[24638]: New Master at lock2 :192.168.1.3 Mar 4 00:05:56 lock2 lock_gulmd_core[24627]: Now have Slave quorum, going full Master. Mar 4 00:11:39 lock2 lock_gulmd_core[24627]: Master Node Is Logging Out NOW! ... Mar 4 00:05:52 client1 kernel: lock_gulm: Checking for journals for node "lock1 " Mar 4 00:05:52 client1 lock_gulmd_core[9383]: Master Node has logged out. Mar 4 00:05:52 client1 kernel: lock_gulm: Checking for journals for node "lock1 " Mar 4 00:05:56 client1 lock_gulmd_core[9383]: Found Master at lock2 , so I'm a Client. Mar 4 00:05:56 client1 lock_gulmd_core[9383]: Failed to receive a timely heartbeat reply from Master. (t:1172988356370685 mb:1) Mar 4 00:05:56 client1 lock_gulmd_LTPX[9390]: New Master at lock2 :192.168.1.3 Mar 4 00:06:01 client1 lock_gulmd_LTPX[9390]: Errors trying to login to LT002: (lock2 :192.168.1.3) 1006:Not Allowed Mar 4 00:06:01 client1 lock_gulmd_LTPX[9390]: Errors trying to login to LT000: (lock2 :192.168.1.3) 1006:Not Allowed Mar 4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to LT000: (lock2 :192.168.1.3) 1006:Not Allowed Mar 4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to LT002: (lock2 :192.168.1.3) 1006:Not Allowed Mar 4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to LT004: (lock2 :192.168.1.3) 1006:Not Allowed Mar 4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to LT001: (lock2 :192.168.1.3) 1006:Not Allowed Anyone have any idea what might be causing this? Regards, Britt Treece -------------- next part -------------- An HTML attachment was scrubbed... URL: From ivan at mysql.com Thu Mar 8 10:39:36 2007 From: ivan at mysql.com (Ivan Zoratti) Date: Thu, 8 Mar 2007 10:39:36 +0000 Subject: [Linux-cluster] FWD: Question on RH Cluster from a MySQL Customer In-Reply-To: <45EF0ACA.1090900@redhat.com> References: <1638.1172372671@sss.pgh.pa.us> <45EF0ACA.1090900@redhat.com> Message-ID: <0DEA37F7-2A38-4F33-9857-01FD717DADCC@mysql.com> Hi Robert, First of all, thanks for your time, I really appreciate it. I'd like to reply to two separate topics here: first, the objective of my question and second, the cluster-awareness of MySQL and the use of GFS with MySQL. My original question was mainly related to the use of Piranha to switch over a service (ie, a specific mysql daemon) from one server to another, in case of fault. There should be only one active service in the cluster, therefore no concurrency or locking issues should happen. The ideal system should be able to: - have a list of services to launch on the cluster - identify the node in the cluster suitable to host the service (for example the node with less workload) - check the availability of the service - stop the service on a node (if the service is not already down) and start the service on another node in case of fault Fault tolerance in this case will be provided by the ability to switch the service from one server to another in the cluster. Scalability is not provided within the service, ie the limitation in resources for the service consist of the resources available on that specific server. I understand that your cluster suite can provide this functionality. I am mainly looking for a supported set of features for an enterprise organisation. The second topic is related to the use of MySQL with clusters and specifically with GFS. It is what we use to call MySQL in active- active clustering. I am afraid your documentation is not totally accurate. Unfortunately, information on the Internet (and also on our web site) are often contradictory. It is indeed possible to run multiple mysqld services on different cluster nodes, all sharing the same data structure on shared storage, with this configuration: - Only the MyISAM storage engine can be used - Each mysqld service must start with the external-locking parameter on - Each mysqld service hase to have the query cache parameter off (other cache mechanisms remain on, since they are automatically invalidated by external locking) I am afraid this configuration still does not compete against Oracle RAC. MySQL does not provide a solution that can be compared 1:1 with RAC. You may find some MySQL implementations much more effective than RAC for certain environments, as you will certainly find RAC performing better than MySQL on other implementations. Based on the experience of the sales engineering team, customers have never been disappointed by the technology that MySQL can provide as an alternative to RAC. Decisions are based on many other factors, such as the introduction of another (or a different) database, the cost of migrating current applications and compatibility with third party products. You can imagine we are working hard to remove these obstacles. Thanks again for your help, Kind Regards, Ivan -- Ivan Zoratti - Sales Engineering Manager EMEA MySQL AB - Windsor - UK Mobile: +44 7866 363 180 ivan at mysql.com http://www.mysql.com -- On 7 Mar 2007, at 18:56, Robert Peterson wrote: > Tom Lane wrote: >> Can someone help out this questioner? I know zip about Cluster. >> I looked at the FAQ for a bit and thought that what he wants is >> probably doable, but I couldn't tell if it would be easy or >> painful to do load-balancing in this particular way. (And I'm not >> qualified to say if what he wants is a sensible approach, either.) >> regards, tom lane >> ------- Forwarded Message >> Date: Sat, 24 Feb 2007 15:37:17 +0000 >> From: Ivan Zoratti >> To: tgl at redhat.com >> Subject: Question on RH Cluster from a MySQL Customer >> Dear Tom, >> first of all, let me introduce myself. I am the Sales Engineering >> Manager for EMEA at MySQL. Kath O'Neil, our Director of Strategic >> Alliances, kindly gave me your name for a technical question >> related to the use of Red Hat and MySQL - hopefully leading to >> the adoption of RH Cluster. >> Our customer is looking for a solution that could provide high >> availability and scalability in a cluster environment based on >> linux servers that are connected to a large SAN. Their favourite >> choice would be to go with Red Hat. >> Each server connected to the SAN would provide resources to host, >> let's say, 5 different instances of MySQL (mysqld). Each mysqld >> will have its own configuration, datadir, connection port and IP >> address. >> The clustering software should be able to load-balance new mysqld >> instances on the available servers. For example, considering >> servers with same specs and workload, when the first mysqld >> starts, it will be placed on Server A, the second one will go on >> Server B and so on for C,D and E. The sixth mysqld will then go >> on A again, then B and so forth. If one of the server fails, the >> mysqld(s) is (or are) "moved" on the other servers, still in a >> way to guarantee a load- balance of the whole system. >> After my long (and hopefully clear enough) explanation, my quick >> question is: does RH Cluster provide this kind of features? I am >> mostly interested in the way we can instatiate mysqld and re- >> launch them on any other server in the cluster in case of fault. >> I would be very grateful if you could help me or address me to >> somebody or something for an answer. >> Thank you in advance for your help. >> Kind Regards, >> Ivan >> -- >> Ivan Zoratti - Sales Engineering Manager EMEA >> MySQL AB - Windsor - UK >> Mobile: +44 7866 363 180 >> ivan at mysql.com >> http://www.mysql.com > > Hi Tom, Ivan, and linux-cluster readers, > > In theory, our Piranha / LVS (Linux Virtual Server) may be used to > load-balance the requests to numerous mysql servers in a cluster. > > Our rgmanager can provide the High Availability to fail over > mysql services to other nodes in the cluster if they fail. > > However, if the mysqld daemons are all running on a SAN and you're > mysqld daemons are trying to serve data from the same file system, you > probably have a problem. To share the data/database on the SAN in > one harmonious file system, you could use the GFS file system, but > "regular" mysql is not cluster-aware (to the best of my > knowledge). The sum of my understanding about this may be found here: > > http://sources.redhat.com/cluster/faq.html#gfs_mysql > > Since Ivan works for mysql, perhaps he can clear this up if > it's not accurate. I'd like to know more about "mysql-cluster" > and how it's implemented. I'd like to see mysql implemented as > a cluster-friendly app using our cluster infrastructure so they > can effectively compete against Oracle RAC without reinventing > the wheel. I'd even like to be a part of the effort to make this > happen. Hope this helps. > > Regards, > > Bob Peterson > Red Hat Cluster Suite From dave at eons.com Thu Mar 8 16:15:32 2007 From: dave at eons.com (Dave Berry) Date: Thu, 08 Mar 2007 11:15:32 -0500 Subject: [Linux-cluster] Failover not working Message-ID: <45F036A4.1090106@eons.com> I have a 3 node GFS cluster sharing 2 virtual IPs as 2 different services. For some reason the failover is not working correctly. The IPs are listed as services in the cluster.conf and the failover is set to use ordered/restricted. Below is the pertinent cluster.conf parts. The IPs failover when the box goes down but does not fail back to the correctly prioritized box when it returns. I have included the error from the log at the end. Thanks. Mar 8 11:03:26 fs101 clurgmgrd[5684]: Relocating group nfs_ip2 to better node fs102 Mar 8 11:03:26 fs101 clurgmgrd[5684]: Event (0:2:1) Processed Mar 8 11:03:26 fs101 clurgmgrd[5684]: Stopping service nfs_ip2 Mar 8 11:03:26 fs101 clurgmgrd[5684]: #52: Failed changing RG status Mar 8 11:03:26 fs101 clurgmgrd[5684]: Handling failure request for RG nfs_ip2 Mar 8 11:03:26 fs101 clurgmgrd[5684]: #57: Failed changing RG status From filipe.miranda at gmail.com Thu Mar 8 17:24:14 2007 From: filipe.miranda at gmail.com (Filipe Miranda) Date: Thu, 8 Mar 2007 14:24:14 -0300 Subject: [Linux-cluster] How GULM works ? Message-ID: When using the DLM lock system, the CCSD pass on the structure information to the CMAN. CMAN will be responsible for the cluster membership, heartbeat, and cluster communication. All other layers above relay on CMAN. If I'm using GULM, CMAN is not installed, how the architecture works in this case? -- --- Filipe T Miranda -------------- next part -------------- An HTML attachment was scrubbed... URL: From lhh at redhat.com Thu Mar 8 17:34:46 2007 From: lhh at redhat.com (Lon Hohberger) Date: Thu, 08 Mar 2007 12:34:46 -0500 Subject: [Linux-cluster] Failover not working In-Reply-To: <45F036A4.1090106@eons.com> References: <45F036A4.1090106@eons.com> Message-ID: <1173375286.12686.26.camel@asuka.boston.devel.redhat.com> On Thu, 2007-03-08 at 11:15 -0500, Dave Berry wrote: > I have a 3 node GFS cluster sharing 2 virtual IPs as 2 different > services. For some reason the failover is not working correctly. The > IPs are listed as services in the cluster.conf and the failover is set > to use ordered/restricted. Below is the pertinent cluster.conf parts. > The IPs failover when the box goes down but does not fail back to the > correctly prioritized box when it returns. I have included the error > from the log at the end. Thanks. > Mar 8 11:03:26 fs101 clurgmgrd[5684]: Relocating group nfs_ip2 > to better node fs102 > Mar 8 11:03:26 fs101 clurgmgrd[5684]: Event (0:2:1) Processed > Mar 8 11:03:26 fs101 clurgmgrd[5684]: Stopping service nfs_ip2 > Mar 8 11:03:26 fs101 clurgmgrd[5684]: #52: Failed changing RG status > Mar 8 11:03:26 fs101 clurgmgrd[5684]: Handling failure request > for RG nfs_ip2 > Mar 8 11:03:26 fs101 clurgmgrd[5684]: #57: Failed changing RG status That shouldn't happen - what rgmanager RPM do you have? -- Lon From rstevens at vitalstream.com Thu Mar 8 18:07:51 2007 From: rstevens at vitalstream.com (Rick Stevens) Date: Thu, 08 Mar 2007 10:07:51 -0800 Subject: [Linux-cluster] 2.6.20-rc4 gfs2 bug In-Reply-To: <20070125050731.GA23270@chaos.ao.net> References: <20070125050731.GA23270@chaos.ao.net> Message-ID: <1173377271.30562.9.camel@prophead.corp.publichost.com> On Thu, 2007-01-25 at 00:07 -0500, Dan Merillat wrote: > Running 2.6.20-rc4 _WITH_ the following patch: (Shouldn't be the issue, > but just in case, I'm listing it here) Not adding to the thread here, but Dan, check the date on your machine. This just showed up in my mailbox (8 March) and its headers say it was sent on 25 January! > > Date: Fri, 29 Dec 2006 21:03:57 +0100 > From: Ingo Molnar > Subject: [patch] remove MAX_ARG_PAGES > Message-ID: <20061229200357.GA5940 at elte.hu> > > Linux fileserver 2.6.20-rc4MAX_ARGS #4 PREEMPT Fri Jan 12 03:58:25 EST 2007 x86_64 GNU/Linux > > This happened when I started testing gfs2 for the first time. I > installed userspace from CVS, loaded the gfs2/dlm modules, mkfs.gfs2, > then "mount -t gfs2 -v /dev/vg1/gfs2 /mnt/gfs" > > This was the initial mount of the new filesystem. I can create > directories, but attempting a stress-test with bonnie seems to have > deadlocked something. (at "Start 'em", immediately.) > > To clarify: the two oopses happened at first mount. After that, I > created files/directories, then attempted to stress it a bit with > bonnie++. No further oops/dmesg output. > > For the GFS2 folks, latest CVS gfs_tool doesn't have lockdump, is there > any way to examine what I'm stuck on? > > This machine is specifically for testing new things before I put them > into production, so I can leave it hung like this indefinitely for > debugging. > > > [845566.571468] GFS2 (built Jan 12 2007 04:02:27) installed > [849416.113382] DLM (built Jan 12 2007 04:01:21) installed > [849416.352219] Lock_DLM (built Jan 12 2007 04:02:46) installed > [850966.368016] GFS2: fsid=: Trying to join cluster "lock_dlm", "internal:gfs-test" > [850971.783223] dlm: gfs-test: recover 1 > [850971.783242] dlm: gfs-test: add member 1 > [850971.783246] dlm: gfs-test: total members 1 error 0 > [850971.783248] dlm: gfs-test: dlm_recover_directory > [850971.783260] dlm: gfs-test: dlm_recover_directory 0 entries > [850971.783270] dlm: gfs-test: recover 1 done: 0 ms > [850971.783454] GFS2: fsid=internal:gfs-test.0: Joined cluster. Now mounting FS... > [850973.409048] GFS2: fsid=internal:gfs-test.0: jid=0, already locked for use > [850973.409135] GFS2: fsid=internal:gfs-test.0: jid=0: Looking at journal... > [850973.504558] GFS2: fsid=internal:gfs-test.0: jid=0: Done > [850973.504653] GFS2: fsid=internal:gfs-test.0: jid=1: Trying to acquire journal lock... > [850973.517086] GFS2: fsid=internal:gfs-test.0: jid=1: Looking at journal... > [850973.691546] GFS2: fsid=internal:gfs-test.0: jid=1: Done > [850973.691635] GFS2: fsid=internal:gfs-test.0: jid=2: Trying to acquire journal lock... > [850973.702646] GFS2: fsid=internal:gfs-test.0: jid=2: Looking at journal... > [850973.846397] GFS2: fsid=internal:gfs-test.0: jid=2: Done > > > [850973.869288] ------------[ cut here ]------------ > [850973.869294] kernel BUG at fs/gfs2/glock.c:738! > [850973.869297] invalid opcode: 0000 [1] PREEMPT > [850973.869300] CPU 0 > [850973.869302] Modules linked in: lock_dlm dlm gfs2 scsi_tgt bttv video_buf firmware_class ir_common compat_ioctl32 btcx_risc tveeprom videodev v4l2_common v4l1_compat radeon nbd eth1394 ohci1394 dm_crypt eeprom w83627hf hwmon_vid i2c_isa i2c_viapro snd_via82xx snd_mpu401_uart snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_device snd_timer snd_page_alloc snd_util_mem snd_hwdep snd soundcore > [850973.869324] Pid: 31076, comm: gfs2_glockd Not tainted 2.6.20-rc4MAX_ARGS #4 > [850973.869327] RIP: 0010:[] [] :gfs2:gfs2_glmutex_unlock+0x2b/0x40 > [850973.869355] RSP: 0018:ffff81001849be70 EFLAGS: 00010282 > [850973.869359] RAX: ffff810023ff4ee0 RBX: ffff810023ff4e68 RCX: ffffffff88185800 > [850973.869363] RDX: 0000000000000000 RSI: ffff810023ff4ec0 RDI: ffff810023ff4e68 > [850973.869366] RBP: ffff810023ff4f38 R08: 0000000000000000 R09: 0000000000006052 > [850973.869370] R10: 0000000000000000 R11: ffffffff8816de60 R12: ffff810023ff4e68 > [850973.869374] R13: ffff810023ff4eb0 R14: ffff81003ffd6850 R15: ffff81003ffd6870 > [850973.869378] FS: 00002aebf51826d0(0000) GS:ffffffff807fb000(0000) knlGS:00000000f72026c0 > [850973.869381] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > [850973.869384] CR2: 00002b9e93097fe0 CR3: 0000000003a79000 CR4: 00000000000006e0 > [850973.869388] Process gfs2_glockd (pid: 31076, threadinfo ffff81001849a000, task ffff810000b82890) > [850973.869390] Stack: ffff810023ff4eb0 ffffffff8816cc08 ffff81001849beb0 ffff810024322000 > [850973.869397] ffff8100243223b8 ffff8100074cf968 ffffffff88163510 ffffffff88163528 > [850973.869402] 0000000000000000 ffff810000b82890 ffffffff8029fe70 ffff81001849bec8 > [850973.869407] Call Trace: > [850973.869421] [] :gfs2:gfs2_reclaim_glock+0x138/0x180 > [850973.869434] [] :gfs2:gfs2_glockd+0x0/0xf0 > [850973.869445] [] :gfs2:gfs2_glockd+0x18/0xf0 > [850973.869453] [] autoremove_wake_function+0x0/0x30 > [850973.869465] [] :gfs2:gfs2_glockd+0x0/0xf0 > [850973.869471] [] kthread+0xd3/0x110 > [850973.869476] [] schedule_tail+0x37/0xc0 > [850973.869481] [] keventd_create_kthread+0x0/0xa0 > [850973.869485] [] child_rip+0xa/0x12 > [850973.869490] [] keventd_create_kthread+0x0/0xa0 > [850973.869497] [] kthread+0x0/0x110 > [850973.869501] [] child_rip+0x0/0x12 > [850973.869504] > [850973.869505] > [850973.869506] Code: 0f 0b 66 66 90 eb fe 66 66 66 90 66 66 66 90 66 66 90 66 66 > [850973.869514] RIP [] :gfs2:gfs2_glmutex_unlock+0x2b/0x40 > [850973.869528] RSP > [850973.869530] <6>note: gfs2_glockd[31076] exited with preempt_count 1 > > > [850986.762341] ------------[ cut here ]------------ > [850986.762346] kernel BUG at fs/gfs2/glock.c:738! > [850986.762349] invalid opcode: 0000 [2] PREEMPT > [850986.762351] CPU 0 > [850986.762353] Modules linked in: lock_dlm dlm gfs2 scsi_tgt bttv video_buf firmware_class ir_common compat_ioctl32 btcx_risc tveeprom videodev v4l2_common v4l1_compat radeon nbd eth1394 ohci1394 dm_crypt eeprom w83627hf hwmon_vid i2c_isa i2c_viapro snd_via82xx snd_mpu401_uart snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_device snd_timer snd_page_alloc snd_util_mem snd_hwdep snd soundcore > [850986.762376] Pid: 31075, comm: gfs2_scand Not tainted 2.6.20-rc4MAX_ARGS #4 > [850986.762379] RIP: 0010:[] [] :gfs2:gfs2_glmutex_unlock+0x2b/0x40 > [850986.762405] RSP: 0000:ffff81001f221e80 EFLAGS: 00010286 > [850986.762408] RAX: ffff810023ff4940 RBX: ffff810023ff48c8 RCX: 0000000000000000 > [850986.762412] RDX: 0000000000000146 RSI: ffff810023ff4920 RDI: ffff810023ff48c8 > [850986.762416] RBP: ffff810024322000 R08: ffff81001f220000 R09: 00000000f6e88388 > [850986.762418] R10: 0000000000000000 R11: 00000000ffffffff R12: 0000000000000000 > [850986.762422] R13: 0000000000000000 R14: ffffffff8816d2f0 R15: ffff81003ffd6870 > [850986.762426] FS: 00002b5212e53ae0(0000) GS:ffffffff807fb000(0000) knlGS:00000000f6e88bb0 > [850986.762429] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > [850986.762432] CR2: 00000000f706f000 CR3: 000000002b96b000 CR4: 00000000000006e0 > [850986.762436] Process gfs2_scand (pid: 31075, threadinfo ffff81001f220000, task ffff810025448850) > [850986.762438] Stack: ffff810023ff48c8 ffffffff8816a86c 0000000000000147 ffff810024322000 > [850986.762445] ffff8100074cf968 ffffffff88163600 ffff81003ffd6850 ffffffff8816ae54 > [850986.762450] ffff81003ffd6850 000000000000000f ffff810024322000 ffffffff88163618 > [850986.762454] Call Trace: > [850986.762469] [] :gfs2:examine_bucket+0x8c/0x100 > [850986.762481] [] :gfs2:gfs2_scand+0x0/0x70 > [850986.762494] [] :gfs2:gfs2_scand_internal+0x24/0x40 > [850986.762506] [] :gfs2:gfs2_scand+0x18/0x70 > [850986.762514] [] kthread+0xd3/0x110 > [850986.762519] [] schedule_tail+0x37/0xc0 > [850986.762525] [] keventd_create_kthread+0x0/0xa0 > [850986.762530] [] child_rip+0xa/0x12 > [850986.762535] [] keventd_create_kthread+0x0/0xa0 > [850986.762542] [] kthread+0x0/0x110 > [850986.762545] [] child_rip+0x0/0x12 > [850986.762548] > [850986.762549] > [850986.762550] Code: 0f 0b 66 66 90 eb fe 66 66 66 90 66 66 66 90 66 66 90 66 66 > [850986.762559] RIP [] :gfs2:gfs2_glmutex_unlock+0x2b/0x40 > [850986.762572] RSP > [850986.762575] <6>note: gfs2_scand[31075] exited with preempt_count 1 > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster ---------------------------------------------------------------------- - Rick Stevens, Principal Engineer rstevens at vitalstream.com - - VitalStream, Inc. http://www.vitalstream.com - - - - I'm afraid my karma just ran over your dogma - ---------------------------------------------------------------------- From rpeterso at redhat.com Thu Mar 8 18:29:22 2007 From: rpeterso at redhat.com (Robert Peterson) Date: Thu, 08 Mar 2007 12:29:22 -0600 Subject: [Linux-cluster] How GULM works ? In-Reply-To: References: Message-ID: <45F05602.3000803@redhat.com> Filipe Miranda wrote: > When using the DLM lock system, the CCSD pass on the structure > information to the CMAN. CMAN will be responsible for the cluster > membership, heartbeat, and cluster communication. All other layers above > relay on CMAN. > > If I'm using GULM, CMAN is not installed, how the architecture works in > this case? Hi Filipe, The Gulm locking protocol has its own cluster manager and locking layers. The layers are all still there, but they're internal to Gulm. For cman, it's broken out into pieces like cman, dlm, lock_dlm, lock_harness, and so forth. Regards, Bob Peterson Red Hat Cluster Suite From rpeterso at redhat.com Thu Mar 8 18:32:52 2007 From: rpeterso at redhat.com (Robert Peterson) Date: Thu, 08 Mar 2007 12:32:52 -0600 Subject: [Linux-cluster] 2.6.20-rc4 gfs2 bug In-Reply-To: <1173377271.30562.9.camel@prophead.corp.publichost.com> References: <20070125050731.GA23270@chaos.ao.net> <1173377271.30562.9.camel@prophead.corp.publichost.com> Message-ID: <45F056D4.9080301@redhat.com> Rick Stevens wrote: > On Thu, 2007-01-25 at 00:07 -0500, Dan Merillat wrote: >> Running 2.6.20-rc4 _WITH_ the following patch: (Shouldn't be the issue, >> but just in case, I'm listing it here) > > Not adding to the thread here, but Dan, check the date on your machine. > This just showed up in my mailbox (8 March) and its headers say it was > sent on 25 January! Hi Rick, We discovered today that the linux-cluster mailing list had some problems and some of the messages got inadvertently stuck. When we fixed the problem, a few old messages came through that had been stuck. Bob Peterson Red Hat Cluster Suite From lshen at cisco.com Thu Mar 8 18:40:15 2007 From: lshen at cisco.com (Lin Shen (lshen)) Date: Thu, 8 Mar 2007 10:40:15 -0800 Subject: [Linux-cluster] Changing journal size with GFS2 Message-ID: <08A9A3213527A6428774900A80DBD8D80397DBA4@xmb-sjc-222.amer.cisco.com> The FAQ says that with GFS2, it will be possible to add journals w/o extending the file system. Does this require redo mkfs? Also, will it be possible to change (both increase and decrease) journal size w/o extending file system or redo mkfs? Lin From adas at redhat.com Thu Mar 8 18:53:04 2007 From: adas at redhat.com (Abhijith Das) Date: Thu, 08 Mar 2007 12:53:04 -0600 Subject: [Linux-cluster] Changing journal size with GFS2 In-Reply-To: <08A9A3213527A6428774900A80DBD8D80397DBA4@xmb-sjc-222.amer.cisco.com> References: <08A9A3213527A6428774900A80DBD8D80397DBA4@xmb-sjc-222.amer.cisco.com> Message-ID: <45F05B90.9070708@redhat.com> Lin Shen (lshen) wrote: >The FAQ says that with GFS2, it will be possible to add journals w/o >extending the file system. Does this require redo mkfs? > > The gfs2_jadd tool allows you to add journals to an existing gfs2 filesystem without having to redo mkfs. --Abhi From lshen at cisco.com Thu Mar 8 19:13:54 2007 From: lshen at cisco.com (Lin Shen (lshen)) Date: Thu, 8 Mar 2007 11:13:54 -0800 Subject: [Linux-cluster] Changing journal size with GFS2 In-Reply-To: <45F05B90.9070708@redhat.com> Message-ID: <08A9A3213527A6428774900A80DBD8D80397DBF3@xmb-sjc-222.amer.cisco.com> Looks like gfs2_jadd also allows to add journals with different sizes. Wonder if the size of the exiting journals can be changed somehow. Lin > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Abhijith Das > Sent: Thursday, March 08, 2007 10:53 AM > To: linux clustering > Subject: Re: [Linux-cluster] Changing journal size with GFS2 > > Lin Shen (lshen) wrote: > > >The FAQ says that with GFS2, it will be possible to add journals w/o > >extending the file system. Does this require redo mkfs? > > > > > The gfs2_jadd tool allows you to add journals to an existing > gfs2 filesystem without having to redo mkfs. > > --Abhi > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From lshen at cisco.com Thu Mar 8 19:29:30 2007 From: lshen at cisco.com (Lin Shen (lshen)) Date: Thu, 8 Mar 2007 11:29:30 -0800 Subject: [Linux-cluster] Using GFS2 as a local file system Message-ID: <08A9A3213527A6428774900A80DBD8D8039FF97C@xmb-sjc-222.amer.cisco.com> We have a situation that we may need to use GFS2 to share storage in our system in the future and to ease the pain of transition at that time (convert files into GFS), we're thinking of using GFS2 just as a local file system for now. How is GFS2 compared to other popular local file systems such as ext3 and Reiser in terms of performance, overhead etc? Are we hitting the wrong direction totally by using GFS2 just as a local file system? BTW, we've run bonnie on local GFS2, and the performance is decent compared to ext3 (90%). Lin From srramasw at cisco.com Thu Mar 8 19:53:40 2007 From: srramasw at cisco.com (Sridhar Ramaswamy (srramasw)) Date: Thu, 8 Mar 2007 11:53:40 -0800 Subject: [Linux-cluster] GFS2 traceback related to DLM In-Reply-To: <20070125050731.GA23270@chaos.ao.net> Message-ID: I tried GFS2 on two-node cluster using GNBD. cfs1 - gnbd exports an IDE parition. Mount gfs2 directly on that IDE partition. cfs5 - gnbd imports the IDE parition. Mount gfs2 on top of gnbd device. I'm using, RHEL4 distro Linux kernel 2.6.20.1 (from kernel.org) cluster-2.00.00 (from tarball) udev-094 openais-0.80.2 Everything seems to be working fine. But when I mounted GFS2 on the 2nd node on top of gnbd device, I got these dlm related tracebacks. Plus dlm_recvd and dlm_sendd process are spinning cpu on both the boxes. Note the mount itself succeeded and I can use the filesystem from both the nodes. I know GFS2 is new, but anyone solution to this problem? I need to mention, I also see bunch of udev daemon related failure mesgs. I'm guessing it is due using it on old RHEL4 distribution? Not sure if that contributed to this spinlock problem reported here. Ultimately I want to run bonnie test on this configuration. But don't what to do that until the basic sanity of this GFS2 configuration is established. thanks, Sridhar cfs1: Mar 7 17:45:53 cfs1 kernel: BUG: spinlock already unlocked on CPU#1, dlm_recoverd/11046 Mar 7 17:45:53 cfs1 kernel: lock: cc6b68e4, .magic: dead4ead, .owner: /-1, .owner_cpu: -1 Mar 7 17:45:53 cfs1 kernel: [] _raw_spin_unlock+0x29/0x6b Mar 7 17:45:53 cfs1 kernel: [] dlm_lowcomms_get_buffer+0x6c/0xe7 [dlm] Mar 7 17:45:53 cfs1 kernel: [] create_rcom+0x2d/0xb3 [dlm] Mar 7 17:45:53 cfs1 kernel: [] dlm_rcom_status+0x5a/0x10b [dlm] Mar 7 17:45:53 cfs1 kernel: [] make_member_array+0x84/0x14c [dlm] Mar 7 17:45:53 cfs1 kernel: [] ping_members+0x37/0x6e [dlm] Mar 7 17:45:53 cfs1 kernel: [] dlm_set_recover_status+0x14/0x24 [dlm] Mar 7 17:45:53 cfs1 kernel: [] dlm_recover_members+0x164/0x1a1 [dlm] Mar 7 17:45:53 cfs1 kernel: [] ls_recover+0x67/0x2c6 [dlm] Mar 7 17:45:53 cfs1 kernel: [] do_ls_recovery+0x5d/0x75 [dlm] Mar 7 17:45:53 cfs1 kernel: [] dlm_recoverd+0x0/0x74 [dlm] Mar 7 17:45:53 cfs1 kernel: [] dlm_recoverd+0x5b/0x74 [dlm] Mar 7 17:45:53 cfs1 kernel: [] kthread+0x72/0x96 Mar 7 17:45:53 cfs1 kernel: [] kthread+0x0/0x96 Mar 7 17:45:53 cfs1 kernel: [] kernel_thread_helper+0x7/0x10 cfs2: Mar 7 17:43:28 cfs5 gnbd_monitor[10552]: gnbd_monitor started. Monitoring device #0 Mar 7 17:43:28 cfs5 gnbd_recvd[10555]: gnbd_recvd started Mar 7 17:43:28 cfs5 kernel: resending requests Mar 7 17:43:33 cfs5 udevsend[10560]: starting udevd daemon Mar 7 17:43:35 cfs5 udevsend[10560]: unable to connect to event daemon, try to call udev directly Mar 7 17:45:51 cfs5 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "ciscogfs2:hda9" Mar 7 17:45:53 cfs5 udevsend[10598]: starting udevd daemon Mar 7 17:45:53 cfs5 udevsend[10599]: starting udevd daemon Mar 7 17:45:53 cfs5 udevsend[10610]: starting udevd daemon Mar 7 17:45:53 cfs5 kernel: dlm: got connection from 1 Mar 7 17:45:53 cfs5 kernel: BUG: spinlock already unlocked on CPU#0, dlm_recvd/10593 Mar 7 17:45:53 cfs5 kernel: lock: c8d467e4, .magic: dead4ead, .owner: /-1, .owner_cpu: -1 Mar 7 17:45:53 cfs5 kernel: [] _raw_spin_unlock+0x29/0x6b Mar 7 17:45:53 cfs5 kernel: [] dlm_lowcomms_get_buffer+0x6c/0xe7 [dlm] Mar 7 17:45:53 cfs5 kernel: [] create_rcom+0x2d/0xb3 [dlm] Mar 7 17:45:53 cfs5 kernel: [] receive_rcom_status+0x2f/0x74 [dlm] Mar 7 17:45:53 cfs5 kernel: [] dlm_find_lockspace_global+0x3c/0x41 [dlm] Mar 7 17:45:53 cfs5 kernel: [] dlm_receive_rcom+0xc1/0x17f [dlm] Mar 7 17:45:53 cfs5 udevsend[10617]: starting udevd daemon Mar 7 17:45:54 cfs5 udevsend[10626]: starting udevd daemon Mar 7 17:45:54 cfs5 udevsend[10628]: starting udevd daemon Mar 7 17:45:55 cfs5 udevsend[10598]: unable to connect to event daemon, try to call udev directly Mar 7 17:45:55 cfs5 udevsend[10599]: unable to connect to event daemon, try to call udev directly Mar 7 17:45:55 cfs5 udevsend[10610]: unable to connect to event daemon, try to call udev directly Mar 7 17:45:55 cfs5 kernel: [] dlm_process_incoming_buffer+0x148/0x1ad [dlm] Mar 7 17:45:59 cfs5 udevsend[10617]: unable to connect to event daemon, try to call udev directly Mar 7 17:46:00 cfs5 udevsend[10626]: unable to connect to event daemon, try to call udev directly Mar 7 17:46:02 cfs5 udevsend[10628]: unable to connect to event daemon, try to call udev directly Mar 7 17:46:05 cfs5 kernel: [] autoremove_wake_function+0x0/0x33 Mar 7 17:46:09 cfs5 kernel: [] __alloc_pages+0x61/0x2ad Mar 7 17:46:10 cfs5 kernel: [] receive_from_sock+0x178/0x246 [dlm] Mar 7 17:46:10 cfs5 kernel: [] process_sockets+0x55/0x90 [dlm] Mar 7 17:46:11 cfs5 kernel: [] dlm_recvd+0x0/0x69 [dlm] Mar 7 17:46:11 cfs5 kernel: [] dlm_recvd+0x5a/0x69 [dlm] Mar 7 17:46:12 cfs5 kernel: [] kthread+0x72/0x96 Mar 7 17:46:12 cfs5 kernel: [] kthread+0x0/0x96 Mar 7 17:46:13 cfs5 kernel: [] kernel_thread_helper+0x7/0x10 Mar 7 17:46:14 cfs5 kernel: ======================= Mar 7 17:46:14 cfs5 kernel: dlm: hda9: recover 1 Mar 7 17:46:15 cfs5 kernel: dlm: hda9: add member 1 Mar 7 17:46:15 cfs5 kernel: dlm: hda9: add member 2 Mar 7 17:46:16 cfs5 kernel: dlm: hda9: total members 2 error 0 Mar 7 17:46:17 cfs5 kernel: dlm: hda9: dlm_recover_directory Mar 7 17:46:18 cfs5 kernel: dlm: hda9: dlm_recover_directory 12 entries Mar 7 17:46:19 cfs5 kernel: GFS2: fsid=ciscogfs2:hda9.1: Joined cluster. Now mounting FS... Mar 7 17:46:20 cfs5 kernel: dlm: hda9: recover 1 done: 348 ms Mar 7 17:46:21 cfs5 kernel: GFS2: fsid=ciscogfs2:hda9.1: jid=1, already locked for use Mar 7 17:46:22 cfs5 kernel: GFS2: fsid=ciscogfs2:hda9.1: jid=1: Looking at journal... Mar 7 17:46:23 cfs5 kernel: GFS2: fsid=ciscogfs2:hda9.1: jid=1: Done > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dan Merillat > Sent: Wednesday, January 24, 2007 9:08 PM > To: linux-kernel at vger.kernel.org > Cc: linux-cluster at redhat.com > Subject: [Linux-cluster] 2.6.20-rc4 gfs2 bug > > Running 2.6.20-rc4 _WITH_ the following patch: (Shouldn't be > the issue, > but just in case, I'm listing it here) > From dave at eons.com Thu Mar 8 20:20:02 2007 From: dave at eons.com (Dave Berry) Date: Thu, 08 Mar 2007 15:20:02 -0500 Subject: [Linux-cluster] Re: Failover not working In-Reply-To: <45F036A4.1090106@eons.com> References: <45F036A4.1090106@eons.com> Message-ID: <45F06FF2.6020309@eons.com> rgmanager-1.9.54-1 > That shouldn't happen - what rgmanager RPM do you have? > > -- Lon From teigland at redhat.com Fri Mar 9 02:46:38 2007 From: teigland at redhat.com (David Teigland) Date: Thu, 8 Mar 2007 20:46:38 -0600 Subject: [Linux-cluster] GFS2 traceback related to DLM In-Reply-To: References: <20070125050731.GA23270@chaos.ao.net> Message-ID: <20070309024638.GA2954@redhat.com> On Thu, Mar 08, 2007 at 11:53:40AM -0800, Sridhar Ramaswamy (srramasw) wrote: > I'm using, > > RHEL4 distro > Linux kernel 2.6.20.1 (from kernel.org) > cluster-2.00.00 (from tarball) > udev-094 > openais-0.80.2 > > Everything seems to be working fine. But when I mounted GFS2 on the 2nd > node on top of gnbd device, I got these dlm related tracebacks. Plus > dlm_recvd and dlm_sendd process are spinning cpu on both the boxes. Note > the mount itself succeeded and I can use the filesystem from both the > nodes. > > I know GFS2 is new, but anyone solution to this problem? Yes, get the latest dlm source (and gfs2 while you're at it) from a 2.6.21-rc kernel. I usually do something like this as a shortcut: cp linux-2.6.21-rc/fs/dlm/* linux-2.6.20/fs/dlm/ cp linux-2.6.21-rc/fs/gfs2/* linux-2.6.20/fs/gfs2/ cp linux-2.6.21-rc/fs/gfs2/locking/dlm/* linux-2.6.20/fs/gfs2/locking/dlm/ No guarantees, but it often works and saves a bit of effort. Dave From wcheng at redhat.com Fri Mar 9 05:35:34 2007 From: wcheng at redhat.com (Wendy Cheng) Date: Fri, 09 Mar 2007 00:35:34 -0500 Subject: [Linux-cluster] Using GFS2 as a local file system In-Reply-To: <08A9A3213527A6428774900A80DBD8D8039FF97C@xmb-sjc-222.amer.cisco.com> References: <08A9A3213527A6428774900A80DBD8D8039FF97C@xmb-sjc-222.amer.cisco.com> Message-ID: <45F0F226.2020601@redhat.com> Lin Shen (lshen) wrote: >We have a situation that we may need to use GFS2 to share storage in our >system in the future and to ease the pain of transition at that time >(convert files into GFS), we're thinking of using GFS2 just as a local >file system for now. > >How is GFS2 compared to other popular local file systems such as ext3 >and Reiser in terms of performance, overhead etc? Are we hitting the >wrong direction totally by using GFS2 just as a local file system? > >BTW, we've run bonnie on local GFS2, and the performance is decent >compared to ext3 (90%). > > > > I personally think using GFS (both GFS1 and GFS2) as a local filesystem has many advantages. The only issue (I think ..haven't checked mkfs code in ages) is lock protocol is hard coded into on-disk super block during mkfs time - but fixing this should be trivial. If we allow interchangeable between lock_nolock and lock_dlm, then the filesystem should be able to migrate from single node into cluster environment. It is very nice (IMHO). In the mean time, you can always run GFS(s) using lock_dlm with single node. There are lock overhead though. I understand people may have different opinions about this and certainly don't have time to get into heated debating about this issue right now. BTW, the team will spend this quarter to fine-tune GFS2. Would like to suggest people wait a little bit before putting GFS2 into a production environment. -- Wendy From swhiteho at redhat.com Fri Mar 9 08:42:58 2007 From: swhiteho at redhat.com (Steven Whitehouse) Date: Fri, 09 Mar 2007 08:42:58 +0000 Subject: [Linux-cluster] Using GFS2 as a local file system In-Reply-To: <45F0F226.2020601@redhat.com> References: <08A9A3213527A6428774900A80DBD8D8039FF97C@xmb-sjc-222.amer.cisco.com> <45F0F226.2020601@redhat.com> Message-ID: <1173429778.32601.35.camel@quoit.chygwyn.com> Hi, On Fri, 2007-03-09 at 00:35 -0500, Wendy Cheng wrote: > Lin Shen (lshen) wrote: > > >We have a situation that we may need to use GFS2 to share storage in our > >system in the future and to ease the pain of transition at that time > >(convert files into GFS), we're thinking of using GFS2 just as a local > >file system for now. > > > >How is GFS2 compared to other popular local file systems such as ext3 > >and Reiser in terms of performance, overhead etc? Are we hitting the > >wrong direction totally by using GFS2 just as a local file system? > > > >BTW, we've run bonnie on local GFS2, and the performance is decent > >compared to ext3 (90%). > > > > > > > > > I personally think using GFS (both GFS1 and GFS2) as a local filesystem > has many advantages. The only issue (I think ..haven't checked mkfs code > in ages) is lock protocol is hard coded into on-disk super block during > mkfs time - but fixing this should be trivial. If we allow > interchangeable between lock_nolock and lock_dlm, then the filesystem > should be able to migrate from single node into cluster environment. It > is very nice (IMHO). > You can override the settings in the sb on the mount command line, Steve. From cluster at defuturo.co.uk Thu Mar 8 18:13:45 2007 From: cluster at defuturo.co.uk (Robert Clark) Date: Thu, 08 Mar 2007 18:13:45 +0000 Subject: [Linux-cluster] cmirror performance Message-ID: <1173377626.2757.63.camel@rutabaga.defuturo.co.uk> I've been trying out cmirror for a few months on a RHEL4U4 cluster and it's now working very well for me, although I've noticed that it does have a bit of a performance hit. My set-up has a 32G GFS filesystem on a mirrored LV shared via AoE (with jumbo frame support). Just using dd with a 4k blocksize to write files on the same LV when it's mirrored and then unmirrored shows a big difference in speed: Unmirrored: 12440kB/s Mirrored: 2969kB/s which I wasn't expecting as my understanding is that the cmirror design introduces very little overhead. The two legs of the mirror are on separate, identical AoE servers and the filesystem is mounted on 3 out of 6 nodes in the cluster. This is with the cmirror-kernel_2_6_9_19 tagged version and I've tried with both core and disk logs. I suspect a bad interaction between cmirror and something else, but I'm not sure where to start looking. Any ideas? Thanks, Robert From rpeterso at redhat.com Fri Mar 9 15:41:10 2007 From: rpeterso at redhat.com (Robert Peterson) Date: Fri, 09 Mar 2007 09:41:10 -0600 Subject: [Linux-cluster] FWD: Question on RH Cluster from a MySQL Customer In-Reply-To: <0DEA37F7-2A38-4F33-9857-01FD717DADCC@mysql.com> References: <1638.1172372671@sss.pgh.pa.us> <45EF0ACA.1090900@redhat.com> <0DEA37F7-2A38-4F33-9857-01FD717DADCC@mysql.com> Message-ID: <45F18016.3050603@redhat.com> Hi Ivan, Answers embedded below: Ivan Zoratti wrote: > Hi Robert, > > First of all, thanks for your time, I really appreciate it. > I'd like to reply to two separate topics here: first, the objective of > my question and second, the cluster-awareness of MySQL and the use of > GFS with MySQL. > > My original question was mainly related to the use of Piranha to switch > over a service (ie, a specific mysql daemon) from one server to another, > in case of fault. There should be only one active service in the > cluster, therefore no concurrency or locking issues should happen. > The ideal system should be able to: > - have a list of services to launch on the cluster > - identify the node in the cluster suitable to host the service (for > example the node with less workload) > - check the availability of the service > - stop the service on a node (if the service is not already down) and > start the service on another node in case of fault > Fault tolerance in this case will be provided by the ability to switch > the service from one server to another in the cluster. > Scalability is not provided within the service, ie the limitation in > resources for the service consist of the resources available on that > specific server. > > I understand that your cluster suite can provide this functionality. I > am mainly looking for a supported set of features for an enterprise > organisation. Red Hat's Cluster Suite does all of this with the rgmanager service (not piranha). I guess I'm not sure what you're asking here. Are you asking what features rgmanager has? Its features are probably documented somewhere, but I don't know where offhand. I know it's quite full-featured and allows you to do exactly what you listed: provide High Availability (HA) of multiple services, stopping and starting services throughout cluster, with different kinds of dependencies. The Cluster FAQ has information on rgmanager here that you may find helpful: http://sources.redhat.com/cluster/faq.html#rgm_what If you have questions that aren't covered by the FAQ, let me know and I'll do my best to answer your questions. > The second topic is related to the use of MySQL with clusters and > specifically with GFS. It is what we use to call MySQL in active-active > clustering. I am afraid your documentation is not totally accurate. > Unfortunately, information on the Internet (and also on our web site) > are often contradictory. > It is indeed possible to run multiple mysqld services on different > cluster nodes, all sharing the same data structure on shared storage, > with this configuration: > - Only the MyISAM storage engine can be used > - Each mysqld service must start with the external-locking parameter on > - Each mysqld service hase to have the query cache parameter off (other > cache mechanisms remain on, since they are automatically invalidated by > external locking) Thanks for providing this information. I'll get it into the cluster FAQ. Maybe some day I'll find the time to play with this myself. > I am afraid this configuration still does not compete against Oracle > RAC. MySQL does not provide a solution that can be compared 1:1 with > RAC. You may find some MySQL implementations much more effective than > RAC for certain environments, as you will certainly find RAC performing > better than MySQL on other implementations. > > Based on the experience of the sales engineering team, customers have > never been disappointed by the technology that MySQL can provide as an > alternative to RAC. Decisions are based on many other factors, such as > the introduction of another (or a different) database, the cost of > migrating current applications and compatibility with third party > products. You can imagine we are working hard to remove these obstacles. > > Thanks again for your help, > > Kind Regards, > > Ivan > > -- > Ivan Zoratti - Sales Engineering Manager EMEA > > MySQL AB - Windsor - UK > Mobile: +44 7866 363 180 > > ivan at mysql.com > http://www.mysql.com If you have other questions, please let me know. You can either email me directly or join the linux-cluster mailing list where you can talk to people are using these features and everyone can benefit from the discussion. Regards, Bob Peterson Red Hat Cluster Suite From wcheng at redhat.com Fri Mar 9 16:15:29 2007 From: wcheng at redhat.com (Wendy Cheng) Date: Fri, 09 Mar 2007 11:15:29 -0500 Subject: [Linux-cluster] FWD: Question on RH Cluster from a MySQL Customer In-Reply-To: <45F18016.3050603@redhat.com> References: <1638.1172372671@sss.pgh.pa.us> <45EF0ACA.1090900@redhat.com> <0DEA37F7-2A38-4F33-9857-01FD717DADCC@mysql.com> <45F18016.3050603@redhat.com> Message-ID: <45F18821.3040804@redhat.com> Robert Peterson wrote: > Ivan Zoratti wrote: >> >> My original question was mainly related to the use of Piranha to >> switch over a service (ie, a specific mysql daemon) from one server >> to another, in case of fault. There should be only one active service >> in the cluster, therefore no concurrency or locking issues should >> happen. I assume this is a special daemon (say the one controls meta data) among many other mySQL daemons ? >> The ideal system should be able to: >> - have a list of services to launch on the cluster >> - identify the node in the cluster suitable to host the service (for >> example the node with less workload) The only load balancer we have (at this moment) indeed is piranha (LVS). However, using load balancer combining with GFS is tricky due to locking overhead (cluster locks are expensive). We do encourage individual file access to stay within one node for a proper length of time if all possible. Judging by your above statement, since switching (that particular ?) service only happens upon fault, this should be ok. Current versions of rgmanager and GFS out in the field do not have workload statistics - so knowing which node has less workload would be tricky (unless you put LVS as the front end). The newest version of cluster software using openais (Steve Dake, cc in this email, is the maintainer) that may have some features that can be used (but I'm not sure). -- Wendy >> - check the availability of the service >> - stop the service on a node (if the service is not already down) and >> start the service on another node in case of fault >> Fault tolerance in this case will be provided by the ability to >> switch the service from one server to another in the cluster. >> Scalability is not provided within the service, ie the limitation in >> resources for the service consist of the resources available on that >> specific server. >> >> I understand that your cluster suite can provide this functionality. >> I am mainly looking for a supported set of features for an enterprise >> organisation. > > Red Hat's Cluster Suite does all of this with the rgmanager service > (not piranha). I guess I'm not sure what you're asking here. Are you > asking what features rgmanager has? Its features are probably documented > somewhere, but I don't know where offhand. I know it's quite > full-featured and allows you to do exactly what you listed: > provide High Availability (HA) of multiple services, stopping and > starting services throughout cluster, with different kinds of > dependencies. The Cluster FAQ has information on rgmanager here > that you may find helpful: > > http://sources.redhat.com/cluster/faq.html#rgm_what > > If you have questions that aren't covered by the FAQ, let me know and > I'll do my best to answer your questions. > >> The second topic is related to the use of MySQL with clusters and >> specifically with GFS. It is what we use to call MySQL in >> active-active clustering. I am afraid your documentation is not >> totally accurate. Unfortunately, information on the Internet (and >> also on our web site) are often contradictory. >> It is indeed possible to run multiple mysqld services on different >> cluster nodes, all sharing the same data structure on shared storage, >> with this configuration: >> - Only the MyISAM storage engine can be used >> - Each mysqld service must start with the external-locking parameter on >> - Each mysqld service hase to have the query cache parameter off >> (other cache mechanisms remain on, since they are automatically >> invalidated by external locking) > > Thanks for providing this information. I'll get it into the cluster FAQ. > Maybe some day I'll find the time to play with this myself. > >> I am afraid this configuration still does not compete against Oracle >> RAC. MySQL does not provide a solution that can be compared 1:1 with >> RAC. You may find some MySQL implementations much more effective than >> RAC for certain environments, as you will certainly find RAC >> performing better than MySQL on other implementations. >> >> Based on the experience of the sales engineering team, customers have >> never been disappointed by the technology that MySQL can provide as >> an alternative to RAC. Decisions are based on many other factors, >> such as the introduction of another (or a different) database, the >> cost of migrating current applications and compatibility with third >> party products. You can imagine we are working hard to remove these >> obstacles. >> >> Thanks again for your help, >> >> Kind Regards, >> >> Ivan >> >> -- >> Ivan Zoratti - Sales Engineering Manager EMEA >> >> MySQL AB - Windsor - UK >> Mobile: +44 7866 363 180 >> >> ivan at mysql.com >> http://www.mysql.com > > If you have other questions, please let me know. You can either > email me directly or join the linux-cluster mailing list where you can > talk to people are using these features and everyone can benefit from > the discussion. > > Regards, > > Bob Peterson > Red Hat Cluster Suite > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From jbrassow at redhat.com Fri Mar 9 17:49:34 2007 From: jbrassow at redhat.com (Jonathan E Brassow) Date: Fri, 9 Mar 2007 11:49:34 -0600 Subject: [Linux-cluster] cmirror performance In-Reply-To: <1173377626.2757.63.camel@rutabaga.defuturo.co.uk> References: <1173377626.2757.63.camel@rutabaga.defuturo.co.uk> Message-ID: Nope, the first version is just slow. Next version should be coming with RHEL5.X (and should be going upstream), which should be faster. I just wrote up a perl script (which I haven't had a chance to really clean-up yet) that will give performance numbers for various request/transfer sizes. I'm including it at the end. You must have lmbench package installed (for 'lmdd'). Then run: # to give you read performance numbers 'perf_matrix.pl if=' # to give you write performance numbers 'perf_matrix.pl of=' # to do multiple runs and discard numbers outside the std deviation # The more iterations you do, the more accurate your results 'perf_matrix.pl if= iter=5 For more information on the options, do 'perf_matrix.pl -h'. Using the above, you can compare the numbers you're getting from the base device, linear target, mirror target, etc over a wide range of transfer/request sizes. Let's take a look at a couple examples. (Request sizes increase to the right by powers of two starting at 1kiB. Transfer sizes increase by rows by powers of two starting at 1MiB. Results are in MiB/sec): prompt> perf_matrix.pl if=/dev/vg/linear iter=5 #linear reads, 5 iterations w/ results averaged 25.24 28.16 28.82 28.93 28.96 29.25 28.72 26.54 27.39 27.84 28.94 0.00 0.00 30.48 31.57 31.66 31.32 31.89 32.19 31.66 32.00 33.98 34.23 31.93 33.30 0.00 34.00 33.46 33.39 33.12 33.50 34.32 33.57 33.78 34.81 35.03 33.68 34.25 34.68 34.82 34.33 34.32 34.20 34.49 34.89 35.20 35.24 35.39 35.33 35.56 34.94 35.18 35.50 35.37 35.53 35.37 35.54 35.53 35.41 35.60 35.38 35.53 35.54 35.45 35.33 35.72 35.76 35.82 35.81 35.81 35.80 35.81 35.82 35.81 35.84 35.66 35.78 35.76 35.96 35.97 35.87 35.91 35.98 35.99 35.97 35.97 35.98 35.99 35.90 35.96 35.95 36.05 36.05 36.05 36.03 36.03 36.03 36.06 36.08 36.06 36.07 36.07 36.06 36.06 36.10 36.08 36.08 36.08 36.08 36.10 36.08 36.09 36.10 36.11 36.09 36.10 36.11 36.11 36.11 36.11 36.11 36.11 36.11 36.11 36.12 36.12 36.12 36.12 36.12 36.12 36.13 36.12 36.12 36.12 36.12 36.12 36.13 36.12 36.12 36.13 36.13 36.13 36.13 prompt> perf_matrix.pl of=/dev/vg/linear iter=5 #linear writes, 5 iterations w/ results averaged 11.74 9.00 31.77 31.82 31.78 31.84 31.93 32.03 32.37 32.98 34.52 0.00 0.00 9.14 9.65 33.57 33.65 33.64 33.65 33.70 33.79 33.99 34.33 35.12 33.36 0.00 9.63 9.70 33.03 33.01 34.65 34.65 34.67 33.09 33.16 33.35 33.70 32.88 34.42 9.60 9.66 33.30 32.35 33.47 33.49 33.49 32.73 33.36 33.65 33.84 33.41 33.37 9.68 9.74 33.31 33.36 32.90 32.94 32.94 33.21 33.08 32.99 33.16 33.33 32.59 9.66 9.74 32.88 33.14 33.47 33.38 33.20 33.60 33.18 33.35 33.15 33.10 33.22 9.68 9.73 32.66 32.73 33.30 33.39 33.22 33.18 33.23 32.97 33.01 33.10 33.13 9.69 9.74 33.06 33.28 33.37 33.45 33.32 33.53 33.27 33.34 33.16 33.05 33.08 9.59 9.66 31.88 32.34 32.14 32.41 33.21 32.49 32.41 32.47 32.39 32.69 32.05 9.47 9.58 32.87 32.79 32.80 32.84 33.09 32.96 32.99 32.95 32.65 32.59 32.83 9.45 9.52 33.35 33.10 33.17 33.12 33.05 33.12 33.97 33.14 32.72 33.07 33.24 # if I redirect the above output to files, I can then diff them prompt> perf_matrix.pl diff clinear-read.txt clinear-write.txt -53.49% -68.04% 10.24% 9.99% 9.74% 8.85% 11.18% 20.69% 18.18% 18.46% 19.28% -.--% -.--% -70.01% -69.43% 6.03% 7.44% 5.49% 4.54% 6.44% 5.59% 0.03% 0.29% 9.99% 0.18% -.--% -71.68% -71.01% -1.08% -0.33% 3.43% 0.96% 3.28% -2.04% -4.74% -4.80% 0.06% -4.00% -0.75% -72.43% -71.86% -2.97% -5.41% -2.96% -4.01% -4.86% -7.12% -5.74% -4.76% -4.84% -4.38% -5.14% -72.73% -72.46% -6.25% -5.68% -7.43% -7.29% -6.98% -6.71% -6.50% -7.15% -6.70% -5.98% -7.76% -72.96% -72.76% -8.21% -7.46% -6.53% -6.76% -7.29% -6.20% -7.34% -6.95% -7.04% -7.49% -7.10% -73.08% -72.95% -8.95% -8.86% -7.45% -7.22% -7.65% -7.76% -7.64% -8.39% -8.05% -7.95% -7.84% -73.12% -72.98% -8.29% -7.63% -7.38% -7.16% -7.60% -7.07% -7.74% -7.57% -8.07% -8.35% -8.26% -73.43% -73.23% -11.64% -10.37% -10.92% -10.22% -7.95% -9.98% -10.22% -10.08% -10.25% -9.45% -11.24% -73.77% -73.47% -8.97% -9.19% -9.17% -9.06% -8.36% -8.75% -8.67% -8.78% -9.61% -9.77% -9.11% -73.84% -73.64% -7.67% -8.36% -8.17% -8.31% -8.52% -8.31% -5.95% -8.28% -9.44% -8.47% -8.00% I can see that writes for a linear device are much worse when request sizes are small, but get reasonably close when request sizes are >= 4kiB. I haven't had a chance to do this with (cluster) mirrors yet. It would be interesting to see the difference in performance from linear -> mirror and mirror -> cmirror... Once things are truly stable, I will concentrate more on performance. (Also note: While a mirroring is sync'ing itself, performance for nominal operations will be degraded.) brassow -------------- next part -------------- A non-text attachment was scrubbed... Name: perf_matrix.pl Type: application/octet-stream Size: 7099 bytes Desc: not available URL: -------------- next part -------------- On Mar 8, 2007, at 12:13 PM, Robert Clark wrote: > I've been trying out cmirror for a few months on a RHEL4U4 cluster > and > it's now working very well for me, although I've noticed that it does > have a bit of a performance hit. > > My set-up has a 32G GFS filesystem on a mirrored LV shared via AoE > (with jumbo frame support). Just using dd with a 4k blocksize to write > files on the same LV when it's mirrored and then unmirrored shows a big > difference in speed: > > Unmirrored: 12440kB/s > Mirrored: 2969kB/s > > which I wasn't expecting as my understanding is that the cmirror design > introduces very little overhead. > > The two legs of the mirror are on separate, identical AoE servers and > the filesystem is mounted on 3 out of 6 nodes in the cluster. This is > with the cmirror-kernel_2_6_9_19 tagged version and I've tried with > both > core and disk logs. > > I suspect a bad interaction between cmirror and something else, but > I'm not sure where to start looking. Any ideas? > > Thanks, > > Robert > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From sdake at redhat.com Fri Mar 9 19:41:54 2007 From: sdake at redhat.com (Steven Dake) Date: Fri, 09 Mar 2007 12:41:54 -0700 Subject: [Linux-cluster] Re: [Openais] Xen and Cluster Manager (OpenAIS::TOTEM) In-Reply-To: <45F1739E.6040403@startx.fr> References: <45F1739E.6040403@startx.fr> Message-ID: <1173469314.18932.1.camel@shih.broked.org> On Fri, 2007-03-09 at 15:47 +0100, Fabien MALFOY wrote: > Hi > > I'm trying to build a two nodes cluster with Cluster Manager on Fedora > Core 6. As I dispose of only one machine and that my solution needs > three, I decided to create two XEN virtual Fedora Core 6 (M1 and M2). I > use the Dom0 (M0) host as the storage back-end. All used packages are > those of standard repositories of Fedora Core 6 (including Xen, Cman, > OpenAIS...). So, the cluster will be made of the two DomU. > > But i'm not able to start correctly the Cman service on M1 and M2. > Here's a part of the log : > > Mar 7 15:29:17 M1 openais[2220]: [CMAN ] CMAN 2.0.60 (built Jan 24 2007 > 15:30:39) started > Mar 7 15:29:17 M1 openais[2220]: [SYNC ] Not using a virtual synchrony > filter. > Mar 7 15:29:17 M1 openais[2220]: [MAIN ] AIS Executive Service: started > and ready to provide service. > Mar 7 15:29:18 M1 ccsd[2214]: Initial status:: Inquorate > *Mar 7 15:29:32 M1 openais[2220]: [TOTEM] The consensus timeout expired. > Mar 7 15:29:32 M1 openais[2220]: [TOTEM] entering GATHER state from 3. > Mar 7 15:29:47 M1 openais[2220]: [TOTEM] The consensus timeout expired. > Mar 7 15:29:47 M1 openais[2220]: [TOTEM] entering GATHER state from 3. > Mar 7 15:30:02 M1 openais[2220]: [TOTEM] The consensus timeout expired. > Mar 7 15:30:02 M1 openais[2220]: [TOTEM] entering GATHER state from 3. > > *This till the timeout and cman fails to start. To ensure that my > configuration was well formed, I adapted my cluster.conf to exclude M2, > replaced by M0. In this case, the cman service starts correctly on M0 > and warns about the absence of M1. On M1, the service fails to start > with the same log. Is there a matter between Xen and Cluster Manager ? > Thanks for your help. > Fabien, I believe the problem you have is that your default firewall rules setup by fedora core 6 are not allowing the openais protocol to reach consensus. To start off with, try turning off your firewall. Regards -steve From jeff3140 at gmail.com Fri Mar 9 21:47:16 2007 From: jeff3140 at gmail.com (Jeff) Date: Fri, 9 Mar 2007 16:47:16 -0500 Subject: [Linux-cluster] DLM internals In-Reply-To: <20070215151426.GA18284@redhat.com> References: <20070215151426.GA18284@redhat.com> Message-ID: On 2/15/07, David Teigland wrote: > This is an excellent description of a dlm and the general ideas/logic > reflect very well our own dlm: > > http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf > > Dave There is a document in a DLM Source tree, http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/dlm/doc/?cvsroot=cluster which briefly describes the DLM API. Is there anything else available? In particular I was looking for how VALNOTVALID errors are handled with a node crashes with an CR/EX lock and it had read the lock value block. From jbrassow at redhat.com Fri Mar 9 21:49:35 2007 From: jbrassow at redhat.com (Jonathan E Brassow) Date: Fri, 9 Mar 2007 15:49:35 -0600 Subject: [Linux-cluster] cmirror performance In-Reply-To: References: <1173377626.2757.63.camel@rutabaga.defuturo.co.uk> Message-ID: clean-up of previously posted script (plus colorized diff output for easier reading). brassow -------------- next part -------------- A non-text attachment was scrubbed... Name: perf_matrix.pl Type: application/octet-stream Size: 7334 bytes Desc: not available URL: -------------- next part -------------- On Mar 9, 2007, at 11:49 AM, Jonathan E Brassow wrote: > Nope, the first version is just slow. Next version should be coming > with RHEL5.X (and should be going upstream), which should be faster. > > I just wrote up a perl script (which I haven't had a chance to really > clean-up yet) that will give performance numbers for various > request/transfer sizes. I'm including it at the end. > > You must have lmbench package installed (for 'lmdd'). Then run: > # to give you read performance numbers > 'perf_matrix.pl if=' > > # to give you write performance numbers > 'perf_matrix.pl of=' > > # to do multiple runs and discard numbers outside the std deviation > # The more iterations you do, the more accurate your results > 'perf_matrix.pl if= iter=5 > > For more information on the options, do 'perf_matrix.pl -h'. > > Using the above, you can compare the numbers you're getting from the > base device, linear target, mirror target, etc over a wide range of > transfer/request sizes. > > Let's take a look at a couple examples. (Request sizes increase to > the right by powers of two starting at 1kiB. Transfer sizes increase > by rows by powers of two starting at 1MiB. Results are in MiB/sec): > prompt> perf_matrix.pl if=/dev/vg/linear iter=5 #linear reads, 5 > iterations w/ results averaged > 25.24 28.16 28.82 28.93 28.96 29.25 28.72 26.54 27.39 27.84 28.94 > 0.00 0.00 > 30.48 31.57 31.66 31.32 31.89 32.19 31.66 32.00 33.98 34.23 31.93 > 33.30 0.00 > 34.00 33.46 33.39 33.12 33.50 34.32 33.57 33.78 34.81 35.03 33.68 > 34.25 34.68 > 34.82 34.33 34.32 34.20 34.49 34.89 35.20 35.24 35.39 35.33 35.56 > 34.94 35.18 > 35.50 35.37 35.53 35.37 35.54 35.53 35.41 35.60 35.38 35.53 35.54 > 35.45 35.33 > 35.72 35.76 35.82 35.81 35.81 35.80 35.81 35.82 35.81 35.84 35.66 > 35.78 35.76 > 35.96 35.97 35.87 35.91 35.98 35.99 35.97 35.97 35.98 35.99 35.90 > 35.96 35.95 > 36.05 36.05 36.05 36.03 36.03 36.03 36.06 36.08 36.06 36.07 36.07 > 36.06 36.06 > 36.10 36.08 36.08 36.08 36.08 36.10 36.08 36.09 36.10 36.11 36.09 > 36.10 36.11 > 36.11 36.11 36.11 36.11 36.11 36.11 36.11 36.12 36.12 36.12 36.12 > 36.12 36.12 > 36.13 36.12 36.12 36.12 36.12 36.12 36.13 36.12 36.12 36.13 36.13 > 36.13 36.13 > > prompt> perf_matrix.pl of=/dev/vg/linear iter=5 #linear writes, 5 > iterations w/ results averaged > 11.74 9.00 31.77 31.82 31.78 31.84 31.93 32.03 32.37 32.98 34.52 0.00 > 0.00 > 9.14 9.65 33.57 33.65 33.64 33.65 33.70 33.79 33.99 34.33 35.12 33.36 > 0.00 > 9.63 9.70 33.03 33.01 34.65 34.65 34.67 33.09 33.16 33.35 33.70 32.88 > 34.42 > 9.60 9.66 33.30 32.35 33.47 33.49 33.49 32.73 33.36 33.65 33.84 33.41 > 33.37 > 9.68 9.74 33.31 33.36 32.90 32.94 32.94 33.21 33.08 32.99 33.16 33.33 > 32.59 > 9.66 9.74 32.88 33.14 33.47 33.38 33.20 33.60 33.18 33.35 33.15 33.10 > 33.22 > 9.68 9.73 32.66 32.73 33.30 33.39 33.22 33.18 33.23 32.97 33.01 33.10 > 33.13 > 9.69 9.74 33.06 33.28 33.37 33.45 33.32 33.53 33.27 33.34 33.16 33.05 > 33.08 > 9.59 9.66 31.88 32.34 32.14 32.41 33.21 32.49 32.41 32.47 32.39 32.69 > 32.05 > 9.47 9.58 32.87 32.79 32.80 32.84 33.09 32.96 32.99 32.95 32.65 32.59 > 32.83 > 9.45 9.52 33.35 33.10 33.17 33.12 33.05 33.12 33.97 33.14 32.72 33.07 > 33.24 > > # if I redirect the above output to files, I can then diff them > prompt> perf_matrix.pl diff clinear-read.txt clinear-write.txt > -53.49% -68.04% 10.24% 9.99% 9.74% 8.85% 11.18% 20.69% 18.18% 18.46% > 19.28% -.--% -.--% > -70.01% -69.43% 6.03% 7.44% 5.49% 4.54% 6.44% 5.59% 0.03% 0.29% 9.99% > 0.18% -.--% > -71.68% -71.01% -1.08% -0.33% 3.43% 0.96% 3.28% -2.04% -4.74% -4.80% > 0.06% -4.00% -0.75% > -72.43% -71.86% -2.97% -5.41% -2.96% -4.01% -4.86% -7.12% -5.74% > -4.76% -4.84% -4.38% -5.14% > -72.73% -72.46% -6.25% -5.68% -7.43% -7.29% -6.98% -6.71% -6.50% > -7.15% -6.70% -5.98% -7.76% > -72.96% -72.76% -8.21% -7.46% -6.53% -6.76% -7.29% -6.20% -7.34% > -6.95% -7.04% -7.49% -7.10% > -73.08% -72.95% -8.95% -8.86% -7.45% -7.22% -7.65% -7.76% -7.64% > -8.39% -8.05% -7.95% -7.84% > -73.12% -72.98% -8.29% -7.63% -7.38% -7.16% -7.60% -7.07% -7.74% > -7.57% -8.07% -8.35% -8.26% > -73.43% -73.23% -11.64% -10.37% -10.92% -10.22% -7.95% -9.98% -10.22% > -10.08% -10.25% -9.45% -11.24% > -73.77% -73.47% -8.97% -9.19% -9.17% -9.06% -8.36% -8.75% -8.67% > -8.78% -9.61% -9.77% -9.11% > -73.84% -73.64% -7.67% -8.36% -8.17% -8.31% -8.52% -8.31% -5.95% > -8.28% -9.44% -8.47% -8.00% > > I can see that writes for a linear device are much worse when request > sizes are small, but get reasonably close when request sizes are >= > 4kiB. > > I haven't had a chance to do this with (cluster) mirrors yet. It > would be interesting to see the difference in performance from linear > -> mirror and mirror -> cmirror... > > Once things are truly stable, I will concentrate more on performance. > (Also note: While a mirroring is sync'ing itself, performance for > nominal operations will be degraded.) > > brassow > > > > > On Mar 8, 2007, at 12:13 PM, Robert Clark wrote: > >> I've been trying out cmirror for a few months on a RHEL4U4 cluster >> and >> it's now working very well for me, although I've noticed that it does >> have a bit of a performance hit. >> >> My set-up has a 32G GFS filesystem on a mirrored LV shared via AoE >> (with jumbo frame support). Just using dd with a 4k blocksize to write >> files on the same LV when it's mirrored and then unmirrored shows a >> big >> difference in speed: >> >> Unmirrored: 12440kB/s >> Mirrored: 2969kB/s >> >> which I wasn't expecting as my understanding is that the cmirror >> design >> introduces very little overhead. >> >> The two legs of the mirror are on separate, identical AoE servers >> and >> the filesystem is mounted on 3 out of 6 nodes in the cluster. This is >> with the cmirror-kernel_2_6_9_19 tagged version and I've tried with >> both >> core and disk logs. >> >> I suspect a bad interaction between cmirror and something else, but >> I'm not sure where to start looking. Any ideas? >> >> Thanks, >> >> Robert >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From ivan at mysql.com Fri Mar 9 20:18:34 2007 From: ivan at mysql.com (Ivan Zoratti) Date: Fri, 9 Mar 2007 20:18:34 +0000 Subject: [Linux-cluster] FWD: Question on RH Cluster from a MySQL Customer In-Reply-To: <45F18016.3050603@redhat.com> References: <1638.1172372671@sss.pgh.pa.us> <45EF0ACA.1090900@redhat.com> <0DEA37F7-2A38-4F33-9857-01FD717DADCC@mysql.com> <45F18016.3050603@redhat.com> Message-ID: Hi Robert, This is great news, thanks a lot. You (and Wendy in another email) have answered my questions, now I will start digging into more details through the documentation. Should you need any information regarding MySQL, I would be glad to help. Thanks again, Kind Regards, Ivan -- Ivan Zoratti - Sales Engineering Manager EMEA MySQL AB - Windsor - UK Mobile: +44 7866 363 180 ivan at mysql.com http://www.mysql.com -- On 9 Mar 2007, at 15:41, Robert Peterson wrote: > Hi Ivan, > > Answers embedded below: > > Ivan Zoratti wrote: >> Hi Robert, >> First of all, thanks for your time, I really appreciate it. >> I'd like to reply to two separate topics here: first, the >> objective of my question and second, the cluster-awareness of >> MySQL and the use of GFS with MySQL. >> My original question was mainly related to the use of Piranha to >> switch over a service (ie, a specific mysql daemon) from one >> server to another, in case of fault. There should be only one >> active service in the cluster, therefore no concurrency or locking >> issues should happen. >> The ideal system should be able to: >> - have a list of services to launch on the cluster >> - identify the node in the cluster suitable to host the service >> (for example the node with less workload) >> - check the availability of the service >> - stop the service on a node (if the service is not already down) >> and start the service on another node in case of fault >> Fault tolerance in this case will be provided by the ability to >> switch the service from one server to another in the cluster. >> Scalability is not provided within the service, ie the limitation >> in resources for the service consist of the resources available on >> that specific server. >> I understand that your cluster suite can provide this >> functionality. I am mainly looking for a supported set of features >> for an enterprise organisation. > > Red Hat's Cluster Suite does all of this with the rgmanager service > (not piranha). I guess I'm not sure what you're asking here. Are you > asking what features rgmanager has? Its features are probably > documented > somewhere, but I don't know where offhand. I know it's quite > full-featured and allows you to do exactly what you listed: > provide High Availability (HA) of multiple services, stopping and > starting services throughout cluster, with different kinds of > dependencies. The Cluster FAQ has information on rgmanager here > that you may find helpful: > > http://sources.redhat.com/cluster/faq.html#rgm_what > > If you have questions that aren't covered by the FAQ, let me know > and I'll do my best to answer your questions. > >> The second topic is related to the use of MySQL with clusters and >> specifically with GFS. It is what we use to call MySQL in active- >> active clustering. I am afraid your documentation is not totally >> accurate. Unfortunately, information on the Internet (and also on >> our web site) are often contradictory. >> It is indeed possible to run multiple mysqld services on different >> cluster nodes, all sharing the same data structure on shared >> storage, with this configuration: >> - Only the MyISAM storage engine can be used >> - Each mysqld service must start with the external-locking >> parameter on >> - Each mysqld service hase to have the query cache parameter off >> (other cache mechanisms remain on, since they are automatically >> invalidated by external locking) > > Thanks for providing this information. I'll get it into the > cluster FAQ. > Maybe some day I'll find the time to play with this myself. > >> I am afraid this configuration still does not compete against >> Oracle RAC. MySQL does not provide a solution that can be compared >> 1:1 with RAC. You may find some MySQL implementations much more >> effective than RAC for certain environments, as you will certainly >> find RAC performing better than MySQL on other implementations. >> Based on the experience of the sales engineering team, customers >> have never been disappointed by the technology that MySQL can >> provide as an alternative to RAC. Decisions are based on many >> other factors, such as the introduction of another (or a >> different) database, the cost of migrating current applications >> and compatibility with third party products. You can imagine we >> are working hard to remove these obstacles. >> Thanks again for your help, >> Kind Regards, >> Ivan >> -- >> Ivan Zoratti - Sales Engineering Manager EMEA >> MySQL AB - Windsor - UK >> Mobile: +44 7866 363 180 >> ivan at mysql.com >> http://www.mysql.com > > If you have other questions, please let me know. You can either > email me directly or join the linux-cluster mailing list where you > can talk to people are using these features and everyone can > benefit from the discussion. > > Regards, > > Bob Peterson > Red Hat Cluster Suite From sail at serverengines.com Fri Mar 9 23:30:15 2007 From: sail at serverengines.com (Sai Loganathan) Date: Fri, 9 Mar 2007 15:30:15 -0800 Subject: [Linux-cluster] cluster not doing failover Message-ID: <012d01c762a2$e439dd90$2702140a@se19261f2cf9ed> Hello, I am setting up a 2 node redhat cluster to test failover as part of testing effort in my company for the iscsi product we develop. I have a iscsi target which is my cluster shared storage. Downloaded and compiled the open source redhat cluster and installed the cluster components in both the nodes. Logged-into the iscsi target, created a gfs filesystem and mounted the lun on both the nodes. Created the cluster.conf using system-config-cluster gui and below that cluster.conf Using the cluster ip address (172.40.2.119), I was able to do an nfs mount of the shared lun from a 3rd machine. Started an infinite ls on that lun. To simulate failover, I just powered-down the node1 and hoping to see the read io stop but resume via the node2. But, I see the following error message on the node 2. Mar 9 12:14:49 node2 fenced[7422]: fence "node1" failed Mar 9 12:14:54 node2 fenced[7422]: fencing node "node1" Mar 9 12:14:54 node2 fenced[7422]: agent "fence_ilo" reports: Can't call method "configure" on an undefined value at /sbin/fence_ilo line 169, <> line 4. Mar 9 12:14:54 node2 fenced[7422]: fence "node1" failed Mar 9 12:14:59 node2 fenced[7422]: fencing node "node1" Mar 9 12:14:59 node2 fenced[7422]: agent "fence_ilo" reports: Can't call method "configure" on an undefined value at /sbin/fence_ilo line 169, <> line 4. Seems like I am not doing something correct with respect to fencing. Can I setup cluster without fencing first of all? I don't have any of the fencing power devices. In that case, how do I do fencing? Any help would be greatly appreciated. Thanks, Sai Logan _________________________________________________________________________________________________________________ This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient please telephone or e-mail the sender and delete this message and all attachments from your system - ServerEngines LLC -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrassow at redhat.com Sat Mar 10 01:53:40 2007 From: jbrassow at redhat.com (Jonathan E Brassow) Date: Fri, 9 Mar 2007 19:53:40 -0600 Subject: [Linux-cluster] cluster not doing failover In-Reply-To: <012d01c762a2$e439dd90$2702140a@se19261f2cf9ed> References: <012d01c762a2$e439dd90$2702140a@se19261f2cf9ed> Message-ID: <40407159e8e6506b05d46c82d921d936@redhat.com> On Mar 9, 2007, at 5:30 PM, Sai Loganathan wrote: > ??????????? > ??????????????????????? login="admin" name="node1_fence" passwd="admin"/> > ??????????????????????? login="admin" name="node2_fence" passwd="admin"/> > ??????????? The above line look funny to me. The hostname for the fence device is "admin"? > Using the cluster ip address (172.40.2.119), I was able to do an nfs > mount of the shared lun from a 3rd machine. Started an infinite ls on > that lun. > To simulate failover, I just powered-down the node1 and hoping to see > the read io stop but resume via the node2. But, I see the following > error message on the node 2. > Mar? 9 12:14:49 node2 fenced[7422]: fence "node1" failed > Mar? 9 12:14:54 node2 fenced[7422]: fencing node "node1" > Mar? 9 12:14:54 node2 fenced[7422]: agent "fence_ilo" reports: Can't > call method "configure" on an undefined value at /sbin/fence_ilo line > 169, <> line 4. > Mar? 9 12:14:54 node2 fenced[7422]: fence "node1" failed > Mar? 9 12:14:59 node2 fenced[7422]: fencing node "node1" > Mar? 9 12:14:59 node2 fenced[7422]: agent "fence_ilo" reports: Can't > call method "configure" on an undefined value at /sbin/fence_ilo line > 169, <> line 4. > ? > Seems like I am not doing something correct with respect to fencing. > Can I setup cluster without fencing first of all? Yes. You can use manual fencing. That should only be used for testing purposes though... it is not a supported configuration. brassow -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 3471 bytes Desc: not available URL: From mallah.rajesh at gmail.com Sat Mar 10 18:31:21 2007 From: mallah.rajesh at gmail.com (Rajesh Kumar Mallah) Date: Sun, 11 Mar 2007 00:01:21 +0530 Subject: [Linux-cluster] problem compiling redhat cluster2 with 2.6.9 Message-ID: Hi, I am facing problem in compiling cluster 1 or cluster 2 with kernel - 2.6.9. cluster sources were picked from ftp://sources.redhat.com/pub/cluster/releases/ can anyone please tell if its possible to compile GFS1 or 2 with kernel 2.6.9 any help is appreciated. transcript: cluster-2.00.00]# ./configure --kernel_src=/usr/src/vanilla/linux-2.6.9 configure gnbd-kernel Configuring Makefiles for your system... Completed Makefile configuration configure ccs Configuring Makefiles for your system... Completed Makefile configuration configure cman Configuring Makefiles for your system... Completed Makefile configuration configure group Configuring Makefiles for your system... Completed Makefile configuration configure dlm Configuring Makefiles for your system... Completed Makefile configuration configure fence Configuring Makefiles for your system... Completed Makefile configuration configure gfs-kernel Configuring Makefiles for your system... Completed Makefile configuration configure gfs Configuring Makefiles for your system... Completed Makefile configuration configure gfs2 Configuring Makefiles for your system... Completed Makefile configuration configure gnbd Configuring Makefiles for your system... Completed Makefile configuration configure rgmanager Configuring Makefiles for your system... Completed Makefile configuration [root at IPDDFG0595ATL2 cluster-2.00.00]# make make -C gnbd-kernel all make[1]: Entering directory `/opt/sources/gfs/cluster-2.00.00/gnbd-kernel' make -C src all make[2]: Entering directory `/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src' make -C /usr/src/vanilla/linux-2.6.9 M=/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src modules USING_KBUILD=yes make[3]: Entering directory `/usr/src/vanilla/linux-2.6.9' CC [M] /opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.o /opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c: In function `store_sectors': /opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:181: warning: implicit declaration of function `mutex_lock' /opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:181: structure has no member named `i_mutex' /opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:183: warning: implicit declaration of function `mutex_unlock' /opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:183: structure has no member named `i_mutex' /opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c: In function `gnbd_end_request': /opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:282: too many arguments to function `end_that_request_last' /opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c: In function `do_gnbd_request': /opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:578: structure has no member named `cmd_type' /opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c: In function `gnbd_init': /opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:888: structure has no member named `cmd_type' /opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:888: `REQ_TYPE_SPECIAL' undeclared (first use in this function) /opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:888: (Each undeclared identifier is reported only once /opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:888: for each function it appears in.) /opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:893: structure has no member named `cmd_type' /opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:913: incompatible type for argument 1 of `elevator_exit' /opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:914: warning: passing arg 2 of `elevator_init' from incompatible pointer type make[4]: *** [/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.o] Error 1 make[3]: *** [_module_/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src] Error 2 make[3]: Leaving directory `/usr/src/vanilla/linux-2.6.9' make[2]: *** [all] Error 2 make[2]: Leaving directory `/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src' make[1]: *** [all] Error 2 make[1]: Leaving directory `/opt/sources/gfs/cluster-2.00.00/gnbd-kernel' make: *** [all] Error 2 Regds mallah. From fajarpri at cbn.net.id Mon Mar 12 13:13:19 2007 From: fajarpri at cbn.net.id (Fajar Priyanto) Date: Mon, 12 Mar 2007 20:13:19 +0700 Subject: [Linux-cluster] Advice on setting up Resources for Postgres Message-ID: <200703122013.20165.fajarpri@cbn.net.id> Hi All, I'd like to setup RHCS for Postgres on 2-nodes cluster. What is the best way to setup the resources? No GFS. - Do I need to setup a script for postgres init.d? Or should I just let it on on both server from chkconfig? Thanks. -- Fajar Priyanto | Reg'd Linux User #327841 | Linux tutorial http://linux2.arinet.org 8:09pm up 12:12, 2.6.18.2-34-default GNU/Linux Let's use OpenOffice. http://www.openoffice.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From devrim at gunduz.org Mon Mar 12 13:39:57 2007 From: devrim at gunduz.org (=?iso-8859-9?Q?Devrim_G=DCND=DCZ?=) Date: Mon, 12 Mar 2007 15:39:57 +0200 (EET) Subject: [Linux-cluster] Advice on setting up Resources for Postgres In-Reply-To: <200703122013.20165.fajarpri@cbn.net.id> References: <200703122013.20165.fajarpri@cbn.net.id> Message-ID: Hi, On Mon, 12 Mar 2007, Fajar Priyanto wrote: > I'd like to setup RHCS for Postgres on 2-nodes cluster. > What is the best way to setup the resources? No GFS. I'd use GFS to make sure that data is not corrupted. You can "live" with ext3 -- but in case of a problem, if two postmaster directly accesses the same node, data problem may occur. http://www.gunduz.org/download.php?dlid=142 is the link to the presentation that I made at 10th PostgreSQL Anniversary Summit last year; which will give you a basic idea about these issues. > - Do I need to setup a script for postgres init.d? Or should I just let it on > on both server from chkconfig? AFAIR you will need to change condstart routines to start routines for RHCS to start PostgreSQL correctly. So PostgreSQL's init script should be fine. Regards, -- Devrim G?ND?Z devrim~gunduz.org, devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr http://www.gunduz.org From lhh at redhat.com Mon Mar 12 18:07:54 2007 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 12 Mar 2007 14:07:54 -0400 Subject: [Linux-cluster] Re: Failover not working In-Reply-To: <45F06FF2.6020309@eons.com> References: <45F036A4.1090106@eons.com> <45F06FF2.6020309@eons.com> Message-ID: <1173722874.4557.23.camel@asuka.boston.devel.redhat.com> On Thu, 2007-03-08 at 15:20 -0500, Dave Berry wrote: > rgmanager-1.9.54-1 > > > That shouldn't happen - what rgmanager RPM do you have? > > Ok, I'll poke around. If you have an easy way to reproduce this, could you file a bugzilla with the exact necessary steps? -- Lon From lhh at redhat.com Mon Mar 12 18:08:59 2007 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 12 Mar 2007 14:08:59 -0400 Subject: [Linux-cluster] Advice on setting up Resources for Postgres In-Reply-To: <200703122013.20165.fajarpri@cbn.net.id> References: <200703122013.20165.fajarpri@cbn.net.id> Message-ID: <1173722939.4557.25.camel@asuka.boston.devel.redhat.com> On Mon, 2007-03-12 at 20:13 +0700, Fajar Priyanto wrote: > Hi All, > I'd like to setup RHCS for Postgres on 2-nodes cluster. > What is the best way to setup the resources? No GFS. > - Do I need to setup a script for postgres init.d? Or should I just let it on > on both server from chkconfig? > > Thanks. There's a postgres resource agent in CVS ... maybe you should start there? :) -- Lon From lshen at cisco.com Mon Mar 12 21:41:25 2007 From: lshen at cisco.com (Lin Shen (lshen)) Date: Mon, 12 Mar 2007 14:41:25 -0700 Subject: [Linux-cluster] Using GFS on Compact Flash Message-ID: <08A9A3213527A6428774900A80DBD8D803A003D6@xmb-sjc-222.amer.cisco.com> Does it make sense to use GFS (local or cluster mode) on Compact Flash? Will it greatly reduce the life expectancy of the Compact Flash compared to using a local file system? The rational behind this is that GFS will issue way more writes to the disk for its internal operations ( such as dlm locking etc). Lin From fajarpri at cbn.net.id Tue Mar 13 02:14:05 2007 From: fajarpri at cbn.net.id (Fajar Priyanto) Date: Tue, 13 Mar 2007 09:14:05 +0700 Subject: [Linux-cluster] Advice on setting up Resources for Postgres In-Reply-To: <1173722939.4557.25.camel@asuka.boston.devel.redhat.com> References: <200703122013.20165.fajarpri@cbn.net.id> <1173722939.4557.25.camel@asuka.boston.devel.redhat.com> Message-ID: <200703130914.06009.fajarpri@cbn.net.id> On Tuesday 13 March 2007 01:08, Lon Hohberger wrote: > On Mon, 2007-03-12 at 20:13 +0700, Fajar Priyanto wrote: > > Hi All, > > I'd like to setup RHCS for Postgres on 2-nodes cluster. > > What is the best way to setup the resources? No GFS. > > - Do I need to setup a script for postgres init.d? Or should I just let > > it on on both server from chkconfig? > > > > Thanks. > > There's a postgres resource agent in CVS ... maybe you should start > there? :) > > -- Lon Thanks. BTW, can you pls send me patch file for init.d/functions ? The one that modify the exit code for stopping stopped service? I've googled for it, but havent found it yet. -- Fajar Priyanto | Reg'd Linux User #327841 | Linux tutorial http://linux2.arinet.org 9:12am up 0:37, 2.6.18.2-34-default GNU/Linux Let's use OpenOffice. http://www.openoffice.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From fajarpri at cbn.net.id Tue Mar 13 02:17:08 2007 From: fajarpri at cbn.net.id (Fajar Priyanto) Date: Tue, 13 Mar 2007 09:17:08 +0700 Subject: [Linux-cluster] Advice on setting up Resources for Postgres In-Reply-To: <1173722939.4557.25.camel@asuka.boston.devel.redhat.com> References: <200703122013.20165.fajarpri@cbn.net.id> <1173722939.4557.25.camel@asuka.boston.devel.redhat.com> Message-ID: <200703130917.09258.fajarpri@cbn.net.id> On Tuesday 13 March 2007 01:08, Lon Hohberger wrote: > On Mon, 2007-03-12 at 20:13 +0700, Fajar Priyanto wrote: > > Hi All, > > I'd like to setup RHCS for Postgres on 2-nodes cluster. > > What is the best way to setup the resources? No GFS. > > - Do I need to setup a script for postgres init.d? Or should I just let > > it on on both server from chkconfig? > > > > Thanks. > > There's a postgres resource agent in CVS ... maybe you should start > there? :) > > -- Lon Ugh... found it. https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=111998 -- Fajar Priyanto | Reg'd Linux User #327841 | Linux tutorial http://linux2.arinet.org 9:16am up 0:42, 2.6.18.2-34-default GNU/Linux Let's use OpenOffice. http://www.openoffice.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From sail at serverengines.com Tue Mar 13 02:47:45 2007 From: sail at serverengines.com (Sai Loganathan) Date: Mon, 12 Mar 2007 19:47:45 -0700 Subject: [Linux-cluster] RE: Linux-cluster Digest, Vol 35, Issue 13 In-Reply-To: <20070310170007.4934A731DB@hormel.redhat.com> References: <20070310170007.4934A731DB@hormel.redhat.com> Message-ID: <015501c76519$fad5eca0$2702140a@se19261f2cf9ed> Hello, Thanks for the info. Now I am doing manual fencing but get the following error whenever I do a failover. Mar 12 17:25:50 node2 clurgmgrd[6088]: State change: node1 DOWN Mar 12 17:25:52 node2 clurgmgrd[6088]: Starting stopped service iscsi_ip Mar 12 17:25:52 node2 clurgmgrd: [6088]: Adding IPv4 address 172.40.2.119 to eth2 Mar 12 17:25:52 node2 clurgmgrd[6088]: Starting stopped service iscsi_lun Mar 12 17:25:53 node2 clurgmgrd[6088]: Service iscsi_lun started Mar 12 17:25:54 node2 clurgmgrd[6088]: Service iscsi_ip started Mar 12 17:26:24 node2 kernel: CMAN: removing node node1 from the cluster : Missed too many heartbeats Mar 12 17:26:24 node2 fenced[6040]: node1 not a cluster member after 0 sec post_fail_delay Mar 12 17:26:24 node2 fenced[6040]: fencing node "node1" Mar 12 17:26:24 node2 fence_manual: Node node1 needs to be reset before recovery can procede. Waiting for node1 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n node1) I just power down node 1 to simulate the failover to node2. Unless I execute the command fence_ack_manual -n node1, the system will not move forward and wait in fencing. How to fix this error? During shutdown, I get the following error message and system waits there infinitely. Starting Killall: CMAN: sendmsg failed: -101 WARNING: dlm_emergency_shutdown SM: 00000003 sm_stop: SG stilljoined How to fix this error? Thanks, Sai Logan -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of linux-cluster-request at redhat.com Sent: Saturday, March 10, 2007 9:00 AM To: linux-cluster at redhat.com Subject: Linux-cluster Digest, Vol 35, Issue 13 Send Linux-cluster mailing list submissions to linux-cluster at redhat.com To subscribe or unsubscribe via the World Wide Web, visit https://www.redhat.com/mailman/listinfo/linux-cluster or, via email, send a message with subject or body 'help' to linux-cluster-request at redhat.com You can reach the person managing the list at linux-cluster-owner at redhat.com When replying, please edit your Subject line so it is more specific than "Re: Contents of Linux-cluster digest..." Today's Topics: 1. Re: cluster not doing failover (Jonathan E Brassow) ---------------------------------------------------------------------- Message: 1 Date: Fri, 9 Mar 2007 19:53:40 -0600 From: Jonathan E Brassow Subject: Re: [Linux-cluster] cluster not doing failover To: linux clustering Message-ID: <40407159e8e6506b05d46c82d921d936 at redhat.com> Content-Type: text/plain; charset="iso-8859-1" On Mar 9, 2007, at 5:30 PM, Sai Loganathan wrote: > > login="admin" name="node1_fence" passwd="admin"/> > login="admin" name="node2_fence" passwd="admin"/> > The above line look funny to me. The hostname for the fence device is "admin"? > Using the cluster ip address (172.40.2.119), I was able to do an nfs > mount of the shared lun from a 3rd machine. Started an infinite ls on > that lun. > To simulate failover, I just powered-down the node1 and hoping to see > the read io stop but resume via the node2. But, I see the following > error message on the node 2. > Mar 9 12:14:49 node2 fenced[7422]: fence "node1" failed > Mar 9 12:14:54 node2 fenced[7422]: fencing node "node1" > Mar 9 12:14:54 node2 fenced[7422]: agent "fence_ilo" reports: Can't > call method "configure" on an undefined value at /sbin/fence_ilo line > 169, <> line 4. > Mar 9 12:14:54 node2 fenced[7422]: fence "node1" failed > Mar 9 12:14:59 node2 fenced[7422]: fencing node "node1" > Mar 9 12:14:59 node2 fenced[7422]: agent "fence_ilo" reports: Can't > call method "configure" on an undefined value at /sbin/fence_ilo line > 169, <> line 4. > > Seems like I am not doing something correct with respect to fencing. > Can I setup cluster without fencing first of all? Yes. You can use manual fencing. That should only be used for testing purposes though... it is not a supported configuration. brassow -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 3471 bytes Desc: not available Url : https://www.redhat.com/archives/linux-cluster/attachments/20070309/c015b8da/ attachment.bin ------------------------------ -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster End of Linux-cluster Digest, Vol 35, Issue 13 ********************************************* _________________________________________________________________________________________________________________ This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient please telephone or e-mail the sender and delete this message and all attachments from your system - ServerEngines LLC From fajarpri at cbn.net.id Tue Mar 13 07:32:27 2007 From: fajarpri at cbn.net.id (Fajar Priyanto) Date: Tue, 13 Mar 2007 14:32:27 +0700 Subject: [Linux-cluster] If power down, no error? Message-ID: <200703131432.28935.fajarpri@cbn.net.id> Hi all, A friend of mine is setting up a 2-node cluster using RHEL4u3. No GFS. All failover testings are ok (shutting down eth0, unplugging the cables, shutting down the service, etc). But, powering down the current active node doesnt make the failover to occur. And he said that in the log there's no error at all. How come? So strange. Here's the conf: