From isplist at logicore.net  Thu Mar  1 01:16:36 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Wed, 28 Feb 2007 19:16:36 -0600
Subject: [Linux-cluster] RHEL Problem: Can't access LUNS
Message-ID: <2007228191636.191879@leena>

Is there anyone on this list who can make a suggestion to RH??? I've been 
asking about this for some time now and have no idea where to turn. 

I've been working with Qlogic on my LUN problem for a couple of weeks or so. I 
posted asking about this here but it seems that there aren't any answers here 
either.

The problem is that I need to install RHEL on fibre channel storage volumes. I 
cannot find a way of doing this from the installer. Once the OS is installed I 
can see them but that's of little use. 

In order to see the volumes which I need to gain access to, I need to install 
RHEL and then do the following;

Edit /etc/modprobe.conf
options scsi_mod max_luns=256 dev_flags=INLINE:TF200:0x242
mkinitrd -f /boot/initrd-`uname -r`.img `uname -r`
Reboot;

Then all volumes on the storage device (32 of them) can be seen by the server. 

The problem is with the Red Hat scsi layer, not with the Qlogic driver. That 
is apparent by the fact that the devices can be seen by the HBA BIOS. 
Ultimately this needs to be answered by Red Hat - how to pass "max_luns=xxx" 
to the scsi layer during the OS installation.

Or, if someone could explain how I might be able to build a custom install 
disk, that might work also. Perhaps I can pre-install the code needed so that 
anaconda can see all of the volumes?

Mike





From tmornini at engineyard.com  Thu Mar  1 03:08:42 2007
From: tmornini at engineyard.com (Tom Mornini)
Date: Wed, 28 Feb 2007 19:08:42 -0800
Subject: [Linux-cluster] cmirror status?
Message-ID: <3E495EB1-CB02-4634-98BA-6608F9D1A21E@engineyard.com>

Hello all.

What is the status of cmirror package?

I see that there have been code changes as late as this month.

We're extraordinarily interested in using it!

-- 
-- Tom Mornini, CTO
-- Engine Yard, Ruby on Rails Hosting
-- Reliability, Ease of Use, Scalability
-- (866) 518-YARD (9273)



From Alain.Moulle at bull.net  Thu Mar  1 06:57:50 2007
From: Alain.Moulle at bull.net (Alain Moulle)
Date: Thu, 01 Mar 2007 07:57:50 +0100
Subject: [Linux-cluster] Re: CS4 Update 4 / Oops in dlm module (Alain Moulle)
Message-ID: <45E6796E.5030809@bull.net>

Hi

We test it :
1/ it seems that the services stuck in stopping state is fixed
2/ about DLM Oops, we have not reproduced it but it happens
only once with former rpm version, so ... wait & see ...

3/ we have a problem just after the boot in clurgmgrd, don't
know if it is due to this new rpm or not, but we never had
this problem with former rpm version; syslog gives :
clurgmgrd[7069]: <info> Services Initialized
clurgmgrd[7069]: <crit> #10: Couldn't set up listen socket
and CS4 is stalled on the machine.

Any idea ?

Thanks a lot
Alain

>> Could you install the current rgmanager test RPM:
>>
>> http://people.redhat.com/lhh/rgmanager-1.9.54-3.228823test.i386.rpm
>>
>> ...and see if it goes away?  The above RPM is the same as 1.9.54, but
>> includes fix for an assertion failure, a way to fix services stuck in
>> the stopping state, and (the important one for you) a fix for an
>> intermittent DLM lock leak.
>>
>> ia64/x86_64/srpms here: http://people.redhat.com/lhh/packages.html
>>(Lon Hohberger)





-- 



mailto:Alain.Moulle at bull.net
+------------------------------+--------------------------------+
|	Alain Moull?	       	| from France :	04 76 29 75 99  |
|                              	| FAX number  : 04 76 29 72 49  |
| Bull SA		       	|				|
| 1, Rue de Provence  		| Adr  : FREC B1-041            |
| B.P. 208			|				|
| 38432 Echirolles - CEDEX     	| Email: Alain.Moulle at bull.net  |
| France                       	| BCOM : 229 7599               |
+-------------------------------+-------------------------------+




From jbrassow at redhat.com  Thu Mar  1 21:19:29 2007
From: jbrassow at redhat.com (Jonathan E Brassow)
Date: Thu, 1 Mar 2007 15:19:29 -0600
Subject: [Linux-cluster] cmirror status?
In-Reply-To: <3E495EB1-CB02-4634-98BA-6608F9D1A21E@engineyard.com>
References: <3E495EB1-CB02-4634-98BA-6608F9D1A21E@engineyard.com>
Message-ID: <7b40c10fa6df9aa866b93ca230ddf733@redhat.com>

There are a handful of bugs that need to be cleaned up for cmirror 
before RHEL4.5.  The speed at which that happens will depend on their 
repeatability.

The version which will eventually go upstream is still in an initial 
phase.

  brassow

On Feb 28, 2007, at 9:08 PM, Tom Mornini wrote:

> Hello all.
>
> What is the status of cmirror package?
>
> I see that there have been code changes as late as this month.
>
> We're extraordinarily interested in using it!
>
> -- 
> -- Tom Mornini, CTO
> -- Engine Yard, Ruby on Rails Hosting
> -- Reliability, Ease of Use, Scalability
> -- (866) 518-YARD (9273)
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From tmornini at engineyard.com  Thu Mar  1 21:40:42 2007
From: tmornini at engineyard.com (Tom Mornini)
Date: Thu, 1 Mar 2007 13:40:42 -0800
Subject: [Linux-cluster] cmirror status?
In-Reply-To: <7b40c10fa6df9aa866b93ca230ddf733@redhat.com>
References: <3E495EB1-CB02-4634-98BA-6608F9D1A21E@engineyard.com>
	<7b40c10fa6df9aa866b93ca230ddf733@redhat.com>
Message-ID: <26F144F3-F48E-49A0-AE66-97D18141EF80@engineyard.com>

Thanks!

--  
-- Tom Mornini, CTO
-- Engine Yard, Ruby on Rails Hosting
-- Reliability, Ease of Use, Scalability
-- (866) 518-YARD (9273)

On Mar 1, 2007, at 1:19 PM, Jonathan E Brassow wrote:

> There are a handful of bugs that need to be cleaned up for cmirror  
> before RHEL4.5.  The speed at which that happens will depend on  
> their repeatability.
>
> The version which will eventually go upstream is still in an  
> initial phase.
>
>  brassow
>
> On Feb 28, 2007, at 9:08 PM, Tom Mornini wrote:
>
>> Hello all.
>>
>> What is the status of cmirror package?
>>
>> I see that there have been code changes as late as this month.
>>
>> We're extraordinarily interested in using it!
>>
>> -- 
>> -- Tom Mornini, CTO
>> -- Engine Yard, Ruby on Rails Hosting
>> -- Reliability, Ease of Use, Scalability
>> -- (866) 518-YARD (9273)
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From orkcu at yahoo.com  Thu Mar  1 23:55:14 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Thu, 1 Mar 2007 15:55:14 -0800 (PST)
Subject: [Linux-cluster] RHEL Problem: Can't access LUNS
In-Reply-To: <2007228191636.191879@leena>
Message-ID: <732955.27236.qm@web50607.mail.yahoo.com>
--- "isplist at logicore.net" <isplist at logicore.net>
wrote:

> Is there anyone on this list who can make a
> suggestion to RH??? I've been 
> asking about this for some time now and have no idea
> where to turn. 
> 
> I've been working with Qlogic on my LUN problem for
> a couple of weeks or so. I 
> posted asking about this here but it seems that
> there aren't any answers here 
> either.
> 
> The problem is that I need to install RHEL on fibre
> channel storage volumes. I 
> cannot find a way of doing this from the installer.
> Once the OS is installed I 
> can see them but that's of little use. 
> 
> In order to see the volumes which I need to gain
> access to, I need to install 
> RHEL and then do the following;
> 
> Edit /etc/modprobe.conf
> options scsi_mod max_luns=256
> dev_flags=INLINE:TF200:0x242
> mkinitrd -f /boot/initrd-`uname -r`.img `uname -r`
> Reboot;
> 

I insist that you can get the expected result _if_ you
type _exactly_ the _right_ line in the boot promp line
and according to this document:

http://www.kernel.org/pub/linux/kernel/people/gregkh/lkn/lkn_pdf/ch09.pdf

what you have to writte is a litle modification of
what you type in the modprobe.conf file
maybe something like:
scsi_mod.max_luns=256
scsi_mod.scsi_dev_flags=INLINE:TF200:0x242



man, you have to test it, try and error approach maybe


cu
roger

__________________________________________
RedHat Certified Engineer ( RHCE )
Cisco Certified Network Associate ( CCNA )


 
____________________________________________________________________________________
TV dinner still cooling? 
Check out "Tonight's Picks" on Yahoo! TV.
http://tv.yahoo.com/



From isplist at logicore.net  Fri Mar  2 00:05:47 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 1 Mar 2007 18:05:47 -0600
Subject: [Linux-cluster] RHEL Problem: Can't access LUNS
In-Reply-To: <732955.27236.qm@web50607.mail.yahoo.com>
Message-ID: <20073118547.393207@leena>

I'm not sure what you're telling me here. 

Since this isn't something I knew how to do, I've been asking for help. The 
things I know how to do now, which still don't work, are all things which have 
been suggested by people who are trying to help.

 From all of this, the one thing I've learned is that it does not matter what I 
try to enter at the install command line, the problem is the redhat installer, 
there is no way of passing this information. 

I need to modify the RHEL4 install CD's initrd.img in order for anaconda to 
see this at install time... and, I don't know how to do this but am reading up 
on it.

> http://www.kernel.org/pub/linux/kernel/people/gregkh/lkn/lkn_pdf/ch09.pdf

Thanks, I'll check this out.
 
> what you have to writte is a litle modification of
> what you type in the modprobe.conf file
> maybe something like:
> scsi_mod.max_luns=256
> scsi_mod.scsi_dev_flags=INLINE:TF200:0x242

I don't have access to modprobe.conf at install time, that's the point of my 
problem. I can modify it AFTER install and see all of the storage but I need 
to see the storage AT install time so that I can install TO the storage.
 
> man, you have to test it, try and error approach maybe

I've been at trial and error for a couple of weeks now.

Mike






From isplist at logicore.net  Fri Mar  2 03:05:12 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 1 Mar 2007 21:05:12 -0600
Subject: [Linux-cluster] RHEL Problem: Can't access LUNS
In-Reply-To: <a6d13c780703011710q3f8a063eyb58464105d6ed73b@mail.gmail.com>
Message-ID: <20073121512.491314@leena>

Interesting! The option is scsi_dev_flags, no dev_flags as I've always seen 
it. Yet another thing to try before moving on :).


On Thu, 1 Mar 2007 22:10:40 -0300, Filipe Miranda wrote:
> Mike,
> 
> I got this from the URL: http://www.redhat.com/docs/manuals/enterprise/RHEL-
> 4-Manual/ref-guide/ch-modules.html
> 
> "During installation, Red Hat Enterprise Linux uses a limited subset of
> device drivers to create a stable installation environment. Although the
> installation program supports installation on many different types of
> hardware, some drivers (including those for SCSI adapters and network
> adapters) are not included in the installation kernel."
> 
> Which means that if the installation kernel that you are using does have
> the SCSI modules, you should just load the parameters during the process of
> loading anaconda.
> 
> When the RHEL loads, the first phase of the installation, you should use
> the SCSI parameters there!
> If nothing happens, this means that the SCSI drivers that you need are not
> in the installation kernel.
> 
> I recommend using the latest release of the version of the RHEL you are
> trying to install.
> 
> Also check page number 106 (20-30) of the pdf that Roger posted, on SCSI
> options.
> 
> I hope these guidelines will help you solve the problem.
> 
> Regards,
> 
> Filipe Miranda
> 
> On 3/1/07, isplist at logicore.net < isplist at logicore.net> wrote:> I'm not
> sure what you're telling me here.
> 
>> Since this isn't something I knew how to do, I've been asking for help.
>> The
>> things I know how to do now, which still don't work, are all things which
>> have
>> been suggested by people who are trying to help.
>> 
>> From all of this, the one thing I've learned is that it does not matter
>> what I
>> try to enter at the install command line, the problem is the redhat
>> installer,
>> there is no way of passing this information.
>> 
>> I need to modify the RHEL4 install CD's initrd.img in order for anaconda
>> to
>> see this at install time... and, I don't know how to do this but am
>> reading up
>> on it.
>> 
>>> http://www.kernel.org/pub/linux/kernel/people/gregkh/lkn/lkn_pdf/ch09.pdf
>>> 
>> Thanks, I'll check this out.
>> 
>>> what you have to writte is a litle modification of
>>> what you type in the modprobe.conf file
>>> maybe something like:
>>> scsi_mod.max_luns=256
>>> scsi_mod.scsi_dev_flags=INLINE:TF200:0x242
>>> 
>> I don't have access to modprobe.conf at install time, that's the point of
>> my
>> problem. I can modify it AFTER install and see all of the storage but I
>> need
>> to see the storage AT install time so that I can install TO the storage.
>> 
>>> man, you have to test it, try and error approach maybe
>>> 
>> I've been at trial and error for a couple of weeks now.
>> 
>> Mike
>> 
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster






From isplist at logicore.net  Fri Mar  2 02:59:57 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 1 Mar 2007 20:59:57 -0600
Subject: [Linux-cluster] RHEL Problem: Can't access LUNS
In-Reply-To: <a6d13c780703011710q3f8a063eyb58464105d6ed73b@mail.gmail.com>
Message-ID: <200731205957.022213@leena>

I have tried all of the options I can find or have been given during 
installation. So at this stage, I feel that I should be looking at building a 
custom initrd.img file in order to get this problem resolved. 

Mike


On Thu, 1 Mar 2007 22:10:40 -0300, Filipe Miranda wrote:
> Mike,
> 
> I got this from the URL: http://www.redhat.com/docs/manuals/enterprise/RHEL-
> 4-Manual/ref-guide/ch-modules.html
> 
> "During installation, Red Hat Enterprise Linux uses a limited subset of
> device drivers to create a stable installation environment. Although the
> installation program supports installation on many different types of
> hardware, some drivers (including those for SCSI adapters and network
> adapters) are not included in the installation kernel."
> 
> Which means that if the installation kernel that you are using does have
> the SCSI modules, you should just load the parameters during the process of
> loading anaconda.
> 
> When the RHEL loads, the first phase of the installation, you should use
> the SCSI parameters there!
> If nothing happens, this means that the SCSI drivers that you need are not
> in the installation kernel.
> 
> I recommend using the latest release of the version of the RHEL you are
> trying to install.
> 
> Also check page number 106 (20-30) of the pdf that Roger posted, on SCSI
> options.
> 
> I hope these guidelines will help you solve the problem.
> 
> Regards,
> 
> Filipe Miranda
> 
> On 3/1/07, isplist at logicore.net < isplist at logicore.net> wrote:> I'm not
> sure what you're telling me here.
> 
>> Since this isn't something I knew how to do, I've been asking for help.
>> The
>> things I know how to do now, which still don't work, are all things which
>> have
>> been suggested by people who are trying to help.
>> 
>> From all of this, the one thing I've learned is that it does not matter
>> what I
>> try to enter at the install command line, the problem is the redhat
>> installer,
>> there is no way of passing this information.
>> 
>> I need to modify the RHEL4 install CD's initrd.img in order for anaconda
>> to
>> see this at install time... and, I don't know how to do this but am
>> reading up
>> on it.
>> 
>>> http://www.kernel.org/pub/linux/kernel/people/gregkh/lkn/lkn_pdf/ch09.pdf
>>> 
>> Thanks, I'll check this out.
>> 
>>> what you have to writte is a litle modification of
>>> what you type in the modprobe.conf file
>>> maybe something like:
>>> scsi_mod.max_luns=256
>>> scsi_mod.scsi_dev_flags=INLINE:TF200:0x242
>>> 
>> I don't have access to modprobe.conf at install time, that's the point of
>> my
>> problem. I can modify it AFTER install and see all of the storage but I
>> need
>> to see the storage AT install time so that I can install TO the storage.
>> 
>>> man, you have to test it, try and error approach maybe
>>> 
>> I've been at trial and error for a couple of weeks now.
>> 
>> Mike
>> 
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster






From isplist at logicore.net  Fri Mar  2 03:23:51 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 1 Mar 2007 21:23:51 -0600
Subject: [Linux-cluster] RHEL Problem: Can't access LUNS
In-Reply-To: <732955.27236.qm@web50607.mail.yahoo.com>
Message-ID: <200731212351.418306@leena>

> I insist that you can get the expected result _if_ you
> type _exactly_ the _right_ line in the boot promp line
> and according to this document:

Didn't work. I entered it just as it is here;

scsi_mod.max_luns=256 scsi_mod.scsi_dev_flags=INLINE:TF200:0x242

Mike





From filipe.miranda at gmail.com  Fri Mar  2 01:10:40 2007
From: filipe.miranda at gmail.com (Filipe Miranda)
Date: Thu, 1 Mar 2007 22:10:40 -0300
Subject: [Linux-cluster] RHEL Problem: Can't access LUNS
In-Reply-To: <20073118547.393207@leena>
References: <732955.27236.qm@web50607.mail.yahoo.com>
	<20073118547.393207@leena>
Message-ID: <a6d13c780703011710q3f8a063eyb58464105d6ed73b@mail.gmail.com>

Mike,

I got this from the URL:
http://www.redhat.com/docs/manuals/enterprise/RHEL-4-Manual/ref-guide/ch-modules.html

"During installation, Red Hat Enterprise Linux uses a limited subset of
device drivers to create a stable installation environment. Although the
installation program supports installation on many different types of
hardware, some drivers (including those for SCSI adapters and network
adapters) are not included in the installation kernel."

Which means that if the installation kernel that you are using does have the
SCSI modules, you should just load the parameters during the process of
loading anaconda.

When the RHEL loads, the first phase of the installation, you should use the
SCSI parameters there!
If nothing happens, this means that the SCSI drivers that you need are not
in the installation kernel.

I recommend using the latest release of the version of the RHEL you are
trying to install.

Also check page number 106 (20-30) of the pdf that Roger posted, on SCSI
options.

I hope these guidelines will help you solve the problem.

Regards,

Filipe Miranda

On 3/1/07, isplist at logicore.net <isplist at logicore.net> wrote:
>
> I'm not sure what you're telling me here.
>
> Since this isn't something I knew how to do, I've been asking for help.
> The
> things I know how to do now, which still don't work, are all things which
> have
> been suggested by people who are trying to help.
>
> From all of this, the one thing I've learned is that it does not matter
> what I
> try to enter at the install command line, the problem is the redhat
> installer,
> there is no way of passing this information.
>
> I need to modify the RHEL4 install CD's initrd.img in order for anaconda
> to
> see this at install time... and, I don't know how to do this but am
> reading up
> on it.
>
> >
> http://www.kernel.org/pub/linux/kernel/people/gregkh/lkn/lkn_pdf/ch09.pdf
>
> Thanks, I'll check this out.
>
> > what you have to writte is a litle modification of
> > what you type in the modprobe.conf file
> > maybe something like:
> > scsi_mod.max_luns=256
> > scsi_mod.scsi_dev_flags=INLINE:TF200:0x242
>
> I don't have access to modprobe.conf at install time, that's the point of
> my
> problem. I can modify it AFTER install and see all of the storage but I
> need
> to see the storage AT install time so that I can install TO the storage.
>
> > man, you have to test it, try and error approach maybe
>
> I've been at trial and error for a couple of weeks now.
>
> Mike
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070301/01a43649/attachment.htm>

From isplist at logicore.net  Fri Mar  2 03:07:51 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 1 Mar 2007 21:07:51 -0600
Subject: [Linux-cluster] RHEL Problem: Can't access LUNS
In-Reply-To: <732955.27236.qm@web50607.mail.yahoo.com>
Message-ID: <20073121751.180050@leena>

I see now Roger... I had the wrong option wording. I never saw it this way so 
will give it a try and post my findings, thanks.


On Thu, 1 Mar 2007 15:55:14 -0800 (PST), Roger Pe?a wrote:
> 
> 
> --- "isplist at logicore.net" <isplist at logicore.net>
> wrote:
> 
>> Is there anyone on this list who can make a
>> suggestion to RH??? I've been
>> asking about this for some time now and have no idea
>> where to turn.
>> 
>> I've been working with Qlogic on my LUN problem for
>> a couple of weeks or so. I
>> posted asking about this here but it seems that
>> there aren't any answers here
>> either.
>> 
>> The problem is that I need to install RHEL on fibre
>> channel storage volumes. I
>> cannot find a way of doing this from the installer.
>> Once the OS is installed I
>> can see them but that's of little use.
>> 
>> In order to see the volumes which I need to gain
>> access to, I need to install
>> RHEL and then do the following;
>> 
>> Edit /etc/modprobe.conf
>> options scsi_mod max_luns=256
>> dev_flags=INLINE:TF200:0x242
>> mkinitrd -f /boot/initrd-`uname -r`.img `uname -r`
>> Reboot;
>> 
> 
> I insist that you can get the expected result _if_ you
> type _exactly_ the _right_ line in the boot promp line
> and according to this document:
> 
> http://www.kernel.org/pub/linux/kernel/people/gregkh/lkn/lkn_pdf/ch09.pdf
> 
> what you have to writte is a litle modification of
> what you type in the modprobe.conf file
> maybe something like:
> scsi_mod.max_luns=256
> scsi_mod.scsi_dev_flags=INLINE:TF200:0x242
> 
> 
> man, you have to test it, try and error approach maybe
> 
> 
> cu
> roger
> 
> __________________________________________
> RedHat Certified Engineer ( RHCE )
> Cisco Certified Network Associate ( CCNA )
> 
> 
> ___________________________________________________________________
___________
> ______
> TV dinner still cooling?
> Check out "Tonight's Picks" on Yahoo! TV.
> http://tv.yahoo.com/






From jose.dr.g at gmail.com  Fri Mar  2 05:01:50 2007
From: jose.dr.g at gmail.com (Jose Guevarra)
Date: Thu, 1 Mar 2007 21:01:50 -0800
Subject: [Linux-cluster] can't start GFS on Fedora
Message-ID: <3837d8af0703012101m5fe475e4xadebf56d1406b986@mail.gmail.com>

Hi,

I'm trying to get GFS installed and running on FC4 on a Poweredge 4400 dual
processor. Take note that I'm a total newbie at GFS.

I've installed

GFS-6.1
GFS-kernel-smp
lvm2-cluster
ccs
cman-kernel-smp
magma
fence
gnbd-kernel-smp
gnbd

I have a volume group that I want to mount w/ GFS
/dev/mapper/VolGroup00-LogVol02

I was able to create a GFS file system w/ this command...

# gfs_mkfs -p lock_dlm -t CLUST:gfs1 -j 6 /dev/mapper/VolGroup00-LogVol02

Now. when I try to start ccsd it fails. so none of the other daemons start
either. /var/log/messages doesn't say anything about the start failure.

How can I troubleshoot this more? What are the required daemons that need to
start?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070301/e9d1db7a/attachment.htm>

From orkcu at yahoo.com  Fri Mar  2 13:34:18 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Fri, 2 Mar 2007 05:34:18 -0800 (PST)
Subject: [Linux-cluster] RHEL Problem: Can't access LUNS
In-Reply-To: <20073121751.180050@leena>
Message-ID: <378534.10487.qm@web50607.mail.yahoo.com>


--- "isplist at logicore.net" <isplist at logicore.net>
wrote:

> I see now Roger... I had the wrong option wording. I
> never saw it this way so 
> will give it a try and post my findings, thanks.

that was exactly what I like to point
if you mistake anything, the kernel just drop what you
type, so you have to be very carefull about what to
write.
there is also the way to give more than one option to
the same kernel module, usually the examples just show
howto give only one option to the modules but in your
case you have to pass _two_ options to the same kernel
module, how do you do that?
I guess I give you the way it should be done, but I am
not sure at all :-(
in past times I recall that you can pass several
options with a ',' separator but I couldn't fine if it
is the same with kernels 2.6.x ....

scsi_mod.max_luns=256
scsi_mod.scsi_dev_flags=INLINE:TF200:0x242

see the '.' between the module name and its option?
maybe, and just maybe, it make the trick

there is also the posibility that, because the
scsi_mod maybe be loaded as a module (not be an
in-line module), that any of the options you pass at
boot time has any use, so, why not to switch to the
console F2 and unload  scsi_mod and then loaded with
the right options?

if that do not works, then, my next try would be to
install from network (PXE boot) so I can create-modify
whatever file I like and add whatever option I like to
the kernel at boot time without recreate the installer
CD


> 
> 
> On Thu, 1 Mar 2007 15:55:14 -0800 (PST), Roger Pe?a
> wrote:
> > 
> > 
> > --- "isplist at logicore.net" <isplist at logicore.net>
> > wrote:
> > 
> >> Is there anyone on this list who can make a
> >> suggestion to RH??? I've been
> >> asking about this for some time now and have no
> idea
> >> where to turn.
> >> 
> >> I've been working with Qlogic on my LUN problem
> for
> >> a couple of weeks or so. I
> >> posted asking about this here but it seems that
> >> there aren't any answers here
> >> either.

> >> In order to see the volumes which I need to gain
> >> access to, I need to install
> >> RHEL and then do the following;
> >> 
> >> Edit /etc/modprobe.conf
> >> options scsi_mod max_luns=256
> >> dev_flags=INLINE:TF200:0x242
> >> mkinitrd -f /boot/initrd-`uname -r`.img `uname
> -r`
> >> Reboot;
> > maybe something like:
> > scsi_mod.max_luns=256
> > scsi_mod.scsi_dev_flags=INLINE:TF200:0x242

cu
roger

__________________________________________
RedHat Certified Engineer ( RHCE )
Cisco Certified Network Associate ( CCNA )


 
____________________________________________________________________________________
Have a burning question?  
Go to www.Answers.yahoo.com and get answers from real people who know.



From isplist at logicore.net  Fri Mar  2 14:47:51 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Fri, 2 Mar 2007 08:47:51 -0600
Subject: [Linux-cluster] RHEL Problem: Can't access LUNS
In-Reply-To: <378534.10487.qm@web50607.mail.yahoo.com>
Message-ID: <20073284751.501337@leena>

> if that do not works, then, my next try would be to
> install from network (PXE boot) so I can create-modify
> whatever file I like and add whatever option I like to
> the kernel at boot time without recreate the installer
> CD

You're right, this would be the best way. Plus, I can keep various versions 
handy too. This is exactly how I'd like to do this but I've not learned how to 
modify the files I need in this case yet.

Mike






From rpeterso at redhat.com  Fri Mar  2 16:33:02 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Fri, 02 Mar 2007 10:33:02 -0600
Subject: [Linux-cluster] can't start GFS on Fedora
In-Reply-To: <3837d8af0703012101m5fe475e4xadebf56d1406b986@mail.gmail.com>
References: <3837d8af0703012101m5fe475e4xadebf56d1406b986@mail.gmail.com>
Message-ID: <45E851BE.4090300@redhat.com>

Jose Guevarra wrote:
> I have a volume group that I want to mount w/ GFS
> /dev/mapper/VolGroup00-LogVol02
>
> I was able to create a GFS file system w/ this command...
>
> # gfs_mkfs -p lock_dlm -t CLUST:gfs1 -j 6 /dev/mapper/VolGroup00-LogVol02
>
> Now. when I try to start ccsd it fails. so none of the other daemons 
> start
> either. /var/log/messages doesn't say anything about the start failure.
>
> How can I troubleshoot this more? What are the required daemons that 
> need to
> start?
Hi Jose,

I have a couple of suggestions.  First of all, you need to determine
if you're planning to use GFS in a cluster (i.e. on shared storage like 
a SAN) or
stand-alone (and share that with us if you want help.) 

Your use of  "lock_dlm" and a cluster name makes it sound like you want
it in a cluster, but the VolGroup00-LogVol02 makes it sound like a local
hard disk and not any kind of shared storage.

If you're using it stand-alone, you don't need ccsd, since ccsd is part 
of the
cluster infrastructure.  If stand-alone, you also would want to use 
lock_nolock
rather than lock_dlm.

Now about ccsd: If you're using the cluster code shipped with FC6,
that's the "new" infrastructure code.  With the "new" stuff, you don't
need to start ccsd with a separate script like in RHEL4.  Everything should
be handled by doing: "service cman start."  The ccsd daemon is started
by the init script.  I apologize if you already knew this.  It's just that
I couldn't tell how you were starting ccsd.

You say that ccsd fails, but you didn't say much about how it fails or
what error message it gives you.

I guess the bottom line is that you didn't give us enough information to
help you.

Also, if this storage is shared in the cluster, you need to do
"service clvmd start" as well, and you may want to
change locking_type = 3 in your /etc/lvm/lvm.conf before starting clvmd.

If you're using it on shared storage in a cluster, you should probably
post your cluster.conf file, which might tell us why ccsd is having issues.

Also, the gfs_mkfs command is typically used on the logical volume, not on
the /dev/mapper device.  So something like:

# gfs_mkfs -p lock_dlm -t CLUST:gfs1 -j 6 /dev/VolGroup00/LogVol02

I hope this helps.

Regards,

Bob Peterson
Red Hat Cluster Suite



From jose.dr.g at gmail.com  Fri Mar  2 19:36:23 2007
From: jose.dr.g at gmail.com (Jose Guevarra)
Date: Fri, 2 Mar 2007 11:36:23 -0800
Subject: [Linux-cluster] can't start GFS on Fedora
In-Reply-To: <45E851BE.4090300@redhat.com>
References: <3837d8af0703012101m5fe475e4xadebf56d1406b986@mail.gmail.com>
	<45E851BE.4090300@redhat.com>
Message-ID: <3837d8af0703021136t54fb61bch2cf6f87a3d7ca6c4@mail.gmail.com>

yes, I'm trying to get a test HPC cluster going with GFS  to be used as a
SAN and shared among several nodes.

I'm currently using Fedora Core 4 which was the first version to come with
GFS.  As you say that there is now
a "new" infrast.  would you recommend that I simply upgrade to Fedora Core
6?

In terms of CCSD, 'service ccsd start' simply returns [Failed].  the logs
show ..

Mar  2 11:28:09 IQCD1 ccsd[8651]: Starting ccsd 1.0.0:
Mar  2 11:28:09 IQCD1 ccsd[8651]:  Built: Jun 16 2005 10:45:39
Mar  2 11:28:09 IQCD1 ccsd[8651]:  Copyright (C) Red Hat, Inc.  2004  All
rights reserved.

That's it.  I've now discovered that cluster.conf is nowhere to be found on
my system.  The probably
explains CCSD failing.  ccs-1.0 is installed.  What package installed a
default cluster.conf file?

Thanks.



On 3/2/07, Robert Peterson <rpeterso at redhat.com> wrote:
>
> Jose Guevarra wrote:
> > I have a volume group that I want to mount w/ GFS
> > /dev/mapper/VolGroup00-LogVol02
> >
> > I was able to create a GFS file system w/ this command...
> >
> > # gfs_mkfs -p lock_dlm -t CLUST:gfs1 -j 6
> /dev/mapper/VolGroup00-LogVol02
> >
> > Now. when I try to start ccsd it fails. so none of the other daemons
> > start
> > either. /var/log/messages doesn't say anything about the start failure.
> >
> > How can I troubleshoot this more? What are the required daemons that
> > need to
> > start?
> Hi Jose,
>
> I have a couple of suggestions.  First of all, you need to determine
> if you're planning to use GFS in a cluster (i.e. on shared storage like
> a SAN) or
> stand-alone (and share that with us if you want help.)
>
> Your use of  "lock_dlm" and a cluster name makes it sound like you want
> it in a cluster, but the VolGroup00-LogVol02 makes it sound like a local
> hard disk and not any kind of shared storage.
>
> If you're using it stand-alone, you don't need ccsd, since ccsd is part
> of the
> cluster infrastructure.  If stand-alone, you also would want to use
> lock_nolock
> rather than lock_dlm.
>
> Now about ccsd: If you're using the cluster code shipped with FC6,
> that's the "new" infrastructure code.  With the "new" stuff, you don't
> need to start ccsd with a separate script like in RHEL4.  Everything
> should
> be handled by doing: "service cman start."  The ccsd daemon is started
> by the init script.  I apologize if you already knew this.  It's just that
> I couldn't tell how you were starting ccsd.
>
> You say that ccsd fails, but you didn't say much about how it fails or
> what error message it gives you.
>
> I guess the bottom line is that you didn't give us enough information to
> help you.
>
> Also, if this storage is shared in the cluster, you need to do
> "service clvmd start" as well, and you may want to
> change locking_type = 3 in your /etc/lvm/lvm.conf before starting clvmd.
>
> If you're using it on shared storage in a cluster, you should probably
> post your cluster.conf file, which might tell us why ccsd is having
> issues.
>
> Also, the gfs_mkfs command is typically used on the logical volume, not on
> the /dev/mapper device.  So something like:
>
> # gfs_mkfs -p lock_dlm -t CLUST:gfs1 -j 6 /dev/VolGroup00/LogVol02
>
> I hope this helps.
>
> Regards,
>
> Bob Peterson
> Red Hat Cluster Suite
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070302/13e8cba2/attachment.htm>

From rpeterso at redhat.com  Sat Mar  3 00:28:01 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Fri, 02 Mar 2007 18:28:01 -0600
Subject: [Linux-cluster] can't start GFS on Fedora
In-Reply-To: <3837d8af0703021136t54fb61bch2cf6f87a3d7ca6c4@mail.gmail.com>
References: <3837d8af0703012101m5fe475e4xadebf56d1406b986@mail.gmail.com>	<45E851BE.4090300@redhat.com>
	<3837d8af0703021136t54fb61bch2cf6f87a3d7ca6c4@mail.gmail.com>
Message-ID: <45E8C111.2060801@redhat.com>

Jose Guevarra wrote:
> yes, I'm trying to get a test HPC cluster going with GFS  to be used as a
> SAN and shared among several nodes.
>
> I'm currently using Fedora Core 4 which was the first version to come 
> with
> GFS.  As you say that there is now
> a "new" infrast.  would you recommend that I simply upgrade to Fedora 
> Core
> 6?
>
> In terms of CCSD, 'service ccsd start' simply returns [Failed].  the logs
> show ..
>
> Mar  2 11:28:09 IQCD1 ccsd[8651]: Starting ccsd 1.0.0:
> Mar  2 11:28:09 IQCD1 ccsd[8651]:  Built: Jun 16 2005 10:45:39
> Mar  2 11:28:09 IQCD1 ccsd[8651]:  Copyright (C) Red Hat, Inc.  2004  All
> rights reserved.
>
> That's it.  I've now discovered that cluster.conf is nowhere to be 
> found on
> my system.  The probably
> explains CCSD failing.  ccs-1.0 is installed.  What package installed a
> default cluster.conf file?
>
> Thanks.
Hi Jose,

Well, I can't tell you to go to FC6, but I can tell you this much:
I like FC6 a lot better than FC4, plus all the cluster code has had a lot
of bug fixes and improvements.  The current CVS development tree
is geared toward newer (upstream) kernels, so FC6 will get you closer
if you want to build it from source. 

There is no default cluster.conf file.  The cluster.conf file defines
your cluster: what computers ("nodes") are in your cluster, what
fencing device(s) you are using, and the services you want for
High Availability.  There's no way any of that can be determined by
default.  That's determined by the boxes in your network.
There is, however, a couple of GUIs that may make your
life easier.  The first one is called Conga, and it's web based.
The second one is called system-config-cluster, but it's not as
user friendly as Conga.  I don't think they'll work on FC4 though.

If you haven't already done so, you might want to check out
my "NFS/GFS Cookbook" that will walk you through the
process of setting up and configuring a cluster, although that's
geared more toward RHEL4 (not the new infrastructure).
http://sources.redhat.com/cluster/doc/nfscookbook.pdf

I recently posted a link to a quick install guide to getting
the STABLE cvs branch working on an upstream kernel too.
The STABLE branch is much like the RHEL4, in that it
uses the old infrastructure.  That link is:
https://rpeterso.108.redhat.com/files/documents/98/247/STABLE.txt
The advantage of doing this is that more people on this list are
familiar with that infrastructure and can therefore answer questions.
It should work for FC6.

Hopefully this gets you going.  Learning how to set up
and manage a cluster can be a frustrating and confusing learning
experience.  At least it was for me!  But once you get going you'll
be alright.  You may have a lot of questions, and perhaps the cluster
FAQ can help with some of those:
http://sources.redhat.com/cluster/faq.html

Otherwise, the people on this list are pretty friendly and helpful.

Regards,

Bob Peterson
Red Hat Cluster Suite



From shailesh at verismonetworks.com  Mon Mar  5 05:33:13 2007
From: shailesh at verismonetworks.com (Shailesh)
Date: Mon, 05 Mar 2007 11:03:13 +0530
Subject: [Linux-cluster] Clustering questions
Message-ID: <1173072793.15762.37.camel@shailesh>

Hi All,
        I am designing a low cost storage for file serving, which will
 contain servers with directly attached storage (NO common storage). The
 requirement here is that all the servers nodes should be able to access
 EACH OTHERS directly attached storage.

 I have some of questions, your answers will be helpful?

 -  Is a file system like GFS useful in this scenario?
    If not which would be optimum based on performance?
  
 -  Can I use a ToE card on each server node to for the storage access?
    This is for both getting access to other servers storage and for
    giving access to other servers for it's own storage.

 - How safe is it to put all the storage (of individual servers) into a 
   single volume group? And then make logical volume group

 My Intentions is to just add low-cost PC to this network and provide
 both strorage and load-handling scalability.

 Thanks & Regards
Shailesh     



From lhh at redhat.com  Mon Mar  5 16:11:57 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 05 Mar 2007 11:11:57 -0500
Subject: [Linux-cluster] Power Switch recommendation
In-Reply-To: <453D02254A9EBC45866DBF28FECEA46FBADBED@ILEX01.corp.diligent.com>
References: <453D02254A9EBC45866DBF28FECEA46FBADBED@ILEX01.corp.diligent.com>
Message-ID: <1173111118.17223.6.camel@asuka.boston.devel.redhat.com>

On Sat, 2007-02-24 at 10:23 +0200, Krikler, Samuel wrote:
> Hi,
> 
>  
> 
> I want to purchase a power switch to be used as fence device.
> 
> Since I don?t any experience with this kind of devices:
> 
> Can someone recommend me a specific power switch model of those
> supported by GFS2?

Much the same as GFS1.

Any WTI IPS800, IPS1600 or NBB series:

http://www.wti.com

Most APC 79xx series:

http://www.apcc.com

Older WTI models (e.g. NPS-115 or NPS-230) and older APC models (9225
+AP9606 web/snmp card comes to mind) are often available on eBay, but
the manufacturer may have stopped supporting the units.  As such, I'd
recommend one of the newer ones for production.

Currently, we don't have much else in the way of supported remote power
controller vendors.  Black Box seems to sell re-branded WTI devices, so
it might be possible to tweak the fence_wti agent to work with those.

-- Lon




From lhh at redhat.com  Mon Mar  5 16:26:35 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 05 Mar 2007 11:26:35 -0500
Subject: [Linux-cluster] FWD: Question on RH Cluster from a MySQL Customer
In-Reply-To: <1638.1172372671@sss.pgh.pa.us>
References: <1638.1172372671@sss.pgh.pa.us>
Message-ID: <1173111995.17223.18.camel@asuka.boston.devel.redhat.com>

On Sat, 2007-02-24 at 22:04 -0500, Tom Lane wrote:
> Can someone help out this questioner?  I know zip about Cluster.
> I looked at the FAQ for a bit and thought that what he wants is
> probably doable, but I couldn't tell if it would be easy or
> painful to do load-balancing in this particular way.  (And I'm not
> qualified to say if what he wants is a sensible approach, either.)

The short answer is "yes, sort of".

With all the data on separate places on the SAN, you can certainly spawn
as many instances of MySQL as you want, and have them fail over.

There, however, is currently no way to make linux-cluster figure out
where to place new instances of MySQL based on the number of instances
of MySQL which are currently running.

Now, you can set sort of an affinity for specific nodes, and manually
have the instances of MySQL set up, say, like this:

  node 1 -> runs instances 1 and 5
  node 2 -> runs instances 2 and 6
  node 3 -> runs instances 3 and 7
  node 4 -> runs instances 4 and 8

You can make it decide to split the load, for example, set the preferred
list for instance 1 to:

   {1 2 3}

While setting instance 5 to:

   {1 4 2}

If node 1 fails, instance 1 will start on node 2, and instance 5 will
start on node 4.  With enough thought, you probably could get it so that
the instances will be equally distributed regardless of the failure
model.  Something like this for 4 nodes + 8 instances (did not check for
correctness):

Inst.  Node list (e.g. ordered/unrestricted failover domain)
  1   {1 2 3 4}
  2   {2 3 4 1}
  3   {3 4 1 2}
  4   {4 1 2 3}
  5   {1 4 3 2}
  6   {2 1 4 3}
  7   {3 2 1 4}
  8   {4 3 2 1}

-- Lon




From lhh at redhat.com  Mon Mar  5 16:29:48 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 05 Mar 2007 11:29:48 -0500
Subject: [Linux-cluster] Question about Cluster Service
In-Reply-To: <767095.27017.qm@web31809.mail.mud.yahoo.com>
References: <767095.27017.qm@web31809.mail.mud.yahoo.com>
Message-ID: <1173112189.17223.21.camel@asuka.boston.devel.redhat.com>

On Sun, 2007-02-25 at 05:10 -0800, sara sodagar wrote:
> Hi
> I would be grateful if anyone could tell me if this
> solution works or not?

It looks like it will work fine.

> As I have only 1 passive server , I should create 2
> fail over domain .
> 
> Node A ,C    (cluster service 1)
> Node B , C   (cluster service 2)
> Node c :   (Failover domain 1 : service 1, failover
> domain2: service 2)
> Each Cluster service comprises : ip address resource ,
> web serviver init script,file 
> system resource (gfs)

Yup, you certainly can do that.


>  Also I would like to know what are the advantages of
> using gfs in this solution over
> other types of files systems (like ext3) , as there
> are no 2 active servers writing on the same area at
> the
> same time.

Note that multiple "readers" from a single EXT3 file system will not be
reliable, either - so if you intend to mount on multiple servers (at all
- not just read-only), then you should use GFS.

If you are not trying to do the above, then the only practical advantage
GFS gives you is the potential for slightly faster recovery (due to the
fs already being mounted).

For most people in this case, ext3 is fine.

-- Lon



From lhh at redhat.com  Mon Mar  5 16:31:10 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 05 Mar 2007 11:31:10 -0500
Subject: [Linux-cluster] clurgmgrd[6147]: <warning> Starving for lock
	usrm::rg="SDA database"
In-Reply-To: <49900.193.133.138.40.1172572976.squirrel@lapthorn.biz>
References: <49900.193.133.138.40.1172572976.squirrel@lapthorn.biz>
Message-ID: <1173112270.17223.24.camel@asuka.boston.devel.redhat.com>

On Tue, 2007-02-27 at 10:42 +0000, James Lapthorn wrote:
> Hi Guys,
> 
> I have a 4 node cluster running RH Cluster Suite 4.  I have just added a
> DB2 service to one of the nodes and have starting gettingerrors relating
> to locks ion the system log.  I plan to restart this node at luch time
> today to see if this fixes the problem.
> 
> Is there anyone who can explain what these errors relate to so that I can
> understand the problem better.   I have checked RHN, Cluster Project
> website and Google and I cant find anything?
> 
> Its worth mentioning that the service is running fine.

cat /proc/slabinfo  | grep dlm

If you see a big number, try the rgmanager packages from here - they
should fix it:

http://people.redhat.com/lhh/packages.html

-- Lon




From lhh at redhat.com  Mon Mar  5 16:32:21 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 05 Mar 2007 11:32:21 -0500
Subject: [Linux-cluster] Typo in Makefile
In-Reply-To: <45E58780.8060901@seanodes.com>
References: <45E58780.8060901@seanodes.com>
Message-ID: <1173112341.17223.26.camel@asuka.boston.devel.redhat.com>

On Wed, 2007-02-28 at 14:45 +0100, Erwan Velu wrote:
> I found many line like this one in the Makefile or rgmanager.
> 
> rgmanager/src/utils/Makefile: $(CC) -o $@ $^ $(INLUDE) $(CFLAGS) $(LDFLAGS)
> 
> Looks like INLUDE is a typo ;)

Yes, it does.

-- Lon



From lhh at redhat.com  Mon Mar  5 16:39:46 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 05 Mar 2007 11:39:46 -0500
Subject: [Linux-cluster] can't start GFS on Fedora
In-Reply-To: <3837d8af0703012101m5fe475e4xadebf56d1406b986@mail.gmail.com>
References: <3837d8af0703012101m5fe475e4xadebf56d1406b986@mail.gmail.com>
Message-ID: <1173112787.17223.34.camel@asuka.boston.devel.redhat.com>

On Thu, 2007-03-01 at 21:01 -0800, Jose Guevarra wrote:
> Hi,
> 
> I'm trying to get GFS installed and running on FC4 on a Poweredge 4400
> dual processor. Take note that I'm a total newbie at GFS.
> 
> I've installed 
> 
> GFS-6.1
> GFS-kernel-smp
> lvm2-cluster
> ccs
> cman-kernel-smp
> magma
> fence
> gnbd-kernel-smp
> gnbd


> How can I troubleshoot this more? What are the required daemons that
> need to start? 

Missing magma-plugins rpm

-- Lon






From erwan at seanodes.com  Mon Mar  5 16:42:54 2007
From: erwan at seanodes.com (Erwan Velu)
Date: Mon, 05 Mar 2007 17:42:54 +0100
Subject: [Linux-cluster] can't start GFS on Fedora
In-Reply-To: <1173112787.17223.34.camel@asuka.boston.devel.redhat.com>
References: <3837d8af0703012101m5fe475e4xadebf56d1406b986@mail.gmail.com>
	<1173112787.17223.34.camel@asuka.boston.devel.redhat.com>
Message-ID: <45EC488E.6000908@seanodes.com>

Lon Hohberger wrote:
>> How can I troubleshoot this more? What are the required daemons that
>> need to start? 
>>     
>
> Missing magma-plugins rpm
>   
It seems there is a missing dependencies from one rpm isn't it ?



From ROBERTO.RAMIREZ at hitachigst.com  Tue Mar  6 00:41:00 2007
From: ROBERTO.RAMIREZ at hitachigst.com (ROBERTO.RAMIREZ at hitachigst.com)
Date: Mon, 5 Mar 2007 16:41:00 -0800
Subject: [Linux-cluster] ipmi fence device config
Message-ID: <OFE4929638.C8D7CD23-ON88257296.00037D6F-88257296.0003BC29@hgst.com>

Hello

i am trying to setup ipmi fence device on a RHEL AS V4 Cluster Suite

i have 2 ibm 3950 servers the BMC device is setup with an ip address on 
both servers

when i add the ipmi device to the cluster and test , the fence fails

have somebody configure ipmi on Cluster Suite with IBM servers that can 
help me

also if you can tell me some basics about the ipmi fence method i would 
apriciate i have only do fence on balde centers 

regards

thank

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070305/751e4427/attachment.htm>

From Britt.Treece at savvis.net  Tue Mar  6 04:30:53 2007
From: Britt.Treece at savvis.net (Treece, Britt)
Date: Mon, 5 Mar 2007 22:30:53 -0600
Subject: [Linux-cluster] RE: Errors trying to login to LT000: ... 1006:Not
	Allowed
Message-ID: <D23D9462B361814BBF0D8205C6584BF201387589@s228130hz1ew24.apptix-01.savvis.net>

All,

After much further investigation I found /etc/hosts is off by one for
these 3 client nodes on all 3 lock servers.  Having fixed the typo's is
it safe to assume that the root of the problem trying to login to LTPX
is that /etc/hosts on the lock servers was wrong for these nodes?  If
yes, why would these 3 clients be allowed into the cluster when it was
originally started being that they had incorrect entries in /etc/hosts?

Regards,

Britt Treece
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070305/31f8b565/attachment.htm>

From britt.treece at savvis.net  Tue Mar  6 04:51:26 2007
From: britt.treece at savvis.net (Britt Treece)
Date: Mon, 05 Mar 2007 22:51:26 -0600
Subject: [Linux-cluster] RE: Errors trying to login to LT000: ...
	1006:Not Allowed
In-Reply-To: <D23D9462B361814BBF0D8205C6584BF201387589@s228130hz1ew24.apptix-01.savvis.net>
Message-ID: <C2124F6E.4721%britt.treece@savvis.net>

Not sure why my first post didn?t, but here it is...

---
I am running a 13 node GFS (6.0.2.33) cluster with 10 mounting clients and 3
dedicated lock servers.  The master lock server was rebooted and the next
slave in the voting order took over.  At that time 3 of the client nodes
started receiving login errors for the ltpx server

Mar  4 00:05:52 lock1 lock_gulmd_core[3798]: Master Node Is Logging Out NOW!
... 

Mar  4 00:05:52 lock2 lock_gulmd_core[24627]: Master Node has logged out.
Mar  4 00:05:54 lock2 lock_gulmd_core[24627]: I see no Masters, So I am
Arbitrating until enough Slaves talk to me.
Mar  4 00:05:54 lock2 lock_gulmd_LTPX[24638]: New Master at lock2
:192.168.1.3 
Mar  4 00:05:56 lock2 lock_gulmd_core[24627]: Now have Slave quorum, going
full Master. 
Mar  4 00:11:39 lock2 lock_gulmd_core[24627]: Master Node Is Logging Out
NOW! 
? 

Mar  4 00:05:52 client1 kernel: lock_gulm: Checking for journals for node
"lock1 " 
Mar  4 00:05:52 client1 lock_gulmd_core[9383]: Master Node has logged out.
Mar  4 00:05:52 client1 kernel: lock_gulm: Checking for journals for node
"lock1 " 
Mar  4 00:05:56 client1 lock_gulmd_core[9383]: Found Master at lock2 , so
I'm a Client. 
Mar  4 00:05:56 client1 lock_gulmd_core[9383]: Failed to receive a timely
heartbeat reply from Master. (t:1172988356370685 mb:1)

Mar  4 00:05:56 client1 lock_gulmd_LTPX[9390]: New Master at lock2
:192.168.1.3 
Mar  4 00:06:01 client1 lock_gulmd_LTPX[9390]: Errors trying to login to
LT002: (lock2 :192.168.1.3) 1006:Not Allowed
Mar  4 00:06:01 client1 lock_gulmd_LTPX[9390]: Errors trying to login to
LT000: (lock2 :192.168.1.3) 1006:Not Allowed
Mar  4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to
LT000: (lock2 :192.168.1.3) 1006:Not Allowed
Mar  4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to
LT002: (lock2 :192.168.1.3) 1006:Not Allowed
Mar  4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to
LT004: (lock2 :192.168.1.3) 1006:Not Allowed
Mar  4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to
LT001: (lock2 :192.168.1.3) 1006:Not Allowed
---

Britt


On 3/5/07 10:30 PM, "Treece, Britt" <Britt.Treece at savvis.net> wrote:

> All, 
> 
> After much further investigation I found /etc/hosts is off by one for these 3
> client nodes on all 3 lock servers.  Having fixed the typo's is it safe to
> assume that the root of the problem trying to login to LTPX is that /etc/hosts
> on the lock servers was wrong for these nodes?  If yes, why would these 3
> clients be allowed into the cluster when it was originally started being that
> they had incorrect entries in /etc/hosts?
> 
> Regards, 
> 
> Britt Treece 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070305/d6dcd9c6/attachment.htm>

From matthew at arts.usyd.edu.au  Tue Mar  6 09:25:49 2007
From: matthew at arts.usyd.edu.au (Matthew Geier)
Date: Tue, 06 Mar 2007 20:25:49 +1100
Subject: [Linux-cluster] RHEL4 cluster NFS
Message-ID: <45ED339D.7000504@arts.usyd.edu.au>


 I'm beating my head against the wall trying to figure out how to do
this properly.

 I have a active/passive file server running Samba (and NetAtalk) all
running fine. I'm using ext3 file systems on an EMC san and have all the
failover stuff mostly sorted.
 Only I need to export one of the filesystems with NFS as well.

 Not finding an obvious way to do an NFS export in system-config-cluster
(no where to enter the file system I want to share), I put the export in
/etc/exports.
 Only now the 'file service' will not shutdown cleanly as it can't stop
NFS on the exported volume thus can't unmount it...

 I put in a support call with Redhat (my cluster file service will not
shutdown cleanly) and after a little to-and-fro they have said I have
NFS incorrectly configured.

 But after much searching the web, i'm even more confused. Many articles
on Redhat's site (and others) are for RHEL 3 and not RHEL4. Things they
say to do I can't, or the config file format has changed, etc.


 Any one have a concise example on how to NFS export an ext3 filesystem
on RHEL U4 cluster suite. ?

 Thanks.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3415 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070306/51099901/attachment.bin>

From lhh at redhat.com  Tue Mar  6 15:05:52 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 06 Mar 2007 10:05:52 -0500
Subject: [Linux-cluster] RHEL4 cluster NFS
In-Reply-To: <45ED339D.7000504@arts.usyd.edu.au>
References: <45ED339D.7000504@arts.usyd.edu.au>
Message-ID: <1173193552.14390.4.camel@asuka.boston.devel.redhat.com>

On Tue, 2007-03-06 at 20:25 +1100, Matthew Geier wrote:

>  Any one have a concise example on how to NFS export an ext3 filesystem
> on RHEL U4 cluster suite. ?

Typically, it should look something like this:

<resources>
  <nfsexport name="foo"/>
</resources>
<service name="myservice">
  <fs device... mountpoint... name... type... (etc...) >
    <nfsexport ref="foo">
      <nfsclient .../>
      <nfsclient .../>
      <nfsclient .../>
    </nfsexport>
  </fs>
  <ip/>
  <samba/> <!-- unless you're using your own script, in which case,
                use <script/> -->
</service>

If you need to export something *other* than the top-level mountpoint,
you can add path="" attributes to the nfsclient resources (include the
full path; i.e. if mountpoint was /mnt/1 and you want to
export /mnt/1/foo, use path="/mnt/1/foo", not "/foo"...).

Let's see what you've got now?

-- Lon



From rpeterso at redhat.com  Tue Mar  6 15:09:00 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Tue, 06 Mar 2007 09:09:00 -0600
Subject: [Linux-cluster] RHEL4 cluster NFS
In-Reply-To: <45ED339D.7000504@arts.usyd.edu.au>
References: <45ED339D.7000504@arts.usyd.edu.au>
Message-ID: <45ED840C.3000002@redhat.com>

Matthew Geier wrote:
>  I'm beating my head against the wall trying to figure out how to do
> this properly.
>
>  I have a active/passive file server running Samba (and NetAtalk) all
> running fine. I'm using ext3 file systems on an EMC san and have all the
> failover stuff mostly sorted.
>  Only I need to export one of the filesystems with NFS as well.
>
>  Not finding an obvious way to do an NFS export in system-config-cluster
> (no where to enter the file system I want to share), I put the export in
> /etc/exports.
>  Only now the 'file service' will not shutdown cleanly as it can't stop
> NFS on the exported volume thus can't unmount it...
>
>  I put in a support call with Redhat (my cluster file service will not
> shutdown cleanly) and after a little to-and-fro they have said I have
> NFS incorrectly configured.
>
>  But after much searching the web, i'm even more confused. Many articles
> on Redhat's site (and others) are for RHEL 3 and not RHEL4. Things they
> say to do I can't, or the config file format has changed, etc.
>
>
>  Any one have a concise example on how to NFS export an ext3 filesystem
> on RHEL U4 cluster suite. ?
>
>  Thanks.
>   
Hi Matthew,

Have you looked at my NFS/GFS cookbook?  It's not too different for EXT3,
and it has concrete examples.
http://sources.redhat.com/cluster/doc/nfscookbook.pdf

Regards,

Bob Peterson
Red Hat Cluster Suite



From lhh at redhat.com  Tue Mar  6 15:20:46 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 06 Mar 2007 10:20:46 -0500
Subject: [Linux-cluster] ipmi fence device config
In-Reply-To: <OFE4929638.C8D7CD23-ON88257296.00037D6F-88257296.0003BC29@hgst.com>
References: <OFE4929638.C8D7CD23-ON88257296.00037D6F-88257296.0003BC29@hgst.com>
Message-ID: <1173194447.14390.14.camel@asuka.boston.devel.redhat.com>

On Mon, 2007-03-05 at 16:41 -0800, ROBERTO.RAMIREZ at hitachigst.com wrote:
> 
> Hello 
> 
> i am trying to setup ipmi fence device on a RHEL AS V4 Cluster Suite 
> 
> i have 2 ibm 3950 servers the BMC device is setup with an ip address
> on both servers 
> 
> when i add the ipmi device to the cluster and test , the fence fails 
> 
> have somebody configure ipmi on Cluster Suite with IBM servers that
> can help me 
> 
> also if you can tell me some basics about the ipmi fence method i
> would apriciate i have only do fence on balde centers  

Try fence_ipmi from the command line -- e.g.

fence_ipmi -a <ip> -o off -p <password> (etc.) - see what the output is.

There are two bugs which are fixed in CVS which you might be hitting: 

(1) fence_ipmi doesn't work with null passwords

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=218974

(2) fence_ipmi doesn't work with lan-plus components

(don't know the bugzilla # :( )

Also, fence_ipmi uses ipmitool.  If you get an "ipmitool not found"
warning, ensure that the OpenIPMI package from RHEL4 U3 or later is
installed.  The fence package does not require this any other packages
at install-time (Note: -not- a bug).

-- Lon



From lhh at redhat.com  Tue Mar  6 15:22:44 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 06 Mar 2007 10:22:44 -0500
Subject: [Linux-cluster] RHEL4 cluster NFS
In-Reply-To: <1173193552.14390.4.camel@asuka.boston.devel.redhat.com>
References: <45ED339D.7000504@arts.usyd.edu.au>
	<1173193552.14390.4.camel@asuka.boston.devel.redhat.com>
Message-ID: <1173194564.14390.17.camel@asuka.boston.devel.redhat.com>

On Tue, 2007-03-06 at 10:05 -0500, Lon Hohberger wrote:
> On Tue, 2007-03-06 at 20:25 +1100, Matthew Geier wrote:

Note: Bob's NFS cookbook is way better than my example.

/me goes to drink more coffee

-- Lon



From lgodoy at atichile.com  Tue Mar  6 22:48:54 2007
From: lgodoy at atichile.com (Luis Godoy Gonzalez)
Date: Tue, 06 Mar 2007 19:48:54 -0300
Subject: [Linux-cluster] node fails to join cluster after it was fenced
In-Reply-To: <45DC4934.4040504@redhat.com>
References: <1171458304.24507.91.camel@pc029.sc.diamond.ac.uk>	<45D31766.3080908@redhat.com>	<1171469028.24507.109.camel@pc029.sc.diamond.ac.uk>	<45D339CF.7070408@redhat.com>	<1171474578.24507.148.camel@pc029.sc.diamond.ac.uk>	<45D422B7.30506@redhat.com>	<1171539363.24507.210.camel@pc029.sc.diamond.ac.uk>	<1172056251.18210.135.camel@pc029.sc.diamond.ac.uk>	<45DC2C5E.1040808@redhat.com>	<1172059653.18210.166.camel@pc029.sc.diamond.ac.uk>
	<45DC4934.4040504@redhat.com>
Message-ID: <45EDEFD6.8070802@atichile.com>

Hi

we have the same problem... :|   we have RHE4 U2 with cluster suite 4 
U2, in our case one node send a fenced to the other node, and we have 
not succes to rejoining the node to cluster.
On logs appeared that node 2 cannot comunicate with node 1, but the 
network connectivity is working fine
In a test we deleted the cluster.conf from node 2 and reboot it. After 
the reboot the node got the last version of cluster.conf from node 1, 
but still cannot joining to cluster again.

Below of this mail, we attached a little dump from node 1 that were the 
cluster service is running.

Thanks in advanced for any help.

Best Regards,

Luis.G.
=========================================================================
[root at lvs-gt1 ~]# tcpdump -s0 -x port 6809
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on bond0, link-type EN10MB (Ethernet), capture size 65535 bytes
16:45:30.043719 IP lvs-gt1.6809 > 192.168.150.255.6809: UDP, length 28
        0x0000:  4500 0038 d064 4000 4011 bbea c0a8 9615  E..8.d at .@.......
        0x0010:  c0a8 96ff 1a99 1a99 0024 f69c 0101 4ed0  .........$....N.
        0x0020:  0000 b49d 0000 1900 0100 0000 0000 0000  ................
        0x0030:  0402 0100 0200 0000                      ........
16:45:30.043758 IP lvs-gt1.6809 > 192.168.150.255.6809: UDP, length 28
        0x0000:  4500 0038 d064 4000 4011 bbea c0a8 9615  E..8.d at .@.......
        0x0010:  c0a8 96ff 1a99 1a99 0024 f69c 0101 4ed0  .........$....N.
        0x0020:  0000 b49d 0000 1900 0100 0000 0000 0000  ................
        0x0030:  0402 0100 0200 0000                      ........
16:45:30.043829 IP lvs-gt2.6809 > lvs-gt1.6809: UDP, length 92
        0x0000:  4500 0078 0226 4000 4011 8ad2 c0a8 9616  E..x.&@. at .......
        0x0010:  c0a8 9615 1a99 1a99 0064 1b42 0101 2902  .........d.B..).
        0x0020:  0000 b49d 0000 0100 0000 0000 0000 0000  ................
        0x0030:  0201 0100 0100 0000 0000 0000 0500 0000  ................
        0x0040:  0000 0000 0100 0000 0a00 0000 1000 0000  ................
        0x0050:  6c62 5f63 6c75 7374 6572 0000 0000 0000  lb_cluster......
        0x0060:  0200 1a99 c0a8 9616 0000 0000 0000 0000  ................
        0x0070:  6c76 732d 6774 3200                      lvs-gt2.
16:45:35.042945 IP lvs-gt1.6809 > 192.168.150.255.6809: UDP, length 28
        0x0000:  4500 0038 d065 4000 4011 bbe9 c0a8 9615  E..8.e at .@.......
        0x0010:  c0a8 96ff 1a99 1a99 0024 f59c 0101 4fd0  .........$....O.
        0x0020:  0000 b49d 0000 1900 0100 0000 0000 0000  ................
        0x0030:  0402 0100 0200 0000                      ........
16:45:35.042998 IP lvs-gt1.6809 > 192.168.150.255.6809: UDP, length 28
        0x0000:  4500 0038 d065 4000 4011 bbe9 c0a8 9615  E..8.e at .@.......
        0x0010:  c0a8 96ff 1a99 1a99 0024 f59c 0101 4fd0  .........$....O.
        0x0020:  0000 b49d 0000 1900 0100 0000 0000 0000  ................
        0x0030:  0402 0100 0200 0000                      ........
16:45:35.043075 IP lvs-gt2.6809 > lvs-gt1.6809: UDP, length 92
        0x0000:  4500 0078 0227 4000 4011 8ad1 c0a8 9616  E..x.'@. at .......
        0x0010:  c0a8 9615 1a99 1a99 0064 1a42 0101 2a02  .........d.B..*.
        0x0020:  0000 b49d 0000 0100 0000 0000 0000 0000  ................
        0x0030:  0201 0100 0100 0000 0000 0000 0500 0000  ................
        0x0040:  0000 0000 0100 0000 0a00 0000 1000 0000  ................
        0x0050:  6c62 5f63 6c75 7374 6572 0000 0000 0000  lb_cluster......
        0x0060:  0200 1a99 c0a8 9616 0000 0000 0000 0000  ................
        0x0070:  6c76 732d 6774 3200                      lvs-gt2.

6 packets captured
6 packets received by filter
0 packets dropped by kernel
=============================================================================================

Patrick Caulfield wrote:
> Frederik Ferner wrote:
>   
>> On Wed, 2007-02-21 at 11:26 +0000, Patrick Caulfield wrote:
>>     
>>> Frederik Ferner wrote:
>>>       
>>>> Hi Patrick, All,
>>>>
>>>> let me give you an update on that problem.
>>>>
>>>> On Thu, 2007-02-15 at 11:36 +0000, Frederik Ferner wrote:
>>>>         
>>>>> On Thu, 2007-02-15 at 09:07 +0000, Patrick Caulfield wrote:
>>>>>           
>>>> [node not joining cluster] 
>>>>         
>>>>>> It would be interesting to know - though you may not want to do it - if the
>>>>>> problem persists when the still-running node is rebooted.
>>>>>>             
>>>>> Obviously not at the moment, but I have a maintenance window upcoming
>>>>> soon where I might be able to do that. I'll keep you informed about the
>>>>> result.
>>>>>           
>>>> Today I had the possibility to reboot the node that was still quorate
>>>> (i04-storage1) while the other node (i04-storage2) was still trying to
>>>> join. 
>>>> When i04-storage1 came to the stage where the cluster services are
>>>> started, both nodes joined the cluster at the same time.
>>>>
>>>> With this running cluster, I tried to reproduce the problem by fencing
>>>> one node but after rebooting this immediately joined the cluster.
>>>>         
>>> Interesting. it sounds similar to a cman bug that was introduced in U3, but it
>>> was fixed in U4 - which you said you were running.
>>>       
>> Let's verify that then. I have the following RHCS related packages
>> installed:
>> ccs-1.0.7-0
>> rgmanager-1.9.54-1
>> cman-1.0.11-0
>> fence-1.32.25-1
>> cman-kernel-smp-2.6.9-45.8
>> dlm-kernel-smp-2.6.9-44.3
>> dlm-1.0.1-1
>>     
>
> Yes, those look fine.
>
>   



From ROBERTO.RAMIREZ at hitachigst.com  Tue Mar  6 23:17:52 2007
From: ROBERTO.RAMIREZ at hitachigst.com (ROBERTO.RAMIREZ at hitachigst.com)
Date: Tue, 6 Mar 2007 15:17:52 -0800
Subject: [Linux-cluster] node fails to join cluster after it was fenced
In-Reply-To: <45EDEFD6.8070802@atichile.com>
Message-ID: <OF4672BF0B.940BD234-ON88257296.007EA813-88257296.007FF655@hgst.com>

Luis have you check it the iptables are off if they are on try to disable 
them for a test and try again

service iptables stop
chkconfig iptables off

fence and see if it get back




Luis Godoy Gonzalez <lgodoy at atichile.com> 
Sent by: linux-cluster-bounces at redhat.com
03/06/2007 02:48 PM
Please respond to
linux clustering <linux-cluster at redhat.com>


To
linux clustering <linux-cluster at redhat.com>
cc

Subject
Re: [Linux-cluster] node fails to join cluster after it was fenced






Hi

we have the same problem... :|   we have RHE4 U2 with cluster suite 4 
U2, in our case one node send a fenced to the other node, and we have 
not succes to rejoining the node to cluster.
On logs appeared that node 2 cannot comunicate with node 1, but the 
network connectivity is working fine
In a test we deleted the cluster.conf from node 2 and reboot it. After 
the reboot the node got the last version of cluster.conf from node 1, 
but still cannot joining to cluster again.

Below of this mail, we attached a little dump from node 1 that were the 
cluster service is running.

Thanks in advanced for any help.

Best Regards,

Luis.G.
=========================================================================
[root at lvs-gt1 ~]# tcpdump -s0 -x port 6809
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on bond0, link-type EN10MB (Ethernet), capture size 65535 bytes
16:45:30.043719 IP lvs-gt1.6809 > 192.168.150.255.6809: UDP, length 28
        0x0000:  4500 0038 d064 4000 4011 bbea c0a8 9615  E..8.d at .@.......
        0x0010:  c0a8 96ff 1a99 1a99 0024 f69c 0101 4ed0  .........$....N.
        0x0020:  0000 b49d 0000 1900 0100 0000 0000 0000  ................
        0x0030:  0402 0100 0200 0000                      ........
16:45:30.043758 IP lvs-gt1.6809 > 192.168.150.255.6809: UDP, length 28
        0x0000:  4500 0038 d064 4000 4011 bbea c0a8 9615  E..8.d at .@.......
        0x0010:  c0a8 96ff 1a99 1a99 0024 f69c 0101 4ed0  .........$....N.
        0x0020:  0000 b49d 0000 1900 0100 0000 0000 0000  ................
        0x0030:  0402 0100 0200 0000                      ........
16:45:30.043829 IP lvs-gt2.6809 > lvs-gt1.6809: UDP, length 92
        0x0000:  4500 0078 0226 4000 4011 8ad2 c0a8 9616  E..x.&@. at .......
        0x0010:  c0a8 9615 1a99 1a99 0064 1b42 0101 2902  .........d.B..).
        0x0020:  0000 b49d 0000 0100 0000 0000 0000 0000  ................
        0x0030:  0201 0100 0100 0000 0000 0000 0500 0000  ................
        0x0040:  0000 0000 0100 0000 0a00 0000 1000 0000  ................
        0x0050:  6c62 5f63 6c75 7374 6572 0000 0000 0000  lb_cluster......
        0x0060:  0200 1a99 c0a8 9616 0000 0000 0000 0000  ................
        0x0070:  6c76 732d 6774 3200                      lvs-gt2.
16:45:35.042945 IP lvs-gt1.6809 > 192.168.150.255.6809: UDP, length 28
        0x0000:  4500 0038 d065 4000 4011 bbe9 c0a8 9615  E..8.e at .@.......
        0x0010:  c0a8 96ff 1a99 1a99 0024 f59c 0101 4fd0  .........$....O.
        0x0020:  0000 b49d 0000 1900 0100 0000 0000 0000  ................
        0x0030:  0402 0100 0200 0000                      ........
16:45:35.042998 IP lvs-gt1.6809 > 192.168.150.255.6809: UDP, length 28
        0x0000:  4500 0038 d065 4000 4011 bbe9 c0a8 9615  E..8.e at .@.......
        0x0010:  c0a8 96ff 1a99 1a99 0024 f59c 0101 4fd0  .........$....O.
        0x0020:  0000 b49d 0000 1900 0100 0000 0000 0000  ................
        0x0030:  0402 0100 0200 0000                      ........
16:45:35.043075 IP lvs-gt2.6809 > lvs-gt1.6809: UDP, length 92
        0x0000:  4500 0078 0227 4000 4011 8ad1 c0a8 9616  E..x.'@. at .......
        0x0010:  c0a8 9615 1a99 1a99 0064 1a42 0101 2a02  .........d.B..*.
        0x0020:  0000 b49d 0000 0100 0000 0000 0000 0000  ................
        0x0030:  0201 0100 0100 0000 0000 0000 0500 0000  ................
        0x0040:  0000 0000 0100 0000 0a00 0000 1000 0000  ................
        0x0050:  6c62 5f63 6c75 7374 6572 0000 0000 0000  lb_cluster......
        0x0060:  0200 1a99 c0a8 9616 0000 0000 0000 0000  ................
        0x0070:  6c76 732d 6774 3200                      lvs-gt2.

6 packets captured
6 packets received by filter
0 packets dropped by kernel
=============================================================================================

Patrick Caulfield wrote:
> Frederik Ferner wrote:
> 
>> On Wed, 2007-02-21 at 11:26 +0000, Patrick Caulfield wrote:
>> 
>>> Frederik Ferner wrote:
>>> 
>>>> Hi Patrick, All,
>>>>
>>>> let me give you an update on that problem.
>>>>
>>>> On Thu, 2007-02-15 at 11:36 +0000, Frederik Ferner wrote:
>>>> 
>>>>> On Thu, 2007-02-15 at 09:07 +0000, Patrick Caulfield wrote:
>>>>> 
>>>> [node not joining cluster] 
>>>> 
>>>>>> It would be interesting to know - though you may not want to do it 
- if the
>>>>>> problem persists when the still-running node is rebooted.
>>>>>> 
>>>>> Obviously not at the moment, but I have a maintenance window 
upcoming
>>>>> soon where I might be able to do that. I'll keep you informed about 
the
>>>>> result.
>>>>> 
>>>> Today I had the possibility to reboot the node that was still quorate
>>>> (i04-storage1) while the other node (i04-storage2) was still trying 
to
>>>> join. 
>>>> When i04-storage1 came to the stage where the cluster services are
>>>> started, both nodes joined the cluster at the same time.
>>>>
>>>> With this running cluster, I tried to reproduce the problem by 
fencing
>>>> one node but after rebooting this immediately joined the cluster.
>>>> 
>>> Interesting. it sounds similar to a cman bug that was introduced in 
U3, but it
>>> was fixed in U4 - which you said you were running.
>>> 
>> Let's verify that then. I have the following RHCS related packages
>> installed:
>> ccs-1.0.7-0
>> rgmanager-1.9.54-1
>> cman-1.0.11-0
>> fence-1.32.25-1
>> cman-kernel-smp-2.6.9-45.8
>> dlm-kernel-smp-2.6.9-44.3
>> dlm-1.0.1-1
>> 
>
> Yes, those look fine.
>
> 

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070306/3030ac98/attachment.htm>

From matthew at arts.usyd.edu.au  Tue Mar  6 23:23:51 2007
From: matthew at arts.usyd.edu.au (Matthew Geier)
Date: Wed, 07 Mar 2007 10:23:51 +1100
Subject: [Linux-cluster] RHEL4 cluster NFS
In-Reply-To: <1173193552.14390.4.camel@asuka.boston.devel.redhat.com>
References: <45ED339D.7000504@arts.usyd.edu.au>
	<1173193552.14390.4.camel@asuka.boston.devel.redhat.com>
Message-ID: <45EDF807.3050104@arts.usyd.edu.au>

Lon Hohberger wrote:
> On Tue, 2007-03-06 at 20:25 +1100, Matthew Geier wrote:
> 
>>  Any one have a concise example on how to NFS export an ext3 filesystem
>> on RHEL U4 cluster suite. ?

  Ok, thanks, it's 'clicked' now and I have the idea. Some one emailed 
me directly a screen grab of the layout in system-config-cluster that 
showed me the relationship I was missing, however last evening while 
relaxing in the bath, I had a 'flash of inspiration' on how the 
relationship between the file systems in the services section and the 
NFS clients went together and I tried it remotely and it seems to work. 
all your helpful emails arrived later.

  It's still not perfect, but functional

                         <fs ref="Files - u3">
                                 <nfsexport name="NFS Export u3">
                                         <nfsclient name="whitestar NFS" 
options="async,rw" target="whitestar.arts.usyd.edu.au"/>
                                 </nfsexport>
                         </fs>

   I gather the nfsclient should be a public resource so it can be 
reused on other file systems. I made it private. Have to wait to my next 
maintenance window to change it as the resulting service restart will 
annoy all my Mac users. (Unlike Windows, Mac's don't expect their 
servers to go down all the time :-)

  The examples put the nfsexport in the main resources section as well, 
which is why I couldn't make the connection with the export resource and 
the file system it exports. They neatly put the resource configuration 
in order, directly after the file system is is going to reference, which 
has given me a false impression of how it was supposed to work, as I 
mistakenly thought the binding must happen in resources (and I couldn't 
get it to work) when it actually happens in the service section.

  What does the actual nfsexport directive do ?. It seems to be that 
adding an nfsclient to a filesystem resource would imply it.

  Thanks to all those that helped.




From srramasw at cisco.com  Wed Mar  7 02:03:22 2007
From: srramasw at cisco.com (Sridhar Ramaswamy (srramasw))
Date: Tue, 6 Mar 2007 18:03:22 -0800
Subject: [Linux-cluster] uboot / bootloader support for GFS
Message-ID: <B14199FA0DBAAF4AA89E83EB41D354350329633A@xmb-sjc-22c.amer.cisco.com>

Has anyone tried to use uboot bootloader to boot off a local GFS
filesystem? For that matter any other bootloader?
 
thanks,
Sridhar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070306/8dfaa715/attachment.htm>

From shailesh at verismonetworks.com  Wed Mar  7 05:38:50 2007
From: shailesh at verismonetworks.com (Shailesh)
Date: Wed, 07 Mar 2007 11:08:50 +0530
Subject: [Linux-cluster] fence and lvm
Message-ID: <1173245930.20588.21.camel@shailesh>

Hi,
      what are the uses of 'fenced' in a clustered environment where I
am not using any power device ?  Can you  elaborate any other uses.

      Is it possible to use 'lvm' (make logical volumes) in a RAID-5/6
disk array ? If so how are the parity blocks of RAID 5/6 taken care of
in the logical volumes.

Thanks & Regards 
Shailesh 



From shailesh at verismonetworks.com  Wed Mar  7 06:52:12 2007
From: shailesh at verismonetworks.com (Shailesh)
Date: Wed, 07 Mar 2007 12:22:12 +0530
Subject: [Linux-cluster] Any RH cluster suite MIB
Message-ID: <1173250332.20588.25.camel@shailesh>

I am looking for private mibs defined for GFS and the other cluster
suite. Do you know of any?

Thanks & Regards
Shailesh





From pcaulfie at redhat.com  Wed Mar  7 08:53:18 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 07 Mar 2007 08:53:18 +0000
Subject: [Linux-cluster] node fails to join cluster after it was fenced
In-Reply-To: <45EDEFD6.8070802@atichile.com>
References: <1171458304.24507.91.camel@pc029.sc.diamond.ac.uk>	<45D31766.3080908@redhat.com>	<1171469028.24507.109.camel@pc029.sc.diamond.ac.uk>	<45D339CF.7070408@redhat.com>	<1171474578.24507.148.camel@pc029.sc.diamond.ac.uk>	<45D422B7.30506@redhat.com>	<1171539363.24507.210.camel@pc029.sc.diamond.ac.uk>	<1172056251.18210.135.camel@pc029.sc.diamond.ac.uk>	<45DC2C5E.1040808@redhat.com>	<1172059653.18210.166.camel@pc029.sc.diamond.ac.uk>	<45DC4934.4040504@redhat.com>
	<45EDEFD6.8070802@atichile.com>
Message-ID: <45EE7D7E.6030101@redhat.com>

Luis Godoy Gonzalez wrote:
> Hi
> 
> we have the same problem... :|   we have RHE4 U2 with cluster suite 4
> U2, in our case one node send a fenced to the other node, and we have
> not succes to rejoining the node to cluster.
> On logs appeared that node 2 cannot comunicate with node 1, but the
> network connectivity is working fine
> In a test we deleted the cluster.conf from node 2 and reboot it. After
> the reboot the node got the last version of cluster.conf from node 1,
> but still cannot joining to cluster again.
> 
> Below of this mail, we attached a little dump from node 1 that were the
> cluster service is running. 

That's showing the same symptoms. The new node is sending joinreq messages but
they are not received by the node that's already in the cluster.

If you're running U2 you should upgrade anyway. there are lots of bugs fixed
between that and the current U4.

-- 

patrick



From lgodoy at atichile.com  Wed Mar  7 13:56:55 2007
From: lgodoy at atichile.com (Luis Godoy Gonzalez)
Date: Wed, 07 Mar 2007 10:56:55 -0300
Subject: [Linux-cluster] node fails to join cluster after it was fenced
In-Reply-To: <45EE7D7E.6030101@redhat.com>
References: <1171458304.24507.91.camel@pc029.sc.diamond.ac.uk>	<45D31766.3080908@redhat.com>	<1171469028.24507.109.camel@pc029.sc.diamond.ac.uk>	<45D339CF.7070408@redhat.com>	<1171474578.24507.148.camel@pc029.sc.diamond.ac.uk>	<45D422B7.30506@redhat.com>	<1171539363.24507.210.camel@pc029.sc.diamond.ac.uk>	<1172056251.18210.135.camel@pc029.sc.diamond.ac.uk>	<45DC2C5E.1040808@redhat.com>	<1172059653.18210.166.camel@pc029.sc.diamond.ac.uk>	<45DC4934.4040504@redhat.com>	<45EDEFD6.8070802@atichile.com>
	<45EE7D7E.6030101@redhat.com>
Message-ID: <45EEC4A7.8030200@atichile.com>

Hi

The "IPtable" service is not running on both nodes.
We are thinking in update the platform (RHE4 U4 RHCS 4U4) but thid is 
not easy right now because we have several servers on production.
Another reason to not do it the version update is that we are waiting 
for an update 5 por RHE4 or the production release for RHE5.
In this moment we only update "rgmanager" in some sites (we have several 
issues with the rgmanager of update 2 RHCS4).

Thanks again for you reply.

Best regards,

Luis G.


Patrick Caulfield wrote:
> Luis Godoy Gonzalez wrote:
>   
>> Hi
>>
>> we have the same problem... :|   we have RHE4 U2 with cluster suite 4
>> U2, in our case one node send a fenced to the other node, and we have
>> not succes to rejoining the node to cluster.
>> On logs appeared that node 2 cannot comunicate with node 1, but the
>> network connectivity is working fine
>> In a test we deleted the cluster.conf from node 2 and reboot it. After
>> the reboot the node got the last version of cluster.conf from node 1,
>> but still cannot joining to cluster again.
>>
>> Below of this mail, we attached a little dump from node 1 that were the
>> cluster service is running. 
>>     
>
> That's showing the same symptoms. The new node is sending joinreq messages but
> they are not received by the node that's already in the cluster.
>
> If you're running U2 you should upgrade anyway. there are lots of bugs fixed
> between that and the current U4.
>
>   



From pcaulfie at redhat.com  Wed Mar  7 14:56:55 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 07 Mar 2007 14:56:55 +0000
Subject: [Linux-cluster] node fails to join cluster after it was fenced
In-Reply-To: <45EEC4A7.8030200@atichile.com>
References: <1171458304.24507.91.camel@pc029.sc.diamond.ac.uk>	<45D31766.3080908@redhat.com>	<1171469028.24507.109.camel@pc029.sc.diamond.ac.uk>	<45D339CF.7070408@redhat.com>	<1171474578.24507.148.camel@pc029.sc.diamond.ac.uk>	<45D422B7.30506@redhat.com>	<1171539363.24507.210.camel@pc029.sc.diamond.ac.uk>	<1172056251.18210.135.camel@pc029.sc.diamond.ac.uk>	<45DC2C5E.1040808@redhat.com>	<1172059653.18210.166.camel@pc029.sc.diamond.ac.uk>	<45DC4934.4040504@redhat.com>	<45EDEFD6.8070802@atichile.com>	<45EE7D7E.6030101@redhat.com>
	<45EEC4A7.8030200@atichile.com>
Message-ID: <45EED2B7.6040002@redhat.com>

Luis Godoy Gonzalez wrote:
> Hi
> 
> The "IPtable" service is not running on both nodes.
> We are thinking in update the platform (RHE4 U4 RHCS 4U4) but thid is
> not easy right now because we have several servers on production.
> Another reason to not do it the version update is that we are waiting
> for an update 5 por RHE4 or the production release for RHE5.
> In this moment we only update "rgmanager" in some sites (we have several
> issues with the rgmanager of update 2 RHCS4).
> 

It is really rather odd. Node 1 can obviously see the joinreq messages - at
least tcpdump can, but cman is either not seeing them or ignoring them.

What really bothers me is that this seems to be affecting U2 and U4 - if both of
you were using U3 I would think no more of it :)

Annoyingly it's hard to debug at this level (you can't strace a kernel thread!).
I"m pretty sure that a reboot of node1 would fix the problem but that's hardly
helpful.

-- 

patrick



From Britt.Treece at savvis.net  Wed Mar  7 15:17:17 2007
From: Britt.Treece at savvis.net (Treece, Britt)
Date: Wed, 7 Mar 2007 09:17:17 -0600
Subject: [Linux-cluster] RE: Errors trying to login to LT000: ...1006:Not
	Allowed
In-Reply-To: <C2124F6E.4721%britt.treece@savvis.net>
Message-ID: <D23D9462B361814BBF0D8205C6584BF2C33B3F@s228130hz1ew24.apptix-01.savvis.net>

Does anyone have any idea why incorrect entries in /etc/hosts of the
lock servers would intermittently cause the "Errors trying to login to
LT000: ...1006:Not Allowed?"  I would think this would be something that
if wrong should *consistently* cause the client not to be allowed into
the lockspace.
 
Additionally can anyone explain the fundamentals of GFS 6.0 lock tables
and the locking process.  A couple specific questions I have...
 
    What is the difference between LTPX and the LT000?
 
    What is the advantage of having additional lock tables and when
would having more be a disadvantage?
 
    Is each lock propagated to each locktable or is it held in only one
table?
 
    Is the highwater mark for each locktable or the sum of locks across
all locktables?
 
 
Regards,

Britt Treece

________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Britt Treece
Sent: Monday, March 05, 2007 10:51 PM
To: linux clustering
Subject: Re: [Linux-cluster] RE: Errors trying to login to LT000:
...1006:Not Allowed


Not sure why my first post didn't, but here it is...

---
I am running a 13 node GFS (6.0.2.33) cluster with 10 mounting clients
and 3 dedicated lock servers.  The master lock server was rebooted and
the next slave in the voting order took over.  At that time 3 of the
client nodes started receiving login errors for the ltpx server

Mar  4 00:05:52 lock1 lock_gulmd_core[3798]: Master Node Is Logging Out
NOW! 
... 

Mar  4 00:05:52 lock2 lock_gulmd_core[24627]: Master Node has logged
out. 
Mar  4 00:05:54 lock2 lock_gulmd_core[24627]: I see no Masters, So I am
Arbitrating until enough Slaves talk to me. 
Mar  4 00:05:54 lock2 lock_gulmd_LTPX[24638]: New Master at lock2
:192.168.1.3 
Mar  4 00:05:56 lock2 lock_gulmd_core[24627]: Now have Slave quorum,
going full Master. 
Mar  4 00:11:39 lock2 lock_gulmd_core[24627]: Master Node Is Logging Out
NOW! 
... 

Mar  4 00:05:52 client1 kernel: lock_gulm: Checking for journals for
node "lock1 " 
Mar  4 00:05:52 client1 lock_gulmd_core[9383]: Master Node has logged
out. 
Mar  4 00:05:52 client1 kernel: lock_gulm: Checking for journals for
node "lock1 " 
Mar  4 00:05:56 client1 lock_gulmd_core[9383]: Found Master at lock2 ,
so I'm a Client. 
Mar  4 00:05:56 client1 lock_gulmd_core[9383]: Failed to receive a
timely heartbeat reply from Master. (t:1172988356370685 mb:1)

Mar  4 00:05:56 client1 lock_gulmd_LTPX[9390]: New Master at lock2
:192.168.1.3 
Mar  4 00:06:01 client1 lock_gulmd_LTPX[9390]: Errors trying to login to
LT002: (lock2 :192.168.1.3) 1006:Not Allowed 
Mar  4 00:06:01 client1 lock_gulmd_LTPX[9390]: Errors trying to login to
LT000: (lock2 :192.168.1.3) 1006:Not Allowed 
Mar  4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to
LT000: (lock2 :192.168.1.3) 1006:Not Allowed 
Mar  4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to
LT002: (lock2 :192.168.1.3) 1006:Not Allowed 
Mar  4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to
LT004: (lock2 :192.168.1.3) 1006:Not Allowed 
Mar  4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to
LT001: (lock2 :192.168.1.3) 1006:Not Allowed
---

Britt


On 3/5/07 10:30 PM, "Treece, Britt" <Britt.Treece at savvis.net> wrote:



	All, 
	
	After much further investigation I found /etc/hosts is off by
one for these 3 client nodes on all 3 lock servers.  Having fixed the
typo's is it safe to assume that the root of the problem trying to login
to LTPX is that /etc/hosts on the lock servers was wrong for these
nodes?  If yes, why would these 3 clients be allowed into the cluster
when it was originally started being that they had incorrect entries in
/etc/hosts?
	
	Regards, 
	
	Britt Treece 
	
	
________________________________

	--
	Linux-cluster mailing list
	Linux-cluster at redhat.com
	https://www.redhat.com/mailman/listinfo/linux-cluster
	



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070307/64f42ce4/attachment.htm>

From rpeterso at redhat.com  Wed Mar  7 16:15:41 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Wed, 07 Mar 2007 10:15:41 -0600
Subject: [Linux-cluster] uboot / bootloader support for GFS
In-Reply-To: <B14199FA0DBAAF4AA89E83EB41D354350329633A@xmb-sjc-22c.amer.cisco.com>
References: <B14199FA0DBAAF4AA89E83EB41D354350329633A@xmb-sjc-22c.amer.cisco.com>
Message-ID: <45EEE52D.6070405@redhat.com>

Sridhar Ramaswamy (srramasw) wrote:
> Has anyone tried to use uboot bootloader to boot off a local GFS
> filesystem? For that matter any other bootloader?
>  
> thanks,
> Sridhar

Hi Sridhar,

I'm familiar with uboot, but I haven't tried booting a GFS root from it.
I only know about one platform for using GFS as a root partition, and
that is open-sharedroot.  See:

http://sources.redhat.com/cluster/faq.html#gfs_diskless

Unfortunately, I haven't played with that either.

Regards,

Bob Peterson
Red Hat Cluster Suite



From james.lapthorn at lapthornconsulting.com  Wed Mar  7 16:36:14 2007
From: james.lapthorn at lapthornconsulting.com (James Lapthorn)
Date: Wed, 7 Mar 2007 16:36:14 -0000 (UTC)
Subject: [Linux-cluster] Quorum Disk question
Message-ID: <35859.193.133.138.40.1173285374.squirrel@www.lapthorn.biz>

Good afternoon,

Hopefully somebody can help with a Quorum disk question I have.  I have a
4 node cluster and have adopted a 'last man standing' approach.  Because
of this I use a quorum disk which is I have setup on my SAN.

I have added the following configuration into my cluster.conf file.

<quorumd device="/dev/vpathe" interval="1" tko="10" votes="4">
     <heuristic interval="2" program="ping leoukldb1 -c1 -t1" score="     
    1"/>
     <heuristic interval="2" program="ping leoukldb2 -c1 -t1" score="     
    1"/>
     <heuristic interval="2" program="ping leoukldb3 -c1 -t1" score="     
    1"/>
     <heuristic interval="2" program="ping leoukldb4 -c1 -t1" score="     
    1"/>
</quorumd>

Each node has 1 vote and the quorum disk has 4, therefore Quorum should
remain if 3 nodes are removed from the cluster.

While testing this I have noticed that i get the following error in the log:

 qdiskd[26676]: <notice> Score insufficient for master operation (1/2;
max=4); downgrading

At this point Activity is blocked.

Is this to do with my heuristic programs, should I not ping each memeber
node and maybe ping something like 'localhost'



PLease help


James Lapthorn
_________________________________
This email has been ClamScanned !
          www.clamav.net



From lgodoy at atichile.com  Wed Mar  7 17:19:27 2007
From: lgodoy at atichile.com (Luis Godoy Gonzalez)
Date: Wed, 07 Mar 2007 14:19:27 -0300
Subject: [Linux-cluster] node fails to join cluster after it was fenced
In-Reply-To: <45EED2B7.6040002@redhat.com>
References: <1171458304.24507.91.camel@pc029.sc.diamond.ac.uk>	<45D31766.3080908@redhat.com>	<1171469028.24507.109.camel@pc029.sc.diamond.ac.uk>	<45D339CF.7070408@redhat.com>	<1171474578.24507.148.camel@pc029.sc.diamond.ac.uk>	<45D422B7.30506@redhat.com>	<1171539363.24507.210.camel@pc029.sc.diamond.ac.uk>	<1172056251.18210.135.camel@pc029.sc.diamond.ac.uk>	<45DC2C5E.1040808@redhat.com>	<1172059653.18210.166.camel@pc029.sc.diamond.ac.uk>	<45DC4934.4040504@redhat.com>	<45EDEFD6.8070802@atichile.com>	<45EE7D7E.6030101@redhat.com>	<45EEC4A7.8030200@atichile.com>
	<45EED2B7.6040002@redhat.com>
Message-ID: <45EEF41F.1040704@atichile.com>


We are programming a Maintenance Window to reboot node 1, bellow you can 
find more configuration info.

Until this moment, we have had two problems that we describe like "big 
problems". One of them was solved with a rgmanager update, and the other 
(more extrange) was solved  changing a 10/100/1000 switch for a 10/100 
switch (that is the used in our producction platforms) .

Bellow I athach too a generic diagram of this instalation. This 
instalation particulary only have one switch (commonly we have two 
switch for redundant)

Thanks & Regards
Luis G.

================================================================================

[root at lvs-gt1 ~]# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  lvs-gt2                                  Offline
  lvs-gt1                                  Online, Local, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  XXX1               lvs-gt1                        started
  XXX2               lvs-gt1                        started
[root at lvs-gt1 ~]# cman_tool status
Protocol version: 5.0.1
Config version: 10
Cluster name: lb_cluster
Cluster ID: 40372
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 1
Total_votes: 1
Quorum: 1
Active subsystems: 4
Node name: lvs-gt1
Node addresses: 192.168.150.21

[root at lvs-gt1 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    1   M   lvs-gt1
   2    1    1   X   lvs-gt2

[root at lvs-gt1 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1]

DLM Lock Space:  "Magma"                             3   4 run       -
[1]

User:            "usrm::manager"                     2   3 run       -
[1]

[root at lvs-gt1 ~]# uname -a
Linux lvs-gt1 2.6.9-22.EL #1 Mon Sep 19 18:20:28 EDT 2005 i686 athlon 
i386 GNU/Linux
[root at lvs-gt1 ~]# rpm -qa | grep cman
cman-kernel-2.6.9-39.5
cman-1.0.2-0
cman-kernel-hugemem-2.6.9-39.5
cman-kernheaders-2.6.9-39.5
cman-kernel-smp-2.6.9-39.5
[root at lvs-gt1 ~]# rpm -qa | grep -i  ccs
ccs-1.0.2-0
[root at lvs-gt1 ~]# rpm -qa | grep -i  fence
fence-1.32.6-0
[root at lvs-gt1 ~]# rpm -qa | grep -i  rgma
rgmanager-1.9.53-0

OTHER NODE
==========
[root at lvs-gt1 log]# ssh lvs-gt2
Last login: Tue Mar  6 17:57:07 2007 from 172.22.22.52
[root at lvs-gt2 ~]# tail /var/log/messages
Mar  7 09:47:06 lvs-gt2 kernel: CMAN: sending membership request
Mar  7 09:47:41 lvs-gt2 last message repeated 7 times
Mar  7 09:47:56 lvs-gt2 last message repeated 3 times
Mar  7 09:47:57 lvs-gt2 sshd(pam_unix)[13006]: session opened for user 
root by root(uid=0)
Mar  7 09:48:01 lvs-gt2 kernel: CMAN: sending membership request
Mar  7 09:48:01 lvs-gt2 crond(pam_unix)[12936]: session closed for user root
Mar  7 09:48:01 lvs-gt2 crond(pam_unix)[13039]: session opened for user 
root by (uid=0)
Mar  7 09:48:01 lvs-gt2 su(pam_unix)[13044]: session opened for user 
admin by (uid=0)
Mar  7 09:48:01 lvs-gt2 su(pam_unix)[13044]: session closed for user admin
Mar  7 09:48:06 lvs-gt2 kernel: CMAN: sending membership request
[root at lvs-gt2 ~]#
[root at lvs-gt2 ~]# clustat
Segmentation fault
[root at lvs-gt2 ~]# cman_tool status
Protocol version: 5.0.1
Config version: 10
Cluster name: lb_cluster
Cluster ID: 40372
Cluster Member: No
Membership state: Joining







Patrick Caulfield wrote:
> Luis Godoy Gonzalez wrote:
>   
>> Hi
>>
>> The "IPtable" service is not running on both nodes.
>> We are thinking in update the platform (RHE4 U4 RHCS 4U4) but thid is
>> not easy right now because we have several servers on production.
>> Another reason to not do it the version update is that we are waiting
>> for an update 5 por RHE4 or the production release for RHE5.
>> In this moment we only update "rgmanager" in some sites (we have several
>> issues with the rgmanager of update 2 RHCS4).
>>
>>     
>
> It is really rather odd. Node 1 can obviously see the joinreq messages - at
> least tcpdump can, but cman is either not seeing them or ignoring them.
>
> What really bothers me is that this seems to be affecting U2 and U4 - if both of
> you were using U3 I would think no more of it :)
>
> Annoyingly it's hard to debug at this level (you can't strace a kernel thread!).
> I"m pretty sure that a reboot of node1 would fix the problem but that's hardly
> helpful.
>
>   

-------------- next part --------------
A non-text attachment was scrubbed...
Name: generico.png
Type: image/png
Size: 31248 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070307/234bbb10/attachment.png>

From Bowie_Bailey at BUC.com  Wed Mar  7 17:23:15 2007
From: Bowie_Bailey at BUC.com (Bowie Bailey)
Date: Wed, 7 Mar 2007 12:23:15 -0500 
Subject: [Linux-cluster] Shutdown a cluster
Message-ID: <4766EEE585A6D311ADF500E018C154E30268524C@bnifex.cis.buc.com>

What is the proper way to shutdown an entire cluster?

I have a 3-node cluster.  I can shutdown any one node with no problems,
but when I try to shutdown the second and third nodes, it locks up
trying to stop the cluster processes since it loses quorum at that
point.  There has to be a way to tell the cluster to do a complete
shutdown.

Any pointers?

Thanks,

-- 
Bowie



From jleafey at utmem.edu  Wed Mar  7 17:32:20 2007
From: jleafey at utmem.edu (Jay Leafey)
Date: Wed, 07 Mar 2007 11:32:20 -0600
Subject: [Linux-cluster] Quorum Disk question
In-Reply-To: <35859.193.133.138.40.1173285374.squirrel@www.lapthorn.biz>
References: <35859.193.133.138.40.1173285374.squirrel@www.lapthorn.biz>
Message-ID: <45EEF724.2050408@utmem.edu>

James Lapthorn wrote:
> Good afternoon,
> 
> Hopefully somebody can help with a Quorum disk question I have.  I have a
> 4 node cluster and have adopted a 'last man standing' approach.  Because
> of this I use a quorum disk which is I have setup on my SAN.
> 
> I have added the following configuration into my cluster.conf file.
> 
> <quorumd device="/dev/vpathe" interval="1" tko="10" votes="4">
>      <heuristic interval="2" program="ping leoukldb1 -c1 -t1" score="     
>     1"/>
>      <heuristic interval="2" program="ping leoukldb2 -c1 -t1" score="     
>     1"/>
>      <heuristic interval="2" program="ping leoukldb3 -c1 -t1" score="     
>     1"/>
>      <heuristic interval="2" program="ping leoukldb4 -c1 -t1" score="     
>     1"/>
> </quorumd>
> 
> Each node has 1 vote and the quorum disk has 4, therefore Quorum should
> remain if 3 nodes are removed from the cluster.
> 
> While testing this I have noticed that i get the following error in the log:
> 
>  qdiskd[26676]: <notice> Score insufficient for master operation (1/2;
> max=4); downgrading
> 
> At this point Activity is blocked.
> 
> Is this to do with my heuristic programs, should I not ping each memeber
> node and maybe ping something like 'localhost'
> 

Instead of pinging the individual nodes in the cluster, how about 
pinging the default router?  If the entire network goes away, all nodes 
in the cluster should become inquorate and block activity.  If an 
Ethernet drop to a single host goes bad, it will lose quorum but the 
other hosts should remain quorate.  That's the approach we're using and 
it seems to work OK so far.

Hope that helps!
-- 
Jay Leafey - University of Tennessee
E-Mail:  jleafey at utmem.edu  Phone:  901-448-5848  FAX:  901-448-8199
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5153 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070307/69f68923/attachment.bin>

From lhh at redhat.com  Wed Mar  7 17:54:53 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 07 Mar 2007 12:54:53 -0500
Subject: [Linux-cluster] RHEL4 cluster NFS
In-Reply-To: <45EDF807.3050104@arts.usyd.edu.au>
References: <45ED339D.7000504@arts.usyd.edu.au>
	<1173193552.14390.4.camel@asuka.boston.devel.redhat.com>
	<45EDF807.3050104@arts.usyd.edu.au>
Message-ID: <1173290093.12686.16.camel@asuka.boston.devel.redhat.com>

On Wed, 2007-03-07 at 10:23 +1100, Matthew Geier wrote:
> Lon Hohberger wrote:
> > On Tue, 2007-03-06 at 20:25 +1100, Matthew Geier wrote:
> > 
> >>  Any one have a concise example on how to NFS export an ext3 filesystem
> >> on RHEL U4 cluster suite. ?
> 
>   Ok, thanks, it's 'clicked' now and I have the idea. Some one emailed 
> me directly a screen grab of the layout in system-config-cluster that 
> showed me the relationship I was missing, however last evening while 
> relaxing in the bath, I had a 'flash of inspiration' on how the 
> relationship between the file systems in the services section and the 
> NFS clients went together and I tried it remotely and it seems to work. 
> all your helpful emails arrived later.
> 
>   It's still not perfect, but functional
> 
>                          <fs ref="Files - u3">
>                                  <nfsexport name="NFS Export u3">
>                                          <nfsclient name="whitestar NFS" 
> options="async,rw" target="whitestar.arts.usyd.edu.au"/>
>                                  </nfsexport>
>                          </fs>
> 
>    I gather the nfsclient should be a public resource so it can be 
> reused on other file systems. I made it private. Have to wait to my next 
> maintenance window to change it as the resulting service restart will 
> annoy all my Mac users. (Unlike Windows, Mac's don't expect their 
> servers to go down all the time :-)

haha :)

>   What does the actual nfsexport directive do ?. It seems to be that 
> adding an nfsclient to a filesystem resource would imply it.

It's basically a per-mountpoint script that does NFS cleanups prior to
allowing the file systems to be unmounted, and does sanity checks (makes
sure nfsd is running prior to trying to call 'exportfs', for example).

It's also a placeholder for future, um "special steps", that might need
to happen as Linux NFS changes over time.

It is designed to inherit everything it needs to know from the parent
<fs> resource (and its parent service resource), which is why it can be
reused.

-- Lon



From rpeterso at redhat.com  Wed Mar  7 17:56:05 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Wed, 07 Mar 2007 11:56:05 -0600
Subject: [Linux-cluster] Shutdown a cluster
In-Reply-To: <4766EEE585A6D311ADF500E018C154E30268524C@bnifex.cis.buc.com>
References: <4766EEE585A6D311ADF500E018C154E30268524C@bnifex.cis.buc.com>
Message-ID: <45EEFCB5.4000607@redhat.com>

Bowie Bailey wrote:
> What is the proper way to shutdown an entire cluster?
> 
> I have a 3-node cluster.  I can shutdown any one node with no problems,
> but when I try to shutdown the second and third nodes, it locks up
> trying to stop the cluster processes since it loses quorum at that
> point.  There has to be a way to tell the cluster to do a complete
> shutdown.
> 
> Any pointers?
> 
> Thanks,
> 
Hi Bowie,

http://sources.redhat.com/cluster/faq.html#cman_shutdown

Regards,

Bob Peterson
Red Hat Cluster Suite



From lhh at redhat.com  Wed Mar  7 18:03:32 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 07 Mar 2007 13:03:32 -0500
Subject: [Linux-cluster] Quorum Disk question
In-Reply-To: <35859.193.133.138.40.1173285374.squirrel@www.lapthorn.biz>
References: <35859.193.133.138.40.1173285374.squirrel@www.lapthorn.biz>
Message-ID: <1173290612.12686.25.camel@asuka.boston.devel.redhat.com>

On Wed, 2007-03-07 at 16:36 +0000, James Lapthorn wrote:
> Good afternoon,
> 
> Hopefully somebody can help with a Quorum disk question I have.  I have a
> 4 node cluster and have adopted a 'last man standing' approach.  Because
> of this I use a quorum disk which is I have setup on my SAN.
> 
> I have added the following configuration into my cluster.conf file.
> 
> <quorumd device="/dev/vpathe" interval="1" tko="10" votes="4">
>      <heuristic interval="2" program="ping leoukldb1 -c1 -t1" score="     
>     1"/>
>      <heuristic interval="2" program="ping leoukldb2 -c1 -t1" score="     
>     1"/>
>      <heuristic interval="2" program="ping leoukldb3 -c1 -t1" score="     
>     1"/>
>      <heuristic interval="2" program="ping leoukldb4 -c1 -t1" score="     
>     1"/>
> </quorumd>
> 
> Each node has 1 vote and the quorum disk has 4, therefore Quorum should
> remain if 3 nodes are removed from the cluster.
> 
> While testing this I have noticed that i get the following error in the log:
> 
>  qdiskd[26676]: <notice> Score insufficient for master operation (1/2;
> max=4); downgrading
> 
> At this point Activity is blocked.
> 
> Is this to do with my heuristic programs, should I not ping each memeber
> node and maybe ping something like 'localhost'

Yes, that's the problem.  Try one heuristic with a big score and pinging
the closest router.

Note that if you have any unreliable parts of the network with the
RHEL4U4 Qdisk, you may experience false 'downgrades' (bad).

If you have a current CVS update of RHEL4/STABLE/RHEL5/etc., you can run
with no heuristics at all - which will be a "last-man-standing"
approach.

Additionally, the 'false downgrade' problem will be fixed in RHEL4U5
(and is fixed in CVS) by adding 'tko' counts to the heuristics.

I have updated cman / cman-kernel packages which have several qdisk
fixes here (including the two above):

http://people.redhat.com/lhh/packages.html

-- Lon



From Bowie_Bailey at BUC.com  Wed Mar  7 18:22:25 2007
From: Bowie_Bailey at BUC.com (Bowie Bailey)
Date: Wed, 7 Mar 2007 13:22:25 -0500 
Subject: [Linux-cluster] Shutdown a cluster
Message-ID: <4766EEE585A6D311ADF500E018C154E30268524E@bnifex.cis.buc.com>

Robert Peterson wrote:
> Bowie Bailey wrote:
> > What is the proper way to shutdown an entire cluster?
> > 
> > I have a 3-node cluster.  I can shutdown any one node with no
> > problems, but when I try to shutdown the second and third nodes, it
> > locks up trying to stop the cluster processes since it loses quorum
> > at that point.  There has to be a way to tell the cluster to do a
> > complete shutdown. 
> > 
> > Any pointers?
> > 
> > Thanks,
> > 
> Hi Bowie,
> 
> http://sources.redhat.com/cluster/faq.html#cman_shutdown

Now why couldn't I find that??  :)

Looking at my init scripts, it looks like the system should run this
command when I do a normal shutdown:

    cman_tool -t 60 -w leave remove

Should this be sufficient to allow me to shut down the entire cluster by
just issuing a "shutdown" command on each node, or is there something I
am missing?

-- 
Bowie



From rpeterso at redhat.com  Wed Mar  7 18:56:10 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Wed, 07 Mar 2007 12:56:10 -0600
Subject: [Linux-cluster] FWD: Question on RH Cluster from a MySQL Customer
In-Reply-To: <1638.1172372671@sss.pgh.pa.us>
References: <1638.1172372671@sss.pgh.pa.us>
Message-ID: <45EF0ACA.1090900@redhat.com>

Tom Lane wrote:
> Can someone help out this questioner?  I know zip about Cluster.
> I looked at the FAQ for a bit and thought that what he wants is
> probably doable, but I couldn't tell if it would be easy or
> painful to do load-balancing in this particular way.  (And I'm not
> qualified to say if what he wants is a sensible approach, either.)
> 
> 			regards, tom lane
> 
> ------- Forwarded Message
> 
> Date:    Sat, 24 Feb 2007 15:37:17 +0000
> From:    Ivan Zoratti <ivan at mysql.com>
> To:      tgl at redhat.com
> Subject: Question on RH Cluster from a MySQL Customer
> 
> Dear Tom,
> 
> first of all, let me introduce myself. I am the Sales Engineering  
> Manager for EMEA at MySQL. Kath O'Neil, our Director of Strategic  
> Alliances, kindly gave me your name for a technical question related  
> to the use of Red Hat and MySQL - hopefully leading to the adoption  
> of RH Cluster.
> 
> Our customer is looking for a solution that could provide high  
> availability and scalability in a cluster environment based on linux  
> servers that are connected to a large SAN. Their favourite choice  
> would be to go with Red Hat.
> Each server connected to the SAN would provide resources to host,  
> let's say, 5 different instances of MySQL (mysqld). Each mysqld will  
> have its own configuration, datadir, connection port and IP address.
> The clustering software should be able to load-balance new mysqld  
> instances on the available servers. For example, considering servers  
> with same specs and workload, when the first mysqld starts, it will  
> be placed on Server A, the second one will go on Server B and so on  
> for C,D and E. The sixth mysqld will then go on A again, then B and  
> so forth. If one of the server fails, the mysqld(s) is (or are)  
> "moved" on the other servers, still in a way to guarantee a load- 
> balance of the whole system.
> After my long (and hopefully clear enough) explanation, my quick  
> question is: does RH Cluster provide this kind of features? I am  
> mostly interested in the way we can instatiate mysqld and re-launch  
> them on any other server in the cluster in case of fault.
> 
> I would be very grateful if you could help me or address me to  
> somebody or something for an answer.
> 
> Thank you in advance for your help.
> 
> Kind Regards,
> 
> Ivan
> 
> 
> --
>    Ivan Zoratti - Sales Engineering Manager EMEA
> 
>    MySQL AB - Windsor - UK
>    Mobile: +44 7866 363 180
> 
>    ivan at mysql.com
>    http://www.mysql.com

Hi Tom, Ivan, and linux-cluster readers,

In theory, our Piranha / LVS (Linux Virtual Server) may be used to
load-balance the requests to numerous mysql servers in a cluster.

Our rgmanager can provide the High Availability to fail over
mysql services to other nodes in the cluster if they fail.

However, if the mysqld daemons are all running on a SAN and you're
mysqld daemons are trying to serve data from the same file system, you
probably have a problem.  To share the data/database on the SAN in
one harmonious file system, you could use the GFS file system, but 
"regular" mysql is not cluster-aware (to the best of my knowledge).  
The sum of my understanding about this may be found here:

http://sources.redhat.com/cluster/faq.html#gfs_mysql

Since Ivan works for mysql, perhaps he can clear this up if
it's not accurate.  I'd like to know more about "mysql-cluster"
and how it's implemented.  I'd like to see mysql implemented as
a cluster-friendly app using our cluster infrastructure so they
can effectively compete against Oracle RAC without reinventing
the wheel.  I'd even like to be a part of the effort to make this 
happen.  Hope this helps.

Regards,

Bob Peterson
Red Hat Cluster Suite



From Britt.Treece at savvis.net  Tue Mar  6 03:45:32 2007
From: Britt.Treece at savvis.net (Treece, Britt)
Date: Mon, 5 Mar 2007 21:45:32 -0600
Subject: [Linux-cluster] Errors trying to login to LT000: ... 1006:Not
	Allowed
Message-ID: <D23D9462B361814BBF0D8205C6584BF201387586@s228130hz1ew24.apptix-01.savvis.net>

All,

I am running a 13 node GFS (6.0.2.33) cluster with 10 mounting clients
and 3 dedicated lock servers.  The master lock server was rebooted and
the next slave in the voting order took over.  At that time 3 of the
client nodes started receiving login errors for the ltpx server

Mar  4 00:05:52 lock1 lock_gulmd_core[3798]: Master Node Is Logging Out
NOW!
...

Mar  4 00:05:52 lock2 lock_gulmd_core[24627]: Master Node has logged
out.
Mar  4 00:05:54 lock2 lock_gulmd_core[24627]: I see no Masters, So I am
Arbitrating until enough Slaves talk to me.
Mar  4 00:05:54 lock2 lock_gulmd_LTPX[24638]: New Master at lock2
:192.168.1.3
Mar  4 00:05:56 lock2 lock_gulmd_core[24627]: Now have Slave quorum,
going full Master.
Mar  4 00:11:39 lock2 lock_gulmd_core[24627]: Master Node Is Logging Out
NOW!
...

Mar  4 00:05:52 client1 kernel: lock_gulm: Checking for journals for
node "lock1 "
Mar  4 00:05:52 client1 lock_gulmd_core[9383]: Master Node has logged
out.
Mar  4 00:05:52 client1 kernel: lock_gulm: Checking for journals for
node "lock1 "
Mar  4 00:05:56 client1 lock_gulmd_core[9383]: Found Master at lock2 ,
so I'm a Client.
Mar  4 00:05:56 client1 lock_gulmd_core[9383]: Failed to receive a
timely heartbeat reply from Master. (t:1172988356370685 mb:1)
Mar  4 00:05:56 client1 lock_gulmd_LTPX[9390]: New Master at lock2
:192.168.1.3
Mar  4 00:06:01 client1 lock_gulmd_LTPX[9390]: Errors trying to login to
LT002: (lock2 :192.168.1.3) 1006:Not Allowed
Mar  4 00:06:01 client1 lock_gulmd_LTPX[9390]: Errors trying to login to
LT000: (lock2 :192.168.1.3) 1006:Not Allowed
Mar  4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to
LT000: (lock2 :192.168.1.3) 1006:Not Allowed
Mar  4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to
LT002: (lock2 :192.168.1.3) 1006:Not Allowed
Mar  4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to
LT004: (lock2 :192.168.1.3) 1006:Not Allowed
Mar  4 00:06:02 client1 lock_gulmd_LTPX[9390]: Errors trying to login to
LT001: (lock2 :192.168.1.3) 1006:Not Allowed


Anyone have any idea what might be causing this?

Regards,

Britt Treece

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070305/f6a46e76/attachment.htm>

From ivan at mysql.com  Thu Mar  8 10:39:36 2007
From: ivan at mysql.com (Ivan Zoratti)
Date: Thu, 8 Mar 2007 10:39:36 +0000
Subject: [Linux-cluster] FWD: Question on RH Cluster from a MySQL Customer
In-Reply-To: <45EF0ACA.1090900@redhat.com>
References: <1638.1172372671@sss.pgh.pa.us> <45EF0ACA.1090900@redhat.com>
Message-ID: <0DEA37F7-2A38-4F33-9857-01FD717DADCC@mysql.com>

Hi Robert,

First of all, thanks for your time, I really appreciate it.
I'd like to reply to two separate topics here: first, the objective  
of my question and second, the cluster-awareness of MySQL and the use  
of GFS with MySQL.

My original question was mainly related to the use of Piranha to  
switch over a service (ie, a specific mysql daemon) from one server  
to another, in case of fault. There should be only one active service  
in the cluster, therefore no concurrency or locking issues should  
happen.
The ideal system should be able to:
- have a list of services to launch on the cluster
- identify the node in the cluster suitable to host the service (for  
example the node with less workload)
- check the availability of the service
- stop the service on a node (if the service is not already down) and  
start the service on another node in case of fault

Fault tolerance in this case will be provided by the ability to  
switch the service from one server to another in the cluster.
Scalability is not provided within the service, ie the limitation in  
resources for the service consist of the resources available on that  
specific server.

I understand that your cluster suite can provide this functionality.  
I am mainly looking for a supported set of features for an enterprise  
organisation.

The second topic is related to the use of MySQL with clusters and  
specifically with GFS. It is what we use to call MySQL in active- 
active clustering. I am afraid your documentation is not totally  
accurate. Unfortunately, information on the Internet (and also on our  
web site) are often contradictory.
It is indeed possible to run multiple mysqld services on different  
cluster nodes, all sharing the same data structure on shared storage,  
with this configuration:
- Only the MyISAM storage engine can be used
- Each mysqld service must start with the external-locking parameter on
- Each mysqld service hase to have the query cache parameter off  
(other cache mechanisms remain on, since they are automatically  
invalidated by external locking)

I am afraid this configuration still does not compete against Oracle  
RAC. MySQL does not provide a solution that can be compared 1:1 with  
RAC. You may find some MySQL implementations much more effective than  
RAC for certain environments, as you will certainly find RAC  
performing better than MySQL on other implementations.

Based on the experience of the sales engineering team, customers have  
never been disappointed by the technology that MySQL can provide as  
an alternative to RAC. Decisions are based on many other factors,  
such as the introduction of another (or a different) database, the  
cost of migrating current applications and compatibility with third  
party products. You can imagine we are working hard to remove these  
obstacles.

Thanks again for your help,

Kind Regards,

Ivan

--
   Ivan Zoratti - Sales Engineering Manager EMEA

   MySQL AB - Windsor - UK
   Mobile: +44 7866 363 180

   ivan at mysql.com
   http://www.mysql.com
--


On 7 Mar 2007, at 18:56, Robert Peterson wrote:

> Tom Lane wrote:
>> Can someone help out this questioner?  I know zip about Cluster.
>> I looked at the FAQ for a bit and thought that what he wants is
>> probably doable, but I couldn't tell if it would be easy or
>> painful to do load-balancing in this particular way.  (And I'm not
>> qualified to say if what he wants is a sensible approach, either.)
>> 			regards, tom lane
>> ------- Forwarded Message
>> Date:    Sat, 24 Feb 2007 15:37:17 +0000
>> From:    Ivan Zoratti <ivan at mysql.com>
>> To:      tgl at redhat.com
>> Subject: Question on RH Cluster from a MySQL Customer
>> Dear Tom,
>> first of all, let me introduce myself. I am the Sales Engineering   
>> Manager for EMEA at MySQL. Kath O'Neil, our Director of Strategic   
>> Alliances, kindly gave me your name for a technical question  
>> related  to the use of Red Hat and MySQL - hopefully leading to  
>> the adoption  of RH Cluster.
>> Our customer is looking for a solution that could provide high   
>> availability and scalability in a cluster environment based on  
>> linux  servers that are connected to a large SAN. Their favourite  
>> choice  would be to go with Red Hat.
>> Each server connected to the SAN would provide resources to host,   
>> let's say, 5 different instances of MySQL (mysqld). Each mysqld  
>> will  have its own configuration, datadir, connection port and IP  
>> address.
>> The clustering software should be able to load-balance new mysqld   
>> instances on the available servers. For example, considering  
>> servers  with same specs and workload, when the first mysqld  
>> starts, it will  be placed on Server A, the second one will go on  
>> Server B and so on  for C,D and E. The sixth mysqld will then go  
>> on A again, then B and  so forth. If one of the server fails, the  
>> mysqld(s) is (or are)  "moved" on the other servers, still in a  
>> way to guarantee a load- balance of the whole system.
>> After my long (and hopefully clear enough) explanation, my quick   
>> question is: does RH Cluster provide this kind of features? I am   
>> mostly interested in the way we can instatiate mysqld and re- 
>> launch  them on any other server in the cluster in case of fault.
>> I would be very grateful if you could help me or address me to   
>> somebody or something for an answer.
>> Thank you in advance for your help.
>> Kind Regards,
>> Ivan
>> --
>>    Ivan Zoratti - Sales Engineering Manager EMEA
>>    MySQL AB - Windsor - UK
>>    Mobile: +44 7866 363 180
>>    ivan at mysql.com
>>    http://www.mysql.com
>
> Hi Tom, Ivan, and linux-cluster readers,
>
> In theory, our Piranha / LVS (Linux Virtual Server) may be used to
> load-balance the requests to numerous mysql servers in a cluster.
>
> Our rgmanager can provide the High Availability to fail over
> mysql services to other nodes in the cluster if they fail.
>
> However, if the mysqld daemons are all running on a SAN and you're
> mysqld daemons are trying to serve data from the same file system, you
> probably have a problem.  To share the data/database on the SAN in
> one harmonious file system, you could use the GFS file system, but  
> "regular" mysql is not cluster-aware (to the best of my  
> knowledge).  The sum of my understanding about this may be found here:
>
> http://sources.redhat.com/cluster/faq.html#gfs_mysql
>
> Since Ivan works for mysql, perhaps he can clear this up if
> it's not accurate.  I'd like to know more about "mysql-cluster"
> and how it's implemented.  I'd like to see mysql implemented as
> a cluster-friendly app using our cluster infrastructure so they
> can effectively compete against Oracle RAC without reinventing
> the wheel.  I'd even like to be a part of the effort to make this  
> happen.  Hope this helps.
>
> Regards,
>
> Bob Peterson
> Red Hat Cluster Suite



From dave at eons.com  Thu Mar  8 16:15:32 2007
From: dave at eons.com (Dave Berry)
Date: Thu, 08 Mar 2007 11:15:32 -0500
Subject: [Linux-cluster] Failover not working
Message-ID: <45F036A4.1090106@eons.com>

I have a 3 node GFS cluster sharing 2 virtual IPs as 2 different 
services.  For some reason the failover is not working correctly.  The 
IPs are listed as services in the cluster.conf and the failover is set 
to use ordered/restricted.  Below is the pertinent cluster.conf parts.  
The IPs failover when the box goes down but does not fail back to the 
correctly prioritized box when it returns.   I have included the error 
from the log at the end.  Thanks.

 <failoverdomains>
                        <failoverdomain name="nfs_domain1" ordered="1" 
restricted="1">
                                <failoverdomainnode name="fs101" 
priority="1"/>
                                <failoverdomainnode name="fs102" 
priority="2"/>
                                <failoverdomainnode name="fs103" 
priority="3"/>
                        </failoverdomain>
                        <failoverdomain name="nfs_domain2" ordered="1" 
restricted="1">
                                <failoverdomainnode name="fs102" 
priority="1"/>
                                <failoverdomainnode name="fs101" 
priority="2"/>
                                <failoverdomainnode name="fs103" 
priority="3"/>
                        </failoverdomain>
                </failoverdomains>

 <resources>
                        <clusterfs device="/dev/mapper/mpath6p1" 
force_unmount="0" fsid="49841" fstype="gfs" 
mountpoint="/opt/eons/shared" name="nfs1"
 options=""/>
                        <ip address="192.168.1.200" monitor_link="1"/>
                        <ip address="192.168.1.201" monitor_link="1"/>
</resources>
 <service autostart="1" domain="nfs_domain1" name="nfs_ip1">
                        <ip ref="192.168.1.200"/>
                </service>
                <service autostart="1" domain="nfs_domain2" name="nfs_ip2">
                        <ip ref="192.168.1.201"/>
                </service>

Mar  8 11:03:26 fs101 clurgmgrd[5684]: <debug> Relocating group nfs_ip2 
to better node fs102
Mar  8 11:03:26 fs101 clurgmgrd[5684]: <debug> Event (0:2:1) Processed
Mar  8 11:03:26 fs101 clurgmgrd[5684]: <notice> Stopping service nfs_ip2
Mar  8 11:03:26 fs101 clurgmgrd[5684]: <err> #52: Failed changing RG status
Mar  8 11:03:26 fs101 clurgmgrd[5684]: <debug> Handling failure request 
for RG nfs_ip2
Mar  8 11:03:26 fs101 clurgmgrd[5684]: <err> #57: Failed changing RG status



From filipe.miranda at gmail.com  Thu Mar  8 17:24:14 2007
From: filipe.miranda at gmail.com (Filipe Miranda)
Date: Thu, 8 Mar 2007 14:24:14 -0300
Subject: [Linux-cluster] How GULM works ?
Message-ID: <a6d13c780703080924u771c3ffax8fa12d59b49b44bd@mail.gmail.com>

When using the DLM lock system, the CCSD pass on the structure
information to the CMAN. CMAN will be responsible for the cluster
membership, heartbeat, and cluster communication. All other layers above
  relay on CMAN.

If I'm using GULM, CMAN is not installed, how the architecture works in
this case?



-- 
---
Filipe T Miranda
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070308/5aae7175/attachment.htm>

From lhh at redhat.com  Thu Mar  8 17:34:46 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 08 Mar 2007 12:34:46 -0500
Subject: [Linux-cluster] Failover not working
In-Reply-To: <45F036A4.1090106@eons.com>
References: <45F036A4.1090106@eons.com>
Message-ID: <1173375286.12686.26.camel@asuka.boston.devel.redhat.com>

On Thu, 2007-03-08 at 11:15 -0500, Dave Berry wrote:
> I have a 3 node GFS cluster sharing 2 virtual IPs as 2 different 
> services.  For some reason the failover is not working correctly.  The 
> IPs are listed as services in the cluster.conf and the failover is set 
> to use ordered/restricted.  Below is the pertinent cluster.conf parts.  
> The IPs failover when the box goes down but does not fail back to the 
> correctly prioritized box when it returns.   I have included the error 
> from the log at the end.  Thanks.


> Mar  8 11:03:26 fs101 clurgmgrd[5684]: <debug> Relocating group nfs_ip2 
> to better node fs102
> Mar  8 11:03:26 fs101 clurgmgrd[5684]: <debug> Event (0:2:1) Processed
> Mar  8 11:03:26 fs101 clurgmgrd[5684]: <notice> Stopping service nfs_ip2
> Mar  8 11:03:26 fs101 clurgmgrd[5684]: <err> #52: Failed changing RG status
> Mar  8 11:03:26 fs101 clurgmgrd[5684]: <debug> Handling failure request 
> for RG nfs_ip2
> Mar  8 11:03:26 fs101 clurgmgrd[5684]: <err> #57: Failed changing RG status

That shouldn't happen - what rgmanager RPM do you have?

-- Lon



From rstevens at vitalstream.com  Thu Mar  8 18:07:51 2007
From: rstevens at vitalstream.com (Rick Stevens)
Date: Thu, 08 Mar 2007 10:07:51 -0800
Subject: [Linux-cluster] 2.6.20-rc4 gfs2 bug
In-Reply-To: <20070125050731.GA23270@chaos.ao.net>
References: <20070125050731.GA23270@chaos.ao.net>
Message-ID: <1173377271.30562.9.camel@prophead.corp.publichost.com>

On Thu, 2007-01-25 at 00:07 -0500, Dan Merillat wrote:
> Running 2.6.20-rc4 _WITH_ the following patch: (Shouldn't be the issue,
> but just in case, I'm listing it here)

Not adding to the thread here, but Dan, check the date on your machine.
This just showed up in my mailbox (8 March) and its headers say it was
sent on 25 January!

> 
> Date:	Fri, 29 Dec 2006 21:03:57 +0100
> From:	Ingo Molnar <mingo at elte.hu>
> Subject: [patch] remove MAX_ARG_PAGES
> Message-ID: <20061229200357.GA5940 at elte.hu>
> 
> Linux fileserver 2.6.20-rc4MAX_ARGS #4 PREEMPT Fri Jan 12 03:58:25 EST 2007 x86_64 GNU/Linux
> 
> This happened when I started testing gfs2 for the first time.  I
> installed userspace from CVS, loaded the gfs2/dlm modules, mkfs.gfs2,
> then "mount -t gfs2 -v /dev/vg1/gfs2 /mnt/gfs"
> 
> This was the initial mount of the new filesystem.  I can create
> directories, but attempting a stress-test with bonnie seems to have
> deadlocked something.  (at "Start 'em", immediately.)
> 
> To clarify: the two oopses happened at first mount.  After that, I
> created files/directories, then attempted to stress it a bit with
> bonnie++.  No further oops/dmesg output.
> 
> For the GFS2 folks, latest CVS gfs_tool doesn't have lockdump, is there
> any way to examine what I'm stuck on?
> 
> This machine is specifically for testing new things before I put them
> into production, so I can leave it hung like this indefinitely for
> debugging.
> 
> 
> [845566.571468] GFS2 (built Jan 12 2007 04:02:27) installed
> [849416.113382] DLM (built Jan 12 2007 04:01:21) installed
> [849416.352219] Lock_DLM (built Jan 12 2007 04:02:46) installed
> [850966.368016] GFS2: fsid=: Trying to join cluster "lock_dlm", "internal:gfs-test"
> [850971.783223] dlm: gfs-test: recover 1
> [850971.783242] dlm: gfs-test: add member 1
> [850971.783246] dlm: gfs-test: total members 1 error 0
> [850971.783248] dlm: gfs-test: dlm_recover_directory
> [850971.783260] dlm: gfs-test: dlm_recover_directory 0 entries
> [850971.783270] dlm: gfs-test: recover 1 done: 0 ms
> [850971.783454] GFS2: fsid=internal:gfs-test.0: Joined cluster. Now mounting FS...
> [850973.409048] GFS2: fsid=internal:gfs-test.0: jid=0, already locked for use
> [850973.409135] GFS2: fsid=internal:gfs-test.0: jid=0: Looking at journal...
> [850973.504558] GFS2: fsid=internal:gfs-test.0: jid=0: Done
> [850973.504653] GFS2: fsid=internal:gfs-test.0: jid=1: Trying to acquire journal lock...
> [850973.517086] GFS2: fsid=internal:gfs-test.0: jid=1: Looking at journal...
> [850973.691546] GFS2: fsid=internal:gfs-test.0: jid=1: Done
> [850973.691635] GFS2: fsid=internal:gfs-test.0: jid=2: Trying to acquire journal lock...
> [850973.702646] GFS2: fsid=internal:gfs-test.0: jid=2: Looking at journal...
> [850973.846397] GFS2: fsid=internal:gfs-test.0: jid=2: Done
> 
> 
> [850973.869288] ------------[ cut here ]------------
> [850973.869294] kernel BUG at fs/gfs2/glock.c:738!
> [850973.869297] invalid opcode: 0000 [1] PREEMPT 
> [850973.869300] CPU 0 
> [850973.869302] Modules linked in: lock_dlm dlm gfs2 scsi_tgt bttv video_buf firmware_class ir_common compat_ioctl32 btcx_risc tveeprom videodev v4l2_common v4l1_compat radeon nbd eth1394 ohci1394 dm_crypt eeprom w83627hf hwmon_vid i2c_isa i2c_viapro snd_via82xx snd_mpu401_uart snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_device snd_timer snd_page_alloc snd_util_mem snd_hwdep snd soundcore
> [850973.869324] Pid: 31076, comm: gfs2_glockd Not tainted 2.6.20-rc4MAX_ARGS #4
> [850973.869327] RIP: 0010:[<ffffffff8816cabb>]  [<ffffffff8816cabb>] :gfs2:gfs2_glmutex_unlock+0x2b/0x40
> [850973.869355] RSP: 0018:ffff81001849be70  EFLAGS: 00010282
> [850973.869359] RAX: ffff810023ff4ee0 RBX: ffff810023ff4e68 RCX: ffffffff88185800
> [850973.869363] RDX: 0000000000000000 RSI: ffff810023ff4ec0 RDI: ffff810023ff4e68
> [850973.869366] RBP: ffff810023ff4f38 R08: 0000000000000000 R09: 0000000000006052
> [850973.869370] R10: 0000000000000000 R11: ffffffff8816de60 R12: ffff810023ff4e68
> [850973.869374] R13: ffff810023ff4eb0 R14: ffff81003ffd6850 R15: ffff81003ffd6870
> [850973.869378] FS:  00002aebf51826d0(0000) GS:ffffffff807fb000(0000) knlGS:00000000f72026c0
> [850973.869381] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [850973.869384] CR2: 00002b9e93097fe0 CR3: 0000000003a79000 CR4: 00000000000006e0
> [850973.869388] Process gfs2_glockd (pid: 31076, threadinfo ffff81001849a000, task ffff810000b82890)
> [850973.869390] Stack:  ffff810023ff4eb0 ffffffff8816cc08 ffff81001849beb0 ffff810024322000
> [850973.869397]  ffff8100243223b8 ffff8100074cf968 ffffffff88163510 ffffffff88163528
> [850973.869402]  0000000000000000 ffff810000b82890 ffffffff8029fe70 ffff81001849bec8
> [850973.869407] Call Trace:
> [850973.869421]  [<ffffffff8816cc08>] :gfs2:gfs2_reclaim_glock+0x138/0x180
> [850973.869434]  [<ffffffff88163510>] :gfs2:gfs2_glockd+0x0/0xf0
> [850973.869445]  [<ffffffff88163528>] :gfs2:gfs2_glockd+0x18/0xf0
> [850973.869453]  [<ffffffff8029fe70>] autoremove_wake_function+0x0/0x30
> [850973.869465]  [<ffffffff88163510>] :gfs2:gfs2_glockd+0x0/0xf0
> [850973.869471]  [<ffffffff80234d43>] kthread+0xd3/0x110
> [850973.869476]  [<ffffffff80229407>] schedule_tail+0x37/0xc0
> [850973.869481]  [<ffffffff8029fca0>] keventd_create_kthread+0x0/0xa0
> [850973.869485]  [<ffffffff80264618>] child_rip+0xa/0x12
> [850973.869490]  [<ffffffff8029fca0>] keventd_create_kthread+0x0/0xa0
> [850973.869497]  [<ffffffff80234c70>] kthread+0x0/0x110
> [850973.869501]  [<ffffffff8026460e>] child_rip+0x0/0x12
> [850973.869504] 
> [850973.869505] 
> [850973.869506] Code: 0f 0b 66 66 90 eb fe 66 66 66 90 66 66 66 90 66 66 90 66 66 
> [850973.869514] RIP  [<ffffffff8816cabb>] :gfs2:gfs2_glmutex_unlock+0x2b/0x40
> [850973.869528]  RSP <ffff81001849be70>
> [850973.869530]  <6>note: gfs2_glockd[31076] exited with preempt_count 1
> 
> 
> [850986.762341] ------------[ cut here ]------------
> [850986.762346] kernel BUG at fs/gfs2/glock.c:738!
> [850986.762349] invalid opcode: 0000 [2] PREEMPT 
> [850986.762351] CPU 0 
> [850986.762353] Modules linked in: lock_dlm dlm gfs2 scsi_tgt bttv video_buf firmware_class ir_common compat_ioctl32 btcx_risc tveeprom videodev v4l2_common v4l1_compat radeon nbd eth1394 ohci1394 dm_crypt eeprom w83627hf hwmon_vid i2c_isa i2c_viapro snd_via82xx snd_mpu401_uart snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_device snd_timer snd_page_alloc snd_util_mem snd_hwdep snd soundcore
> [850986.762376] Pid: 31075, comm: gfs2_scand Not tainted 2.6.20-rc4MAX_ARGS #4
> [850986.762379] RIP: 0010:[<ffffffff8816cabb>]  [<ffffffff8816cabb>] :gfs2:gfs2_glmutex_unlock+0x2b/0x40
> [850986.762405] RSP: 0000:ffff81001f221e80  EFLAGS: 00010286
> [850986.762408] RAX: ffff810023ff4940 RBX: ffff810023ff48c8 RCX: 0000000000000000
> [850986.762412] RDX: 0000000000000146 RSI: ffff810023ff4920 RDI: ffff810023ff48c8
> [850986.762416] RBP: ffff810024322000 R08: ffff81001f220000 R09: 00000000f6e88388
> [850986.762418] R10: 0000000000000000 R11: 00000000ffffffff R12: 0000000000000000
> [850986.762422] R13: 0000000000000000 R14: ffffffff8816d2f0 R15: ffff81003ffd6870
> [850986.762426] FS:  00002b5212e53ae0(0000) GS:ffffffff807fb000(0000) knlGS:00000000f6e88bb0
> [850986.762429] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [850986.762432] CR2: 00000000f706f000 CR3: 000000002b96b000 CR4: 00000000000006e0
> [850986.762436] Process gfs2_scand (pid: 31075, threadinfo ffff81001f220000, task ffff810025448850)
> [850986.762438] Stack:  ffff810023ff48c8 ffffffff8816a86c 0000000000000147 ffff810024322000
> [850986.762445]  ffff8100074cf968 ffffffff88163600 ffff81003ffd6850 ffffffff8816ae54
> [850986.762450]  ffff81003ffd6850 000000000000000f ffff810024322000 ffffffff88163618
> [850986.762454] Call Trace:
> [850986.762469]  [<ffffffff8816a86c>] :gfs2:examine_bucket+0x8c/0x100
> [850986.762481]  [<ffffffff88163600>] :gfs2:gfs2_scand+0x0/0x70
> [850986.762494]  [<ffffffff8816ae54>] :gfs2:gfs2_scand_internal+0x24/0x40
> [850986.762506]  [<ffffffff88163618>] :gfs2:gfs2_scand+0x18/0x70
> [850986.762514]  [<ffffffff80234d43>] kthread+0xd3/0x110
> [850986.762519]  [<ffffffff80229407>] schedule_tail+0x37/0xc0
> [850986.762525]  [<ffffffff8029fca0>] keventd_create_kthread+0x0/0xa0
> [850986.762530]  [<ffffffff80264618>] child_rip+0xa/0x12
> [850986.762535]  [<ffffffff8029fca0>] keventd_create_kthread+0x0/0xa0
> [850986.762542]  [<ffffffff80234c70>] kthread+0x0/0x110
> [850986.762545]  [<ffffffff8026460e>] child_rip+0x0/0x12
> [850986.762548] 
> [850986.762549] 
> [850986.762550] Code: 0f 0b 66 66 90 eb fe 66 66 66 90 66 66 66 90 66 66 90 66 66 
> [850986.762559] RIP  [<ffffffff8816cabb>] :gfs2:gfs2_glmutex_unlock+0x2b/0x40
> [850986.762572]  RSP <ffff81001f221e80>
> [850986.762575]  <6>note: gfs2_scand[31075] exited with preempt_count 1
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
----------------------------------------------------------------------
- Rick Stevens, Principal Engineer          rstevens at vitalstream.com -
- VitalStream, Inc.                       http://www.vitalstream.com -
-                                                                    -
-            I'm afraid my karma just ran over your dogma            -
----------------------------------------------------------------------



From rpeterso at redhat.com  Thu Mar  8 18:29:22 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Thu, 08 Mar 2007 12:29:22 -0600
Subject: [Linux-cluster] How GULM works ?
In-Reply-To: <a6d13c780703080924u771c3ffax8fa12d59b49b44bd@mail.gmail.com>
References: <a6d13c780703080924u771c3ffax8fa12d59b49b44bd@mail.gmail.com>
Message-ID: <45F05602.3000803@redhat.com>

Filipe Miranda wrote:
> When using the DLM lock system, the CCSD pass on the structure
> information to the CMAN. CMAN will be responsible for the cluster
> membership, heartbeat, and cluster communication. All other layers above
>  relay on CMAN.
> 
> If I'm using GULM, CMAN is not installed, how the architecture works in
> this case?
Hi Filipe,

The Gulm locking protocol has its own cluster manager and locking layers.
The layers are all still there, but they're internal to Gulm.

For cman, it's broken out into pieces like cman, dlm, lock_dlm, lock_harness,
and so forth.

Regards,

Bob Peterson
Red Hat Cluster Suite



From rpeterso at redhat.com  Thu Mar  8 18:32:52 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Thu, 08 Mar 2007 12:32:52 -0600
Subject: [Linux-cluster] 2.6.20-rc4 gfs2 bug
In-Reply-To: <1173377271.30562.9.camel@prophead.corp.publichost.com>
References: <20070125050731.GA23270@chaos.ao.net>
	<1173377271.30562.9.camel@prophead.corp.publichost.com>
Message-ID: <45F056D4.9080301@redhat.com>

Rick Stevens wrote:
> On Thu, 2007-01-25 at 00:07 -0500, Dan Merillat wrote:
>> Running 2.6.20-rc4 _WITH_ the following patch: (Shouldn't be the issue,
>> but just in case, I'm listing it here)
> 
> Not adding to the thread here, but Dan, check the date on your machine.
> This just showed up in my mailbox (8 March) and its headers say it was
> sent on 25 January!

Hi Rick,

We discovered today that the linux-cluster mailing list had
some problems and some of the messages got inadvertently stuck.
When we fixed the problem, a few old messages came through that
had been stuck.

Bob Peterson
Red Hat Cluster Suite



From lshen at cisco.com  Thu Mar  8 18:40:15 2007
From: lshen at cisco.com (Lin Shen (lshen))
Date: Thu, 8 Mar 2007 10:40:15 -0800
Subject: [Linux-cluster] Changing journal size with GFS2 
Message-ID: <08A9A3213527A6428774900A80DBD8D80397DBA4@xmb-sjc-222.amer.cisco.com>

The FAQ says that with GFS2, it will be possible to add journals w/o
extending the file system. Does this require redo mkfs?

Also, will it be possible to change (both increase and decrease) journal
size w/o extending file system or redo mkfs? 

Lin  



From adas at redhat.com  Thu Mar  8 18:53:04 2007
From: adas at redhat.com (Abhijith Das)
Date: Thu, 08 Mar 2007 12:53:04 -0600
Subject: [Linux-cluster] Changing journal size with GFS2
In-Reply-To: <08A9A3213527A6428774900A80DBD8D80397DBA4@xmb-sjc-222.amer.cisco.com>
References: <08A9A3213527A6428774900A80DBD8D80397DBA4@xmb-sjc-222.amer.cisco.com>
Message-ID: <45F05B90.9070708@redhat.com>

Lin Shen (lshen) wrote:

>The FAQ says that with GFS2, it will be possible to add journals w/o
>extending the file system. Does this require redo mkfs?
>  
>
The gfs2_jadd tool allows you to add journals to an existing gfs2 
filesystem without having to redo mkfs.
 
--Abhi



From lshen at cisco.com  Thu Mar  8 19:13:54 2007
From: lshen at cisco.com (Lin Shen (lshen))
Date: Thu, 8 Mar 2007 11:13:54 -0800
Subject: [Linux-cluster] Changing journal size with GFS2
In-Reply-To: <45F05B90.9070708@redhat.com>
Message-ID: <08A9A3213527A6428774900A80DBD8D80397DBF3@xmb-sjc-222.amer.cisco.com>

Looks like gfs2_jadd also allows to add journals with different sizes.
Wonder if the size of the exiting journals can be changed somehow. 

Lin 

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Abhijith Das
> Sent: Thursday, March 08, 2007 10:53 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] Changing journal size with GFS2
> 
> Lin Shen (lshen) wrote:
> 
> >The FAQ says that with GFS2, it will be possible to add journals w/o 
> >extending the file system. Does this require redo mkfs?
> >  
> >
> The gfs2_jadd tool allows you to add journals to an existing 
> gfs2 filesystem without having to redo mkfs.
>  
> --Abhi
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From lshen at cisco.com  Thu Mar  8 19:29:30 2007
From: lshen at cisco.com (Lin Shen (lshen))
Date: Thu, 8 Mar 2007 11:29:30 -0800
Subject: [Linux-cluster] Using GFS2 as a local file system
Message-ID: <08A9A3213527A6428774900A80DBD8D8039FF97C@xmb-sjc-222.amer.cisco.com>

We have a situation that we may need to use GFS2 to share storage in our
system in the future and to ease the pain of transition at that time
(convert files into GFS), we're thinking of using GFS2 just as a local
file system for now. 

How is GFS2 compared to other popular local file systems such as ext3
and Reiser in terms of performance, overhead etc? Are we hitting the
wrong direction totally by using GFS2 just as a local file system? 

BTW, we've run bonnie on local GFS2, and the performance is decent
compared to ext3 (90%).

Lin 



From srramasw at cisco.com  Thu Mar  8 19:53:40 2007
From: srramasw at cisco.com (Sridhar Ramaswamy (srramasw))
Date: Thu, 8 Mar 2007 11:53:40 -0800
Subject: [Linux-cluster] GFS2 traceback related to DLM
In-Reply-To: <20070125050731.GA23270@chaos.ao.net>
Message-ID: <B14199FA0DBAAF4AA89E83EB41D35435032FE36C@xmb-sjc-22c.amer.cisco.com>

I tried GFS2 on two-node cluster using GNBD. 

cfs1 - gnbd exports an IDE parition. Mount gfs2 directly on that IDE
partition.
cfs5 - gnbd imports the IDE parition. Mount gfs2 on top of gnbd device.

I'm using,

RHEL4 distro
Linux kernel 2.6.20.1 (from kernel.org)
cluster-2.00.00 (from tarball)
udev-094
openais-0.80.2

Everything seems to be working fine. But when I mounted GFS2 on the 2nd
node on top of gnbd device, I got these dlm related tracebacks. Plus
dlm_recvd and dlm_sendd process are spinning cpu on both the boxes. Note
the mount itself succeeded and I can use the filesystem from both the
nodes. 

I know GFS2 is new, but anyone solution to this problem?

I need to mention, I also see bunch of udev daemon related failure
mesgs. I'm guessing it is due using it on old RHEL4 distribution? Not
sure if that contributed to this spinlock problem reported here.

Ultimately I want to run bonnie test on this configuration. But don't
what to do that until the basic sanity of this GFS2 configuration is
established.

thanks,
Sridhar

cfs1:

Mar  7 17:45:53 cfs1 kernel: BUG: spinlock already unlocked on CPU#1,
dlm_recoverd/11046
Mar  7 17:45:53 cfs1 kernel:  lock: cc6b68e4, .magic: dead4ead, .owner:
<none>/-1, .owner_cpu: -1
Mar  7 17:45:53 cfs1 kernel:  [<c01d3012>] _raw_spin_unlock+0x29/0x6b
Mar  7 17:45:53 cfs1 kernel:  [<e0969912>]
dlm_lowcomms_get_buffer+0x6c/0xe7 [dlm]
Mar  7 17:45:53 cfs1 kernel:  [<e09661df>] create_rcom+0x2d/0xb3 [dlm]
Mar  7 17:45:53 cfs1 kernel:  [<e09663b0>] dlm_rcom_status+0x5a/0x10b
[dlm]
Mar  7 17:45:53 cfs1 kernel:  [<e096593e>] make_member_array+0x84/0x14c
[dlm]
Mar  7 17:45:53 cfs1 kernel:  [<e0965a3d>] ping_members+0x37/0x6e [dlm]
Mar  7 17:45:53 cfs1 kernel:  [<e0966e01>]
dlm_set_recover_status+0x14/0x24 [dlm]
Mar  7 17:45:53 cfs1 kernel:  [<e0965bd8>]
dlm_recover_members+0x164/0x1a1 [dlm]
Mar  7 17:45:53 cfs1 kernel:  [<e0967b4e>] ls_recover+0x67/0x2c6 [dlm]
Mar  7 17:45:53 cfs1 kernel:  [<e0967e0a>] do_ls_recovery+0x5d/0x75
[dlm]
Mar  7 17:45:53 cfs1 kernel:  [<e0967e22>] dlm_recoverd+0x0/0x74 [dlm]
Mar  7 17:45:53 cfs1 kernel:  [<e0967e7d>] dlm_recoverd+0x5b/0x74 [dlm]
Mar  7 17:45:53 cfs1 kernel:  [<c012e4aa>] kthread+0x72/0x96
Mar  7 17:45:53 cfs1 kernel:  [<c012e438>] kthread+0x0/0x96
Mar  7 17:45:53 cfs1 kernel:  [<c010405f>] kernel_thread_helper+0x7/0x10


cfs2:

Mar  7 17:43:28 cfs5 gnbd_monitor[10552]: gnbd_monitor started.
Monitoring device #0 
Mar  7 17:43:28 cfs5 gnbd_recvd[10555]: gnbd_recvd started 
Mar  7 17:43:28 cfs5 kernel: resending requests
Mar  7 17:43:33 cfs5 udevsend[10560]: starting udevd daemon
Mar  7 17:43:35 cfs5 udevsend[10560]: unable to connect to event daemon,
try to call udev directly
Mar  7 17:45:51 cfs5 kernel: GFS2: fsid=: Trying to join cluster
"lock_dlm", "ciscogfs2:hda9"
Mar  7 17:45:53 cfs5 udevsend[10598]: starting udevd daemon
Mar  7 17:45:53 cfs5 udevsend[10599]: starting udevd daemon
Mar  7 17:45:53 cfs5 udevsend[10610]: starting udevd daemon
Mar  7 17:45:53 cfs5 kernel: dlm: got connection from 1
Mar  7 17:45:53 cfs5 kernel: BUG: spinlock already unlocked on CPU#0,
dlm_recvd/10593
Mar  7 17:45:53 cfs5 kernel:  lock: c8d467e4, .magic: dead4ead, .owner:
<none>/-1, .owner_cpu: -1
Mar  7 17:45:53 cfs5 kernel:  [<c01d62d2>] _raw_spin_unlock+0x29/0x6b
Mar  7 17:45:53 cfs5 kernel:  [<e0c080a2>]
dlm_lowcomms_get_buffer+0x6c/0xe7 [dlm]
Mar  7 17:45:53 cfs5 kernel:  [<e0c041f3>] create_rcom+0x2d/0xb3 [dlm]
Mar  7 17:45:53 cfs5 kernel:  [<e0c044a4>] receive_rcom_status+0x2f/0x74
[dlm]
Mar  7 17:45:53 cfs5 kernel:  [<e0c02dd6>]
dlm_find_lockspace_global+0x3c/0x41 [dlm]
Mar  7 17:45:53 cfs5 kernel:  [<e0c04bca>] dlm_receive_rcom+0xc1/0x17f
[dlm]
Mar  7 17:45:53 cfs5 udevsend[10617]: starting udevd daemon
Mar  7 17:45:54 cfs5 udevsend[10626]: starting udevd daemon
Mar  7 17:45:54 cfs5 udevsend[10628]: starting udevd daemon
Mar  7 17:45:55 cfs5 udevsend[10598]: unable to connect to event daemon,
try to call udev directly
Mar  7 17:45:55 cfs5 udevsend[10599]: unable to connect to event daemon,
try to call udev directly
Mar  7 17:45:55 cfs5 udevsend[10610]: unable to connect to event daemon,
try to call udev directly
Mar  7 17:45:55 cfs5 kernel:  [<e0c04157>]
dlm_process_incoming_buffer+0x148/0x1ad [dlm]
Mar  7 17:45:59 cfs5 udevsend[10617]: unable to connect to event daemon,
try to call udev directly
Mar  7 17:46:00 cfs5 udevsend[10626]: unable to connect to event daemon,
try to call udev directly
Mar  7 17:46:02 cfs5 udevsend[10628]: unable to connect to event daemon,
try to call udev directly
Mar  7 17:46:05 cfs5 kernel:  [<c012e862>]
autoremove_wake_function+0x0/0x33
Mar  7 17:46:09 cfs5 kernel:  [<c0146e7d>] __alloc_pages+0x61/0x2ad
Mar  7 17:46:10 cfs5 kernel:  [<e0c0790a>] receive_from_sock+0x178/0x246
[dlm]
Mar  7 17:46:10 cfs5 kernel:  [<e0c08470>] process_sockets+0x55/0x90
[dlm]
Mar  7 17:46:11 cfs5 kernel:  [<e0c085c6>] dlm_recvd+0x0/0x69 [dlm]
Mar  7 17:46:11 cfs5 kernel:  [<e0c08620>] dlm_recvd+0x5a/0x69 [dlm]
Mar  7 17:46:12 cfs5 kernel:  [<c012e51a>] kthread+0x72/0x96
Mar  7 17:46:12 cfs5 kernel:  [<c012e4a8>] kthread+0x0/0x96
Mar  7 17:46:13 cfs5 kernel:  [<c010405f>] kernel_thread_helper+0x7/0x10
Mar  7 17:46:14 cfs5 kernel:  =======================
Mar  7 17:46:14 cfs5 kernel: dlm: hda9: recover 1
Mar  7 17:46:15 cfs5 kernel: dlm: hda9: add member 1
Mar  7 17:46:15 cfs5 kernel: dlm: hda9: add member 2
Mar  7 17:46:16 cfs5 kernel: dlm: hda9: total members 2 error 0
Mar  7 17:46:17 cfs5 kernel: dlm: hda9: dlm_recover_directory
Mar  7 17:46:18 cfs5 kernel: dlm: hda9: dlm_recover_directory 12 entries
Mar  7 17:46:19 cfs5 kernel: GFS2: fsid=ciscogfs2:hda9.1: Joined
cluster. Now mounting FS...
Mar  7 17:46:20 cfs5 kernel: dlm: hda9: recover 1 done: 348 ms
Mar  7 17:46:21 cfs5 kernel: GFS2: fsid=ciscogfs2:hda9.1: jid=1, already
locked for use
Mar  7 17:46:22 cfs5 kernel: GFS2: fsid=ciscogfs2:hda9.1: jid=1: Looking
at journal...
Mar  7 17:46:23 cfs5 kernel: GFS2: fsid=ciscogfs2:hda9.1: jid=1: Done

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dan Merillat
> Sent: Wednesday, January 24, 2007 9:08 PM
> To: linux-kernel at vger.kernel.org
> Cc: linux-cluster at redhat.com
> Subject: [Linux-cluster] 2.6.20-rc4 gfs2 bug
> 
> Running 2.6.20-rc4 _WITH_ the following patch: (Shouldn't be 
> the issue,
> but just in case, I'm listing it here)
> 



From dave at eons.com  Thu Mar  8 20:20:02 2007
From: dave at eons.com (Dave Berry)
Date: Thu, 08 Mar 2007 15:20:02 -0500
Subject: [Linux-cluster] Re: Failover not working
In-Reply-To: <45F036A4.1090106@eons.com>
References: <45F036A4.1090106@eons.com>
Message-ID: <45F06FF2.6020309@eons.com>

rgmanager-1.9.54-1

>  That shouldn't happen - what rgmanager RPM do you have?
>
> -- Lon





From teigland at redhat.com  Fri Mar  9 02:46:38 2007
From: teigland at redhat.com (David Teigland)
Date: Thu, 8 Mar 2007 20:46:38 -0600
Subject: [Linux-cluster] GFS2 traceback related to DLM
In-Reply-To: <B14199FA0DBAAF4AA89E83EB41D35435032FE36C@xmb-sjc-22c.amer.cisco.com>
References: <20070125050731.GA23270@chaos.ao.net>
	<B14199FA0DBAAF4AA89E83EB41D35435032FE36C@xmb-sjc-22c.amer.cisco.com>
Message-ID: <20070309024638.GA2954@redhat.com>

On Thu, Mar 08, 2007 at 11:53:40AM -0800, Sridhar Ramaswamy (srramasw) wrote:
> I'm using,
> 
> RHEL4 distro
> Linux kernel 2.6.20.1 (from kernel.org)
> cluster-2.00.00 (from tarball)
> udev-094
> openais-0.80.2
> 
> Everything seems to be working fine. But when I mounted GFS2 on the 2nd
> node on top of gnbd device, I got these dlm related tracebacks. Plus
> dlm_recvd and dlm_sendd process are spinning cpu on both the boxes. Note
> the mount itself succeeded and I can use the filesystem from both the
> nodes. 
> 
> I know GFS2 is new, but anyone solution to this problem?

Yes, get the latest dlm source (and gfs2 while you're at it) from a
2.6.21-rc kernel.  I usually do something like this as a shortcut:

  cp linux-2.6.21-rc/fs/dlm/* linux-2.6.20/fs/dlm/
  cp linux-2.6.21-rc/fs/gfs2/* linux-2.6.20/fs/gfs2/
  cp linux-2.6.21-rc/fs/gfs2/locking/dlm/* linux-2.6.20/fs/gfs2/locking/dlm/

No guarantees, but it often works and saves a bit of effort.
Dave



From wcheng at redhat.com  Fri Mar  9 05:35:34 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Fri, 09 Mar 2007 00:35:34 -0500
Subject: [Linux-cluster] Using GFS2 as a local file system
In-Reply-To: <08A9A3213527A6428774900A80DBD8D8039FF97C@xmb-sjc-222.amer.cisco.com>
References: <08A9A3213527A6428774900A80DBD8D8039FF97C@xmb-sjc-222.amer.cisco.com>
Message-ID: <45F0F226.2020601@redhat.com>

Lin Shen (lshen) wrote:

>We have a situation that we may need to use GFS2 to share storage in our
>system in the future and to ease the pain of transition at that time
>(convert files into GFS), we're thinking of using GFS2 just as a local
>file system for now. 
>
>How is GFS2 compared to other popular local file systems such as ext3
>and Reiser in terms of performance, overhead etc? Are we hitting the
>wrong direction totally by using GFS2 just as a local file system? 
>
>BTW, we've run bonnie on local GFS2, and the performance is decent
>compared to ext3 (90%).
>
>
>  
>
I personally think using GFS (both GFS1 and GFS2) as a local filesystem 
has many advantages. The only issue (I think ..haven't checked mkfs code 
in ages) is lock protocol is hard coded into on-disk super block during 
mkfs time - but fixing this should be trivial. If we allow 
interchangeable between lock_nolock and lock_dlm, then the filesystem 
should be able to migrate from single node into cluster environment. It 
is very nice (IMHO).

In the mean time, you can always run GFS(s) using lock_dlm with single 
node. There are lock overhead though.

I understand people may have different opinions about this and certainly 
don't have time to get into heated debating about this issue right now.

BTW, the team will spend this quarter to fine-tune GFS2. Would like to 
suggest people wait a little bit before putting GFS2 into a production 
environment.

-- Wendy





From swhiteho at redhat.com  Fri Mar  9 08:42:58 2007
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Fri, 09 Mar 2007 08:42:58 +0000
Subject: [Linux-cluster] Using GFS2 as a local file system
In-Reply-To: <45F0F226.2020601@redhat.com>
References: <08A9A3213527A6428774900A80DBD8D8039FF97C@xmb-sjc-222.amer.cisco.com>
	<45F0F226.2020601@redhat.com>
Message-ID: <1173429778.32601.35.camel@quoit.chygwyn.com>

Hi,

On Fri, 2007-03-09 at 00:35 -0500, Wendy Cheng wrote:
> Lin Shen (lshen) wrote:
> 
> >We have a situation that we may need to use GFS2 to share storage in our
> >system in the future and to ease the pain of transition at that time
> >(convert files into GFS), we're thinking of using GFS2 just as a local
> >file system for now. 
> >
> >How is GFS2 compared to other popular local file systems such as ext3
> >and Reiser in terms of performance, overhead etc? Are we hitting the
> >wrong direction totally by using GFS2 just as a local file system? 
> >
> >BTW, we've run bonnie on local GFS2, and the performance is decent
> >compared to ext3 (90%).
> >
> >
> >  
> >
> I personally think using GFS (both GFS1 and GFS2) as a local filesystem 
> has many advantages. The only issue (I think ..haven't checked mkfs code 
> in ages) is lock protocol is hard coded into on-disk super block during 
> mkfs time - but fixing this should be trivial. If we allow 
> interchangeable between lock_nolock and lock_dlm, then the filesystem 
> should be able to migrate from single node into cluster environment. It 
> is very nice (IMHO).
> 
You can override the settings in the sb on the mount command line,

Steve.




From cluster at defuturo.co.uk  Thu Mar  8 18:13:45 2007
From: cluster at defuturo.co.uk (Robert Clark)
Date: Thu, 08 Mar 2007 18:13:45 +0000
Subject: [Linux-cluster] cmirror performance
Message-ID: <1173377626.2757.63.camel@rutabaga.defuturo.co.uk>

  I've been trying out cmirror for a few months on a RHEL4U4 cluster and
it's now working very well for me, although I've noticed that it does
have a bit of a performance hit.

  My set-up has a 32G GFS filesystem on a mirrored LV shared via AoE
(with jumbo frame support). Just using dd with a 4k blocksize to write
files on the same LV when it's mirrored and then unmirrored shows a big
difference in speed:

    Unmirrored: 12440kB/s
    Mirrored:    2969kB/s

which I wasn't expecting as my understanding is that the cmirror design
introduces very little overhead.

  The two legs of the mirror are on separate, identical AoE servers and
the filesystem is mounted on 3 out of 6 nodes in the cluster. This is
with the cmirror-kernel_2_6_9_19 tagged version and I've tried with both
core and disk logs.

  I suspect a bad interaction between cmirror and something else, but
I'm not sure where to start looking. Any ideas?

	Thanks,

		Robert



From rpeterso at redhat.com  Fri Mar  9 15:41:10 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Fri, 09 Mar 2007 09:41:10 -0600
Subject: [Linux-cluster] FWD: Question on RH Cluster from a MySQL Customer
In-Reply-To: <0DEA37F7-2A38-4F33-9857-01FD717DADCC@mysql.com>
References: <1638.1172372671@sss.pgh.pa.us> <45EF0ACA.1090900@redhat.com>
	<0DEA37F7-2A38-4F33-9857-01FD717DADCC@mysql.com>
Message-ID: <45F18016.3050603@redhat.com>

Hi Ivan,

Answers embedded below:

Ivan Zoratti wrote:
> Hi Robert,
> 
> First of all, thanks for your time, I really appreciate it.
> I'd like to reply to two separate topics here: first, the objective of 
> my question and second, the cluster-awareness of MySQL and the use of 
> GFS with MySQL.
> 
> My original question was mainly related to the use of Piranha to switch 
> over a service (ie, a specific mysql daemon) from one server to another, 
> in case of fault. There should be only one active service in the 
> cluster, therefore no concurrency or locking issues should happen.
> The ideal system should be able to:
> - have a list of services to launch on the cluster
> - identify the node in the cluster suitable to host the service (for 
> example the node with less workload)
> - check the availability of the service
> - stop the service on a node (if the service is not already down) and 
> start the service on another node in case of fault
> Fault tolerance in this case will be provided by the ability to switch 
> the service from one server to another in the cluster.
> Scalability is not provided within the service, ie the limitation in 
> resources for the service consist of the resources available on that 
> specific server.
> 
> I understand that your cluster suite can provide this functionality. I 
> am mainly looking for a supported set of features for an enterprise 
> organisation.

Red Hat's Cluster Suite does all of this with the rgmanager service
(not piranha).  I guess I'm not sure what you're asking here.  Are you
asking what features rgmanager has?  Its features are probably documented
somewhere, but I don't know where offhand.  I know it's quite
full-featured and allows you to do exactly what you listed:
provide High Availability (HA) of multiple services, stopping and
starting services throughout cluster, with different kinds of
dependencies.  The Cluster FAQ has information on rgmanager here
that you may find helpful:

http://sources.redhat.com/cluster/faq.html#rgm_what

If you have questions that aren't covered by the FAQ, let me know 
and I'll do my best to answer your questions.

> The second topic is related to the use of MySQL with clusters and 
> specifically with GFS. It is what we use to call MySQL in active-active 
> clustering. I am afraid your documentation is not totally accurate. 
> Unfortunately, information on the Internet (and also on our web site) 
> are often contradictory.
> It is indeed possible to run multiple mysqld services on different 
> cluster nodes, all sharing the same data structure on shared storage, 
> with this configuration:
> - Only the MyISAM storage engine can be used
> - Each mysqld service must start with the external-locking parameter on
> - Each mysqld service hase to have the query cache parameter off (other 
> cache mechanisms remain on, since they are automatically invalidated by 
> external locking)

Thanks for providing this information.  I'll get it into the cluster FAQ.
Maybe some day I'll find the time to play with this myself.

> I am afraid this configuration still does not compete against Oracle 
> RAC. MySQL does not provide a solution that can be compared 1:1 with 
> RAC. You may find some MySQL implementations much more effective than 
> RAC for certain environments, as you will certainly find RAC performing 
> better than MySQL on other implementations.
> 
> Based on the experience of the sales engineering team, customers have 
> never been disappointed by the technology that MySQL can provide as an 
> alternative to RAC. Decisions are based on many other factors, such as 
> the introduction of another (or a different) database, the cost of 
> migrating current applications and compatibility with third party 
> products. You can imagine we are working hard to remove these obstacles.
> 
> Thanks again for your help,
> 
> Kind Regards,
> 
> Ivan
> 
> -- 
>   Ivan Zoratti - Sales Engineering Manager EMEA
> 
>   MySQL AB - Windsor - UK
>   Mobile: +44 7866 363 180
> 
>   ivan at mysql.com
>   http://www.mysql.com

If you have other questions, please let me know.  You can either
email me directly or join the linux-cluster mailing list where 
you can talk to people are using these features and everyone can 
benefit from the discussion.

Regards,

Bob Peterson
Red Hat Cluster Suite



From wcheng at redhat.com  Fri Mar  9 16:15:29 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Fri, 09 Mar 2007 11:15:29 -0500
Subject: [Linux-cluster] FWD: Question on RH Cluster from a MySQL Customer
In-Reply-To: <45F18016.3050603@redhat.com>
References: <1638.1172372671@sss.pgh.pa.us>
	<45EF0ACA.1090900@redhat.com>	<0DEA37F7-2A38-4F33-9857-01FD717DADCC@mysql.com>
	<45F18016.3050603@redhat.com>
Message-ID: <45F18821.3040804@redhat.com>

Robert Peterson wrote:
> Ivan Zoratti wrote:
>>
>> My original question was mainly related to the use of Piranha to 
>> switch over a service (ie, a specific mysql daemon) from one server 
>> to another, in case of fault. There should be only one active service 
>> in the cluster, therefore no concurrency or locking issues should 
>> happen.
I assume this is a special daemon (say the one controls meta data) among 
many other mySQL daemons ?

>> The ideal system should be able to:
>> - have a list of services to launch on the cluster
>> - identify the node in the cluster suitable to host the service (for 
>> example the node with less workload)
The only load balancer we have (at this moment) indeed is piranha (LVS). 
However, using load balancer combining with GFS is tricky due to locking 
overhead (cluster locks are expensive). We do encourage individual file 
access to stay within one node for a proper length of time if all 
possible. Judging by your above statement, since switching (that 
particular ?) service only happens upon fault, this should be ok. 
Current versions of rgmanager and GFS out in the field do not have 
workload statistics - so knowing which node has less workload would be 
tricky (unless you put LVS as the front end). The newest version of 
cluster software using openais (Steve Dake, cc in this email, is the 
maintainer) that may have some features that can be used (but I'm not 
sure).

-- Wendy
>> - check the availability of the service
>> - stop the service on a node (if the service is not already down) and 
>> start the service on another node in case of fault
>> Fault tolerance in this case will be provided by the ability to 
>> switch the service from one server to another in the cluster.
>> Scalability is not provided within the service, ie the limitation in 
>> resources for the service consist of the resources available on that 
>> specific server.
>>
>> I understand that your cluster suite can provide this functionality. 
>> I am mainly looking for a supported set of features for an enterprise 
>> organisation.
>
> Red Hat's Cluster Suite does all of this with the rgmanager service
> (not piranha).  I guess I'm not sure what you're asking here.  Are you
> asking what features rgmanager has?  Its features are probably documented
> somewhere, but I don't know where offhand.  I know it's quite
> full-featured and allows you to do exactly what you listed:
> provide High Availability (HA) of multiple services, stopping and
> starting services throughout cluster, with different kinds of
> dependencies.  The Cluster FAQ has information on rgmanager here
> that you may find helpful:
>
> http://sources.redhat.com/cluster/faq.html#rgm_what
>
> If you have questions that aren't covered by the FAQ, let me know and 
> I'll do my best to answer your questions.
>
>> The second topic is related to the use of MySQL with clusters and 
>> specifically with GFS. It is what we use to call MySQL in 
>> active-active clustering. I am afraid your documentation is not 
>> totally accurate. Unfortunately, information on the Internet (and 
>> also on our web site) are often contradictory.
>> It is indeed possible to run multiple mysqld services on different 
>> cluster nodes, all sharing the same data structure on shared storage, 
>> with this configuration:
>> - Only the MyISAM storage engine can be used
>> - Each mysqld service must start with the external-locking parameter on
>> - Each mysqld service hase to have the query cache parameter off 
>> (other cache mechanisms remain on, since they are automatically 
>> invalidated by external locking)
>
> Thanks for providing this information.  I'll get it into the cluster FAQ.
> Maybe some day I'll find the time to play with this myself.
>
>> I am afraid this configuration still does not compete against Oracle 
>> RAC. MySQL does not provide a solution that can be compared 1:1 with 
>> RAC. You may find some MySQL implementations much more effective than 
>> RAC for certain environments, as you will certainly find RAC 
>> performing better than MySQL on other implementations.
>>
>> Based on the experience of the sales engineering team, customers have 
>> never been disappointed by the technology that MySQL can provide as 
>> an alternative to RAC. Decisions are based on many other factors, 
>> such as the introduction of another (or a different) database, the 
>> cost of migrating current applications and compatibility with third 
>> party products. You can imagine we are working hard to remove these 
>> obstacles.
>>
>> Thanks again for your help,
>>
>> Kind Regards,
>>
>> Ivan
>>
>> -- 
>>   Ivan Zoratti - Sales Engineering Manager EMEA
>>
>>   MySQL AB - Windsor - UK
>>   Mobile: +44 7866 363 180
>>
>>   ivan at mysql.com
>>   http://www.mysql.com
>
> If you have other questions, please let me know.  You can either
> email me directly or join the linux-cluster mailing list where you can 
> talk to people are using these features and everyone can benefit from 
> the discussion.
>
> Regards,
>
> Bob Peterson
> Red Hat Cluster Suite
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From jbrassow at redhat.com  Fri Mar  9 17:49:34 2007
From: jbrassow at redhat.com (Jonathan E Brassow)
Date: Fri, 9 Mar 2007 11:49:34 -0600
Subject: [Linux-cluster] cmirror performance
In-Reply-To: <1173377626.2757.63.camel@rutabaga.defuturo.co.uk>
References: <1173377626.2757.63.camel@rutabaga.defuturo.co.uk>
Message-ID: <e323d3e75f167abaeca1020e2269ea48@redhat.com>

Nope, the first version is just slow.  Next version should be coming 
with RHEL5.X (and should be going upstream), which should be faster.

I just wrote up a perl script (which I haven't had a chance to really 
clean-up yet) that will give performance numbers for various 
request/transfer sizes.  I'm including it at the end.

You must have lmbench package installed (for 'lmdd').  Then run:
# to give you read performance numbers
'perf_matrix.pl if=<block device>'

# to give you write performance numbers
'perf_matrix.pl of=<block device>'

# to do multiple runs and discard numbers outside the std deviation
# The more iterations you do, the more accurate your results
'perf_matrix.pl if=<block device> iter=5

For more information on the options, do 'perf_matrix.pl -h'.

Using the above, you can compare the numbers you're getting from the 
base device, linear target, mirror target, etc over a wide range of 
transfer/request sizes.

Let's take a look at a couple examples.  (Request sizes increase to the 
right by powers of two starting at 1kiB.  Transfer sizes increase by 
rows by powers of two starting at 1MiB.  Results are in MiB/sec):
prompt> perf_matrix.pl if=/dev/vg/linear iter=5 #linear reads, 5 
iterations w/ results averaged
  25.24 28.16 28.82 28.93 28.96 29.25 28.72 26.54 27.39 27.84 28.94 0.00 
0.00
  30.48 31.57 31.66 31.32 31.89 32.19 31.66 32.00 33.98 34.23 31.93 
33.30 0.00
  34.00 33.46 33.39 33.12 33.50 34.32 33.57 33.78 34.81 35.03 33.68 
34.25 34.68
  34.82 34.33 34.32 34.20 34.49 34.89 35.20 35.24 35.39 35.33 35.56 
34.94 35.18
  35.50 35.37 35.53 35.37 35.54 35.53 35.41 35.60 35.38 35.53 35.54 
35.45 35.33
  35.72 35.76 35.82 35.81 35.81 35.80 35.81 35.82 35.81 35.84 35.66 
35.78 35.76
  35.96 35.97 35.87 35.91 35.98 35.99 35.97 35.97 35.98 35.99 35.90 
35.96 35.95
  36.05 36.05 36.05 36.03 36.03 36.03 36.06 36.08 36.06 36.07 36.07 
36.06 36.06
  36.10 36.08 36.08 36.08 36.08 36.10 36.08 36.09 36.10 36.11 36.09 
36.10 36.11
  36.11 36.11 36.11 36.11 36.11 36.11 36.11 36.12 36.12 36.12 36.12 
36.12 36.12
  36.13 36.12 36.12 36.12 36.12 36.12 36.13 36.12 36.12 36.13 36.13 
36.13 36.13

prompt> perf_matrix.pl of=/dev/vg/linear iter=5 #linear writes, 5 
iterations w/ results averaged
  11.74 9.00 31.77 31.82 31.78 31.84 31.93 32.03 32.37 32.98 34.52 0.00 
0.00
  9.14 9.65 33.57 33.65 33.64 33.65 33.70 33.79 33.99 34.33 35.12 33.36 
0.00
  9.63 9.70 33.03 33.01 34.65 34.65 34.67 33.09 33.16 33.35 33.70 32.88 
34.42
  9.60 9.66 33.30 32.35 33.47 33.49 33.49 32.73 33.36 33.65 33.84 33.41 
33.37
  9.68 9.74 33.31 33.36 32.90 32.94 32.94 33.21 33.08 32.99 33.16 33.33 
32.59
  9.66 9.74 32.88 33.14 33.47 33.38 33.20 33.60 33.18 33.35 33.15 33.10 
33.22
  9.68 9.73 32.66 32.73 33.30 33.39 33.22 33.18 33.23 32.97 33.01 33.10 
33.13
  9.69 9.74 33.06 33.28 33.37 33.45 33.32 33.53 33.27 33.34 33.16 33.05 
33.08
  9.59 9.66 31.88 32.34 32.14 32.41 33.21 32.49 32.41 32.47 32.39 32.69 
32.05
  9.47 9.58 32.87 32.79 32.80 32.84 33.09 32.96 32.99 32.95 32.65 32.59 
32.83
  9.45 9.52 33.35 33.10 33.17 33.12 33.05 33.12 33.97 33.14 32.72 33.07 
33.24

# if I redirect the above output to files, I can then diff them
prompt> perf_matrix.pl diff clinear-read.txt clinear-write.txt
  -53.49% -68.04% 10.24% 9.99% 9.74% 8.85% 11.18% 20.69% 18.18% 18.46% 
19.28% -.--% -.--%
  -70.01% -69.43% 6.03% 7.44% 5.49% 4.54% 6.44% 5.59% 0.03% 0.29% 9.99% 
0.18% -.--%
  -71.68% -71.01% -1.08% -0.33% 3.43% 0.96% 3.28% -2.04% -4.74% -4.80% 
0.06% -4.00% -0.75%
  -72.43% -71.86% -2.97% -5.41% -2.96% -4.01% -4.86% -7.12% -5.74% 
-4.76% -4.84% -4.38% -5.14%
  -72.73% -72.46% -6.25% -5.68% -7.43% -7.29% -6.98% -6.71% -6.50% 
-7.15% -6.70% -5.98% -7.76%
  -72.96% -72.76% -8.21% -7.46% -6.53% -6.76% -7.29% -6.20% -7.34% 
-6.95% -7.04% -7.49% -7.10%
  -73.08% -72.95% -8.95% -8.86% -7.45% -7.22% -7.65% -7.76% -7.64% 
-8.39% -8.05% -7.95% -7.84%
  -73.12% -72.98% -8.29% -7.63% -7.38% -7.16% -7.60% -7.07% -7.74% 
-7.57% -8.07% -8.35% -8.26%
  -73.43% -73.23% -11.64% -10.37% -10.92% -10.22% -7.95% -9.98% -10.22% 
-10.08% -10.25% -9.45% -11.24%
  -73.77% -73.47% -8.97% -9.19% -9.17% -9.06% -8.36% -8.75% -8.67% 
-8.78% -9.61% -9.77% -9.11%
  -73.84% -73.64% -7.67% -8.36% -8.17% -8.31% -8.52% -8.31% -5.95% 
-8.28% -9.44% -8.47% -8.00%

I can see that writes for a linear device are much worse when request 
sizes are small, but get reasonably close when request sizes are >= 
4kiB.

I haven't had a chance to do this with (cluster) mirrors yet.  It would 
be interesting to see the difference in performance from linear -> 
mirror and mirror -> cmirror...

Once things are truly stable, I will concentrate more on performance.  
(Also note:  While a mirroring is sync'ing itself, performance for 
nominal operations will be degraded.)

  brassow

-------------- next part --------------
A non-text attachment was scrubbed...
Name: perf_matrix.pl
Type: application/octet-stream
Size: 7099 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070309/178c246a/attachment.obj>
-------------- next part --------------



On Mar 8, 2007, at 12:13 PM, Robert Clark wrote:

>   I've been trying out cmirror for a few months on a RHEL4U4 cluster 
> and
> it's now working very well for me, although I've noticed that it does
> have a bit of a performance hit.
>
>   My set-up has a 32G GFS filesystem on a mirrored LV shared via AoE
> (with jumbo frame support). Just using dd with a 4k blocksize to write
> files on the same LV when it's mirrored and then unmirrored shows a big
> difference in speed:
>
>     Unmirrored: 12440kB/s
>     Mirrored:    2969kB/s
>
> which I wasn't expecting as my understanding is that the cmirror design
> introduces very little overhead.
>
>   The two legs of the mirror are on separate, identical AoE servers and
> the filesystem is mounted on 3 out of 6 nodes in the cluster. This is
> with the cmirror-kernel_2_6_9_19 tagged version and I've tried with 
> both
> core and disk logs.
>
>   I suspect a bad interaction between cmirror and something else, but
> I'm not sure where to start looking. Any ideas?
>
> 	Thanks,
>
> 		Robert
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

From sdake at redhat.com  Fri Mar  9 19:41:54 2007
From: sdake at redhat.com (Steven Dake)
Date: Fri, 09 Mar 2007 12:41:54 -0700
Subject: [Linux-cluster] Re: [Openais] Xen and Cluster Manager
	(OpenAIS::TOTEM)
In-Reply-To: <45F1739E.6040403@startx.fr>
References: <45F1739E.6040403@startx.fr>
Message-ID: <1173469314.18932.1.camel@shih.broked.org>

On Fri, 2007-03-09 at 15:47 +0100, Fabien MALFOY wrote:
> Hi
> 
> I'm trying to build a two nodes cluster with Cluster Manager on Fedora 
> Core 6. As I dispose of only one machine and that my solution needs 
> three, I decided to create two XEN virtual Fedora Core 6 (M1 and M2). I 
> use the Dom0 (M0) host as the storage back-end. All used packages are 
> those of standard repositories of Fedora Core 6 (including Xen, Cman, 
> OpenAIS...). So, the cluster will be made of the two DomU.
> 
> But i'm not able to start correctly the Cman service on M1 and M2. 
> Here's a part of the log :
> 
> Mar 7 15:29:17 M1 openais[2220]: [CMAN ] CMAN 2.0.60 (built Jan 24 2007 
> 15:30:39) started
> Mar 7 15:29:17 M1 openais[2220]: [SYNC ] Not using a virtual synchrony 
> filter.
> Mar 7 15:29:17 M1 openais[2220]: [MAIN ] AIS Executive Service: started 
> and ready to provide service.
> Mar 7 15:29:18 M1 ccsd[2214]: Initial status:: Inquorate
> *Mar 7 15:29:32 M1 openais[2220]: [TOTEM] The consensus timeout expired.
> Mar 7 15:29:32 M1 openais[2220]: [TOTEM] entering GATHER state from 3.
> Mar 7 15:29:47 M1 openais[2220]: [TOTEM] The consensus timeout expired.
> Mar 7 15:29:47 M1 openais[2220]: [TOTEM] entering GATHER state from 3.
> Mar 7 15:30:02 M1 openais[2220]: [TOTEM] The consensus timeout expired.
> Mar 7 15:30:02 M1 openais[2220]: [TOTEM] entering GATHER state from 3.
> 
> *This till the timeout and cman fails to start. To ensure that my 
> configuration was well formed, I adapted my cluster.conf to exclude M2, 
> replaced by M0. In this case, the cman service starts correctly on M0 
> and warns about the absence of M1. On M1, the service fails to start 
> with the same log. Is there a matter between Xen and Cluster Manager ?
> Thanks for your help.
> 

Fabien,

I believe the problem you have is that your default firewall rules setup
by fedora core 6 are not allowing the openais protocol to reach
consensus.  To start off with, try turning off your firewall.

Regards
-steve



From jeff3140 at gmail.com  Fri Mar  9 21:47:16 2007
From: jeff3140 at gmail.com (Jeff)
Date: Fri, 9 Mar 2007 16:47:16 -0500
Subject: [Linux-cluster] DLM internals
In-Reply-To: <20070215151426.GA18284@redhat.com>
References: <ce31d420702142150t48cf1f89h68b16b1f3116a066@mail.gmail.com>
	<20070215151426.GA18284@redhat.com>
Message-ID: <bf6826c60703091347ld14981u95369322caa0a7c1@mail.gmail.com>

On 2/15/07, David Teigland <teigland at redhat.com> wrote:
> This is an excellent description of a dlm and the general ideas/logic
> reflect very well our own dlm:
>
> http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf
>
> Dave

There is a document in a DLM Source tree,
    http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/dlm/doc/?cvsroot=cluster
which briefly describes the DLM API. Is there anything else
available? In particular I was looking for how VALNOTVALID errors
are handled with a node crashes with an CR/EX lock and it had
read the lock value block.



From jbrassow at redhat.com  Fri Mar  9 21:49:35 2007
From: jbrassow at redhat.com (Jonathan E Brassow)
Date: Fri, 9 Mar 2007 15:49:35 -0600
Subject: [Linux-cluster] cmirror performance
In-Reply-To: <e323d3e75f167abaeca1020e2269ea48@redhat.com>
References: <1173377626.2757.63.camel@rutabaga.defuturo.co.uk>
	<e323d3e75f167abaeca1020e2269ea48@redhat.com>
Message-ID: <e73daefcfdcff2e601a225933ed25b93@redhat.com>

clean-up of previously posted script (plus colorized diff output for 
easier reading).

  brassow

-------------- next part --------------
A non-text attachment was scrubbed...
Name: perf_matrix.pl
Type: application/octet-stream
Size: 7334 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070309/6b3f5bfa/attachment.obj>
-------------- next part --------------


On Mar 9, 2007, at 11:49 AM, Jonathan E Brassow wrote:

> Nope, the first version is just slow.  Next version should be coming 
> with RHEL5.X (and should be going upstream), which should be faster.
>
> I just wrote up a perl script (which I haven't had a chance to really 
> clean-up yet) that will give performance numbers for various 
> request/transfer sizes.  I'm including it at the end.
>
> You must have lmbench package installed (for 'lmdd').  Then run:
> # to give you read performance numbers
> 'perf_matrix.pl if=<block device>'
>
> # to give you write performance numbers
> 'perf_matrix.pl of=<block device>'
>
> # to do multiple runs and discard numbers outside the std deviation
> # The more iterations you do, the more accurate your results
> 'perf_matrix.pl if=<block device> iter=5
>
> For more information on the options, do 'perf_matrix.pl -h'.
>
> Using the above, you can compare the numbers you're getting from the 
> base device, linear target, mirror target, etc over a wide range of 
> transfer/request sizes.
>
> Let's take a look at a couple examples.  (Request sizes increase to 
> the right by powers of two starting at 1kiB.  Transfer sizes increase 
> by rows by powers of two starting at 1MiB.  Results are in MiB/sec):
> prompt> perf_matrix.pl if=/dev/vg/linear iter=5 #linear reads, 5 
> iterations w/ results averaged
>  25.24 28.16 28.82 28.93 28.96 29.25 28.72 26.54 27.39 27.84 28.94 
> 0.00 0.00
>  30.48 31.57 31.66 31.32 31.89 32.19 31.66 32.00 33.98 34.23 31.93 
> 33.30 0.00
>  34.00 33.46 33.39 33.12 33.50 34.32 33.57 33.78 34.81 35.03 33.68 
> 34.25 34.68
>  34.82 34.33 34.32 34.20 34.49 34.89 35.20 35.24 35.39 35.33 35.56 
> 34.94 35.18
>  35.50 35.37 35.53 35.37 35.54 35.53 35.41 35.60 35.38 35.53 35.54 
> 35.45 35.33
>  35.72 35.76 35.82 35.81 35.81 35.80 35.81 35.82 35.81 35.84 35.66 
> 35.78 35.76
>  35.96 35.97 35.87 35.91 35.98 35.99 35.97 35.97 35.98 35.99 35.90 
> 35.96 35.95
>  36.05 36.05 36.05 36.03 36.03 36.03 36.06 36.08 36.06 36.07 36.07 
> 36.06 36.06
>  36.10 36.08 36.08 36.08 36.08 36.10 36.08 36.09 36.10 36.11 36.09 
> 36.10 36.11
>  36.11 36.11 36.11 36.11 36.11 36.11 36.11 36.12 36.12 36.12 36.12 
> 36.12 36.12
>  36.13 36.12 36.12 36.12 36.12 36.12 36.13 36.12 36.12 36.13 36.13 
> 36.13 36.13
>
> prompt> perf_matrix.pl of=/dev/vg/linear iter=5 #linear writes, 5 
> iterations w/ results averaged
>  11.74 9.00 31.77 31.82 31.78 31.84 31.93 32.03 32.37 32.98 34.52 0.00 
> 0.00
>  9.14 9.65 33.57 33.65 33.64 33.65 33.70 33.79 33.99 34.33 35.12 33.36 
> 0.00
>  9.63 9.70 33.03 33.01 34.65 34.65 34.67 33.09 33.16 33.35 33.70 32.88 
> 34.42
>  9.60 9.66 33.30 32.35 33.47 33.49 33.49 32.73 33.36 33.65 33.84 33.41 
> 33.37
>  9.68 9.74 33.31 33.36 32.90 32.94 32.94 33.21 33.08 32.99 33.16 33.33 
> 32.59
>  9.66 9.74 32.88 33.14 33.47 33.38 33.20 33.60 33.18 33.35 33.15 33.10 
> 33.22
>  9.68 9.73 32.66 32.73 33.30 33.39 33.22 33.18 33.23 32.97 33.01 33.10 
> 33.13
>  9.69 9.74 33.06 33.28 33.37 33.45 33.32 33.53 33.27 33.34 33.16 33.05 
> 33.08
>  9.59 9.66 31.88 32.34 32.14 32.41 33.21 32.49 32.41 32.47 32.39 32.69 
> 32.05
>  9.47 9.58 32.87 32.79 32.80 32.84 33.09 32.96 32.99 32.95 32.65 32.59 
> 32.83
>  9.45 9.52 33.35 33.10 33.17 33.12 33.05 33.12 33.97 33.14 32.72 33.07 
> 33.24
>
> # if I redirect the above output to files, I can then diff them
> prompt> perf_matrix.pl diff clinear-read.txt clinear-write.txt
>  -53.49% -68.04% 10.24% 9.99% 9.74% 8.85% 11.18% 20.69% 18.18% 18.46% 
> 19.28% -.--% -.--%
>  -70.01% -69.43% 6.03% 7.44% 5.49% 4.54% 6.44% 5.59% 0.03% 0.29% 9.99% 
> 0.18% -.--%
>  -71.68% -71.01% -1.08% -0.33% 3.43% 0.96% 3.28% -2.04% -4.74% -4.80% 
> 0.06% -4.00% -0.75%
>  -72.43% -71.86% -2.97% -5.41% -2.96% -4.01% -4.86% -7.12% -5.74% 
> -4.76% -4.84% -4.38% -5.14%
>  -72.73% -72.46% -6.25% -5.68% -7.43% -7.29% -6.98% -6.71% -6.50% 
> -7.15% -6.70% -5.98% -7.76%
>  -72.96% -72.76% -8.21% -7.46% -6.53% -6.76% -7.29% -6.20% -7.34% 
> -6.95% -7.04% -7.49% -7.10%
>  -73.08% -72.95% -8.95% -8.86% -7.45% -7.22% -7.65% -7.76% -7.64% 
> -8.39% -8.05% -7.95% -7.84%
>  -73.12% -72.98% -8.29% -7.63% -7.38% -7.16% -7.60% -7.07% -7.74% 
> -7.57% -8.07% -8.35% -8.26%
>  -73.43% -73.23% -11.64% -10.37% -10.92% -10.22% -7.95% -9.98% -10.22% 
> -10.08% -10.25% -9.45% -11.24%
>  -73.77% -73.47% -8.97% -9.19% -9.17% -9.06% -8.36% -8.75% -8.67% 
> -8.78% -9.61% -9.77% -9.11%
>  -73.84% -73.64% -7.67% -8.36% -8.17% -8.31% -8.52% -8.31% -5.95% 
> -8.28% -9.44% -8.47% -8.00%
>
> I can see that writes for a linear device are much worse when request 
> sizes are small, but get reasonably close when request sizes are >= 
> 4kiB.
>
> I haven't had a chance to do this with (cluster) mirrors yet.  It 
> would be interesting to see the difference in performance from linear 
> -> mirror and mirror -> cmirror...
>
> Once things are truly stable, I will concentrate more on performance.  
> (Also note:  While a mirroring is sync'ing itself, performance for 
> nominal operations will be degraded.)
>
>  brassow
>
> <perf_matrix.pl>
>
>
> On Mar 8, 2007, at 12:13 PM, Robert Clark wrote:
>
>>   I've been trying out cmirror for a few months on a RHEL4U4 cluster 
>> and
>> it's now working very well for me, although I've noticed that it does
>> have a bit of a performance hit.
>>
>>   My set-up has a 32G GFS filesystem on a mirrored LV shared via AoE
>> (with jumbo frame support). Just using dd with a 4k blocksize to write
>> files on the same LV when it's mirrored and then unmirrored shows a 
>> big
>> difference in speed:
>>
>>     Unmirrored: 12440kB/s
>>     Mirrored:    2969kB/s
>>
>> which I wasn't expecting as my understanding is that the cmirror 
>> design
>> introduces very little overhead.
>>
>>   The two legs of the mirror are on separate, identical AoE servers 
>> and
>> the filesystem is mounted on 3 out of 6 nodes in the cluster. This is
>> with the cmirror-kernel_2_6_9_19 tagged version and I've tried with 
>> both
>> core and disk logs.
>>
>>   I suspect a bad interaction between cmirror and something else, but
>> I'm not sure where to start looking. Any ideas?
>>
>> 	Thanks,
>>
>> 		Robert
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

From ivan at mysql.com  Fri Mar  9 20:18:34 2007
From: ivan at mysql.com (Ivan Zoratti)
Date: Fri, 9 Mar 2007 20:18:34 +0000
Subject: [Linux-cluster] FWD: Question on RH Cluster from a MySQL Customer
In-Reply-To: <45F18016.3050603@redhat.com>
References: <1638.1172372671@sss.pgh.pa.us> <45EF0ACA.1090900@redhat.com>
	<0DEA37F7-2A38-4F33-9857-01FD717DADCC@mysql.com>
	<45F18016.3050603@redhat.com>
Message-ID: <CCC26F8A-11BB-4EC9-B788-B9FFAB992AF4@mysql.com>

Hi Robert,

This is great news, thanks a lot. You (and Wendy in another email)  
have answered my questions, now I will start digging into more  
details through the documentation.

Should you need any information regarding MySQL, I would be glad to  
help.

Thanks again,

Kind Regards,

Ivan


--
   Ivan Zoratti - Sales Engineering Manager EMEA

   MySQL AB - Windsor - UK
   Mobile: +44 7866 363 180

   ivan at mysql.com
   http://www.mysql.com
--


On 9 Mar 2007, at 15:41, Robert Peterson wrote:

> Hi Ivan,
>
> Answers embedded below:
>
> Ivan Zoratti wrote:
>> Hi Robert,
>> First of all, thanks for your time, I really appreciate it.
>> I'd like to reply to two separate topics here: first, the  
>> objective of my question and second, the cluster-awareness of  
>> MySQL and the use of GFS with MySQL.
>> My original question was mainly related to the use of Piranha to  
>> switch over a service (ie, a specific mysql daemon) from one  
>> server to another, in case of fault. There should be only one  
>> active service in the cluster, therefore no concurrency or locking  
>> issues should happen.
>> The ideal system should be able to:
>> - have a list of services to launch on the cluster
>> - identify the node in the cluster suitable to host the service  
>> (for example the node with less workload)
>> - check the availability of the service
>> - stop the service on a node (if the service is not already down)  
>> and start the service on another node in case of fault
>> Fault tolerance in this case will be provided by the ability to  
>> switch the service from one server to another in the cluster.
>> Scalability is not provided within the service, ie the limitation  
>> in resources for the service consist of the resources available on  
>> that specific server.
>> I understand that your cluster suite can provide this  
>> functionality. I am mainly looking for a supported set of features  
>> for an enterprise organisation.
>
> Red Hat's Cluster Suite does all of this with the rgmanager service
> (not piranha).  I guess I'm not sure what you're asking here.  Are you
> asking what features rgmanager has?  Its features are probably  
> documented
> somewhere, but I don't know where offhand.  I know it's quite
> full-featured and allows you to do exactly what you listed:
> provide High Availability (HA) of multiple services, stopping and
> starting services throughout cluster, with different kinds of
> dependencies.  The Cluster FAQ has information on rgmanager here
> that you may find helpful:
>
> http://sources.redhat.com/cluster/faq.html#rgm_what
>
> If you have questions that aren't covered by the FAQ, let me know  
> and I'll do my best to answer your questions.
>
>> The second topic is related to the use of MySQL with clusters and  
>> specifically with GFS. It is what we use to call MySQL in active- 
>> active clustering. I am afraid your documentation is not totally  
>> accurate. Unfortunately, information on the Internet (and also on  
>> our web site) are often contradictory.
>> It is indeed possible to run multiple mysqld services on different  
>> cluster nodes, all sharing the same data structure on shared  
>> storage, with this configuration:
>> - Only the MyISAM storage engine can be used
>> - Each mysqld service must start with the external-locking  
>> parameter on
>> - Each mysqld service hase to have the query cache parameter off  
>> (other cache mechanisms remain on, since they are automatically  
>> invalidated by external locking)
>
> Thanks for providing this information.  I'll get it into the  
> cluster FAQ.
> Maybe some day I'll find the time to play with this myself.
>
>> I am afraid this configuration still does not compete against  
>> Oracle RAC. MySQL does not provide a solution that can be compared  
>> 1:1 with RAC. You may find some MySQL implementations much more  
>> effective than RAC for certain environments, as you will certainly  
>> find RAC performing better than MySQL on other implementations.
>> Based on the experience of the sales engineering team, customers  
>> have never been disappointed by the technology that MySQL can  
>> provide as an alternative to RAC. Decisions are based on many  
>> other factors, such as the introduction of another (or a  
>> different) database, the cost of migrating current applications  
>> and compatibility with third party products. You can imagine we  
>> are working hard to remove these obstacles.
>> Thanks again for your help,
>> Kind Regards,
>> Ivan
>> -- 
>>   Ivan Zoratti - Sales Engineering Manager EMEA
>>   MySQL AB - Windsor - UK
>>   Mobile: +44 7866 363 180
>>   ivan at mysql.com
>>   http://www.mysql.com
>
> If you have other questions, please let me know.  You can either
> email me directly or join the linux-cluster mailing list where you  
> can talk to people are using these features and everyone can  
> benefit from the discussion.
>
> Regards,
>
> Bob Peterson
> Red Hat Cluster Suite



From sail at serverengines.com  Fri Mar  9 23:30:15 2007
From: sail at serverengines.com (Sai Loganathan)
Date: Fri, 9 Mar 2007 15:30:15 -0800
Subject: [Linux-cluster] cluster not doing failover
Message-ID: <012d01c762a2$e439dd90$2702140a@se19261f2cf9ed>

 

Hello,

I am setting up a 2 node redhat cluster to test failover as part of 

testing effort in my company for the iscsi product we develop.

I have a iscsi target which is my cluster shared storage.

Downloaded and compiled the open source redhat cluster and installed 

the cluster components in both the nodes.

Logged-into the iscsi target, created a gfs filesystem and mounted the 

lun on both the nodes.

Created the cluster.conf using system-config-cluster gui and below that 

cluster.conf

 

<?xml version="1.0"?>

<cluster config_version="8" name="alpha_cluster">

            <fence_daemon post_fail_delay="0" post_join_delay="3"/>

            <clusternodes>

                        <clusternode name="node1" votes="1">

                                    <fence>

                                                <method name="1">

                                                            <device
name="node1_fence"/>

                                                </method>

                                    </fence>

                        </clusternode>

                        <clusternode name="node2" votes="1">

                                    <fence>

                                                <method name="1">

                                                            <device
name="node2_fence"/>

                                                </method>

                                    </fence>

                        </clusternode>

            </clusternodes>

            <cman expected_votes="1" two_node="1"/>

            <fencedevices>

                        <fencedevice agent="fence_ilo" hostname="admin" 

login="admin" name="node1_fence" passwd="admin"/>

                        <fencedevice agent="fence_ilo" hostname="admin" 

login="admin" name="node2_fence" passwd="admin"/>

            </fencedevices>

            <rm>

                        <failoverdomains>

                                    <failoverdomain name="failover"
ordered="1" 

restricted="0">

                                                <failoverdomainnode
name="node1" 

priority="1"/>

                                                <failoverdomainnode
name="node2" 

priority="2"/>

                                    </failoverdomain>

                        </failoverdomains>

                        <resources>

                                    <ip address="172.40.2.119"
monitor_link="1"/>

                                    <clusterfs device="/dev/mapper/vg0-lv" 

force_unmount="0" fsid="10938" fstype="gfs" mountpoint="/test1" 

name="lun" options=""/>

                        </resources>

                        <service autostart="1" domain="failover" 

name="iscsi_ip">

                                    <ip ref="172.40.2.119"/>

                        </service>

                        <service autostart="1" domain="failover" 

name="iscsi_lun">

                                    <clusterfs ref="lun"/>

                        </service>

            </rm>

</cluster>

 

Using the cluster ip address (172.40.2.119), I was able to do an nfs 

mount of the shared lun from a 3rd machine. Started an infinite ls on 

that lun. 

To simulate failover, I just powered-down the node1 and hoping to see 

the read io stop but resume via the node2. But, I see the following 

error message on the node 2.

Mar  9 12:14:49 node2 fenced[7422]: fence "node1" failed

Mar  9 12:14:54 node2 fenced[7422]: fencing node "node1"

Mar  9 12:14:54 node2 fenced[7422]: agent "fence_ilo" reports: Can't 

call method "configure" on an undefined value at /sbin/fence_ilo line 

169, <> line 4.

Mar  9 12:14:54 node2 fenced[7422]: fence "node1" failed

Mar  9 12:14:59 node2 fenced[7422]: fencing node "node1"

Mar  9 12:14:59 node2 fenced[7422]: agent "fence_ilo" reports: Can't 

call method "configure" on an undefined value at /sbin/fence_ilo line 

169, <> line 4.

 

Seems like I am not doing something correct with respect to fencing.

Can I setup cluster without fencing first of all?

I don't have any of the fencing power devices. In that case, how do I 

do fencing?

 

Any help would be greatly appreciated.

 

Thanks,

Sai Logan




_________________________________________________________________________________________________________________
This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended 
recipient please telephone or e-mail the sender and delete this message and all attachments from your system - ServerEngines  LLC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070309/d60026b1/attachment.htm>

From jbrassow at redhat.com  Sat Mar 10 01:53:40 2007
From: jbrassow at redhat.com (Jonathan E Brassow)
Date: Fri, 9 Mar 2007 19:53:40 -0600
Subject: [Linux-cluster] cluster not doing failover
In-Reply-To: <012d01c762a2$e439dd90$2702140a@se19261f2cf9ed>
References: <012d01c762a2$e439dd90$2702140a@se19261f2cf9ed>
Message-ID: <40407159e8e6506b05d46c82d921d936@redhat.com>


On Mar 9, 2007, at 5:30 PM, Sai Loganathan wrote:
> ??????????? <fencedevices>
> ??????????????????????? <fencedevice agent="fence_ilo" hostname="admin"
> login="admin" name="node1_fence" passwd="admin"/>
> ??????????????????????? <fencedevice agent="fence_ilo" hostname="admin"
> login="admin" name="node2_fence" passwd="admin"/>
> ??????????? </fencedevices>

The above line look funny to me.  The hostname for the fence device is 
"admin"?

> Using the cluster ip address (172.40.2.119), I was able to do an nfs
> mount of the shared lun from a 3rd machine. Started an infinite ls on
> that lun.
> To simulate failover, I just powered-down the node1 and hoping to see
> the read io stop but resume via the node2. But, I see the following
> error message on the node 2.
> Mar? 9 12:14:49 node2 fenced[7422]: fence "node1" failed
> Mar? 9 12:14:54 node2 fenced[7422]: fencing node "node1"
> Mar? 9 12:14:54 node2 fenced[7422]: agent "fence_ilo" reports: Can't
> call method "configure" on an undefined value at /sbin/fence_ilo line
> 169, <> line 4.
> Mar? 9 12:14:54 node2 fenced[7422]: fence "node1" failed
> Mar? 9 12:14:59 node2 fenced[7422]: fencing node "node1"
> Mar? 9 12:14:59 node2 fenced[7422]: agent "fence_ilo" reports: Can't
> call method "configure" on an undefined value at /sbin/fence_ilo line
> 169, <> line 4.
> ?
> Seems like I am not doing something correct with respect to fencing.
> Can I setup cluster without fencing first of all?

Yes.  You can use manual fencing.  That should only be used for testing 
purposes though... it is not a supported configuration.


  brassow
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 3471 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070309/c015b8da/attachment.bin>

From mallah.rajesh at gmail.com  Sat Mar 10 18:31:21 2007
From: mallah.rajesh at gmail.com (Rajesh Kumar Mallah)
Date: Sun, 11 Mar 2007 00:01:21 +0530
Subject: [Linux-cluster] problem compiling redhat cluster2 with 2.6.9
Message-ID: <a97c77030703101031j3d6bbb32me18a015831268e00@mail.gmail.com>

Hi,

I am facing problem in compiling cluster 1 or cluster 2 with kernel - 2.6.9.
cluster sources were picked from
ftp://sources.redhat.com/pub/cluster/releases/

can anyone please tell if its possible to compile GFS1 or 2 with kernel 2.6.9
any help is appreciated.


transcript:

cluster-2.00.00]# ./configure --kernel_src=/usr/src/vanilla/linux-2.6.9
configure gnbd-kernel

Configuring Makefiles for your system...
Completed Makefile configuration

configure ccs

Configuring Makefiles for your system...
Completed Makefile configuration

configure cman

Configuring Makefiles for your system...
Completed Makefile configuration

configure group

Configuring Makefiles for your system...
Completed Makefile configuration

configure dlm

Configuring Makefiles for your system...
Completed Makefile configuration

configure fence

Configuring Makefiles for your system...
Completed Makefile configuration

configure gfs-kernel

Configuring Makefiles for your system...
Completed Makefile configuration

configure gfs

Configuring Makefiles for your system...
Completed Makefile configuration

configure gfs2

Configuring Makefiles for your system...
Completed Makefile configuration

configure gnbd

Configuring Makefiles for your system...
Completed Makefile configuration

configure rgmanager

Configuring Makefiles for your system...
Completed Makefile configuration

[root at IPDDFG0595ATL2 cluster-2.00.00]# make
make -C gnbd-kernel all
make[1]: Entering directory `/opt/sources/gfs/cluster-2.00.00/gnbd-kernel'
make -C src all
make[2]: Entering directory `/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src'
make -C /usr/src/vanilla/linux-2.6.9
M=/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src modules
USING_KBUILD=yes
make[3]: Entering directory `/usr/src/vanilla/linux-2.6.9'
  CC [M]  /opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.o
/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c: In function
`store_sectors':
/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:181: warning:
implicit declaration of function `mutex_lock'
/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:181: structure
has no member named `i_mutex'
/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:183: warning:
implicit declaration of function `mutex_unlock'
/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:183: structure
has no member named `i_mutex'
/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c: In function
`gnbd_end_request':
/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:282: too many
arguments to function `end_that_request_last'
/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c: In function
`do_gnbd_request':
/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:578: structure
has no member named `cmd_type'
/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c: In function
`gnbd_init':
/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:888: structure
has no member named `cmd_type'
/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:888:
`REQ_TYPE_SPECIAL' undeclared (first use in this function)
/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:888: (Each
undeclared identifier is reported only once
/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:888: for each
function it appears in.)
/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:893: structure
has no member named `cmd_type'
/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:913:
incompatible type for argument 1 of `elevator_exit'
/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.c:914: warning:
passing arg 2 of `elevator_init' from incompatible pointer type
make[4]: *** [/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src/gnbd.o] Error 1
make[3]: *** [_module_/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src] Error 2
make[3]: Leaving directory `/usr/src/vanilla/linux-2.6.9'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/opt/sources/gfs/cluster-2.00.00/gnbd-kernel/src'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/opt/sources/gfs/cluster-2.00.00/gnbd-kernel'
make: *** [all] Error 2




Regds
mallah.



From fajarpri at cbn.net.id  Mon Mar 12 13:13:19 2007
From: fajarpri at cbn.net.id (Fajar Priyanto)
Date: Mon, 12 Mar 2007 20:13:19 +0700
Subject: [Linux-cluster] Advice on setting up Resources for Postgres
Message-ID: <200703122013.20165.fajarpri@cbn.net.id>

Hi All,
I'd like to setup RHCS for Postgres on 2-nodes cluster.
What is the best way to setup the resources? No GFS.
- Do I need to setup a script for postgres init.d? Or should I just let it on 
on both server from chkconfig?

Thanks.
-- 
Fajar Priyanto | Reg'd Linux User #327841 | Linux tutorial 
http://linux2.arinet.org
8:09pm up 12:12, 2.6.18.2-34-default GNU/Linux 
Let's use OpenOffice. http://www.openoffice.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070312/aa8bb200/attachment.sig>

From devrim at gunduz.org  Mon Mar 12 13:39:57 2007
From: devrim at gunduz.org (=?iso-8859-9?Q?Devrim_G=DCND=DCZ?=)
Date: Mon, 12 Mar 2007 15:39:57 +0200 (EET)
Subject: [Linux-cluster] Advice on setting up Resources for Postgres
In-Reply-To: <200703122013.20165.fajarpri@cbn.net.id>
References: <200703122013.20165.fajarpri@cbn.net.id>
Message-ID: <Pine.LNX.4.63.0703121522390.20080@mail.kivi.com.tr>


Hi,

On Mon, 12 Mar 2007, Fajar Priyanto wrote:

> I'd like to setup RHCS for Postgres on 2-nodes cluster.
> What is the best way to setup the resources? No GFS.

I'd use GFS to make sure that data is not corrupted. You can "live" with 
ext3 -- but in case of a problem, if two postmaster directly accesses the 
same node, data problem may occur.

http://www.gunduz.org/download.php?dlid=142

is the link to the presentation that I made at 10th PostgreSQL Anniversary 
Summit last year; which will give you a basic idea about these issues.

> - Do I need to setup a script for postgres init.d? Or should I just let it on
> on both server from chkconfig?

AFAIR you will need to change condstart routines to start routines for 
RHCS to start PostgreSQL correctly. So PostgreSQL's init script should be 
fine.

Regards,
--
Devrim G?ND?Z
devrim~gunduz.org, devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
                       http://www.gunduz.org



From lhh at redhat.com  Mon Mar 12 18:07:54 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 12 Mar 2007 14:07:54 -0400
Subject: [Linux-cluster] Re: Failover not working
In-Reply-To: <45F06FF2.6020309@eons.com>
References: <45F036A4.1090106@eons.com>  <45F06FF2.6020309@eons.com>
Message-ID: <1173722874.4557.23.camel@asuka.boston.devel.redhat.com>

On Thu, 2007-03-08 at 15:20 -0500, Dave Berry wrote:
> rgmanager-1.9.54-1
> 
> >  That shouldn't happen - what rgmanager RPM do you have?
> >

Ok, I'll poke around.  If you have an easy way to reproduce this, could
you file a bugzilla with the exact necessary steps?

-- Lon




From lhh at redhat.com  Mon Mar 12 18:08:59 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 12 Mar 2007 14:08:59 -0400
Subject: [Linux-cluster] Advice on setting up Resources for Postgres
In-Reply-To: <200703122013.20165.fajarpri@cbn.net.id>
References: <200703122013.20165.fajarpri@cbn.net.id>
Message-ID: <1173722939.4557.25.camel@asuka.boston.devel.redhat.com>

On Mon, 2007-03-12 at 20:13 +0700, Fajar Priyanto wrote:
> Hi All,
> I'd like to setup RHCS for Postgres on 2-nodes cluster.
> What is the best way to setup the resources? No GFS.
> - Do I need to setup a script for postgres init.d? Or should I just let it on 
> on both server from chkconfig?
> 
> Thanks.

There's a postgres resource agent in CVS ... maybe you should start
there? :)

-- Lon




From lshen at cisco.com  Mon Mar 12 21:41:25 2007
From: lshen at cisco.com (Lin Shen (lshen))
Date: Mon, 12 Mar 2007 14:41:25 -0700
Subject: [Linux-cluster] Using GFS on Compact Flash
Message-ID: <08A9A3213527A6428774900A80DBD8D803A003D6@xmb-sjc-222.amer.cisco.com>

Does it make sense to use  GFS (local or cluster mode) on Compact Flash?
Will it greatly reduce the life expectancy of the Compact Flash compared
to using a local file system? The rational behind this is that GFS will
issue way more writes to the disk for its internal operations ( such as
dlm locking etc).

Lin   



From fajarpri at cbn.net.id  Tue Mar 13 02:14:05 2007
From: fajarpri at cbn.net.id (Fajar Priyanto)
Date: Tue, 13 Mar 2007 09:14:05 +0700
Subject: [Linux-cluster] Advice on setting up Resources for Postgres
In-Reply-To: <1173722939.4557.25.camel@asuka.boston.devel.redhat.com>
References: <200703122013.20165.fajarpri@cbn.net.id>
	<1173722939.4557.25.camel@asuka.boston.devel.redhat.com>
Message-ID: <200703130914.06009.fajarpri@cbn.net.id>

On Tuesday 13 March 2007 01:08, Lon Hohberger wrote:
> On Mon, 2007-03-12 at 20:13 +0700, Fajar Priyanto wrote:
> > Hi All,
> > I'd like to setup RHCS for Postgres on 2-nodes cluster.
> > What is the best way to setup the resources? No GFS.
> > - Do I need to setup a script for postgres init.d? Or should I just let
> > it on on both server from chkconfig?
> >
> > Thanks.
>
> There's a postgres resource agent in CVS ... maybe you should start
> there? :)
>
> -- Lon

Thanks.
BTW, can you pls send me patch file for init.d/functions ? The one that modify 
the exit code for stopping stopped service? I've googled for it, but havent 
found it yet.

-- 
Fajar Priyanto | Reg'd Linux User #327841 | Linux tutorial 
http://linux2.arinet.org
9:12am up 0:37, 2.6.18.2-34-default GNU/Linux 
Let's use OpenOffice. http://www.openoffice.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070313/6dbc391d/attachment.sig>

From fajarpri at cbn.net.id  Tue Mar 13 02:17:08 2007
From: fajarpri at cbn.net.id (Fajar Priyanto)
Date: Tue, 13 Mar 2007 09:17:08 +0700
Subject: [Linux-cluster] Advice on setting up Resources for Postgres
In-Reply-To: <1173722939.4557.25.camel@asuka.boston.devel.redhat.com>
References: <200703122013.20165.fajarpri@cbn.net.id>
	<1173722939.4557.25.camel@asuka.boston.devel.redhat.com>
Message-ID: <200703130917.09258.fajarpri@cbn.net.id>

On Tuesday 13 March 2007 01:08, Lon Hohberger wrote:
> On Mon, 2007-03-12 at 20:13 +0700, Fajar Priyanto wrote:
> > Hi All,
> > I'd like to setup RHCS for Postgres on 2-nodes cluster.
> > What is the best way to setup the resources? No GFS.
> > - Do I need to setup a script for postgres init.d? Or should I just let
> > it on on both server from chkconfig?
> >
> > Thanks.
>
> There's a postgres resource agent in CVS ... maybe you should start
> there? :)
>
> -- Lon

Ugh... found it.
https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=111998

-- 
Fajar Priyanto | Reg'd Linux User #327841 | Linux tutorial 
http://linux2.arinet.org
9:16am up 0:42, 2.6.18.2-34-default GNU/Linux 
Let's use OpenOffice. http://www.openoffice.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070313/0d6218d8/attachment.sig>

From sail at serverengines.com  Tue Mar 13 02:47:45 2007
From: sail at serverengines.com (Sai Loganathan)
Date: Mon, 12 Mar 2007 19:47:45 -0700
Subject: [Linux-cluster] RE: Linux-cluster Digest, Vol 35, Issue 13
In-Reply-To: <20070310170007.4934A731DB@hormel.redhat.com>
References: <20070310170007.4934A731DB@hormel.redhat.com>
Message-ID: <015501c76519$fad5eca0$2702140a@se19261f2cf9ed>

Hello,
Thanks for the info. Now I am doing manual fencing but get the following
error whenever I do a failover.

Mar 12 17:25:50 node2 clurgmgrd[6088]: <info> State change: node1 DOWN
Mar 12 17:25:52 node2 clurgmgrd[6088]: <notice> Starting stopped service
iscsi_ip
Mar 12 17:25:52 node2 clurgmgrd: [6088]: <info> Adding IPv4 address
172.40.2.119 to eth2
Mar 12 17:25:52 node2 clurgmgrd[6088]: <notice> Starting stopped service
iscsi_lun
Mar 12 17:25:53 node2 clurgmgrd[6088]: <notice> Service iscsi_lun started
Mar 12 17:25:54 node2 clurgmgrd[6088]: <notice> Service iscsi_ip started
Mar 12 17:26:24 node2 kernel: CMAN: removing node node1 from the cluster :
Missed too many heartbeats
Mar 12 17:26:24 node2 fenced[6040]: node1 not a cluster member after 0 sec
post_fail_delay
Mar 12 17:26:24 node2 fenced[6040]: fencing node "node1"
Mar 12 17:26:24 node2 fence_manual: Node node1 needs to be reset before
recovery can procede.  Waiting for node1 to rejoin the cluster or for manual
acknowledgement that it has been reset (i.e. fence_ack_manual -n node1)

I just power down node 1 to simulate the failover to node2. Unless I execute
the command fence_ack_manual -n node1, the system will not move forward and
wait in fencing. How to fix this error?

During shutdown, I get the following error message and system waits there
infinitely.
Starting Killall: CMAN: sendmsg failed: -101
WARNING: dlm_emergency_shutdown
SM: 00000003 sm_stop: SG stilljoined
How to fix this error?

Thanks,
Sai Logan



-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of
linux-cluster-request at redhat.com
Sent: Saturday, March 10, 2007 9:00 AM
To: linux-cluster at redhat.com
Subject: Linux-cluster Digest, Vol 35, Issue 13

Send Linux-cluster mailing list submissions to
	linux-cluster at redhat.com

To subscribe or unsubscribe via the World Wide Web, visit
	https://www.redhat.com/mailman/listinfo/linux-cluster
or, via email, send a message with subject or body 'help' to
	linux-cluster-request at redhat.com

You can reach the person managing the list at
	linux-cluster-owner at redhat.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Linux-cluster digest..."


Today's Topics:

   1. Re: cluster not doing failover (Jonathan E Brassow)


----------------------------------------------------------------------

Message: 1
Date: Fri, 9 Mar 2007 19:53:40 -0600
From: Jonathan E Brassow <jbrassow at redhat.com>
Subject: Re: [Linux-cluster] cluster not doing failover
To: linux clustering <linux-cluster at redhat.com>
Message-ID: <40407159e8e6506b05d46c82d921d936 at redhat.com>
Content-Type: text/plain; charset="iso-8859-1"


On Mar 9, 2007, at 5:30 PM, Sai Loganathan wrote:
>             <fencedevices>
>                         <fencedevice agent="fence_ilo" hostname="admin"
> login="admin" name="node1_fence" passwd="admin"/>
>                         <fencedevice agent="fence_ilo" hostname="admin"
> login="admin" name="node2_fence" passwd="admin"/>
>             </fencedevices>

The above line look funny to me.  The hostname for the fence device is 
"admin"?

> Using the cluster ip address (172.40.2.119), I was able to do an nfs
> mount of the shared lun from a 3rd machine. Started an infinite ls on
> that lun.
> To simulate failover, I just powered-down the node1 and hoping to see
> the read io stop but resume via the node2. But, I see the following
> error message on the node 2.
> Mar  9 12:14:49 node2 fenced[7422]: fence "node1" failed
> Mar  9 12:14:54 node2 fenced[7422]: fencing node "node1"
> Mar  9 12:14:54 node2 fenced[7422]: agent "fence_ilo" reports: Can't
> call method "configure" on an undefined value at /sbin/fence_ilo line
> 169, <> line 4.
> Mar  9 12:14:54 node2 fenced[7422]: fence "node1" failed
> Mar  9 12:14:59 node2 fenced[7422]: fencing node "node1"
> Mar  9 12:14:59 node2 fenced[7422]: agent "fence_ilo" reports: Can't
> call method "configure" on an undefined value at /sbin/fence_ilo line
> 169, <> line 4.
>  
> Seems like I am not doing something correct with respect to fencing.
> Can I setup cluster without fencing first of all?

Yes.  You can use manual fencing.  That should only be used for testing 
purposes though... it is not a supported configuration.


  brassow
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 3471 bytes
Desc: not available
Url :
https://www.redhat.com/archives/linux-cluster/attachments/20070309/c015b8da/
attachment.bin

------------------------------

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

End of Linux-cluster Digest, Vol 35, Issue 13
*********************************************



_________________________________________________________________________________________________________________
This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended 
recipient please telephone or e-mail the sender and delete this message and all attachments from your system - ServerEngines  LLC





From fajarpri at cbn.net.id  Tue Mar 13 07:32:27 2007
From: fajarpri at cbn.net.id (Fajar Priyanto)
Date: Tue, 13 Mar 2007 14:32:27 +0700
Subject: [Linux-cluster] If power down, no error?
Message-ID: <200703131432.28935.fajarpri@cbn.net.id>

Hi all,
A friend of mine is setting up a 2-node cluster using RHEL4u3. No GFS.
All failover testings are ok (shutting down eth0, unplugging the cables, 
shutting down the service, etc). But, powering down the current active node 
doesnt make the failover to occur. And he said that in the log there's no 
error at all. How come? So strange.

Here's the conf:
<?xml version="1.0"?>
<cluster config_version="74" name="pgsql">
	<fence_daemon post_fail_delay="3" post_join_delay="1"/>
	<clusternodes>
		<clusternode name="solidpoint1" votes="1">
			<fence>
				<method name="1">
					<device name="solidpoint1ilo"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="solidpoint2" votes="1">
			<fence>
				<method name="1">
					<device name="solidpoint2ilo"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<cman expected_votes="1" two_node="1"/>
	<fencedevices>
		<fencedevice agent="fence_ilo" hostname="solidpoint1ilo" login="root" 
name="solidpoint1ilo" passwd="redhat123"/>
		<fencedevice agent="fence_ilo" hostname="solidpoint2ilo" login="root" 
name="solidpoint2ilo" passwd="redhat123"/>
	</fencedevices>
	<rm>
		<failoverdomains>
			<failoverdomain name="pgcluster" ordered="1" restricted="0">
				<failoverdomainnode name="solidpoint1" priority="1"/>
				<failoverdomainnode name="solidpoint2" priority="1"/>
			</failoverdomain>
		</failoverdomains>
		<resources>
			<ip address="192.168.0.89" monitor_link="1"/>
			<fs device="/dev/sda1" force_fsck="0" force_unmount="0" fsid="11199" 
fstype="ext3" mountpoint="/home" name="pgdevice" options="" self_fence="0"/>
			<script file="/etc/init.d/postgresql" name="postgresql"/>
		</resources>
		<service autostart="1" domain="pgcluster" exclusive="1" 
name="postgreservice" recovery="restart">
			<ip ref="192.168.0.89">
				<fs ref="pgdevice">
					<script ref="postgresql"/>
				</fs>
			</ip>
		</service>
	</rm>
</cluster>

Thank you very much.
-- 
Fajar Priyanto | Reg'd Linux User #327841 | Linux tutorial 
http://linux2.arinet.org
2:32pm up 5:58, 2.6.18.2-34-default GNU/Linux 
Let's use OpenOffice. http://www.openoffice.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070313/f04a70a9/attachment.sig>

From Hugues.Lafarge at afp.com  Tue Mar 13 09:49:23 2007
From: Hugues.Lafarge at afp.com (Hugues Lafarge)
Date: Tue, 13 Mar 2007 10:49:23 +0100
Subject: [Linux-cluster] FAQ correction ?!
Message-ID: <877itlqxr0.fsf@merlot.par.afp.com>


I'm not sure i correctly understand http://sources.redhat.com/cluster/faq.html#cman_2to3
so i'm kindly asking you if there is not a typo in the following sentence:

8. How do I add a third node to my two-node cluster?

 Unfortunately, two-node clusters are a special case. Because a
 two-node cluster only needs two nodes to establish quorum, the only
 way to add a third node is to shut down the cluster, add the third
 node into your /etc/cluster/cluster.conf file and get rid of
 two_node="1", then restart all three nodes. 


I would have expected that to be:

 Unfortunately, two-node clusters are a special case. Because a
 two-node cluster only needs ONE NODE to establish quorum, ...

Am i right there, or misunderstanding something ?

-- 
Hugues Lafarge                    || Email: Hugues.Lafarge at afp.com
Agence France Presse              || Phone: +33 1 40 41 77 15
4 rue de la bourse, 75002 Paris   ||   Fax: +33 1 40 41 79 24



This e-mail, and any file transmitted with it, is confidential and  intended solely for the use of the individual or entity to whom it is addressed. If you have received this email in error, please  contact the sender and delete the email from your system. If you are  not the named addressee you should not disseminate, distribute or copy  this email.

For more information on Agence France-Presse, please visit our web site at http://www.afp.com



From Dan.HAWKER at uk4.astrium.eads.net  Tue Mar 13 09:56:32 2007
From: Dan.HAWKER at uk4.astrium.eads.net (HAWKER, Dan (external))
Date: Tue, 13 Mar 2007 09:56:32 -0000
Subject: [Linux-cluster] Using GFS on Compact Flash
Message-ID: <7F6B06837A5DBD49AC6E1650EFF5490601223274@auk52177.ukr.astrium.corp>


> Does it make sense to use  GFS (local or cluster mode) on
> Compact Flash? Will it greatly reduce the life expectancy of
> the Compact Flash compared to using a local file system? The
> rational behind this is that GFS will issue way more writes
> to the disk for its internal operations ( such as dlm locking etc).
>

Ummm, why would you want to use CF & GFS together???

Surely you'd only use GFS for a shared pool of storage. As the largest
CF card you can sensibly get atm is around 8GB, that's not a lot of
storage to be shared by multiple nodes.

Now we use CF cards for embedded boxes, but for exactly the reasons you
mentioned (poor MTBF/reduced lifetime due to the read/write process),
our cards are invariably read-only filesystems and we use ext2 to get
rid of the journaling too. No swap either (which has caused some trouble
on one or two occasions, but that's another story).

Dan

--

Dan Hawker
Linux System Administrator
Astrium
http://www.astrium.eads.net

--

This email (including any attachments) may contain confidential and/or privileged information or information otherwise protected from disclosure.
If you are not the intended recipient, please notify the sender immediately, do not copy this message or any attachments and do not use it for any purpose or disclose its content to any person, but delete this message and any attachments from your system.
Astrium disclaims any and all liability if this email transmission was virus corrupted, altered or falsified.
---------------------------------------------------------------------
Astrium Limited, Registered in England and Wales No. 2449259
Registered Office: Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England



From fajarpri at cbn.net.id  Tue Mar 13 10:07:45 2007
From: fajarpri at cbn.net.id (Fajar Priyanto)
Date: Tue, 13 Mar 2007 17:07:45 +0700
Subject: [Linux-cluster] FAQ correction ?!
In-Reply-To: <877itlqxr0.fsf@merlot.par.afp.com>
References: <877itlqxr0.fsf@merlot.par.afp.com>
Message-ID: <200703131707.46648.fajarpri@cbn.net.id>

On Tuesday 13 March 2007 16:49, Hugues Lafarge wrote:
> I'm not sure i correctly understand
> http://sources.redhat.com/cluster/faq.html#cman_2to3 so i'm kindly asking
> you if there is not a typo in the following sentence:
>
> 8. How do I add a third node to my two-node cluster?
>
>  Unfortunately, two-node clusters are a special case. Because a
>  two-node cluster only needs two nodes to establish quorum, the only
>  way to add a third node is to shut down the cluster, add the third
>  node into your /etc/cluster/cluster.conf file and get rid of
>  two_node="1", then restart all three nodes.
>
>
> I would have expected that to be:
>
>  Unfortunately, two-node clusters are a special case. Because a
>  two-node cluster only needs ONE NODE to establish quorum, ...
>
> Am i right there, or misunderstanding something ?

In my opinion, you're both correct.
But, this sentence can be easier to understand:
Two-node clusters are a special case. Because a
two-node cluster only needs two nodes to establish quorum, the only
way to add a third node is to shut down the cluster, add the third
node into your /etc/cluster/cluster.conf file (either by hand or 
system-config-cluster), copy it into the newly joined cluster, and restart 
all of the nodes. (the two_node="1" will be removed automatically by 
system-config-cluster).

-- 
Fajar Priyanto | Reg'd Linux User #327841 | Linux tutorial 
http://linux2.arinet.org
5:07pm up 8:33, 2.6.18.2-34-default GNU/Linux 
Let's use OpenOffice. http://www.openoffice.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070313/ef48a7a0/attachment.sig>

From bobby.m.dalton at nasa.gov  Tue Mar 13 13:12:30 2007
From: bobby.m.dalton at nasa.gov (Dalton, Maurice)
Date: Tue, 13 Mar 2007 08:12:30 -0500
Subject: [Linux-cluster] Qdiskd problems
Message-ID: <EB190CD1E73E1146ACB7694746E205A802E6007E@hx1.ums.msfc.nasa.gov>

Cannot get qdiskd to start on my cluster.

This is what I have added to cluster.conf

<quorumd interval="1" tko="10" votes="3" log_level="7"
label="/quorum_disk_sda1">
            <heuristic program="ping 198.xxx.xx.xx -c1 -t1" score="1"
interval="2"/>
        </quorumd>

This is all I get in the logs..

qdiskd[14896]: <info> Quorum Daemon Initializing
qdiskd[14896]: <crit> Initialization failed

Output from mkqdisk -L
mkqdisk v0.5
/dev/sda1:
        Magic:   eb7a62c2
        Label:   /quorum_disk_sda1
        Created: Tue Feb  6 19:50:14 2007
        Host:    xxxxxxxxxx

Not sure what I am doing wrong..




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070313/588b6e30/attachment.htm>

From maciej.bogucki at artegence.com  Tue Mar 13 15:51:16 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Tue, 13 Mar 2007 16:51:16 +0100
Subject: [Linux-cluster] FAQ: How do I change the time after which a
 non-responsive node is considered dead?
Message-ID: <45F6C874.9020700@artegence.com>

Hello,

In the FAQ we have:

---cut---
Just add deadnode_timer="value" to the cman section in your cluster.conf
file. For example:

<cman deadnode_timer="21">

The default value is 21 seconds.
---cut---

Do I need to restart cman for the changes to take effect?

When i propagate the cluster.conf file to a running cluster I still have
old values in /proc/ but cluster.conf in updated.

# cat /proc/cluster/config/cman/deadnode_timeout
21
# grep dead /etc/cluster/cluster.conf
        <cman expected_votes="1" two_node="1" deadnode_timeout="120"
hello_timer="10"/>
#


Best Regards
Maciej Bogucki



From pcaulfie at redhat.com  Tue Mar 13 16:29:34 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Tue, 13 Mar 2007 16:29:34 +0000
Subject: [Linux-cluster] FAQ: How do I change the time after which a
	non-responsive node is considered dead?
In-Reply-To: <45F6C874.9020700@artegence.com>
References: <45F6C874.9020700@artegence.com>
Message-ID: <45F6D16E.2040806@redhat.com>

Maciej Bogucki wrote:
> Hello,
> 
> In the FAQ we have:
> 
> ---cut---
> Just add deadnode_timer="value" to the cman section in your cluster.conf
> file. For example:
> 
> <cman deadnode_timer="21">
> 
> The default value is 21 seconds.
> ---cut---
> 
> Do I need to restart cman for the changes to take effect?
> 
> When i propagate the cluster.conf file to a running cluster I still have
> old values in /proc/ but cluster.conf in updated.
> 
> # cat /proc/cluster/config/cman/deadnode_timeout
> 21
> # grep dead /etc/cluster/cluster.conf
>         <cman expected_votes="1" two_node="1" deadnode_timeout="120"
> hello_timer="10"/>
> #


Updating cluster.conf just changes the values that will be used next time cman
starts up. If you need to change it on the fly you'll also need to echo the new
values into /proc/cluster/config/cman/deadnode_timeout (etc)

-- 

patrick



From rpeterso at redhat.com  Tue Mar 13 17:41:40 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Tue, 13 Mar 2007 12:41:40 -0500
Subject: [Linux-cluster] FAQ correction ?!
In-Reply-To: <200703131707.46648.fajarpri@cbn.net.id>
References: <877itlqxr0.fsf@merlot.par.afp.com>
	<200703131707.46648.fajarpri@cbn.net.id>
Message-ID: <45F6E254.9020307@redhat.com>

Fajar Priyanto wrote:
> On Tuesday 13 March 2007 16:49, Hugues Lafarge wrote:
>> I'm not sure i correctly understand
>> http://sources.redhat.com/cluster/faq.html#cman_2to3 so i'm kindly asking
>> you if there is not a typo in the following sentence:
>>
>> 8. How do I add a third node to my two-node cluster?
>>
>>  Unfortunately, two-node clusters are a special case. Because a
>>  two-node cluster only needs two nodes to establish quorum, the only
>>  way to add a third node is to shut down the cluster, add the third
>>  node into your /etc/cluster/cluster.conf file and get rid of
>>  two_node="1", then restart all three nodes.
>>
>>
>> I would have expected that to be:
>>
>>  Unfortunately, two-node clusters are a special case. Because a
>>  two-node cluster only needs ONE NODE to establish quorum, ...
>>
>> Am i right there, or misunderstanding something ?
> 
> In my opinion, you're both correct.
> But, this sentence can be easier to understand:
> Two-node clusters are a special case. Because a
> two-node cluster only needs two nodes to establish quorum, the only
> way to add a third node is to shut down the cluster, add the third
> node into your /etc/cluster/cluster.conf file (either by hand or 
> system-config-cluster), copy it into the newly joined cluster, and restart 
> all of the nodes. (the two_node="1" will be removed automatically by 
> system-config-cluster).

Hi,

Actually, in a two-node cluster (without quorum disk/partition)
two nodes are necessary to establish quorum, but only one node
is enough to maintain quorum once it's established.
By design, a two-node cluster can't be started with only
one node, even with a quorum disk/partition that contributes votes.
That was done to guard against split-brain.

That still doesn't explain why you have to restart the cluster
when adding a third node:  it's because the two_node value
can't be undone once it's been set in memory.

It was just a bad explanation on my part, and I apologize.
I updated the FAQ so it's a little bit more accurate.

Regards,

Bob Peterson
Red Hat Cluster Suite



From lshen at cisco.com  Tue Mar 13 17:56:26 2007
From: lshen at cisco.com (Lin Shen (lshen))
Date: Tue, 13 Mar 2007 10:56:26 -0700
Subject: [Linux-cluster] Using GFS on Compact Flash
In-Reply-To: <7F6B06837A5DBD49AC6E1650EFF5490601223274@auk52177.ukr.astrium.corp>
Message-ID: <08A9A3213527A6428774900A80DBD8D803A006A2@xmb-sjc-222.amer.cisco.com>

 

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> HAWKER, Dan (external)
> Sent: Tuesday, March 13, 2007 2:57 AM
> To: linux clustering
> Subject: RE: [Linux-cluster] Using GFS on Compact Flash
> 
> 
> > Does it make sense to use  GFS (local or cluster mode) on Compact 
> > Flash? Will it greatly reduce the life expectancy of the 
> Compact Flash 
> > compared to using a local file system? The rational behind this is 
> > that GFS will issue way more writes to the disk for its internal 
> > operations ( such as dlm locking etc).
> >
> 
> Ummm, why would you want to use CF & GFS together???
> 
> Surely you'd only use GFS for a shared pool of storage. As 
> the largest CF card you can sensibly get atm is around 8GB, 
> that's not a lot of storage to be shared by multiple nodes.
> 
> Now we use CF cards for embedded boxes, but for exactly the 
> reasons you mentioned (poor MTBF/reduced lifetime due to the 
> read/write process), our cards are invariably read-only 
> filesystems and we use ext2 to get rid of the journaling too. 
> No swap either (which has caused some trouble on one or two 
> occasions, but that's another story).
> 
> Dan

Have you done any measurements in regard to the amount of extra
read/write introduced by GFS+cluster suite+fencing+journaling etc?
And/or how much CF lifetime reduction is that translated to?

Lin  


> 
> --
> 
> Dan Hawker
> Linux System Administrator
> Astrium
> http://www.astrium.eads.net
> 
> --
> 
> This email (including any attachments) may contain 
> confidential and/or privileged information or information 
> otherwise protected from disclosure.
> If you are not the intended recipient, please notify the 
> sender immediately, do not copy this message or any 
> attachments and do not use it for any purpose or disclose its 
> content to any person, but delete this message and any 
> attachments from your system.
> Astrium disclaims any and all liability if this email 
> transmission was virus corrupted, altered or falsified.
> ---------------------------------------------------------------------
> Astrium Limited, Registered in England and Wales No. 2449259 
> Registered Office: Gunnels Wood Road, Stevenage, 
> Hertfordshire, SG1 2AS, England
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From rpeterso at redhat.com  Tue Mar 13 18:16:19 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Tue, 13 Mar 2007 13:16:19 -0500
Subject: [Linux-cluster] Using GFS on Compact Flash
In-Reply-To: <08A9A3213527A6428774900A80DBD8D803A003D6@xmb-sjc-222.amer.cisco.com>
References: <08A9A3213527A6428774900A80DBD8D803A003D6@xmb-sjc-222.amer.cisco.com>
Message-ID: <45F6EA73.8040409@redhat.com>

Lin Shen (lshen) wrote:
> Does it make sense to use  GFS (local or cluster mode) on Compact Flash?
> Will it greatly reduce the life expectancy of the Compact Flash compared
> to using a local file system? The rational behind this is that GFS will
> issue way more writes to the disk for its internal operations ( such as
> dlm locking etc).
> 
> Lin   
Hi Lin,

Here are my thoughts:

I'm not aware of anyone using compact flash with GFS.
GFS has no wear-leveling, so life expectancy might be an issue.
The same goes for most file systems except those specifically
written for CF, like jffs2.

The file and directory data isn't a concern: Linux page cache should 
manage the data buffers normally.  The cluster locks and glocks are 
also not a concern, since everything is managed in memory.  However, 
the GFS Resource Groups (RGs) (Not to be confused with rgmanager's 
resource groups) are a bigger concern.  Part of the RGs have bitmaps
of blocks to indicate which blocks are allocated and that may be rewritten
many times as blocks are allocated and released.  However,
I haven't studied how often these are actually written back to the media.

Another concern are the journal areas, which are being written over and over.
They aren't as bad as the RGs, though, because the journals have
lots of space, and therefore it's not always the same block getting
written over and over.  The journals may be forced to disk more though,
because if a node crashes for whatever reason, the other nodes need access
to the journals to replay the node's data to ensure file system integrity.

Regards,

Bob Peterson
Red Hat Cluster Suite



From lhh at redhat.com  Tue Mar 13 18:51:16 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 13 Mar 2007 14:51:16 -0400
Subject: [Linux-cluster] Qdiskd problems
In-Reply-To: <EB190CD1E73E1146ACB7694746E205A802E6007E@hx1.ums.msfc.nasa.gov>
References: <EB190CD1E73E1146ACB7694746E205A802E6007E@hx1.ums.msfc.nasa.gov>
Message-ID: <1173811876.4557.35.camel@asuka.boston.devel.redhat.com>

On Tue, 2007-03-13 at 08:12 -0500, Dalton, Maurice wrote:
> Cannot get qdiskd to start on my cluster.
> 
> This is what I have added to cluster.conf
> 
> <quorumd interval="1" tko="10" votes="3" log_level="7"
> label="/quorum_disk_sda1"> 
>             <heuristic program="ping 198.xxx.xx.xx -c1 -t1" score="1"
> interval="2"/> 
>         </quorumd>
> 
> This is all I get in the logs..
> 
> qdiskd[14896]: <info> Quorum Daemon Initializing 
> qdiskd[14896]: <crit> Initialization failed
> 
> Output from mkqdisk -L 
> mkqdisk v0.5 
> /dev/sda1: 
>         Magic:   eb7a62c2 
>         Label:   /quorum_disk_sda1 
>         Created: Tue Feb  6 19:50:14 2007 
>         Host:    xxxxxxxxxx
> 
> Not sure what I am doing wrong..

Try 'qdiskd -fd' and see what it tells you?

In the current release as of 4.4, there are a number of things qdiskd
doesn't wait around for - e.g. cman running, for example.  In the 4.5
release, this will be fixed.

Updated packages are available from my people page if you would like to
test them:

http://people.redhat.com/lhh/packages.html

-- Lon



From bobby.m.dalton at nasa.gov  Tue Mar 13 19:04:56 2007
From: bobby.m.dalton at nasa.gov (Dalton, Maurice)
Date: Tue, 13 Mar 2007 14:04:56 -0500
Subject: [Linux-cluster] Qdiskd problems
In-Reply-To: <1173811876.4557.35.camel@asuka.boston.devel.redhat.com>
References: <EB190CD1E73E1146ACB7694746E205A802E6007E@hx1.ums.msfc.nasa.gov>
	<1173811876.4557.35.camel@asuka.boston.devel.redhat.com>
Message-ID: <EB190CD1E73E1146ACB7694746E205A802E6008F@hx1.ums.msfc.nasa.gov>

Thanks.. I think I just got it working a couple of minutes ago..

I will look at the new releases.. Thanks again..
 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger
Sent: Tuesday, March 13, 2007 1:51 PM
To: linux clustering
Subject: Re: [Linux-cluster] Qdiskd problems

On Tue, 2007-03-13 at 08:12 -0500, Dalton, Maurice wrote:
> Cannot get qdiskd to start on my cluster.
> 
> This is what I have added to cluster.conf
> 
> <quorumd interval="1" tko="10" votes="3" log_level="7"
> label="/quorum_disk_sda1"> 
>             <heuristic program="ping 198.xxx.xx.xx -c1 -t1" score="1"
> interval="2"/> 
>         </quorumd>
> 
> This is all I get in the logs..
> 
> qdiskd[14896]: <info> Quorum Daemon Initializing
> qdiskd[14896]: <crit> Initialization failed
> 
> Output from mkqdisk -L
> mkqdisk v0.5
> /dev/sda1: 
>         Magic:   eb7a62c2 
>         Label:   /quorum_disk_sda1 
>         Created: Tue Feb  6 19:50:14 2007 
>         Host:    xxxxxxxxxx
> 
> Not sure what I am doing wrong..

Try 'qdiskd -fd' and see what it tells you?

In the current release as of 4.4, there are a number of things qdiskd
doesn't wait around for - e.g. cman running, for example.  In the 4.5
release, this will be fixed.

Updated packages are available from my people page if you would like to
test them:

http://people.redhat.com/lhh/packages.html

-- Lon

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From lhh at redhat.com  Tue Mar 13 19:34:43 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 13 Mar 2007 15:34:43 -0400
Subject: [Linux-cluster] FAQ correction ?!
In-Reply-To: <45F6E254.9020307@redhat.com>
References: <877itlqxr0.fsf@merlot.par.afp.com>
	<200703131707.46648.fajarpri@cbn.net.id> <45F6E254.9020307@redhat.com>
Message-ID: <1173814483.4557.46.camel@asuka.boston.devel.redhat.com>

On Tue, 2007-03-13 at 12:41 -0500, Robert Peterson wrote:
> Fajar Priyanto wrote:
> > On Tuesday 13 March 2007 16:49, Hugues Lafarge wrote:
> >> I'm not sure i correctly understand
> >> http://sources.redhat.com/cluster/faq.html#cman_2to3 so i'm kindly asking
> >> you if there is not a typo in the following sentence:
> >>
> >> 8. How do I add a third node to my two-node cluster?
> >>
> >>  Unfortunately, two-node clusters are a special case. Because a
> >>  two-node cluster only needs two nodes to establish quorum, the only
> >>  way to add a third node is to shut down the cluster, add the third
> >>  node into your /etc/cluster/cluster.conf file and get rid of
> >>  two_node="1", then restart all three nodes.
> >>
> >>
> >> I would have expected that to be:
> >>
> >>  Unfortunately, two-node clusters are a special case. Because a
> >>  two-node cluster only needs ONE NODE to establish quorum, ...
> >>
> >> Am i right there, or misunderstanding something ?
> > 
> > In my opinion, you're both correct.
> > But, this sentence can be easier to understand:
> > Two-node clusters are a special case. Because a
> > two-node cluster only needs two nodes to establish quorum, the only
> > way to add a third node is to shut down the cluster, add the third
> > node into your /etc/cluster/cluster.conf file (either by hand or 
> > system-config-cluster), copy it into the newly joined cluster, and restart 
> > all of the nodes. (the two_node="1" will be removed automatically by 
> > system-config-cluster).
> 
> Hi,
> 
> Actually, in a two-node cluster (without quorum disk/partition)
> two nodes are necessary to establish quorum, but only one node
> is enough to maintain quorum once it's established.

Actually, that's not true.  With two_node="1", cman will fence the node
that isn't a cluster member in order to establish itself as the only
node alive.  If the two nodes can not see each other but are both online
(e.g. network partition), this situation is called the 'race-to-fence'
condition.

Under normal circumstances, one node can not communicate - and therefore
- cannot fence the other node. 

(For RHCS3, you are correct - both nodes or manual intervention are
required if disk-tiebreaker isn't used...)

> By design, a two-node cluster can't be started with only
> one node, even with a quorum disk/partition that contributes votes.
> That was done to guard against split-brain.

You can start it with disk votes, too.  It currently doesn't work
correctly because of a bug in CMAN where it doesn't advertise qdisk
votes to the service manager.

Patch is here:

  http://people.redhat.com/lhh/cman-kernel-fixes.patch

(^^ Patrick made that patch, really...).

The requirement for fencing is *not* obviated by the use of qdisk,
however; CMAN can and will still fence the presumed-dead node(s) before
a quorum is established.

-- Lon



From lshen at cisco.com  Tue Mar 13 21:15:01 2007
From: lshen at cisco.com (Lin Shen (lshen))
Date: Tue, 13 Mar 2007 14:15:01 -0700
Subject: [Linux-cluster] Using GFS on Compact Flash
In-Reply-To: <45F6EA73.8040409@redhat.com>
Message-ID: <08A9A3213527A6428774900A80DBD8D803A007F6@xmb-sjc-222.amer.cisco.com>

 

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Robert Peterson
> Sent: Tuesday, March 13, 2007 11:16 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] Using GFS on Compact Flash
> 
> Lin Shen (lshen) wrote:
> > Does it make sense to use  GFS (local or cluster mode) on 
> Compact Flash?
> > Will it greatly reduce the life expectancy of the Compact Flash 
> > compared to using a local file system? The rational behind this is 
> > that GFS will issue way more writes to the disk for its internal 
> > operations ( such as dlm locking etc).
> > 
> > Lin   
> Hi Lin,
> 
> Here are my thoughts:
> 
> I'm not aware of anyone using compact flash with GFS.
> GFS has no wear-leveling, so life expectancy might be an issue.
> The same goes for most file systems except those specifically 
> written for CF, like jffs2.
> 
> The file and directory data isn't a concern: Linux page cache 
> should manage the data buffers normally.  The cluster locks 
> and glocks are also not a concern, since everything is 
> managed in memory.  However, the GFS Resource Groups (RGs) 
> (Not to be confused with rgmanager's resource groups) are a 
> bigger concern.  Part of the RGs have bitmaps of blocks to 
> indicate which blocks are allocated and that may be rewritten 
> many times as blocks are allocated and released.  However, I 
> haven't studied how often these are actually written back to 
> the media.
> 
> Another concern are the journal areas, which are being 
> written over and over.
> They aren't as bad as the RGs, though, because the journals 
> have lots of space, and therefore it's not always the same 
> block getting written over and over.  The journals may be 
> forced to disk more though, because if a node crashes for 
> whatever reason, the other nodes need access to the journals 
> to replay the node's data to ensure file system integrity.
> 
> Regards,
> 
> Bob Peterson
> Red Hat Cluster Suite

Hi Bob,

This is very good info. The CF we use has builtin wear-levelling, wonder
if this will make running GFS on top of it more resonable. Of course, if
GFS is like FAT in the way that file system structure is frequently
re-written in-place and cannot be reallocated or moved after wear
failure, then the builtin wear-levelling is not good enough. 

Since we'll need to run a file system on the CF, and most likely one of
either GFS, Reiser or FAT, we'd really like to compare the extra disk
writes introduced by the various file systems. I know it's hard to get
the exact numbers, but do you think GFS is far more worse in this
regard. 

BTW, is there also a big difference when GFS is running in local and
clustered mode?

Last, does GFS issue any disk writes when it's idle, meaning no one is
using the file system? Do the cluster and fencing stuff also introduce
disk writes?

Lin     


> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From jstoner at opsource.net  Tue Mar 13 21:17:10 2007
From: jstoner at opsource.net (Jeff Stoner)
Date: Tue, 13 Mar 2007 21:17:10 -0000
Subject: [Linux-cluster] Advice on setting up Resources for Postgres
In-Reply-To: <Pine.LNX.4.63.0703121522390.20080@mail.kivi.com.tr>
Message-ID: <38A48FA2F0103444906AD22E14F1B5A305791E4B@mailxchg01.corp.opsource.net>

> -----Original Message-----
> I'd use GFS to make sure that data is not corrupted. You can 
> "live" with
> ext3 -- but in case of a problem, if two postmaster directly 
> accesses the same node, data problem may occur.

It's not just "if 2 systems access it at the same time" - the filesystem
itself my get corrupted (yes, I've seen journaled filesystems get
corrupted.)  Which means the filesystem on a SAN becomes a single point
of failure. For this reason, I have a 3rd MySQL server set up as a slave
replication server to my clustered MySQL database. It provides that
extra layer of protection against this SPoF.

Food for thought.

--Jeff
SME - UNIX
OpSource Inc.

PGP Key ID 0x6CB364CA 




From srramasw at cisco.com  Wed Mar 14 00:25:14 2007
From: srramasw at cisco.com (Sridhar Ramaswamy (srramasw))
Date: Tue, 13 Mar 2007 17:25:14 -0700
Subject: [Linux-cluster] Kernel oops when GFS2 used as localfs
Message-ID: <B14199FA0DBAAF4AA89E83EB41D3543503367902@xmb-sjc-22c.amer.cisco.com>

I hit various kernel oops while running some metadata intensive test on
GFS2 filesystem. This volume is intended to be used as local fileystem.
So it was created using lock_nolock. Also no cman, fence and any such
things are running in the system.
 
RHEL4 distro
Linux kernel 2.6.20.1 (from kernel.org)
cluster-2.00.00 (from tarball)
 
David.T earlier suggested to move to newer kernel (2.6.21-rc), which I'm
planning to do. Meanwhile I want to poll the alias on such issues in
GFS2. Ofcourse I'm all ears on any other ideas to resolve this!
 
thanks,
Sridhar
 
$time bonnie++ -u 99 -f -x 3 -d /mnt/gfs2/bonnie -s 0 -n 30:4096:4096:30
 
(1) 
 
Mar 13 15:25:56 cfs1 kernel: ------------[ cut here ]------------
Mar 13 15:25:56 cfs1 kernel: kernel BUG at fs/gfs2/meta_io.c:474!
Mar 13 15:25:56 cfs1 kernel: invalid opcode: 0000 [#1]
Mar 13 15:25:56 cfs1 kernel: SMP
Mar 13 15:25:56 cfs1 kernel: Modules linked in: lock_nolock gfs2
reiserfs nfsd exportfs nfs lockd nfs_acl ipv6 parport_pc lp parport
autofs4 sunrpc dm_mirror dm_mod button battery ac uhci_hcd ehci_hcd
intel_rng rng_core i2c_i801 i2c_core e1000 e100 mii floppy ext3 jbd
Mar 13 15:25:56 cfs1 kernel: CPU:    1
Mar 13 15:25:56 cfs1 kernel: EIP:    0060:[<e0c88da5>]    Not tainted
VLI
Mar 13 15:25:56 cfs1 kernel: EFLAGS: 00010246   (2.6.20.1 #1)
Mar 13 15:25:56 cfs1 kernel: EIP is at
gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2]
Mar 13 15:25:56 cfs1 kernel: eax: 00000000   ebx: 00012bf5   ecx:
ce5a6dd4   edx: dc5eae00
Mar 13 15:25:56 cfs1 kernel: esi: 00000000   edi: 00000000   ebp:
dc5ea9a8   esp: ce5a6d58
Mar 13 15:25:56 cfs1 kernel: ds: 007b   es: 007b   ss: 0068
Mar 13 15:25:56 cfs1 kernel: Process bonnie++ (pid: 5509, ti=ce5a6000
task=dc320570 task.ti=ce5a6000)
Mar 13 15:25:56 cfs1 kernel: Stack: c156d274 ce5a6dd4 c016eb91 ce5a6dd4
00000000 dc5eae00 00000000 d6f34000
Mar 13 15:25:56 cfs1 kernel:        00000000 00000000 dc5ea9a8 d57794a8
ce5a6e08 e0c83ad3 00012bf5 00000000
Mar 13 15:25:56 cfs1 kernel:        00000000 ce5a6da0 00000000 00000000
dc5ea9a8 d57794a8 e0c84cc9 ce5a6dd4
Mar 13 15:25:56 cfs1 kernel: Call Trace:
Mar 13 15:25:56 cfs1 kernel:  [<c016eb91>] iget5_locked+0x3d/0x67
Mar 13 15:25:56 cfs1 kernel:  [<e0c83ad3>] gfs2_inode_refresh+0x34/0xfe
[gfs2]
Mar 13 15:25:56 cfs1 kernel:  [<e0c84cc9>] gfs2_createi+0x12c/0x191
[gfs2]
Mar 13 15:25:56 cfs1 kernel:  [<e0c8da3c>] gfs2_create+0x5c/0x103 [gfs2]
Mar 13 15:25:56 cfs1 kernel:  [<e0c84be7>] gfs2_createi+0x4a/0x191
[gfs2]
Mar 13 15:25:56 cfs1 kernel:  [<e0c820c4>] gfs2_glock_nq_num+0x3f/0x64
[gfs2]
Mar 13 15:25:56 cfs1 kernel:  [<c016552e>] vfs_create+0xc3/0x126
Mar 13 15:25:56 cfs1 kernel:  [<c01657f2>] open_namei_create+0x47/0x88
Mar 13 15:25:56 cfs1 kernel:  [<c016597d>] open_namei+0x14a/0x539
Mar 13 15:25:56 cfs1 kernel:  [<c015d27b>] do_filp_open+0x25/0x39
Mar 13 15:25:56 cfs1 kernel:  [<c01d200a>] strncpy_from_user+0x3c/0x5b
Mar 13 15:25:56 cfs1 kernel:  [<c015d431>] get_unused_fd+0xa8/0xb1
Mar 13 15:25:56 cfs1 kernel:  [<c015d509>] do_sys_open+0x42/0xbe
Mar 13 15:25:56 cfs1 kernel:  [<c015d59f>] sys_open+0x1a/0x1c
Mar 13 15:25:56 cfs1 kernel:  [<c015d5dd>] sys_creat+0x1f/0x23
Mar 13 15:25:56 cfs1 kernel:  [<c0103410>] sysenter_past_esp+0x5d/0x81
Mar 13 15:25:56 cfs1 kernel:  =======================
Mar 13 15:25:56 cfs1 kernel: Code: 80 a8 01 00 00 89 44 24 1c 8b 85 f0
01 00 00 c7 44 24 20 00 00 00 00 89 54 24 14 85 c0 89 44 24 18 c7 44 24
10 00 00 00 00 75 04 <0f> 0b eb fe 83 7c 24 1c 00 75 04 0f 0b eb fe 8d
85 24 04 00 00
Mar 13 15:25:57 cfs1 kernel: EIP: [<e0c88da5>]
gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2] SS:ESP 0068:ce5a6d58
 
 
(2) 
 
Mar 13 17:00:30 cfs1 kernel: Call Trace:
Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>] gfs2_unlink+0x53/0xe0 [gfs2]
Mar 13 17:00:30 cfs1 kernel:  [<e0baddc8>] gfs2_unlink+0x3a/0xe0 [gfs2]
Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>] gfs2_unlink+0x53/0xe0 [gfs2]
Mar 13 17:00:30 cfs1 kernel:  [<c0166572>] vfs_unlink+0xa1/0xc5
Mar 13 17:00:30 cfs1 kernel:  [<c016662b>] do_unlinkat+0x95/0xf5
Mar 13 17:00:30 cfs1 kernel:  [<c01187de>] scheduler_tick+0x8f/0x95
Mar 13 17:00:30 cfs1 kernel:  [<c0103410>] sysenter_past_esp+0x5d/0x81
Mar 13 17:00:30 cfs1 kernel:  =======================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070313/f64de53d/attachment.htm>

From wcheng at redhat.com  Wed Mar 14 05:29:26 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Wed, 14 Mar 2007 00:29:26 -0500
Subject: [Linux-cluster] Kernel oops when GFS2 used as localfs
In-Reply-To: <B14199FA0DBAAF4AA89E83EB41D3543503367902@xmb-sjc-22c.amer.cisco.com>
References: <B14199FA0DBAAF4AA89E83EB41D3543503367902@xmb-sjc-22c.amer.cisco.com>
Message-ID: <45F78836.20404@redhat.com>

Sridhar Ramaswamy (srramasw) wrote:

>
> Mar 13 15:25:56 cfs1 kernel: ------------[ cut here ]------------
> Mar 13 15:25:56 cfs1 kernel: kernel BUG at fs/gfs2/meta_io.c:474!


I don't have time to pull kernel.org source code right now. Could you 
cut-and-paste the following two routines (from fs/gfs2/inode.c):
gfs2_ilookup() and gfs2_iget() so we can look into this quickly ? They 
are all one-liner so shouldn't be too much troubles. GFS2 currently has 
ino issue with its lookup code.

-- Wendy

> Mar 13 15:25:56 cfs1 kernel: invalid opcode: 0000 [#1]
> Mar 13 15:25:56 cfs1 kernel: SMP
> Mar 13 15:25:56 cfs1 kernel: Modules linked in: lock_nolock gfs2 
> reiserfs nfsd exportfs nfs lockd nfs_acl ipv6 parport_pc lp parport 
> autofs4 sunrpc dm_mirror dm_mod button battery ac uhci_hcd ehci_hcd 
> intel_rng rng_core i2c_i801 i2c_core e1000 e100 mii floppy ext3 jbd
> Mar 13 15:25:56 cfs1 kernel: CPU:    1
> Mar 13 15:25:56 cfs1 kernel: EIP:    0060:[<e0c88da5>]    Not tainted VLI
> Mar 13 15:25:56 cfs1 kernel: EFLAGS: 00010246   (2.6.20.1 #1)
> Mar 13 15:25:56 cfs1 kernel: EIP is at 
> gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2]
> Mar 13 15:25:56 cfs1 kernel: eax: 00000000   ebx: 00012bf5   ecx: 
> ce5a6dd4   edx: dc5eae00
> Mar 13 15:25:56 cfs1 kernel: esi: 00000000   edi: 00000000   ebp: 
> dc5ea9a8   esp: ce5a6d58
> Mar 13 15:25:56 cfs1 kernel: ds: 007b   es: 007b   ss: 0068
> Mar 13 15:25:56 cfs1 kernel: Process bonnie++ (pid: 5509, ti=ce5a6000 
> task=dc320570 task.ti=ce5a6000)
> Mar 13 15:25:56 cfs1 kernel: Stack: c156d274 ce5a6dd4 c016eb91 
> ce5a6dd4 00000000 dc5eae00 00000000 d6f34000
> Mar 13 15:25:56 cfs1 kernel:        00000000 00000000 dc5ea9a8 
> d57794a8 ce5a6e08 e0c83ad3 00012bf5 00000000
> Mar 13 15:25:56 cfs1 kernel:        00000000 ce5a6da0 00000000 
> 00000000 dc5ea9a8 d57794a8 e0c84cc9 ce5a6dd4
> Mar 13 15:25:56 cfs1 kernel: Call Trace:
> Mar 13 15:25:56 cfs1 kernel:  [<c016eb91>] iget5_locked+0x3d/0x67
> Mar 13 15:25:56 cfs1 kernel:  [<e0c83ad3>] 
> gfs2_inode_refresh+0x34/0xfe [gfs2]
> Mar 13 15:25:56 cfs1 kernel:  [<e0c84cc9>] gfs2_createi+0x12c/0x191 [gfs2]
> Mar 13 15:25:56 cfs1 kernel:  [<e0c8da3c>] gfs2_create+0x5c/0x103 [gfs2]
> Mar 13 15:25:56 cfs1 kernel:  [<e0c84be7>] gfs2_createi+0x4a/0x191 [gfs2]
> Mar 13 15:25:56 cfs1 kernel:  [<e0c820c4>] gfs2_glock_nq_num+0x3f/0x64 
> [gfs2]
> Mar 13 15:25:56 cfs1 kernel:  [<c016552e>] vfs_create+0xc3/0x126
> Mar 13 15:25:56 cfs1 kernel:  [<c01657f2>] open_namei_create+0x47/0x88
> Mar 13 15:25:56 cfs1 kernel:  [<c016597d>] open_namei+0x14a/0x539
> Mar 13 15:25:56 cfs1 kernel:  [<c015d27b>] do_filp_open+0x25/0x39
> Mar 13 15:25:56 cfs1 kernel:  [<c01d200a>] strncpy_from_user+0x3c/0x5b
> Mar 13 15:25:56 cfs1 kernel:  [<c015d431>] get_unused_fd+0xa8/0xb1
> Mar 13 15:25:56 cfs1 kernel:  [<c015d509>] do_sys_open+0x42/0xbe
> Mar 13 15:25:56 cfs1 kernel:  [<c015d59f>] sys_open+0x1a/0x1c
> Mar 13 15:25:56 cfs1 kernel:  [<c015d5dd>] sys_creat+0x1f/0x23
> Mar 13 15:25:56 cfs1 kernel:  [<c0103410>] sysenter_past_esp+0x5d/0x81
> Mar 13 15:25:56 cfs1 kernel:  =======================
> Mar 13 15:25:56 cfs1 kernel: Code: 80 a8 01 00 00 89 44 24 1c 8b 85 f0 
> 01 00 00 c7 44 24 20 00 00 00 00 89 54 24 14 85 c0 89 44 24 18 c7 44 
> 24 10 00 00 00 00 75 04 <0f> 0b eb fe 83 7c 24 1c 00 75 04 0f 0b eb fe 
> 8d 85 24 04 00 00
> Mar 13 15:25:57 cfs1 kernel: EIP: [<e0c88da5>] 
> gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2] SS:ESP 0068:ce5a6d58
>  
>  
> (2)
>  
> Mar 13 17:00:30 cfs1 kernel: Call Trace:
> Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>] gfs2_unlink+0x53/0xe0 [gfs2]
> Mar 13 17:00:30 cfs1 kernel:  [<e0baddc8>] gfs2_unlink+0x3a/0xe0 [gfs2]
> Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>] gfs2_unlink+0x53/0xe0 [gfs2]
> Mar 13 17:00:30 cfs1 kernel:  [<c0166572>] vfs_unlink+0xa1/0xc5
> Mar 13 17:00:30 cfs1 kernel:  [<c016662b>] do_unlinkat+0x95/0xf5
> Mar 13 17:00:30 cfs1 kernel:  [<c01187de>] scheduler_tick+0x8f/0x95
> Mar 13 17:00:30 cfs1 kernel:  [<c0103410>] sysenter_past_esp+0x5d/0x81
> Mar 13 17:00:30 cfs1 kernel:  =======================
>
>------------------------------------------------------------------------
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster
>



From srramasw at cisco.com  Wed Mar 14 07:15:41 2007
From: srramasw at cisco.com (Sridhar Ramaswamy (srramasw))
Date: Wed, 14 Mar 2007 00:15:41 -0700
Subject: [Linux-cluster] Kernel oops when GFS2 used as localfs
In-Reply-To: <45F78836.20404@redhat.com>
Message-ID: <B14199FA0DBAAF4AA89E83EB41D35435033679D9@xmb-sjc-22c.amer.cisco.com>

Sure Wendy. Here it is,

"fs/gfs2/inode.c" 1256
struct inode *gfs2_ilookup(struct super_block *sb, struct gfs2_inum_host
*inum)
{
        return ilookup5(sb, (unsigned long)inum->no_formal_ino,
                        iget_test, inum);
}

static struct inode *gfs2_iget(struct super_block *sb, struct
gfs2_inum_host *inum)
{
        return iget5_locked(sb, (unsigned long)inum->no_formal_ino,
                     iget_test, iget_set, inum);
}

BTW - this code is from kernel version 2.6.20.1. 

Thanks,
Sridhar

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Wendy Cheng
> Sent: Tuesday, March 13, 2007 10:29 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Kernel oops when GFS2 used as localfs
> 
> Sridhar Ramaswamy (srramasw) wrote:
> 
> >
> > Mar 13 15:25:56 cfs1 kernel: ------------[ cut here ]------------
> > Mar 13 15:25:56 cfs1 kernel: kernel BUG at fs/gfs2/meta_io.c:474!
> 
> 
> I don't have time to pull kernel.org source code right now. Could you 
> cut-and-paste the following two routines (from fs/gfs2/inode.c):
> gfs2_ilookup() and gfs2_iget() so we can look into this 
> quickly ? They 
> are all one-liner so shouldn't be too much troubles. GFS2 
> currently has 
> ino issue with its lookup code.
> 
> -- Wendy
> 
> > Mar 13 15:25:56 cfs1 kernel: invalid opcode: 0000 [#1]
> > Mar 13 15:25:56 cfs1 kernel: SMP
> > Mar 13 15:25:56 cfs1 kernel: Modules linked in: lock_nolock gfs2 
> > reiserfs nfsd exportfs nfs lockd nfs_acl ipv6 parport_pc lp parport 
> > autofs4 sunrpc dm_mirror dm_mod button battery ac uhci_hcd ehci_hcd 
> > intel_rng rng_core i2c_i801 i2c_core e1000 e100 mii floppy ext3 jbd
> > Mar 13 15:25:56 cfs1 kernel: CPU:    1
> > Mar 13 15:25:56 cfs1 kernel: EIP:    0060:[<e0c88da5>]    
> Not tainted VLI
> > Mar 13 15:25:56 cfs1 kernel: EFLAGS: 00010246   (2.6.20.1 #1)
> > Mar 13 15:25:56 cfs1 kernel: EIP is at 
> > gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2]
> > Mar 13 15:25:56 cfs1 kernel: eax: 00000000   ebx: 00012bf5   ecx: 
> > ce5a6dd4   edx: dc5eae00
> > Mar 13 15:25:56 cfs1 kernel: esi: 00000000   edi: 00000000   ebp: 
> > dc5ea9a8   esp: ce5a6d58
> > Mar 13 15:25:56 cfs1 kernel: ds: 007b   es: 007b   ss: 0068
> > Mar 13 15:25:56 cfs1 kernel: Process bonnie++ (pid: 5509, 
> ti=ce5a6000 
> > task=dc320570 task.ti=ce5a6000)
> > Mar 13 15:25:56 cfs1 kernel: Stack: c156d274 ce5a6dd4 c016eb91 
> > ce5a6dd4 00000000 dc5eae00 00000000 d6f34000
> > Mar 13 15:25:56 cfs1 kernel:        00000000 00000000 dc5ea9a8 
> > d57794a8 ce5a6e08 e0c83ad3 00012bf5 00000000
> > Mar 13 15:25:56 cfs1 kernel:        00000000 ce5a6da0 00000000 
> > 00000000 dc5ea9a8 d57794a8 e0c84cc9 ce5a6dd4
> > Mar 13 15:25:56 cfs1 kernel: Call Trace:
> > Mar 13 15:25:56 cfs1 kernel:  [<c016eb91>] iget5_locked+0x3d/0x67
> > Mar 13 15:25:56 cfs1 kernel:  [<e0c83ad3>] 
> > gfs2_inode_refresh+0x34/0xfe [gfs2]
> > Mar 13 15:25:56 cfs1 kernel:  [<e0c84cc9>] 
> gfs2_createi+0x12c/0x191 [gfs2]
> > Mar 13 15:25:56 cfs1 kernel:  [<e0c8da3c>] 
> gfs2_create+0x5c/0x103 [gfs2]
> > Mar 13 15:25:56 cfs1 kernel:  [<e0c84be7>] 
> gfs2_createi+0x4a/0x191 [gfs2]
> > Mar 13 15:25:56 cfs1 kernel:  [<e0c820c4>] 
> gfs2_glock_nq_num+0x3f/0x64 
> > [gfs2]
> > Mar 13 15:25:56 cfs1 kernel:  [<c016552e>] vfs_create+0xc3/0x126
> > Mar 13 15:25:56 cfs1 kernel:  [<c01657f2>] 
> open_namei_create+0x47/0x88
> > Mar 13 15:25:56 cfs1 kernel:  [<c016597d>] open_namei+0x14a/0x539
> > Mar 13 15:25:56 cfs1 kernel:  [<c015d27b>] do_filp_open+0x25/0x39
> > Mar 13 15:25:56 cfs1 kernel:  [<c01d200a>] 
> strncpy_from_user+0x3c/0x5b
> > Mar 13 15:25:56 cfs1 kernel:  [<c015d431>] get_unused_fd+0xa8/0xb1
> > Mar 13 15:25:56 cfs1 kernel:  [<c015d509>] do_sys_open+0x42/0xbe
> > Mar 13 15:25:56 cfs1 kernel:  [<c015d59f>] sys_open+0x1a/0x1c
> > Mar 13 15:25:56 cfs1 kernel:  [<c015d5dd>] sys_creat+0x1f/0x23
> > Mar 13 15:25:56 cfs1 kernel:  [<c0103410>] 
> sysenter_past_esp+0x5d/0x81
> > Mar 13 15:25:56 cfs1 kernel:  =======================
> > Mar 13 15:25:56 cfs1 kernel: Code: 80 a8 01 00 00 89 44 24 
> 1c 8b 85 f0 
> > 01 00 00 c7 44 24 20 00 00 00 00 89 54 24 14 85 c0 89 44 24 
> 18 c7 44 
> > 24 10 00 00 00 00 75 04 <0f> 0b eb fe 83 7c 24 1c 00 75 04 
> 0f 0b eb fe 
> > 8d 85 24 04 00 00
> > Mar 13 15:25:57 cfs1 kernel: EIP: [<e0c88da5>] 
> > gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2] SS:ESP 0068:ce5a6d58
> >  
> >  
> > (2)
> >  
> > Mar 13 17:00:30 cfs1 kernel: Call Trace:
> > Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>] 
> gfs2_unlink+0x53/0xe0 [gfs2]
> > Mar 13 17:00:30 cfs1 kernel:  [<e0baddc8>] 
> gfs2_unlink+0x3a/0xe0 [gfs2]
> > Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>] 
> gfs2_unlink+0x53/0xe0 [gfs2]
> > Mar 13 17:00:30 cfs1 kernel:  [<c0166572>] vfs_unlink+0xa1/0xc5
> > Mar 13 17:00:30 cfs1 kernel:  [<c016662b>] do_unlinkat+0x95/0xf5
> > Mar 13 17:00:30 cfs1 kernel:  [<c01187de>] scheduler_tick+0x8f/0x95
> > Mar 13 17:00:30 cfs1 kernel:  [<c0103410>] 
> sysenter_past_esp+0x5d/0x81
> > Mar 13 17:00:30 cfs1 kernel:  =======================
> >
> >-------------------------------------------------------------
> -----------
> >
> >--
> >Linux-cluster mailing list
> >Linux-cluster at redhat.com
> >https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From swhiteho at redhat.com  Wed Mar 14 08:27:37 2007
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Wed, 14 Mar 2007 08:27:37 +0000
Subject: [Linux-cluster] Kernel oops when GFS2 used as localfs
In-Reply-To: <B14199FA0DBAAF4AA89E83EB41D35435033679D9@xmb-sjc-22c.amer.cisco.com>
References: <B14199FA0DBAAF4AA89E83EB41D35435033679D9@xmb-sjc-22c.amer.cisco.com>
Message-ID: <1173860857.32601.80.camel@quoit.chygwyn.com>

Hi,

This looks like Red Hat bugzilla 229831 and if so then there is already
a fix in Linus' upstream kernel and in the latest kernel build for
Fedora (both 5 and 6),

Steve.

On Wed, 2007-03-14 at 00:15 -0700, Sridhar Ramaswamy (srramasw) wrote:
> Sure Wendy. Here it is,
> 
> "fs/gfs2/inode.c" 1256
> struct inode *gfs2_ilookup(struct super_block *sb, struct gfs2_inum_host
> *inum)
> {
>         return ilookup5(sb, (unsigned long)inum->no_formal_ino,
>                         iget_test, inum);
> }
> 
> static struct inode *gfs2_iget(struct super_block *sb, struct
> gfs2_inum_host *inum)
> {
>         return iget5_locked(sb, (unsigned long)inum->no_formal_ino,
>                      iget_test, iget_set, inum);
> }
> 
> BTW - this code is from kernel version 2.6.20.1. 
> 
> Thanks,
> Sridhar
> 
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com 
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Wendy Cheng
> > Sent: Tuesday, March 13, 2007 10:29 PM
> > To: linux clustering
> > Subject: Re: [Linux-cluster] Kernel oops when GFS2 used as localfs
> > 
> > Sridhar Ramaswamy (srramasw) wrote:
> > 
> > >
> > > Mar 13 15:25:56 cfs1 kernel: ------------[ cut here ]------------
> > > Mar 13 15:25:56 cfs1 kernel: kernel BUG at fs/gfs2/meta_io.c:474!
> > 
> > 
> > I don't have time to pull kernel.org source code right now. Could you 
> > cut-and-paste the following two routines (from fs/gfs2/inode.c):
> > gfs2_ilookup() and gfs2_iget() so we can look into this 
> > quickly ? They 
> > are all one-liner so shouldn't be too much troubles. GFS2 
> > currently has 
> > ino issue with its lookup code.
> > 
> > -- Wendy
> > 
> > > Mar 13 15:25:56 cfs1 kernel: invalid opcode: 0000 [#1]
> > > Mar 13 15:25:56 cfs1 kernel: SMP
> > > Mar 13 15:25:56 cfs1 kernel: Modules linked in: lock_nolock gfs2 
> > > reiserfs nfsd exportfs nfs lockd nfs_acl ipv6 parport_pc lp parport 
> > > autofs4 sunrpc dm_mirror dm_mod button battery ac uhci_hcd ehci_hcd 
> > > intel_rng rng_core i2c_i801 i2c_core e1000 e100 mii floppy ext3 jbd
> > > Mar 13 15:25:56 cfs1 kernel: CPU:    1
> > > Mar 13 15:25:56 cfs1 kernel: EIP:    0060:[<e0c88da5>]    
> > Not tainted VLI
> > > Mar 13 15:25:56 cfs1 kernel: EFLAGS: 00010246   (2.6.20.1 #1)
> > > Mar 13 15:25:56 cfs1 kernel: EIP is at 
> > > gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2]
> > > Mar 13 15:25:56 cfs1 kernel: eax: 00000000   ebx: 00012bf5   ecx: 
> > > ce5a6dd4   edx: dc5eae00
> > > Mar 13 15:25:56 cfs1 kernel: esi: 00000000   edi: 00000000   ebp: 
> > > dc5ea9a8   esp: ce5a6d58
> > > Mar 13 15:25:56 cfs1 kernel: ds: 007b   es: 007b   ss: 0068
> > > Mar 13 15:25:56 cfs1 kernel: Process bonnie++ (pid: 5509, 
> > ti=ce5a6000 
> > > task=dc320570 task.ti=ce5a6000)
> > > Mar 13 15:25:56 cfs1 kernel: Stack: c156d274 ce5a6dd4 c016eb91 
> > > ce5a6dd4 00000000 dc5eae00 00000000 d6f34000
> > > Mar 13 15:25:56 cfs1 kernel:        00000000 00000000 dc5ea9a8 
> > > d57794a8 ce5a6e08 e0c83ad3 00012bf5 00000000
> > > Mar 13 15:25:56 cfs1 kernel:        00000000 ce5a6da0 00000000 
> > > 00000000 dc5ea9a8 d57794a8 e0c84cc9 ce5a6dd4
> > > Mar 13 15:25:56 cfs1 kernel: Call Trace:
> > > Mar 13 15:25:56 cfs1 kernel:  [<c016eb91>] iget5_locked+0x3d/0x67
> > > Mar 13 15:25:56 cfs1 kernel:  [<e0c83ad3>] 
> > > gfs2_inode_refresh+0x34/0xfe [gfs2]
> > > Mar 13 15:25:56 cfs1 kernel:  [<e0c84cc9>] 
> > gfs2_createi+0x12c/0x191 [gfs2]
> > > Mar 13 15:25:56 cfs1 kernel:  [<e0c8da3c>] 
> > gfs2_create+0x5c/0x103 [gfs2]
> > > Mar 13 15:25:56 cfs1 kernel:  [<e0c84be7>] 
> > gfs2_createi+0x4a/0x191 [gfs2]
> > > Mar 13 15:25:56 cfs1 kernel:  [<e0c820c4>] 
> > gfs2_glock_nq_num+0x3f/0x64 
> > > [gfs2]
> > > Mar 13 15:25:56 cfs1 kernel:  [<c016552e>] vfs_create+0xc3/0x126
> > > Mar 13 15:25:56 cfs1 kernel:  [<c01657f2>] 
> > open_namei_create+0x47/0x88
> > > Mar 13 15:25:56 cfs1 kernel:  [<c016597d>] open_namei+0x14a/0x539
> > > Mar 13 15:25:56 cfs1 kernel:  [<c015d27b>] do_filp_open+0x25/0x39
> > > Mar 13 15:25:56 cfs1 kernel:  [<c01d200a>] 
> > strncpy_from_user+0x3c/0x5b
> > > Mar 13 15:25:56 cfs1 kernel:  [<c015d431>] get_unused_fd+0xa8/0xb1
> > > Mar 13 15:25:56 cfs1 kernel:  [<c015d509>] do_sys_open+0x42/0xbe
> > > Mar 13 15:25:56 cfs1 kernel:  [<c015d59f>] sys_open+0x1a/0x1c
> > > Mar 13 15:25:56 cfs1 kernel:  [<c015d5dd>] sys_creat+0x1f/0x23
> > > Mar 13 15:25:56 cfs1 kernel:  [<c0103410>] 
> > sysenter_past_esp+0x5d/0x81
> > > Mar 13 15:25:56 cfs1 kernel:  =======================
> > > Mar 13 15:25:56 cfs1 kernel: Code: 80 a8 01 00 00 89 44 24 
> > 1c 8b 85 f0 
> > > 01 00 00 c7 44 24 20 00 00 00 00 89 54 24 14 85 c0 89 44 24 
> > 18 c7 44 
> > > 24 10 00 00 00 00 75 04 <0f> 0b eb fe 83 7c 24 1c 00 75 04 
> > 0f 0b eb fe 
> > > 8d 85 24 04 00 00
> > > Mar 13 15:25:57 cfs1 kernel: EIP: [<e0c88da5>] 
> > > gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2] SS:ESP 0068:ce5a6d58
> > >  
> > >  
> > > (2)
> > >  
> > > Mar 13 17:00:30 cfs1 kernel: Call Trace:
> > > Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>] 
> > gfs2_unlink+0x53/0xe0 [gfs2]
> > > Mar 13 17:00:30 cfs1 kernel:  [<e0baddc8>] 
> > gfs2_unlink+0x3a/0xe0 [gfs2]
> > > Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>] 
> > gfs2_unlink+0x53/0xe0 [gfs2]
> > > Mar 13 17:00:30 cfs1 kernel:  [<c0166572>] vfs_unlink+0xa1/0xc5
> > > Mar 13 17:00:30 cfs1 kernel:  [<c016662b>] do_unlinkat+0x95/0xf5
> > > Mar 13 17:00:30 cfs1 kernel:  [<c01187de>] scheduler_tick+0x8f/0x95
> > > Mar 13 17:00:30 cfs1 kernel:  [<c0103410>] 
> > sysenter_past_esp+0x5d/0x81
> > > Mar 13 17:00:30 cfs1 kernel:  =======================
> > >
> > >-------------------------------------------------------------
> > -----------
> > >
> > >--
> > >Linux-cluster mailing list
> > >Linux-cluster at redhat.com
> > >https://www.redhat.com/mailman/listinfo/linux-cluster
> > >
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From wcheng at redhat.com  Wed Mar 14 13:32:55 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Wed, 14 Mar 2007 09:32:55 -0400
Subject: [Linux-cluster] Kernel oops when GFS2 used as localfs
In-Reply-To: <1173860857.32601.80.camel@quoit.chygwyn.com>
References: <B14199FA0DBAAF4AA89E83EB41D35435033679D9@xmb-sjc-22c.amer.cisco.com>
	<1173860857.32601.80.camel@quoit.chygwyn.com>
Message-ID: <45F7F987.408@redhat.com>

Steven Whitehouse wrote:
> Hi,
>
> This looks like Red Hat bugzilla 229831 and if so then there is already
> a fix in Linus' upstream kernel and in the latest kernel build for
> Fedora (both 5 and 6),
>
>   
yeah... agree. I missed the unlink part of the trace in previous post 
and thought it might have something to do with my ino number hacking in 
the lookup code.

-- Wendy
> Steve.
>
> On Wed, 2007-03-14 at 00:15 -0700, Sridhar Ramaswamy (srramasw) wrote:
>   
>> Sure Wendy. Here it is,
>>
>> "fs/gfs2/inode.c" 1256
>> struct inode *gfs2_ilookup(struct super_block *sb, struct gfs2_inum_host
>> *inum)
>> {
>>         return ilookup5(sb, (unsigned long)inum->no_formal_ino,
>>                         iget_test, inum);
>> }
>>
>> static struct inode *gfs2_iget(struct super_block *sb, struct
>> gfs2_inum_host *inum)
>> {
>>         return iget5_locked(sb, (unsigned long)inum->no_formal_ino,
>>                      iget_test, iget_set, inum);
>> }
>>
>> BTW - this code is from kernel version 2.6.20.1. 
>>
>> Thanks,
>> Sridhar
>>
>>     
>>> -----Original Message-----
>>> From: linux-cluster-bounces at redhat.com 
>>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Wendy Cheng
>>> Sent: Tuesday, March 13, 2007 10:29 PM
>>> To: linux clustering
>>> Subject: Re: [Linux-cluster] Kernel oops when GFS2 used as localfs
>>>
>>> Sridhar Ramaswamy (srramasw) wrote:
>>>
>>>       
>>>> Mar 13 15:25:56 cfs1 kernel: ------------[ cut here ]------------
>>>> Mar 13 15:25:56 cfs1 kernel: kernel BUG at fs/gfs2/meta_io.c:474!
>>>>         
>>> I don't have time to pull kernel.org source code right now. Could you 
>>> cut-and-paste the following two routines (from fs/gfs2/inode.c):
>>> gfs2_ilookup() and gfs2_iget() so we can look into this 
>>> quickly ? They 
>>> are all one-liner so shouldn't be too much troubles. GFS2 
>>> currently has 
>>> ino issue with its lookup code.
>>>
>>> -- Wendy
>>>
>>>       
>>>> Mar 13 15:25:56 cfs1 kernel: invalid opcode: 0000 [#1]
>>>> Mar 13 15:25:56 cfs1 kernel: SMP
>>>> Mar 13 15:25:56 cfs1 kernel: Modules linked in: lock_nolock gfs2 
>>>> reiserfs nfsd exportfs nfs lockd nfs_acl ipv6 parport_pc lp parport 
>>>> autofs4 sunrpc dm_mirror dm_mod button battery ac uhci_hcd ehci_hcd 
>>>> intel_rng rng_core i2c_i801 i2c_core e1000 e100 mii floppy ext3 jbd
>>>> Mar 13 15:25:56 cfs1 kernel: CPU:    1
>>>> Mar 13 15:25:56 cfs1 kernel: EIP:    0060:[<e0c88da5>]    
>>>>         
>>> Not tainted VLI
>>>       
>>>> Mar 13 15:25:56 cfs1 kernel: EFLAGS: 00010246   (2.6.20.1 #1)
>>>> Mar 13 15:25:56 cfs1 kernel: EIP is at 
>>>> gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2]
>>>> Mar 13 15:25:56 cfs1 kernel: eax: 00000000   ebx: 00012bf5   ecx: 
>>>> ce5a6dd4   edx: dc5eae00
>>>> Mar 13 15:25:56 cfs1 kernel: esi: 00000000   edi: 00000000   ebp: 
>>>> dc5ea9a8   esp: ce5a6d58
>>>> Mar 13 15:25:56 cfs1 kernel: ds: 007b   es: 007b   ss: 0068
>>>> Mar 13 15:25:56 cfs1 kernel: Process bonnie++ (pid: 5509, 
>>>>         
>>> ti=ce5a6000 
>>>       
>>>> task=dc320570 task.ti=ce5a6000)
>>>> Mar 13 15:25:56 cfs1 kernel: Stack: c156d274 ce5a6dd4 c016eb91 
>>>> ce5a6dd4 00000000 dc5eae00 00000000 d6f34000
>>>> Mar 13 15:25:56 cfs1 kernel:        00000000 00000000 dc5ea9a8 
>>>> d57794a8 ce5a6e08 e0c83ad3 00012bf5 00000000
>>>> Mar 13 15:25:56 cfs1 kernel:        00000000 ce5a6da0 00000000 
>>>> 00000000 dc5ea9a8 d57794a8 e0c84cc9 ce5a6dd4
>>>> Mar 13 15:25:56 cfs1 kernel: Call Trace:
>>>> Mar 13 15:25:56 cfs1 kernel:  [<c016eb91>] iget5_locked+0x3d/0x67
>>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c83ad3>] 
>>>> gfs2_inode_refresh+0x34/0xfe [gfs2]
>>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c84cc9>] 
>>>>         
>>> gfs2_createi+0x12c/0x191 [gfs2]
>>>       
>>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c8da3c>] 
>>>>         
>>> gfs2_create+0x5c/0x103 [gfs2]
>>>       
>>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c84be7>] 
>>>>         
>>> gfs2_createi+0x4a/0x191 [gfs2]
>>>       
>>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c820c4>] 
>>>>         
>>> gfs2_glock_nq_num+0x3f/0x64 
>>>       
>>>> [gfs2]
>>>> Mar 13 15:25:56 cfs1 kernel:  [<c016552e>] vfs_create+0xc3/0x126
>>>> Mar 13 15:25:56 cfs1 kernel:  [<c01657f2>] 
>>>>         
>>> open_namei_create+0x47/0x88
>>>       
>>>> Mar 13 15:25:56 cfs1 kernel:  [<c016597d>] open_namei+0x14a/0x539
>>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d27b>] do_filp_open+0x25/0x39
>>>> Mar 13 15:25:56 cfs1 kernel:  [<c01d200a>] 
>>>>         
>>> strncpy_from_user+0x3c/0x5b
>>>       
>>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d431>] get_unused_fd+0xa8/0xb1
>>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d509>] do_sys_open+0x42/0xbe
>>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d59f>] sys_open+0x1a/0x1c
>>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d5dd>] sys_creat+0x1f/0x23
>>>> Mar 13 15:25:56 cfs1 kernel:  [<c0103410>] 
>>>>         
>>> sysenter_past_esp+0x5d/0x81
>>>       
>>>> Mar 13 15:25:56 cfs1 kernel:  =======================
>>>> Mar 13 15:25:56 cfs1 kernel: Code: 80 a8 01 00 00 89 44 24 
>>>>         
>>> 1c 8b 85 f0 
>>>       
>>>> 01 00 00 c7 44 24 20 00 00 00 00 89 54 24 14 85 c0 89 44 24 
>>>>         
>>> 18 c7 44 
>>>       
>>>> 24 10 00 00 00 00 75 04 <0f> 0b eb fe 83 7c 24 1c 00 75 04 
>>>>         
>>> 0f 0b eb fe 
>>>       
>>>> 8d 85 24 04 00 00
>>>> Mar 13 15:25:57 cfs1 kernel: EIP: [<e0c88da5>] 
>>>> gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2] SS:ESP 0068:ce5a6d58
>>>>  
>>>>  
>>>> (2)
>>>>  
>>>> Mar 13 17:00:30 cfs1 kernel: Call Trace:
>>>> Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>] 
>>>>         
>>> gfs2_unlink+0x53/0xe0 [gfs2]
>>>       
>>>> Mar 13 17:00:30 cfs1 kernel:  [<e0baddc8>] 
>>>>         
>>> gfs2_unlink+0x3a/0xe0 [gfs2]
>>>       
>>>> Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>] 
>>>>         
>>> gfs2_unlink+0x53/0xe0 [gfs2]
>>>       
>>>> Mar 13 17:00:30 cfs1 kernel:  [<c0166572>] vfs_unlink+0xa1/0xc5
>>>> Mar 13 17:00:30 cfs1 kernel:  [<c016662b>] do_unlinkat+0x95/0xf5
>>>> Mar 13 17:00:30 cfs1 kernel:  [<c01187de>] scheduler_tick+0x8f/0x95
>>>> Mar 13 17:00:30 cfs1 kernel:  [<c0103410>] 
>>>>         
>>> sysenter_past_esp+0x5d/0x81
>>>       
>>>> Mar 13 17:00:30 cfs1 kernel:  =======================
>>>>
>>>> -------------------------------------------------------------
>>>>         
>>> -----------
>>>       
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>         
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>       
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>     
>
>   



From holger.ratzel at she.net  Wed Mar 14 14:55:32 2007
From: holger.ratzel at she.net (Holger L. Ratzel)
Date: Wed, 14 Mar 2007 15:55:32 +0100
Subject: [Linux-cluster] Problem with shared SCSI-Storage
Message-ID: <200703141555.32868.holger.ratzel@she.net>

Hi All,

I've a problem with a RHEL4 cluster running on two Dell PowerEdge 1850 and an 
Dell PowerVault PV220S. After running for a while I get errors from the 
megaraid driver in /var/log/messages saying that the device was "offlined" 
and the service gets relocated to the other node while the original node 
reboots. The relevant parts from /var/log/messages can be downloaded from 
http://www.she.net/hra/web01-messages.txt

Dell has checked the system and all firmwares have been updated, so it seems 
not to be a hardware problem.

Please mail me if you have any questions or need further information.

I would appreciate any suggestions.

Many thanks an best regards,

	Holger Ratzel

-- 
----------------- SHE - IT-Sicherheit von Experten ------------------
SHE Informationstechnologie AG
Holger L. Ratzel                               Fon:+49 621 5200 - 210 
Service Delivery & Support                     Fax:+49 621 5200 - 555
Donnersbergweg 3                                holger.ratzel at she.net
D-67059 Ludwigshafen                              http://www.she.net/
Sitz der Gesellschaft und Registergericht Ludwigshafen HRB 4593
Aufsichtsratsvorsitzender: Ulrich Engelhardt
Vorstand: Klaus Schulz
-------------- Drink wet cement and get really stoned! --------------

PGP-Fingerprint:
9A 73 40 22 72 64 BE D1  D8 1A 54 3C 5B 64 AF C3  CC E3 CA A8
Get my PGP public key at: http://pgp.she.net/



From lhh at redhat.com  Wed Mar 14 18:13:00 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 14 Mar 2007 14:13:00 -0400
Subject: [Linux-cluster] Problem with shared SCSI-Storage
In-Reply-To: <200703141555.32868.holger.ratzel@she.net>
References: <200703141555.32868.holger.ratzel@she.net>
Message-ID: <1173895980.4557.90.camel@asuka.boston.devel.redhat.com>

On Wed, 2007-03-14 at 15:55 +0100, Holger L. Ratzel wrote:
> Hi All,
> 
> I've a problem with a RHEL4 cluster running on two Dell PowerEdge 1850 and an 
> Dell PowerVault PV220S. After running for a while I get errors from the 
> megaraid driver in /var/log/messages saying that the device was "offlined" 
> and the service gets relocated to the other node while the original node 
> reboots. The relevant parts from /var/log/messages can be downloaded from 
> http://www.she.net/hra/web01-messages.txt
> 
> Dell has checked the system and all firmwares have been updated, so it seems 
> not to be a hardware problem.

Sounds like this might help:

http://marc.theaimsgroup.com/?t=117082438600008&r=1&w=2

-- Lon




From srramasw at cisco.com  Wed Mar 14 18:20:25 2007
From: srramasw at cisco.com (Sridhar Ramaswamy (srramasw))
Date: Wed, 14 Mar 2007 11:20:25 -0700
Subject: [Linux-cluster] Kernel oops when GFS2 used as localfs
In-Reply-To: <45F7F987.408@redhat.com>
Message-ID: <B14199FA0DBAAF4AA89E83EB41D3543503367BC5@xmb-sjc-22c.amer.cisco.com>

Hi folks,

This bugzilla 229831 seems "restricted" :(  Can someone please forward
its details? I'm interested in understanding the actual problem and its
proposed fix.

Meanwhile I'll try an upstream kernel. Should 2.6.21.rc1 be fine?

Thanks,
Sridhar

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Wendy Cheng
> Sent: Wednesday, March 14, 2007 6:33 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] Kernel oops when GFS2 used as localfs
> 
> Steven Whitehouse wrote:
> > Hi,
> >
> > This looks like Red Hat bugzilla 229831 and if so then 
> there is already
> > a fix in Linus' upstream kernel and in the latest kernel build for
> > Fedora (both 5 and 6),
> >
> >   
> yeah... agree. I missed the unlink part of the trace in previous post 
> and thought it might have something to do with my ino number 
> hacking in 
> the lookup code.
> 
> -- Wendy
> > Steve.
> >
> > On Wed, 2007-03-14 at 00:15 -0700, Sridhar Ramaswamy 
> (srramasw) wrote:
> >   
> >> Sure Wendy. Here it is,
> >>
> >> "fs/gfs2/inode.c" 1256
> >> struct inode *gfs2_ilookup(struct super_block *sb, struct 
> gfs2_inum_host
> >> *inum)
> >> {
> >>         return ilookup5(sb, (unsigned long)inum->no_formal_ino,
> >>                         iget_test, inum);
> >> }
> >>
> >> static struct inode *gfs2_iget(struct super_block *sb, struct
> >> gfs2_inum_host *inum)
> >> {
> >>         return iget5_locked(sb, (unsigned long)inum->no_formal_ino,
> >>                      iget_test, iget_set, inum);
> >> }
> >>
> >> BTW - this code is from kernel version 2.6.20.1. 
> >>
> >> Thanks,
> >> Sridhar
> >>
> >>     
> >>> -----Original Message-----
> >>> From: linux-cluster-bounces at redhat.com 
> >>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Wendy Cheng
> >>> Sent: Tuesday, March 13, 2007 10:29 PM
> >>> To: linux clustering
> >>> Subject: Re: [Linux-cluster] Kernel oops when GFS2 used as localfs
> >>>
> >>> Sridhar Ramaswamy (srramasw) wrote:
> >>>
> >>>       
> >>>> Mar 13 15:25:56 cfs1 kernel: ------------[ cut here ]------------
> >>>> Mar 13 15:25:56 cfs1 kernel: kernel BUG at fs/gfs2/meta_io.c:474!
> >>>>         
> >>> I don't have time to pull kernel.org source code right 
> now. Could you 
> >>> cut-and-paste the following two routines (from fs/gfs2/inode.c):
> >>> gfs2_ilookup() and gfs2_iget() so we can look into this 
> >>> quickly ? They 
> >>> are all one-liner so shouldn't be too much troubles. GFS2 
> >>> currently has 
> >>> ino issue with its lookup code.
> >>>
> >>> -- Wendy
> >>>
> >>>       
> >>>> Mar 13 15:25:56 cfs1 kernel: invalid opcode: 0000 [#1]
> >>>> Mar 13 15:25:56 cfs1 kernel: SMP
> >>>> Mar 13 15:25:56 cfs1 kernel: Modules linked in: lock_nolock gfs2 
> >>>> reiserfs nfsd exportfs nfs lockd nfs_acl ipv6 parport_pc 
> lp parport 
> >>>> autofs4 sunrpc dm_mirror dm_mod button battery ac 
> uhci_hcd ehci_hcd 
> >>>> intel_rng rng_core i2c_i801 i2c_core e1000 e100 mii 
> floppy ext3 jbd
> >>>> Mar 13 15:25:56 cfs1 kernel: CPU:    1
> >>>> Mar 13 15:25:56 cfs1 kernel: EIP:    0060:[<e0c88da5>]    
> >>>>         
> >>> Not tainted VLI
> >>>       
> >>>> Mar 13 15:25:56 cfs1 kernel: EFLAGS: 00010246   (2.6.20.1 #1)
> >>>> Mar 13 15:25:56 cfs1 kernel: EIP is at 
> >>>> gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2]
> >>>> Mar 13 15:25:56 cfs1 kernel: eax: 00000000   ebx: 
> 00012bf5   ecx: 
> >>>> ce5a6dd4   edx: dc5eae00
> >>>> Mar 13 15:25:56 cfs1 kernel: esi: 00000000   edi: 
> 00000000   ebp: 
> >>>> dc5ea9a8   esp: ce5a6d58
> >>>> Mar 13 15:25:56 cfs1 kernel: ds: 007b   es: 007b   ss: 0068
> >>>> Mar 13 15:25:56 cfs1 kernel: Process bonnie++ (pid: 5509, 
> >>>>         
> >>> ti=ce5a6000 
> >>>       
> >>>> task=dc320570 task.ti=ce5a6000)
> >>>> Mar 13 15:25:56 cfs1 kernel: Stack: c156d274 ce5a6dd4 c016eb91 
> >>>> ce5a6dd4 00000000 dc5eae00 00000000 d6f34000
> >>>> Mar 13 15:25:56 cfs1 kernel:        00000000 00000000 dc5ea9a8 
> >>>> d57794a8 ce5a6e08 e0c83ad3 00012bf5 00000000
> >>>> Mar 13 15:25:56 cfs1 kernel:        00000000 ce5a6da0 00000000 
> >>>> 00000000 dc5ea9a8 d57794a8 e0c84cc9 ce5a6dd4
> >>>> Mar 13 15:25:56 cfs1 kernel: Call Trace:
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c016eb91>] iget5_locked+0x3d/0x67
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c83ad3>] 
> >>>> gfs2_inode_refresh+0x34/0xfe [gfs2]
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c84cc9>] 
> >>>>         
> >>> gfs2_createi+0x12c/0x191 [gfs2]
> >>>       
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c8da3c>] 
> >>>>         
> >>> gfs2_create+0x5c/0x103 [gfs2]
> >>>       
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c84be7>] 
> >>>>         
> >>> gfs2_createi+0x4a/0x191 [gfs2]
> >>>       
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c820c4>] 
> >>>>         
> >>> gfs2_glock_nq_num+0x3f/0x64 
> >>>       
> >>>> [gfs2]
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c016552e>] vfs_create+0xc3/0x126
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c01657f2>] 
> >>>>         
> >>> open_namei_create+0x47/0x88
> >>>       
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c016597d>] open_namei+0x14a/0x539
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d27b>] do_filp_open+0x25/0x39
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c01d200a>] 
> >>>>         
> >>> strncpy_from_user+0x3c/0x5b
> >>>       
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d431>] 
> get_unused_fd+0xa8/0xb1
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d509>] do_sys_open+0x42/0xbe
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d59f>] sys_open+0x1a/0x1c
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d5dd>] sys_creat+0x1f/0x23
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c0103410>] 
> >>>>         
> >>> sysenter_past_esp+0x5d/0x81
> >>>       
> >>>> Mar 13 15:25:56 cfs1 kernel:  =======================
> >>>> Mar 13 15:25:56 cfs1 kernel: Code: 80 a8 01 00 00 89 44 24 
> >>>>         
> >>> 1c 8b 85 f0 
> >>>       
> >>>> 01 00 00 c7 44 24 20 00 00 00 00 89 54 24 14 85 c0 89 44 24 
> >>>>         
> >>> 18 c7 44 
> >>>       
> >>>> 24 10 00 00 00 00 75 04 <0f> 0b eb fe 83 7c 24 1c 00 75 04 
> >>>>         
> >>> 0f 0b eb fe 
> >>>       
> >>>> 8d 85 24 04 00 00
> >>>> Mar 13 15:25:57 cfs1 kernel: EIP: [<e0c88da5>] 
> >>>> gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2] SS:ESP 0068:ce5a6d58
> >>>>  
> >>>>  
> >>>> (2)
> >>>>  
> >>>> Mar 13 17:00:30 cfs1 kernel: Call Trace:
> >>>> Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>] 
> >>>>         
> >>> gfs2_unlink+0x53/0xe0 [gfs2]
> >>>       
> >>>> Mar 13 17:00:30 cfs1 kernel:  [<e0baddc8>] 
> >>>>         
> >>> gfs2_unlink+0x3a/0xe0 [gfs2]
> >>>       
> >>>> Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>] 
> >>>>         
> >>> gfs2_unlink+0x53/0xe0 [gfs2]
> >>>       
> >>>> Mar 13 17:00:30 cfs1 kernel:  [<c0166572>] vfs_unlink+0xa1/0xc5
> >>>> Mar 13 17:00:30 cfs1 kernel:  [<c016662b>] do_unlinkat+0x95/0xf5
> >>>> Mar 13 17:00:30 cfs1 kernel:  [<c01187de>] 
> scheduler_tick+0x8f/0x95
> >>>> Mar 13 17:00:30 cfs1 kernel:  [<c0103410>] 
> >>>>         
> >>> sysenter_past_esp+0x5d/0x81
> >>>       
> >>>> Mar 13 17:00:30 cfs1 kernel:  =======================
> >>>>
> >>>> -------------------------------------------------------------
> >>>>         
> >>> -----------
> >>>       
> >>>> --
> >>>> Linux-cluster mailing list
> >>>> Linux-cluster at redhat.com
> >>>> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>>>
> >>>>         
> >>> --
> >>> Linux-cluster mailing list
> >>> Linux-cluster at redhat.com
> >>> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>>
> >>>       
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>     
> >
> >   
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From isplist at logicore.net  Thu Mar 15 18:13:14 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 15 Mar 2007 12:13:14 -0600
Subject: [Linux-cluster] Custom Anaconda to see 32 LUNS?
Message-ID: <2007315121314.653119@leena>

Is there anyone out there who might have a custom anaconda version which could 
see at least 32 LUNS or more?

I have yet to find a way of installing blades onto 32 volumes which are on a 
fibre channel storage device. I must find a solution and was now told that I 
might be able to modify anaconda. Not knowing how to do this, I wonder if 
someone might already have a modified version?

Mike

This was my old message, things I've done/tried which have not worked.

//
I've tried everything I can find and think of or that has been suggested.

I have a Xyratex/MTI type chassis split into 32 volumes. I need to install 32
blades onto each volume so that I can remove the drives on each blade.

When I start a Linux install on a blade, it never sees all of the volumes,
only LUNS 0/1. I need to see all 32 LUNS so that I can install all of my
servers.

I've tried all of the following;

RHEL4, CentOS4.4, others. I don't really care what the distro is, so long as
it runs basic services such as web/php, qmail, etc. What I do need however is
that they be GFS/Cluster machines.

I've tried passing the information at the installers command line;

scsi_mod.max_luns=256
scsi_mod.scsi_dev_flags=INLINE:TF200:0x242

I've tried many variations of these types of commands with no result.

I've then set up a PXE boot server thinking that I might be able to pass the
options using pxelinux.cfg. Still no luck, it only sees two LUNS.

I've tried installing from network with a recompilled initrd.img from a
machine which was already installed and modprobe.conf modified to see the
LUNS. The donor server can see all of the volumes, the installing version dies
with a kernel problem since it cannot see the same already installed volume.

What in the world can I do? Is there a guru here who can tell me how I can do
this?
//





From mwill at penguincomputing.com  Thu Mar 15 17:41:41 2007
From: mwill at penguincomputing.com (Michael Will)
Date: Thu, 15 Mar 2007 10:41:41 -0700
Subject: [Linux-cluster] Custom Anaconda to see 32 LUNS?
In-Reply-To: <2007315121314.653119@leena>
Message-ID: <433093DF7AD7444DA65EFAFE3987879C3BA0AA@orca.penguincomputing.com>

How does a blade know to boot from a specific lun?

Can you do a LUN-SAN mapping that would make a lun available to only one
specific blade each
under the correct name?


Michael
-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of
isplist at logicore.net
Sent: Thursday, March 15, 2007 11:13 AM
To: linux-cluster; anaconda-devel-list at redhat.com;
redhat-install-list at redhat.com
Subject: [Linux-cluster] Custom Anaconda to see 32 LUNS?

Is there anyone out there who might have a custom anaconda version which
could see at least 32 LUNS or more?

I have yet to find a way of installing blades onto 32 volumes which are
on a fibre channel storage device. I must find a solution and was now
told that I might be able to modify anaconda. Not knowing how to do
this, I wonder if someone might already have a modified version?

Mike

This was my old message, things I've done/tried which have not worked.

//
I've tried everything I can find and think of or that has been
suggested.

I have a Xyratex/MTI type chassis split into 32 volumes. I need to
install 32 blades onto each volume so that I can remove the drives on
each blade.

When I start a Linux install on a blade, it never sees all of the
volumes, only LUNS 0/1. I need to see all 32 LUNS so that I can install
all of my servers.

I've tried all of the following;

RHEL4, CentOS4.4, others. I don't really care what the distro is, so
long as it runs basic services such as web/php, qmail, etc. What I do
need however is that they be GFS/Cluster machines.

I've tried passing the information at the installers command line;

scsi_mod.max_luns=256
scsi_mod.scsi_dev_flags=INLINE:TF200:0x242

I've tried many variations of these types of commands with no result.

I've then set up a PXE boot server thinking that I might be able to pass
the options using pxelinux.cfg. Still no luck, it only sees two LUNS.

I've tried installing from network with a recompilled initrd.img from a
machine which was already installed and modprobe.conf modified to see
the LUNS. The donor server can see all of the volumes, the installing
version dies with a kernel problem since it cannot see the same already
installed volume.

What in the world can I do? Is there a guru here who can tell me how I
can do this?
//



--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From Derek.Anderson at compellent.com  Thu Mar 15 18:27:32 2007
From: Derek.Anderson at compellent.com (Derek Anderson)
Date: Thu, 15 Mar 2007 13:27:32 -0500
Subject: [Linux-cluster] Custom Anaconda to see 32 LUNS?
In-Reply-To: <433093DF7AD7444DA65EFAFE3987879C3BA0AA@orca.penguincomputing.com>
References: <2007315121314.653119@leena>
	<433093DF7AD7444DA65EFAFE3987879C3BA0AA@orca.penguincomputing.com>
Message-ID: <99E0F1976E2DA2499F3E6EB18B25F036EEC7C1@honeywheat.Beer.Town>

Exactly.  Mike seemed to be on to this path in February:
https://www.redhat.com/archives/linux-cluster/2007-February/msg00178.htm
l

The Xyratex StorView software should be able to help you LUN Mask/Map so
that each blade sees only its own LUNs.  As long as each of your blades
have unique WWNs.  ?

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Will
> Sent: Thursday, March 15, 2007 12:42 PM
> To: isplist at logicore.net; linux clustering; 
> anaconda-devel-list at redhat.com; redhat-install-list at redhat.com
> Subject: RE: [Linux-cluster] Custom Anaconda to see 32 LUNS?
> 
> How does a blade know to boot from a specific lun?
> 
> Can you do a LUN-SAN mapping that would make a lun available 
> to only one specific blade each under the correct name?
> 
> 
> Michael
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> isplist at logicore.net
> Sent: Thursday, March 15, 2007 11:13 AM
> To: linux-cluster; anaconda-devel-list at redhat.com; 
> redhat-install-list at redhat.com
> Subject: [Linux-cluster] Custom Anaconda to see 32 LUNS?
> 
> Is there anyone out there who might have a custom anaconda 
> version which could see at least 32 LUNS or more?
> 
> I have yet to find a way of installing blades onto 32 volumes 
> which are on a fibre channel storage device. I must find a 
> solution and was now told that I might be able to modify 
> anaconda. Not knowing how to do this, I wonder if someone 
> might already have a modified version?
> 
> Mike
> 
> This was my old message, things I've done/tried which have not worked.
> 
> //
> I've tried everything I can find and think of or that has 
> been suggested.
> 
> I have a Xyratex/MTI type chassis split into 32 volumes. I 
> need to install 32 blades onto each volume so that I can 
> remove the drives on each blade.
> 
> When I start a Linux install on a blade, it never sees all of 
> the volumes, only LUNS 0/1. I need to see all 32 LUNS so that 
> I can install all of my servers.
> 
> I've tried all of the following;
> 
> RHEL4, CentOS4.4, others. I don't really care what the distro 
> is, so long as it runs basic services such as web/php, qmail, 
> etc. What I do need however is that they be GFS/Cluster machines.
> 
> I've tried passing the information at the installers command line;
> 
> scsi_mod.max_luns=256
> scsi_mod.scsi_dev_flags=INLINE:TF200:0x242
> 
> I've tried many variations of these types of commands with no result.
> 
> I've then set up a PXE boot server thinking that I might be 
> able to pass the options using pxelinux.cfg. Still no luck, 
> it only sees two LUNS.
> 
> I've tried installing from network with a recompilled 
> initrd.img from a machine which was already installed and 
> modprobe.conf modified to see the LUNS. The donor server can 
> see all of the volumes, the installing version dies with a 
> kernel problem since it cannot see the same already installed volume.
> 
> What in the world can I do? Is there a guru here who can tell 
> me how I can do this?
> //
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From isplist at logicore.net  Thu Mar 15 19:34:43 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 15 Mar 2007 13:34:43 -0600
Subject: [Linux-cluster] Custom Anaconda to see 32 LUNS?
In-Reply-To: <99E0F1976E2DA2499F3E6EB18B25F036EEC7C1@honeywheat.Beer.Town>
Message-ID: <2007315133443.075288@leena>

> Exactly.  Mike seemed to be on to this path in February:
> https://www.redhat.com/archives/linux-cluster/2007-February/msg00178.htm

I received an error at this link.

> The Xyratex StorView software should be able to help you LUN Mask/Map so
> that each blade sees only its own LUNs.  As long as each of your blades
> have unique WWNs.  ?

Yes, each blade does have a unique WWN. Problem is, I've not found any way of 
entering that into the Brocade or the INLINE TF200 storage unit. It's of 
similar style to the Xyratex. 

Mike





From Derek.Anderson at compellent.com  Thu Mar 15 18:57:13 2007
From: Derek.Anderson at compellent.com (Derek Anderson)
Date: Thu, 15 Mar 2007 13:57:13 -0500
Subject: [Linux-cluster] Custom Anaconda to see 32 LUNS?
In-Reply-To: <2007315133443.075288@leena>
References: <99E0F1976E2DA2499F3E6EB18B25F036EEC7C1@honeywheat.Beer.Town>
	<2007315133443.075288@leena>
Message-ID: <99E0F1976E2DA2499F3E6EB18B25F036EEC7C2@honeywheat.Beer.Town>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> isplist at logicore.net
> Sent: Thursday, March 15, 2007 2:35 PM
> To: linux clustering
> Subject: RE: [Linux-cluster] Custom Anaconda to see 32 LUNS?
> 
> > Exactly.  Mike seemed to be on to this path in February:
> > 
> https://www.redhat.com/archives/linux-cluster/2007-February/msg00178.h
> > tm
> 
> I received an error at this link.

Sorry, bad paste.  ".html"

> 
> > The Xyratex StorView software should be able to help you 
> LUN Mask/Map 
> > so that each blade sees only its own LUNs.  As long as each of your 
> > blades have unique WWNs.  ?
> 
> Yes, each blade does have a unique WWN. Problem is, I've not 
> found any way of entering that into the Brocade or the INLINE 
> TF200 storage unit. It's of similar style to the Xyratex. 

I haven't used Inline enclosures, but certainly seems like the TF200 has
this capability.  See the WWN Privileged Access section of this
document:
http://www.storageheaven.com/downloads/data_security.pdf

And a link to info about an Inline storage management app which may help
you configure it.
http://www.inlinecorp.com/products/products_MorStorView.htm

Hope this helps.
- Derek

> 
> Mike
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From lhh at redhat.com  Thu Mar 15 20:00:04 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 15 Mar 2007 16:00:04 -0400
Subject: [Linux-cluster] [PATCH] WTI RSM serial passthrough for fence_wti
Message-ID: <1173988804.13796.62.camel@asuka.boston.devel.redhat.com>

Problem statement:

Fencing fails if WTI device loses network connectivity.  It sure would
be Nice(tm) to have a backup method to access the fence device.

Solution:

With this patch, you can use your WTI RSM serial port server as a backup
method to access your WTI IPS/NPS/NBB/TPS/. remote power controller in
the event that the IPS/*/etc. loses its network connectivity (I'll just
say "IPS" from now on to mean "all of 'em").

IMPORTANT: If you want it to work with your particular serial port
server (that is not a WTI RSM), I welcome your patch or your
hardware. ;)

This patch was tested with an IPS-800 as the back-end and a RSM-8 as the
front-end.  It *should* work with any RSM and any supported WTI remote
power switch (e.g. NPS, IPS, NBB, TPS series).

Configuration notes:

* You must enable "Direct Connect" access for the port connected to the
IPS. (/P [number], option 31).  It should be set to "On - Password".

* The perl Net::Telnet module does not work with the RSM's standard
telnet port; I don't know why (nor do I care to figure it out :) ).
However, because of this, you must enable "Raw Socket Access" (/N,
option 31); using the standard direct access telnet port will not work.

* You may use a script to retrieve the password (use rsm_passwd_script
option instead of rsm_passwd).

* You must add a new fence device (manually) to the cluster.conf
(example below).  There is no UI support for this.

...
                <clusternode name="green" nodeid="2" votes="1">
                        <multicast addr="225.0.0.12" interface="eth0"/>
                        <fence>
                                <method name="1">
                                        <device name="ips-rack9"
                                                port="2"/>
                                </method>
                                <method name="2">
                                        <device name="ips-rack9-backup"
                                                port="2"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
               <fencedevice agent="fence_wti" 
                            name="ips-rack9"
                            passwd="wti"
                            ipaddr="ips-rack9"/>
               <fencedevice agent="fence_wti"
                            name="ips-rack9-backup"
                            passwd="wti"
                            ipaddr="rsm-rack9"
                            rsm_enable="1"
                            rsm_login="super"
                            rsm_passwd="super"
                            tcpport="3108"/>
        </fencedevices>
...

In my case, I plugged the IPS into port 8 on the RSM - so the raw port
becomes '3108'.  This reduces the complexity of the patch (when compared
to parsing RSM output).  Note that the passwd, agent, and port (plug
number) stay the same, but the host we're talking to changes, and a
bunch of additional stuff is added to talk to the RSM prior to even
getting to the IPS.


How to test:

* Apply patch to CVS/head; rebuild / install fence_wti.

* Configure your cluster similar to above (don't forget to run ccs_tool
update)

* Pull the network jack from your WTI power switch.

* Run fence_node <nodename>.  The first fence level will fail, but the
second one should succeed.

Other information:

* This patch has been tested with the same RSM with direct access for
the port set to 'no password' as well.  In this case, the relevant extra
fence device would more or less look like the following:

...
               <fencedevice agent="fence_wti"
                            name="ips-rack9-backup"
                            passwd="wti"
                            ipaddr="rsm-rack9"
                            serial="1"
                            tcpport="3108"/>
...

I do not recommend this configuration because it is possible to log out
of the RSM without being logged out of the IPS.  The effect is that
someone might be able to reconnect to the existing IPS session without
being prompted for any password (that's bad!).  The upshot is that raw
serial mode *MIGHT* work with other serial port server appliances which
support raw / unpassworded / direct telnet access.


FAQ:

Q: What units was this patch developed on?
A: WTI RSM-8 & WTI IPS-800

Q: What versions of Linux-cluster will this work with?
A: Currently, just CVS - this is a new patch.

Q: The RSM supports SSH - why is there no SSH support?
A: Because it would increase the complexity of the fence_wti agent
significantly.  Patches accepted.

Q: Why not just have two hosts (rsm_host, for example) and do the
retrying internally?
A: fenced already retries and already has provisions for doing this;
implementing it within fence agents is redundant (not to mention, it
adds needless complexity).

Q: Why did you choose the WTI RSM?
A: First, because people like solutions from as few vendors as possible
- so, the RSM + IPS was a natural fit in that regard.  Second, because
it's what I have on-hand.

Q: Why don't you support the [insert your favorite serial server here]?
A: Because I don't have one and because it adds even more complexity to
the agent.  Taking patches (should amount to changing the login/password
expect strings in the rsm-login section...).

-- Lon

-------------- next part --------------
A non-text attachment was scrubbed...
Name: fence-wti-rsm-serial-passthrough.patch
Type: text/x-patch
Size: 4968 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070315/62af5335/attachment.bin>

From lhh at redhat.com  Thu Mar 15 20:15:13 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 15 Mar 2007 16:15:13 -0400
Subject: [Linux-cluster] Has anyone succeeded in using WTI RPS-10?
In-Reply-To: <200701222205.43121.fajarpri@cbn.net.id>
References: <200701221110.39958.fajarpri@cbn.net.id>
	<200701222205.43121.fajarpri@cbn.net.id>
Message-ID: <1173989713.13796.76.camel@asuka.boston.devel.redhat.com>

On Mon, 2007-01-22 at 22:05 +0700, Fajar Priyanto wrote:
> On Monday 22 January 2007 11:10, Fajar Priyanto wrote:
> > Hi all,
> > We are setting up 2-note cluster with RHEL4U3 with RHCS-1.0.25 using WTI
> > Remote Power Switch as fence device. We are aware that the device is rather
> > obsolete, but in the mean time that's all we have.

It only works with RPS-10M; I don't think it works with the RPS-10S
units.

-- Lon



From lhh at redhat.com  Thu Mar 15 20:15:59 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 15 Mar 2007 16:15:59 -0400
Subject: [Linux-cluster] Has anyone succeeded in using WTI RPS-10?
In-Reply-To: <1173989713.13796.76.camel@asuka.boston.devel.redhat.com>
References: <200701221110.39958.fajarpri@cbn.net.id>
	<200701222205.43121.fajarpri@cbn.net.id>
	<1173989713.13796.76.camel@asuka.boston.devel.redhat.com>
Message-ID: <1173989759.13796.78.camel@asuka.boston.devel.redhat.com>

On Thu, 2007-03-15 at 16:15 -0400, Lon Hohberger wrote:
> On Mon, 2007-01-22 at 22:05 +0700, Fajar Priyanto wrote:

Oh - and - sorry for the really, really late response.

-- Lon



From fajarpri at cbn.net.id  Fri Mar 16 12:10:12 2007
From: fajarpri at cbn.net.id (Fajar Priyanto)
Date: Fri, 16 Mar 2007 19:10:12 +0700
Subject: [Linux-cluster] Has anyone succeeded in using WTI RPS-10?
In-Reply-To: <1173989759.13796.78.camel@asuka.boston.devel.redhat.com>
References: <200701221110.39958.fajarpri@cbn.net.id>
	<1173989713.13796.76.camel@asuka.boston.devel.redhat.com>
	<1173989759.13796.78.camel@asuka.boston.devel.redhat.com>
Message-ID: <200703161910.13108.fajarpri@cbn.net.id>

On Friday 16 March 2007 03:15, Lon Hohberger wrote:
> On Thu, 2007-03-15 at 16:15 -0400, Lon Hohberger wrote:
> > On Mon, 2007-01-22 at 22:05 +0700, Fajar Priyanto wrote:
>
> Oh - and - sorry for the really, really late response.

That's ok Lon. I really appreciate it.

-- 
Fajar Priyanto | Reg'd Linux User #327841 | Linux tutorial 
http://linux2.arinet.org
7:10pm up 0:28, 2.6.18.2-34-default GNU/Linux 
Let's use OpenOffice. http://www.openoffice.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070316/45f7540d/attachment.sig>

From orkcu at yahoo.com  Fri Mar 16 13:46:59 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Fri, 16 Mar 2007 06:46:59 -0700 (PDT)
Subject: [Linux-cluster] GFS and RHCS will be included now in RHEL4 Advanced
	Platform ?
Message-ID: <479149.36421.qm@web50606.mail.re2.yahoo.com>

Hi 
I just email sales contact in redhat but I do not get
an anwser yet :-( I guess they are very busy now :-)

According to redhat website, now GFS and Cluster suit
are included in the Advanced Platform , what does that
mean for RHEL-4 GFS and Cluster suit?

If I now buy a RHEL Advanced Platform license what do
I get is just for RHEL-5 plataform? or this features:

http://www.redhat.com/rhel/server/advanced/

are also true for RHEL-4 line of products ?

more precise, one month ago, if I would had to have
support for RHEL4 GFS I would had to buy this:
two RHEL licenses
two GFS licenses
which mean more than $5k .
now with only $3k I got the same for RHEL5 plataform.
what I need to know is: the same is true for RHEL-4
plataform?

thanks
roger


__________________________________________
RedHat Certified Engineer ( RHCE )
Cisco Certified Network Associate ( CCNA )


 
____________________________________________________________________________________
Finding fabulous fares is fun.  
Let Yahoo! FareChase search your favorite travel sites to find flight and hotel bargains.
http://farechase.yahoo.com/promo-generic-14795097



From hlawatschek at atix.de  Fri Mar 16 13:49:15 2007
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Fri, 16 Mar 2007 14:49:15 +0100
Subject: [Linux-cluster] rgmanager is ping-ponging service
Message-ID: <200703161449.16182.hlawatschek@atix.de>

Hi,

I have the following configuration:
2-node cluster with a priorized failover group:
[...]
<failoverdomain name="FD" ordered="1" restricted="1">
                                <failoverdomainnode name="node1" 
priority="1"/>
                                <failoverdomainnode name="node2" 
priority="2"/>
</failoverdomain>
[...]

Two IP resources configured:
<resources>
          <ip address="10.0.46.46" monitor_link="1"/>
          <ip address="10.226.3.80" monitor_link="1"/>
 </resources>

and a service:
<service autostart="1" domain="FD" name="VIP">
          <ip ref="10.0.46.46"/>
          <ip ref="10.226.3.80"/>
</service>

bond1 is configured 
- node1: 10.0.46.48
- node2: 10.0.46.47

bond2 is configured:
- node1: 10.226.3.82
- node2: 10.226.3.81

Now, if I shut down bond1 on node1. the service ping-pongs on node1:
Here's the rgmanager output:

node1:
<debug>  Checking 10.0.46.46, Level 0
<warn>   10.0.46.46 is not configured
[28810] notice: status on ip "10.0.46.46" returned 1 (generic error)
[28810] notice: Stopping service VIP
<debug>  10.0.46.46 is not configured
<info>   Removing IPv4 address 10.226.3.80 from bond2

node2:
[23903] notice: Service VIP is stopped
[23903] debug: Sent relocate request to 1

node1:
[29867] notice: Service VIP is recovering
[29867] notice: Recovering failed service VIP
<debug>  Link for bond2: Detected
<info>   Adding IPv4 address 10.226.3.80 to bond2
<debug>  Sending gratuitous ARP: 10.226.3.80 00:19:bb:e9:2a:c6 brd 
ff:ff:ff:ff:ff:ff
[29867] notice: Service VIP started
<info>   Executing /etc/init.d/ldap status
<debug>  Checking 10.0.46.46, Level 10
<warn>   10.0.46.46 is not configured
[30905] notice: status on ip "10.0.46.46" returned 1 (generic error)
[30905] notice: Stopping service VIP
<debug>  10.0.46.46 is not configured
<info>   Removing IPv4 address 10.226.3.80 from bond2

Any ideas ?

Mark

-- 
Gruss / Regards,

** Visit us at CeBIT 2007 in Hannover/Germany **
** in Hall 5, Booth G48/2  (15.-21. of March) **

Dipl.-Ing. Mark Hlawatschek
http://www.atix.de/
http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany
Registergericht: Amtsgericht M?nchen
Registernummer: HR131682
USt.-Id.: DE209485962

Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz




From hlawatschek at atix.de  Fri Mar 16 14:05:02 2007
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Fri, 16 Mar 2007 15:05:02 +0100
Subject: [Linux-cluster] rgmanager is ping-ponging service
In-Reply-To: <200703161449.16182.hlawatschek@atix.de>
References: <200703161449.16182.hlawatschek@atix.de>
Message-ID: <200703161505.02131.hlawatschek@atix.de>

Hi, 

some additional info:

setting recovery="relocate" to the service definition solved it for now. 

Nevertheless, with recovery="restart" rgmanager should also recognize that 
bond1 is not up and relocate the service to node2.

Any opinions ?

Mark

On Friday 16 March 2007 14:49:15 Mark Hlawatschek wrote:
> Hi,
>
> I have the following configuration:
> 2-node cluster with a priorized failover group:
> [...]
> <failoverdomain name="FD" ordered="1" restricted="1">
>                                 <failoverdomainnode name="node1"
> priority="1"/>
>                                 <failoverdomainnode name="node2"
> priority="2"/>
> </failoverdomain>
> [...]
>
> Two IP resources configured:
> <resources>
>           <ip address="10.0.46.46" monitor_link="1"/>
>           <ip address="10.226.3.80" monitor_link="1"/>
>  </resources>
>
> and a service:
> <service autostart="1" domain="FD" name="VIP">
>           <ip ref="10.0.46.46"/>
>           <ip ref="10.226.3.80"/>
> </service>
>
> bond1 is configured
> - node1: 10.0.46.48
> - node2: 10.0.46.47
>
> bond2 is configured:
> - node1: 10.226.3.82
> - node2: 10.226.3.81
>
> Now, if I shut down bond1 on node1. the service ping-pongs on node1:
> Here's the rgmanager output:
>
> node1:
> <debug>  Checking 10.0.46.46, Level 0
> <warn>   10.0.46.46 is not configured
> [28810] notice: status on ip "10.0.46.46" returned 1 (generic error)
> [28810] notice: Stopping service VIP
> <debug>  10.0.46.46 is not configured
> <info>   Removing IPv4 address 10.226.3.80 from bond2
>
> node2:
> [23903] notice: Service VIP is stopped
> [23903] debug: Sent relocate request to 1
>
> node1:
> [29867] notice: Service VIP is recovering
> [29867] notice: Recovering failed service VIP
> <debug>  Link for bond2: Detected
> <info>   Adding IPv4 address 10.226.3.80 to bond2
> <debug>  Sending gratuitous ARP: 10.226.3.80 00:19:bb:e9:2a:c6 brd
> ff:ff:ff:ff:ff:ff
> [29867] notice: Service VIP started
> <info>   Executing /etc/init.d/ldap status
> <debug>  Checking 10.0.46.46, Level 10
> <warn>   10.0.46.46 is not configured
> [30905] notice: status on ip "10.0.46.46" returned 1 (generic error)
> [30905] notice: Stopping service VIP
> <debug>  10.0.46.46 is not configured
> <info>   Removing IPv4 address 10.226.3.80 from bond2
>
> Any ideas ?
>
> Mark

-- 
Gruss / Regards,

** Visit us at CeBIT 2007 in Hannover/Germany **
** in Hall 5, Booth G48/2  (15.-21. of March) **

Dipl.-Ing. Mark Hlawatschek
http://www.atix.de/
http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany



From isplist at logicore.net  Fri Mar 16 16:23:22 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Fri, 16 Mar 2007 10:23:22 -0600
Subject: [Linux-cluster] Custom Anaconda to see 32 LUNS: Resolved
Message-ID: <2007316102322.068499@leena>

First: THANK YOU!!!! To all who tried to help me with this.

The solution was, um, rather simple... pull another chassis out, try again.

Without even an anaconda/installer option, RHEL4 finds all 32 LUNS plus all of 
the other storage systems. 

I didn't think of trying another chassis because this one is slightly 
different than all of the rest I have so it was dedicated to this task. 

Anyhow, now the new challenge is figuring out how to assign a volume to each 
of 30 blades/servers. Each server has a Qlogic 2200 installed. On each blade, 
I can assign a boot device within the Qlogic BIOS. Of course, that's not 
working as well as I'd hoped.

Anyone doing this who might be able to offer some insight?

Mike





From lhh at redhat.com  Fri Mar 16 17:04:25 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 16 Mar 2007 13:04:25 -0400
Subject: [Linux-cluster] rgmanager is ping-ponging service
In-Reply-To: <200703161449.16182.hlawatschek@atix.de>
References: <200703161449.16182.hlawatschek@atix.de>
Message-ID: <1174064666.13796.86.camel@asuka.boston.devel.redhat.com>

On Fri, 2007-03-16 at 14:49 +0100, Mark Hlawatschek wrote:

> [30905] notice: status on ip "10.0.46.46" returned 1 (generic error)
> [30905] notice: Stopping service VIP
> <debug>  10.0.46.46 is not configured
> <info>   Removing IPv4 address 10.226.3.80 from bond2

Noted here:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=222484

Patch here:

http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/rgmanager/src/resources/ip.sh.diff?cvsroot=cluster&r1=text&tr1=1.22&r2=text&tr2=1.21&f=u

Will be included in update 5 of RHCS4.

-- Lon



From jeff3140 at gmail.com  Fri Mar 16 18:49:10 2007
From: jeff3140 at gmail.com (Jeff)
Date: Fri, 16 Mar 2007 14:49:10 -0400
Subject: [Linux-cluster] DLM Documentation
Message-ID: <bf6826c60703161149j49468a62y383d1d4e730b2b59@mail.gmail.com>

There is a document in a DLM Source tree,
   http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/dlm/doc/?cvsroot=cluster
which briefly describes the DLM API.
Is there anything else available?

In particular I was looking for how VALNOTVALID errors are handled and
signaled to the user.



From rkenna at redhat.com  Fri Mar 16 18:57:18 2007
From: rkenna at redhat.com (Rob Kenna)
Date: Fri, 16 Mar 2007 14:57:18 -0400
Subject: [Linux-cluster] GFS and RHCS will be included now in RHEL4
	Advanced Platform ?
In-Reply-To: <479149.36421.qm@web50606.mail.re2.yahoo.com>
References: <479149.36421.qm@web50606.mail.re2.yahoo.com>
Message-ID: <45FAE88E.6010100@redhat.com>

The RHEL 5 Advanced Platform provides a great deal, both in terms of 
functionality and price.  Advanced Platform only exists on RHEL 5, there 
is no RHEL 4 Advanced Platform. It does not include the entitlements to 
run GFS & Cluster Suite on RHEL 4.  RHEL 4 packaging is the same as before.

If you have a RHEL 4 AS entitlement you can call Red Hat and upgrade, 
for no charge, into the RHEL 5 AP platform which includes GFS & Cluster 
Suite.

- Rob

Roger Pe?a wrote:
> Hi 
> I just email sales contact in redhat but I do not get
> an anwser yet :-( I guess they are very busy now :-)
>
> According to redhat website, now GFS and Cluster suit
> are included in the Advanced Platform , what does that
> mean for RHEL-4 GFS and Cluster suit?
>
> If I now buy a RHEL Advanced Platform license what do
> I get is just for RHEL-5 plataform? or this features:
>
> http://www.redhat.com/rhel/server/advanced/
>
> are also true for RHEL-4 line of products ?
>
> more precise, one month ago, if I would had to have
> support for RHEL4 GFS I would had to buy this:
> two RHEL licenses
> two GFS licenses
> which mean more than $5k .
> now with only $3k I got the same for RHEL5 plataform.
> what I need to know is: the same is true for RHEL-4
> plataform?
>
> thanks
> roger
>
>
> __________________________________________
> RedHat Certified Engineer ( RHCE )
> Cisco Certified Network Associate ( CCNA )
>
>
>  
> ____________________________________________________________________________________
> Finding fabulous fares is fun.  
> Let Yahoo! FareChase search your favorite travel sites to find flight and hotel bargains.
> http://farechase.yahoo.com/promo-generic-14795097
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   

-- 
Robert Kenna / Red Hat
Sr Product Mgr - Clustering & Storage
10 Technology Park Drive
Westford, MA 01886
o: (978) 392-2410 f: (978) 392-1001
c: (978) 771-6314 text: 9787716314 at vtext.com
rkenna at redhat.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070316/f760e272/attachment.htm>

From orkcu at yahoo.com  Fri Mar 16 19:13:40 2007
From: orkcu at yahoo.com (=?iso-8859-1?Q?Roger_Pe=F1a?=)
Date: Fri, 16 Mar 2007 12:13:40 -0700 (PDT)
Subject: [Linux-cluster] GFS and RHCS will be included now in RHEL4
	Advanced Platform ?
In-Reply-To: <45FAE88E.6010100@redhat.com>
Message-ID: <947590.47976.qm@web50606.mail.re2.yahoo.com>


--- Rob Kenna <rkenna at redhat.com> wrote:

> The RHEL 5 Advanced Platform provides a great deal,
> both in terms of 
> functionality and price.  Advanced Platform only
> exists on RHEL 5, there 
> is no RHEL 4 Advanced Platform. It does not include
> the entitlements to 
> run GFS & Cluster Suite on RHEL 4.  RHEL 4 packaging
> is the same as before.
> 
> If you have a RHEL 4 AS entitlement you can call Red
> Hat and upgrade, 
> for no charge, into the RHEL 5 AP platform which
> includes GFS & Cluster 
> Suite.
> 

thanks, that was exactly what I need to clarify
:-)



> - Rob
> 
> Roger Pe?a wrote:
> > Hi 
> > I just email sales contact in redhat but I do not
> get
> > an anwser yet :-( I guess they are very busy now
> :-)
> >
> > According to redhat website, now GFS and Cluster
> suit
> > are included in the Advanced Platform , what does
> that
> > mean for RHEL-4 GFS and Cluster suit?
> >
> > If I now buy a RHEL Advanced Platform license what
> do
> > I get is just for RHEL-5 plataform? or this
> features:
> >
> > http://www.redhat.com/rhel/server/advanced/
> >
> > are also true for RHEL-4 line of products ?
> >
> > more precise, one month ago, if I would had to
> have
> > support for RHEL4 GFS I would had to buy this:
> > two RHEL licenses
> > two GFS licenses
> > which mean more than $5k .
> > now with only $3k I got the same for RHEL5
> plataform.
> > what I need to know is: the same is true for
> RHEL-4
> > plataform?

cu
roger

__________________________________________
RedHat Certified Engineer ( RHCE )
Cisco Certified Network Associate ( CCNA )


 
____________________________________________________________________________________
TV dinner still cooling? 
Check out "Tonight's Picks" on Yahoo! TV.
http://tv.yahoo.com/



From isplist at logicore.net  Fri Mar 16 21:07:12 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Fri, 16 Mar 2007 15:07:12 -0600
Subject: [Linux-cluster] Custom Anaconda to see 32 LUNS: Resolved
Message-ID: <200731615712.368472@leena>

My mailer accidentally nuked replies on this topic. Please re-send as I saw 
replies but didn't get the chance to note from whom.

Thanks much.

Mike





From cluster at ew.nsci.us  Sat Mar 17 03:54:47 2007
From: cluster at ew.nsci.us (cluster at ew.nsci.us)
Date: Fri, 16 Mar 2007 19:54:47 -0800 (PST)
Subject: [Linux-cluster] DRBD + GNBD + GFS Race Conditions?
Message-ID: <Pine.LNX.4.64.0703161934140.21214@xeon.national-security.net>


Dear listmates,

Floating about the Internet are many howtos and references to backing GNBD 
with DRBD in order to have failover GNBD and mount GFS atop of the GNBD 
device.  Does anyone know how the following possible race condition is 
handled?

1. GFS writes to its GNBD device.
    GNBD client node writes to GNBD server node.
    GNBD server writes to DRBD-primary.
    DRBD begins to write to itself and to DRBD-secondary.
    Before DRBD completes the write to DRBD-secondary (thus, before 
it returns since writes are synchronous) the DRBD-primary node 
looses power.
    The GNBD server dies with the power loss.
    GNBD client node drops connection to the GNBD server.

2. Heartbeat notices the death of DRBD-primary, switches the 
DRBD-secondary to DRBD-primary, re-exports /dev/drbd0 via GNBD, and 
re-creates the virtual IP which the GNBD client was connecting to.

3. The GNBD client writing on behalf of GFS reconnects.

Now, what happens to the write originally going to the DRBD volume? Will 
the GNBD-client retry the write?  Are there situations where the write 
could be dropped all together?

Are there other kinds of race conditions which could take place?  Other 
concerns outside of this scenario?

We are thinking about implementing DRBD+GNBD+GFS+Xen to support failover 
and domain migration.  In the event of a failure like power loss, I would 
like to be certain that when the failed-to GNBD server node comes online, 
that any GNBD clients which were half-way through a write will re-commit 
the write.

Thoughts?

-Eric



From davegu1 at hotmail.com  Sat Mar 17 06:11:04 2007
From: davegu1 at hotmail.com (David)
Date: Sat, 17 Mar 2007 00:11:04 -0600
Subject: [Linux-cluster] Kernel oops when GFS2 used as localfs
In-Reply-To: <B14199FA0DBAAF4AA89E83EB41D3543503367BC5@xmb-sjc-22c.amer.cisco.com>
Message-ID: <BAY109-DAV128548AD7BFF5992DCB8F1FA700@phx.gbl>

Once you get it, feel free to share with all of us.

David 

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sridhar Ramaswamy
(srramasw)
Sent: Wednesday, March 14, 2007 12:20 PM
To: linux clustering
Subject: RE: [Linux-cluster] Kernel oops when GFS2 used as localfs


Hi folks,

This bugzilla 229831 seems "restricted" :(  Can someone please forward
its details? I'm interested in understanding the actual problem and its
proposed fix.

Meanwhile I'll try an upstream kernel. Should 2.6.21.rc1 be fine?

Thanks,
Sridhar

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Wendy Cheng
> Sent: Wednesday, March 14, 2007 6:33 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] Kernel oops when GFS2 used as localfs
> 
> Steven Whitehouse wrote:
> > Hi,
> >
> > This looks like Red Hat bugzilla 229831 and if so then
> there is already
> > a fix in Linus' upstream kernel and in the latest kernel build for 
> > Fedora (both 5 and 6),
> >
> >   
> yeah... agree. I missed the unlink part of the trace in previous post
> and thought it might have something to do with my ino number 
> hacking in 
> the lookup code.
> 
> -- Wendy
> > Steve.
> >
> > On Wed, 2007-03-14 at 00:15 -0700, Sridhar Ramaswamy
> (srramasw) wrote:
> >   
> >> Sure Wendy. Here it is,
> >>
> >> "fs/gfs2/inode.c" 1256
> >> struct inode *gfs2_ilookup(struct super_block *sb, struct
> gfs2_inum_host
> >> *inum)
> >> {
> >>         return ilookup5(sb, (unsigned long)inum->no_formal_ino,
> >>                         iget_test, inum);
> >> }
> >>
> >> static struct inode *gfs2_iget(struct super_block *sb, struct 
> >> gfs2_inum_host *inum) {
> >>         return iget5_locked(sb, (unsigned long)inum->no_formal_ino,
> >>                      iget_test, iget_set, inum);
> >> }
> >>
> >> BTW - this code is from kernel version 2.6.20.1.
> >>
> >> Thanks,
> >> Sridhar
> >>
> >>     
> >>> -----Original Message-----
> >>> From: linux-cluster-bounces at redhat.com
> >>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Wendy Cheng
> >>> Sent: Tuesday, March 13, 2007 10:29 PM
> >>> To: linux clustering
> >>> Subject: Re: [Linux-cluster] Kernel oops when GFS2 used as localfs
> >>>
> >>> Sridhar Ramaswamy (srramasw) wrote:
> >>>
> >>>       
> >>>> Mar 13 15:25:56 cfs1 kernel: ------------[ cut here ]------------

> >>>> Mar 13 15:25:56 cfs1 kernel: kernel BUG at fs/gfs2/meta_io.c:474!
> >>>>         
> >>> I don't have time to pull kernel.org source code right
> now. Could you
> >>> cut-and-paste the following two routines (from fs/gfs2/inode.c):
> >>> gfs2_ilookup() and gfs2_iget() so we can look into this
> >>> quickly ? They 
> >>> are all one-liner so shouldn't be too much troubles. GFS2 
> >>> currently has 
> >>> ino issue with its lookup code.
> >>>
> >>> -- Wendy
> >>>
> >>>       
> >>>> Mar 13 15:25:56 cfs1 kernel: invalid opcode: 0000 [#1] Mar 13 
> >>>> 15:25:56 cfs1 kernel: SMP Mar 13 15:25:56 cfs1 kernel: Modules 
> >>>> linked in: lock_nolock gfs2 reiserfs nfsd exportfs nfs lockd 
> >>>> nfs_acl ipv6 parport_pc
> lp parport
> >>>> autofs4 sunrpc dm_mirror dm_mod button battery ac
> uhci_hcd ehci_hcd
> >>>> intel_rng rng_core i2c_i801 i2c_core e1000 e100 mii
> floppy ext3 jbd
> >>>> Mar 13 15:25:56 cfs1 kernel: CPU:    1
> >>>> Mar 13 15:25:56 cfs1 kernel: EIP:    0060:[<e0c88da5>]    
> >>>>         
> >>> Not tainted VLI
> >>>       
> >>>> Mar 13 15:25:56 cfs1 kernel: EFLAGS: 00010246   (2.6.20.1 #1)
> >>>> Mar 13 15:25:56 cfs1 kernel: EIP is at
> >>>> gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2]
> >>>> Mar 13 15:25:56 cfs1 kernel: eax: 00000000   ebx: 
> 00012bf5   ecx: 
> >>>> ce5a6dd4   edx: dc5eae00
> >>>> Mar 13 15:25:56 cfs1 kernel: esi: 00000000   edi: 
> 00000000   ebp: 
> >>>> dc5ea9a8   esp: ce5a6d58
> >>>> Mar 13 15:25:56 cfs1 kernel: ds: 007b   es: 007b   ss: 0068
> >>>> Mar 13 15:25:56 cfs1 kernel: Process bonnie++ (pid: 5509,
> >>>>         
> >>> ti=ce5a6000
> >>>       
> >>>> task=dc320570 task.ti=ce5a6000)
> >>>> Mar 13 15:25:56 cfs1 kernel: Stack: c156d274 ce5a6dd4 c016eb91
> >>>> ce5a6dd4 00000000 dc5eae00 00000000 d6f34000
> >>>> Mar 13 15:25:56 cfs1 kernel:        00000000 00000000 dc5ea9a8 
> >>>> d57794a8 ce5a6e08 e0c83ad3 00012bf5 00000000
> >>>> Mar 13 15:25:56 cfs1 kernel:        00000000 ce5a6da0 00000000 
> >>>> 00000000 dc5ea9a8 d57794a8 e0c84cc9 ce5a6dd4
> >>>> Mar 13 15:25:56 cfs1 kernel: Call Trace:
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c016eb91>] iget5_locked+0x3d/0x67
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c83ad3>] 
> >>>> gfs2_inode_refresh+0x34/0xfe [gfs2]
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c84cc9>] 
> >>>>         
> >>> gfs2_createi+0x12c/0x191 [gfs2]
> >>>       
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c8da3c>]
> >>>>         
> >>> gfs2_create+0x5c/0x103 [gfs2]
> >>>       
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c84be7>]
> >>>>         
> >>> gfs2_createi+0x4a/0x191 [gfs2]
> >>>       
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c820c4>]
> >>>>         
> >>> gfs2_glock_nq_num+0x3f/0x64
> >>>       
> >>>> [gfs2]
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c016552e>] vfs_create+0xc3/0x126 
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c01657f2>]
> >>>>         
> >>> open_namei_create+0x47/0x88
> >>>       
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c016597d>] open_namei+0x14a/0x539

> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d27b>] do_filp_open+0x25/0x39

> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c01d200a>]
> >>>>         
> >>> strncpy_from_user+0x3c/0x5b
> >>>       
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d431>]
> get_unused_fd+0xa8/0xb1
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d509>] do_sys_open+0x42/0xbe 
> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d59f>] sys_open+0x1a/0x1c Mar

> >>>> 13 15:25:56 cfs1 kernel:  [<c015d5dd>] sys_creat+0x1f/0x23 Mar 13

> >>>> 15:25:56 cfs1 kernel:  [<c0103410>]
> >>>>         
> >>> sysenter_past_esp+0x5d/0x81
> >>>       
> >>>> Mar 13 15:25:56 cfs1 kernel:  ======================= Mar 13 
> >>>> 15:25:56 cfs1 kernel: Code: 80 a8 01 00 00 89 44 24
> >>>>         
> >>> 1c 8b 85 f0
> >>>       
> >>>> 01 00 00 c7 44 24 20 00 00 00 00 89 54 24 14 85 c0 89 44 24
> >>>>         
> >>> 18 c7 44
> >>>       
> >>>> 24 10 00 00 00 00 75 04 <0f> 0b eb fe 83 7c 24 1c 00 75 04
> >>>>         
> >>> 0f 0b eb fe
> >>>       
> >>>> 8d 85 24 04 00 00
> >>>> Mar 13 15:25:57 cfs1 kernel: EIP: [<e0c88da5>]
> >>>> gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2] SS:ESP 0068:ce5a6d58
> >>>>  
> >>>>  
> >>>> (2)
> >>>>  
> >>>> Mar 13 17:00:30 cfs1 kernel: Call Trace:
> >>>> Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>]
> >>>>         
> >>> gfs2_unlink+0x53/0xe0 [gfs2]
> >>>       
> >>>> Mar 13 17:00:30 cfs1 kernel:  [<e0baddc8>]
> >>>>         
> >>> gfs2_unlink+0x3a/0xe0 [gfs2]
> >>>       
> >>>> Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>]
> >>>>         
> >>> gfs2_unlink+0x53/0xe0 [gfs2]
> >>>       
> >>>> Mar 13 17:00:30 cfs1 kernel:  [<c0166572>] vfs_unlink+0xa1/0xc5 
> >>>> Mar 13 17:00:30 cfs1 kernel:  [<c016662b>] do_unlinkat+0x95/0xf5 
> >>>> Mar 13 17:00:30 cfs1 kernel:  [<c01187de>]
> scheduler_tick+0x8f/0x95
> >>>> Mar 13 17:00:30 cfs1 kernel:  [<c0103410>]
> >>>>         
> >>> sysenter_past_esp+0x5d/0x81
> >>>       
> >>>> Mar 13 17:00:30 cfs1 kernel:  =======================
> >>>>
> >>>> -------------------------------------------------------------
> >>>>         
> >>> -----------
> >>>       
> >>>> --
> >>>> Linux-cluster mailing list
> >>>> Linux-cluster at redhat.com 
> >>>> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>>>
> >>>>         
> >>> --
> >>> Linux-cluster mailing list
> >>> Linux-cluster at redhat.com 
> >>> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>>
> >>>       
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com 
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>     
> >
> >   
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com 
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From srramasw at cisco.com  Sat Mar 17 05:34:12 2007
From: srramasw at cisco.com (Sridhar Ramaswamy (srramasw))
Date: Fri, 16 Mar 2007 22:34:12 -0700
Subject: [Linux-cluster] Kernel oops when GFS2 used as localfs
In-Reply-To: <003301c7685b$0b393680$6701a8c0@gutierrezda>
Message-ID: <B14199FA0DBAAF4AA89E83EB41D35435033D14C5@xmb-sjc-22c.amer.cisco.com>

Sure. I don't have any breakthru' yet. Also I didn't got any further
details on this bugzilla contents.

- Sridhar 

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of David
> Sent: Friday, March 16, 2007 11:11 PM
> To: 'linux clustering'
> Subject: RE: [Linux-cluster] Kernel oops when GFS2 used as localfs
> 
> Once you get it, feel free to share with all of us.
> 
> David 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> Sridhar Ramaswamy
> (srramasw)
> Sent: Wednesday, March 14, 2007 12:20 PM
> To: linux clustering
> Subject: RE: [Linux-cluster] Kernel oops when GFS2 used as localfs
> 
> 
> Hi folks,
> 
> This bugzilla 229831 seems "restricted" :(  Can someone please forward
> its details? I'm interested in understanding the actual 
> problem and its
> proposed fix.
> 
> Meanwhile I'll try an upstream kernel. Should 2.6.21.rc1 be fine?
> 
> Thanks,
> Sridhar
> 
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Wendy Cheng
> > Sent: Wednesday, March 14, 2007 6:33 AM
> > To: linux clustering
> > Subject: Re: [Linux-cluster] Kernel oops when GFS2 used as localfs
> > 
> > Steven Whitehouse wrote:
> > > Hi,
> > >
> > > This looks like Red Hat bugzilla 229831 and if so then
> > there is already
> > > a fix in Linus' upstream kernel and in the latest kernel 
> build for 
> > > Fedora (both 5 and 6),
> > >
> > >   
> > yeah... agree. I missed the unlink part of the trace in 
> previous post
> > and thought it might have something to do with my ino number 
> > hacking in 
> > the lookup code.
> > 
> > -- Wendy
> > > Steve.
> > >
> > > On Wed, 2007-03-14 at 00:15 -0700, Sridhar Ramaswamy
> > (srramasw) wrote:
> > >   
> > >> Sure Wendy. Here it is,
> > >>
> > >> "fs/gfs2/inode.c" 1256
> > >> struct inode *gfs2_ilookup(struct super_block *sb, struct
> > gfs2_inum_host
> > >> *inum)
> > >> {
> > >>         return ilookup5(sb, (unsigned long)inum->no_formal_ino,
> > >>                         iget_test, inum);
> > >> }
> > >>
> > >> static struct inode *gfs2_iget(struct super_block *sb, struct 
> > >> gfs2_inum_host *inum) {
> > >>         return iget5_locked(sb, (unsigned 
> long)inum->no_formal_ino,
> > >>                      iget_test, iget_set, inum);
> > >> }
> > >>
> > >> BTW - this code is from kernel version 2.6.20.1.
> > >>
> > >> Thanks,
> > >> Sridhar
> > >>
> > >>     
> > >>> -----Original Message-----
> > >>> From: linux-cluster-bounces at redhat.com
> > >>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of 
> Wendy Cheng
> > >>> Sent: Tuesday, March 13, 2007 10:29 PM
> > >>> To: linux clustering
> > >>> Subject: Re: [Linux-cluster] Kernel oops when GFS2 used 
> as localfs
> > >>>
> > >>> Sridhar Ramaswamy (srramasw) wrote:
> > >>>
> > >>>       
> > >>>> Mar 13 15:25:56 cfs1 kernel: ------------[ cut here 
> ]------------
> 
> > >>>> Mar 13 15:25:56 cfs1 kernel: kernel BUG at 
> fs/gfs2/meta_io.c:474!
> > >>>>         
> > >>> I don't have time to pull kernel.org source code right
> > now. Could you
> > >>> cut-and-paste the following two routines (from fs/gfs2/inode.c):
> > >>> gfs2_ilookup() and gfs2_iget() so we can look into this
> > >>> quickly ? They 
> > >>> are all one-liner so shouldn't be too much troubles. GFS2 
> > >>> currently has 
> > >>> ino issue with its lookup code.
> > >>>
> > >>> -- Wendy
> > >>>
> > >>>       
> > >>>> Mar 13 15:25:56 cfs1 kernel: invalid opcode: 0000 [#1] Mar 13 
> > >>>> 15:25:56 cfs1 kernel: SMP Mar 13 15:25:56 cfs1 kernel: Modules 
> > >>>> linked in: lock_nolock gfs2 reiserfs nfsd exportfs nfs lockd 
> > >>>> nfs_acl ipv6 parport_pc
> > lp parport
> > >>>> autofs4 sunrpc dm_mirror dm_mod button battery ac
> > uhci_hcd ehci_hcd
> > >>>> intel_rng rng_core i2c_i801 i2c_core e1000 e100 mii
> > floppy ext3 jbd
> > >>>> Mar 13 15:25:56 cfs1 kernel: CPU:    1
> > >>>> Mar 13 15:25:56 cfs1 kernel: EIP:    0060:[<e0c88da5>]    
> > >>>>         
> > >>> Not tainted VLI
> > >>>       
> > >>>> Mar 13 15:25:56 cfs1 kernel: EFLAGS: 00010246   (2.6.20.1 #1)
> > >>>> Mar 13 15:25:56 cfs1 kernel: EIP is at
> > >>>> gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2]
> > >>>> Mar 13 15:25:56 cfs1 kernel: eax: 00000000   ebx: 
> > 00012bf5   ecx: 
> > >>>> ce5a6dd4   edx: dc5eae00
> > >>>> Mar 13 15:25:56 cfs1 kernel: esi: 00000000   edi: 
> > 00000000   ebp: 
> > >>>> dc5ea9a8   esp: ce5a6d58
> > >>>> Mar 13 15:25:56 cfs1 kernel: ds: 007b   es: 007b   ss: 0068
> > >>>> Mar 13 15:25:56 cfs1 kernel: Process bonnie++ (pid: 5509,
> > >>>>         
> > >>> ti=ce5a6000
> > >>>       
> > >>>> task=dc320570 task.ti=ce5a6000)
> > >>>> Mar 13 15:25:56 cfs1 kernel: Stack: c156d274 ce5a6dd4 c016eb91
> > >>>> ce5a6dd4 00000000 dc5eae00 00000000 d6f34000
> > >>>> Mar 13 15:25:56 cfs1 kernel:        00000000 00000000 dc5ea9a8 
> > >>>> d57794a8 ce5a6e08 e0c83ad3 00012bf5 00000000
> > >>>> Mar 13 15:25:56 cfs1 kernel:        00000000 ce5a6da0 00000000 
> > >>>> 00000000 dc5ea9a8 d57794a8 e0c84cc9 ce5a6dd4
> > >>>> Mar 13 15:25:56 cfs1 kernel: Call Trace:
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<c016eb91>] 
> iget5_locked+0x3d/0x67
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c83ad3>] 
> > >>>> gfs2_inode_refresh+0x34/0xfe [gfs2]
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c84cc9>] 
> > >>>>         
> > >>> gfs2_createi+0x12c/0x191 [gfs2]
> > >>>       
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c8da3c>]
> > >>>>         
> > >>> gfs2_create+0x5c/0x103 [gfs2]
> > >>>       
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c84be7>]
> > >>>>         
> > >>> gfs2_createi+0x4a/0x191 [gfs2]
> > >>>       
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c820c4>]
> > >>>>         
> > >>> gfs2_glock_nq_num+0x3f/0x64
> > >>>       
> > >>>> [gfs2]
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<c016552e>] 
> vfs_create+0xc3/0x126 
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<c01657f2>]
> > >>>>         
> > >>> open_namei_create+0x47/0x88
> > >>>       
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<c016597d>] 
> open_namei+0x14a/0x539
> 
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d27b>] 
> do_filp_open+0x25/0x39
> 
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<c01d200a>]
> > >>>>         
> > >>> strncpy_from_user+0x3c/0x5b
> > >>>       
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d431>]
> > get_unused_fd+0xa8/0xb1
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d509>] 
> do_sys_open+0x42/0xbe 
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d59f>] 
> sys_open+0x1a/0x1c Mar
> 
> > >>>> 13 15:25:56 cfs1 kernel:  [<c015d5dd>] 
> sys_creat+0x1f/0x23 Mar 13
> 
> > >>>> 15:25:56 cfs1 kernel:  [<c0103410>]
> > >>>>         
> > >>> sysenter_past_esp+0x5d/0x81
> > >>>       
> > >>>> Mar 13 15:25:56 cfs1 kernel:  ======================= Mar 13 
> > >>>> 15:25:56 cfs1 kernel: Code: 80 a8 01 00 00 89 44 24
> > >>>>         
> > >>> 1c 8b 85 f0
> > >>>       
> > >>>> 01 00 00 c7 44 24 20 00 00 00 00 89 54 24 14 85 c0 89 44 24
> > >>>>         
> > >>> 18 c7 44
> > >>>       
> > >>>> 24 10 00 00 00 00 75 04 <0f> 0b eb fe 83 7c 24 1c 00 75 04
> > >>>>         
> > >>> 0f 0b eb fe
> > >>>       
> > >>>> 8d 85 24 04 00 00
> > >>>> Mar 13 15:25:57 cfs1 kernel: EIP: [<e0c88da5>]
> > >>>> gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2] SS:ESP 
> 0068:ce5a6d58
> > >>>>  
> > >>>>  
> > >>>> (2)
> > >>>>  
> > >>>> Mar 13 17:00:30 cfs1 kernel: Call Trace:
> > >>>> Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>]
> > >>>>         
> > >>> gfs2_unlink+0x53/0xe0 [gfs2]
> > >>>       
> > >>>> Mar 13 17:00:30 cfs1 kernel:  [<e0baddc8>]
> > >>>>         
> > >>> gfs2_unlink+0x3a/0xe0 [gfs2]
> > >>>       
> > >>>> Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>]
> > >>>>         
> > >>> gfs2_unlink+0x53/0xe0 [gfs2]
> > >>>       
> > >>>> Mar 13 17:00:30 cfs1 kernel:  [<c0166572>] 
> vfs_unlink+0xa1/0xc5 
> > >>>> Mar 13 17:00:30 cfs1 kernel:  [<c016662b>] 
> do_unlinkat+0x95/0xf5 
> > >>>> Mar 13 17:00:30 cfs1 kernel:  [<c01187de>]
> > scheduler_tick+0x8f/0x95
> > >>>> Mar 13 17:00:30 cfs1 kernel:  [<c0103410>]
> > >>>>         
> > >>> sysenter_past_esp+0x5d/0x81
> > >>>       
> > >>>> Mar 13 17:00:30 cfs1 kernel:  =======================
> > >>>>
> > >>>> -------------------------------------------------------------
> > >>>>         
> > >>> -----------
> > >>>       
> > >>>> --
> > >>>> Linux-cluster mailing list
> > >>>> Linux-cluster at redhat.com 
> > >>>> https://www.redhat.com/mailman/listinfo/linux-cluster
> > >>>>
> > >>>>         
> > >>> --
> > >>> Linux-cluster mailing list
> > >>> Linux-cluster at redhat.com 
> > >>> https://www.redhat.com/mailman/listinfo/linux-cluster
> > >>>
> > >>>       
> > >> --
> > >> Linux-cluster mailing list
> > >> Linux-cluster at redhat.com 
> > >> https://www.redhat.com/mailman/listinfo/linux-cluster
> > >>     
> > >
> > >   
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com 
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From kanderso at redhat.com  Sat Mar 17 09:36:33 2007
From: kanderso at redhat.com (Kevin Anderson)
Date: Sat, 17 Mar 2007 04:36:33 -0500
Subject: [Linux-cluster] Kernel oops when GFS2 used as localfs
In-Reply-To: <003301c7685b$0b393680$6701a8c0@gutierrezda>
References: <003301c7685b$0b393680$6701a8c0@gutierrezda>
Message-ID: <1174124193.2897.18.camel@localhost.localdomain>

Just opened up bugzilla 229831, you should be able to see it now.

Kevin

On Sat, 2007-03-17 at 00:11 -0600, David wrote:
> Once you get it, feel free to share with all of us.
> 
> David 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sridhar Ramaswamy
> (srramasw)
> Sent: Wednesday, March 14, 2007 12:20 PM
> To: linux clustering
> Subject: RE: [Linux-cluster] Kernel oops when GFS2 used as localfs
> 
> 
> Hi folks,
> 
> This bugzilla 229831 seems "restricted" :(  Can someone please forward
> its details? I'm interested in understanding the actual problem and its
> proposed fix.
> 
> Meanwhile I'll try an upstream kernel. Should 2.6.21.rc1 be fine?
> 
> Thanks,
> Sridhar
> 
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Wendy Cheng
> > Sent: Wednesday, March 14, 2007 6:33 AM
> > To: linux clustering
> > Subject: Re: [Linux-cluster] Kernel oops when GFS2 used as localfs
> > 
> > Steven Whitehouse wrote:
> > > Hi,
> > >
> > > This looks like Red Hat bugzilla 229831 and if so then
> > there is already
> > > a fix in Linus' upstream kernel and in the latest kernel build for 
> > > Fedora (both 5 and 6),
> > >
> > >   
> > yeah... agree. I missed the unlink part of the trace in previous post
> > and thought it might have something to do with my ino number 
> > hacking in 
> > the lookup code.
> > 
> > -- Wendy
> > > Steve.
> > >
> > > On Wed, 2007-03-14 at 00:15 -0700, Sridhar Ramaswamy
> > (srramasw) wrote:
> > >   
> > >> Sure Wendy. Here it is,
> > >>
> > >> "fs/gfs2/inode.c" 1256
> > >> struct inode *gfs2_ilookup(struct super_block *sb, struct
> > gfs2_inum_host
> > >> *inum)
> > >> {
> > >>         return ilookup5(sb, (unsigned long)inum->no_formal_ino,
> > >>                         iget_test, inum);
> > >> }
> > >>
> > >> static struct inode *gfs2_iget(struct super_block *sb, struct 
> > >> gfs2_inum_host *inum) {
> > >>         return iget5_locked(sb, (unsigned long)inum->no_formal_ino,
> > >>                      iget_test, iget_set, inum);
> > >> }
> > >>
> > >> BTW - this code is from kernel version 2.6.20.1.
> > >>
> > >> Thanks,
> > >> Sridhar
> > >>
> > >>     
> > >>> -----Original Message-----
> > >>> From: linux-cluster-bounces at redhat.com
> > >>> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Wendy Cheng
> > >>> Sent: Tuesday, March 13, 2007 10:29 PM
> > >>> To: linux clustering
> > >>> Subject: Re: [Linux-cluster] Kernel oops when GFS2 used as localfs
> > >>>
> > >>> Sridhar Ramaswamy (srramasw) wrote:
> > >>>
> > >>>       
> > >>>> Mar 13 15:25:56 cfs1 kernel: ------------[ cut here ]------------
> 
> > >>>> Mar 13 15:25:56 cfs1 kernel: kernel BUG at fs/gfs2/meta_io.c:474!
> > >>>>         
> > >>> I don't have time to pull kernel.org source code right
> > now. Could you
> > >>> cut-and-paste the following two routines (from fs/gfs2/inode.c):
> > >>> gfs2_ilookup() and gfs2_iget() so we can look into this
> > >>> quickly ? They 
> > >>> are all one-liner so shouldn't be too much troubles. GFS2 
> > >>> currently has 
> > >>> ino issue with its lookup code.
> > >>>
> > >>> -- Wendy
> > >>>
> > >>>       
> > >>>> Mar 13 15:25:56 cfs1 kernel: invalid opcode: 0000 [#1] Mar 13 
> > >>>> 15:25:56 cfs1 kernel: SMP Mar 13 15:25:56 cfs1 kernel: Modules 
> > >>>> linked in: lock_nolock gfs2 reiserfs nfsd exportfs nfs lockd 
> > >>>> nfs_acl ipv6 parport_pc
> > lp parport
> > >>>> autofs4 sunrpc dm_mirror dm_mod button battery ac
> > uhci_hcd ehci_hcd
> > >>>> intel_rng rng_core i2c_i801 i2c_core e1000 e100 mii
> > floppy ext3 jbd
> > >>>> Mar 13 15:25:56 cfs1 kernel: CPU:    1
> > >>>> Mar 13 15:25:56 cfs1 kernel: EIP:    0060:[<e0c88da5>]    
> > >>>>         
> > >>> Not tainted VLI
> > >>>       
> > >>>> Mar 13 15:25:56 cfs1 kernel: EFLAGS: 00010246   (2.6.20.1 #1)
> > >>>> Mar 13 15:25:56 cfs1 kernel: EIP is at
> > >>>> gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2]
> > >>>> Mar 13 15:25:56 cfs1 kernel: eax: 00000000   ebx: 
> > 00012bf5   ecx: 
> > >>>> ce5a6dd4   edx: dc5eae00
> > >>>> Mar 13 15:25:56 cfs1 kernel: esi: 00000000   edi: 
> > 00000000   ebp: 
> > >>>> dc5ea9a8   esp: ce5a6d58
> > >>>> Mar 13 15:25:56 cfs1 kernel: ds: 007b   es: 007b   ss: 0068
> > >>>> Mar 13 15:25:56 cfs1 kernel: Process bonnie++ (pid: 5509,
> > >>>>         
> > >>> ti=ce5a6000
> > >>>       
> > >>>> task=dc320570 task.ti=ce5a6000)
> > >>>> Mar 13 15:25:56 cfs1 kernel: Stack: c156d274 ce5a6dd4 c016eb91
> > >>>> ce5a6dd4 00000000 dc5eae00 00000000 d6f34000
> > >>>> Mar 13 15:25:56 cfs1 kernel:        00000000 00000000 dc5ea9a8 
> > >>>> d57794a8 ce5a6e08 e0c83ad3 00012bf5 00000000
> > >>>> Mar 13 15:25:56 cfs1 kernel:        00000000 ce5a6da0 00000000 
> > >>>> 00000000 dc5ea9a8 d57794a8 e0c84cc9 ce5a6dd4
> > >>>> Mar 13 15:25:56 cfs1 kernel: Call Trace:
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<c016eb91>] iget5_locked+0x3d/0x67
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c83ad3>] 
> > >>>> gfs2_inode_refresh+0x34/0xfe [gfs2]
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c84cc9>] 
> > >>>>         
> > >>> gfs2_createi+0x12c/0x191 [gfs2]
> > >>>       
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c8da3c>]
> > >>>>         
> > >>> gfs2_create+0x5c/0x103 [gfs2]
> > >>>       
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c84be7>]
> > >>>>         
> > >>> gfs2_createi+0x4a/0x191 [gfs2]
> > >>>       
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c820c4>]
> > >>>>         
> > >>> gfs2_glock_nq_num+0x3f/0x64
> > >>>       
> > >>>> [gfs2]
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<c016552e>] vfs_create+0xc3/0x126 
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<c01657f2>]
> > >>>>         
> > >>> open_namei_create+0x47/0x88
> > >>>       
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<c016597d>] open_namei+0x14a/0x539
> 
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d27b>] do_filp_open+0x25/0x39
> 
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<c01d200a>]
> > >>>>         
> > >>> strncpy_from_user+0x3c/0x5b
> > >>>       
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d431>]
> > get_unused_fd+0xa8/0xb1
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d509>] do_sys_open+0x42/0xbe 
> > >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d59f>] sys_open+0x1a/0x1c Mar
> 
> > >>>> 13 15:25:56 cfs1 kernel:  [<c015d5dd>] sys_creat+0x1f/0x23 Mar 13
> 
> > >>>> 15:25:56 cfs1 kernel:  [<c0103410>]
> > >>>>         
> > >>> sysenter_past_esp+0x5d/0x81
> > >>>       
> > >>>> Mar 13 15:25:56 cfs1 kernel:  ======================= Mar 13 
> > >>>> 15:25:56 cfs1 kernel: Code: 80 a8 01 00 00 89 44 24
> > >>>>         
> > >>> 1c 8b 85 f0
> > >>>       
> > >>>> 01 00 00 c7 44 24 20 00 00 00 00 89 54 24 14 85 c0 89 44 24
> > >>>>         
> > >>> 18 c7 44
> > >>>       
> > >>>> 24 10 00 00 00 00 75 04 <0f> 0b eb fe 83 7c 24 1c 00 75 04
> > >>>>         
> > >>> 0f 0b eb fe
> > >>>       
> > >>>> 8d 85 24 04 00 00
> > >>>> Mar 13 15:25:57 cfs1 kernel: EIP: [<e0c88da5>]
> > >>>> gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2] SS:ESP 0068:ce5a6d58
> > >>>>  
> > >>>>  
> > >>>> (2)
> > >>>>  
> > >>>> Mar 13 17:00:30 cfs1 kernel: Call Trace:
> > >>>> Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>]
> > >>>>         
> > >>> gfs2_unlink+0x53/0xe0 [gfs2]
> > >>>       
> > >>>> Mar 13 17:00:30 cfs1 kernel:  [<e0baddc8>]
> > >>>>         
> > >>> gfs2_unlink+0x3a/0xe0 [gfs2]
> > >>>       
> > >>>> Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>]
> > >>>>         
> > >>> gfs2_unlink+0x53/0xe0 [gfs2]
> > >>>       
> > >>>> Mar 13 17:00:30 cfs1 kernel:  [<c0166572>] vfs_unlink+0xa1/0xc5 
> > >>>> Mar 13 17:00:30 cfs1 kernel:  [<c016662b>] do_unlinkat+0x95/0xf5 
> > >>>> Mar 13 17:00:30 cfs1 kernel:  [<c01187de>]
> > scheduler_tick+0x8f/0x95
> > >>>> Mar 13 17:00:30 cfs1 kernel:  [<c0103410>]
> > >>>>         
> > >>> sysenter_past_esp+0x5d/0x81
> > >>>       
> > >>>> Mar 13 17:00:30 cfs1 kernel:  =======================
> > >>>>
> > >>>> -------------------------------------------------------------
> > >>>>         
> > >>> -----------
> > >>>       
> > >>>> --
> > >>>> Linux-cluster mailing list
> > >>>> Linux-cluster at redhat.com 
> > >>>> https://www.redhat.com/mailman/listinfo/linux-cluster
> > >>>>
> > >>>>         
> > >>> --
> > >>> Linux-cluster mailing list
> > >>> Linux-cluster at redhat.com 
> > >>> https://www.redhat.com/mailman/listinfo/linux-cluster
> > >>>
> > >>>       
> > >> --
> > >> Linux-cluster mailing list
> > >> Linux-cluster at redhat.com 
> > >> https://www.redhat.com/mailman/listinfo/linux-cluster
> > >>     
> > >
> > >   
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com 
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070317/ed969d3b/attachment.htm>

From J.vandenHorn at xb.nl  Sat Mar 17 15:46:54 2007
From: J.vandenHorn at xb.nl (Jeroen van den Horn)
Date: Sat, 17 Mar 2007 16:46:54 +0100
Subject: [Linux-cluster] GFS instabilities - what am I doing wrong?
Message-ID: <45FC0D6E.2020808@xb.nl>

All,

I'm trying to get a two-node Debian cluster up-and-running with GFS. The 
Linux-machines themselves are running on top op VMWare on blade systems.

I can succesfully start all required subsystems (css, cman, fenced) and 
am able to mount the GFS filesystem on both nodes. After some time 
however (usually in the middle of the night - cause unknown but it might 
have something to do with clock-drift on the VM's) one of the nodes 
determines that the heartbeat is late and kicks the node out of the cluster.

On the 'reporting' node:

Mar 17 03:58:34 srv129bh heartbeat[2722]: WARN: Late heartbeat: Node 
srv128bh: interval 15150 ms
Mar 17 03:58:40 srv129bh kernel: CMAN: removing node srv128bh from the 
cluster : Missed too many heartbeats

On the 'failing' node:

Mar 17 03:58:24 srv128bh kernel: CMAN: Being told to leave the cluster 
by node 2
Mar 17 03:58:24 srv128bh kernel: CMAN: we are leaving the cluster.
Mar 17 03:58:24 srv128bh kernel: WARNING: dlm_emergency_shutdown
Mar 17 03:58:24 srv128bh kernel: WARNING: dlm_emergency_shutdown

Note the approx 15 second delay that may indicate a clock-drift between 
the nodes.

Some hours later I am getting this error on the 'failing' node:

Mar 17 06:26:25 srv128bh kernel: dlm: dlm_lock: no lockspace
Mar 17 06:26:25 srv128bh kernel: s_1 rebuild resource directory
Mar 17 06:26:25 srv128bh kernel: gfs_1 rebuilt 16 resources
Mar 17 06:26:25 srv128bh kernel: gfs_1 purge requests
Mar 17 06:26:25 srv128bh kernel: gfs_1 purged 0 requests
Mar 17 06:26:25 srv128bh kernel: gfs_1 mark waiting requests
Mar 17 06:26:25 srv128bh kernel: gfs_1 marked 0 requests
Mar 17 06:26:25 srv128bh kernel: gfs_1 purge locks of departed nodes
Mar 17 06:26:25 srv128bh kernel: gfs_1 purged 14 locks
Mar 17 06:26:25 srv128bh kernel: gfs_1 update remastered resources
Mar 17 06:26:25 srv128bh kernel: gfs_1 updated 1 resources
Mar 17 06:26:25 srv128bh kernel: gfs_1 rebuild locks
Mar 17 06:26:25 srv128bh kernel: gfs_1 rebuilt 0 locks
Mar 17 06:26:25 srv128bh kernel: gfs_1 recover event 8 done
Mar 17 06:26:25 srv128bh kernel: gfs_1 move flags 0,0,1 ids 6,8,8
Mar 17 06:26:25 srv128bh kernel: gfs_1 process held requests
Mar 17 06:26:25 srv128bh kernel: gfs_1 processed 0 requests
Mar 17 06:26:25 srv128bh kernel: gfs_1 resend marked requests
Mar 17 06:26:25 srv128bh kernel: gfs_1 resent 0 requests
Mar 17 06:26:25 srv128bh kernel: gfs_1 recover event 8 finished
Mar 17 06:26:25 srv128bh kernel: gfs_1 move flags 1,0,0 ids 8,8,8
Mar 17 06:26:25 srv128bh kernel: gfs_1 move flags 0,1,0 ids 8,10,8
Mar 17 06:26:25 srv128bh kernel: gfs_1 move use event 10
Mar 17 06:26:25 srv128bh kernel: gfs_1 recover event 10
Mar 17 06:26:25 srv128bh kernel: gfs_1 add node 2
Mar 17 06:26:25 srv128bh kernel: gfs_1 total nodes 2
Mar 17 06:26:25 srv128bh kernel: gfs_1 rebuild resource directory
Mar 17 06:26:25 srv128bh kernel: gfs_1 rebuilt 9 resources
Mar 17 06:26:25 srv128bh kernel: gfs_1 purge requests
Mar 17 06:26:25 srv128bh kernel: gfs_1 purged 0 requests
Mar 17 06:26:25 srv128bh kernel: gfs_1 mark waiting requests
Mar 17 06:26:25 srv128bh kernel: gfs_1 marked 0 requests
Mar 17 06:26:25 srv128bh kernel: gfs_1 recover event 10 done
Mar 17 06:26:25 srv128bh kernel: gfs_1 move flags 0,0,1 ids 8,10,10
Mar 17 06:26:25 srv128bh kernel: gfs_1 process held requests
Mar 17 06:26:25 srv128bh kernel: gfs_1 processed 0 requests
Mar 17 06:26:25 srv128bh kernel: gfs_1 resend marked requests
Mar 17 06:26:25 srv128bh kernel: gfs_1 resent 0 requests
Mar 17 06:26:25 srv128bh kernel: gfs_1 recover event 10 finished
Mar 17 06:26:25 srv128bh kernel: 3259 pr_start last_stop 0 last_start 4 
last_finish 0
Mar 17 06:26:25 srv128bh kernel: 3259 pr_start count 1 type 2 event 4 
flags 250
Mar 17 06:26:25 srv128bh kernel: 3259 claim_jid 0
Mar 17 06:26:25 srv128bh kernel: 3259 pr_start 4 done 1
Mar 17 06:26:25 srv128bh kernel: 3259 pr_finish flags 5b
Mar 17 06:26:25 srv128bh kernel: 3250 recovery_done jid 0 msg 309 b
Mar 17 06:26:25 srv128bh kernel: 3250 recovery_done nodeid 1 flg 18
Mar 17 06:26:25 srv128bh kernel: 3250 recovery_done jid 1 msg 309 b
Mar 17 06:26:25 srv128bh kernel: 3250 others_may_mount b
Mar 17 06:26:25 srv128bh kernel: 3258 pr_start last_stop 4 last_start 7 
last_finish 4
Mar 17 06:26:25 srv128bh kernel: 3258 pr_start count 2 type 2 event 7 
flags a1b
Mar 17 06:26:25 srv128bh kernel: 3258 pr_start 7 done 1
Mar 17 06:26:25 srv128bh kernel: 3258 pr_finish flags 81b
Mar 17 06:26:25 srv128bh kernel: 3259 pr_start last_stop 7 last_start 8 
last_finish 7
Mar 17 06:26:25 srv128bh kernel: 3259 pr_start count 1 type 1 event 8 
flags a1b
Mar 17 06:26:25 srv128bh kernel: 3259 pr_start cb jid 1 id 2
Mar 17 06:26:25 srv128bh kernel: 3259 pr_start 8 done 0
Mar 17 06:26:25 srv128bh kernel: 3262 recovery_done jid 1 msg 309 91b
Mar 17 06:26:25 srv128bh kernel: 3262 recovery_done nodeid 2 flg 1b
Mar 17 06:26:25 srv128bh kernel: 3262 recovery_done start_done 8
Mar 17 06:26:25 srv128bh kernel: 3258 pr_finish flags 81b
Mar 17 06:26:25 srv128bh kernel: 3259 pr_start last_stop 8 last_start 11 
last_finish 8
Mar 17 06:26:25 srv128bh kernel: 3259 pr_start count 2 type 2 event 11 
flags a1b
Mar 17 06:26:25 srv128bh kernel: 3259 pr_start 11 done 1
Mar 17 06:26:25 srv128bh kernel: 3259 pr_finish flags 81b
Mar 17 06:26:25 srv128bh kernel:
Mar 17 06:26:25 srv128bh kernel: lock_dlm:  Assertion failed on line 428 
of file 
/home/jnhon/redhat-cluster/linux-modules-extra-2.6-2.6.18/debian/build/build_i386_none_k7_redhat-cluster/gfs/dlm/lock.c
Mar 17 06:26:25 srv128bh kernel: lock_dlm:  assertion:  "!error"
Mar 17 06:26:25 srv128bh kernel: lock_dlm:  time = 12311916
Mar 17 06:26:25 srv128bh kernel: gfs_1: num=2,17 err=-22 cur=-1 req=3 
lkf=10000
Mar 17 06:26:25 srv128bh kernel:
Mar 17 06:26:25 srv128bh kernel: Modules linked in: ip_vs_wlc ip_vs 
button ac battery vmmemctl ipv6 lock_nolock lock_dlm dlm gfs 
lock_harness cman loop tsdev parport_pc parport intel_agp floppy 
i2c_piix4 psmouse agpgart shpchp pci_hotplug i2c_core serio_raw evdev 
rtc pcspkr ext3 jbd mbcache dm_mirror dm_snapshot dm_mod sd_mod ide_cd 
cdrom pcnet32 mii mptspi mptscsih mptbase scsi_transport_spi scsi_mod 
piix generic ide_core thermal processor fan
Mar 17 06:26:25 srv128bh kernel: EIP:    0060:[<f8c2f4c2>]    Tainted: 
P      VLI
Mar 17 06:26:25 srv128bh kernel: EFLAGS: 00010296   (2.6.18-4-k7 #1)

Message from syslogd at srv128bh at Sat Mar 17 06:26:25 2007 ...
srv128bh kernel: ------------[ cut here ]------------

Message from syslogd at srv128bh at Sat Mar 17 06:26:25 2007 ...
srv128bh kernel: kernel BUG at 
/home/jnhon/redhat-cluster/linux-modules-extra-2.6-2.6.18/debian/build/build_i386_none_k7_redhat-cluster/gfs/dlm/lock.c:428!

Message from syslogd at srv128bh at Sat Mar 17 06:26:25 2007 ...
srv128bh kernel: esi: dfdd5080   edi: f28f01c0   ebp: 00000001   esp: 
f4feddd8

Message from syslogd at srv128bh at Sat Mar 17 06:26:25 2007 ...
srv128bh kernel: ds: 007b   es: 007b   ss: 0068

Message from syslogd at srv128bh at Sat Mar 17 06:26:25 2007 ...
srv128bh kernel: Process find (pid: 5178, ti=f4fec000 task=c214caa0 
task.ti=f4fec000)

Message from syslogd at srv128bh at Sat Mar 17 06:26:25 2007 ...
srv128bh kernel: Stack: f8c33955 00000000 202001c0 20202020 20203220 
20202020 20202020 20202020

Message from syslogd at srv128bh at Sat Mar 17 06:26:25 2007 ...
srv128bh kernel:        00183731 f28f01c0 00000000 00000003 f28f01c0 
f8c2f61b 00000003 f8c36c60

Message from syslogd at srv128bh at Sat Mar 17 06:26:25 2007 ...
srv128bh kernel: invalid opcode: 0000 [#1]

Message from syslogd at srv128bh at Sat Mar 17 06:26:25 2007 ...
srv128bh kernel: SMP

Message from syslogd at srv128bh at Sat Mar 17 06:26:25 2007 ...
srv128bh kernel: CPU:    0

Message from syslogd at srv128bh at Sat Mar 17 06:26:25 2007 ...
srv128bh kernel: EIP is at do_dlm_lock+0x133/0x14d [lock_dlm]

Message from syslogd at srv128bh at Sat Mar 17 06:26:25 2007 ...
srv128bh kernel:        f8cf0000 00000000 f8cc5662 00000008 f5069468 
f8cf0000 00000008 00000003

Message from syslogd at srv128bh at Sat Mar 17 06:26:25 2007 ...
srv128bh kernel: Call Trace:

Message from syslogd at srv128bh at Sat Mar 17 06:26:25 2007 ...
srv128bh kernel: Code: 2c 50 0f bf 47 24 50 53 ff 77 08 ff 77 04 ff 77 
0c ff 76 18 68 c2 39 c3 f8 e8 b8 e5 4e c7 83 c4 38 68 55 39 c3 f8 e8 ab 
e5 4e c7 <0f> 0b ac 01 4d 38 c3 f8 68 57 39 c3 f8 e8 d4 db 4e c7 83 c4 20

Message from syslogd at srv128bh at Sat Mar 17 06:26:25 2007 ...
srv128bh kernel: EIP: [<f8c2f4c2>] do_dlm_lock+0x133/0x14d [lock_dlm] 
SS:ESP 0068:f4feddd8

Message from syslogd at srv128bh at Sat Mar 17 06:26:25 2007 ...
srv128bh kernel: eax: 00000004   ebx: ffffffea   ecx: ffffffff   edx: 
fffff696
Mar 17 06:26:25 srv128bh kernel:  [<f8c2f61b>] lm_dlm_lock+0xb6/0xc0 
[lock_dlm]
Mar 17 06:26:25 srv128bh kernel:  [<f8cc5662>] gfs_lm_lock+0x35/0x4d [gfs]
Mar 17 06:26:25 srv128bh kernel:  [<f8cbe220>] 
gfs_glock_xmote_th+0x113/0x151 [gfs]
Mar 17 06:26:25 srv128bh kernel:  [<f8cbccd6>] run_queue+0x221/0x2e4 [gfs]
Mar 17 06:26:25 srv128bh kernel:  [<f8cbd329>] gfs_glock_nq+0x353/0x3a4 
[gfs]
Mar 17 06:26:25 srv128bh kernel:  [<f8cbd38d>] 
gfs_glock_nq_init+0x13/0x26 [gfs]
Mar 17 06:26:25 srv128bh kernel:  [<f8ccfb17>] gfs_getattr+0x37/0x58 [gfs]
Mar 17 06:26:25 srv128bh kernel:  [<f8ccfae0>] gfs_getattr+0x0/0x58 [gfs]
Mar 17 06:26:25 srv128bh kernel:  [<c016240b>] vfs_getattr+0x40/0x99
Mar 17 06:26:25 srv128bh kernel:  [<c016248b>] vfs_lstat_fd+0x27/0x39
Mar 17 06:26:25 srv128bh kernel:  [<c01592df>] do_filp_open+0x2b/0x31
Mar 17 06:26:25 srv128bh kernel:  [<c01624e2>] sys_lstat64+0xf/0x23
Mar 17 06:26:25 srv128bh kernel:  [<c016e2cb>] dput+0x1a/0x119
Mar 17 06:26:25 srv128bh kernel:  [<c015b4fa>] __fput+0x11c/0x13f
Mar 17 06:26:25 srv128bh kernel:  [<c01717c2>] mntput_no_expire+0x11/0x68
Mar 17 06:26:25 srv128bh kernel:  [<c0158fa1>] filp_close+0x4e/0x54

As soon as I now touch the filesystem on the 'reporting' node it simply 
freezes (it's pingable but any filesystem-access seems to hang).

One of the things that I'm experiencing is that fenced never gets 
'invoked'. I've (for now) setup manual fencing and have tried with 
others but it appears that the failing node never gets properly fenced.

My questions are:

1. Can clock-drift between the nodes cause the 'late heartbeat'?
2. Under what circumstances doesn't a node get fenced?

For completeness, my cluster.conf:

<?xml version="1.0"?>

<cluster name="gfs_ontwikkel" config_version="1">

<cman two_node="1" expected_votes="1"></cman>


<fencedevices>

<fencedevice name="manual" agent="fence_manual"/>

</fencedevices>


<clusternodes>

<clusternode name="srv128bh" votes="1" nodeid="srv128bh">

<fence>

<method name="single">

<device name="manual"/>

</method>

</fence>

</clusternode>


<clusternode name="srv129bh" votes="1" nodeid="srv129bh">

<fence>

<method name="single">

<device name="manual"/>

</method>

</fence>

</clusternode>

</clusternodes>


<rm>

<failoverdomains/>

<resources/>

</rm>

</cluster>


Thanks in advance,
Jeroen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070317/3c86b137/attachment.htm>

From isplist at logicore.net  Sat Mar 17 17:17:27 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Sat, 17 Mar 2007 11:17:27 -0600
Subject: [Linux-cluster] Sharedroot or shared GFS? Flash Boot
Message-ID: <2007317111727.746342@leena>

I've looked around at the sharedroot projects and have been working on 
diskless blades that each share their own volume on a storage device.

I have two questions and I'm not sure if this is the place to ask them so 
please re-direct me to another list if this isn't the place.

First, sharedroot and shared GFS space aren't the same, this I know. However, 
could GFS be used as a shared root for multiple servers which are identical? 
They all do the same things, they are all configured identically except for 
their IP. So, why could they not all boot off of one single GFS partition?

Second, I have 32 volumes on a storage device. I have a flash drive for the 
/boot partition only. The system boots off of the lash card then uses one of 
the volumes as it's own OS swap and root partitions. Rather than installing 
over and over again, is there some way of making a copy of one of the volumes 
and using that to mirror the rest? Then I'd only need to modify the grub.conf 
file to tell each blade what it's volume disk is. 

And last, just out of curiosity. I can install /boot on my DOM based flash 
cards but the CF based ones give this error (or similar);

Assertion (cyl_size <=255-63) at disk_dos,c:556 in Function 
probe_partition_for_geom() Failed

I tried pre-formatting the disk to DOS, would it help to pre-format it to 
Linux or am I missing something?

Thanks in advance.

Mike





From rainer at ultra-secure.de  Sat Mar 17 16:18:12 2007
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Sat, 17 Mar 2007 17:18:12 +0100
Subject: [Linux-cluster] GFS and RHCS will be included now in
	RHEL4	Advanced Platform ?
In-Reply-To: <45FAE88E.6010100@redhat.com>
References: <479149.36421.qm@web50606.mail.re2.yahoo.com>
	<45FAE88E.6010100@redhat.com>
Message-ID: <45FC14C4.6000805@ultra-secure.de>

Rob Kenna wrote:
> The RHEL 5 Advanced Platform provides a great deal, both in terms of 
> functionality and price.  Advanced Platform only exists on RHEL 5, 
> there is no RHEL 4 Advanced Platform. It does not include the 
> entitlements to run GFS & Cluster Suite on RHEL 4.  RHEL 4 packaging 
> is the same as before.
>
> If you have a RHEL 4 AS entitlement you can call Red Hat and upgrade, 
> for no charge, into the RHEL 5 AP platform which includes GFS & 
> Cluster Suite.
>
>


Do you have an idea when we will see various hardware being certified 
for RHEL5?
Specifically HP EVA storage and Qlogic HBAs.
Currently, the HP download pages offer "only" drivers for RHEL4 (but at 
least both for x86-64 and i386).



cheers,
Rainer



From ianbrn at gmail.com  Sun Mar 18 16:02:40 2007
From: ianbrn at gmail.com (Ian Brown)
Date: Sun, 18 Mar 2007 18:02:40 +0200
Subject: [Linux-cluster] Invalid cross-device link when building cluster-2
Message-ID: <d0383f90703180902j67d9a824rb51cbaee0676e908@mail.gmail.com>

Hello,

I encountered a problem with building of cluster2:

According to the doc/usage.txt, I ran:

svn checkout http://svn.osdl.org/openais
  cd openais/branches/whitetank
  make; make install DESTDIR=/

and also
  downloaded udev-094.tar.bz2, and ran
   make EXTRAS="extras/volume_id" install
	
Then, cd to cluster-2.00.00 and ran:	
./configure --kernel_src=path to sources of linux-2.6.20.2
and then

make install

after some time I see:

make[2]: Entering directory `/work/src/cluster-2.00.00/gfs2/mkfs'
install -m 0755 mkfs.gfs2 //sbin
ln -f mkfs.gfs2 //sbin/gfs2_jadd
ln: creating hard link `//sbin/gfs2_jadd' to `mkfs.gfs2': Invalid
cross-device link

Any idea what can be wrong here ?
regards,
Ian



From nirmaltom at hotmail.com  Sun Mar 18 18:26:00 2007
From: nirmaltom at hotmail.com (nirmal tom)
Date: Sun, 18 Mar 2007 23:56:00 +0530
Subject: [Linux-cluster] make errors for 2.6.19 on fc6
Message-ID: <BAY136-F234D32FA8137FF10B5D7DCAC770@phx.gbl>

hi,
I am running fc6 and my  kernel version is 2.6.19-1.2911.fc6.I installed 
openais succesfully and i thind i dont need to install libvolume_id as i am 
running fc6,it already contains udev-095-14

i downloaded development version using
cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster
and now when i run make , it fetches me the error
make -C dlm_controld all
make[2]: Entering directory `/opt/cluster/group/dlm_controld'
gcc -Wall  -g -I//usr/include -I../config -idirafter 
/usr/src/kernels/2.6.19-1.2911.fc6-i686/include/linux -I../../group/lib/ 
-I../../ccs/lib/ -I../../cman/lib/ -I../include/ -c -o main.o main.c
main.c: In function ?setup_uevent?:
main.c:188: error: ?NETLINK_KOBJECT_UEVENT? undeclared (first use in this 
function)
main.c:188: error: (Each undeclared identifier is reported only once
main.c:188: error: for each function it appears in.)
make[2]: *** [main.o] Error 1
make[2]: Leaving directory `/opt/cluster/group/dlm_controld'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/opt/cluster/group'
make: *** [all] Error 2

As for stable verion is consent, i got the same error.

Thanks for any response.

regards,
Nirmal Tom.




From nirmaltom at hotmail.com  Sun Mar 18 18:26:48 2007
From: nirmaltom at hotmail.com (nirmal tom)
Date: Sun, 18 Mar 2007 23:56:48 +0530
Subject: [Linux-cluster] make errors for 2.6.19 on fc6
Message-ID: <BAY136-F230B4C6F61A8AF2578B0F1AC770@phx.gbl>

hi,
I am running fc6 and my  kernel version is 2.6.19-1.2911.fc6.I installed 
openais succesfully and i thind i dont need to install libvolume_id as i am 
running fc6,it already contains udev-095-14

i downloaded development version using
cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster
and now when i run make , it fetches me the error
make -C dlm_controld all
make[2]: Entering directory `/opt/cluster/group/dlm_controld'
gcc -Wall  -g -I//usr/include -I../config -idirafter 
/usr/src/kernels/2.6.19-1.2911.fc6-i686/include/linux -I../../group/lib/ 
-I../../ccs/lib/ -I../../cman/lib/ -I../include/ -c -o main.o main.c
main.c: In function ?setup_uevent?:
main.c:188: error: ?NETLINK_KOBJECT_UEVENT? undeclared (first use in this 
function)
main.c:188: error: (Each undeclared identifier is reported only once
main.c:188: error: for each function it appears in.)
make[2]: *** [main.o] Error 1
make[2]: Leaving directory `/opt/cluster/group/dlm_controld'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/opt/cluster/group'
make: *** [all] Error 2

As for stable verion is consent, i got the same error.

Thanks for any response.

regards,
Nirmal Tom.




From nirmaltom at hotmail.com  Sun Mar 18 18:26:30 2007
From: nirmaltom at hotmail.com (nirmal tom)
Date: Sun, 18 Mar 2007 23:56:30 +0530
Subject: [Linux-cluster] make errors for 2.6.19 on fc6
Message-ID: <BAY136-F270D1C51A71DC1CFE05958AC770@phx.gbl>

hi,
I am running fc6 and my  kernel version is 2.6.19-1.2911.fc6.I installed 
openais succesfully and i thind i dont need to install libvolume_id as i am 
running fc6,it already contains udev-095-14

i downloaded development version using
cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster
and now when i run make , it fetches me the error
make -C dlm_controld all
make[2]: Entering directory `/opt/cluster/group/dlm_controld'
gcc -Wall  -g -I//usr/include -I../config -idirafter 
/usr/src/kernels/2.6.19-1.2911.fc6-i686/include/linux -I../../group/lib/ 
-I../../ccs/lib/ -I../../cman/lib/ -I../include/ -c -o main.o main.c
main.c: In function ?setup_uevent?:
main.c:188: error: ?NETLINK_KOBJECT_UEVENT? undeclared (first use in this 
function)
main.c:188: error: (Each undeclared identifier is reported only once
main.c:188: error: for each function it appears in.)
make[2]: *** [main.o] Error 1
make[2]: Leaving directory `/opt/cluster/group/dlm_controld'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/opt/cluster/group'
make: *** [all] Error 2

As for stable verion is consent, i got the same error.

Thanks for any response.

regards,
Nirmal Tom.




From Phillip.Brown at hp.com  Sun Mar 18 20:30:37 2007
From: Phillip.Brown at hp.com (Brown, Phillip)
Date: Sun, 18 Mar 2007 20:30:37 -0000
Subject: [Linux-cluster] GFS 6.1 Example of a working fence.ccs and
	nodes.ccs when using RILOE 
Message-ID: <0B8C713E581B2747B2AD4220537D15215580ED@G3W0639.americas.hpqcorp.net>

Hey folks:

Some Background
===============

For customers using GFS who would like to setup fencing with the RILOE
method: how to setup fencing depends on the device and method being
used. That's in Section 10.2 of the GFS 6.0 doc - Fencing Methods

 
http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-fence-methods.ht
ml

RILOE is there and supported, obviously. The documentation for GFS 6.0
instructs the use of the deprecated fence_rib fencing agent for the
RILOE. 

The examples there for the RILOE method are good in the 6.0 doc:

 Format for fence.ccs:
 
http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-sf-fence.html#FI
G-FENCE-RIB

 Example of a RILOE fence.ccs:
 
http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-sf-fence.html#EX
-FENCE-RILOE

 Format for nodes.ccs:
 
http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-sf-nodes.html#FI
G-SF-NODESRILEOSINGLE

 Example of a RILOE nodes.ccs
 
http://www.redhat.com/docs/manuals/csgfs/admin-guide/s1-sf-nodes.html#EX
-SF-RILOE

As you know, I'm sure, the fence_rib agent has been deprecated. The
fence_ilo agent should be used instead, according to multiple references
including the man pages for both.

I hoped to find some explicit examples of the two files in the
documentation for GFS 6.1 using fence_ilo, but the documentation has
changed. Specifically, the GFS Admin Guide has a short: "Fencing is
configured and managed in Red Hat Cluster Suite. For more information
about fencing options, refer to Red Hat Cluster Suite Configuring and
Managing a Cluster."

That particular section is here:
 
http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/s1-config-fence
-devices.html

You'll note that it's not anywhere near the detail as the above GFS 6.0
documentation, as it's GUI-based. 

The man page for fence_ilo(8) indicates the following fields are used:

  action = The action required. reboot (default), off, on or status. 
  agent = This option is used by fence_node(8) and is ignored by
fence_ilo. 
  hostname = IP address or hostname of the iLO card. 
  login = Login name. 
  passwd = Password for login. 
  ribcl = RIBCL protocol to use. Default is to autodetect. 
  verbose = Verbose mode.

Since agent is ignored, looks like we can simply add the "action"
parameter to the fence.ccs? I guess action is analogous to the drop-down
menu at the top of the GUI? 

And ribcl and verbose can safely be left alone?  They're not in the GUI
fields.


Questions
=========

So, I'm wondering: does anyone have an example of a working fence.ccs
and nodes.ccs for GFS 6.1 when using RILOE?  I don't have a GFS 6.1
config to work on just now. 

Also, it appears that this is the way to test?

 # fence_ilo -v -a ipaddr -l login -p passwd -o off
 

Kind regards

Phil





From rbravo at di.uc3m.es  Sun Mar 18 22:58:03 2007
From: rbravo at di.uc3m.es (Rafael Bravo)
Date: Sun, 18 Mar 2007 23:58:03 +0100
Subject: [Linux-cluster] make errors for 2.6.19 on fc6
In-Reply-To: <BAY136-F270D1C51A71DC1CFE05958AC770@phx.gbl>
References: <BAY136-F270D1C51A71DC1CFE05958AC770@phx.gbl>
Message-ID: <45FDC3FB.9060908@di.uc3m.es>

nirmal tom wrote:

> hi,
> I am running fc6 and my  kernel version is 2.6.19-1.2911.fc6.I 
> installed openais succesfully and i thind i dont need to install 
> libvolume_id as i am running fc6,it already contains udev-095-14
>
> i downloaded development version using
> cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster
> and now when i run make , it fetches me the error
> make -C dlm_controld all
> make[2]: Entering directory `/opt/cluster/group/dlm_controld'
> gcc -Wall  -g -I//usr/include -I../config -idirafter 
> /usr/src/kernels/2.6.19-1.2911.fc6-i686/include/linux 
> -I../../group/lib/ -I../../ccs/lib/ -I../../cman/lib/ -I../include/ -c 
> -o main.o main.c
> main.c: In function ?setup_uevent?:
> main.c:188: error: ?NETLINK_KOBJECT_UEVENT? undeclared (first use in 
> this function)
> main.c:188: error: (Each undeclared identifier is reported only once
> main.c:188: error: for each function it appears in.)
> make[2]: *** [main.o] Error 1
> make[2]: Leaving directory `/opt/cluster/group/dlm_controld'
> make[1]: *** [all] Error 2
> make[1]: Leaving directory `/opt/cluster/group'
> make: *** [all] Error 2
>
> As for stable verion is consent, i got the same error.

    Try...
    # cd /usr/include
    # mv linux linux.old
    # ln -s /usr/src/kernels/2.6.19-1.2911.fc6-i686/include/linux linux
    Regards ;-)  

>
> Thanks for any response.
>
> regards,
> Nirmal Tom.
>
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From srramasw at cisco.com  Mon Mar 19 03:14:49 2007
From: srramasw at cisco.com (Sridhar Ramaswamy (srramasw))
Date: Sun, 18 Mar 2007 20:14:49 -0700
Subject: [Linux-cluster] make errors for 2.6.19 on fc6
In-Reply-To: <45FDC3FB.9060908@di.uc3m.es>
Message-ID: <B14199FA0DBAAF4AA89E83EB41D35435033D15D3@xmb-sjc-22c.amer.cisco.com>

My distro is an old 2.6.9. So I ended up fixing the Makefile in that dir
to solve the problem.

[root at cfs1 dlm_controld]$ diff -cpt Makefile.ORG Makefile
*** Makefile.ORG        2006-08-11 08:18:15.000000000 -0700
--- Makefile    2007-02-27 15:08:39.000000000 -0800
*************** include ${top_srcdir}/make/defines.mk
*** 17,23 ****
  
  CFLAGS+= -g -I${incdir} -I${top_srcdir}/config
  
! CFLAGS+= -idirafter ${KERNEL_SRC}/include/linux \
          -I../../group/lib/ \
          -I../../ccs/lib/ \
          -I../../cman/lib/ \
--- 17,23 ----
  
  CFLAGS+= -g -I${incdir} -I${top_srcdir}/config
  
! CFLAGS+= -I${KERNEL_SRC}/include/ \
          -I../../group/lib/ \
          -I../../ccs/lib/ \


Also did the same for cluster-2.00.00/group/gfs_controld/Makefile for
similar error.

Thanks,
Sridhar 

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rafael Bravo
> Sent: Sunday, March 18, 2007 3:58 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] make errors for 2.6.19 on fc6
> 
> nirmal tom wrote:
> 
> > hi,
> > I am running fc6 and my  kernel version is 2.6.19-1.2911.fc6.I 
> > installed openais succesfully and i thind i dont need to install 
> > libvolume_id as i am running fc6,it already contains udev-095-14
> >
> > i downloaded development version using
> > cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster
> > and now when i run make , it fetches me the error
> > make -C dlm_controld all
> > make[2]: Entering directory `/opt/cluster/group/dlm_controld'
> > gcc -Wall  -g -I//usr/include -I../config -idirafter 
> > /usr/src/kernels/2.6.19-1.2911.fc6-i686/include/linux 
> > -I../../group/lib/ -I../../ccs/lib/ -I../../cman/lib/ 
> -I../include/ -c 
> > -o main.o main.c
> > main.c: In function 'setup_uevent':
> > main.c:188: error: 'NETLINK_KOBJECT_UEVENT' undeclared 
> (first use in 
> > this function)
> > main.c:188: error: (Each undeclared identifier is reported only once
> > main.c:188: error: for each function it appears in.)
> > make[2]: *** [main.o] Error 1
> > make[2]: Leaving directory `/opt/cluster/group/dlm_controld'
> > make[1]: *** [all] Error 2
> > make[1]: Leaving directory `/opt/cluster/group'
> > make: *** [all] Error 2
> >
> > As for stable verion is consent, i got the same error.
> 
>     Try...
>     # cd /usr/include
>     # mv linux linux.old
>     # ln -s 
> /usr/src/kernels/2.6.19-1.2911.fc6-i686/include/linux linux
>     Regards ;-)  
> 
> >
> > Thanks for any response.
> >
> > regards,
> > Nirmal Tom.
> >
> >
> > -- 
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From srramasw at cisco.com  Mon Mar 19 03:26:06 2007
From: srramasw at cisco.com (Sridhar Ramaswamy (srramasw))
Date: Sun, 18 Mar 2007 20:26:06 -0700
Subject: [Linux-cluster] Kernel oops when GFS2 used as localfs
In-Reply-To: <1174124193.2897.18.camel@localhost.localdomain>
Message-ID: <B14199FA0DBAAF4AA89E83EB41D35435033D15D9@xmb-sjc-22c.amer.cisco.com>

Thanks Kevin! That helps.
 
- Sridhar


________________________________

	From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kevin Anderson
	Sent: Saturday, March 17, 2007 2:37 AM
	To: linux clustering
	Subject: RE: [Linux-cluster] Kernel oops when GFS2 used as
localfs
	
	
	Just opened up bugzilla 229831, you should be able to see it
now.
	
	Kevin
	
	On Sat, 2007-03-17 at 00:11 -0600, David wrote: 

		Once you get it, feel free to share with all of us.
		
		David 
		
		-----Original Message-----
		From: linux-cluster-bounces at redhat.com
		[mailto:linux-cluster-bounces at redhat.com] On Behalf Of
Sridhar Ramaswamy
		(srramasw)
		Sent: Wednesday, March 14, 2007 12:20 PM
		To: linux clustering
		Subject: RE: [Linux-cluster] Kernel oops when GFS2 used
as localfs
		
		
		Hi folks,
		
		This bugzilla 229831 seems "restricted" :(  Can someone
please forward
		its details? I'm interested in understanding the actual
problem and its
		proposed fix.
		
		Meanwhile I'll try an upstream kernel. Should 2.6.21.rc1
be fine?
		
		Thanks,
		Sridhar
		
		> -----Original Message-----
		> From: linux-cluster-bounces at redhat.com
		> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of
Wendy Cheng
		> Sent: Wednesday, March 14, 2007 6:33 AM
		> To: linux clustering
		> Subject: Re: [Linux-cluster] Kernel oops when GFS2
used as localfs
		> 
		> Steven Whitehouse wrote:
		> > Hi,
		> >
		> > This looks like Red Hat bugzilla 229831 and if so
then
		> there is already
		> > a fix in Linus' upstream kernel and in the latest
kernel build for 
		> > Fedora (both 5 and 6),
		> >
		> >   
		> yeah... agree. I missed the unlink part of the trace
in previous post
		> and thought it might have something to do with my ino
number 
		> hacking in 
		> the lookup code.
		> 
		> -- Wendy
		> > Steve.
		> >
		> > On Wed, 2007-03-14 at 00:15 -0700, Sridhar Ramaswamy
		> (srramasw) wrote:
		> >   
		> >> Sure Wendy. Here it is,
		> >>
		> >> "fs/gfs2/inode.c" 1256
		> >> struct inode *gfs2_ilookup(struct super_block *sb,
struct
		> gfs2_inum_host
		> >> *inum)
		> >> {
		> >>         return ilookup5(sb, (unsigned
long)inum->no_formal_ino,
		> >>                         iget_test, inum);
		> >> }
		> >>
		> >> static struct inode *gfs2_iget(struct super_block
*sb, struct 
		> >> gfs2_inum_host *inum) {
		> >>         return iget5_locked(sb, (unsigned
long)inum->no_formal_ino,
		> >>                      iget_test, iget_set, inum);
		> >> }
		> >>
		> >> BTW - this code is from kernel version 2.6.20.1.
		> >>
		> >> Thanks,
		> >> Sridhar
		> >>
		> >>     
		> >>> -----Original Message-----
		> >>> From: linux-cluster-bounces at redhat.com
		> >>> [mailto:linux-cluster-bounces at redhat.com] On
Behalf Of Wendy Cheng
		> >>> Sent: Tuesday, March 13, 2007 10:29 PM
		> >>> To: linux clustering
		> >>> Subject: Re: [Linux-cluster] Kernel oops when GFS2
used as localfs
		> >>>
		> >>> Sridhar Ramaswamy (srramasw) wrote:
		> >>>
		> >>>       
		> >>>> Mar 13 15:25:56 cfs1 kernel: ------------[ cut
here ]------------
		
		> >>>> Mar 13 15:25:56 cfs1 kernel: kernel BUG at
fs/gfs2/meta_io.c:474!
		> >>>>         
		> >>> I don't have time to pull kernel.org source code
right
		> now. Could you
		> >>> cut-and-paste the following two routines (from
fs/gfs2/inode.c):
		> >>> gfs2_ilookup() and gfs2_iget() so we can look into
this
		> >>> quickly ? They 
		> >>> are all one-liner so shouldn't be too much
troubles. GFS2 
		> >>> currently has 
		> >>> ino issue with its lookup code.
		> >>>
		> >>> -- Wendy
		> >>>
		> >>>       
		> >>>> Mar 13 15:25:56 cfs1 kernel: invalid opcode: 0000
[#1] Mar 13 
		> >>>> 15:25:56 cfs1 kernel: SMP Mar 13 15:25:56 cfs1
kernel: Modules 
		> >>>> linked in: lock_nolock gfs2 reiserfs nfsd
exportfs nfs lockd 
		> >>>> nfs_acl ipv6 parport_pc
		> lp parport
		> >>>> autofs4 sunrpc dm_mirror dm_mod button battery ac
		> uhci_hcd ehci_hcd
		> >>>> intel_rng rng_core i2c_i801 i2c_core e1000 e100
mii
		> floppy ext3 jbd
		> >>>> Mar 13 15:25:56 cfs1 kernel: CPU:    1
		> >>>> Mar 13 15:25:56 cfs1 kernel: EIP:
0060:[<e0c88da5>]    
		> >>>>         
		> >>> Not tainted VLI
		> >>>       
		> >>>> Mar 13 15:25:56 cfs1 kernel: EFLAGS: 00010246
(2.6.20.1 #1)
		> >>>> Mar 13 15:25:56 cfs1 kernel: EIP is at
		> >>>> gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2]
		> >>>> Mar 13 15:25:56 cfs1 kernel: eax: 00000000   ebx:

		> 00012bf5   ecx: 
		> >>>> ce5a6dd4   edx: dc5eae00
		> >>>> Mar 13 15:25:56 cfs1 kernel: esi: 00000000   edi:

		> 00000000   ebp: 
		> >>>> dc5ea9a8   esp: ce5a6d58
		> >>>> Mar 13 15:25:56 cfs1 kernel: ds: 007b   es: 007b
ss: 0068
		> >>>> Mar 13 15:25:56 cfs1 kernel: Process bonnie++
(pid: 5509,
		> >>>>         
		> >>> ti=ce5a6000
		> >>>       
		> >>>> task=dc320570 task.ti=ce5a6000)
		> >>>> Mar 13 15:25:56 cfs1 kernel: Stack: c156d274
ce5a6dd4 c016eb91
		> >>>> ce5a6dd4 00000000 dc5eae00 00000000 d6f34000
		> >>>> Mar 13 15:25:56 cfs1 kernel:        00000000
00000000 dc5ea9a8 
		> >>>> d57794a8 ce5a6e08 e0c83ad3 00012bf5 00000000
		> >>>> Mar 13 15:25:56 cfs1 kernel:        00000000
ce5a6da0 00000000 
		> >>>> 00000000 dc5ea9a8 d57794a8 e0c84cc9 ce5a6dd4
		> >>>> Mar 13 15:25:56 cfs1 kernel: Call Trace:
		> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c016eb91>]
iget5_locked+0x3d/0x67
		> >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c83ad3>] 
		> >>>> gfs2_inode_refresh+0x34/0xfe [gfs2]
		> >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c84cc9>] 
		> >>>>         
		> >>> gfs2_createi+0x12c/0x191 [gfs2]
		> >>>       
		> >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c8da3c>]
		> >>>>         
		> >>> gfs2_create+0x5c/0x103 [gfs2]
		> >>>       
		> >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c84be7>]
		> >>>>         
		> >>> gfs2_createi+0x4a/0x191 [gfs2]
		> >>>       
		> >>>> Mar 13 15:25:56 cfs1 kernel:  [<e0c820c4>]
		> >>>>         
		> >>> gfs2_glock_nq_num+0x3f/0x64
		> >>>       
		> >>>> [gfs2]
		> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c016552e>]
vfs_create+0xc3/0x126 
		> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c01657f2>]
		> >>>>         
		> >>> open_namei_create+0x47/0x88
		> >>>       
		> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c016597d>]
open_namei+0x14a/0x539
		
		> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d27b>]
do_filp_open+0x25/0x39
		
		> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c01d200a>]
		> >>>>         
		> >>> strncpy_from_user+0x3c/0x5b
		> >>>       
		> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d431>]
		> get_unused_fd+0xa8/0xb1
		> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d509>]
do_sys_open+0x42/0xbe 
		> >>>> Mar 13 15:25:56 cfs1 kernel:  [<c015d59f>]
sys_open+0x1a/0x1c Mar
		
		> >>>> 13 15:25:56 cfs1 kernel:  [<c015d5dd>]
sys_creat+0x1f/0x23 Mar 13
		
		> >>>> 15:25:56 cfs1 kernel:  [<c0103410>]
		> >>>>         
		> >>> sysenter_past_esp+0x5d/0x81
		> >>>       
		> >>>> Mar 13 15:25:56 cfs1 kernel:
======================= Mar 13 
		> >>>> 15:25:56 cfs1 kernel: Code: 80 a8 01 00 00 89 44
24
		> >>>>         
		> >>> 1c 8b 85 f0
		> >>>       
		> >>>> 01 00 00 c7 44 24 20 00 00 00 00 89 54 24 14 85
c0 89 44 24
		> >>>>         
		> >>> 18 c7 44
		> >>>       
		> >>>> 24 10 00 00 00 00 75 04 <0f> 0b eb fe 83 7c 24 1c
00 75 04
		> >>>>         
		> >>> 0f 0b eb fe
		> >>>       
		> >>>> 8d 85 24 04 00 00
		> >>>> Mar 13 15:25:57 cfs1 kernel: EIP: [<e0c88da5>]
		> >>>> gfs2_meta_indirect_buffer+0x4c/0x278 [gfs2]
SS:ESP 0068:ce5a6d58
		> >>>>  
		> >>>>  
		> >>>> (2)
		> >>>>  
		> >>>> Mar 13 17:00:30 cfs1 kernel: Call Trace:
		> >>>> Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>]
		> >>>>         
		> >>> gfs2_unlink+0x53/0xe0 [gfs2]
		> >>>       
		> >>>> Mar 13 17:00:30 cfs1 kernel:  [<e0baddc8>]
		> >>>>         
		> >>> gfs2_unlink+0x3a/0xe0 [gfs2]
		> >>>       
		> >>>> Mar 13 17:00:30 cfs1 kernel:  [<e0badde1>]
		> >>>>         
		> >>> gfs2_unlink+0x53/0xe0 [gfs2]
		> >>>       
		> >>>> Mar 13 17:00:30 cfs1 kernel:  [<c0166572>]
vfs_unlink+0xa1/0xc5 
		> >>>> Mar 13 17:00:30 cfs1 kernel:  [<c016662b>]
do_unlinkat+0x95/0xf5 
		> >>>> Mar 13 17:00:30 cfs1 kernel:  [<c01187de>]
		> scheduler_tick+0x8f/0x95
		> >>>> Mar 13 17:00:30 cfs1 kernel:  [<c0103410>]
		> >>>>         
		> >>> sysenter_past_esp+0x5d/0x81
		> >>>       
		> >>>> Mar 13 17:00:30 cfs1 kernel:
=======================
		> >>>>
		> >>>>
-------------------------------------------------------------
		> >>>>         
		> >>> -----------
		> >>>       
		> >>>> --
		> >>>> Linux-cluster mailing list
		> >>>> Linux-cluster at redhat.com 
		> >>>>
https://www.redhat.com/mailman/listinfo/linux-cluster
		> >>>>
		> >>>>         
		> >>> --
		> >>> Linux-cluster mailing list
		> >>> Linux-cluster at redhat.com 
		> >>>
https://www.redhat.com/mailman/listinfo/linux-cluster
		> >>>
		> >>>       
		> >> --
		> >> Linux-cluster mailing list
		> >> Linux-cluster at redhat.com 
		> >>
https://www.redhat.com/mailman/listinfo/linux-cluster
		> >>     
		> >
		> >   
		> 
		> --
		> Linux-cluster mailing list
		> Linux-cluster at redhat.com 
		> https://www.redhat.com/mailman/listinfo/linux-cluster
		> 
		
		--
		Linux-cluster mailing list
		Linux-cluster at redhat.com
		https://www.redhat.com/mailman/listinfo/linux-cluster
		
		--
		Linux-cluster mailing list
		Linux-cluster at redhat.com
		https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070318/88adcce7/attachment.htm>

From hlawatschek at atix.de  Mon Mar 19 08:57:34 2007
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Mon, 19 Mar 2007 09:57:34 +0100
Subject: [Linux-cluster] Sharedroot or shared GFS? Flash Boot
In-Reply-To: <2007317111727.746342@leena>
References: <2007317111727.746342@leena>
Message-ID: <200703190957.34934.hlawatschek@atix.de>

> First, sharedroot and shared GFS space aren't the same, this I know.
> However, could GFS be used as a shared root for multiple servers which are
> identical? They all do the same things, they are all configured identically
> except for their IP. So, why could they not all boot off of one single GFS
> partition?
Yes, you can do this. 
That's why http://www.open-sharedroot.org/ originally has been created.
But it can do a lot more.
E.g. you can also create clusters from servers with different hardware setup, 
as fibre channel and network hbas will be automatically detected during the 
initrd bootsequence.
You can assign different root partitions for the cluster nodes. I.e. a cluster 
can have multiple root partitions and servers can be booted into different 
sub-clusters on demand. (e.g. a php cluster node can be booted as a database 
cluster node on demand)
 
> Second, I have 32 volumes on a storage device. I have a flash drive for the
> /boot partition only. The system boots off of the lash card then uses one
> of the volumes as it's own OS swap and root partitions. Rather than
> installing over and over again, is there some way of making a copy of one
> of the volumes and using that to mirror the rest? Then I'd only need to
> modify the grub.conf file to tell each blade what it's volume disk is.

open-sharedroot has a sub project called comoonics-enterprisecopy. With this 
software you can create bootable clones from either single node 
configurations or sharedroot clusters. 
They can be used to create identical bootable shadow installations, perform 
regular backups (right now tar files and legato networker client have beed 
integrated) or to deploy new single node servers or sharedroot clusters.

Mark 


-- 
Gruss / Regards,

** Visit us at CeBIT 2007 in Hannover/Germany **
** in Hall 5, Booth G48/2  (15.-21. of March) **

Dipl.-Ing. Mark Hlawatschek
http://www.atix.de/
http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany



From hlawatschek at atix.de  Mon Mar 19 09:48:21 2007
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Mon, 19 Mar 2007 10:48:21 +0100
Subject: [Linux-cluster] rgmanager is ping-ponging service
In-Reply-To: <1174064666.13796.86.camel@asuka.boston.devel.redhat.com>
References: <200703161449.16182.hlawatschek@atix.de>
	<1174064666.13796.86.camel@asuka.boston.devel.redhat.com>
Message-ID: <200703191048.21688.hlawatschek@atix.de>

Hi Lon,

thanks for the answer. 
The patch you mentioned in you mail was already applied before the test. 

In my oppinion, the problem occurs when more than on IP addresses are 
configured in a service and only one of them fails.

>From my previous mail:
[snip]
Two IP resources configured:
<resources>
          <ip address="10.0.46.46" monitor_link="1"/>
          <ip address="10.226.3.80" monitor_link="1"/>
 </resources>

and a service:
<service autostart="1" domain="FD" name="VIP">
          <ip ref="10.0.46.46"/>
          <ip ref="10.226.3.80"/>
</service>

bond1 is configured 
- node1: 10.0.46.48
- node2: 10.0.46.47

bond2 is configured:
- node1: 10.226.3.82
- node2: 10.226.3.81

Now, if I shut down bond1 on node1. the service ping-pongs on node1
[snap]

Thanks,

Mark

On Friday 16 March 2007 18:04:25 Lon Hohberger wrote:
> On Fri, 2007-03-16 at 14:49 +0100, Mark Hlawatschek wrote:
> > [30905] notice: status on ip "10.0.46.46" returned 1 (generic error)
> > [30905] notice: Stopping service VIP
> > <debug>  10.0.46.46 is not configured
> > <info>   Removing IPv4 address 10.226.3.80 from bond2
>
> Noted here:
>
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=222484
>
> Patch here:
>
> http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/rgmanager/src/resource
>s/ip.sh.diff?cvsroot=cluster&r1=text&tr1=1.22&r2=text&tr2=1.21&f=u
>
> Will be included in update 5 of RHCS4.
>

-- 
Gruss / Regards,

** Visit us at CeBIT 2007 in Hannover/Germany **
** in Hall 5, Booth G48/2  (15.-21. of March) **

Dipl.-Ing. Mark Hlawatschek
http://www.atix.de/
http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany




From nirmaltom at hotmail.com  Mon Mar 19 13:06:06 2007
From: nirmaltom at hotmail.com (nirmal tom)
Date: Mon, 19 Mar 2007 18:36:06 +0530
Subject: [Linux-cluster] make errors for 2.6.19 on fc6
In-Reply-To: <45FDC3FB.9060908@di.uc3m.es>
Message-ID: <BAY136-F4054B56ECA91DFBC864DCDAC760@phx.gbl>

hi,
Thanks a lot to both Sridhar and Rafael.Now gfs-kernel stucks, i think.

/opt/cluster/gfs-kernel/src/gfs/glock.c:41: error: field ?gr_work? has 
incomplete type
/opt/cluster/gfs-kernel/src/gfs/glock.c: In function ?greedy_work?:
/opt/cluster/gfs-kernel/src/gfs/glock.c:1762: warning: type defaults to 
?int? in declaration of ?__mptr?
/opt/cluster/gfs-kernel/src/gfs/glock.c:1762: warning: initialization from 
incompatible pointer type
/opt/cluster/gfs-kernel/src/gfs/glock.c: In function ?gfs_glock_be_greedy?:
/opt/cluster/gfs-kernel/src/gfs/glock.c:1817: error: implicit declaration of 
function ?INIT_DELAYED_WORK?
make[5]: *** [/opt/cluster/gfs-kernel/src/gfs/glock.o] Error 1
make[4]: *** [_module_/opt/cluster/gfs-kernel/src/gfs] Error 2
make[4]: Leaving directory `/usr/src/kernels/2.6.19-1.2911.fc6-i686'
make[3]: *** [all] Error 2
make[3]: Leaving directory `/opt/cluster/gfs-kernel/src/gfs'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/opt/cluster/gfs-kernel/src'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/opt/cluster/gfs-kernel'
make: *** [all] Error 2


on both stable and development, i got this.how to fix this one?

thanks and regards,
Nirmal Tom.

>From: Rafael Bravo <rbravo at di.uc3m.es>
>Reply-To: linux clustering <linux-cluster at redhat.com>
>To: linux clustering <linux-cluster at redhat.com>
>Subject: Re: [Linux-cluster] make errors for 2.6.19 on fc6
>Date: Sun, 18 Mar 2007 23:58:03 +0100
>
>nirmal tom wrote:
>
>>hi,
>>I am running fc6 and my  kernel version is 2.6.19-1.2911.fc6.I installed 
>>openais succesfully and i thind i dont need to install libvolume_id as i 
>>am running fc6,it already contains udev-095-14
>>
>>i downloaded development version using
>>cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster checkout cluster
>>and now when i run make , it fetches me the error
>>make -C dlm_controld all
>>make[2]: Entering directory `/opt/cluster/group/dlm_controld'
>>gcc -Wall  -g -I//usr/include -I../config -idirafter 
>>/usr/src/kernels/2.6.19-1.2911.fc6-i686/include/linux -I../../group/lib/ 
>>-I../../ccs/lib/ -I../../cman/lib/ -I../include/ -c -o main.o main.c
>>main.c: In function ?setup_uevent?:
>>main.c:188: error: ?NETLINK_KOBJECT_UEVENT? undeclared (first use in this 
>>function)
>>main.c:188: error: (Each undeclared identifier is reported only once
>>main.c:188: error: for each function it appears in.)
>>make[2]: *** [main.o] Error 1
>>make[2]: Leaving directory `/opt/cluster/group/dlm_controld'
>>make[1]: *** [all] Error 2
>>make[1]: Leaving directory `/opt/cluster/group'
>>make: *** [all] Error 2
>>
>>As for stable verion is consent, i got the same error.
>
>    Try...
>    # cd /usr/include
>    # mv linux linux.old
>    # ln -s /usr/src/kernels/2.6.19-1.2911.fc6-i686/include/linux linux
>    Regards ;-)
>
>>
>>Thanks for any response.
>>
>>regards,
>>Nirmal Tom.
>>
>>
>>--
>>Linux-cluster mailing list
>>Linux-cluster at redhat.com
>>https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster




From Michael.Roethlein at ri-solution.com  Mon Mar 19 14:42:50 2007
From: Michael.Roethlein at ri-solution.com (=?iso-8859-1?Q?R=F6thlein_Michael_=28RI-Solution=29?=)
Date: Mon, 19 Mar 2007 15:42:50 +0100
Subject: [Linux-cluster] Filesystem Freeze
Message-ID: <992633B6A0E42B49BC5A41C10A8C841B05905561@MUCEX004.root.local>

Hello,
 
We use GFS for about 1 year now on a 4 node cluster connected to a SAN. In the past weeks nearly every day we had troubles desribed here: http://www.open-sharedroot.org/faq/troubleshooting-guide/system-failures/the-whole-cluster-freezes/ as "Freeze of filesystem"
> There you will not see anything in the syslogs. The only thing you will see is rapidly increasing load on the system an 
> rapidly increasing amounts of processes in the D State (means waiting for I/O). Here a lock is hung and any process 
> accessing the resource (normally directory) locked by this lock will end up in the processstate D. 
 
Is there any sure way to determine which node causes the lock or any fix for this issue? 
We use this cluster as web server running apache 2.

Thanks in advance

Yours

Michael



From rpeterso at redhat.com  Mon Mar 19 15:06:34 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Mon, 19 Mar 2007 10:06:34 -0500
Subject: [Linux-cluster] make errors for 2.6.19 on fc6
In-Reply-To: <BAY136-F4054B56ECA91DFBC864DCDAC760@phx.gbl>
References: <BAY136-F4054B56ECA91DFBC864DCDAC760@phx.gbl>
Message-ID: <45FEA6FA.9070500@redhat.com>

nirmal tom wrote:
> hi,
> Thanks a lot to both Sridhar and Rafael.Now gfs-kernel stucks, i think.
> 
> /opt/cluster/gfs-kernel/src/gfs/glock.c:41: error: field ?gr_work? has 
> incomplete type
> /opt/cluster/gfs-kernel/src/gfs/glock.c: In function ?greedy_work?:
> /opt/cluster/gfs-kernel/src/gfs/glock.c:1762: warning: type defaults to 
> ?int? in declaration of ?__mptr?
> /opt/cluster/gfs-kernel/src/gfs/glock.c:1762: warning: initialization 
> from incompatible pointer type
> /opt/cluster/gfs-kernel/src/gfs/glock.c: In function ?gfs_glock_be_greedy?:
> /opt/cluster/gfs-kernel/src/gfs/glock.c:1817: error: implicit 
> declaration of function ?INIT_DELAYED_WORK?
> make[5]: *** [/opt/cluster/gfs-kernel/src/gfs/glock.o] Error 1
> make[4]: *** [_module_/opt/cluster/gfs-kernel/src/gfs] Error 2
> make[4]: Leaving directory `/usr/src/kernels/2.6.19-1.2911.fc6-i686'
> make[3]: *** [all] Error 2
> make[3]: Leaving directory `/opt/cluster/gfs-kernel/src/gfs'
> make[2]: *** [all] Error 2
> make[2]: Leaving directory `/opt/cluster/gfs-kernel/src'
> make[1]: *** [all] Error 2
> make[1]: Leaving directory `/opt/cluster/gfs-kernel'
> make: *** [all] Error 2
> 
> 
> on both stable and development, i got this.how to fix this one?
> 
> thanks and regards,
> Nirmal Tom.

Hi Nirmal,

This has to do with changes made to the upstream kernels, which our STABLE
and HEAD branches of CVS track.  In 2.6.19, the "struct delayed_work" was 
"struct work_struct" and "INIT_DELAYED_WORK" was "INIT_WORK".
There was also a change in what was passed.  The following diff shows the 
change made to bring glock.c up to the newer kernels (HEAD branch):

http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/gfs-kernel/src/gfs/glock.c.diff?r1=1.30&r2=1.31&cvsroot=cluster&f=h

If you really don't want to upgrade to a newer kernel, you could back out 
the last change I did, at least to glock.c.  To compare them, also see:

http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/gfs-kernel/src/gfs/glock.c?cvsroot=cluster

Regards,

Bob Peterson
Red Hat Cluster Suite



From isplist at logicore.net  Mon Mar 19 16:45:53 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Mon, 19 Mar 2007 10:45:53 -0600
Subject: [Linux-cluster] Install over FC SAN works, boot doesn't
Message-ID: <2007319104553.467418@leena>

I am able to install Linux over the FC network and onto a storage volume. 
However, when I try to boot these installs, I only get a blank screen and 
cannot boot.

I have found that I can install the /boot directory to a flash card on each 
blade and then am able to boot fine with swap and root on the remote volume. 

Why is this and how can I make my install 100% remote?

Also strange (to me at least), I have found that if I connect the multiple 
volume storage device to the rest of the FC network, already installed 
machines with Qlogic cards in them no longer boot. They come up with many 
errors and all finally stall at the syslog starting point. Disconnecting the 
new storage allows them to boot again.

I believe this has something to do with LUN masking/zoning (which I've yet to 
learn, have to start reading Brocade 2800 docs next) but what I don't get is 
why the new blades can see their own LUNS if installed as noted above with 
flash /boot, but old machines get stuck?

Rather confusing.

Mike





From Brian.Hartin at pearson.com  Mon Mar 19 15:39:48 2007
From: Brian.Hartin at pearson.com (Hartin, Brian)
Date: Mon, 19 Mar 2007 10:39:48 -0500
Subject: [Linux-cluster] Problem with SAN after migrating to RH cluster suite
Message-ID: <67208505BB6AB0459AB885493C6B412B017F76BD@IOWACSRVE3MX03.NCSP.PEROOT.COM>

Hello all,

I'm relatively new to Linux, so forgive me if this question seems off.

We recently moved from a cluster running RHEL 4/Veritas to a new cluster
running RHEL 4/Red Hat Cluster Suite.  In both cases, a SAN was
involved.

After migrating, we see a considerable increase in the time it takes to
mount the SAN.  Some of our init.d scripts fail because the SAN is not
up yet.  Our admin tried changing run levels to make the scripts run
later, but this doesn't help.  One can even log in via SSH shortly after
boot and the SAN is not yet mounted.  Could this be normal behavior?
When a service needs access to files on the SAN should it be started by
some cluster mechanism?  Or should we be looking for some underlying
problem?

Incidentally, the files on the SAN are not config files, they are data.
All config files are on local disk.

Thanks for any help,

B
**************************************************************************** 
This email may contain confidential material. 
If you were not an intended recipient, 
Please notify the sender and delete all copies. 
We may monitor email to and from our network. 
****************************************************************************



From maciej.bogucki at artegence.com  Mon Mar 19 16:38:57 2007
From: maciej.bogucki at artegence.com (Maciej Bogucki)
Date: Mon, 19 Mar 2007 17:38:57 +0100
Subject: [Linux-cluster] Re: GFS problem with du and df - RESOLVED
In-Reply-To: <45C7665B.7020703@artegence.com>
References: <45C7665B.7020703@artegence.com>
Message-ID: <45FEBCA1.7070604@artegence.com>

Maciej Bogucki napisa?(a):
> Hello,
> 
> I hava 5 node cluster with GFS filesystem. One of the partition is used
> to share indexes for Apache Lucene search engine -
> http://lucene.apache.org/java/docs/ across 5 application nodes.
> The problem is that "du" command show different output than "df".
> "du" output is correct, I'm sure that there is only 15M data.
> The strangest thins is that "df" output of used space grows from minute
> to minute.
> 
> Here is the output of du, df, and mount.
> [root at repo05 ~]# df /datafs/search-indexes/
> Filesystem           1K-blocks      Used Available Use% Mounted on
> /dev/mapper/cluster_116550929916-lucene
>                        4980192   3169396   1810796  64%
> /datafs/search-indexes
> [root at repo05 ~]# du -sh /datafs/search-indexes/
> 15M     /datafs/search-indexes/
> [root at repo05 ~]# mount | grep search
> /dev/mapper/cluster_116550929916-lucene on /datafs/search-indexes type
> gfs (rw,noatime,nodiratime)
> [root at repo05 ~]#
> 
> Any ideas?
> 
> Best Regards
> Maciej Bogucki
> 

Solution: gfs_tool reclaim

http://archives.free.net.ph/message/20061220.150843.80921ed1.en.html

Best Regards
Maciej Bogucki



From rpeterso at redhat.com  Mon Mar 19 17:42:14 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Mon, 19 Mar 2007 12:42:14 -0500
Subject: [Linux-cluster] Problem with SAN after migrating to RH cluster
	suite
In-Reply-To: <67208505BB6AB0459AB885493C6B412B017F76BD@IOWACSRVE3MX03.NCSP.PEROOT.COM>
References: <67208505BB6AB0459AB885493C6B412B017F76BD@IOWACSRVE3MX03.NCSP.PEROOT.COM>
Message-ID: <45FECB76.20408@redhat.com>

Hartin, Brian wrote:
> Hello all,
> 
> I'm relatively new to Linux, so forgive me if this question seems off.
> 
> We recently moved from a cluster running RHEL 4/Veritas to a new cluster
> running RHEL 4/Red Hat Cluster Suite.  In both cases, a SAN was
> involved.
> 
> After migrating, we see a considerable increase in the time it takes to
> mount the SAN.  Some of our init.d scripts fail because the SAN is not
> up yet.  Our admin tried changing run levels to make the scripts run
> later, but this doesn't help.  One can even log in via SSH shortly after
> boot and the SAN is not yet mounted.  Could this be normal behavior?
> When a service needs access to files on the SAN should it be started by
> some cluster mechanism?  Or should we be looking for some underlying
> problem?
> 
> Incidentally, the files on the SAN are not config files, they are data.
> All config files are on local disk.
> 
> Thanks for any help,
> 
> B
Hi Brian,

I'm not quite sure I understand what the problem is.  Under ordinary
circumstances, there should be no extra time required as far as I know.
If you have all of your cluster startup scripts in place and chkconfiged
on, then I think you should be able to mount immediately.
What init scripts are failing because of the SAN and what do they say
when they fail?

In theory, you should be able to have all of these init scripts turned 
"on" so they run at init time:

ccsd
cman
fenced

If you have GFS mount points in your /etc/inittab, you may also want to
enable:

gfs

(You can also check rgmanager if you're using rgmanager failover services
for High Availability).

You can check these by doing this command:

chkconfig --list | grep "ccsd\|cman\|fenced\|gfs"

So a gfs file system should be able to mount when the system is booting.
I don't recommend messing with the order of the scripts though.
I hope this helps.

Regards,

Bob Peterson
Red Hat Cluster Suite



From Brian.Hartin at pearson.com  Mon Mar 19 18:13:37 2007
From: Brian.Hartin at pearson.com (Hartin, Brian)
Date: Mon, 19 Mar 2007 13:13:37 -0500
Subject: [Linux-cluster] Problem with SAN after migrating to RH cluster
	suite
In-Reply-To: <45FECB76.20408@redhat.com>
References: <67208505BB6AB0459AB885493C6B412B017F76BD@IOWACSRVE3MX03.NCSP.PEROOT.COM>
	<45FECB76.20408@redhat.com>
Message-ID: <67208505BB6AB0459AB885493C6B412B017F76C4@IOWACSRVE3MX03.NCSP.PEROOT.COM>

Robert,

Thanks for replying.  In particular, the script for the Perforce server
(p4d) is failing with a 'no such file or directory' error.  However, if
one waits 30 seconds and runs /etc/init.d/p4d, it works.  We made sure
that this is not an application problem by changing /etc/init.d/p4d to
include other commands like 'ls' and 'touch' which also report that the
directory on the SAN is not available.

I ran the 'chkconfig' command you suggested and it produced

ccsd            0:off   1:off   2:on    3:on    4:on    5:on    6:off
fenced          0:off   1:off   2:on    3:on    4:on    5:on    6:off
cman            0:off   1:off   2:on    3:on    4:on    5:on    6:off

Does that shed any light?

Thanks for your help!

Brian Hartin


-----Original Message-----
From: Robert Peterson [mailto:rpeterso at redhat.com] 
Sent: Monday, March 19, 2007 12:42 PM
To: linux clustering
Cc: Hartin, Brian; Seeley, Joiey
Subject: Re: [Linux-cluster] Problem with SAN after migrating to RH
cluster suite

Hartin, Brian wrote:
> Hello all,
> 
> I'm relatively new to Linux, so forgive me if this question seems off.
> 
> We recently moved from a cluster running RHEL 4/Veritas to a new
cluster
> running RHEL 4/Red Hat Cluster Suite.  In both cases, a SAN was
> involved.
> 
> After migrating, we see a considerable increase in the time it takes
to
> mount the SAN.  Some of our init.d scripts fail because the SAN is not
> up yet.  Our admin tried changing run levels to make the scripts run
> later, but this doesn't help.  One can even log in via SSH shortly
after
> boot and the SAN is not yet mounted.  Could this be normal behavior?
> When a service needs access to files on the SAN should it be started
by
> some cluster mechanism?  Or should we be looking for some underlying
> problem?
> 
> Incidentally, the files on the SAN are not config files, they are
data.
> All config files are on local disk.
> 
> Thanks for any help,
> 
> B
Hi Brian,

I'm not quite sure I understand what the problem is.  Under ordinary
circumstances, there should be no extra time required as far as I know.
If you have all of your cluster startup scripts in place and chkconfiged
on, then I think you should be able to mount immediately.
What init scripts are failing because of the SAN and what do they say
when they fail?

In theory, you should be able to have all of these init scripts turned 
"on" so they run at init time:

ccsd
cman
fenced

If you have GFS mount points in your /etc/inittab, you may also want to
enable:

gfs

(You can also check rgmanager if you're using rgmanager failover
services
for High Availability).

You can check these by doing this command:

chkconfig --list | grep "ccsd\|cman\|fenced\|gfs"

So a gfs file system should be able to mount when the system is booting.
I don't recommend messing with the order of the scripts though.
I hope this helps.

Regards,

Bob Peterson
Red Hat Cluster Suite
**************************************************************************** 
This email may contain confidential material. 
If you were not an intended recipient, 
Please notify the sender and delete all copies. 
We may monitor email to and from our network. 
****************************************************************************



From nirmaltom at hotmail.com  Mon Mar 19 18:49:35 2007
From: nirmaltom at hotmail.com (nirmal tom)
Date: Tue, 20 Mar 2007 00:19:35 +0530
Subject: [Linux-cluster] make errors for 2.6.19 on fc6
In-Reply-To: <45FEA6FA.9070500@redhat.com>
Message-ID: <BAY136-F21F4CB5B18F2B6DEA3E1D9AC760@phx.gbl>

hi,
Thanks man.
The reason for not switching to vanilla(2.6.20 and i have it installed) is , 
i am running iscsi-target for 2.6.19.I got similar make errors for gfs and 
gfs2 programs.but i left them by marking the lines in make file as commands, 
as i installed gfs2-utils rpm from update yum repository and i am gonna to 
only use gfs2.The installation is successful.
but when i do,
[root at server ~]# dlm_controld
[root at server ~]#  modprobe lock_dlm
WARNING: Error inserting debugfs 
(/lib/modules/2.6.19-1.2911.fc6/kernel/fs/debugfs/debugfs.ko): Invalid 
module format
[root at server ~]# modprobe dlm
WARNING: Error inserting debugfs 
(/lib/modules/2.6.19-1.2911.fc6/kernel/fs/debugfs/debugfs.ko): Invalid 
module format

Also,
[root at server ~]# ccsd
[root at server ~]# service cman start
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... failed
/usr/sbin/cman_tool: ccsd is not running
                                                           [FAILED]
[root at server ~]# ccsd
Failed to create lockfile.
Hint: ccsd is already running.


where i am missing?

out of curiosity:you referenced me a web based diff tool at redhat site.how 
can i find the version of a source file with respec to kernel .For example: 
glock.c contains no version information inside it!


Thanks and regards,
Nirmal Tom.V




>From: Robert Peterson <rpeterso at redhat.com>
>Reply-To: linux clustering <linux-cluster at redhat.com>
>To: linux clustering <linux-cluster at redhat.com>
>Subject: Re: [Linux-cluster] make errors for 2.6.19 on fc6
>Date: Mon, 19 Mar 2007 10:06:34 -0500
>
>nirmal tom wrote:
>>hi,
>>Thanks a lot to both Sridhar and Rafael.Now gfs-kernel stucks, i think.
>>
>>/opt/cluster/gfs-kernel/src/gfs/glock.c:41: error: field ?gr_work? has 
>>incomplete type
>>/opt/cluster/gfs-kernel/src/gfs/glock.c: In function ?greedy_work?:
>>/opt/cluster/gfs-kernel/src/gfs/glock.c:1762: warning: type defaults to 
>>?int? in declaration of ?__mptr?
>>/opt/cluster/gfs-kernel/src/gfs/glock.c:1762: warning: initialization from 
>>incompatible pointer type
>>/opt/cluster/gfs-kernel/src/gfs/glock.c: In function 
>>?gfs_glock_be_greedy?:
>>/opt/cluster/gfs-kernel/src/gfs/glock.c:1817: error: implicit declaration 
>>of function ?INIT_DELAYED_WORK?
>>make[5]: *** [/opt/cluster/gfs-kernel/src/gfs/glock.o] Error 1
>>make[4]: *** [_module_/opt/cluster/gfs-kernel/src/gfs] Error 2
>>make[4]: Leaving directory `/usr/src/kernels/2.6.19-1.2911.fc6-i686'
>>make[3]: *** [all] Error 2
>>make[3]: Leaving directory `/opt/cluster/gfs-kernel/src/gfs'
>>make[2]: *** [all] Error 2
>>make[2]: Leaving directory `/opt/cluster/gfs-kernel/src'
>>make[1]: *** [all] Error 2
>>make[1]: Leaving directory `/opt/cluster/gfs-kernel'
>>make: *** [all] Error 2
>>
>>
>>on both stable and development, i got this.how to fix this one?
>>
>>thanks and regards,
>>Nirmal Tom.
>
>Hi Nirmal,
>
>This has to do with changes made to the upstream kernels, which our STABLE
>and HEAD branches of CVS track.  In 2.6.19, the "struct delayed_work" was 
>"struct work_struct" and "INIT_DELAYED_WORK" was "INIT_WORK".
>There was also a change in what was passed.  The following diff shows the 
>change made to bring glock.c up to the newer kernels (HEAD branch):
>
>http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/gfs-kernel/src/gfs/glock.c.diff?r1=1.30&r2=1.31&cvsroot=cluster&f=h
>
>If you really don't want to upgrade to a newer kernel, you could back out 
>the last change I did, at least to glock.c.  To compare them, also see:
>
>http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/gfs-kernel/src/gfs/glock.c?cvsroot=cluster
>
>Regards,
>
>Bob Peterson
>Red Hat Cluster Suite
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster




From lshen at cisco.com  Mon Mar 19 20:05:23 2007
From: lshen at cisco.com (Lin Shen (lshen))
Date: Mon, 19 Mar 2007 13:05:23 -0700
Subject: [Linux-cluster] Can GFS2 be compiled on top of 2.6.11?
Message-ID: <08A9A3213527A6428774900A80DBD8D803A9A595@xmb-sjc-222.amer.cisco.com>

I need to run GFS2 on top of 2.6.11. Is it possible?

Lin 



From pcaulfie at redhat.com  Mon Mar 19 09:14:50 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 19 Mar 2007 09:14:50 +0000
Subject: [Linux-cluster] CMAN cluster event notification broken? (fwd)
In-Reply-To: <49622.192.168.1.140.1174070249.squirrel@mail.max-t.com>
References: <Pine.LNX.4.63.0703161433460.10489@laurentian>
	<49622.192.168.1.140.1174070249.squirrel@mail.max-t.com>
Message-ID: <45FE548A.5010905@redhat.com>

Ashutosh Rajekar wrote:
> I just rebuilt RPMs from the source RPMs of RHEL5 released a few days ago
> from
> (ftp://ftp.redhat.com/pub/redhat/linux/enterprise/5Server/en/os/SRPMS/cman-2.0.60-1.el5.src.rpm)
> 
> cman_start_notification() still seems broken, it doesn't even trigger the
> callback registered with cman_start_notification().

cmn_start_notification doesn't trigger any callbacks itself. it simple enables
callbacks to happen when they occur.


> In addition to that, monitoring the fd returned by cman_get_fd() through
> select returns no events when I force other nodes to close their
> applications. Is application level notification broken or unsupported in
> this release?

cman doesn't know anything about applications. It only deals with nodes. If you
need application failover look at rgmanager.

-- 

patrick



From fajarpri at arinet.org  Mon Mar 12 13:12:05 2007
From: fajarpri at arinet.org (Fajar Priyanto)
Date: Mon, 12 Mar 2007 20:12:05 +0700
Subject: [Linux-cluster] Advice on setting up Resources for Postgres
Message-ID: <200703122012.06172.fajarpri@arinet.org>

Hi All,
I'd like to setup RHCS for Postgres on 2-nodes cluster.
What is the best way to setup the resources? No GFS.
- Do I need to setup a script for postgres init.d? Or should I just let it on 
on both server from chkconfig?

Thanks.
-- 
Fajar Priyanto | Reg'd Linux User #327841 | Linux tutorial 
http://linux2.arinet.org
8:09pm up 12:12, 2.6.18.2-34-default GNU/Linux 
Let's use OpenOffice. http://www.openoffice.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070312/3edc31c2/attachment.sig>

From christian.brandes at forschungsgruppe.de  Wed Mar 14 10:53:11 2007
From: christian.brandes at forschungsgruppe.de (Christian Brandes)
Date: Wed, 14 Mar 2007 11:53:11 +0100
Subject: [Linux-cluster] NFSv4 with ACL-Support on GFS
Message-ID: <45F7D417.8010008@forschungsgruppe.de>

Dear readers,

I would like to set up an active/active cluster of redundant NFSv4 
servers with ACLs that have the same GFS file system exported at the 
same time.
Is that supported?
What versions are needed minimum:
	Kernel
	GFS-Tools
	NFS-Tools ...

At the moment I have a two node cluster of test servers with a GFS 
exported over NFSv4, but I can not get or set ACLs with getfacl or 
setfacl, which is the same behavior I had with samba.

Some weeks ago I tried two SAMBA servers on GFS, though it was said not 
to work in the Cluster FAQ.
It seemed to work -- but no ACLs.

Do you have any ideas?

Best regards
	Christian

-- 
____________________________________________________________

Christian Brandes

FGW Forschungsgruppe Wahlen Telefonfeld GmbH
N7, 13-15
68161 Mannheim

Tel: +49 621 1233-117
Fax: +49 621 1233-199

E-Mail: Christian.Brandes at forschungsgruppe.de
Internet: http://www.forschungsgruppe.de

Amtsgericht Mannheim HRB 6318
Gesch?ftsf?hrer: Matthias Jung, Andrea Wolf
____________________________________________________________
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4348 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070314/e2c06d3d/attachment.bin>

From chekov at ucla.edu  Wed Mar 14 21:34:53 2007
From: chekov at ucla.edu (Alan Wood)
Date: Wed, 14 Mar 2007 14:34:53 -0700 (PDT)
Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 35, Issue 18
In-Reply-To: <20070314160007.39B8F731D7@hormel.redhat.com>
References: <20070314160007.39B8F731D7@hormel.redhat.com>
Message-ID: <Pine.LNX.4.64.0703141426450.8571@cpe-76-168-2-122.socal.res.rr.com>

Holger,

we ran into that problem and went back and forth with dell for literally 
months as they replaces every component in the 220, our Perc cards, cables, 
etc.  we were never able to solve the problem and I ended up buying a 
couple of iSCSI enclosures instead, which I've been very happy with.  Much 
of the consensus from experts I've talked to is to avoid using shared SCSI 
buses for concurrent access, at least with the megaraid driver.  I have not 
had any problems with the PV enclosure since shifting it to a single-system 
connection.

wish I could tell you something more positive.  In the end I was not able 
to correlate the crashses with anything (load, calendar, processes, users, 
or settings).  Sometimes it would go a month without crashing sometimes it 
would crash many times in a week. 
-alan

> ------------------------------
>
> Date: Wed, 14 Mar 2007 15:55:32 +0100
> From: "Holger L. Ratzel" <holger.ratzel at she.net>
> Subject: [Linux-cluster] Problem with shared SCSI-Storage
> To: linux-cluster at redhat.com
> Cc: "Stempel, Steffen" <Steffen.Stempel at she.net>
> Message-ID: <200703141555.32868.holger.ratzel at she.net>
> Content-Type: text/plain;	charset="us-ascii"
>
> Hi All,
>
> I've a problem with a RHEL4 cluster running on two Dell PowerEdge 1850 and an
> Dell PowerVault PV220S. After running for a while I get errors from the
> megaraid driver in /var/log/messages saying that the device was "offlined"
> and the service gets relocated to the other node while the original node
> reboots. The relevant parts from /var/log/messages can be downloaded from
> http://www.she.net/hra/web01-messages.txt
>
> Dell has checked the system and all firmwares have been updated, so it seems
> not to be a hardware problem.
>
> Please mail me if you have any questions or need further information.
>
> I would appreciate any suggestions.
>
> Many thanks an best regards,
>
> 	Holger Ratzel
>
>



From rstevens at internap.com  Thu Mar 15 18:38:01 2007
From: rstevens at internap.com (Rick Stevens)
Date: Thu, 15 Mar 2007 11:38:01 -0700
Subject: [Linux-cluster] Custom Anaconda to see 32 LUNS?
In-Reply-To: <433093DF7AD7444DA65EFAFE3987879C3BA0AA@orca.penguincomputing.com>
References: <433093DF7AD7444DA65EFAFE3987879C3BA0AA@orca.penguincomputing.com>
Message-ID: <1173983881.5971.26.camel@prophead.corp.publichost.com>

On Thu, 2007-03-15 at 10:41 -0700, Michael Will wrote:
> How does a blade know to boot from a specific lun?
> 
> Can you do a LUN-SAN mapping that would make a lun available to only one
> specific blade each
> under the correct name?

That is an issue.  In those cases, generally you must get onto the
storage device and do host LUN mapping.  Essentially you only expose
the LUN you want to boot from to the WWN of the HBA of the host and hide
it from the other WWNs.

> 
> 
> Michael
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of
> isplist at logicore.net
> Sent: Thursday, March 15, 2007 11:13 AM
> To: linux-cluster; anaconda-devel-list at redhat.com;
> redhat-install-list at redhat.com
> Subject: [Linux-cluster] Custom Anaconda to see 32 LUNS?
> 
> Is there anyone out there who might have a custom anaconda version which
> could see at least 32 LUNS or more?
> 
> I have yet to find a way of installing blades onto 32 volumes which are
> on a fibre channel storage device. I must find a solution and was now
> told that I might be able to modify anaconda. Not knowing how to do
> this, I wonder if someone might already have a modified version?
> 
> Mike
> 
> This was my old message, things I've done/tried which have not worked.
> 
> //
> I've tried everything I can find and think of or that has been
> suggested.
> 
> I have a Xyratex/MTI type chassis split into 32 volumes. I need to
> install 32 blades onto each volume so that I can remove the drives on
> each blade.
> 
> When I start a Linux install on a blade, it never sees all of the
> volumes, only LUNS 0/1. I need to see all 32 LUNS so that I can install
> all of my servers.
> 
> I've tried all of the following;
> 
> RHEL4, CentOS4.4, others. I don't really care what the distro is, so
> long as it runs basic services such as web/php, qmail, etc. What I do
> need however is that they be GFS/Cluster machines.
> 
> I've tried passing the information at the installers command line;
> 
> scsi_mod.max_luns=256
> scsi_mod.scsi_dev_flags=INLINE:TF200:0x242
> 
> I've tried many variations of these types of commands with no result.
> 
> I've then set up a PXE boot server thinking that I might be able to pass
> the options using pxelinux.cfg. Still no luck, it only sees two LUNS.
> 
> I've tried installing from network with a recompilled initrd.img from a
> machine which was already installed and modprobe.conf modified to see
> the LUNS. The donor server can see all of the volumes, the installing
> version dies with a kernel problem since it cannot see the same already
> installed volume.
> 
> What in the world can I do? Is there a guru here who can tell me how I
> can do this?
> //
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
----------------------------------------------------------------------
- Rick Stevens, Principal Engineer          rstevens at vitalstream.com -
- VitalStream, Inc.                       http://www.vitalstream.com -
-                                                                    -
-         It is better to have loved and lost.  Cheaper, too!        -
----------------------------------------------------------------------



From arajekar at max-t.com  Fri Mar 16 18:37:29 2007
From: arajekar at max-t.com (Ashutosh Rajekar)
Date: Fri, 16 Mar 2007 14:37:29 -0400 (EDT)
Subject: [Linux-cluster] CMAN cluster event notification broken? (fwd)
In-Reply-To: <Pine.LNX.4.63.0703161433460.10489@laurentian>
References: <Pine.LNX.4.63.0703161433460.10489@laurentian>
Message-ID: <49622.192.168.1.140.1174070249.squirrel@mail.max-t.com>

>
> Mikhail A Zelikov wrote:
>> I have a 2-node cluster, and I wrote a simple program that registers for
>> cluster events using:
>> cman_start_notification and cman_start_recv_data.  If a node dies
>> (manual
>> reboot/shutdown) then I receive CMAN_REASON_PORTCLOSED event and then NO
>> CMAN_REASON_STATECHANGE (as I woudl expect). I receive
>> CMAN_REASON_STATECHANGE once the node is booted again. If I do not
>> register for data then I do not receive CMAN_REASON_PORTCLOSED as
>> expected.
>> Is the notification mechanism broken?
>
> It used to be broken for the last node in the cluster. It should be fixed
> in U3.

I just rebuilt RPMs from the source RPMs of RHEL5 released a few days ago
from
(ftp://ftp.redhat.com/pub/redhat/linux/enterprise/5Server/en/os/SRPMS/cman-2.0.60-1.el5.src.rpm)

cman_start_notification() still seems broken, it doesn't even trigger the
callback registered with cman_start_notification().

In addition to that, monitoring the fd returned by cman_get_fd() through
select returns no events when I force other nodes to close their
applications. Is application level notification broken or unsupported in
this release?


Thanks in advance,
Ashutosh



From raziebe at gmail.com  Mon Mar 12 15:48:20 2007
From: raziebe at gmail.com (Raz Ben-Jehuda(caro))
Date: Mon, 12 Mar 2007 17:48:20 +0200
Subject: [Linux-cluster] register_blkdev: failed to get major for gfs_diape
Message-ID: <5d96567b0703120848w7718335sd1ee16d335fdf3b1@mail.gmail.com>

I am getting this when i try to insert gfs.ko .
I am using 2.6.20-git3 and i have checked out
the STABLE version on fedora 4.

any idea ?

thank you
-- 
Raz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070312/db10ef71/attachment.htm>

From raziebe at gmail.com  Tue Mar 13 17:34:42 2007
From: raziebe at gmail.com (Raz Ben-Jehuda(caro))
Date: Tue, 13 Mar 2007 19:34:42 +0200
Subject: [Linux-cluster] fstat syscall hangs
Message-ID: <5d96567b0703131034u6d58515eqa02c2ddd8ae6ec7d@mail.gmail.com>

Patrick Hello.
I have installed open gfs over iscsi. It works fine except for the fact that
"df" doesn't work.
I have used 1.03 and cvs STABLE version , problem remained.
debugging it i have learned that fstat hangs over gfs..iscsi.
any ideas ?

thank you
-- 
Raz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070313/f5c5a671/attachment.htm>

From raziebe at gmail.com  Fri Mar 16 20:28:47 2007
From: raziebe at gmail.com (Raz Ben-Jehuda(caro))
Date: Fri, 16 Mar 2007 22:28:47 +0200
Subject: [Linux-cluster] gfs tools
Message-ID: <5d96567b0703161328ta902662v7f6d3a4ff97202f7@mail.gmail.com>

Hello.

I am looking for a way to look at a bitmap
of a gfs file, meaning, how does it stored on the storage.

thank you

-- 
Raz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070316/f69996a3/attachment.htm>

From isplist at logicore.net  Mon Mar 19 22:12:10 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Mon, 19 Mar 2007 16:12:10 -0600
Subject: [Linux-cluster] Custom Anaconda to see 32 LUNS?
In-Reply-To: <1173983881.5971.26.camel@prophead.corp.publichost.com>
Message-ID: <2007319161210.772196@leena>

> That is an issue.  In those cases, generally you must get onto the
> storage device and do host LUN mapping.  Essentially you only expose
> the LUN you want to boot from to the WWN of the HBA of the host and hide
> it from the other WWNs.

In looking at the Brocade switches I'm using, I think I can do this. I just 
need to learn how to do it with the devices I have.

Mike





From rpeterso at redhat.com  Mon Mar 19 21:24:28 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Mon, 19 Mar 2007 16:24:28 -0500
Subject: [Linux-cluster] gfs tools
In-Reply-To: <5d96567b0703161328ta902662v7f6d3a4ff97202f7@mail.gmail.com>
References: <5d96567b0703161328ta902662v7f6d3a4ff97202f7@mail.gmail.com>
Message-ID: <45FEFF8C.1070903@redhat.com>

Raz Ben-Jehuda(caro) wrote:
> Hello.
> 
> I am looking for a way to look at a bitmap
> of a gfs file, meaning, how does it stored on the storage.
> 
> thank you

Hi Raz,

For what version?
For RHEL4 and similar, there is gfs_tool layout.

For RHEL5, gfs_tool layout should work for gfs1, but there is
not yet a gfs2 equivalent.  We're working on it, though.  See:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=221301

Regards,

Bob Peterson
Red Hat Cluster Suite



From rstevens at internap.com  Mon Mar 19 21:35:00 2007
From: rstevens at internap.com (Rick Stevens)
Date: Mon, 19 Mar 2007 14:35:00 -0700
Subject: [Linux-cluster] Custom Anaconda to see 32 LUNS?
In-Reply-To: <2007319161210.772196@leena>
References: <2007319161210.772196@leena>
Message-ID: <1174340100.5724.8.camel@prophead.corp.publichost.com>

On Mon, 2007-03-19 at 16:12 -0600, isplist at logicore.net wrote:
> > That is an issue.  In those cases, generally you must get onto the
> > storage device and do host LUN mapping.  Essentially you only expose
> > the LUN you want to boot from to the WWN of the HBA of the host and hide
> > it from the other WWNs.
> 
> In looking at the Brocade switches I'm using, I think I can do this. I just 
> need to learn how to do it with the devices I have.

Yes, you can do it with the Brocades (in fact, almost any FC switch
provides ways to do it).  If you aren't using Brocades, then most SAN
controllers have methods to do it.  We use Hitachi and EMC SANs and can
do it from there if needed.

----------------------------------------------------------------------
- Rick Stevens, Principal Engineer             rstevens at internap.com -
- VitalStream, Inc.                       http://www.vitalstream.com -
-                                                                    -
-        Hard work has a future payoff. Laziness pays off now.       -
----------------------------------------------------------------------



From nirmaltom at hotmail.com  Tue Mar 20 10:32:33 2007
From: nirmaltom at hotmail.com (nirmal tom)
Date: Tue, 20 Mar 2007 16:02:33 +0530
Subject: [Linux-cluster] iscsi and clusterd volume group
Message-ID: <BAY136-F30F715479F5B0DB9748CCCAC750@phx.gbl>

hi,
as per the usage guide , i activated the c or clustered bit for the volume 
group, rac.Now it is shown to my kernel or system.So, the iscsi target fails 
and whatever command i gave with respect to lvm it simly reponds skipping as 
u see below

[root at server ~]# vgchange -a y rac
  Skipping clustered volume group rac
[root at server ~]# vgchange -c n rac
  Skipping clustered volume group rac
[root at server ~]# vgremove rac
  Skipping clustered volume group rac
[root at server ~]# vgchange --clustered n rac
  Skipping clustered volume group rac
ered volume group rac
[root at server ~]# vgchange --available e y rac
  Volume group "y" not found
  Skipping clustered volume group rac

how to set the clustered bit to turn off? or to make it avialabel as usual

thanks for any response

regards,
Nirmal Tom.




From rpeterso at redhat.com  Tue Mar 20 14:16:46 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Tue, 20 Mar 2007 09:16:46 -0500
Subject: [Linux-cluster] iscsi and clusterd volume group
In-Reply-To: <BAY136-F30F715479F5B0DB9748CCCAC750@phx.gbl>
References: <BAY136-F30F715479F5B0DB9748CCCAC750@phx.gbl>
Message-ID: <45FFECCE.8000502@redhat.com>

nirmal tom wrote:
> hi,
> as per the usage guide , i activated the c or clustered bit for the 
> volume group, rac.Now it is shown to my kernel or system.So, the iscsi 
> target fails and whatever command i gave with respect to lvm it simly 
> reponds skipping as u see below
> 
> [root at server ~]# vgchange -a y rac
>  Skipping clustered volume group rac
> [root at server ~]# vgchange -c n rac
>  Skipping clustered volume group rac
> [root at server ~]# vgremove rac
>  Skipping clustered volume group rac
> [root at server ~]# vgchange --clustered n rac
>  Skipping clustered volume group rac
> ered volume group rac
> [root at server ~]# vgchange --available e y rac
>  Volume group "y" not found
>  Skipping clustered volume group rac
> 
> how to set the clustered bit to turn off? or to make it avialabel as usual
> 
> thanks for any response
> 
> regards,
> Nirmal Tom.

Hi Nirmal,

I think what happened here is that maybe you have 
locking_type = 2 but clvmd isn't running.  Therefore,
lvm can't do the clustered locking necessary to "see" the
volume group, and it has to "see" the volume group before
it can change its attributes.

Your choices are to (1) activate the clustered locking
mechanism and leave the volume group clustered, or 
(2) change the locking_type = 1 in your 
/etc/lvm/lvm.conf file, at least temporarily, so you can
change the clustered bit off.

To get the clustered volume to be active, make sure
locking_type = 2 (for RHEL4 and similar) or locking_type = 3
(for RHEL5 and similar) in /etc/lvm/lvm.conf, then you need
to run the clustered LVM manager: service clvmd start.
The clvmd service can only run properly, however, if the rest
of the cluster infrastructure is running.

If you're sharing the volume group between systems, you
want to keep the clustered bit on.  If not, you can turn it off.

After clvmd is started, or locking_type changed back to 1,
you should be able to do what you want.  For example:

vgscan
vgchange -aln
vgchange -cn rac 

Regards,

Bob Peterson
Red Hat Cluster Suite



From rpeterso at redhat.com  Tue Mar 20 14:25:59 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Tue, 20 Mar 2007 09:25:59 -0500
Subject: [Linux-cluster] make errors for 2.6.19 on fc6
In-Reply-To: <BAY136-F21F4CB5B18F2B6DEA3E1D9AC760@phx.gbl>
References: <BAY136-F21F4CB5B18F2B6DEA3E1D9AC760@phx.gbl>
Message-ID: <45FFEEF7.9020003@redhat.com>

nirmal tom wrote:
> hi,
> Thanks man.
> The reason for not switching to vanilla(2.6.20 and i have it installed) 
> is , i am running iscsi-target for 2.6.19.I got similar make errors for 
> gfs and gfs2 programs.but i left them by marking the lines in make file 
> as commands, as i installed gfs2-utils rpm from update yum repository 
> and i am gonna to only use gfs2.The installation is successful.
> but when i do,
> [root at server ~]# dlm_controld
> [root at server ~]#  modprobe lock_dlm
> WARNING: Error inserting debugfs 
> (/lib/modules/2.6.19-1.2911.fc6/kernel/fs/debugfs/debugfs.ko): Invalid 
> module format
> [root at server ~]# modprobe dlm
> WARNING: Error inserting debugfs 
> (/lib/modules/2.6.19-1.2911.fc6/kernel/fs/debugfs/debugfs.ko): Invalid 
> module format
> 
> Also,
> [root at server ~]# ccsd
> [root at server ~]# service cman start
> Starting cluster:
>   Loading modules... done
>   Mounting configfs... done
>   Starting ccsd... done
>   Starting cman... failed
> /usr/sbin/cman_tool: ccsd is not running
>                                                           [FAILED]
> [root at server ~]# ccsd
> Failed to create lockfile.
> Hint: ccsd is already running.
> 
> 
> where i am missing?
> 
> out of curiosity:you referenced me a web based diff tool at redhat 
> site.how can i find the version of a source file with respec to kernel 
> .For example: glock.c contains no version information inside it!
> 
> 
> Thanks and regards,
> Nirmal Tom.V

Hi Nirmal,

It sounds like your kernel does not have the debugfs kernel
option compiled in.  To compile it in:

cd /your/kernel/source/location
make menuconfig
scroll down to "Kernel hacking" hit enter, then
scroll down to "Debug filesystem" hit space until it becomes a "*",
then right-arrow to exit, save, changes, make, make install, reboot, etc.

The kernel module probably doesn't contain version information from cvs.

Regards,

Bob Peterson
Red Hat Cluster Suite



From teigland at redhat.com  Tue Mar 20 15:09:52 2007
From: teigland at redhat.com (David Teigland)
Date: Tue, 20 Mar 2007 10:09:52 -0500
Subject: [Linux-cluster] DLM Documentation
In-Reply-To: <bf6826c60703161149j49468a62y383d1d4e730b2b59@mail.gmail.com>
References: <bf6826c60703161149j49468a62y383d1d4e730b2b59@mail.gmail.com>
Message-ID: <20070320150951.GD20026@redhat.com>

On Fri, Mar 16, 2007 at 02:49:10PM -0400, Jeff wrote:
> There is a document in a DLM Source tree,
>   http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/dlm/doc/?cvsroot=cluster
> which briefly describes the DLM API.
> Is there anything else available?
> 
> In particular I was looking for how VALNOTVALID errors are handled and
> signaled to the user.

The userland API is somewhat half-baked, it does just enough for the
minimal use we've made of it.  VALNOTVALID in particular is something we
don't use, so it exists, but in practice it probably doesn't work right.
The dlm code that supports userland locking is currently being overhauled,
and part of that will include the API.  It's part of a larger effort to
make the dlm useful for serious userland apps.  If you're interested in
specific things from the dlm or the API, now would be a great time to let
us know.

Dave



From rhurst at bidmc.harvard.edu  Tue Mar 20 15:39:11 2007
From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu)
Date: Tue, 20 Mar 2007 11:39:11 -0400
Subject: [Linux-cluster] GFS/CS blocks all I/O on 1 server reboot of 11
	nodes?
Message-ID: <1174405151.6895.19.camel@WSBID06223>

Troubling, this behavior has not occurred prior to our Mar 2nd up2date
on our RHEL GFS/CS subscription.  I rebooted an application server
(app1) in an 11-node cluster, and from viewing its console, it 'hung' on
a service cman stop.  Consequently, ALL GFS I/O got blocked on ALL
nodes.  All servers are configured the same:

AMD64 dual CPU/duo core HP DL385, 8GB RAM, dual hba (PowerPath)
# uname -r
2.6.9-42.0.10.ELsmp

ccs-1.0.7-0
cman-1.0.11-0
dlm-1.0.1-1
fence-1.32.25-1
GFS-6.1.6-1
magma-1.0.6-0
magma-plugins-1.0.9-0
rgmanager-1.9.54-1

My central syslog server showed that all nodes registered the membership
change, yet the service continued to hang.

Mar 20 11:06:18 app1 shutdown: shutting down for system reboot 
Mar 20 11:06:18 app1 init: Switching to runlevel: 6 
Mar 20 11:06:19 app1 rgmanager: [1873]: <notice> Shutting down Cluster
Service Manager...  
Mar 20 11:06:20 app1 clurgmgrd[11220]: <notice> Shutting down  
Mar 20 11:06:20 net2 clurgmgrd[30893]: <info> State change: app1 DOWN 
Mar 20 11:06:20 app3 clurgmgrd[11092]: <info> State change: app1 DOWN  
Mar 20 11:06:20 db1 clurgmgrd[8351]: <info> State change: app1 DOWN  
Mar 20 11:06:20 db3 clurgmgrd[8279]: <info> State change: app1 DOWN  
Mar 20 11:06:20 db2 clurgmgrd[10875]: <info> State change: app1 DOWN  
Mar 20 11:06:20 app6 clurgmgrd[10959]: <info> State change: app1 DOWN  
Mar 20 11:06:20 app4 clurgmgrd[11146]: <info> State change: app1 DOWN  
Mar 20 11:06:20 app2 clurgmgrd[10835]: <info> State change: app1 DOWN  
Mar 20 11:06:20 app5 clurgmgrd[11198]: <info> State change: app1 DOWN  
Mar 20 11:06:20 net1 clurgmgrd[12689]: <info> State change: app1 DOWN  
Mar 20 11:12:26 net2 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats
Mar 20 11:12:26 db2 kernel: CMAN: removing node app1 from the cluster :
Missed too many heartbeats 
Mar 20 11:12:26 db3 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 app4 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 app5 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 app6 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 net1 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 app3 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 db1 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 app2 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:32 net1 fenced[10510]: app1 not a cluster member after 0
sec post_fail_delay 
Mar 20 11:12:32 net1 fenced[10510]: fencing node "app1" 
Mar 20 11:13:42 net1 fenced[10510]: fence "app1" success 


I issued a 'power reset' on its HP ILO management port to hardware
reboot the server around 11:12.  That is when the net1 server attempted
to fence app1, after it was missing.  Here's net1's syslog entries on
that event:

Mar 20 11:06:20 net1 clurgmgrd[12689]: <info> Magma Event: Membership
Change 
Mar 20 11:06:20 net1 clurgmgrd[12689]: <info> State change: app1 DOWN
Mar 20 11:12:26 net1 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats
Mar 20 11:12:32 net1 fenced[10510]: app1 not a cluster member after 0
sec post_fail_delay
Mar 20 11:12:32 net1 fenced[10510]: fencing node "app1"
Mar 20 11:13:42 net1 fenced[10510]: fence "app1" success
Mar 20 11:15:45 net1 kernel: CMAN: node app1 rejoining
Mar 20 11:18:05 net1 clurgmgrd[12689]: <info> Magma Event: Membership
Change 
Mar 20 11:18:05 net1 clurgmgrd[12689]: <info> State change: app1 UP 


Robert Hurst, Sr. Cach? Administrator
Beth Israel Deaconess Medical Center
1135 Tremont Street, REN-7
Boston, Massachusetts   02120-2140
617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154
Any technology distinguishable from magic is insufficiently advanced.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070320/ed68c110/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2178 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070320/ed68c110/attachment.p7s>

From teigland at redhat.com  Tue Mar 20 16:03:05 2007
From: teigland at redhat.com (David Teigland)
Date: Tue, 20 Mar 2007 11:03:05 -0500
Subject: [Linux-cluster] problem compiling redhat cluster2 with 2.6.9
In-Reply-To: <a97c77030703101031j3d6bbb32me18a015831268e00@mail.gmail.com>
References: <a97c77030703101031j3d6bbb32me18a015831268e00@mail.gmail.com>
Message-ID: <20070320160305.GE20026@redhat.com>

On Sun, Mar 11, 2007 at 12:01:21AM +0530, Rajesh Kumar Mallah wrote:
> Hi,
> 
> I am facing problem in compiling cluster 1 or cluster 2 with kernel - 2.6.9.
> cluster sources were picked from
> ftp://sources.redhat.com/pub/cluster/releases/
> 
> can anyone please tell if its possible to compile GFS1 or 2 with kernel 
> 2.6.9

cluster-1.04 and -2.00 are intended for use with 2.6.20.

Dave



From teigland at redhat.com  Tue Mar 20 16:20:05 2007
From: teigland at redhat.com (David Teigland)
Date: Tue, 20 Mar 2007 11:20:05 -0500
Subject: [Linux-cluster] Can GFS2 be compiled on top of 2.6.11?
In-Reply-To: <08A9A3213527A6428774900A80DBD8D803A9A595@xmb-sjc-222.amer.cisco.com>
References: <08A9A3213527A6428774900A80DBD8D803A9A595@xmb-sjc-222.amer.cisco.com>
Message-ID: <20070320162005.GF20026@redhat.com>

On Mon, Mar 19, 2007 at 01:05:23PM -0700, Lin Shen (lshen) wrote:
> I need to run GFS2 on top of 2.6.11. Is it possible?

No, apart from a _lot_ of backporting.

Dave



From isplist at logicore.net  Tue Mar 20 18:59:33 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Tue, 20 Mar 2007 12:59:33 -0600
Subject: [Linux-cluster] Solution: Install over FC SAN works, boot doesn't
In-Reply-To: <ae8f10de0703201044n6fc11b1fsed896c5da6a8b89f@mail.gmail.com>
Message-ID: <2007320125933.096787@leena>

After endless trial and errors, the solution was so much simpler than the 
problem. 
The problem was that I was able to install over the SAN to a volume but I 
could not boot from that installation.
It was as simple as going into the 'Advanced setting' of the boot installer 
while installing the server and picking the proper drive to boot.

Wish RH would make it a pick list rather than having to highlight the drive, 
then click it up one at a time but hey, it works. 

The answer was from a way cool guy at Qlogic who spend two days looking into 
this and figured out the problem. Now I have my central storage AND SAN based 
roots for each blade, no need for drives, flash drives, nothing, just the 
storage.

Mike


On Tue, 20 Mar 2007 17:44:35 +0000, William Rizzo wrote:
> Hi you're trying to do that?
> 
> When you say "Install Linux over the FC Network", do you mean Boot From 
SAN?
> 
> I guess that if can see the LUN at any stage, the LUN Masking and the
> Zoning aren't the problems.
> 
> 
> William
> 
> 
> 2007/3/19, isplist at logicore.net <isplist at logicore.net>: > I am able to
> install Linux over the FC network and onto a storage volume.
>> However, when I try to boot these installs, I only get a blank screen and
>> cannot boot.
>> 
>> I have found that I can install the /boot directory to a flash card on
>> each
>> blade and then am able to boot fine with swap and root on the remote
>> volume.
>> 
>> Why is this and how can I make my install 100% remote?
>> 
>> Also strange (to me at least), I have found that if I connect the multiple
>> volume storage device to the rest of the FC network, already installed
>> machines with Qlogic cards in them no longer boot. They come up with many
>> errors and all finally stall at the syslog starting point. Disconnecting
>> the
>> new storage allows them to boot again.
>> 
>> I believe this has something to do with LUN masking/zoning (which I've
>> yet to
>> learn, have to start reading Brocade 2800 docs next) but what I don't get
>> is
>> why the new blades can see their own LUNS if installed as noted above with
>> flash /boot, but old machines get stuck?
>> 
>> Rather confusing.
>> 
>> Mike
>> 
>> 
>> _______________________________________________
>> Anaconda-devel-list mailing list
>> Anaconda-devel-list at redhat.com
>> https://www.redhat.com/mailman/listinfo/anaconda-devel-list






From rstevens at internap.com  Tue Mar 20 18:21:30 2007
From: rstevens at internap.com (Rick Stevens)
Date: Tue, 20 Mar 2007 11:21:30 -0700
Subject: [Linux-cluster] Solution: Install over FC SAN works, boot doesn't
In-Reply-To: <2007320125933.096787@leena>
References: <2007320125933.096787@leena>
Message-ID: <1174414891.3172.4.camel@prophead.corp.publichost.com>

On Tue, 2007-03-20 at 12:59 -0600, isplist at logicore.net wrote:
> After endless trial and errors, the solution was so much simpler than the 
> problem. 
> The problem was that I was able to install over the SAN to a volume but I 
> could not boot from that installation.
> It was as simple as going into the 'Advanced setting' of the boot installer 
> while installing the server and picking the proper drive to boot.
> 
> Wish RH would make it a pick list rather than having to highlight the drive, 
> then click it up one at a time but hey, it works. 
> 
> The answer was from a way cool guy at Qlogic who spend two days looking into 
> this and figured out the problem. Now I have my central storage AND SAN based 
> roots for each blade, no need for drives, flash drives, nothing, just the 
> storage.

Glad you got it sorted out.  So long as the drives don't move around,
you should be good to go.

> On Tue, 20 Mar 2007 17:44:35 +0000, William Rizzo wrote:
> > Hi you're trying to do that?
> > 
> > When you say "Install Linux over the FC Network", do you mean Boot From 
> SAN?
> > 
> > I guess that if can see the LUN at any stage, the LUN Masking and the
> > Zoning aren't the problems.
> > 
> > 
> > William
> > 
> > 
> > 2007/3/19, isplist at logicore.net <isplist at logicore.net>: > I am able to
> > install Linux over the FC network and onto a storage volume.
> >> However, when I try to boot these installs, I only get a blank screen and
> >> cannot boot.
> >> 
> >> I have found that I can install the /boot directory to a flash card on
> >> each
> >> blade and then am able to boot fine with swap and root on the remote
> >> volume.
> >> 
> >> Why is this and how can I make my install 100% remote?
> >> 
> >> Also strange (to me at least), I have found that if I connect the multiple
> >> volume storage device to the rest of the FC network, already installed
> >> machines with Qlogic cards in them no longer boot. They come up with many
> >> errors and all finally stall at the syslog starting point. Disconnecting
> >> the
> >> new storage allows them to boot again.
> >> 
> >> I believe this has something to do with LUN masking/zoning (which I've
> >> yet to
> >> learn, have to start reading Brocade 2800 docs next) but what I don't get
> >> is
> >> why the new blades can see their own LUNS if installed as noted above with
> >> flash /boot, but old machines get stuck?
> >> 
> >> Rather confusing.
> >> 
> >> Mike
> >> 
> >> 
> >> _______________________________________________
> >> Anaconda-devel-list mailing list
> >> Anaconda-devel-list at redhat.com
> >> https://www.redhat.com/mailman/listinfo/anaconda-devel-list
> 
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
----------------------------------------------------------------------
- Rick Stevens, Principal Engineer             rstevens at internap.com -
- VitalStream, Inc.                       http://www.vitalstream.com -
-                                                                    -
-         It is better to have loved and lost.  Cheaper, too!        -
----------------------------------------------------------------------



From dshwatrz at gmail.com  Tue Mar 20 19:47:07 2007
From: dshwatrz at gmail.com (David Shwatrz)
Date: Tue, 20 Mar 2007 21:47:07 +0200
Subject: [Linux-cluster] Minimum kernel version for cluster2.00
Message-ID: <31436f4a0703201247r20d4f8fbv83105783ccc3d7c1@mail.gmail.com>

Hello,
 I saw somewhere (don't remember where - it could be on this list ) that
cluster2.00  can be built with linux -2.6.20.
My question: is this kernel version, 2.6.20 , is a minimum version ? are
there any chances that it will be built with kernel version
lower thatn 2.6.20 ? couldn't find info about it in the source tarball of
cluster2.00

Regaeds,
DS
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070320/000f09fc/attachment.htm>

From ianbrn at gmail.com  Tue Mar 20 20:44:22 2007
From: ianbrn at gmail.com (Ian Brown)
Date: Tue, 20 Mar 2007 22:44:22 +0200
Subject: [Linux-cluster] errors while building cluster-1.03.00 with
	2.6.18-1.2798.fc6xen
Message-ID: <d0383f90703201344k65ebabc0ub465492eeeeeb937@mail.gmail.com>

Hello,
 I try to build cluster-1.03.00 on x86_64 against a xen tree
(2.6.18-1.2798.fc6xen)

I ran:

./configure --kernel_src=/lib/modules/2.6.18-1.2798.fc6xen/build
And then:
make install

I get these errors:

make[4]: Entering directory `/usr/src/kernels/2.6.18-1.2798.fc6-xen-x86_64'
  CC [M]  /work/src/cluster-1.03.00/gfs-kernel/src/harness/main.o
/work/src/cluster-1.03.00/gfs-kernel/src/harness/main.c:121: error:
expected declaration specifiers or '...' before 'lm_fsdata_t'
/work/src/cluster-1.03.00/gfs-kernel/src/harness/main.c: In function 'lm_mount':
/work/src/cluster-1.03.00/gfs-kernel/src/harness/main.c:162: error:
'fsdata' undeclared (first use in this function)
...
...

Any idea why these errors ?

Regards,
Ian



From rhurst at bidmc.harvard.edu  Wed Mar 21 00:36:59 2007
From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu)
Date: Tue, 20 Mar 2007 20:36:59 -0400
Subject: [Linux-cluster] GFS/CS blocks all I/O on 1 server reboot of
	11 nodes?
References: <1174405151.6895.19.camel@WSBID06223>
Message-ID: <C44AB05AAC2D694A907A91DA40CCF37312E166@EVS6.its.caregroup.org>

I ran a series of reboots, and this problem is totally reproducible.  Should I be opening a ticket at Red Hat Support on this?

The problem is immediate with 'service rgmanager stop', as it hangs in its sleep loop forever, even though all nodes in the cluster report that it changed its state to down.  But worse than that, it also hangs all GFS I/O and the load average on all nodes start to spike (>9.00) -- I see gfs_scand in top racing away.

It only gets fixed when I manually 'power reset' the node, then I get the 'Missed too many heartbeats' followed by fencing.  Help.


Robert Hurst, Sr. Cach? Administrator
Beth Israel Deaconess Medical Center
1135 Tremont Street, REN-7
Boston, Massachusetts   02120-2140
617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154
Any technology distinguishable from magic is insufficiently advanced.

-----Original Message-----
From: linux-cluster-bounces at redhat.com on behalf of rhurst at bidmc.harvard.edu
Sent: Tue 3/20/2007 11:39 AM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] GFS/CS blocks all I/O on 1 server reboot of 11 nodes?
 
Troubling, this behavior has not occurred prior to our Mar 2nd up2date
on our RHEL GFS/CS subscription.  I rebooted an application server
(app1) in an 11-node cluster, and from viewing its console, it 'hung' on
a service cman stop.  Consequently, ALL GFS I/O got blocked on ALL
nodes.  All servers are configured the same:

AMD64 dual CPU/duo core HP DL385, 8GB RAM, dual hba (PowerPath)
# uname -r
2.6.9-42.0.10.ELsmp

ccs-1.0.7-0
cman-1.0.11-0
dlm-1.0.1-1
fence-1.32.25-1
GFS-6.1.6-1
magma-1.0.6-0
magma-plugins-1.0.9-0
rgmanager-1.9.54-1

My central syslog server showed that all nodes registered the membership
change, yet the service continued to hang.

Mar 20 11:06:18 app1 shutdown: shutting down for system reboot 
Mar 20 11:06:18 app1 init: Switching to runlevel: 6 
Mar 20 11:06:19 app1 rgmanager: [1873]: <notice> Shutting down Cluster
Service Manager...  
Mar 20 11:06:20 app1 clurgmgrd[11220]: <notice> Shutting down  
Mar 20 11:06:20 net2 clurgmgrd[30893]: <info> State change: app1 DOWN 
Mar 20 11:06:20 app3 clurgmgrd[11092]: <info> State change: app1 DOWN  
Mar 20 11:06:20 db1 clurgmgrd[8351]: <info> State change: app1 DOWN  
Mar 20 11:06:20 db3 clurgmgrd[8279]: <info> State change: app1 DOWN  
Mar 20 11:06:20 db2 clurgmgrd[10875]: <info> State change: app1 DOWN  
Mar 20 11:06:20 app6 clurgmgrd[10959]: <info> State change: app1 DOWN  
Mar 20 11:06:20 app4 clurgmgrd[11146]: <info> State change: app1 DOWN  
Mar 20 11:06:20 app2 clurgmgrd[10835]: <info> State change: app1 DOWN  
Mar 20 11:06:20 app5 clurgmgrd[11198]: <info> State change: app1 DOWN  
Mar 20 11:06:20 net1 clurgmgrd[12689]: <info> State change: app1 DOWN  
Mar 20 11:12:26 net2 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats
Mar 20 11:12:26 db2 kernel: CMAN: removing node app1 from the cluster :
Missed too many heartbeats 
Mar 20 11:12:26 db3 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 app4 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 app5 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 app6 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 net1 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 app3 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 db1 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:26 app2 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats 
Mar 20 11:12:32 net1 fenced[10510]: app1 not a cluster member after 0
sec post_fail_delay 
Mar 20 11:12:32 net1 fenced[10510]: fencing node "app1" 
Mar 20 11:13:42 net1 fenced[10510]: fence "app1" success 


I issued a 'power reset' on its HP ILO management port to hardware
reboot the server around 11:12.  That is when the net1 server attempted
to fence app1, after it was missing.  Here's net1's syslog entries on
that event:

Mar 20 11:06:20 net1 clurgmgrd[12689]: <info> Magma Event: Membership
Change 
Mar 20 11:06:20 net1 clurgmgrd[12689]: <info> State change: app1 DOWN
Mar 20 11:12:26 net1 kernel: CMAN: node app1 has been removed from the
cluster : Missed too many heartbeats
Mar 20 11:12:32 net1 fenced[10510]: app1 not a cluster member after 0
sec post_fail_delay
Mar 20 11:12:32 net1 fenced[10510]: fencing node "app1"
Mar 20 11:13:42 net1 fenced[10510]: fence "app1" success
Mar 20 11:15:45 net1 kernel: CMAN: node app1 rejoining
Mar 20 11:18:05 net1 clurgmgrd[12689]: <info> Magma Event: Membership
Change 
Mar 20 11:18:05 net1 clurgmgrd[12689]: <info> State change: app1 UP 


Robert Hurst, Sr. Cach? Administrator
Beth Israel Deaconess Medical Center
1135 Tremont Street, REN-7
Boston, Massachusetts   02120-2140
617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154
Any technology distinguishable from magic is insufficiently advanced.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 5074 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070320/adf80302/attachment.bin>

From quanong_os at yahoo.com  Wed Mar 21 00:45:48 2007
From: quanong_os at yahoo.com (qua nong)
Date: Wed, 21 Mar 2007 11:45:48 +1100 (EST)
Subject: [Linux-cluster] Cluster in virtual machines ???
Message-ID: <20070321004548.22398.qmail@web63203.mail.re1.yahoo.com>

Hi,

I have only 1 physical server with a number of local harddisks (no SAN or external hardisks).
I would like set up at least 2 virtual machines using either xen or vmware server, then cluster the virtual machines.

I wonder how could  the same disks be presented to both virtual machines. i.e. both virtual machines can share the same disks ? 
Is it possible to configure the iSCSI to both virtual machines ?

The problem is I donnot have any SAN or external storage which  can be shared to both servers.

QN




 
____________________________________________________________________________________
Get your own web address.  
Have a HUGE year through Yahoo! Small Business.
http://smallbusiness.yahoo.com/domains/?p=BESTDEAL
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070321/00f10937/attachment.htm>

From jwhiter at redhat.com  Wed Mar 21 00:53:19 2007
From: jwhiter at redhat.com (Josef Bacik)
Date: Tue, 20 Mar 2007 19:53:19 -0500
Subject: [Linux-cluster] GFS/CS blocks all I/O on 1 server reboot of 11
	nodes?
In-Reply-To: <C44AB05AAC2D694A907A91DA40CCF37312E166@EVS6.its.caregroup.org>
References: <1174405151.6895.19.camel@WSBID06223>
	<C44AB05AAC2D694A907A91DA40CCF37312E166@EVS6.its.caregroup.org>
Message-ID: <20070321005319.GB2637@korben.rdu.redhat.com>

On Tue, Mar 20, 2007 at 08:36:59PM -0400, rhurst at bidmc.harvard.edu wrote:
> I ran a series of reboots, and this problem is totally reproducible.  Should I be opening a ticket at Red Hat Support on this?
> 
> The problem is immediate with 'service rgmanager stop', as it hangs in its sleep loop forever, even though all nodes in the cluster report that it changed its state to down.  But worse than that, it also hangs all GFS I/O and the load average on all nodes start to spike (>9.00) -- I see gfs_scand in top racing away.
> 
> It only gets fixed when I manually 'power reset' the node, then I get the 'Missed too many heartbeats' followed by fencing.  Help.
> 
>

echo "RGMGR_OPTS=-d" > /etc/sysconfig/cluster

and reproduce and then open a ticket with support.  Its possible that it's
waiting for one of your service scripts to stop and it's not returning.  Also
there was a bug where bash would segfault and rgmanager would just hang.  Make
sure you have the newest version of bash and see if the problem still
reproduces.  If none of the above helps definitely file a support ticket, if
frontline cannot figure it out it will probably make it back to me and I'll take
a look.

Josef



From sail at serverengines.com  Wed Mar 21 01:37:50 2007
From: sail at serverengines.com (Sai Loganathan)
Date: Tue, 20 Mar 2007 18:37:50 -0700
Subject: [Linux-cluster] Re: cluster not doing failover 
References: <20070310170007.4934A731DB@hormel.redhat.com> 
Message-ID: <009801c76b59$89833fd0$2702140a@se19261f2cf9ed>

Hello,
Thanks for the info. Now I am doing manual fencing but get the following
error whenever I do a failover.

Mar 12 17:25:50 node2 clurgmgrd[6088]: <info> State change: node1 DOWN
Mar 12 17:25:52 node2 clurgmgrd[6088]: <notice> Starting stopped service
iscsi_ip
Mar 12 17:25:52 node2 clurgmgrd: [6088]: <info> Adding IPv4 address
172.40.2.119 to eth2
Mar 12 17:25:52 node2 clurgmgrd[6088]: <notice> Starting stopped service
iscsi_lun
Mar 12 17:25:53 node2 clurgmgrd[6088]: <notice> Service iscsi_lun started
Mar 12 17:25:54 node2 clurgmgrd[6088]: <notice> Service iscsi_ip started
Mar 12 17:26:24 node2 kernel: CMAN: removing node node1 from the cluster :
Missed too many heartbeats
Mar 12 17:26:24 node2 fenced[6040]: node1 not a cluster member after 0 sec
post_fail_delay
Mar 12 17:26:24 node2 fenced[6040]: fencing node "node1"
Mar 12 17:26:24 node2 fence_manual: Node node1 needs to be reset before
recovery can procede.  Waiting for node1 to rejoin the cluster or for manual
acknowledgement that it has been reset (i.e. fence_ack_manual -n node1)

I just power down node 1 to simulate the failover to node2. Unless I execute
the command fence_ack_manual -n node1, the system will not move forward and
wait in fencing. How to fix this error?

During shutdown, I get the following error message and system waits there
infinitely.
Starting Killall: CMAN: sendmsg failed: -101
WARNING: dlm_emergency_shutdown
SM: 00000003 sm_stop: SG stilljoined
How to fix this error?

Thanks,
Sai Logan



-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of
linux-cluster-request at redhat.com
Sent: Saturday, March 10, 2007 9:00 AM
To: linux-cluster at redhat.com
Subject: Linux-cluster Digest, Vol 35, Issue 13

Send Linux-cluster mailing list submissions to
	linux-cluster at redhat.com

To subscribe or unsubscribe via the World Wide Web, visit
	https://www.redhat.com/mailman/listinfo/linux-cluster
or, via email, send a message with subject or body 'help' to
	linux-cluster-request at redhat.com

You can reach the person managing the list at
	linux-cluster-owner at redhat.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Linux-cluster digest..."


Today's Topics:

   1. Re: cluster not doing failover (Jonathan E Brassow)


----------------------------------------------------------------------

Message: 1
Date: Fri, 9 Mar 2007 19:53:40 -0600
From: Jonathan E Brassow <jbrassow at redhat.com>
Subject: Re: [Linux-cluster] cluster not doing failover
To: linux clustering <linux-cluster at redhat.com>
Message-ID: <40407159e8e6506b05d46c82d921d936 at redhat.com>
Content-Type: text/plain; charset="iso-8859-1"


On Mar 9, 2007, at 5:30 PM, Sai Loganathan wrote:
>             <fencedevices>
>                         <fencedevice agent="fence_ilo" hostname="admin"
> login="admin" name="node1_fence" passwd="admin"/>
>                         <fencedevice agent="fence_ilo" hostname="admin"
> login="admin" name="node2_fence" passwd="admin"/>
>             </fencedevices>

The above line look funny to me.  The hostname for the fence device is 
"admin"?

> Using the cluster ip address (172.40.2.119), I was able to do an nfs
> mount of the shared lun from a 3rd machine. Started an infinite ls on
> that lun.
> To simulate failover, I just powered-down the node1 and hoping to see
> the read io stop but resume via the node2. But, I see the following
> error message on the node 2.
> Mar  9 12:14:49 node2 fenced[7422]: fence "node1" failed
> Mar  9 12:14:54 node2 fenced[7422]: fencing node "node1"
> Mar  9 12:14:54 node2 fenced[7422]: agent "fence_ilo" reports: Can't
> call method "configure" on an undefined value at /sbin/fence_ilo line
> 169, <> line 4.
> Mar  9 12:14:54 node2 fenced[7422]: fence "node1" failed
> Mar  9 12:14:59 node2 fenced[7422]: fencing node "node1"
> Mar  9 12:14:59 node2 fenced[7422]: agent "fence_ilo" reports: Can't
> call method "configure" on an undefined value at /sbin/fence_ilo line
> 169, <> line 4.
>  
> Seems like I am not doing something correct with respect to fencing.
> Can I setup cluster without fencing first of all?

Yes.  You can use manual fencing.  That should only be used for testing 
purposes though... it is not a supported configuration.


  brassow
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 3471 bytes
Desc: not available
Url :
https://www.redhat.com/archives/linux-cluster/attachments/20070309/c015b8da/
attachment.bin

------------------------------

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

End of Linux-cluster Digest, Vol 35, Issue 13
*********************************************



___________________________________________________________________________________
This message, together with any attachment(s), contains confidential and proprietary information of
ServerEngines LLC and is intended only for the designated recipient(s) named above. Any unauthorized
review, printing, retention, copying, disclosure or distribution is strictly prohibited.  If you are not the
intended recipient of this message, please immediately advise the sender by reply email message and
delete all copies of this message and any attachment(s). Thank you.




From srramasw at cisco.com  Wed Mar 21 01:53:35 2007
From: srramasw at cisco.com (Sridhar Ramaswamy (srramasw))
Date: Tue, 20 Mar 2007 18:53:35 -0700
Subject: [Linux-cluster] Kernel oops when GFS2 used as localfs
In-Reply-To: <B14199FA0DBAAF4AA89E83EB41D35435033D15D9@xmb-sjc-22c.amer.cisco.com>
Message-ID: <B14199FA0DBAAF4AA89E83EB41D35435033D1FD4@xmb-sjc-22c.amer.cisco.com>

The problen is solved! Kernel version I tried earlier - 2.6.21-rc1 -
doesn't have the fix for bugzilla# 229831.
 
I moved to Fedora Core 6 (to resolve all the udev problem) and picked
the latest kernel (2.6.21-rc4).  With this bonnie tests run fine. 
 
thanks,
Sridhar


________________________________

	From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Sridhar Ramaswamy
(srramasw)
	Sent: Sunday, March 18, 2007 8:26 PM
	To: linux clustering
	Subject: RE: [Linux-cluster] Kernel oops when GFS2 used as
localfs
	
	
	Thanks Kevin! That helps.
	 
	- Sridhar


________________________________

		From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kevin Anderson
		Sent: Saturday, March 17, 2007 2:37 AM
		To: linux clustering
		Subject: RE: [Linux-cluster] Kernel oops when GFS2 used
as localfs
		
		
		Just opened up bugzilla 229831, you should be able to
see it now.
		
		Kevin
		
		On Sat, 2007-03-17 at 00:11 -0600, David wrote: 

			Once you get it, feel free to share with all of
us.
			
			David 
			
			-----Original Message-----
			From: linux-cluster-bounces at redhat.com
			[mailto:linux-cluster-bounces at redhat.com] On
Behalf Of Sridhar Ramaswamy
			(srramasw)
			Sent: Wednesday, March 14, 2007 12:20 PM
			To: linux clustering
			Subject: RE: [Linux-cluster] Kernel oops when
GFS2 used as localfs
			
			
			Hi folks,
			
			This bugzilla 229831 seems "restricted" :(  Can
someone please forward
			its details? I'm interested in understanding the
actual problem and its
			proposed fix.
			
			Meanwhile I'll try an upstream kernel. Should
2.6.21.rc1 be fine?
			
			Thanks,
			Sridhar
			
			> -----Original Message-----
			> From: linux-cluster-bounces at redhat.com
			> [mailto:linux-cluster-bounces at redhat.com] On
Behalf Of Wendy Cheng
			> Sent: Wednesday, March 14, 2007 6:33 AM
			> To: linux clustering
			> Subject: Re: [Linux-cluster] Kernel oops when
GFS2 used as localfs
			> 
			> Steven Whitehouse wrote:
			> > Hi,
			> >
			> > This looks like Red Hat bugzilla 229831 and
if so then
			> there is already
			> > a fix in Linus' upstream kernel and in the
latest kernel build for 
			> > Fedora (both 5 and 6),
			> >
			> >   
			> yeah... agree. I missed the unlink part of the
trace in previous post
			> and thought it might have something to do with
my ino number 
			> hacking in 
			> the lookup code.
			> 
			> -- Wendy
			> > Steve.
			> >
			> > On Wed, 2007-03-14 at 00:15 -0700, Sridhar
Ramaswamy
			> (srramasw) wrote:
			> >   
			> >> Sure Wendy. Here it is,
			> >>
			> >> "fs/gfs2/inode.c" 1256
			> >> struct inode *gfs2_ilookup(struct
super_block *sb, struct
			> gfs2_inum_host
			> >> *inum)
			> >> {
			> >>         return ilookup5(sb, (unsigned
long)inum->no_formal_ino,
			> >>                         iget_test, inum);
			> >> }
			> >>
			> >> static struct inode *gfs2_iget(struct
super_block *sb, struct 
			> >> gfs2_inum_host *inum) {
			> >>         return iget5_locked(sb, (unsigned
long)inum->no_formal_ino,
			> >>                      iget_test, iget_set,
inum);
			> >> }
			> >>
			> >> BTW - this code is from kernel version
2.6.20.1.
			> >>
			> >> Thanks,
			> >> Sridhar
			> >>
			> >>     
			> >>> -----Original Message-----
			> >>> From: linux-cluster-bounces at redhat.com
			> >>> [mailto:linux-cluster-bounces at redhat.com]
On Behalf Of Wendy Cheng
			> >>> Sent: Tuesday, March 13, 2007 10:29 PM
			> >>> To: linux clustering
			> >>> Subject: Re: [Linux-cluster] Kernel oops
when GFS2 used as localfs
			> >>>
			> >>> Sridhar Ramaswamy (srramasw) wrote:
			> >>>
			> >>>       
			> >>>> Mar 13 15:25:56 cfs1 kernel:
------------[ cut here ]------------
			
			> >>>> Mar 13 15:25:56 cfs1 kernel: kernel BUG
at fs/gfs2/meta_io.c:474!
			> >>>>         
			> >>> I don't have time to pull kernel.org
source code right
			> now. Could you
			> >>> cut-and-paste the following two routines
(from fs/gfs2/inode.c):
			> >>> gfs2_ilookup() and gfs2_iget() so we can
look into this
			> >>> quickly ? They 
			> >>> are all one-liner so shouldn't be too much
troubles. GFS2 
			> >>> currently has 
			> >>> ino issue with its lookup code.
			> >>>
			> >>> -- Wendy
			> >>>
			> >>>       
			> >>>> Mar 13 15:25:56 cfs1 kernel: invalid
opcode: 0000 [#1] Mar 13 
			> >>>> 15:25:56 cfs1 kernel: SMP Mar 13 15:25:56
cfs1 kernel: Modules 
			> >>>> linked in: lock_nolock gfs2 reiserfs nfsd
exportfs nfs lockd 
			> >>>> nfs_acl ipv6 parport_pc
			> lp parport
			> >>>> autofs4 sunrpc dm_mirror dm_mod button
battery ac
			> uhci_hcd ehci_hcd
			> >>>> intel_rng rng_core i2c_i801 i2c_core
e1000 e100 mii
			> floppy ext3 jbd
			> >>>> Mar 13 15:25:56 cfs1 kernel: CPU:    1
			> >>>> Mar 13 15:25:56 cfs1 kernel: EIP:
0060:[<e0c88da5>]    
			> >>>>         
			> >>> Not tainted VLI
			> >>>       
			> >>>> Mar 13 15:25:56 cfs1 kernel: EFLAGS:
00010246   (2.6.20.1 #1)
			> >>>> Mar 13 15:25:56 cfs1 kernel: EIP is at
			> >>>> gfs2_meta_indirect_buffer+0x4c/0x278
[gfs2]
			> >>>> Mar 13 15:25:56 cfs1 kernel: eax:
00000000   ebx: 
			> 00012bf5   ecx: 
			> >>>> ce5a6dd4   edx: dc5eae00
			> >>>> Mar 13 15:25:56 cfs1 kernel: esi:
00000000   edi: 
			> 00000000   ebp: 
			> >>>> dc5ea9a8   esp: ce5a6d58
			> >>>> Mar 13 15:25:56 cfs1 kernel: ds: 007b
es: 007b   ss: 0068
			> >>>> Mar 13 15:25:56 cfs1 kernel: Process
bonnie++ (pid: 5509,
			> >>>>         
			> >>> ti=ce5a6000
			> >>>       
			> >>>> task=dc320570 task.ti=ce5a6000)
			> >>>> Mar 13 15:25:56 cfs1 kernel: Stack:
c156d274 ce5a6dd4 c016eb91
			> >>>> ce5a6dd4 00000000 dc5eae00 00000000
d6f34000
			> >>>> Mar 13 15:25:56 cfs1 kernel:
00000000 00000000 dc5ea9a8 
			> >>>> d57794a8 ce5a6e08 e0c83ad3 00012bf5
00000000
			> >>>> Mar 13 15:25:56 cfs1 kernel:
00000000 ce5a6da0 00000000 
			> >>>> 00000000 dc5ea9a8 d57794a8 e0c84cc9
ce5a6dd4
			> >>>> Mar 13 15:25:56 cfs1 kernel: Call Trace:
			> >>>> Mar 13 15:25:56 cfs1 kernel:
[<c016eb91>] iget5_locked+0x3d/0x67
			> >>>> Mar 13 15:25:56 cfs1 kernel:
[<e0c83ad3>] 
			> >>>> gfs2_inode_refresh+0x34/0xfe [gfs2]
			> >>>> Mar 13 15:25:56 cfs1 kernel:
[<e0c84cc9>] 
			> >>>>         
			> >>> gfs2_createi+0x12c/0x191 [gfs2]
			> >>>       
			> >>>> Mar 13 15:25:56 cfs1 kernel:
[<e0c8da3c>]
			> >>>>         
			> >>> gfs2_create+0x5c/0x103 [gfs2]
			> >>>       
			> >>>> Mar 13 15:25:56 cfs1 kernel:
[<e0c84be7>]
			> >>>>         
			> >>> gfs2_createi+0x4a/0x191 [gfs2]
			> >>>       
			> >>>> Mar 13 15:25:56 cfs1 kernel:
[<e0c820c4>]
			> >>>>         
			> >>> gfs2_glock_nq_num+0x3f/0x64
			> >>>       
			> >>>> [gfs2]
			> >>>> Mar 13 15:25:56 cfs1 kernel:
[<c016552e>] vfs_create+0xc3/0x126 
			> >>>> Mar 13 15:25:56 cfs1 kernel:
[<c01657f2>]
			> >>>>         
			> >>> open_namei_create+0x47/0x88
			> >>>       
			> >>>> Mar 13 15:25:56 cfs1 kernel:
[<c016597d>] open_namei+0x14a/0x539
			
			> >>>> Mar 13 15:25:56 cfs1 kernel:
[<c015d27b>] do_filp_open+0x25/0x39
			
			> >>>> Mar 13 15:25:56 cfs1 kernel:
[<c01d200a>]
			> >>>>         
			> >>> strncpy_from_user+0x3c/0x5b
			> >>>       
			> >>>> Mar 13 15:25:56 cfs1 kernel:
[<c015d431>]
			> get_unused_fd+0xa8/0xb1
			> >>>> Mar 13 15:25:56 cfs1 kernel:
[<c015d509>] do_sys_open+0x42/0xbe 
			> >>>> Mar 13 15:25:56 cfs1 kernel:
[<c015d59f>] sys_open+0x1a/0x1c Mar
			
			> >>>> 13 15:25:56 cfs1 kernel:  [<c015d5dd>]
sys_creat+0x1f/0x23 Mar 13
			
			> >>>> 15:25:56 cfs1 kernel:  [<c0103410>]
			> >>>>         
			> >>> sysenter_past_esp+0x5d/0x81
			> >>>       
			> >>>> Mar 13 15:25:56 cfs1 kernel:
======================= Mar 13 
			> >>>> 15:25:56 cfs1 kernel: Code: 80 a8 01 00
00 89 44 24
			> >>>>         
			> >>> 1c 8b 85 f0
			> >>>       
			> >>>> 01 00 00 c7 44 24 20 00 00 00 00 89 54 24
14 85 c0 89 44 24
			> >>>>         
			> >>> 18 c7 44
			> >>>       
			> >>>> 24 10 00 00 00 00 75 04 <0f> 0b eb fe 83
7c 24 1c 00 75 04
			> >>>>         
			> >>> 0f 0b eb fe
			> >>>       
			> >>>> 8d 85 24 04 00 00
			> >>>> Mar 13 15:25:57 cfs1 kernel: EIP:
[<e0c88da5>]
			> >>>> gfs2_meta_indirect_buffer+0x4c/0x278
[gfs2] SS:ESP 0068:ce5a6d58
			> >>>>  
			> >>>>  
			> >>>> (2)
			> >>>>  
			> >>>> Mar 13 17:00:30 cfs1 kernel: Call Trace:
			> >>>> Mar 13 17:00:30 cfs1 kernel:
[<e0badde1>]
			> >>>>         
			> >>> gfs2_unlink+0x53/0xe0 [gfs2]
			> >>>       
			> >>>> Mar 13 17:00:30 cfs1 kernel:
[<e0baddc8>]
			> >>>>         
			> >>> gfs2_unlink+0x3a/0xe0 [gfs2]
			> >>>       
			> >>>> Mar 13 17:00:30 cfs1 kernel:
[<e0badde1>]
			> >>>>         
			> >>> gfs2_unlink+0x53/0xe0 [gfs2]
			> >>>       
			> >>>> Mar 13 17:00:30 cfs1 kernel:
[<c0166572>] vfs_unlink+0xa1/0xc5 
			> >>>> Mar 13 17:00:30 cfs1 kernel:
[<c016662b>] do_unlinkat+0x95/0xf5 
			> >>>> Mar 13 17:00:30 cfs1 kernel:
[<c01187de>]
			> scheduler_tick+0x8f/0x95
			> >>>> Mar 13 17:00:30 cfs1 kernel:
[<c0103410>]
			> >>>>         
			> >>> sysenter_past_esp+0x5d/0x81
			> >>>       
			> >>>> Mar 13 17:00:30 cfs1 kernel:
=======================
			> >>>>
			> >>>>
-------------------------------------------------------------
			> >>>>         
			> >>> -----------
			> >>>       
			> >>>> --
			> >>>> Linux-cluster mailing list
			> >>>> Linux-cluster at redhat.com 
			> >>>>
https://www.redhat.com/mailman/listinfo/linux-cluster
			> >>>>
			> >>>>         
			> >>> --
			> >>> Linux-cluster mailing list
			> >>> Linux-cluster at redhat.com 
			> >>>
https://www.redhat.com/mailman/listinfo/linux-cluster
			> >>>
			> >>>       
			> >> --
			> >> Linux-cluster mailing list
			> >> Linux-cluster at redhat.com 
			> >>
https://www.redhat.com/mailman/listinfo/linux-cluster
			> >>     
			> >
			> >   
			> 
			> --
			> Linux-cluster mailing list
			> Linux-cluster at redhat.com 
			>
https://www.redhat.com/mailman/listinfo/linux-cluster
			> 
			
			--
			Linux-cluster mailing list
			Linux-cluster at redhat.com
	
https://www.redhat.com/mailman/listinfo/linux-cluster
			
			--
			Linux-cluster mailing list
			Linux-cluster at redhat.com
	
https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070320/97cfca69/attachment.htm>

From nirmaltom at hotmail.com  Wed Mar 21 06:08:20 2007
From: nirmaltom at hotmail.com (nirmal tom)
Date: Wed, 21 Mar 2007 11:38:20 +0530
Subject: [Linux-cluster] Minimum kernel version for cluster2.00
In-Reply-To: <31436f4a0703201247r20d4f8fbv83105783ccc3d7c1@mail.gmail.com>
Message-ID: <BAY136-F40399CC43A0D48E940638EAC740@phx.gbl>

hi,
i built on 2.6.19,it directly depends on your stuff.Many people here helped 
me out to get through the problems.When u ran on older versions , likely u 
will get compilation errors and u have to fix them.It is definitely 
posssible, but needs some work on it.

regards,
Nirmal Tom.


>From: "David Shwatrz" <dshwatrz at gmail.com>
>Reply-To: linux clustering <linux-cluster at redhat.com>
>To: linux-cluster at redhat.com
>Subject: [Linux-cluster] Minimum kernel version for cluster2.00
>Date: Tue, 20 Mar 2007 21:47:07 +0200
>
>Hello,
>I saw somewhere (don't remember where - it could be on this list ) that
>cluster2.00  can be built with linux -2.6.20.
>My question: is this kernel version, 2.6.20 , is a minimum version ? are
>there any chances that it will be built with kernel version
>lower thatn 2.6.20 ? couldn't find info about it in the source tarball of
>cluster2.00
>
>Regaeds,
>DS


>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster




From nirmaltom at hotmail.com  Wed Mar 21 06:12:22 2007
From: nirmaltom at hotmail.com (nirmal tom)
Date: Wed, 21 Mar 2007 11:42:22 +0530
Subject: [Linux-cluster] Cluster in virtual machines ???
In-Reply-To: <20070321004548.22398.qmail@web63203.mail.re1.yahoo.com>
Message-ID: <BAY136-F2472FABD77CC083A250ADDAC740@phx.gbl>

hi,
Its possible through iscsi,gnbd even NFSv4.But iscsi and gnbd makes sense.
Open source project is

http://iscsitarget.sourceforge.net/

and is working too nice for me

regards,
Nirmal Tom
>From: qua nong <quanong_os at yahoo.com>
>Reply-To: linux clustering <linux-cluster at redhat.com>
>To: linux clustering <linux-cluster at redhat.com>
>Subject: [Linux-cluster] Cluster in virtual machines ???
>Date: Wed, 21 Mar 2007 11:45:48 +1100 (EST)
>
>Hi,
>
>I have only 1 physical server with a number of local harddisks (no SAN or 
>external hardisks).
>I would like set up at least 2 virtual machines using either xen or vmware 
>server, then cluster the virtual machines.
>
>I wonder how could  the same disks be presented to both virtual machines. 
>i.e. both virtual machines can share the same disks ?
>Is it possible to configure the iSCSI to both virtual machines ?
>
>The problem is I donnot have any SAN or external storage which  can be 
>shared to both servers.
>
>QN
>
>
>
>
>
>____________________________________________________________________________________
>Get your own web address.
>Have a HUGE year through Yahoo! Small Business.
>http://smallbusiness.yahoo.com/domains/?p=BESTDEAL


>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster




From shailesh at verismonetworks.com  Wed Mar 21 09:48:54 2007
From: shailesh at verismonetworks.com (Shailesh)
Date: Wed, 21 Mar 2007 15:18:54 +0530
Subject: [Linux-cluster] storage clustering without SAN
Message-ID: <1174470534.15862.70.camel@shailesh>

Hi Everyone,
             I am looking for a storage solution where there is NO
common storage area like a SAN. But just a bunch of PC clustered togther
with their internal disk space used collectively as a storage area.

Can you suggest me any information that talks about such an
architecture ?

Can I use the redhat clustering suite with these bunch of PC
connected on a ethernet network ?

Do I have to use GFS , if RHCS is employed ?

I'll appreciate your answer to the above. 

Thanks & Regards
Shailesh P S


 




From hyperbaba at neobee.net  Wed Mar 21 11:00:29 2007
From: hyperbaba at neobee.net (Vladimir Grujic)
Date: Wed, 21 Mar 2007 12:00:29 +0100
Subject: [Linux-cluster] gnbd in non clustered enviroment and lvm2
Message-ID: <200703211200.29622.hyperbaba@neobee.net>

hello,
i'm currently playing with this configuration:

storage:
lvm2 created PV/VG/LV on a hardware raid controller 1T drive
gnbd_export with caching is exporting just one LV

test mashine:
gnbd_import of exported block device

no cluster enviroment (once was operating cluster/gfs and it shut down due the 
crash on gfs filesystem which take 3 days to repair)

block device exported via gnbd is mounted on the test machine
everything works fine, until i extend the size of the exported LV

result is that gnbd still exports block device with old size.
i've tried to remove gnbd import on test machine and got this on storage 
"ERROR size of the exported file /dev/VG/test has changed, aborting" syslog

is there a way to force the gnbd to recognize changes in block device size and 
respond to them without need for clvmd and gnbd exporting entire PV

Vladimir





-- 
------------------------------------------------------------
Any cool program always requires more memory than you have. 
	-- Murphy's Computer Laws n?2
------------------------------------------------------------



From rpeterso at redhat.com  Wed Mar 21 14:06:37 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Wed, 21 Mar 2007 09:06:37 -0500
Subject: [Linux-cluster] Re: cluster not doing failover
In-Reply-To: <009801c76b59$89833fd0$2702140a@se19261f2cf9ed>
References: <20070310170007.4934A731DB@hormel.redhat.com>
	<009801c76b59$89833fd0$2702140a@se19261f2cf9ed>
Message-ID: <46013BED.3010608@redhat.com>

Sai Loganathan wrote:
> I just power down node 1 to simulate the failover to node2. Unless I execute
> the command fence_ack_manual -n node1, the system will not move forward and
> wait in fencing. How to fix this error?

Hi Sai,

There is no error.  That's exactly how manual fencing is supposed to work.
The alternative is to use a different fencing method.

Regards,

Bob Peterson
Red Hat Cluster Suite



From rpeterso at redhat.com  Wed Mar 21 14:11:44 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Wed, 21 Mar 2007 09:11:44 -0500
Subject: [Linux-cluster] storage clustering without SAN
In-Reply-To: <1174470534.15862.70.camel@shailesh>
References: <1174470534.15862.70.camel@shailesh>
Message-ID: <46013D20.2070705@redhat.com>

Shailesh wrote:
> Hi Everyone,
>              I am looking for a storage solution where there is NO
> common storage area like a SAN. But just a bunch of PC clustered togther
> with their internal disk space used collectively as a storage area.
> 
> Can you suggest me any information that talks about such an
> architecture ?
> 
> Can I use the redhat clustering suite with these bunch of PC
> connected on a ethernet network ?
> 
> Do I have to use GFS , if RHCS is employed ?
> 
> I'll appreciate your answer to the above. 
> 
> Thanks & Regards
> Shailesh P S

Hi Shailesh,

If your storage is not shared, what exactly do you mean by "used
collectively"?

You can use the cluster suite to cluster a bunch of PCs together
so they cooperate, provide High Availability services (through
rgmanager) and so forth, regardless of whether you use GFS.

If your storage isn't shared between the systems, you don't
need GFS.

Regards,

Bob Peterson
Red Hat Cluster Suite



From pcaulfie at redhat.com  Wed Mar 21 14:19:05 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 21 Mar 2007 14:19:05 +0000
Subject: [Linux-cluster] Re: cluster not doing failover
In-Reply-To: <009801c76b59$89833fd0$2702140a@se19261f2cf9ed>
References: <20070310170007.4934A731DB@hormel.redhat.com>
	<009801c76b59$89833fd0$2702140a@se19261f2cf9ed>
Message-ID: <46013ED9.9090205@redhat.com>

Sai Loganathan wrote:

> During shutdown, I get the following error message and system waits there
> infinitely.
> Starting Killall: CMAN: sendmsg failed: -101
> WARNING: dlm_emergency_shutdown
> SM: 00000003 sm_stop: SG stilljoined
> How to fix this error?
> 
>

You need to shut down cman properly during system shutdown, not just wait for it
to be killed.

See the provided examples scripts or just use the 'cman_tool leave' command.

-- 

patrick



From lhh at redhat.com  Wed Mar 21 14:20:56 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 21 Mar 2007 10:20:56 -0400
Subject: [Linux-cluster] gnbd in non clustered enviroment and lvm2
In-Reply-To: <200703211200.29622.hyperbaba@neobee.net>
References: <200703211200.29622.hyperbaba@neobee.net>
Message-ID: <1174486857.3003.2.camel@localhost.localdomain>

On Wed, 2007-03-21 at 12:00 +0100, Vladimir Grujic wrote:

> result is that gnbd still exports block device with old size.

I don't think GNBD is in the habit of re-checking the device sizes; not
sure though.

> i've tried to remove gnbd import on test machine and got this on storage 
> "ERROR size of the exported file /dev/VG/test has changed, aborting" syslog
> 
> is there a way to force the gnbd to recognize changes in block device size and 
> respond to them without need for clvmd and gnbd exporting entire PV

I don't think so - at least, not at the moment.  Ben would know more.

-- Lon




From rainer at ultra-secure.de  Wed Mar 21 14:22:48 2007
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Wed, 21 Mar 2007 15:22:48 +0100
Subject: [Linux-cluster] storage clustering without SAN
In-Reply-To: <1174470534.15862.70.camel@shailesh>
References: <1174470534.15862.70.camel@shailesh>
Message-ID: <46013FB8.7040103@ultra-secure.de>

Shailesh wrote:
> Hi Everyone,
>              I am looking for a storage solution where there is NO
> common storage area like a SAN. But just a bunch of PC clustered togther
> with their internal disk space used collectively as a storage area.
>
> Can you suggest me any information that talks about such an
> architecture ?
>
> Can I use the redhat clustering suite with these bunch of PC
> connected on a ethernet network ?
>
> Do I have to use GFS , if RHCS is employed ?
>
> I'll appreciate your answer to the above. 
>
>   


http://datafarm.apgrid.org/   ???

I have not done any testing with this.
I just stumbled on the link some time ago.


cheers,
Rainer



From lhh at redhat.com  Wed Mar 21 14:24:30 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Wed, 21 Mar 2007 10:24:30 -0400
Subject: [Linux-cluster] Cluster in virtual machines ???
In-Reply-To: <20070321004548.22398.qmail@web63203.mail.re1.yahoo.com>
References: <20070321004548.22398.qmail@web63203.mail.re1.yahoo.com>
Message-ID: <1174487071.3003.6.camel@localhost.localdomain>

On Wed, 2007-03-21 at 11:45 +1100, qua nong wrote:
> Hi,
> 
> I have only 1 physical server with a number of local harddisks (no SAN
> or external hardisks).
> I would like set up at least 2 virtual machines using either xen or
> vmware server, then cluster the virtual machines.
> 
> I wonder how could  the same disks be presented to both virtual
> machines. i.e. both virtual machines can share the same disks ? 
> Is it possible to configure the iSCSI to both virtual machines ?
> 
> The problem is I donnot have any SAN or external storage which  can be
> shared to both servers.

You can use iSCSI/gnbd, or you can enable multiple-writers in the Xen
config files for the block devices - bypassing all that stuff. ;)   I
forgot what the exact flag was, though.

On the client side, if you intend to mount the same file system on
multiple nodes, you'll still need something like GFS running in the
client.

-- Lon




From cjk at techma.com  Wed Mar 21 14:36:41 2007
From: cjk at techma.com (Kovacs, Corey J.)
Date: Wed, 21 Mar 2007 10:36:41 -0400
Subject: =?us-ascii?Q?RE:_=5BLinux-cluster=5D_storage_clustering_without_SAN?=
In-Reply-To: <46013D20.2070705@redhat.com>
References: <1174470534.15862.70.camel@shailesh> <46013D20.2070705@redhat.com>
Message-ID: <7DCE72B3C36E2A45B7580F887EE4948C18FDE2@TMAEMAIL.techma.com>

Sounds like what you want is something along the lines of Lustre or even GPFS
from IBM.
Those systems allow aggregate use of nodes as storage bricks. RHCS could be
used on top 
of such a system as well as in conjunction with. Lustre for instance allows a
node to be 
designated as a failover for another node, but as far as I know requires some
other method 
to ensure the storage device is taken over by the failover node. It all
depends on what you
are trying to accomplish. If you are trying to save money, move ahead
carefully. I'm not 
sure if the storage bricks are meant to be used for general purpose computing
etc once they
are configured to provide storage. Also, due to the current lack of "raid"
personalities in 
things like Lustre, you will want some nice hardware, which by the time you
get done paying
for you might as well have paid for a small SAN.  For raw thoughput though,
it's hard to beat
Luster or GPFS.


Corey Kovacs
Senior Systems Engineer
Technology Management Associates
703.279.6168 (B)
855-6168 (R)


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Robert Peterson
Sent: Wednesday, March 21, 2007 10:12 AM
To: linux clustering
Subject: Re: [Linux-cluster] storage clustering without SAN

Shailesh wrote:
> Hi Everyone,
>              I am looking for a storage solution where there is NO 
> common storage area like a SAN. But just a bunch of PC clustered 
> togther with their internal disk space used collectively as a storage area.
> 
> Can you suggest me any information that talks about such an 
> architecture ?
> 
> Can I use the redhat clustering suite with these bunch of PC connected 
> on a ethernet network ?
> 
> Do I have to use GFS , if RHCS is employed ?
> 
> I'll appreciate your answer to the above. 
> 
> Thanks & Regards
> Shailesh P S

Hi Shailesh,

If your storage is not shared, what exactly do you mean by "used
collectively"?

You can use the cluster suite to cluster a bunch of PCs together so they
cooperate, provide High Availability services (through
rgmanager) and so forth, regardless of whether you use GFS.

If your storage isn't shared between the systems, you don't need GFS.

Regards,

Bob Peterson
Red Hat Cluster Suite

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From jrjablo at gmail.com  Wed Mar 21 15:16:25 2007
From: jrjablo at gmail.com (Joey Jablonski)
Date: Wed, 21 Mar 2007 09:16:25 -0600
Subject: [Linux-cluster] storage clustering without SAN
In-Reply-To: <7DCE72B3C36E2A45B7580F887EE4948C18FDE2@TMAEMAIL.techma.com>
References: <1174470534.15862.70.camel@shailesh> <46013D20.2070705@redhat.com>
	<7DCE72B3C36E2A45B7580F887EE4948C18FDE2@TMAEMAIL.techma.com>
Message-ID: <78a88ef70703210816h67fc284cuff47feed2912e2ad@mail.gmail.com>

On potential is Gfarm.

http://datafarm.apgrid.org/

--jj


On 3/21/07, Kovacs, Corey J. <cjk at techma.com> wrote:
>
> Sounds like what you want is something along the lines of Lustre or even
> GPFS
> from IBM.
> Those systems allow aggregate use of nodes as storage bricks. RHCS could
> be
> used on top
> of such a system as well as in conjunction with. Lustre for instance
> allows a
> node to be
> designated as a failover for another node, but as far as I know requires
> some
> other method
> to ensure the storage device is taken over by the failover node. It all
> depends on what you
> are trying to accomplish. If you are trying to save money, move ahead
> carefully. I'm not
> sure if the storage bricks are meant to be used for general purpose
> computing
> etc once they
> are configured to provide storage. Also, due to the current lack of "raid"
> personalities in
> things like Lustre, you will want some nice hardware, which by the time
> you
> get done paying
> for you might as well have paid for a small SAN.  For raw thoughput
> though,
> it's hard to beat
> Luster or GPFS.
>
>
> Corey Kovacs
> Senior Systems Engineer
> Technology Management Associates
> 703.279.6168 (B)
> 855-6168 (R)
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Robert Peterson
> Sent: Wednesday, March 21, 2007 10:12 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] storage clustering without SAN
>
> Shailesh wrote:
> > Hi Everyone,
> >              I am looking for a storage solution where there is NO
> > common storage area like a SAN. But just a bunch of PC clustered
> > togther with their internal disk space used collectively as a storage
> area.
> >
> > Can you suggest me any information that talks about such an
> > architecture ?
> >
> > Can I use the redhat clustering suite with these bunch of PC connected
> > on a ethernet network ?
> >
> > Do I have to use GFS , if RHCS is employed ?
> >
> > I'll appreciate your answer to the above.
> >
> > Thanks & Regards
> > Shailesh P S
>
> Hi Shailesh,
>
> If your storage is not shared, what exactly do you mean by "used
> collectively"?
>
> You can use the cluster suite to cluster a bunch of PCs together so they
> cooperate, provide High Availability services (through
> rgmanager) and so forth, regardless of whether you use GFS.
>
> If your storage isn't shared between the systems, you don't need GFS.
>
> Regards,
>
> Bob Peterson
> Red Hat Cluster Suite
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Joey Jablonski
505-239-4947
jrjablo at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070321/dd901768/attachment.htm>

From jrjablo at gmail.com  Wed Mar 21 15:19:37 2007
From: jrjablo at gmail.com (Joey Jablonski)
Date: Wed, 21 Mar 2007 09:19:37 -0600
Subject: [Linux-cluster] storage clustering without SAN
In-Reply-To: <78a88ef70703210816h67fc284cuff47feed2912e2ad@mail.gmail.com>
References: <1174470534.15862.70.camel@shailesh> <46013D20.2070705@redhat.com>
	<7DCE72B3C36E2A45B7580F887EE4948C18FDE2@TMAEMAIL.techma.com>
	<78a88ef70703210816h67fc284cuff47feed2912e2ad@mail.gmail.com>
Message-ID: <78a88ef70703210819r3edf8598uf3e2523262ca630f@mail.gmail.com>

dcache is another option as well:

http://www.dcache.org/

-jj


On 3/21/07, Joey Jablonski <jrjablo at gmail.com> wrote:
>
> On potential is Gfarm.
>
> http://datafarm.apgrid.org/
>
> --jj
>
>
>  On 3/21/07, Kovacs, Corey J. <cjk at techma.com> wrote:
> >
> > Sounds like what you want is something along the lines of Lustre or even
> > GPFS
> > from IBM.
> > Those systems allow aggregate use of nodes as storage bricks. RHCS could
> > be
> > used on top
> > of such a system as well as in conjunction with. Lustre for instance
> > allows a
> > node to be
> > designated as a failover for another node, but as far as I know requires
> > some
> > other method
> > to ensure the storage device is taken over by the failover node. It all
> > depends on what you
> > are trying to accomplish. If you are trying to save money, move ahead
> > carefully. I'm not
> > sure if the storage bricks are meant to be used for general purpose
> > computing
> > etc once they
> > are configured to provide storage. Also, due to the current lack of
> > "raid"
> > personalities in
> > things like Lustre, you will want some nice hardware, which by the time
> > you
> > get done paying
> > for you might as well have paid for a small SAN.  For raw thoughput
> > though,
> > it's hard to beat
> > Luster or GPFS.
> >
> >
> > Corey Kovacs
> > Senior Systems Engineer
> > Technology Management Associates
> > 703.279.6168 (B)
> > 855-6168 (R)
> >
> >
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com
> > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Robert Peterson
> > Sent: Wednesday, March 21, 2007 10:12 AM
> > To: linux clustering
> > Subject: Re: [Linux-cluster] storage clustering without SAN
> >
> > Shailesh wrote:
> > > Hi Everyone,
> > >              I am looking for a storage solution where there is NO
> > > common storage area like a SAN. But just a bunch of PC clustered
> > > togther with their internal disk space used collectively as a storage
> > area.
> > >
> > > Can you suggest me any information that talks about such an
> > > architecture ?
> > >
> > > Can I use the redhat clustering suite with these bunch of PC connected
> >
> > > on a ethernet network ?
> > >
> > > Do I have to use GFS , if RHCS is employed ?
> > >
> > > I'll appreciate your answer to the above.
> > >
> > > Thanks & Regards
> > > Shailesh P S
> >
> > Hi Shailesh,
> >
> > If your storage is not shared, what exactly do you mean by "used
> > collectively"?
> >
> > You can use the cluster suite to cluster a bunch of PCs together so they
> > cooperate, provide High Availability services (through
> > rgmanager) and so forth, regardless of whether you use GFS.
> >
> > If your storage isn't shared between the systems, you don't need GFS.
> >
> > Regards,
> >
> > Bob Peterson
> > Red Hat Cluster Suite
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
>
>
> --
> Joey Jablonski
> 505-239-4947
> jrjablo at gmail.com




-- 
Joey Jablonski
505-239-4947
jrjablo at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070321/0e6bbd4f/attachment.htm>

From srramasw at cisco.com  Wed Mar 21 22:02:06 2007
From: srramasw at cisco.com (Sridhar Ramaswamy (srramasw))
Date: Wed, 21 Mar 2007 15:02:06 -0700
Subject: [Linux-cluster] storage clustering without SAN
In-Reply-To: <46013D20.2070705@redhat.com>
Message-ID: <B14199FA0DBAAF4AA89E83EB41D354350342D63B@xmb-sjc-22c.amer.cisco.com>


Don't we have a potential solution using GNBD and CLVMD? 

So you've bunch of PCs with hard disks. 

1) Each PC create LVM PhysicalVolume (pvcreate) on its storage device
2) Each PC exports the physical device using GNBD non-cached
3) Each PC imports other PCs storage using GNBD
4) Create a VolumeGroup combining GNBD devices (like
/dev/gnbd/gnbd_pc01) and local physical devices (/dev/hda8). CLVMD will
take care of distributing LVM info among cluster nodes.
5) Create GFS on top of an logical-volume created off this above
VolumeGroup

Would such a thing work alteast as described above? Such a setup will
aggregate storage off these PCs into a single volume.

Ofcourse, even if it does "work", it is a different question whether it
make sense to deploy such a solution. Major downside is - even if one PC
shuts down, the whole VolumeGroup would be offline and GFS won't be
available. Fencing doesn't apply here.

Thanks,
Sridhar 

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com 
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Robert Peterson
> Sent: Wednesday, March 21, 2007 7:12 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] storage clustering without SAN
> 
> Shailesh wrote:
> > Hi Everyone,
> >              I am looking for a storage solution where there is NO
> > common storage area like a SAN. But just a bunch of PC 
> clustered togther
> > with their internal disk space used collectively as a storage area.
> > 
> > Can you suggest me any information that talks about such an
> > architecture ?
> > 
> > Can I use the redhat clustering suite with these bunch of PC
> > connected on a ethernet network ?
> > 
> > Do I have to use GFS , if RHCS is employed ?
> > 
> > I'll appreciate your answer to the above. 
> > 
> > Thanks & Regards
> > Shailesh P S
> 
> Hi Shailesh,
> 
> If your storage is not shared, what exactly do you mean by "used
> collectively"?
> 
> You can use the cluster suite to cluster a bunch of PCs together
> so they cooperate, provide High Availability services (through
> rgmanager) and so forth, regardless of whether you use GFS.
> 
> If your storage isn't shared between the systems, you don't
> need GFS.
> 
> Regards,
> 
> Bob Peterson
> Red Hat Cluster Suite
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 



From rainer at ultra-secure.de  Wed Mar 21 22:34:09 2007
From: rainer at ultra-secure.de (Rainer Duffner)
Date: Wed, 21 Mar 2007 23:34:09 +0100
Subject: [Linux-cluster] storage clustering without SAN
In-Reply-To: <B14199FA0DBAAF4AA89E83EB41D354350342D63B@xmb-sjc-22c.amer.cisco.com>
References: <B14199FA0DBAAF4AA89E83EB41D354350342D63B@xmb-sjc-22c.amer.cisco.com>
Message-ID: <F6FA5EAD-4C9C-43CD-9064-B7D011EF9A14@ultra-secure.de>


Am 21.03.2007 um 23:02 schrieb Sridhar Ramaswamy (srramasw):

>
> Don't we have a potential solution using GNBD and CLVMD?
>
> So you've bunch of PCs with hard disks.
>
> 1) Each PC create LVM PhysicalVolume (pvcreate) on its storage device
> 2) Each PC exports the physical device using GNBD non-cached
> 3) Each PC imports other PCs storage using GNBD
> 4) Create a VolumeGroup combining GNBD devices (like
> /dev/gnbd/gnbd_pc01) and local physical devices (/dev/hda8). CLVMD  
> will
> take care of distributing LVM info among cluster nodes.
> 5) Create GFS on top of an logical-volume created off this above
> VolumeGroup
>
> Would such a thing work alteast as described above? Such a setup will
> aggregate storage off these PCs into a single volume.
>
> Ofcourse, even if it does "work", it is a different question  
> whether it
> make sense to deploy such a solution. Major downside is - even if  
> one PC
> shuts down, the whole VolumeGroup would be offline and GFS won't be
> available. Fencing doesn't apply here.
>
>

;--)
Usually, "distributed" still implies some sort of redundancy.
Unless you're Google and your data consists of copies of downloaded  
webpages, losing some of your data is not an option.



Rainer



From jvantuyl at engineyard.com  Wed Mar 21 23:02:47 2007
From: jvantuyl at engineyard.com (Jayson Vantuyl)
Date: Wed, 21 Mar 2007 18:02:47 -0500
Subject: [Linux-cluster] Cluster in virtual machines ???
In-Reply-To: <BAY136-F2472FABD77CC083A250ADDAC740@phx.gbl>
References: <BAY136-F2472FABD77CC083A250ADDAC740@phx.gbl>
Message-ID: <84D1EED9-099E-484D-818F-2185B13E5CBC@engineyard.com>

With Xen you can present the disk to both servers simultaneously.  In  
the device entry it is important to specify that the device mode is  
w!.  The ! tells Xen to allow both VMs to access it at the same time.

I do this with AoE (ATA over Ethernet).  kblade or vblade could do  
this for you as well.

On Mar 21, 2007, at 1:12 AM, nirmal tom wrote:

> hi,
> Its possible through iscsi,gnbd even NFSv4.But iscsi and gnbd makes  
> sense.
> Open source project is
>
> http://iscsitarget.sourceforge.net/
>
> and is working too nice for me
>
> regards,
> Nirmal Tom
>> From: qua nong <quanong_os at yahoo.com>
>> Reply-To: linux clustering <linux-cluster at redhat.com>
>> To: linux clustering <linux-cluster at redhat.com>
>> Subject: [Linux-cluster] Cluster in virtual machines ???
>> Date: Wed, 21 Mar 2007 11:45:48 +1100 (EST)
>>
>> Hi,
>>
>> I have only 1 physical server with a number of local harddisks (no  
>> SAN or external hardisks).
>> I would like set up at least 2 virtual machines using either xen  
>> or vmware server, then cluster the virtual machines.
>>
>> I wonder how could  the same disks be presented to both virtual  
>> machines. i.e. both virtual machines can share the same disks ?
>> Is it possible to configure the iSCSI to both virtual machines ?
>>
>> The problem is I donnot have any SAN or external storage which   
>> can be shared to both servers.
>>
>> QN
>>

-- 
Jayson Vantuyl
Systems Architect
Engine Yard
jvantuyl at engineyard.com


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070321/0b544bf1/attachment.htm>

From shailesh at verismonetworks.com  Thu Mar 22 06:01:45 2007
From: shailesh at verismonetworks.com (Shailesh)
Date: Thu, 22 Mar 2007 11:31:45 +0530
Subject: [Linux-cluster] storage clustering without SAN
In-Reply-To: <46013D20.2070705@redhat.com>
References: <1174470534.15862.70.camel@shailesh> <46013D20.2070705@redhat.com>
Message-ID: <1174543305.15862.81.camel@shailesh>

I thank you for your response...

On reading your response I get that  the cluster suite can be 
used independant of the File-system and is not tied to any particular 
file system.

So I should be able to share the local disk space of a node using
NFS/CIFS to other nodes with all nodes under RHCS, right?

Also in such a scenario ,a lock manager will not be needed isnt it ?

- Shailesh

On Wed, 2007-03-21 at 09:11 -0500, Robert Peterson wrote:
> Shailesh wrote:
> > Hi Everyone,
> >              I am looking for a storage solution where there is NO
> > common storage area like a SAN. But just a bunch of PC clustered togther
> > with their internal disk space used collectively as a storage area.
> > 
> > Can you suggest me any information that talks about such an
> > architecture ?
> > 
> > Can I use the redhat clustering suite with these bunch of PC
> > connected on a ethernet network ?
> > 
> > Do I have to use GFS , if RHCS is employed ?
> > 
> > I'll appreciate your answer to the above. 
> > 
> > Thanks & Regards
> > Shailesh P S
> 
> Hi Shailesh,
> 
> If your storage is not shared, what exactly do you mean by "used
> collectively"?
> 
> You can use the cluster suite to cluster a bunch of PCs together
> so they cooperate, provide High Availability services (through
> rgmanager) and so forth, regardless of whether you use GFS.
> 
> If your storage isn't shared between the systems, you don't
> need GFS.
> 
> Regards,
> 
> Bob Peterson
> Red Hat Cluster Suite
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 



From nirmaltom at hotmail.com  Thu Mar 22 06:28:51 2007
From: nirmaltom at hotmail.com (nirmal tom)
Date: Thu, 22 Mar 2007 11:58:51 +0530
Subject: [Linux-cluster] storage clustering without SAN
In-Reply-To: <1174543305.15862.81.camel@shailesh>
Message-ID: <BAY136-F790ED8834660727CDA5A3AC6B0@phx.gbl>

hi,
i think that u want ur local file sytem to act like san ,isn't it?
If you really want ur file system to be accessed by different nodes 
simultaneously, then a cluster file system is best and u must have it.
To make ur local filesytem act like san, u can use gnbd or iscsi.

the lock managers like dlm are used for gfs.No need if u dont deploy one.
regards,
Nirmal Tom.



>From: Shailesh <shailesh at verismonetworks.com>
>Reply-To: linux clustering <linux-cluster at redhat.com>
>To: linux clustering <linux-cluster at redhat.com>
>Subject: Re: [Linux-cluster] storage clustering without SAN
>Date: Thu, 22 Mar 2007 11:31:45 +0530
>
>I thank you for your response...
>
>On reading your response I get that  the cluster suite can be
>used independant of the File-system and is not tied to any particular
>file system.
>
>So I should be able to share the local disk space of a node using
>NFS/CIFS to other nodes with all nodes under RHCS, right?
>
>Also in such a scenario ,a lock manager will not be needed isnt it ?
>
>- Shailesh
>
>On Wed, 2007-03-21 at 09:11 -0500, Robert Peterson wrote:
> > Shailesh wrote:
> > > Hi Everyone,
> > >              I am looking for a storage solution where there is NO
> > > common storage area like a SAN. But just a bunch of PC clustered 
>togther
> > > with their internal disk space used collectively as a storage area.
> > >
> > > Can you suggest me any information that talks about such an
> > > architecture ?
> > >
> > > Can I use the redhat clustering suite with these bunch of PC
> > > connected on a ethernet network ?
> > >
> > > Do I have to use GFS , if RHCS is employed ?
> > >
> > > I'll appreciate your answer to the above.
> > >
> > > Thanks & Regards
> > > Shailesh P S
> >
> > Hi Shailesh,
> >
> > If your storage is not shared, what exactly do you mean by "used
> > collectively"?
> >
> > You can use the cluster suite to cluster a bunch of PCs together
> > so they cooperate, provide High Availability services (through
> > rgmanager) and so forth, regardless of whether you use GFS.
> >
> > If your storage isn't shared between the systems, you don't
> > need GFS.
> >
> > Regards,
> >
> > Bob Peterson
> > Red Hat Cluster Suite
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>https://www.redhat.com/mailman/listinfo/linux-cluster




From grimme at atix.de  Thu Mar 22 08:17:04 2007
From: grimme at atix.de (Marc Grimme)
Date: Thu, 22 Mar 2007 09:17:04 +0100
Subject: [Linux-cluster] Unable to obtain lock
In-Reply-To: <1170096286.30401.50.camel@rei.boston.devel.redhat.com>
References: <200701260919.05228.grimme@atix.de>
	<200701261928.43372.grimme@atix.de>
	<1170096286.30401.50.camel@rei.boston.devel.redhat.com>
Message-ID: <200703220917.04917.grimme@atix.de>

Hello,
again we had the same problem as stated in January. We installed the hotfix 
but it didn't help.
Again the whole cluster freezed, no node was allowed to rejoin the 
fencedomain.
Any ideas or do you need any more information?
Thanks and Regards Marc.

Mar 22 04:04:03 lilr623b clurgmgrd[12855]: <err> #48: Unable to obtain cluster 
lock: Connection timed out
Mar 22 04:04:06 lilr623a clurgmgrd[20754]: <err> #50: Unable to obtain cluster 
lock: Connection timed out
Mar 22 04:04:31 lilr623e clurgmgrd[20331]: <err> #48: Unable to obtain cluster 
lock: Connection timed out
Mar 22 04:04:33 lilr623b clurgmgrd[12855]: <err> #50: Unable to obtain cluster 
lock: Connection timed out
Mar 22 04:04:50 lilr623a clurgmgrd[20754]: <err> #48: Unable to obtain cluster 
lock: Connection timed out
Mar 22 04:05:18 lilr623b clurgmgrd[12855]: <err> #48: Unable to obtain cluster 
lock: Connection timed out
Mar 22 04:05:35 lilr623a clurgmgrd[20754]: <err> #50: Unable to obtain cluster 
lock: Connection timed out
Mar 22 04:06:03 lilr623b clurgmgrd[12855]: <err> #50: Unable to obtain cluster 
lock: Connection timed out
Mar 22 04:06:21 lilr623a clurgmgrd[20754]: <err> #48: Unable to obtain cluster 
lock: Connection timed out
Mar 22 04:06:33 lilr623b clurgmgrd[12855]: <err> #48: Unable to obtain cluster 
lock: Connection timed out
Mar 22 04:07:05 lilr623a clurgmgrd[20754]: <err> #50: Unable to obtain cluster 
lock: Connection timed out
Mar 22 07:09:39 lilr623d kernel: CMAN: node lilr623f-ics0 has been removed 
from the cluster : Missed too many heartbeats
Mar 22 07:09:39 lilr623c kernel: CMAN: node lilr623f-ics0 has been removed 
from the cluster : Missed too many heartbeats
Mar 22 07:09:39 lilr623d kernel: dlm: lt_sharedroot: send_cluster_request to 3 
state 1 recovery
Mar 22 07:10:00 lilr623d kernel: CMAN: node lilr623b-ics0 has been removed 
from the cluster : Missed too many heartbeats
Mar 22 07:10:00 lilr623c kernel: CMAN: removing node lilr623b-ics0 from the 
cluster : Missed too many heartbeats
Mar 22 07:10:05 lilr623c kernel: dlm: lt_sharedroot: dlm_dir_rebuild_local 
failed -1
Mar 22 07:10:05 lilr623d kernel: dlm: lt_sharedroot: dlm_dir_rebuild_local 
failed -1
Mar 22 07:10:05 lilr623c kernel: dlm: lt_scratch: dlm_dir_rebuild_wait failed 
1
Mar 22 07:10:05 lilr623d kernel: dlm: lt_scratch: dlm_dir_rebuild_wait failed 
1
Mar 22 07:10:10 lilr623c kernel: dlm: lt_products: restbl_rsb_update failed -1
Mar 22 07:10:10 lilr623d kernel: dlm: lt_products: restbl_rsb_update failed -1
Mar 22 07:10:11 lilr623c kernel: dlm: lt_P06user: dlm_dir_rebuild_local 
failed -1
Mar 22 07:10:11 lilr623d kernel: dlm: lt_P06user: dlm_dir_rebuild_local 
failed -1
Mar 22 07:10:11 lilr623c kernel: dlm: lt_P06user1: dlm_dir_rebuild_wait failed 
1
Mar 22 07:10:11 lilr623d kernel: dlm: lt_P06user1: dlm_dir_rebuild_wait failed 
1
Mar 22 07:10:15 lilr623d kernel: dlm: lt_P06sap: dlm_dir_rebuild_local 
failed -1
Mar 22 07:10:15 lilr623d kernel: dlm: lt_P06origlogA: dlm_dir_rebuild_wait 
failed 1
Mar 22 07:10:15 lilr623d kernel: dlm: lt_P06origlogB: dlm_dir_rebuild_wait 
failed 1
Mar 22 07:10:16 lilr623c kernel: dlm: lt_P06sap: dlm_dir_rebuild_wait 
failed -1
Mar 22 07:10:20 lilr623d kernel: dlm: lt_P06origlogC: dlm_dir_rebuild_wait 
failed -1
Mar 22 07:10:21 lilr623c kernel: dlm: lt_P06origlogA: dlm_dir_rebuild_wait 
failed -1
Mar 22 07:10:21 lilr623c kernel: dlm: lt_P06origlogB: dlm_dir_rebuild_wait 
failed 1
Mar 22 07:10:22 lilr623c kernel: dlm: lt_P06origlogC: dlm_dir_rebuild_wait 
failed 1
Mar 22 07:10:25 lilr623d kernel: dlm: lt_P06origlogD: dlm_dir_rebuild_wait 
failed -1
Mar 22 07:10:25 lilr623d kernel: dlm: lt_P06mirrlogA: dlm_dir_rebuild_wait 
failed 1
Mar 22 07:10:25 lilr623d kernel: dlm: lt_P06mirrlogB: dlm_dir_rebuild_wait 
failed 1
Mar 22 07:10:27 lilr623c kernel: dlm: lt_P06origlogD: dlm_dir_rebuild_wait 
failed -1
Mar 22 07:10:27 lilr623c kernel: dlm: lt_P06mirrlogA: dlm_dir_rebuild_wait 
failed 1
Mar 22 07:10:28 lilr623c kernel: dlm: lt_P06mirrlogB: dlm_dir_rebuild_wait 
failed 1
Mar 22 07:10:28 lilr623c kernel: dlm: lt_P06mirrlogC: dlm_dir_rebuild_wait 
failed 1
Mar 22 07:10:29 lilr623c kernel: dlm: lt_P06mirrlogD: dlm_dir_rebuild_wait 
failed 1
Mar 22 07:10:30 lilr623c kernel: dlm: lt_P06arch: restbl_rsb_update failed -1
Mar 22 07:10:30 lilr623c kernel: dlm: lt_P06data1: dlm_dir_rebuild_wait failed 
1
Mar 22 07:10:30 lilr623d kernel: dlm: lt_P06mirrlogC: dlm_dir_rebuild_wait 
failed -1
Mar 22 07:10:30 lilr623d kernel: dlm: lt_P06mirrlogD: dlm_dir_rebuild_wait 
failed 1
Mar 22 07:10:31 lilr623c kernel: dlm: lt_P06data2: dlm_dir_rebuild_wait failed 
1
Mar 22 07:10:35 lilr623c kernel: dlm: lt_P06data3: restbl_rsb_update failed -1
Mar 22 07:10:35 lilr623d kernel: dlm: lt_P06arch: restbl_rsb_update failed -1
Mar 22 07:10:35 lilr623c kernel: dlm: lt_P06data4: dlm_dir_rebuild_wait failed 
1
Mar 22 07:10:35 lilr623d kernel: dlm: lt_P06data1: dlm_dir_rebuild_wait failed 
1
Mar 22 07:10:36 lilr623d kernel: dlm: lt_P06data2: dlm_dir_rebuild_wait failed 
1
Mar 22 07:10:41 lilr623d kernel: dlm: lt_P06data3: restbl_rsb_update failed -1
Mar 22 07:10:41 lilr623d kernel: dlm: lt_P06data4: dlm_dir_rebuild_wait failed 
1
Mar 22 07:10:41 lilr623c kernel: dlm: clvmd: dlm_dir_rebuild_wait failed -1
Mar 22 07:10:41 lilr623c kernel: dlm: Magma: dlm_dir_rebuild_wait failed 1
Mar 22 07:10:41 lilr623d kernel: dlm: clvmd: dlm_dir_rebuild_wait failed 1
Mar 22 07:10:42 lilr623d kernel: dlm: Magma: dlm_dir_rebuild_wait failed 1
Mar 22 07:11:05 lilr623c fenced[15490]: fencing deferred to lilr623a-ics0
Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06data4.2: jid=5: 
Trying to acquire journal lock...
Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06data4.0: jid=5: 
Trying to acquire journal lock...
Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06data3.2: jid=5: 
Trying to acquire journal lock...
Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06data3.0: jid=5: 
Trying to acquire journal lock...
Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06data3.2: jid=5: 
Looking at journal...
Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06data2.0: jid=5: 
Trying to acquire journal lock...
Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06data1.2: jid=5: 
Trying to acquire journal lock...
Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06data4.0: jid=5: 
Looking at journal...
Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06data1.2: jid=5: 
Looking at journal...
Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06data1.0: jid=5: 
Trying to acquire journal lock...
Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06data4.2: jid=5: Busy
Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06arch.0: jid=5: Trying 
to acquire journal lock...
Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06data4.2: jid=4: 
Trying to acquire journal lock...
Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06arch.0: jid=5: 
Looking at journal...
Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06data4.2: jid=4: 
Looking at journal...
Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06mirrlogD.0: jid=5: 
Trying to acquire journal lock...
Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06mirrlogC.2: jid=5: 
Trying to acquire journal lock...
Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06data3.0: jid=5: Busy
Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06mirrlogC.2: jid=5: 
Looking at journal...
Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06data3.0: jid=4: 
Trying to acquire journal lock...
Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06mirrlogB.2: jid=5: 
Trying to acquire journal lock...
Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06mirrlogC.0: jid=5: 
Trying to acquire journal lock...
Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06mirrlogB.2: jid=5: 
Looking at journal...
Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06data1.0: jid=5: Busy
Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06mirrlogA.2: jid=5: 
Trying to acquire journal lock...
Mar 22 07:11:35 lilr623d kernel: GFS: fsid=lilr623:lt_P06data1.0: jid=4: 
Trying to acquire journal lock...
Mar 22 07:11:35 lilr623c kernel: GFS: fsid=lilr623:lt_P06mirrlogD.2: jid=5: 
Trying to acquire journal lock...
....
Mar 22 07:11:36 lilr623c kernel: GFS: fsid=lilr623:lt_products.2: jid=4: 
Acquiring the transaction lock...
Mar 22 07:11:36 lilr623d kernel: GFS: fsid=lilr623:lt_P06data3.0: jid=4: 
Acquiring the transaction lock...
Mar 22 07:11:36 lilr623c kernel: lock_dlm: lm_dlm_cancel 1,2 flags 84
Mar 22 07:11:36 lilr623c kernel: lock_dlm: lm_dlm_cancel skip 1,2 flags 84
Mar 22 07:11:36 lilr623c kernel: GFS: fsid=lilr623:lt_P06mirrlogB.2: jid=5: 
Acquiring the transaction lock...
Mar 22 07:11:36 lilr623c kernel: GFS: fsid=lilr623:lt_scratch.2: jid=4: Busy
Mar 22 07:11:36 lilr623c kernel: GFS: fsid=lilr623:lt_P06origlogB.2: jid=4: 
Acquiring the transaction lock...
Mar 22 07:11:36 lilr623c kernel: GFS: fsid=lilr623:lt_P06mirrlogA.2: jid=4: 
Acquiring the transaction lock...
Mar 22 07:11:36 lilr623d kernel: GFS: fsid=lilr623:lt_P06origlogD.0: jid=4: 
Acquiring the transaction lock...
Mar 22 07:11:36 lilr623d kernel: GFS: fsid=lilr623:lt_P06mirrlogC.0: jid=4: 
Acquiring the transaction lock...
Mar 22 07:11:36 lilr623d kernel: GFS: fsid=lilr623:lt_P06origlogC.0: jid=4: 
Acquiring the transaction lock...
Mar 22 07:11:36 lilr623d kernel: GFS: fsid=lilr623:lt_P06data4.0: jid=5: 
Acquiring the transaction lock...
Mar 22 07:11:36 lilr623d kernel: GFS: fsid=lilr623:lt_P06user1.0: jid=5: 
Acquiring the transaction lock...
Mar 22 07:11:36 lilr623d kernel: GFS: fsid=lilr623:lt_P06user.0: jid=5: Done
Mar 22 07:11:36 lilr623d kernel: GFS: fsid=lilr623:lt_P06user.0: jid=4: Trying 
to acquire journal lock...
Mar 22 07:11:37 lilr623a clurgmgrd[20754]: <err> #48: Unable to obtain cluster 
lock: Connection timed out
Mar 22 07:11:37 lilr623e clurgmgrd[20331]: <err> #48: Unable to obtain cluster 
lock: Connection timed out
Mar 22 07:11:37 lilr623e clurgmgrd[20331]: <err> #50: Unable to obtain cluster 
lock: Connection timed out
Mar 22 07:11:37 lilr623e clurgmgrd[20331]: <err> #48: Unable to obtain cluster 
lock: Connection timed out
Mar 22 07:11:37 lilr623d kernel: GFS: fsid=lilr623:lt_sharedroot.2: jid=5: 
Acquiring the transaction lock...
Mar 22 07:11:37 lilr623d kernel: GFS: fsid=lilr623:lt_P06user.0: jid=4: Busy
Mar 22 07:11:37 lilr623c kernel: GFS: fsid=lilr623:lt_P06origlogD.2: jid=5: 
Acquiring the transaction lock...
Mar 22 07:11:37 lilr623a clurgmgrd[20754]: <err> #50: Unable to obtain cluster 
lock: Connection timed out
Mar 22 07:11:37 lilr623c kernel: GFS: fsid=lilr623:lt_P06data4.2: jid=4: 
Acquiring the transaction lock...
Mar 22 07:11:37 lilr623e clurgmgrd[20331]: <err> #50: Unable to obtain cluster 
lock: Connection timed out
Mar 22 07:11:37 lilr623d kernel: GFS: fsid=lilr623:lt_P06data2.0: jid=4: 
Acquiring the transaction lock...
...
Mar 22 07:11:38 lilr623d kernel: GFS: fsid=lilr623:lt_P06data4.0: jid=5: Done
Mar 22 07:11:38 lilr623d kernel: GFS: fsid=lilr623:lt_P06data4.0: jid=4: 
Trying to acquire journal lock...
Mar 22 07:11:38 lilr623d kernel: GFS: fsid=lilr623:lt_P06user1.0: jid=4: Busy
Mar 22 07:11:38 lilr623d kernel: GFS: fsid=lilr623:lt_P06data1.0: jid=4: 
Replayed 0 of 0 blocks
Mar 22 07:11:39 lilr623c kernel: GFS: fsid=lilr623:lt_P06origlogC.2: jid=4: 
Trying to acquire journal lock...
Mar 22 07:11:39 lilr623a shutdown: shutting down for system reboot
Mar 22 07:11:39 lilr623a kernel: dlm: lt_products: restbl_rsb_update failed -1
Mar 22 07:11:39 lilr623a kernel: dlm: lt_P06origlogB: dlm_dir_rebuild_wait 
failed -1
Mar 22 07:11:39 lilr623a kernel: dlm: lt_P06origlogC: dlm_dir_rebuild_wait 
failed -1
Mar 22 07:11:39 lilr623a kernel: dlm: lt_P06mirrlogB: dlm_dir_rebuild_wait 
failed -1
Mar 22 07:11:39 lilr623a kernel: dlm: lt_P06mirrlogD: dlm_dir_rebuild_wait 
failed -1
Mar 22 07:11:39 lilr623a kernel: dlm: lt_P06data2: dlm_dir_rebuild_wait 
failed -1
Mar 22 07:11:39 lilr623a kernel: dlm: lt_P06data3: restbl_rsb_update failed -1
Mar 22 07:11:39 lilr623a kernel: GFS: fsid=lilr623:lt_P06data4.1: jid=5: 
Trying to acquire journal lock...
Mar 22 07:11:39 lilr623a kernel: GFS: fsid=lilr623:lt_P06data1.1: jid=5: 
Trying to acquire journal lock...
Mar 22 07:11:39 lilr623d clurgmgrd[20148]: <info> State change: lilr623f-ics0 
DOWN
Mar 22 07:11:39 lilr623e kernel: rh_lkid 2bd03c3
Mar 22 07:11:39 lilr623a kernel: GFS: fsid=lilr623:lt_P06origlogD.1: jid=5: 
Trying to acquire journal lock...
Mar 22 07:11:39 lilr623a kernel: GFS: fsid=lilr623:lt_sharedroot.0: jid=4: 
Busy
Mar 22 07:11:39 lilr623e kernel: lockstate 0
Mar 22 07:11:39 lilr623a kernel: GFS: fsid=lilr623:lt_scratch.1: jid=5: Busy
Mar 22 07:11:39 lilr623a kernel: GFS: fsid=lilr623:lt_P06data4.1: jid=4: Busy
Mar 22 07:11:39 lilr623e kernel: rh_cmd 5
Mar 22 07:11:39 lilr623e kernel: nodeid 5
Mar 22 07:11:39 lilr623e kernel: dlm: Magma: reply from 2 no lock
Mar 22 07:11:39 lilr623e kernel: CMAN: node lilr623b-ics0 has been removed 
from the cluster : Missed too many heartbeats

On Monday 29 January 2007 19:44:46 Lon Hohberger wrote:
> On Fri, 2007-01-26 at 19:28 +0100, Marc Grimme wrote:
> > On Friday 26 January 2007 19:15, Lon Hohberger wrote:
> > > On Fri, 2007-01-26 at 09:19 +0100, Marc Grimme wrote:
> > > > Hello,
> > > > yesterday we saw a clusterfreeze (which seems to come from the
> > > > rgmanager) with RHEL4/U4 GFS installed (see logs) consisting of 6
> > > > nodes x86_64 Architecture. After fencing one node the cluster came
> > > > back to live. Any idea what could have happend?
> > >
> > > Check 'dmesg' and 'cman_tool status'.  Also look at /proc/slabinfo,
> > > specifically 'dlm_lkb' bits.  There's a chance that you hit a bug
> > > that's already fixed. :)
> >
> > dlm_lkb           189628 195177    232   17    1 : tunables  120   60   
> > 8 : slabdata  11481  11481    384
> > nodea
> > dlm_lkb           2074114 2077587    232   17    1 : tunables  120   60  
> >  8 : slabdata 122211 122211    180
> > nodeb
> > dlm_lkb           454319 499392    232   17    1 : tunables  120   60   
> > 8 : slabdata  29376  29376      0
> > nodec
> > dlm_lkb           242144 251719    232   17    1 : tunables  120   60   
> > 8 : slabdata  14807  14807    480
> > noded
> > dlm_lkb           248672 286382    232   17    1 : tunables  120   60   
> > 8 : slabdata  16846  16846    212
> > nodef
> > dlm_lkb            62934  62934    232   17    1 : tunables  120   60   
> > 8 : slabdata   3702   3702      0
>
> You've hit "the bug".
>
> > > Need above information (and possibly more) to answer this.
> >
> > What more?? ;-)
>
> Nothing; test packages here:
>
> http://people.redhat.com/lhh/rgmanager-1.9.54-2.218112hf.i386.rpm
> http://people.redhat.com/lhh/rgmanager-1.9.54-2.218112hf.x86_64.rpm
> http://people.redhat.com/lhh/rgmanager-1.9.54-2.218112hf.src.rpm
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

** Visit us at CeBIT 2007 in Hannover/Germany **
** in Hall 5, Booth G48/2  (15.-21. of March) **

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany

Registergericht: Amtsgericht M?nchen
Registernummer: HRB 131682
USt.-Id.: DE209485962

Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz




From hlawatschek at atix.de  Thu Mar 22 10:00:50 2007
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Thu, 22 Mar 2007 11:00:50 +0100
Subject: [Linux-cluster] rgmanager is ping-ponging service
In-Reply-To: <200703191048.21688.hlawatschek@atix.de>
References: <200703161449.16182.hlawatschek@atix.de>
	<1174064666.13796.86.camel@asuka.boston.devel.redhat.com>
	<200703191048.21688.hlawatschek@atix.de>
Message-ID: <200703221100.50772.hlawatschek@atix.de>

Hi Lon,

as this issue is not solved with patch attached in bz#222484, do you have any 
ideas what is going wrong here?

Thanks,
Mark 
On Monday 19 March 2007 10:48:21 Mark Hlawatschek wrote:
> Hi Lon,
>
> thanks for the answer.
> The patch you mentioned in you mail was already applied before the test.
>
> In my oppinion, the problem occurs when more than on IP addresses are
> configured in a service and only one of them fails.
>
> >From my previous mail:
>
> [snip]
> Two IP resources configured:
> <resources>
>           <ip address="10.0.46.46" monitor_link="1"/>
>           <ip address="10.226.3.80" monitor_link="1"/>
>  </resources>
>
> and a service:
> <service autostart="1" domain="FD" name="VIP">
>           <ip ref="10.0.46.46"/>
>           <ip ref="10.226.3.80"/>
> </service>
>
> bond1 is configured
> - node1: 10.0.46.48
> - node2: 10.0.46.47
>
> bond2 is configured:
> - node1: 10.226.3.82
> - node2: 10.226.3.81
>
> Now, if I shut down bond1 on node1. the service ping-pongs on node1
> [snap]
>
> Thanks,
>
> Mark
>
> On Friday 16 March 2007 18:04:25 Lon Hohberger wrote:
> > On Fri, 2007-03-16 at 14:49 +0100, Mark Hlawatschek wrote:
> > > [30905] notice: status on ip "10.0.46.46" returned 1 (generic error)
> > > [30905] notice: Stopping service VIP
> > > <debug>  10.0.46.46 is not configured
> > > <info>   Removing IPv4 address 10.226.3.80 from bond2
> >
> > Noted here:
> >
> > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=222484
> >
> > Patch here:
> >
> > http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/rgmanager/src/resour
> >ce s/ip.sh.diff?cvsroot=cluster&r1=text&tr1=1.22&r2=text&tr2=1.21&f=u
> >
> > Will be included in update 5 of RHCS4.
>
> --
> Gruss / Regards,
>
> ** Visit us at CeBIT 2007 in Hannover/Germany **
> ** in Hall 5, Booth G48/2  (15.-21. of March) **
>
> Dipl.-Ing. Mark Hlawatschek
> http://www.atix.de/
> http://www.open-sharedroot.org/
>
> **
> ATIX - Ges. fuer Informationstechnologie und Consulting mbH
> Einsteinstr. 10 - 85716 Unterschleissheim - Germany
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Gruss / Regards,

** Visit us at CeBIT 2007 in Hannover/Germany **
** in Hall 5, Booth G48/2  (15.-21. of March) **

Dipl.-Ing. Mark Hlawatschek
http://www.atix.de/
http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany



From g.marshall at dalmany.co.uk  Thu Mar 22 11:27:40 2007
From: g.marshall at dalmany.co.uk (G.Marshall)
Date: Thu, 22 Mar 2007 11:27:40 -0000 (GMT)
Subject: [Linux-cluster] ccsd startup problem
Message-ID: <2061.88.96.235.249.1174562860.squirrel@squirrelmail.tgfslp.dalmany.co.uk>


I am having problems starting ccsd

I have followed the instructions in doc/usage.txt and also on
https://rpeterso.108.redhat.com/files/documents/98/247/STABLE.txt

I have installed a 2.6.20.3 kernel from kernel.org

I have installed openais from svn, on two machines, started them both up
with "aisexec" and can see them joining
Mar 22 10:14:32.430926 [CLM  ] CLM CONFIGURATION CHANGE
Mar 22 10:14:32.430979 [CLM  ] New Configuration:
Mar 22 10:14:32.431041 [CLM  ]  r(0) ip(192.168.2.7)
Mar 22 10:14:32.431104 [CLM  ]  r(0) ip(192.168.2.21)
Mar 22 10:14:32.431153 [CLM  ] Members Left:
Mar 22 10:14:32.431200 [CLM  ] Members Joined:
Mar 22 10:14:32.431259 [CLM  ]  r(0) ip(192.168.2.21)
Mar 22 10:14:32.431345 [SYNC ] This node is within the primary component
and will provide service.
Mar 22 10:14:32.431484 [TOTEM] entering OPERATIONAL state.
Mar 22 10:14:32.434059 [CLM  ] got nodejoin message 192.168.2.7
Mar 22 10:14:32.435451 [CLM  ] got nodejoin message 192.168.2.21

I have installed cluster from cvs and installed a cluster.conf

<?xml version="1.0"?>
<cluster name="alpha" config_version="1">

<cman>
</cman>

<clusternodes>
<clusternode name="node01.cluster">
	<fence>
		<method name="single">
			<device name="human" nodename="node01.cluster"/>
		</method>
	</fence>
</clusternode>

<clusternode name="node02.cluster">
	<fence>
		<method name="single">
			<device name="human" nodename="node02.cluster"/>
		</method>
	</fence>
</clusternode>

<clusternode name="node03.cluster">
	<fence>
		<method name="single">
			<device name="human" nodename="node03.cluster"/>
		</method>
	</fence>
</clusternode>
</clusternodes>

<fencedevices>
	<fencedevice name="human" agent="fence_manual"/>
</fencedevices>

</cluster>

I get the following error
/etc/cluster# ccsd -4In
Starting ccsd DEVEL.1174489543:
 Built: Mar 21 2007 21:48:09
 Copyright (C) Red Hat, Inc.  2004  All rights reserved.
  IP Protocol:: IPv4 only
  Communication:: Local sockets disabled
  No Daemon:: SET

Unable to connect to cluster infrastructure after 30 seconds.
Unable to connect to cluster infrastructure after 60 seconds.
Unable to connect to cluster infrastructure after 90 seconds.

Does anyone have any suggestions please?

Thank you,

Spencer



From markryde at gmail.com  Thu Mar 22 13:58:56 2007
From: markryde at gmail.com (Mark Ryden)
Date: Thu, 22 Mar 2007 15:58:56 +0200
Subject: [Linux-cluster] Xen Cluster with DRBD and GNBD (OCFS2 or GFS) on
	Fedora.
Message-ID: <dac45060703220658q79bd0252m1f1a77b38e8b5d52@mail.gmail.com>

Hello,

I try to set a Xen Cluster with DRBD and GNBD (with OCFS2 or GFS filesystems)
on Fedora.
I tried to find some info on it; I found two detailed howtos:

The first is by Patrick Caulfield, and from more than a year and a half
ago (June 2005).
http://people.redhat.com/pcaulfie/docs/xencluster.html
It is also short and does not talk about filesystems, GNBD, heartbeat,
and more.

The second is :
http://xenamo.sourceforge.net/
It is good one , but it talks about Debian and does not deal with the specifics
for Fedora.

Does anyone know about resources(howtos, etc) for such a task or give some
info/tips/share his knowledge regarding this issue ?

Regards,
Mark



From jbrassow at redhat.com  Thu Mar 22 18:20:33 2007
From: jbrassow at redhat.com (Jonathan E Brassow)
Date: Thu, 22 Mar 2007 13:20:33 -0500
Subject: [Linux-cluster] ccsd startup problem
In-Reply-To: <2061.88.96.235.249.1174562860.squirrel@squirrelmail.tgfslp.dalmany.co.uk>
References: <2061.88.96.235.249.1174562860.squirrel@squirrelmail.tgfslp.dalmany.co.uk>
Message-ID: <d8067ac1953f0fb32fd49d540c43e64d@redhat.com>


On Mar 22, 2007, at 6:27 AM, G.Marshall wrote:

> /etc/cluster# ccsd -4In
> Starting ccsd DEVEL.1174489543:
>  Built: Mar 21 2007 21:48:09
>  Copyright (C) Red Hat, Inc.  2004  All rights reserved.
>   IP Protocol:: IPv4 only
>   Communication:: Local sockets disabled
>   No Daemon:: SET
>
> Unable to connect to cluster infrastructure after 30 seconds.
> Unable to connect to cluster infrastructure after 60 seconds.
> Unable to connect to cluster infrastructure after 90 seconds.
>
> Does anyone have any suggestions please?
>

Looks like ccsd has started just fine.  It is simply waiting for you to 
bring up the cluster infrastructure [.i.e do the next steps].

  brassow



From lhh at redhat.com  Thu Mar 22 22:03:41 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 22 Mar 2007 18:03:41 -0400
Subject: [Linux-cluster] rgmanager is ping-ponging service
In-Reply-To: <200703221100.50772.hlawatschek@atix.de>
References: <200703161449.16182.hlawatschek@atix.de>
	<1174064666.13796.86.camel@asuka.boston.devel.redhat.com>
	<200703191048.21688.hlawatschek@atix.de>
	<200703221100.50772.hlawatschek@atix.de>
Message-ID: <1174601021.5158.55.camel@asuka.boston.devel.redhat.com>

On Thu, 2007-03-22 at 11:00 +0100, Mark Hlawatschek wrote:
> Hi Lon,
> 
> as this issue is not solved with patch attached in bz#222484, do you have any 
> ideas what is going wrong here?

This is definitely something that should end up in bugzilla, if it isn't
already (it might be; I didn't check yet).

It looks like what's going on is that we bring up the IP, but when we do
a "ping-self" test, it's failing (even though somehow the link-check is
succeeding?).  We can, I suppose, do this test after bringing up the IP
address to prevent the "start" from succeeding (e.g. attached patch
should do this).

However, I'm not sure it will help.  In your logs, it says that
10.226.3.80 started successfully - however, it didn't even *mention*
10.0.46.46 in the start phase for whatever reason (but it does in the
status phase).

* Could you tell me your netmasks of your 10.x.x.x addresses on bond0
and bond1?

-- Lon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ip.sh-bond.patch
Type: text/x-patch
Size: 576 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070322/b7760b31/attachment.bin>

From wferi at niif.hu  Fri Mar 23 16:42:49 2007
From: wferi at niif.hu (Ferenc Wagner)
Date: Fri, 23 Mar 2007 17:42:49 +0100
Subject: [Linux-cluster] Slowness above 500 RRDs
Message-ID: <87648r6hdi.fsf@tac.ki.iif.hu>

Hi,

I'm trying out GFS.  It would be a single writer / multiple readers
scenario, where one node continuously updates thousands of RRD files,
while the others only read them.  It the writer fails, a reades
assumes its role of updating the files.  I'm doing benchmarks now.

There's a good bunch of RRDs in a directory.  A script scans them for
their last modification times, and then updates each in turn for a
couple of times.  The number of files scanned and the length of the
update rounds are printed.  The results are much different for 500 and
501 files:

filecount=501
  iteration=0 elapsed time=10.425568 s
  iteration=1 elapsed time= 9.766178 s
  iteration=2 elapsed time=20.14514 s
  iteration=3 elapsed time= 2.991397 s
  iteration=4 elapsed time=20.496422 s
total elapsed time=63.824705 s

filecount=500
  iteration=0 elapsed time=6.560811 s
  iteration=1 elapsed time=0.229375 s
  iteration=2 elapsed time=0.202973 s
  iteration=3 elapsed time=0.203439 s
  iteration=4 elapsed time=0.203095 s
total elapsed time=7.399693 s

The files fit in the buffer cache conveniently, each being ~ 50 kB.
I'm using cluster version 1.03, Linux 2.6.18 with 1 GB of memory.  The
GFS is 40 GB, one other node keeps it mounted, but without any
activity.  The test node exercises librrd2 exclusively (besides the
usual daemons, nothing special).

The library issues an fcntl F_WRLCK before updating a file, and
according to strace, this takes much time.  Changing the library to
use flock() instead gives marginal speedup.  Removing the locking
altogether makes all the difference, bringing GFS to about half the
speed of XFS in the same setting (which runs an iteration in 0.03
seconds both for 500 and 501 files).

So, I'd like to be able to work with around 10000 files with good
performance.  Is there anything I could tune?  Removing the locking
altogether doesn't sound like the best idea.

Mounting with noatime or zeroing /proc/cluster/lock_dlm/drop_count
before mount didn't help at all.

I'd be grateful for any advice, and hope all relevant information is
here.
-- 
Thanks,
Feri.



From Nick.Couchman at seakr.com  Fri Mar 23 17:57:18 2007
From: Nick.Couchman at seakr.com (Nick Couchman)
Date: Fri, 23 Mar 2007 11:57:18 -0600
Subject: [Linux-cluster] mkfs.gfs2 issue...
Message-ID: <4603C09E.87A6.0099.1@seakr.com>

I'm currently evaluating GFS2 for some clustering that I want to do, and I've run into a little problem.  I'm using kernel 2.6.20.3 (with GFS2 included) and the GFS2 userspace stuff from the RH Cluster page (sourceware.org/cluster).  I've tried both the "official" 2.0.0 release and the latest CVS version, and both exhibit the same behavior: when I try to make a GFS2 filesystem, mkfs.gfs2 just hangs.  I'm doing this on a 1 GB iSCSI volume, and the host has already transfered 6.9GB of data.  What in the world is it doing?!  If I enable debug output, I get the following: 

Command Line Arguments: 
  qcsize = 1 
  jsize = 32 
  journals = 2 
  override = 0 
  proto = lock_dlm 
  quiet = 0 
  rgsize = optimize for best performance 
  table = fstest:testfs 
  utsize = 1 
  device = /dev/sdb1 
This will destroy any data on /dev/sdb1. 
  It appears to contain a ext3 filesystem. 

Are you sure you want to proceed? [y/n] y 


Partition size = 1955808 

Device Geometry:  (in basic blocks) 
  SubDevice #0: start = 0, length = 1955808, rgf_flags = 0x00000000 

Device Geometry:  (in FS blocks) 
  SubDevice #0: start = 0, length = 244476, rgf_flags = 0x00000000 

Device Size: 244476 

Data Subdevice 0 
  rg sz = 256 
  nrgrp = 4 

subdevice 0:  rg_o = 17, rg_l = 61117 
subdevice 0:  rg_o = 61134, rg_l = 61114 
subdevice 0:  rg_o = 122248, rg_l = 61114 
subdevice 0:  rg_o = 183362, rg_l = 61114 

  ri_addr:: 17  ri_length:: 4  ri_data0:: 21  ri_data:: 61112  ri_bitbytes:: 15278 
  ri_addr:: 61134  ri_length:: 4  ri_data0:: 61138  ri_data:: 61108  ri_bitbytes:: 15277 
  ri_addr:: 122248  ri_length:: 4  ri_data0:: 122252  ri_data:: 61108  ri_bitbytes:: 15277 
  ri_addr:: 183362  ri_length:: 4  ri_data0:: 183366  ri_data:: 61108  ri_bitbytes:: 15277 
Root directory: 
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 1  no_addr:: 21  di_mode:: 040755  di_uid:: 0  di_gid:: 0  di_nlink:: 2  di_size:: 3864  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 21  di_goal_data:: 21  di_flags:: 0x00000001  di_payload_format:: 1200  di_height:: 0  di_depth:: 0  di_entries:: 2  di_eattr:: 0 
Master dir: 
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 2  no_addr:: 22  di_mode:: 040755  di_uid:: 0  di_gid:: 0  di_nlink:: 2  di_size:: 3864  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 22  di_goal_data:: 22  di_flags:: 0x00000201  di_payload_format:: 1200  di_height:: 0  di_depth:: 0  di_entries:: 2  di_eattr:: 0 
Super Block: 
  mh_magic:: 0x01161970  mh_type:: 1  mh_format:: 100  sb_fs_format:: 1801  sb_multihost_format:: 1900  sb_bsize:: 4096  sb_bsize_shift:: 12  no_formal_ino:: 2  no_addr:: 22  no_formal_ino:: 1  no_addr:: 21  sb_lockproto:: lock_dlm  sb_locktable:: fstest:testfs 
Journal 0: 
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 4  no_addr:: 24  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 33554432  di_blocks:: 8210  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 41  di_goal_data:: 8233  di_flags:: 0x00000200  di_payload_format:: 0  di_height:: 2  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
Journal 1: 
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 5  no_addr:: 8234  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 33554432  di_blocks:: 8210  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 8251  di_goal_data:: 16443  di_flags:: 0x00000200  di_payload_format:: 0  di_height:: 2  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
Jindex: 
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 3  no_addr:: 23  di_mode:: 040700  di_uid:: 0  di_gid:: 0  di_nlink:: 2  di_size:: 3864  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 23  di_goal_data:: 23  di_flags:: 0x00000201  di_payload_format:: 1200  di_height:: 0  di_depth:: 0  di_entries:: 4  di_eattr:: 0 
Inum Range 0: 
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 7  no_addr:: 16445  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 16  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16445  di_goal_data:: 16445  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
StatFS Change 0: 
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 8  no_addr:: 16446  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 24  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16446  di_goal_data:: 16446  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
Quota Change 0: 
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 9  no_addr:: 16447  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 1048576  di_blocks:: 257  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16447  di_goal_data:: 16703  di_flags:: 0x00000200  di_payload_format:: 0  di_height:: 1  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
Inum Range 1: 
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 10  no_addr:: 16704  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 16  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16704  di_goal_data:: 16704  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
StatFS Change 1: 
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 11  no_addr:: 16705  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 24  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16705  di_goal_data:: 16705  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
Quota Change 1: 
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 12  no_addr:: 16706  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 1048576  di_blocks:: 257  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16706  di_goal_data:: 16962  di_flags:: 0x00000200  di_payload_format:: 0  di_height:: 1  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
per_node: 
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 6  no_addr:: 16444  di_mode:: 040700  di_uid:: 0  di_gid:: 0  di_nlink:: 2  di_size:: 3864  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16444  di_goal_data:: 16444  di_flags:: 0x00000201  di_payload_format:: 1200  di_height:: 0  di_depth:: 0  di_entries:: 8  di_eattr:: 0 
Inum Inode: 
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 13  no_addr:: 16963  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 0  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16963  di_goal_data:: 16963  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
StatFS Inode: 
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 14  no_addr:: 16964  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 0  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16964  di_goal_data:: 16964  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
Resource Index: 
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 15  no_addr:: 16965  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 384  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16965  di_goal_data:: 16965  di_flags:: 0x00000201  di_payload_format:: 1100  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
Root quota: 
  qu_limit:: 0  qu_warn:: 0  qu_value:: 1 
Next Inum: 17 

Statfs: 

...and that's where it stops.  To set things up, I compiled the sources, then did the following (as per the usage instructions): 
1) Wrote configuration file - very simple, two host configuration. 
1) Load modules gfs2, dlm, lock_dlm, and no_lock 
2) Mount configfs in /sys/kernel/config 
3) ccsd 
4) cman_tool join 
5) groupd 
6) fenced 
7) fence_tool join 
8) dlm_controld 
9) gfs_controld 
10) I don't have clvmd, so I didn't start that, but the usage.txt file says it's optional. 
11) mkfs.gfs2 -D -p lock_dlm -t fstest:testfs -j 2 /dev/sdb1 

and it just hangs.  There are no funny messages in either dmesg or /var/log/messages - just normal cluster operation (nodes being added when I start up things on both systems, etc.  Can anyone shed any light on what might be happening here - why it's hanging up and transmitting so much data on formatting such a small volume? 

Thanks, 
Nick 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070323/4cbd11a9/attachment.htm>

From rpeterso at redhat.com  Fri Mar 23 18:11:18 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Fri, 23 Mar 2007 13:11:18 -0500
Subject: [Linux-cluster] mkfs.gfs2 issue...
In-Reply-To: <4603C09E.87A6.0099.1@seakr.com>
References: <4603C09E.87A6.0099.1@seakr.com>
Message-ID: <46041846.3030802@redhat.com>

Nick Couchman wrote:
> I'm currently evaluating GFS2 for some clustering that I want to do, and I've run into a little problem.  I'm using kernel 2.6.20.3 (with GFS2 included) and the GFS2 userspace stuff from the RH Cluster page (sourceware.org/cluster).  I've tried both the "official" 2.0.0 release and the latest CVS version, and both exhibit the same behavior: when I try to make a GFS2 filesystem, mkfs.gfs2 just hangs.  I'm doing this on a 1 GB iSCSI volume, and the host has already transfered 6.9GB of data.  What in the world is it doing?!  If I enable debug output, I get the following: 
> 
> Command Line Arguments: 
>   qcsize = 1 
>   jsize = 32 
>   journals = 2 
>   override = 0 
>   proto = lock_dlm 
>   quiet = 0 
>   rgsize = optimize for best performance 
>   table = fstest:testfs 
>   utsize = 1 
>   device = /dev/sdb1 
> This will destroy any data on /dev/sdb1. 
>   It appears to contain a ext3 filesystem. 
> 
> Are you sure you want to proceed? [y/n] y 
> 
> 
> Partition size = 1955808 
> 
> Device Geometry:  (in basic blocks) 
>   SubDevice #0: start = 0, length = 1955808, rgf_flags = 0x00000000 
> 
> Device Geometry:  (in FS blocks) 
>   SubDevice #0: start = 0, length = 244476, rgf_flags = 0x00000000 
> 
> Device Size: 244476 
> 
> Data Subdevice 0 
>   rg sz = 256 
>   nrgrp = 4 
> 
> subdevice 0:  rg_o = 17, rg_l = 61117 
> subdevice 0:  rg_o = 61134, rg_l = 61114 
> subdevice 0:  rg_o = 122248, rg_l = 61114 
> subdevice 0:  rg_o = 183362, rg_l = 61114 
> 
>   ri_addr:: 17  ri_length:: 4  ri_data0:: 21  ri_data:: 61112  ri_bitbytes:: 15278 
>   ri_addr:: 61134  ri_length:: 4  ri_data0:: 61138  ri_data:: 61108  ri_bitbytes:: 15277 
>   ri_addr:: 122248  ri_length:: 4  ri_data0:: 122252  ri_data:: 61108  ri_bitbytes:: 15277 
>   ri_addr:: 183362  ri_length:: 4  ri_data0:: 183366  ri_data:: 61108  ri_bitbytes:: 15277 
> Root directory: 
>   mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 1  no_addr:: 21  di_mode:: 040755  di_uid:: 0  di_gid:: 0  di_nlink:: 2  di_size:: 3864  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 21  di_goal_data:: 21  di_flags:: 0x00000001  di_payload_format:: 1200  di_height:: 0  di_depth:: 0  di_entries:: 2  di_eattr:: 0 
> Master dir: 
>   mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 2  no_addr:: 22  di_mode:: 040755  di_uid:: 0  di_gid:: 0  di_nlink:: 2  di_size:: 3864  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 22  di_goal_data:: 22  di_flags:: 0x00000201  di_payload_format:: 1200  di_height:: 0  di_depth:: 0  di_entries:: 2  di_eattr:: 0 
> Super Block: 
>   mh_magic:: 0x01161970  mh_type:: 1  mh_format:: 100  sb_fs_format:: 1801  sb_multihost_format:: 1900  sb_bsize:: 4096  sb_bsize_shift:: 12  no_formal_ino:: 2  no_addr:: 22  no_formal_ino:: 1  no_addr:: 21  sb_lockproto:: lock_dlm  sb_locktable:: fstest:testfs 
> Journal 0: 
>   mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 4  no_addr:: 24  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 33554432  di_blocks:: 8210  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 41  di_goal_data:: 8233  di_flags:: 0x00000200  di_payload_format:: 0  di_height:: 2  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
> Journal 1: 
>   mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 5  no_addr:: 8234  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 33554432  di_blocks:: 8210  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 8251  di_goal_data:: 16443  di_flags:: 0x00000200  di_payload_format:: 0  di_height:: 2  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
> Jindex: 
>   mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 3  no_addr:: 23  di_mode:: 040700  di_uid:: 0  di_gid:: 0  di_nlink:: 2  di_size:: 3864  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 23  di_goal_data:: 23  di_flags:: 0x00000201  di_payload_format:: 1200  di_height:: 0  di_depth:: 0  di_entries:: 4  di_eattr:: 0 
> Inum Range 0: 
>   mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 7  no_addr:: 16445  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 16  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16445  di_goal_data:: 16445  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
> StatFS Change 0: 
>   mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 8  no_addr:: 16446  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 24  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16446  di_goal_data:: 16446  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
> Quota Change 0: 
>   mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 9  no_addr:: 16447  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 1048576  di_blocks:: 257  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16447  di_goal_data:: 16703  di_flags:: 0x00000200  di_payload_format:: 0  di_height:: 1  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
> Inum Range 1: 
>   mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 10  no_addr:: 16704  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 16  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16704  di_goal_data:: 16704  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
> StatFS Change 1: 
>   mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 11  no_addr:: 16705  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 24  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16705  di_goal_data:: 16705  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
> Quota Change 1: 
>   mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 12  no_addr:: 16706  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 1048576  di_blocks:: 257  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16706  di_goal_data:: 16962  di_flags:: 0x00000200  di_payload_format:: 0  di_height:: 1  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
> per_node: 
>   mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 6  no_addr:: 16444  di_mode:: 040700  di_uid:: 0  di_gid:: 0  di_nlink:: 2  di_size:: 3864  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16444  di_goal_data:: 16444  di_flags:: 0x00000201  di_payload_format:: 1200  di_height:: 0  di_depth:: 0  di_entries:: 8  di_eattr:: 0 
> Inum Inode: 
>   mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 13  no_addr:: 16963  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 0  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16963  di_goal_data:: 16963  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
> StatFS Inode: 
>   mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 14  no_addr:: 16964  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 0  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16964  di_goal_data:: 16964  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
> Resource Index: 
>   mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 15  no_addr:: 16965  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 384  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16965  di_goal_data:: 16965  di_flags:: 0x00000201  di_payload_format:: 1100  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0 
> Root quota: 
>   qu_limit:: 0  qu_warn:: 0  qu_value:: 1 
> Next Inum: 17 
> 
> Statfs: 
> 
> ...and that's where it stops.  To set things up, I compiled the sources, then did the following (as per the usage instructions): 
> 1) Wrote configuration file - very simple, two host configuration. 
> 1) Load modules gfs2, dlm, lock_dlm, and no_lock 
> 2) Mount configfs in /sys/kernel/config 
> 3) ccsd 
> 4) cman_tool join 
> 5) groupd 
> 6) fenced 
> 7) fence_tool join 
> 8) dlm_controld 
> 9) gfs_controld 
> 10) I don't have clvmd, so I didn't start that, but the usage.txt file says it's optional. 
> 11) mkfs.gfs2 -D -p lock_dlm -t fstest:testfs -j 2 /dev/sdb1 
> 
> and it just hangs.  There are no funny messages in either dmesg or /var/log/messages - just normal cluster operation (nodes being added when I start up things on both systems, etc.  Can anyone shed any light on what might be happening here - why it's hanging up and transmitting so much data on formatting such a small volume? 
> 
> Thanks, 
> Nick 
Hi Nick,

Since you're building it from source anyway, I recommend recompiling mkfs.gfs2
with gdb debugging enabled.  To do that, change the Makefile so that 
CFLAGS has -g instead of -O2.  Then make; make install.
Then do the command again, and when it hangs, go into gdb and see where
it's hung.  In other words, from another terminal session:

cd /mkfs/source/directory/  e.g. /home/devel/cluster/gfs2/mkfs/
ps ax | grep mkfs.gfs2 (to get the pid)
gdb ./mkfs.gfs2 <pid>
then do a "bt" to get a call stack.
Post the results here or email them directly to me.

The bt output should hopefully tell me what's going on.
If mkfs.gfs2 is broken, open up bugzilla against me and I'll fix it.

I've used mkfs.gfs2 many times and never had it hang.

Regards,

Bob Peterson
Red Hat Cluster Suite



From Nick.Couchman at seakr.com  Fri Mar 23 18:43:58 2007
From: Nick.Couchman at seakr.com (Nick Couchman)
Date: Fri, 23 Mar 2007 12:43:58 -0600
Subject: [Linux-cluster] Re: mkfs.gfs2 issue...
Message-ID: <4603CB8E.87A6.0099.0@seakr.com>

>>> Nick Couchman 03/23/07 11:57 AM >>>
I'm currently evaluating GFS2 for some clustering that I want to do, and I've run into a little problem.  I'm using kernel 2.6.20.3 (with GFS2 included) and the GFS2 userspace stuff from the RH Cluster page (sourceware.org/cluster).  I've tried both the "official" 2.0.0 release and the latest CVS version, and both exhibit the same behavior: when I try to make a GFS2 filesystem, mkfs.gfs2 just hangs.  I'm doing this on a 1 GB iSCSI volume, and the host has already transfered 6.9GB of data.  What in the world is it doing?!  If I enable debug output, I get the following:

Command Line Arguments:
  qcsize = 1
  jsize = 32
  journals = 2
  override = 0
  proto = lock_dlm
  quiet = 0
  rgsize = optimize for best performance
  table = fstest:testfs
  utsize = 1
  device = /dev/sdb1
This will destroy any data on /dev/sdb1.
  It appears to contain a ext3 filesystem.

Are you sure you want to proceed? [y/n] y


Partition size = 1955808

Device Geometry:  (in basic blocks)
  SubDevice #0: start = 0, length = 1955808, rgf_flags = 0x00000000

Device Geometry:  (in FS blocks)
  SubDevice #0: start = 0, length = 244476, rgf_flags = 0x00000000

Device Size: 244476

Data Subdevice 0
  rg sz = 256
  nrgrp = 4

subdevice 0:  rg_o = 17, rg_l = 61117
subdevice 0:  rg_o = 61134, rg_l = 61114
subdevice 0:  rg_o = 122248, rg_l = 61114
subdevice 0:  rg_o = 183362, rg_l = 61114

  ri_addr:: 17  ri_length:: 4  ri_data0:: 21  ri_data:: 61112  ri_bitbytes:: 15278
  ri_addr:: 61134  ri_length:: 4  ri_data0:: 61138  ri_data:: 61108  ri_bitbytes:: 15277
  ri_addr:: 122248  ri_length:: 4  ri_data0:: 122252  ri_data:: 61108  ri_bitbytes:: 15277
  ri_addr:: 183362  ri_length:: 4  ri_data0:: 183366  ri_data:: 61108  ri_bitbytes:: 15277
Root directory:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 1  no_addr:: 21  di_mode:: 040755  di_uid:: 0  di_gid:: 0  di_nlink:: 2  di_size:: 3864  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 21  di_goal_data:: 21  di_flags:: 0x00000001  di_payload_format:: 1200  di_height:: 0  di_depth:: 0  di_entries:: 2  di_eattr:: 0
Master dir:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 2  no_addr:: 22  di_mode:: 040755  di_uid:: 0  di_gid:: 0  di_nlink:: 2  di_size:: 3864  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 22  di_goal_data:: 22  di_flags:: 0x00000201  di_payload_format:: 1200  di_height:: 0  di_depth:: 0  di_entries:: 2  di_eattr:: 0
Super Block:
  mh_magic:: 0x01161970  mh_type:: 1  mh_format:: 100  sb_fs_format:: 1801  sb_multihost_format:: 1900  sb_bsize:: 4096  sb_bsize_shift:: 12  no_formal_ino:: 2  no_addr:: 22  no_formal_ino:: 1  no_addr:: 21  sb_lockproto:: lock_dlm  sb_locktable:: fstest:testfs
Journal 0:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 4  no_addr:: 24  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 33554432  di_blocks:: 8210  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 41  di_goal_data:: 8233  di_flags:: 0x00000200  di_payload_format:: 0  di_height:: 2  di_depth:: 0  di_entries:: 0  di_eattr:: 0
Journal 1:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 5  no_addr:: 8234  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 33554432  di_blocks:: 8210  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 8251  di_goal_data:: 16443  di_flags:: 0x00000200  di_payload_format:: 0  di_height:: 2  di_depth:: 0  di_entries:: 0  di_eattr:: 0
Jindex:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 3  no_addr:: 23  di_mode:: 040700  di_uid:: 0  di_gid:: 0  di_nlink:: 2  di_size:: 3864  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 23  di_goal_data:: 23  di_flags:: 0x00000201  di_payload_format:: 1200  di_height:: 0  di_depth:: 0  di_entries:: 4  di_eattr:: 0
Inum Range 0:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 7  no_addr:: 16445  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 16  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16445  di_goal_data:: 16445  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0
StatFS Change 0:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 8  no_addr:: 16446  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 24  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16446  di_goal_data:: 16446  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0
Quota Change 0:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 9  no_addr:: 16447  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 1048576  di_blocks:: 257  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16447  di_goal_data:: 16703  di_flags:: 0x00000200  di_payload_format:: 0  di_height:: 1  di_depth:: 0  di_entries:: 0  di_eattr:: 0
Inum Range 1:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 10  no_addr:: 16704  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 16  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16704  di_goal_data:: 16704  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0
StatFS Change 1:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 11  no_addr:: 16705  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 24  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16705  di_goal_data:: 16705  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0
Quota Change 1:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 12  no_addr:: 16706  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 1048576  di_blocks:: 257  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16706  di_goal_data:: 16962  di_flags:: 0x00000200  di_payload_format:: 0  di_height:: 1  di_depth:: 0  di_entries:: 0  di_eattr:: 0
per_node:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 6  no_addr:: 16444  di_mode:: 040700  di_uid:: 0  di_gid:: 0  di_nlink:: 2  di_size:: 3864  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16444  di_goal_data:: 16444  di_flags:: 0x00000201  di_payload_format:: 1200  di_height:: 0  di_depth:: 0  di_entries:: 8  di_eattr:: 0
Inum Inode:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 13  no_addr:: 16963  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 0  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16963  di_goal_data:: 16963  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0
StatFS Inode:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 14  no_addr:: 16964  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 0  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16964  di_goal_data:: 16964  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0
Resource Index:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 15  no_addr:: 16965  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 384  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16965  di_goal_data:: 16965  di_flags:: 0x00000201  di_payload_format:: 1100  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0
Root quota:
  qu_limit:: 0  qu_warn:: 0  qu_value:: 1
Next Inum: 17

Statfs:

...and that's where it stops.  To set things up, I compiled the sources, then did the following (as per the usage instructions):
1) Wrote configuration file - very simple, two host configuration.
1) Load modules gfs2, dlm, lock_dlm, and no_lock
2) Mount configfs in /sys/kernel/config
3) ccsd
4) cman_tool join
5) groupd
6) fenced
7) fence_tool join
8) dlm_controld
9) gfs_controld
10) I don't have clvmd, so I didn't start that, but the usage.txt file says it's optional.
11) mkfs.gfs2 -D -p lock_dlm -t fstest:testfs -j 2 /dev/sdb1

and it just hangs.  There are no funny messages in either dmesg or /var/log/messages - just normal cluster operation (nodes being added when I start up things on both systems, etc.  Can anyone shed any light on what might be happening here - why it's hanging up and transmitting so much data on formatting such a small volume?

Thanks,
Nick

>Hi Nick,

>Since you're building it from source anyway, I recommend recompiling mkfs.gfs2

>with gdb debugging enabled. To do that, change the Makefile so that CFLAGS has -g instead of -O2. Then make; make >install.

>Then do the command again, and when it hangs, go into gdb and see where
>it's hung.  In other words, from another terminal session:

>cd /mkfs/source/directory/  e.g. /home/devel/cluster/gfs2/mkfs/
>ps ax | grep mkfs.gfs2 (to get the pid)
>gdb ./mkfs.gfs2 <pid>
>then do a "bt" to get a call stack.
>Post the results here or email them directly to me.

>The bt output should hopefully tell me what's going on.
>If mkfs.gfs2 is broken, open up bugzilla against me and I'll fix it.

>I've used mkfs.gfs2 many times and never had it hang.

>Regards,

>Bob Peterson
>Red Hat Cluster Suite

Bob,
Thanks for the quick follow-up.  I gave this a shot, but here's the problem: the program seems to be hung on I/O, so it doesn't respond to any signals.  When I try to run gdb against the currently running PID, gdb hangs after the following line:
Attaching to program: mkfs.gfs2, process 3349

and does nothing (sometimes it doesn't even get that far).  I can't use <Ctrl-C> to kill the mkfs process, nor does it respond to a kill with a signal 9.  Also, during this mkfs.gfs2 process, the processor is being used quite heavily (80-100%) by the kernel [scsi_wq_1].  This tends to hang up things like login shells pretty badly, and the only way to get it back is to reboot the machine (in case it matters, this is inside a VMware virtual machine).  I'm going to give a couple of other things a try (like changing the elevator so that it's nicer to other processes, hopefully) and see if I can get gdb to work, but so far no such luck.

Thanks!
Nick 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070323/538e06d2/attachment.htm>

From Nick.Couchman at seakr.com  Fri Mar 23 19:10:38 2007
From: Nick.Couchman at seakr.com (Nick Couchman)
Date: Fri, 23 Mar 2007 13:10:38 -0600
Subject: [Linux-cluster] Re: mkfs.gfs2 issue...
In-Reply-To: <4603CB8E.87A6.0099.0@seakr.com>
References: <4603CB8E.87A6.0099.0@seakr.com>
Message-ID: <4603D1CD.87A6.0099.1@seakr.com>

Nevermind - figured it out.  Sorry to bother everyone.  I realized that I hadn't tried any other filesystem on the kernel I was running and discovered the same behavior from mkfs.ext2.  I went out and grabbed the latest version of open-iscsi and, what do you know, it works (both ext2 and gfs2). 

Again, sorry for not debugging properly... 

--Nick

>>> On Fri, Mar 23, 2007 at 12:43 PM, "Nick Couchman" <Nick.Couchman at seakr.com> wrote:

>>> Nick Couchman 03/23/07 11:57 AM >>>
I'm currently evaluating GFS2 for some clustering that I want to do, and I've run into a little problem.  I'm using kernel 2.6.20.3 (with GFS2 included) and the GFS2 userspace stuff from the RH Cluster page (sourceware.org/cluster).  I've tried both the "official" 2.0.0 release and the latest CVS version, and both exhibit the same behavior: when I try to make a GFS2 filesystem, mkfs.gfs2 just hangs.  I'm doing this on a 1 GB iSCSI volume, and the host has already transfered 6.9GB of data.  What in the world is it doing?!  If I enable debug output, I get the following:

Command Line Arguments:
  qcsize = 1
  jsize = 32
  journals = 2
  override = 0
  proto = lock_dlm
  quiet = 0
  rgsize = optimize for best performance
  table = fstest:testfs
  utsize = 1
  device = /dev/sdb1
This will destroy any data on /dev/sdb1.
  It appears to contain a ext3 filesystem.

Are you sure you want to proceed? [y/n] y


Partition size = 1955808

Device Geometry:  (in basic blocks)
  SubDevice #0: start = 0, length = 1955808, rgf_flags = 0x00000000

Device Geometry:  (in FS blocks)
  SubDevice #0: start = 0, length = 244476, rgf_flags = 0x00000000

Device Size: 244476

Data Subdevice 0
  rg sz = 256
  nrgrp = 4

subdevice 0:  rg_o = 17, rg_l = 61117
subdevice 0:  rg_o = 61134, rg_l = 61114
subdevice 0:  rg_o = 122248, rg_l = 61114
subdevice 0:  rg_o = 183362, rg_l = 61114

  ri_addr:: 17  ri_length:: 4  ri_data0:: 21  ri_data:: 61112  ri_bitbytes:: 15278
  ri_addr:: 61134  ri_length:: 4  ri_data0:: 61138  ri_data:: 61108  ri_bitbytes:: 15277
  ri_addr:: 122248  ri_length:: 4  ri_data0:: 122252  ri_data:: 61108  ri_bitbytes:: 15277
  ri_addr:: 183362  ri_length:: 4  ri_data0:: 183366  ri_data:: 61108  ri_bitbytes:: 15277
Root directory:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 1  no_addr:: 21  di_mode:: 040755  di_uid:: 0  di_gid:: 0  di_nlink:: 2  di_size:: 3864  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 21  di_goal_data:: 21  di_flags:: 0x00000001  di_payload_format:: 1200  di_height:: 0  di_depth:: 0  di_entries:: 2  di_eattr:: 0
Master dir:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 2  no_addr:: 22  di_mode:: 040755  di_uid:: 0  di_gid:: 0  di_nlink:: 2  di_size:: 3864  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 22  di_goal_data:: 22  di_flags:: 0x00000201  di_payload_format:: 1200  di_height:: 0  di_depth:: 0  di_entries:: 2  di_eattr:: 0
Super Block:
  mh_magic:: 0x01161970  mh_type:: 1  mh_format:: 100  sb_fs_format:: 1801  sb_multihost_format:: 1900  sb_bsize:: 4096  sb_bsize_shift:: 12  no_formal_ino:: 2  no_addr:: 22  no_formal_ino:: 1  no_addr:: 21  sb_lockproto:: lock_dlm  sb_locktable:: fstest:testfs
Journal 0:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 4  no_addr:: 24  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 33554432  di_blocks:: 8210  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 41  di_goal_data:: 8233  di_flags:: 0x00000200  di_payload_format:: 0  di_height:: 2  di_depth:: 0  di_entries:: 0  di_eattr:: 0
Journal 1:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 5  no_addr:: 8234  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 33554432  di_blocks:: 8210  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 8251  di_goal_data:: 16443  di_flags:: 0x00000200  di_payload_format:: 0  di_height:: 2  di_depth:: 0  di_entries:: 0  di_eattr:: 0
Jindex:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 3  no_addr:: 23  di_mode:: 040700  di_uid:: 0  di_gid:: 0  di_nlink:: 2  di_size:: 3864  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 23  di_goal_data:: 23  di_flags:: 0x00000201  di_payload_format:: 1200  di_height:: 0  di_depth:: 0  di_entries:: 4  di_eattr:: 0
Inum Range 0:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 7  no_addr:: 16445  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 16  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16445  di_goal_data:: 16445  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0
StatFS Change 0:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 8  no_addr:: 16446  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 24  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16446  di_goal_data:: 16446  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0
Quota Change 0:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 9  no_addr:: 16447  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 1048576  di_blocks:: 257  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16447  di_goal_data:: 16703  di_flags:: 0x00000200  di_payload_format:: 0  di_height:: 1  di_depth:: 0  di_entries:: 0  di_eattr:: 0
Inum Range 1:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 10  no_addr:: 16704  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 16  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16704  di_goal_data:: 16704  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0
StatFS Change 1:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 11  no_addr:: 16705  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 24  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16705  di_goal_data:: 16705  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0
Quota Change 1:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 12  no_addr:: 16706  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 1048576  di_blocks:: 257  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16706  di_goal_data:: 16962  di_flags:: 0x00000200  di_payload_format:: 0  di_height:: 1  di_depth:: 0  di_entries:: 0  di_eattr:: 0
per_node:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 6  no_addr:: 16444  di_mode:: 040700  di_uid:: 0  di_gid:: 0  di_nlink:: 2  di_size:: 3864  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16444  di_goal_data:: 16444  di_flags:: 0x00000201  di_payload_format:: 1200  di_height:: 0  di_depth:: 0  di_entries:: 8  di_eattr:: 0
Inum Inode:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 13  no_addr:: 16963  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 0  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16963  di_goal_data:: 16963  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0
StatFS Inode:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 14  no_addr:: 16964  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 0  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16964  di_goal_data:: 16964  di_flags:: 0x00000201  di_payload_format:: 0  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0
Resource Index:
  mh_magic:: 0x01161970  mh_type:: 4  mh_format:: 400  no_formal_ino:: 15  no_addr:: 16965  di_mode:: 0100600  di_uid:: 0  di_gid:: 0  di_nlink:: 1  di_size:: 384  di_blocks:: 1  di_atime:: 1174670528  di_mtime:: 1174670528  di_ctime:: 1174670528  di_major:: 0  di_minor:: 0  di_goal_meta:: 16965  di_goal_data:: 16965  di_flags:: 0x00000201  di_payload_format:: 1100  di_height:: 0  di_depth:: 0  di_entries:: 0  di_eattr:: 0
Root quota:
  qu_limit:: 0  qu_warn:: 0  qu_value:: 1
Next Inum: 17

Statfs:

...and that's where it stops.  To set things up, I compiled the sources, then did the following (as per the usage instructions):
1) Wrote configuration file - very simple, two host configuration.
1) Load modules gfs2, dlm, lock_dlm, and no_lock
2) Mount configfs in /sys/kernel/config
3) ccsd
4) cman_tool join
5) groupd
6) fenced
7) fence_tool join
8) dlm_controld
9) gfs_controld
10) I don't have clvmd, so I didn't start that, but the usage.txt file says it's optional.
11) mkfs.gfs2 -D -p lock_dlm -t fstest:testfs -j 2 /dev/sdb1

and it just hangs.  There are no funny messages in either dmesg or /var/log/messages - just normal cluster operation (nodes being added when I start up things on both systems, etc.  Can anyone shed any light on what might be happening here - why it's hanging up and transmitting so much data on formatting such a small volume?

Thanks,
Nick

>Hi Nick,

>Since you're building it from source anyway, I recommend recompiling mkfs.gfs2

>with gdb debugging enabled. To do that, change the Makefile so that CFLAGS has -g instead of -O2. Then make; make >install.

>Then do the command again, and when it hangs, go into gdb and see where
>it's hung.  In other words, from another terminal session:

>cd /mkfs/source/directory/  e.g. /home/devel/cluster/gfs2/mkfs/
>ps ax | grep mkfs.gfs2 (to get the pid)
>gdb ./mkfs.gfs2 <pid>
>then do a "bt" to get a call stack.
>Post the results here or email them directly to me.

>The bt output should hopefully tell me what's going on.
>If mkfs.gfs2 is broken, open up bugzilla against me and I'll fix it.

>I've used mkfs.gfs2 many times and never had it hang.

>Regards,

>Bob Peterson
>Red Hat Cluster Suite

Bob,
Thanks for the quick follow-up.  I gave this a shot, but here's the problem: the program seems to be hung on I/O, so it doesn't respond to any signals.  When I try to run gdb against the currently running PID, gdb hangs after the following line:
Attaching to program: mkfs.gfs2, process 3349

and does nothing (sometimes it doesn't even get that far).  I can't use <Ctrl-C> to kill the mkfs process, nor does it respond to a kill with a signal 9.  Also, during this mkfs.gfs2 process, the processor is being used quite heavily (80-100%) by the kernel [scsi_wq_1].  This tends to hang up things like login shells pretty badly, and the only way to get it back is to reboot the machine (in case it matters, this is inside a VMware virtual machine).  I'm going to give a couple of other things a try (like changing the elevator so that it's nicer to other processes, hopefully) and see if I can get gdb to work, but so far no such luck.

Thanks!
Nick 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070323/c751f6dc/attachment.htm>

From rpeterso at redhat.com  Fri Mar 23 19:06:40 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Fri, 23 Mar 2007 14:06:40 -0500
Subject: [Linux-cluster] Re: mkfs.gfs2 issue...
In-Reply-To: <4603CB8E.87A6.0099.0@seakr.com>
References: <4603CB8E.87A6.0099.0@seakr.com>
Message-ID: <46042540.3000502@redhat.com>

Nick Couchman wrote:
 > Bob,
> Thanks for the quick follow-up.  I gave this a shot, but here's the problem: the program seems to be hung on I/O, so it doesn't respond to any signals.  When I try to run gdb against the currently running PID, gdb hangs after the following line:
> Attaching to program: mkfs.gfs2, process 3349
> 
> and does nothing (sometimes it doesn't even get that far).  I can't use <Ctrl-C> to kill the mkfs process, nor does it respond to a kill with a signal 9.  Also, during this mkfs.gfs2 process, the processor is being used quite heavily (80-100%) by the kernel [scsi_wq_1].  This tends to hang up things like login shells pretty badly, and the only way to get it back is to reboot the machine (in case it matters, this is inside a VMware virtual machine).  I'm going to give a couple of other things a try (like changing the elevator so that it's nicer to other processes, hopefully) and see if I can get gdb to work, but so far no such luck.
> 
> Thanks!
> Nick 
Hi Nick,

Hm.  Right after it says "Statfs:" it's done building all the data
structures in buffers and it goes to write the data out to disk.
So it sounds like the low-level disk IO is somehow broken.

I suppose you could do:

strace mkfs.gfs2 ...

and it may tell us what io it's getting hung on, but I don't think
it will tell us much.  Based on what you're telling me, I suspect 
this isn't a problem with mkfs.gfs2 but rather something in a lower
layer.

The next thing I'd try is writing to the raw device.  In other
words, something like:

dd if=/dev/sdb1 of=/tmp/gronk bs=1M count=1

followed by the opposite:

dd if=/tmp/gronk of=/dev/sdb1 bs=1M count=1

If dd can write to the disk, then mkfs should be able to as well.

Regards,

Bob Peterson
Red Hat Cluster Suite



From zstingx at gmail.com  Sat Mar 24 08:29:09 2007
From: zstingx at gmail.com (Sting Zax)
Date: Sat, 24 Mar 2007 10:29:09 +0200
Subject: [Linux-cluster] cman_tool join causes "No multicast address for
	IPv6 node host4" - why ?
Message-ID: <3a0f1c620703240129w4516bd5ei2bf7f8c6f0151d74@mail.gmail.com>

Hello,
I am trying to install a cluster.
After I run /etc/init.d/ccsd start (and get OK)
I run:
cman_tool join

and I get:
cman_tool: No multicast address for IPv6 node host4
I am not using IPV6 and do not want to use IPV6;
nor did I configured the cluster to use IPV6 anywhere.


and cat /proc/cluster/status shows:
...
...
Cluster name:
Cluster ID: 0
Cluster Member: No
Membership state: Not-in-Cluster

Any idea why this error message ?


Reagrds,
Sting



From Nick.Couchman at seakr.com  Sat Mar 24 23:23:03 2007
From: Nick.Couchman at seakr.com (Nick Couchman)
Date: Sat, 24 Mar 2007 17:23:03 -0600
Subject: [Linux-cluster] cman_tool join causes "No multicast
	address for IPv6 node host4" - why ?
Message-ID: <46055E770200009900014F23@collaborate.seakr.com>

>>> "Sting Zax" <zstingx at gmail.com> 03/24/07 2:29 AM >>>
Hello,
I am trying to install a cluster.
After I run /etc/init.d/ccsd start (and get OK)
I run:
cman_tool join

and I get:
cman_tool: No multicast address for IPv6 node host4
I am not using IPV6 and do not want to use IPV6;
nor did I configured the cluster to use IPV6 anywhere.


and cat /proc/cluster/status shows:
...
...
Cluster name:
Cluster ID: 0
Cluster Member: No
Membership state: Not-in-Cluster

Any idea why this error message ?


Reagrds,
Sting


Sting:
First, make sure that your node names and IP addresses are either in DNS somewhere or in your /etc/hosts files (or both).  This can cause problems if ccsd does not know which IP or network device it should attach to.  Also, have you set up your /etc/cluster/cluster.conf file, yet?  The fact that /proc/cluster/status shows no cluster name and an id of 0 indicates that you may not have done that.  Go out to the Cluster site (sourceware.org/cluster) and look at the usage.txt file to see how you should do initial setup.



From bladilo at rice.edu  Sun Mar 25 03:17:21 2007
From: bladilo at rice.edu (Franco Martin Bladilo)
Date: Sat, 24 Mar 2007 22:17:21 -0500
Subject: [Linux-cluster] RHCS Coraid/ATAoe compatibility
Message-ID: <4605E9C1.1070802@rice.edu>

We are considering the option to build a 4 node HA cluster using RHCS 
and Coraid products (SR1521). ATAoE seems to be the most cost effective 
solution for our case and I was wondering if there are any comments/ 
feedback for such configuration.
Is the quorum partition issue (shared concurrent access) still a 
limitation for these appliances? Can these "service" partitions be 
formatted using GFS? Any other workarounds?

Thanks in advance,

Franco.



From hlawatschek at atix.de  Mon Mar 26 09:28:43 2007
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Mon, 26 Mar 2007 11:28:43 +0200
Subject: [Linux-cluster] rgmanager is ping-ponging service
In-Reply-To: <1174601021.5158.55.camel@asuka.boston.devel.redhat.com>
References: <200703161449.16182.hlawatschek@atix.de>
	<200703221100.50772.hlawatschek@atix.de>
	<1174601021.5158.55.camel@asuka.boston.devel.redhat.com>
Message-ID: <200703261128.43435.hlawatschek@atix.de>

Hi Lon,

> This is definitely something that should end up in bugzilla, if it isn't
> already (it might be; I didn't check yet).
>
> It looks like what's going on is that we bring up the IP, but when we do
> a "ping-self" test, it's failing (even though somehow the link-check is
> succeeding?).  We can, I suppose, do this test after bringing up the IP
> address to prevent the "start" from succeeding (e.g. attached patch
> should do this).
Note, in the test I did a ifdown bond1 to disable the interface. I haven't 
unplugged the cables. 
>
> However, I'm not sure it will help.  In your logs, it says that
> 10.226.3.80 started successfully - however, it didn't even *mention*
> 10.0.46.46 in the start phase for whatever reason (but it does in the
> status phase).
>
> * Could you tell me your netmasks of your 10.x.x.x addresses on bond0
> and bond1?
10.0.46.46 belongs to bond1 with netmask 255.255.248.0
10.226.3.80 belongs to bond2 with netmask 255.255.248.0

Mark

-- 
Gruss / Regards,

Dipl.-Ing. Mark Hlawatschek
http://www.atix.de/
http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany



From lhh at redhat.com  Mon Mar 26 16:05:30 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 26 Mar 2007 12:05:30 -0400
Subject: [Linux-cluster] Unable to obtain lock
In-Reply-To: <200703220917.04917.grimme@atix.de>
References: <200701260919.05228.grimme@atix.de>
	<200701261928.43372.grimme@atix.de>
	<1170096286.30401.50.camel@rei.boston.devel.redhat.com>
	<200703220917.04917.grimme@atix.de>
Message-ID: <1174925131.2994.19.camel@localhost.localdomain>

On Thu, 2007-03-22 at 09:17 +0100, Marc Grimme wrote:
> Hello,
> again we had the same problem as stated in January. We installed the hotfix 
> but it didn't help.
> Again the whole cluster freezed, no node was allowed to rejoin the 
> fencedomain.
> Any ideas or do you need any more information?
> Thanks and Regards Marc.

This could be a couple of things, but I am certain it's not the same
problem you had in January (though the symptoms are similar).

Can you attach logs from the separate machines (instead of a single log
with all the messages interleaved)?  It would really help clear things
up.

-- Lon




From lhh at redhat.com  Mon Mar 26 16:42:40 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Mon, 26 Mar 2007 12:42:40 -0400
Subject: [Linux-cluster] ccsd startup problem
In-Reply-To: <2061.88.96.235.249.1174562860.squirrel@squirrelmail.tgfslp.dalmany.co.uk>
References: <2061.88.96.235.249.1174562860.squirrel@squirrelmail.tgfslp.dalmany.co.uk>
Message-ID: <1174927361.2994.21.camel@localhost.localdomain>

On Thu, 2007-03-22 at 11:27 +0000, G.Marshall wrote:
> I am having problems starting ccsd
> 
> I have followed the instructions in doc/usage.txt and also on
> https://rpeterso.108.redhat.com/files/documents/98/247/STABLE.txt
> 
> I have installed a 2.6.20.3 kernel from kernel.org
> 
> I have installed openais from svn, on two machines, started them both up
> with "aisexec" and can see them joining
> Mar 22 10:14:32.430926 [CLM  ] CLM CONFIGURATION CHANGE
> Mar 22 10:14:32.430979 [CLM  ] New Configuration:
> Mar 22 10:14:32.431041 [CLM  ]  r(0) ip(192.168.2.7)
> Mar 22 10:14:32.431104 [CLM  ]  r(0) ip(192.168.2.21)
> Mar 22 10:14:32.431153 [CLM  ] Members Left:
> Mar 22 10:14:32.431200 [CLM  ] Members Joined:
> Mar 22 10:14:32.431259 [CLM  ]  r(0) ip(192.168.2.21)
> Mar 22 10:14:32.431345 [SYNC ] This node is within the primary component
> and will provide service.
> Mar 22 10:14:32.431484 [TOTEM] entering OPERATIONAL state.
> Mar 22 10:14:32.434059 [CLM  ] got nodejoin message 192.168.2.7
> Mar 22 10:14:32.435451 [CLM  ] got nodejoin message 192.168.2.21
> 
> I have installed cluster from cvs and installed a cluster.conf
> 
> <?xml version="1.0"?>
> <cluster name="alpha" config_version="1">
> 
> <cman>
> </cman>
> 
> <clusternodes>
> <clusternode name="node01.cluster">
> 	<fence>
> 		<method name="single">
> 			<device name="human" nodename="node01.cluster"/>
> 		</method>
> 	</fence>
> </clusternode>
> 
> <clusternode name="node02.cluster">
> 	<fence>
> 		<method name="single">
> 			<device name="human" nodename="node02.cluster"/>
> 		</method>
> 	</fence>
> </clusternode>
> 
> <clusternode name="node03.cluster">
...


You're missing nodied="1", nodeid="2" in the <clusternode> definitions,
which is required on the newer infrastructure.

-- Lon




From Bowie_Bailey at BUC.com  Mon Mar 26 17:40:43 2007
From: Bowie_Bailey at BUC.com (Bowie Bailey)
Date: Mon, 26 Mar 2007 13:40:43 -0400
Subject: [Linux-cluster] RHCS Coraid/ATAoe compatibility
Message-ID: <4766EEE585A6D311ADF500E018C154E3026852CB@bnifex.cis.buc.com>

Franco Martin Bladilo wrote:
> We are considering the option to build a 4 node HA cluster using RHCS
> and Coraid products (SR1521). ATAoE seems to be the most cost
> effective solution for our case and I was wondering if there are any
> comments/ feedback for such configuration.

I currently have a Coraid SR1521 connected to three servers running
CentOS.  The only comment I would make is to make sure to get drives
that are specifically supported by Coraid.  We originally had some
unsupported drives that caused intermittent problems until we upgraded
the firmware on all the drives.

> Is the quorum partition issue (shared concurrent access) still a
> limitation for these appliances? Can these "service" partitions be
> formatted using GFS? Any other workarounds?

I'm not sure what issue you are referring to here, but I haven't had any
problems.  The Coraid unit is formatted as a single partition with GFS
and is actively shared by all three servers.

-- 
Bowie



From nathan.j.dragun at wmich.edu  Mon Mar 26 19:15:02 2007
From: nathan.j.dragun at wmich.edu (Nathan J Dragun)
Date: Mon, 26 Mar 2007 14:15:02 -0500
Subject: [Linux-cluster] configuring GFS2
Message-ID: <b7e976df4e4.4607d566@wmich.edu>

So, in short, I'm blow away by the sheer amount of information out there in regards to the RH cluster project.  So much, that I find it hard to find particular information.  I'm hoping that I can have a few things explained that I'm having trouble finding information for.

1.  What -exactly- is the process necessary for setting up gfs2.  I'm confused because there are so many 'pieces' to this, I feel like I've lost my way?
    - To give a little background, I'm using an iSCSI target which allows any machine to make the SAN device a local scsi device on the system.  Ex: /dev/sdb   (This in turn allows direct block access, and I can do whatever with it; as if it were literally a system disk.)  I've read something about needing to set up fencing, cluster information, and .... ??  I'm lost. lol
    - I set up a partition with dlm compiled into the kernel (2.6.20.1), and apparently I can only mount when using no_lock, dlm gives me a resource error (I'm assuming this has to do with no cluster information being set, or anything else configured).

2.  Is there a way to grow a gfs2 partition at the moment?  I see something for gfs1, yet no binaries, source, or references to a command except via this web page which outlines the command: http://www.die.net/doc/linux/man/man8/gfs2_grow.8.html

3. Quota.  Fortunately this is one of the last things I have to worry about on the chain of events that need to unfold.  I have not done much research on this last part as a result.  Does gfs's quota work like the quota system that .... well, I don't know if you want to call it a default, or generic, quota system .. or what?


Thank you greatly for your time and insight.  This project is a powerful resource, and I'm very appreciative of all the hard work people have put into it.

Nathan




From cbergstrom at netsyncro.com  Mon Mar 26 19:15:21 2007
From: cbergstrom at netsyncro.com (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=)
Date: Mon, 26 Mar 2007 22:15:21 +0300
Subject: [Linux-cluster] Making gfs-deploy-tool dist agnostic (somewhat)
Message-ID: <46081BC9.7000200@netsyncro.com>

I'm trying to really clean up my cluster deployment and while looking at 
GFS came across this.

http://people.redhat.com/rkenna/gfs-deploy-tool/html.doc/

To be upfront, I my dist of choice doesn't use an rpm based binary 
package system.  Is there or would there be any objection to add in the 
ability for other backends.. eg.. debs?  My timeframe is about 30 days 
to take a poke at this so I'm in no rush.  Any feedback is appreciated.

Cheers,

Christopher



From jos at xos.nl  Mon Mar 26 19:16:18 2007
From: jos at xos.nl (Jos Vos)
Date: Mon, 26 Mar 2007 21:16:18 +0200
Subject: [Linux-cluster] Why no node-specific <quorumd>?
Message-ID: <200703261916.l2QJGJh19329@xos037.xos.nl>

Hi,

I was wondering why <quorumd> is a cluster-global config item.  This
means it *has* to be the same for all nodes.  Is there a good reason
for this?

Of course it is possible to make node-specific distinctions in the
heuristic scripts that are executed.  But at least the votes associated
with each node now can't be different.

--
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From nathan.j.dragun at wmich.edu  Mon Mar 26 19:33:01 2007
From: nathan.j.dragun at wmich.edu (Nathan J Dragun)
Date: Mon, 26 Mar 2007 14:33:01 -0500
Subject: [Linux-cluster] Making gfs-deploy-tool dist agnostic (somewhat)
Message-ID: <c041c533723d.4607d99d@wmich.edu>

----- Original Message -----
From: "C. Bergstr?m" <cbergstrom at netsyncro.com>
Date: Monday, March 26, 2007 2:15 pm
Subject: [Linux-cluster] Making gfs-deploy-tool dist agnostic (somewhat)

> I'm trying to really clean up my cluster deployment and while 
> looking at 
> GFS came across this.
> 
> http://people.redhat.com/rkenna/gfs-deploy-tool/html.doc/
> 
> To be upfront, I my dist of choice doesn't use an rpm based binary 
> package system.  Is there or would there be any objection to add in 
> the 
> ability for other backends.. eg.. debs?  My timeframe is about 30 
> days 
> to take a poke at this so I'm in no rush.  Any feedback is 
> appreciated.
> Cheers,
> 
> Christopher
> 
You should be able to just install the deb based "rpm" package, which will allow you to install rpm based packages.




From cbergstrom at netsyncro.com  Mon Mar 26 20:00:38 2007
From: cbergstrom at netsyncro.com (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=)
Date: Mon, 26 Mar 2007 23:00:38 +0300
Subject: [Linux-cluster] Making gfs-deploy-tool dist agnostic (somewhat)
In-Reply-To: <c041c533723d.4607d99d@wmich.edu>
References: <c041c533723d.4607d99d@wmich.edu>
Message-ID: <46082666.9070100@netsyncro.com>

Nathan J Dragun wrote:
> ----- Original Message -----
> From: "C. Bergstr?m" <cbergstrom at netsyncro.com>
> Date: Monday, March 26, 2007 2:15 pm
> Subject: [Linux-cluster] Making gfs-deploy-tool dist agnostic (somewhat)
>
>   
>> I'm trying to really clean up my cluster deployment and while 
>> looking at 
>> GFS came across this.
>>
>> http://people.redhat.com/rkenna/gfs-deploy-tool/html.doc/
>>
>> To be upfront, I my dist of choice doesn't use an rpm based binary 
>> package system.  Is there or would there be any objection to add in 
>> the 
>> ability for other backends.. eg.. debs?  My timeframe is about 30 
>> days 
>> to take a poke at this so I'm in no rush.  Any feedback is 
>> appreciated.
>> Cheers,
>>
>> Christopher
>>
>>     
> You should be able to just install the deb based "rpm" package, which will allow you to install rpm based packages.
I'll have to actually try using this to understand what this means, but 
from what I read.. gfs-deploy will initiate the install of any 
additional deps (rpms) needed for the target nodes?  Granted I can 
change my images to include the deps by default, but is really the same 
functionality?  w/o having actually tried this. will the deployment fail 
if it doesn't find the "rpms"..

Thanks

C.



From sdake at redhat.com  Mon Mar 26 21:00:04 2007
From: sdake at redhat.com (Steven Dake)
Date: Mon, 26 Mar 2007 14:00:04 -0700
Subject: [Linux-cluster] Why no node-specific <quorumd>?
In-Reply-To: <200703261916.l2QJGJh19329@xos037.xos.nl>
References: <200703261916.l2QJGJh19329@xos037.xos.nl>
Message-ID: <1174942804.9784.5.camel@shih.broked.org>

On Mon, 2007-03-26 at 21:16 +0200, Jos Vos wrote:
> Hi,
> 
> I was wondering why <quorumd> is a cluster-global config item.  This
> means it *has* to be the same for all nodes.  Is there a good reason
> for this?
> 
> Of course it is possible to make node-specific distinctions in the
> heuristic scripts that are executed.  But at least the votes associated
> with each node now can't be different.
> 

I'm not sure why you would want to select one view of a primary
component (aka quorum membership) with one node, and another primary
component with another node.  Then as the nodes make decisions, they
would be inconsistent (ie: consider GFS one node would think writes were
ok, while another may not...  shouldn't they agree?).

Could you explain the use case?

Thanks
-steve

> --
> --    Jos Vos <jos at xos.nl>
> --    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
> --    Amsterdam, The Netherlands        |     Fax: +31 20 6948204
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From jos at xos.nl  Mon Mar 26 21:25:39 2007
From: jos at xos.nl (Jos Vos)
Date: Mon, 26 Mar 2007 23:25:39 +0200
Subject: [Linux-cluster] Why no node-specific <quorumd>?
In-Reply-To: <1174942804.9784.5.camel@shih.broked.org>;
	from sdake@redhat.com on Mon, Mar 26, 2007 at 02:00:04PM -0700
References: <200703261916.l2QJGJh19329@xos037.xos.nl>
	<1174942804.9784.5.camel@shih.broked.org>
Message-ID: <20070326232539.A19333@xos037.xos.nl>

On Mon, Mar 26, 2007 at 02:00:04PM -0700, Steven Dake wrote:

> I'm not sure why you would want to select one view of a primary
> component (aka quorum membership) with one node, and another primary
> component with another node.  Then as the nodes make decisions, they
> would be inconsistent (ie: consider GFS one node would think writes were
> ok, while another may not...  shouldn't they agree?).

Maybe one step back to the CMAN algorithm:

I have been looking for a comprehensive summary of the algorithm on
how CMAN determines node failures (heartbeat *AND* quorum disk should
show the node being "up"?) and how votes are calculated.  But I could
not find it.

I'm afraid it is not clear what the role the <quorumd> votes play:
are they only calculated on each node or are the votes stored on the
quorum disk and read by other nodes?

-- 
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From sebastian.walter at fu-berlin.de  Tue Mar 27 11:33:44 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Tue, 27 Mar 2007 13:33:44 +0200
Subject: [Linux-cluster] Making gfs-deploy-tool dist agnostic (somewhat)
In-Reply-To: <46081BC9.7000200@netsyncro.com>
References: <46081BC9.7000200@netsyncro.com>
Message-ID: <46090118.4020806@fu-berlin.de>

Hi Christopher,

I don't want to evangelize anyone, but maybe you should have a look at 
the rocks cluster distribution. Its a wrap-around of any RHEL4 
compatible distribution (CentOS, FC, RHEL), which does the deployment 
job for you. I'm at the moment rolling out a 48 core cluster with a 
timeframe of 1 month, and it's really taking a lot off the shoulders, 
altough it's still short in time. My setup is FC SAN, GFS and automatic 
deployment, and I'm currently writing on a howto for that setup (in case 
you're interested).

http://www.rocksclusters.org

Regards,
Sebastian


C. Bergstr?m wrote:
> I'm trying to really clean up my cluster deployment and while looking 
> at GFS came across this.
>
> http://people.redhat.com/rkenna/gfs-deploy-tool/html.doc/
>
> To be upfront, I my dist of choice doesn't use an rpm based binary 
> package system.  Is there or would there be any objection to add in 
> the ability for other backends.. eg.. debs?  My timeframe is about 30 
> days to take a poke at this so I'm in no rush.  Any feedback is 
> appreciated.
>
> Cheers,
>
> Christopher
>
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From cbergstrom at netsyncro.com  Tue Mar 27 11:50:49 2007
From: cbergstrom at netsyncro.com (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=)
Date: Tue, 27 Mar 2007 14:50:49 +0300
Subject: [Linux-cluster] Making gfs-deploy-tool dist agnostic (somewhat)
In-Reply-To: <46090118.4020806@fu-berlin.de>
References: <46081BC9.7000200@netsyncro.com> <46090118.4020806@fu-berlin.de>
Message-ID: <46090519.4030506@netsyncro.com>

Sebastian Walter wrote:
>
> C. Bergstr?m wrote:
>> I'm trying to really clean up my cluster deployment and while looking 
>> at GFS came across this.
>>
>> http://people.redhat.com/rkenna/gfs-deploy-tool/html.doc/
>>
>> To be upfront, I my dist of choice doesn't use an rpm based binary 
>> package system.  Is there or would there be any objection to add in 
>> the ability for other backends.. eg.. debs?  My timeframe is about 30 
>> days to take a poke at this so I'm in no rush.  Any feedback is 
>> appreciated.
>>
>
> Hi Christopher,
>
> I don't want to evangelize anyone, but maybe you should have a look at 
> the rocks cluster distribution. Its a wrap-around of any RHEL4 
> compatible distribution (CentOS, FC, RHEL), which does the deployment 
> job for you. I'm at the moment rolling out a 48 core cluster with a 
> timeframe of 1 month, and it's really taking a lot off the shoulders, 
> altough it's still short in time. My setup is FC SAN, GFS and 
> automatic deployment, and I'm currently writing on a howto for that 
> setup (in case you're interested).
>
> http://www.rocksclusters.org
>
I sincerely appreciate you pointing me in this direction.  I'm quite 
interested in any documentation you can point me at as I'm always up for 
learning/reading more.  In this case there's no reason to change dist..  
I'll patch gfs-deploy and either locally maintain the patches or 
hopefully get them accepted upstream.  I didn't realize this was a RH 
and friends only project.

Thanks

C.



From sebastian.walter at fu-berlin.de  Tue Mar 27 12:10:05 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Tue, 27 Mar 2007 14:10:05 +0200
Subject: [Linux-cluster] Making gfs-deploy-tool dist agnostic (somewhat)
In-Reply-To: <46090519.4030506@netsyncro.com>
References: <46081BC9.7000200@netsyncro.com> <46090118.4020806@fu-berlin.de>
	<46090519.4030506@netsyncro.com>
Message-ID: <4609099D.2010605@fu-berlin.de>

C. Bergstr?m wrote:
> Sebastian Walter wrote:
>>
>> C. Bergstr?m wrote:
>>> I'm trying to really clean up my cluster deployment and while 
>>> looking at GFS came across this.
>>>
>>> http://people.redhat.com/rkenna/gfs-deploy-tool/html.doc/
>>>
>>>
>>
>>
>> http://www.rocksclusters.org
>>
> I sincerely appreciate you pointing me in this direction.  I'm quite 
> interested in any documentation you can point me at as I'm always up 
> for learning/reading more.  In this case there's no reason to change 
> dist..  I'll patch gfs-deploy and either locally maintain the patches 
> or hopefully get them accepted upstream.  I didn't realize this was a 
> RH and friends only project.
As looking on the gfs-deploy page I can't find anything really exciting 
about this project, most of the features are already covered by the 
Cluster Suite.

http://www.centos.org/modules/news/article.php?storyid=108

Am I wrong? please correct me!

Sebastian



From cbergstrom at netsyncro.com  Tue Mar 27 12:16:44 2007
From: cbergstrom at netsyncro.com (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=)
Date: Tue, 27 Mar 2007 15:16:44 +0300
Subject: [Linux-cluster] Making gfs-deploy-tool dist agnostic (somewhat)
In-Reply-To: <4609099D.2010605@fu-berlin.de>
References: <46081BC9.7000200@netsyncro.com>
	<46090118.4020806@fu-berlin.de>	<46090519.4030506@netsyncro.com>
	<4609099D.2010605@fu-berlin.de>
Message-ID: <46090B2C.3020406@netsyncro.com>

Sebastian Walter wrote:
> C. Bergstr?m wrote:
>> Sebastian Walter wrote:
>>>
>>> C. Bergstr?m wrote:
>>>> I'm trying to really clean up my cluster deployment and while 
>>>> looking at GFS came across this.
>>>>
>>>> http://people.redhat.com/rkenna/gfs-deploy-tool/html.doc/
>>>>
>>>>
>>>
>>>
>>> http://www.rocksclusters.org
>>>
>> I sincerely appreciate you pointing me in this direction.  I'm quite 
>> interested in any documentation you can point me at as I'm always up 
>> for learning/reading more.  In this case there's no reason to change 
>> dist..  I'll patch gfs-deploy and either locally maintain the patches 
>> or hopefully get them accepted upstream.  I didn't realize this was a 
>> RH and friends only project.
> As looking on the gfs-deploy page I can't find anything really 
> exciting about this project, most of the features are already covered 
> by the Cluster Suite.
gfs-deploy from reading the docs appears to be an almost idiot proof 
tool to deploy......... gfs... Whereas the link you sent is just the 
dependency packages that by themselves have no interface to deploy other 
than manually on each node/target.  Correct me if I'm wrong?



From grimme at atix.de  Tue Mar 27 13:51:22 2007
From: grimme at atix.de (Marc Grimme)
Date: Tue, 27 Mar 2007 15:51:22 +0200
Subject: [Linux-cluster] Unable to obtain lock
In-Reply-To: <1174925131.2994.19.camel@localhost.localdomain>
References: <200701260919.05228.grimme@atix.de>
	<200703220917.04917.grimme@atix.de>
	<1174925131.2994.19.camel@localhost.localdomain>
Message-ID: <200703271551.22521.grimme@atix.de>

On Monday 26 March 2007 18:05:30 Lon Hohberger wrote:
> On Thu, 2007-03-22 at 09:17 +0100, Marc Grimme wrote:
> > Hello,
> > again we had the same problem as stated in January. We installed the
> > hotfix but it didn't help.
> > Again the whole cluster freezed, no node was allowed to rejoin the
> > fencedomain.
> > Any ideas or do you need any more information?
> > Thanks and Regards Marc.
>
> This could be a couple of things, but I am certain it's not the same
> problem you had in January (though the symptoms are similar).
>
> Can you attach logs from the separate machines (instead of a single log
> with all the messages interleaved)?  It would really help clear things
> up.
>
> -- Lon
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Find them attached. But it's a crash from yesterday 26.3. 04:00:00 with the 
same symtoms. 
Do you have any explanation on how the rgmanager can possibly freeze a whole 
cluster. Isn't that a DLM bug?

-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany

Registergericht: Amtsgericht M?nchen
Registernummer: HRB 131682
USt.-Id.: DE209485962

Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lilr623a.log
Type: text/x-log
Size: 847 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070327/0684ef34/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lilr623b.log
Type: text/x-log
Size: 2030 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070327/0684ef34/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lilr623e.log
Type: text/x-log
Size: 217 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070327/0684ef34/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lilr623f.log
Type: text/x-log
Size: 1344 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070327/0684ef34/attachment-0003.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lilr623c.log
Type: text/x-log
Size: 110 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070327/0684ef34/attachment-0004.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lilr623d.log
Type: text/x-log
Size: 1605 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070327/0684ef34/attachment-0005.bin>

From sebastian.walter at fu-berlin.de  Tue Mar 27 14:32:03 2007
From: sebastian.walter at fu-berlin.de (Sebastian Walter)
Date: Tue, 27 Mar 2007 16:32:03 +0200
Subject: [Linux-cluster] Making gfs-deploy-tool dist agnostic (somewhat)
In-Reply-To: <46090B2C.3020406@netsyncro.com>
References: <46081BC9.7000200@netsyncro.com>	<46090118.4020806@fu-berlin.de>	<46090519.4030506@netsyncro.com>	<4609099D.2010605@fu-berlin.de>
	<46090B2C.3020406@netsyncro.com>
Message-ID: <46092AE3.8030204@fu-berlin.de>

C. Bergstr?m wrote:
> Sebastian Walter wrote:
>> As looking on the gfs-deploy page I can't find anything really 
>> exciting about this project, most of the features are already covered 
>> by the Cluster Suite.
> gfs-deploy from reading the docs appears to be an almost idiot proof 
> tool to deploy......... gfs... Whereas the link you sent is just the 
> dependency packages that by themselves have no interface to deploy 
> other than manually on each node/target.  Correct me if I'm wrong?
The link gives you the packages (you just import the repository in yum 
as described) and the manuals for the Cluster Suite (CS) and GFS. 
According to the gfs manual, you can't install it without CS at all (for 
Redhat and his friends; at least, not in a idiot-proof way e.g. 
cluster.conf). After a first manual distribution of the cluster.conf 
file the CS arranges the distribution of all services and shared file 
systems. The deployment of the rpm's is done by the rocks cluster 
distribution, which is really comfortable and worth a look.

Good luck!
Sebastian



From hlawatschek at atix.de  Tue Mar 27 15:48:25 2007
From: hlawatschek at atix.de (Mark Hlawatschek)
Date: Tue, 27 Mar 2007 17:48:25 +0200
Subject: [Linux-cluster] Cluster maintenance mode ?
Message-ID: <200703271748.25566.hlawatschek@atix.de>

Hi,

I have a question regarding RHEL4 CS and GFS:
 
Is there a way to put the cman into a kind of maintenance mode, where only one 
node is up and no other node is allowed to join ?
If not, are there plans to implement something like that ?

The question is adressing the following issue: 
I'd like to enable automatic GFS filesystem checks. (localfs like every Nth 
mount or after M days... ) This FS check would only be allowed if no other 
node has the filesystem mounted or no other node is in the cluster. How can 
this be assured in a general way ?
   
Thanks, 

Mark
-- 
Gruss / Regards,

Dipl.-Ing. Mark Hlawatschek
http://www.atix.de/
http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany

Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz




From teigland at redhat.com  Tue Mar 27 16:01:04 2007
From: teigland at redhat.com (David Teigland)
Date: Tue, 27 Mar 2007 11:01:04 -0500
Subject: [Linux-cluster] Cluster maintenance mode ?
In-Reply-To: <200703271748.25566.hlawatschek@atix.de>
References: <200703271748.25566.hlawatschek@atix.de>
Message-ID: <20070327160104.GB28370@redhat.com>

On Tue, Mar 27, 2007 at 05:48:25PM +0200, Mark Hlawatschek wrote:
> Is there a way to put the cman into a kind of maintenance mode, where
> only one node is up and no other node is allowed to join ?  If not, are
> there plans to implement something like that ?
> 
> The question is adressing the following issue: I'd like to enable
> automatic GFS filesystem checks. (localfs like every Nth mount or after
> M days... ) This FS check would only be allowed if no other node has the
> filesystem mounted or no other node is in the cluster. How can this be
> assured in a general way ?

The best approach to this problem IMO is to activate the LV exclusively on
the node you want to fsck from.  Last I heard, though, there were problems
with exclusive LV activation, so this didn't actually work.

Dave



From jos at xos.nl  Tue Mar 27 16:50:34 2007
From: jos at xos.nl (Jos Vos)
Date: Tue, 27 Mar 2007 18:50:34 +0200
Subject: [Linux-cluster] Asymmetric config for two-node cluster with qdisk?
Message-ID: <200703271650.l2RGoYG31011@xos037.xos.nl>

Hi,

Should I put some asymmetry in the cluster config (and/or the qdisk
heuristic scripts) of a two-node cluster to make (only) one node
decide to continue its services in case of a split-brain problem?

Or are heartbeat failures ignored *if* the quorum disk still shows
that the other node is running ok (or does this only affect whether
to fence or not)?

During my tests (disconnecting nodes, shared storage via working)
I got tons of messages like

  qdiskd[4012]: <crit> A master exists, but it's not me?!
  qdiskd[4012]: <crit> Critical Error: More than one master found!

and I finally had to reboot one node to make this "master race" be
solved.

I'm getting more confused now, some samples would help...  As said
yesterday, what's really missing is a clear picture of the algorithm
CMAN uses to determine membership, calculate votes and quorum, etc.

P.S.
While testing disconnectivity I got an almost synchronous (on both
nodes) kernel panic in gfs_lockd (still using the 42.0.3.EL kernel,
will upgrade soon...).

--
--    Jos Vos <jos at xos.nl>
--    X/OS Experts in Open Systems BV   |   Phone: +31 20 6938364
--    Amsterdam, The Netherlands        |     Fax: +31 20 6948204



From nenad at panline.net  Tue Mar 27 16:52:30 2007
From: nenad at panline.net (Nenad Opsenica)
Date: Tue, 27 Mar 2007 18:52:30 +0200
Subject: [Linux-cluster] writev does not work properly on GFS filesystem
Message-ID: <46094BCE.5070409@panline.net>

I have found a problem when trying to use syslog (sysklogd-1.4.1) for 
writing log files to GFS-mounted filesystem, as only timestamp gets logged.
After searching and tracing with strace, it seems that problem is 
somewhere in GFS filesystem writev implementation.


Here is strace of what happens when logging to file on ext3 FS (using  
logger -p user.notice "test-ext3 filesystem" to force logging):

    recv(3, "<13>Mar 27 18:27:57 root: test-e"..., 1022, 0) = 46
    time(NULL)                              = 1175012877
    writev(4, [{"Mar 27 18:27:57", 15}, {" ", 1}, {"pantelija", 9}, {" 
", 1}, {"root: test-ext3 filesystem", 26}, {"\n", 1}], 6) = 53
    fsync(4)                                = 0


And now, when trying to log to file on GFS FS (using logger -p 
user.notice "test-gfs filesystem") - we have got 45 bytes to write, 
syslog is properly formatting text string into 6 array entries, but only 
the first is written - as could be seen from writev return code:

    recv(3, "<13>Mar 27 18:28:49 root: test-g"..., 1022, 0) = 45
    time(NULL)                              = 1175012929
    writev(4, [{"Mar 27 18:28:49", 15}, {" ", 1}, {"pantelija", 9}, {" 
", 1}, {"root: test-gfs filesystem", 25}, {"\n", 1}], 6) = 15
    fsync(4)                                = 0


Kernel is 2.6.20.2 and GFS-kernel is 1.04.00

    Nenad




From wcheng at redhat.com  Tue Mar 27 16:51:36 2007
From: wcheng at redhat.com (Wendy Cheng)
Date: Tue, 27 Mar 2007 12:51:36 -0400
Subject: [Linux-cluster] writev does not work properly on GFS filesystem
In-Reply-To: <46094BCE.5070409@panline.net>
References: <46094BCE.5070409@panline.net>
Message-ID: <46094B98.4000803@redhat.com>

Nenad Opsenica wrote:
> I have found a problem when trying to use syslog (sysklogd-1.4.1) for 
> writing log files to GFS-mounted filesystem, as only timestamp gets 
> logged.
> After searching and tracing with strace, it seems that problem is 
> somewhere in GFS filesystem writev implementation.
>
>
> Here is strace of what happens when logging to file on ext3 FS (using  
> logger -p user.notice "test-ext3 filesystem" to force logging):
>
>    recv(3, "<13>Mar 27 18:27:57 root: test-e"..., 1022, 0) = 46
>    time(NULL)                              = 1175012877
>    writev(4, [{"Mar 27 18:27:57", 15}, {" ", 1}, {"pantelija", 9}, {" 
> ", 1}, {"root: test-ext3 filesystem", 26}, {"\n", 1}], 6) = 53
>    fsync(4)                                = 0
>
>
> And now, when trying to log to file on GFS FS (using logger -p 
> user.notice "test-gfs filesystem") - we have got 45 bytes to write, 
> syslog is properly formatting text string into 6 array entries, but 
> only the first is written - as could be seen from writev return code:
>
>    recv(3, "<13>Mar 27 18:28:49 root: test-g"..., 1022, 0) = 45
>    time(NULL)                              = 1175012929
>    writev(4, [{"Mar 27 18:28:49", 15}, {" ", 1}, {"pantelija", 9}, {" 
> ", 1}, {"root: test-gfs filesystem", 25}, {"\n", 1}], 6) = 15
>    fsync(4)                                = 0
>
>
> Kernel is 2.6.20.2 and GFS-kernel is 1.04.00
>
>    
Yeah... an oversight. I'll fix this soon.

-- Wendy



From dnlombar at ichips.intel.com  Tue Mar 27 19:05:29 2007
From: dnlombar at ichips.intel.com (Lombard, David N)
Date: Tue, 27 Mar 2007 12:05:29 -0700
Subject: [Linux-cluster] Making gfs-deploy-tool dist agnostic (somewhat)
In-Reply-To: <46090118.4020806@fu-berlin.de>
References: <46081BC9.7000200@netsyncro.com> <46090118.4020806@fu-berlin.de>
Message-ID: <20070327190529.GB28692@nlxdcldnl2.cl.intel.com>

On Tue, Mar 27, 2007 at 01:33:44PM +0200, Sebastian Walter wrote:
> Hi Christopher,
> 
> I don't want to evangelize anyone, but maybe you should have a look at 
> the rocks cluster distribution. Its a wrap-around of any RHEL4 
> compatible distribution (CentOS, FC, RHEL), which does the deployment 
> job for you. I'm at the moment rolling out a 48 core cluster with a 
> timeframe of 1 month, and it's really taking a lot off the shoulders, 
> altough it's still short in time. My setup is FC SAN, GFS and automatic 
> deployment, and I'm currently writing on a howto for that setup (in case 
> you're interested).
> 
> http://www.rocksclusters.org

Just to make sure expectations are properly set, Rocks is oriented to
the typical HPC usages.  If that your model, then have a happy time with it.

-- 
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.



From jlauro at umflint.edu  Tue Mar 27 20:20:32 2007
From: jlauro at umflint.edu (Lauro, John)
Date: Tue, 27 Mar 2007 16:20:32 -0400
Subject: [Linux-cluster] GFS2 questions
Message-ID: <D4BFED4FB59C104ABC9FF2C745FCD4F703DC24BC@SMB2.umflint.edu>

Hello,

 

Seems like GFS would have it's own area, but... project page directs
to this list...

 

 

I do not need a cluster file system currently.  However, I am
outgrowing ext3 (mainly the volume size limit).  Xfs looks unworkable
due to memory requirements for a large filesystem, and EXT4 is ready
yet.  I was thinking about using JFS, but that isn't supported in
RedHat 5.  So either I have to settle for EXT3 (split the file
system), or pick a different distro, or pick a different file system.

 

Wasn't considering GFS2 at first, but it appears to be built into
Redhat 5.  If anyone can recommend something else that is supported by
RedHat 5 and supports large volumes, that would be good too.

 

Where are the specs for GFS2?  (Max volume size, max file size, max
number of directories in a directory, etc...)?

 

How does performance compare to EXT3 (and JFS) when running GFS2 in
non locking mode?  Given that GFS is designed to work in a cluster, I
am a little concerned it might not be as efficient on a single machine
with a file system not designed to run on a cluster.  Plus I am a
little concerned that GFS2 has not been out as long or in production
as much compared to ext3 and jfs...

 

 

Thanks for any insights.

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070327/6cd37758/attachment.htm>

From davshwartz at gmail.com  Tue Mar 27 20:22:01 2007
From: davshwartz at gmail.com (David Shwatrz)
Date: Tue, 27 Mar 2007 22:22:01 +0200
Subject: [Linux-cluster] using GNBD versus iSCSI
Message-ID: <9ea95a710703271322m34d0bacctf388a4cd7d7bd840@mail.gmail.com>

Hello,
In short: what are the advantages/disadvantages when using GNBD versus iSCSI
for exporting storage in a cluster?
Regards,
Dave S.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070327/f0c76af1/attachment.htm>

From arajekar at max-t.com  Tue Mar 27 21:14:29 2007
From: arajekar at max-t.com (Ashutosh Rajekar)
Date: Tue, 27 Mar 2007 17:14:29 -0400 (EDT)
Subject: [Linux-cluster] fence_node problem with CMAN version 2.0.60-1.el5
In-Reply-To: <45FE548A.5010905@redhat.com>
References: <Pine.LNX.4.63.0703161433460.10489@laurentian>
	<49622.192.168.1.140.1174070249.squirrel@mail.max-t.com>
	<45FE548A.5010905@redhat.com>
Message-ID: <34362.192.168.1.140.1175030069.squirrel@mail.max-t.com>

I just rebuilt RPMs from the source RPMs of RHEL5 released a few days ago
from
(ftp://ftp.redhat.com/pub/redhat/linux/enterprise/5Server/en/os/SRPMS/cman-2.0.60-1.el5.src.rpm)


fence_node doesn't seem to work; I get prints like:
"Mar 27 15:05:52 mds1 fence_node[11404]: Fence of "iosrv2" was unsuccessful".

My configuration file looks like the following:

---------------------------------------
<?xml version="1.0"?>
<cluster name="IA" config_version="2">
  <cman>
    <multicast addr="225.0.0.225"/>
  </cman>
  <clusternodes>
    <clusternode name="mds1" votes="128" nodeid="1">
      <multicast addr="225.0.0.225" interface="myri0"/>
      <fence>
        <method name="single">
          <device name="none"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="iosrv2" votes="1" nodeid="3">
      <multicast addr="225.0.0.225" interface="myri0"/>
      <fence>
        <method name="single">
          <device name="none"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <fencedevices>
    <fencedevice name="none" agent="fence_none"/>
  </fencedevices>
</cluster>
---------------------------------------

All nodes are running fine, and can see each other. cman_tool status
returns no problem on either node. CCSD is running fine too:

[root at mds1 ~]# ps ax | grep ccsd
 6448 ?        Ssl    0:00 /sbin/ccsd
11476 pts/0    S+     0:00 grep ccsd


Fenced is running fine too:

[root at mds1 ~]# ps ax  |grep fenced
 6472 ?        Ss     0:00 /sbin/fenced
11488 pts/0    S+     0:00 grep fenced

Any ideas?

Regards,
-Ashutosh



From sflee at wou.edu.my  Wed Mar 28 01:29:15 2007
From: sflee at wou.edu.my (Lee Siang Fong)
Date: Wed, 28 Mar 2007 09:29:15 +0800
Subject: [Linux-cluster] LVS+GFS+GNBD
Message-ID: <460A3564.D1F2.004B.0@wou.edu.my>

Dear All,
 
I am planning to use only 2 nodes clusters sharing 1 storage location, which is a Raid 5 SCSI hard drives of another HP DL385 machine. Appreciate if you advise me the following:-
 
1. Can GNBD work well in this case? 
2. I have successfully installed GNBD  and GFS( all in one server) and mount it locally to 2 web cluster nodes(I put mount command in rc.local), but when 1 do simulation on shutting down 2 webservers, it hung; i have to manually shut it down. When I start both servers, it takes very long time to startup, sometimes it also hung? Any suggestion?
3. I hardly  find GFS mounting instruction in RHCS tools, I tried to mount GFS as  Shared Resources, but it can only mount at 1 server at a time.
4. I have used back the old school method, I only used NFS for my data. May  I know if there are any performance issues? ( i think we have little chances to faced data corruption due to most of the clients are merely downloading the contents).
 
 
Thanks a and sorry for my bad english.
 
Rdgs
SFLee
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070328/0a30fcce/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: LMS_cluster.JPG
Type: image/jpeg
Size: 46960 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070328/0a30fcce/attachment.jpe>

From Pawan_Wairagade at syntelinc.com  Wed Mar 28 04:39:30 2007
From: Pawan_Wairagade at syntelinc.com (Wairagade, Pawan)
Date: Wed, 28 Mar 2007 00:39:30 -0400
Subject: [Linux-cluster] hi
Message-ID: <9B808266E5B6DB119E4C0002B3C146A59E9AA3@PUNEEXCH1>

Hi All,

I am new to this linux cluster .

Anyone who have the documents or link for the same . It will help me in my
studies .... waiting for your reply

Thanks in advance . take care



Best Regards,
Pawan

____________________________________________
Confidential:  This electronic message and all contents contain information
from Syntel, Inc. which may be privileged, confidential or otherwise
protected from disclosure. The information is intended to be for the
addressee only. If you are not the addressee, any disclosure, copy,
distribution or use of the contents of this message is prohibited.  If you
have received this electronic message in error, please notify the sender
immediately and destroy the original message and all copies.



From ramon at vanalteren.nl  Wed Mar 28 08:19:09 2007
From: ramon at vanalteren.nl (Ramon van Alteren)
Date: Wed, 28 Mar 2007 10:19:09 +0200
Subject: [Linux-cluster] RHCS Coraid/ATAoe compatibility
In-Reply-To: <4605E9C1.1070802@rice.edu>
References: <4605E9C1.1070802@rice.edu>
Message-ID: <460A24FD.4000509@vanalteren.nl>

Franco Martin Bladilo wrote:
> We are considering the option to build a 4 node HA cluster using RHCS 
> and Coraid products (SR1521). ATAoE seems to be the most cost 
> effective solution for our case and I was wondering if there are any 
> comments/ feedback for such configuration.
We're extensively using gfs + coraid 1521's in our setup.
Currently running 3 5-node clusters with 2 - 5 coraid devices attached.

You should not expect the same performance from gfs over AoE as listed 
in the coraid docs.
Depending on the model version you will not get better performance by 
enabling the secondary nic on the coraid devices, rather the opposite.
They claim to have solved this with their newer models but I haven't 
tested yet.

Make sure you use the coraid supplied AoE drivers, performance is much 
better than the in-kernel ones.
Apart from that, install the cec utility, it allows you to login on the 
coraids remotely and thus enables monitoring of raid arrays by a NMS of 
your choice

On the whole we are reasonable statisfied with the solution we managed 
to build on top of gfs and coraids.
It certainly is cost-effective.
> Is the quorum partition issue (shared concurrent access) still a 
> limitation for these appliances? Can these "service" partitions be 
> formatted using GFS? Any other workarounds?
Can't comment, we use the dlm + fencing via APC powerbars.

Regards,

Ramon van Alteren



From grimme at atix.de  Wed Mar 28 08:24:37 2007
From: grimme at atix.de (Marc Grimme)
Date: Wed, 28 Mar 2007 10:24:37 +0200
Subject: [Linux-cluster] Cluster maintenance mode ?
In-Reply-To: <20070327160104.GB28370@redhat.com>
References: <200703271748.25566.hlawatschek@atix.de>
	<20070327160104.GB28370@redhat.com>
Message-ID: <200703281024.37539.grimme@atix.de>

On Tuesday 27 March 2007 18:01:04 David Teigland wrote:
> On Tue, Mar 27, 2007 at 05:48:25PM +0200, Mark Hlawatschek wrote:
> > Is there a way to put the cman into a kind of maintenance mode, where
> > only one node is up and no other node is allowed to join ?  If not, are
> > there plans to implement something like that ?
> >
> > The question is adressing the following issue: I'd like to enable
> > automatic GFS filesystem checks. (localfs like every Nth mount or after
> > M days... ) This FS check would only be allowed if no other node has the
> > filesystem mounted or no other node is in the cluster. How can this be
> > assured in a general way ?
>
> The best approach to this problem IMO is to activate the LV exclusively on
> the node you want to fsck from.  Last I heard, though, there were problems
> with exclusive LV activation, so this didn't actually work.
>
> Dave
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
And the other question would be can one detect a currupted filesystem. 

I.e. is it possible to detect if the fs was not cleanly unmounted? Are there 
any possibilities?

-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany

Registergericht: Amtsgericht M?nchen
Registernummer: HRB 131682
USt.-Id.: DE209485962

Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz




From jvantuyl at engineyard.com  Wed Mar 28 10:49:03 2007
From: jvantuyl at engineyard.com (Jayson Vantuyl)
Date: Wed, 28 Mar 2007 03:49:03 -0700
Subject: [Linux-cluster] RHCS Coraid/ATAoe compatibility
In-Reply-To: <460A24FD.4000509@vanalteren.nl>
References: <4605E9C1.1070802@rice.edu> <460A24FD.4000509@vanalteren.nl>
Message-ID: <1308E1BD-DCD4-4488-BD30-BB92FA6ED0B2@engineyard.com>

The newer units are actually quite a bit faster.  Also, I highly  
recommend connecting your Coraids in a dual-network / not-cross- 
connected fashion.

In other words, given machines A,B,C each with eth2 and eth3;  
switches Alpha and Beta; and Coraids Aleph and Beth each with  
internal connections Uno and Dos, that you connect them as follows.   
For switch Alpha, plug in A-eth2, B-eth2, C-eth2, Aleph-Uno, Beth- 
Uno.  For switch Beta, plug-in A-eth3, B-eth3, C-eth3, Aleph-Dos,  
Beth-Dos.  Do not cross-connect Alpha and Beta!  In this  
configuration, with Jumbo frames, we get pretty good performance.

I've experimented with Quorum Disks and find that I quite like them,  
although we've just never needed them as we also fence with APC power  
bars.  We use GFS extensively and it works quite well.

On Mar 28, 2007, at 1:19 AM, Ramon van Alteren wrote:

> Franco Martin Bladilo wrote:
>> We are considering the option to build a 4 node HA cluster using  
>> RHCS and Coraid products (SR1521). ATAoE seems to be the most cost  
>> effective solution for our case and I was wondering if there are  
>> any comments/ feedback for such configuration.
> We're extensively using gfs + coraid 1521's in our setup.
> Currently running 3 5-node clusters with 2 - 5 coraid devices  
> attached.
>
> You should not expect the same performance from gfs over AoE as  
> listed in the coraid docs.
> Depending on the model version you will not get better performance  
> by enabling the secondary nic on the coraid devices, rather the  
> opposite.
> They claim to have solved this with their newer models but I  
> haven't tested yet.
>
> Make sure you use the coraid supplied AoE drivers, performance is  
> much better than the in-kernel ones.
> Apart from that, install the cec utility, it allows you to login on  
> the coraids remotely and thus enables monitoring of raid arrays by  
> a NMS of your choice
>
> On the whole we are reasonable statisfied with the solution we  
> managed to build on top of gfs and coraids.
> It certainly is cost-effective.
>> Is the quorum partition issue (shared concurrent access) still a  
>> limitation for these appliances? Can these "service" partitions be  
>> formatted using GFS? Any other workarounds?
> Can't comment, we use the dlm + fencing via APC powerbars.
>
> Regards,
>
> Ramon van Alteren
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Jayson Vantuyl
Systems Architect
Engine Yard
jvantuyl at engineyard.com


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070328/115cb060/attachment.htm>

From ramon at vanalteren.nl  Wed Mar 28 12:38:29 2007
From: ramon at vanalteren.nl (Ramon van Alteren)
Date: Wed, 28 Mar 2007 14:38:29 +0200
Subject: [Linux-cluster] RHCS Coraid/ATAoe compatibility
In-Reply-To: <1308E1BD-DCD4-4488-BD30-BB92FA6ED0B2@engineyard.com>
References: <4605E9C1.1070802@rice.edu> <460A24FD.4000509@vanalteren.nl>
	<1308E1BD-DCD4-4488-BD30-BB92FA6ED0B2@engineyard.com>
Message-ID: <460A61C5.1080109@vanalteren.nl>

Jayson Vantuyl wrote:
> The newer units are actually quite a bit faster.  Also, I highly 
> recommend connecting your Coraids in a dual-network / 
> not-cross-connected fashion.
Interesting, any benchmarks on real-life gfs usage ?
I've seen the coraid ones, but we never got that performance out of gfs.

Grtz Ramon



From teigland at redhat.com  Wed Mar 28 13:38:10 2007
From: teigland at redhat.com (David Teigland)
Date: Wed, 28 Mar 2007 08:38:10 -0500
Subject: [Linux-cluster] Cluster maintenance mode ?
In-Reply-To: <200703281024.37539.grimme@atix.de>
References: <200703271748.25566.hlawatschek@atix.de>
	<20070327160104.GB28370@redhat.com>
	<200703281024.37539.grimme@atix.de>
Message-ID: <20070328133810.GA22230@redhat.com>

On Wed, Mar 28, 2007 at 10:24:37AM +0200, Marc Grimme wrote:
> I.e. is it possible to detect if the fs was not cleanly unmounted?
> Are there any possibilities?

I believe fsck checks whether the journal was shut down cleanly (part of
unmount) and cleans it up if it wasn't.  Would you like an option for fsck
to quit if it finds the journal was cleanly shut down by unmount?

Dave



From grimme at atix.de  Wed Mar 28 13:59:56 2007
From: grimme at atix.de (Marc Grimme)
Date: Wed, 28 Mar 2007 15:59:56 +0200
Subject: [Linux-cluster] Cluster maintenance mode ?
In-Reply-To: <20070328133810.GA22230@redhat.com>
References: <200703271748.25566.hlawatschek@atix.de>
	<200703281024.37539.grimme@atix.de>
	<20070328133810.GA22230@redhat.com>
Message-ID: <200703281559.57962.grimme@atix.de>

On Wednesday 28 March 2007 15:38:10 you wrote:
> On Wed, Mar 28, 2007 at 10:24:37AM +0200, Marc Grimme wrote:
> > I.e. is it possible to detect if the fs was not cleanly unmounted?
> > Are there any possibilities?
>
> I believe fsck checks whether the journal was shut down cleanly (part of
> unmount) and cleans it up if it wasn't.  Would you like an option for fsck
> to quit if it finds the journal was cleanly shut down by unmount?
That would be an option. 

But basically it would be great to have something to detect a "corrupt" 
filesystem or at least to check if the filesystem was not cleanly unmounted.

Background is that we had some clusters that crashed because of corrupt 
gfs-filesystems. But we couldn't see from the dump that those filesystems 
were corrupt and needed to be fixed.

Regards Marc.
>
> Dave



-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany

Registergericht: Amtsgericht M?nchen
Registernummer: HRB 131682
USt.-Id.: DE209485962

Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz




From rpeterso at redhat.com  Wed Mar 28 14:01:22 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Wed, 28 Mar 2007 09:01:22 -0500
Subject: [Linux-cluster] hi
In-Reply-To: <9B808266E5B6DB119E4C0002B3C146A59E9AA3@PUNEEXCH1>
References: <9B808266E5B6DB119E4C0002B3C146A59E9AA3@PUNEEXCH1>
Message-ID: <460A7532.1070306@redhat.com>

Wairagade, Pawan wrote:
> Hi All,
> 
> I am new to this linux cluster .
> 
> Anyone who have the documents or link for the same . It will help me in my
> studies .... waiting for your reply
> 
> Thanks in advance . take care
> 
> 
> 
> Best Regards,
> Pawan
Hi Pawan,

There's some documentation and links on the cluster project page:
http://sources.redhat.com/cluster/

Regards,

Bob Peterson
Red Hat Cluster Suite



From jvantuyl at engineyard.com  Wed Mar 28 14:17:43 2007
From: jvantuyl at engineyard.com (Jayson Vantuyl)
Date: Wed, 28 Mar 2007 07:17:43 -0700
Subject: [Linux-cluster] RHCS Coraid/ATAoe compatibility
In-Reply-To: <460A61C5.1080109@vanalteren.nl>
References: <4605E9C1.1070802@rice.edu> <460A24FD.4000509@vanalteren.nl>
	<1308E1BD-DCD4-4488-BD30-BB92FA6ED0B2@engineyard.com>
	<460A61C5.1080109@vanalteren.nl>
Message-ID: <DAFEE7AC-9394-41B3-8C19-B22B25DDAE26@engineyard.com>

We don't have benchmarks per se, but I've had Rsync's float anywhere  
between 6 MB/s and 40 MB/s depending on load and file distribution.   
Considering that GFS1 is not extremely friendly to stat'ing 100,000  
files, I've usually felt this was pretty good.  You'll find that any  
metadata-heavy load is pretty slow in GFS in general.  We find that  
it's generally pretty close to local disks.  Good enough for large  
scale file storage but perhaps a bit below what you'd put behind a  
ultra-fast DBMS.

On Mar 28, 2007, at 5:38 AM, Ramon van Alteren wrote:

> Jayson Vantuyl wrote:
>> The newer units are actually quite a bit faster.  Also, I highly  
>> recommend connecting your Coraids in a dual-network / not-cross- 
>> connected fashion.
> Interesting, any benchmarks on real-life gfs usage ?
> I've seen the coraid ones, but we never got that performance out of  
> gfs.
>
> Grtz Ramon
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Jayson Vantuyl
Systems Architect
Engine Yard
jvantuyl at engineyard.com


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070328/2512b1aa/attachment.htm>

From johnsonzjo at gmail.com  Wed Mar 28 14:28:21 2007
From: johnsonzjo at gmail.com (Andy Johnson)
Date: Wed, 28 Mar 2007 16:28:21 +0200
Subject: [Linux-cluster] Unable to access cluster service while starting
	heartbeat
Message-ID: <147a89290703280728w167e3fbay39f7cfeb8d6d15ea@mail.gmail.com>

Hello,
I have ocfs2-tools-1.2.3. ( built from sources)
I try:
 mount /dev/vg1/cluster /mnt/ocfs2/
and get:
ocfs2_hb_ctl: Unable to access cluster service while starting heartbeat
mount.ocfs2: Error when attempting to run /sbin/ocfs2_hb_ctl: "Operation not
permitted"

What can be the cause of this error ? Any ideas?

(I tried it after service heartbeat stop)

Regards,
Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070328/0c450b4a/attachment.htm>

From teigland at redhat.com  Wed Mar 28 14:36:56 2007
From: teigland at redhat.com (David Teigland)
Date: Wed, 28 Mar 2007 09:36:56 -0500
Subject: [Linux-cluster] Cluster maintenance mode ?
In-Reply-To: <200703281559.57962.grimme@atix.de>
References: <200703271748.25566.hlawatschek@atix.de>
	<200703281024.37539.grimme@atix.de>
	<20070328133810.GA22230@redhat.com>
	<200703281559.57962.grimme@atix.de>
Message-ID: <20070328143656.GB22230@redhat.com>

On Wed, Mar 28, 2007 at 03:59:56PM +0200, Marc Grimme wrote:
> On Wednesday 28 March 2007 15:38:10 you wrote:
> > On Wed, Mar 28, 2007 at 10:24:37AM +0200, Marc Grimme wrote:
> > > I.e. is it possible to detect if the fs was not cleanly unmounted?
> > > Are there any possibilities?
> >
> > I believe fsck checks whether the journal was shut down cleanly (part of
> > unmount) and cleans it up if it wasn't.  Would you like an option for fsck
> > to quit if it finds the journal was cleanly shut down by unmount?
> That would be an option. 
> 
> But basically it would be great to have something to detect a "corrupt" 
> filesystem or at least to check if the filesystem was not cleanly unmounted.
> 
> Background is that we had some clusters that crashed because of corrupt 
> gfs-filesystems. But we couldn't see from the dump that those filesystems 
> were corrupt and needed to be fixed.

Ah, I see what you mean -- there's no way to tell that a corrupted fs is
corrupted (apart from mounting it and eventually tripping over the
corruption again.)  So, when a node discovers that the fs is corrupted, it
could set some kind of CORRUPTED flag on the fs.  Mount could then check
for this flag and return an error.  Or we could possibly automate
something like:

  if CORRUPTED flag is set
    try to activate lv exclusively
    if that works, start fsck
  else if CORRUPTED flag is not set
    mount fs

It'll take some thought to figure out how to properly flag an fs when gfs
detects corruption.

Dave
 



From teigland at redhat.com  Wed Mar 28 14:39:32 2007
From: teigland at redhat.com (David Teigland)
Date: Wed, 28 Mar 2007 09:39:32 -0500
Subject: [Linux-cluster] Unable to access cluster service while starting
	heartbeat
In-Reply-To: <147a89290703280728w167e3fbay39f7cfeb8d6d15ea@mail.gmail.com>
References: <147a89290703280728w167e3fbay39f7cfeb8d6d15ea@mail.gmail.com>
Message-ID: <20070328143932.GC22230@redhat.com>

On Wed, Mar 28, 2007 at 04:28:21PM +0200, Andy Johnson wrote:
> Hello,
> I have ocfs2-tools-1.2.3. ( built from sources)
> I try:
> mount /dev/vg1/cluster /mnt/ocfs2/
> and get:
> ocfs2_hb_ctl: Unable to access cluster service while starting heartbeat
> mount.ocfs2: Error when attempting to run /sbin/ocfs2_hb_ctl: "Operation not
> permitted"
> 
> What can be the cause of this error ? Any ideas?
> 
> (I tried it after service heartbeat stop)

I think you're looking for the list here:
  http://oss.oracle.com/mailman/listinfo/ocfs2-devel

linux-cluster is generally about gfs.

Dave



From rpeterso at redhat.com  Wed Mar 28 14:30:24 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Wed, 28 Mar 2007 09:30:24 -0500
Subject: [Linux-cluster] Invalid cross-device link when building cluster-2
In-Reply-To: <d0383f90703180902j67d9a824rb51cbaee0676e908@mail.gmail.com>
References: <d0383f90703180902j67d9a824rb51cbaee0676e908@mail.gmail.com>
Message-ID: <460A7C00.9060706@redhat.com>

Ian Brown wrote:
> make[2]: Entering directory `/work/src/cluster-2.00.00/gfs2/mkfs'
> install -m 0755 mkfs.gfs2 //sbin
> ln -f mkfs.gfs2 //sbin/gfs2_jadd
> ln: creating hard link `//sbin/gfs2_jadd' to `mkfs.gfs2': Invalid
> cross-device link
> 
> Any idea what can be wrong here ?
> regards,
> Ian
Hi Ian,

This is bz 233083.  See:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=233083
I've updated cvs with the solution in the HEAD and RHEL5 branches.

Regards,

Bob Peterson
Red Hat Cluster Suite



From grimme at atix.de  Wed Mar 28 14:51:05 2007
From: grimme at atix.de (Marc Grimme)
Date: Wed, 28 Mar 2007 16:51:05 +0200
Subject: [Linux-cluster] Cluster maintenance mode ?
In-Reply-To: <20070328143656.GB22230@redhat.com>
References: <200703271748.25566.hlawatschek@atix.de>
	<200703281559.57962.grimme@atix.de>
	<20070328143656.GB22230@redhat.com>
Message-ID: <200703281651.06819.grimme@atix.de>

On Wednesday 28 March 2007 16:36:56 David Teigland wrote:
> On Wed, Mar 28, 2007 at 03:59:56PM +0200, Marc Grimme wrote:
> > On Wednesday 28 March 2007 15:38:10 you wrote:
> > > On Wed, Mar 28, 2007 at 10:24:37AM +0200, Marc Grimme wrote:
> > > > I.e. is it possible to detect if the fs was not cleanly unmounted?
> > > > Are there any possibilities?
> > >
> > > I believe fsck checks whether the journal was shut down cleanly (part
> > > of unmount) and cleans it up if it wasn't.  Would you like an option
> > > for fsck to quit if it finds the journal was cleanly shut down by
> > > unmount?
> >
> > That would be an option.
> >
> > But basically it would be great to have something to detect a "corrupt"
> > filesystem or at least to check if the filesystem was not cleanly
> > unmounted.
> >
> > Background is that we had some clusters that crashed because of corrupt
> > gfs-filesystems. But we couldn't see from the dump that those filesystems
> > were corrupt and needed to be fixed.
>
> Ah, I see what you mean -- there's no way to tell that a corrupted fs is
> corrupted (apart from mounting it and eventually tripping over the
> corruption again.)  So, when a node discovers that the fs is corrupted, it
> could set some kind of CORRUPTED flag on the fs.  Mount could then check
> for this flag and return an error.  Or we could possibly automate
> something like:
>
>   if CORRUPTED flag is set
>     try to activate lv exclusively
>     if that works, start fsck
>   else if CORRUPTED flag is not set
>     mount fs
>
> It'll take some thought to figure out how to properly flag an fs when gfs
> detects corruption.
That would do the trick. 

The currupted flag and the exclusiveness of the lvm-logical volume.
>
> Dave

Marc.

-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany

Registergericht: Amtsgericht M?nchen
Registernummer: HRB 131682
USt.-Id.: DE209485962

Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz




From rpeterso at redhat.com  Wed Mar 28 14:49:55 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Wed, 28 Mar 2007 09:49:55 -0500
Subject: [Linux-cluster] Using GFS on Compact Flash
In-Reply-To: <08A9A3213527A6428774900A80DBD8D803A007F6@xmb-sjc-222.amer.cisco.com>
References: <08A9A3213527A6428774900A80DBD8D803A007F6@xmb-sjc-222.amer.cisco.com>
Message-ID: <460A8093.5090809@redhat.com>

Lin Shen (lshen) wrote:
> Hi Bob,
> 
> This is very good info. The CF we use has builtin wear-levelling, wonder
> if this will make running GFS on top of it more resonable. Of course, if
> GFS is like FAT in the way that file system structure is frequently
> re-written in-place and cannot be reallocated or moved after wear
> failure, then the builtin wear-levelling is not good enough. 

The Resource Group information is always in the same place, so the
bitmaps that indicate block usage will always be located at the same
logical blocks of the file system.

> Since we'll need to run a file system on the CF, and most likely one of
> either GFS, Reiser or FAT, we'd really like to compare the extra disk
> writes introduced by the various file systems. I know it's hard to get
> the exact numbers, but do you think GFS is far more worse in this
> regard. 

Not as far as media usage is concerned.  GFS and the cluster suite in
general introduce network traffic to coordinate the cluster locking,
but the writes to the media are very comparable.
 
> BTW, is there also a big difference when GFS is running in local and
> clustered mode?

Again, not as far as media usage is concerned.  When GFS is running
in clustered mode, the additional overhead is due to the locking
between the nodes, which doesn't touch the media.
 
> Last, does GFS issue any disk writes when it's idle, meaning no one is
> using the file system? Do the cluster and fencing stuff also introduce
> disk writes?

It depends on what you mean by "idle".  If your system is sitting idle, 
for the most part, no, it won't be doing anything to the disk.
It won't constantly be reading and writing areas of the media, except 
for routine background things found in many file systems.  

For example: The Linux page cache will take care 
of flushing pages to disk when it's necessary, and possibly "idle."  
Journal entries may be updated as cached pages are written to disk.  
Cached quota changes may be written.  Unlinked inodes are deallocated.
The fstat information (i.e. for "df") may be synched up occasionally.
But this is not a lot of overhead and they would have to happen anyway.

> Lin     

Regards,

Bob Peterson
Red Hat Cluster Suite



From rpeterso at redhat.com  Wed Mar 28 15:37:03 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Wed, 28 Mar 2007 10:37:03 -0500
Subject: [Linux-cluster] NFSv4 with ACL-Support on GFS
In-Reply-To: <45F7D417.8010008@forschungsgruppe.de>
References: <45F7D417.8010008@forschungsgruppe.de>
Message-ID: <460A8B9F.6010306@redhat.com>

Christian Brandes wrote:
> Dear readers,
> 
> I would like to set up an active/active cluster of redundant NFSv4 
> servers with ACLs that have the same GFS file system exported at the 
> same time.
> Is that supported?
> What versions are needed minimum:
>     Kernel
>     GFS-Tools
>     NFS-Tools ...
> 
> At the moment I have a two node cluster of test servers with a GFS 
> exported over NFSv4, but I can not get or set ACLs with getfacl or 
> setfacl, which is the same behavior I had with samba.
> 
> Some weeks ago I tried two SAMBA servers on GFS, though it was said not 
> to work in the Cluster FAQ.
> It seemed to work -- but no ACLs.
> 
> Do you have any ideas?
> 
> Best regards
>     Christian
Hi Christian,

I have a couple questions:

1. Are you mounting gfs on the nfs server with the "-o acl" option?
2. Can you do setfacl and getfacl from the gfs host (i.e. not through nfs)?
3. On what version of the cluster software and gfs are you seeing this?

Clustered samba is still being worked on.

Regards,

Bob Peterson
Red Hat Cluster Suite



From wferi at niif.hu  Wed Mar 28 15:48:10 2007
From: wferi at niif.hu (Wagner Ferenc)
Date: Wed, 28 Mar 2007 17:48:10 +0200
Subject: [Linux-cluster] Slowness above 500 RRDs
In-Reply-To: <87648r6hdi.fsf@tac.ki.iif.hu> (Ferenc Wagner's message of "Fri, 
	23 Mar 2007 17:42:49 +0100")
References: <87648r6hdi.fsf@tac.ki.iif.hu>
Message-ID: <87ps6tl685.fsf@szonett.ki.iif.hu>

Ferenc Wagner <wferi at niif.hu> writes:

> There's a good bunch of RRDs in a directory.  A script scans them for
> their last modification times, and then updates each in turn for a
> couple of times.  The number of files scanned and the length of the
> update rounds are printed.  The results are much different for 500 and
> 501 files:
>
> filecount=501
>   iteration=0 elapsed time=10.425568 s
>   iteration=1 elapsed time= 9.766178 s
>   iteration=2 elapsed time=20.14514 s
>   iteration=3 elapsed time= 2.991397 s
>   iteration=4 elapsed time=20.496422 s
> total elapsed time=63.824705 s
>
> filecount=500
>   iteration=0 elapsed time=6.560811 s
>   iteration=1 elapsed time=0.229375 s
>   iteration=2 elapsed time=0.202973 s
>   iteration=3 elapsed time=0.203439 s
>   iteration=4 elapsed time=0.203095 s
> total elapsed time=7.399693 s

Following up to myself with one more data point: raising
SHRINK_CACHE_MAX from 1000 to 20000 in gfs/dlm/lock_dlm.h helps
significantly, but still isn't enough.  Besides, I don't know what I'm
doing.  Should I tweak the surrounding #defines, too?

I'd be grateful for any advice.
-- 
Thanks,
Feri.



From frederik.ferner at diamond.ac.uk  Wed Mar 28 15:54:19 2007
From: frederik.ferner at diamond.ac.uk (Frederik Ferner)
Date: Wed, 28 Mar 2007 16:54:19 +0100
Subject: [Linux-cluster] qdiskd, multiple master, updates?
Message-ID: <460A8FAB.5050805@diamond.ac.uk>

Hi,

I noticed that there seems to be a fix for my original problem with 
multiple masters for qdiskd in 
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=220211.

I assume the fix is going to be in RHEL4U5, correct?
Could somebody give me an estimate when we can expect that to be available?

Lon: Do the test rpms at http://people.redhat.com/lhh/packages.html 
include these fixes? Which rpms do I need to test it? This time I might 
actually get a chance to test.

Many thanks,
Frederik

-- 
Frederik Ferner		
Linux Systems Administrator		phone: +44 1235 77 8624
Diamond Light Source Ltd.		mob:   +44 7917 08 5110



From teigland at redhat.com  Wed Mar 28 16:27:27 2007
From: teigland at redhat.com (David Teigland)
Date: Wed, 28 Mar 2007 11:27:27 -0500
Subject: [Linux-cluster] Slowness above 500 RRDs
In-Reply-To: <87ps6tl685.fsf@szonett.ki.iif.hu>
References: <87648r6hdi.fsf@tac.ki.iif.hu> <87ps6tl685.fsf@szonett.ki.iif.hu>
Message-ID: <20070328162726.GF22230@redhat.com>

On Wed, Mar 28, 2007 at 05:48:10PM +0200, Wagner Ferenc wrote:
> Ferenc Wagner <wferi at niif.hu> writes:
> 
> > There's a good bunch of RRDs in a directory.  A script scans them for
> > their last modification times, and then updates each in turn for a
> > couple of times.  The number of files scanned and the length of the
> > update rounds are printed.  The results are much different for 500 and
> > 501 files:
> >
> > filecount=501
> >   iteration=0 elapsed time=10.425568 s
> >   iteration=1 elapsed time= 9.766178 s
> >   iteration=2 elapsed time=20.14514 s
> >   iteration=3 elapsed time= 2.991397 s
> >   iteration=4 elapsed time=20.496422 s
> > total elapsed time=63.824705 s
> >
> > filecount=500
> >   iteration=0 elapsed time=6.560811 s
> >   iteration=1 elapsed time=0.229375 s
> >   iteration=2 elapsed time=0.202973 s
> >   iteration=3 elapsed time=0.203439 s
> >   iteration=4 elapsed time=0.203095 s
> > total elapsed time=7.399693 s
> 
> Following up to myself with one more data point: raising
> SHRINK_CACHE_MAX from 1000 to 20000 in gfs/dlm/lock_dlm.h helps
> significantly, but still isn't enough.  Besides, I don't know what I'm
> doing.  Should I tweak the surrounding #defines, too?

SHRINK_CACHE_MAX is related to fcntl posix locks, did you intend to change
the app to use flock (which is much faster that fcntl)?

Dave



From teigland at redhat.com  Wed Mar 28 16:38:50 2007
From: teigland at redhat.com (David Teigland)
Date: Wed, 28 Mar 2007 11:38:50 -0500
Subject: [Linux-cluster] Slowness above 500 RRDs
In-Reply-To: <20070328162726.GF22230@redhat.com>
References: <87648r6hdi.fsf@tac.ki.iif.hu> <87ps6tl685.fsf@szonett.ki.iif.hu>
	<20070328162726.GF22230@redhat.com>
Message-ID: <20070328163850.GG22230@redhat.com>

On Wed, Mar 28, 2007 at 11:27:27AM -0500, David Teigland wrote:
> On Wed, Mar 28, 2007 at 05:48:10PM +0200, Wagner Ferenc wrote:
> > Ferenc Wagner <wferi at niif.hu> writes:
> > 
> > > There's a good bunch of RRDs in a directory.  A script scans them for
> > > their last modification times, and then updates each in turn for a
> > > couple of times.  The number of files scanned and the length of the
> > > update rounds are printed.  The results are much different for 500 and
> > > 501 files:
> > >
> > > filecount=501
> > >   iteration=0 elapsed time=10.425568 s
> > >   iteration=1 elapsed time= 9.766178 s
> > >   iteration=2 elapsed time=20.14514 s
> > >   iteration=3 elapsed time= 2.991397 s
> > >   iteration=4 elapsed time=20.496422 s
> > > total elapsed time=63.824705 s
> > >
> > > filecount=500
> > >   iteration=0 elapsed time=6.560811 s
> > >   iteration=1 elapsed time=0.229375 s
> > >   iteration=2 elapsed time=0.202973 s
> > >   iteration=3 elapsed time=0.203439 s
> > >   iteration=4 elapsed time=0.203095 s
> > > total elapsed time=7.399693 s
> > 
> > Following up to myself with one more data point: raising
> > SHRINK_CACHE_MAX from 1000 to 20000 in gfs/dlm/lock_dlm.h helps

> SHRINK_CACHE_MAX is related to fcntl posix locks, did you intend to change
> the app to use flock (which is much faster that fcntl)?

> > significantly, but still isn't enough.  Besides, I don't know what I'm
> > doing.  Should I tweak the surrounding #defines, too?

lock_dlm caches dlm locks for old plocks for a while in an attempt to
improve performance and reduce thrashing the dlm -- SHRINK_CACHE_MAX is
the max level of caching, it's fine to change it as you've done.  The fact
that you're hitting it, though, indicates that your app is using plocks
more heavily than gfs/dlm are suited to handle.  Switching to flock will
obviate all of this.  (Or switching to the new dlm and cluster
infrastructure which has a completely different and far better approach to
plocks).

Dave



From zepp157 at yahoo.com  Wed Mar 28 19:38:24 2007
From: zepp157 at yahoo.com (Jon S)
Date: Wed, 28 Mar 2007 12:38:24 -0700 (PDT)
Subject: [Linux-cluster] LVM integration with Resource Group Manager
Message-ID: <20070328193825.93749.qmail@web63506.mail.re1.yahoo.com>

I'd like to have rgmanager manage the logical volumes of LVM abstracted SAN storage as resources of a service.  Congo allows for shared raw devices and LVM hosted GFS filesystems but neither are as good a fit as LVM abstracted shared SAN devices vgimported on one node at a time.  

I'm hoping this is implemented and I just need to FRFM (find the right fine manual).  If not, is it being considered or is it just a bad idea?

btw, I'm running RHEL5 release.

Thanks in advance




 
____________________________________________________________________________________
Don't pick lemons.
See all the new 2007 cars at Yahoo! Autos.
http://autos.yahoo.com/new_cars.html 



From vaden at texoma.net  Thu Mar 29 05:20:11 2007
From: vaden at texoma.net (Larry Vaden)
Date: Thu, 29 Mar 2007 00:20:11 -0500
Subject: [Linux-cluster] Using GFS on Compact Flash
In-Reply-To: <08A9A3213527A6428774900A80DBD8D803A006A2@xmb-sjc-222.amer.cisco.com>
References: <7F6B06837A5DBD49AC6E1650EFF5490601223274@auk52177.ukr.astrium.corp>
	<08A9A3213527A6428774900A80DBD8D803A006A2@xmb-sjc-222.amer.cisco.com>
Message-ID: <c41c77a90703282220r7ec4e57fjd1ff4d2954566a79@mail.gmail.com>

On 3/13/07, Lin Shen (lshen) <lshen at cisco.com> wrote:
>
> Have you done any measurements in regard to the amount of extra
> read/write introduced by GFS+cluster suite+fencing+journaling etc?
> And/or how much CF lifetime reduction is that translated to?

SanDisk and others are now suggesting to investors that the new line
of SDDs which will be here by summer and cost < $600 have a 2,000,000
hour lifespan:

<http://investor.sandisk.com/phoenix.zhtml?c=86495&p=irol-newsArticle_Print&ID=946487&highlight=>

best regards/ldv



From htfrontier at gmail.com  Thu Mar 29 06:19:34 2007
From: htfrontier at gmail.com (Hanny Tidore)
Date: Thu, 29 Mar 2007 14:19:34 +0800
Subject: [Linux-cluster] RHCS without shared Quorum disk
Message-ID: <2fa0bfca0703282319q2e3e34behd6a3360afacb5a3f@mail.gmail.com>

Hi,

Is it possible to setup HA using Redhat Cluster Suite with only 2 servers,
without a shared disk. Which version of RHCS support this.

To my knowledge, RHCS requires a quorum disk (a shared disk). This is an
expensive solution (we have implemented using RHCS version 3). Hence we are
looking at ways to have HA without the shared disk.

Thanks,
Hanny
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070329/8146f8fe/attachment.htm>

From grimme at atix.de  Thu Mar 29 07:14:34 2007
From: grimme at atix.de (Marc Grimme)
Date: Thu, 29 Mar 2007 09:14:34 +0200
Subject: [Linux-cluster] NFSv4 with ACL-Support on GFS
In-Reply-To: <460A8B9F.6010306@redhat.com>
References: <45F7D417.8010008@forschungsgruppe.de>
	<460A8B9F.6010306@redhat.com>
Message-ID: <200703290914.35042.grimme@atix.de>

On Wednesday 28 March 2007 17:37:03 Robert Peterson wrote:
> Christian Brandes wrote:
> > Dear readers,
> >
> > I would like to set up an active/active cluster of redundant NFSv4
> > servers with ACLs that have the same GFS file system exported at the
> > same time.
> > Is that supported?
> > What versions are needed minimum:
> >     Kernel
> >     GFS-Tools
> >     NFS-Tools ...
> >
> > At the moment I have a two node cluster of test servers with a GFS
> > exported over NFSv4, but I can not get or set ACLs with getfacl or
> > setfacl, which is the same behavior I had with samba.
> >
> > Some weeks ago I tried two SAMBA servers on GFS, though it was said not
> > to work in the Cluster FAQ.
> > It seemed to work -- but no ACLs.
> >
> > Do you have any ideas?
> >
> > Best regards
> >     Christian
>
> Hi Christian,
>
> I have a couple questions:
>
> 1. Are you mounting gfs on the nfs server with the "-o acl" option?
> 2. Can you do setfacl and getfacl from the gfs host (i.e. not through nfs)?
> 3. On what version of the cluster software and gfs are you seeing this?
>
> Clustered samba is still being worked on.
Hi Christian,
Yes but samba - without having "shared shares" in a multiple writer 
configuration - works since the beginning even with acl. You need the -o acl 
and everything should be fine.

You can build up a static active/active samba configuration where server a 
serves share a and b serves share b .. and all shares lying in different 
directories of the same GFS FS.  Don't forget a stable and persistent 
id-mapping with Windows and Unix users. You only need some "tweaks" if using 
everything on GFS (i.e. /etc/samba and /var/cache/samba and the like).

Regards 
Marc.
>
> Regards,
>
> Bob Peterson
> Red Hat Cluster Suite
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
Gruss / Regards,

Marc Grimme
Phone: +49-89 452 3538-14
http://www.atix.de/               http://www.open-sharedroot.org/

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany

Registergericht: Amtsgericht M?nchen
Registernummer: HRB 131682
USt.-Id.: DE209485962

Gesch?ftsf?hrung: Marc Grimme, Mark Hlawatschek, Thomas Merz




From J.vandenHorn at xb.nl  Thu Mar 29 08:04:00 2007
From: J.vandenHorn at xb.nl (Jeroen van den Horn)
Date: Thu, 29 Mar 2007 10:04:00 +0200
Subject: [Linux-cluster] Fenced node never reboots properly
Message-ID: <460B72F0.9020200@xb.nl>

All,

We are running some virtual machines on top of ESX that are giving us 
some performance problems but also provide us with a good test-case for 
fencing operations.

Due to some (as of yet unknown) problem the two nodes in the GFS cluster 
do not properly respond to the heartbeats, so node 1 kicks node 2 out of 
the cluster. Node 2 correctly reports in its syslog that it has been 
asked to leave the cluster, node 1 fences node 2 and node 2 initiates a 
shutdown - so far so good.

However during shutdown node 2 executes /etc/rc6.d/S31umountnfs (it's a 
Debian system) which also attempts to unmount the GFS disk - result: 
kernel OOPS. The system continues shutdown until it says 'Will now 
restart.' but that's the end of it. I've tried setting the 
/proc/sys/kernel/panic and added 'panic=5' to the kernel boot options 
but to no avail.

I'm really at a loss here - does anybody have any suggestions on how to 
solve this problem?

Regards,
Jeroen



From fedele at fis.unical.it  Thu Mar 29 08:29:48 2007
From: fedele at fis.unical.it (Fedele Stabile)
Date: Thu, 29 Mar 2007 10:29:48 +0200
Subject: [Linux-cluster] General question about cluster
Message-ID: <460B78FC.2010708@fis.unical.it>

Dear readers,

i have this equipment:
- 33 PC for student use
-  2 servers and 1 SAN connected via SCSI to the servers
only SAN and servers are powered by UPS

The whole equipment is in a private network, only servers are also connected a public net.

I would export disks on SAN as a common filesystem for all computers.

I'm thinking it would be a good idea to use the cluster suite.

What is your opinion?

I installed CentOS+ClusterSuite and configured cluster to have sufficient votes for servers (30 votes each)
to avoid quorum problems in case of occasional student-PC switching-off.
SAN disks were gfs filesystems and exported via GNBD

Cluster is working fine even if i power-off some of student-PC,
but if i power-off all 33 student-PC the two servers go in hang,

remember that
every student-PS has 1 vote
every server has 30 votes
(i can verify with cman_tool nodes)

so i have (as i can see with cman_tool status):

Nodes: 35
Total_votes: 93
Quorum: 47

also if all nodes are booting simultaneously they hang waiting indefinitely for the clvmd start

Can you help me to find the mistake?


Fedele STABILE



From lists at brimer.org  Thu Mar 29 12:11:28 2007
From: lists at brimer.org (Barry Brimer)
Date: Thu, 29 Mar 2007 07:11:28 -0500 (CDT)
Subject: [Linux-cluster] RHCS without shared Quorum disk
In-Reply-To: <2fa0bfca0703282319q2e3e34behd6a3360afacb5a3f@mail.gmail.com>
References: <2fa0bfca0703282319q2e3e34behd6a3360afacb5a3f@mail.gmail.com>
Message-ID: <Pine.LNX.4.61.0703290709480.12423@localhost.localdomain>



On Thu, 29 Mar 2007, Hanny Tidore wrote:

> Hi,
>
> Is it possible to setup HA using Redhat Cluster Suite with only 2 servers,
> without a shared disk. Which version of RHCS support this.
>
> To my knowledge, RHCS requires a quorum disk (a shared disk). This is an
> expensive solution (we have implemented using RHCS version 3). Hence we are
> looking at ways to have HA without the shared disk.

You can use RHCS without a quorum disk in the current version of RHCS for 
RHEL 4.  I do not believe this functionality made it into the RHEL 3 RHCS.

Barry



From rpeterso at redhat.com  Thu Mar 29 14:12:14 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Thu, 29 Mar 2007 09:12:14 -0500
Subject: [Linux-cluster] General question about cluster
In-Reply-To: <460B78FC.2010708@fis.unical.it>
References: <460B78FC.2010708@fis.unical.it>
Message-ID: <460BC93E.5010302@redhat.com>

Fedele Stabile wrote:
> Dear readers,
> 
> i have this equipment:
> - 33 PC for student use
> -  2 servers and 1 SAN connected via SCSI to the servers
> only SAN and servers are powered by UPS
> 
> The whole equipment is in a private network, only servers are also 
> connected a public net.
> 
> I would export disks on SAN as a common filesystem for all computers.
> 
> I'm thinking it would be a good idea to use the cluster suite.
> 
> What is your opinion?
> 
> I installed CentOS+ClusterSuite and configured cluster to have 
> sufficient votes for servers (30 votes each)
> to avoid quorum problems in case of occasional student-PC switching-off.
> SAN disks were gfs filesystems and exported via GNBD
> 
> Cluster is working fine even if i power-off some of student-PC,
> but if i power-off all 33 student-PC the two servers go in hang,
> 
> remember that
> every student-PS has 1 vote
> every server has 30 votes
> (i can verify with cman_tool nodes)
> 
> so i have (as i can see with cman_tool status):
> 
> Nodes: 35
> Total_votes: 93
> Quorum: 47
> 
> also if all nodes are booting simultaneously they hang waiting 
> indefinitely for the clvmd start
> 
> Can you help me to find the mistake?
> 
> 
> Fedele STABILE
Hi Fedele,

We could probably give you more help if you told us your goal, what
you are trying to achieve with your configuration.  It would also be
nice to know what software you are running for Cluster Suite,
whether it's the "old" (RHEL4, STABLE) infrastructure or the "new"
(RHEL5, HEAD) infrastructure.

If your SAN is only connected to the two servers, then you only
need to cluster those two and use GFS to get the servers to cooperate 
with the file system on the SAN.  Then perhaps you can use something
like nfs to export the SAN data in GFS to the 33 student PCs.

If you're clustering all the student PCs for a specific purpose,
then perhaps you could share that.
Using the student PCs in the cluster may have some disadvantages
as you observed: At least 18 of the nodes (16 student PCs plus the
two servers) need to be up to maintain quorum.

Regards,

Bob Peterson
Red Hat Cluster Suite



From pcaulfie at redhat.com  Thu Mar 29 14:37:05 2007
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Thu, 29 Mar 2007 15:37:05 +0100
Subject: [Linux-cluster] General question about cluster
In-Reply-To: <460B78FC.2010708@fis.unical.it>
References: <460B78FC.2010708@fis.unical.it>
Message-ID: <460BCF11.1090509@redhat.com>

Fedele Stabile wrote:
> Dear readers,
> 
> i have this equipment:
> - 33 PC for student use
> -  2 servers and 1 SAN connected via SCSI to the servers
> only SAN and servers are powered by UPS
> 
> The whole equipment is in a private network, only servers are also
> connected a public net.
> 
> I would export disks on SAN as a common filesystem for all computers.
> 
> I'm thinking it would be a good idea to use the cluster suite.
> 
> What is your opinion?
> 
> I installed CentOS+ClusterSuite and configured cluster to have
> sufficient votes for servers (30 votes each)
> to avoid quorum problems in case of occasional student-PC switching-off.
> SAN disks were gfs filesystems and exported via GNBD
> 
> Cluster is working fine even if i power-off some of student-PC,
> but if i power-off all 33 student-PC the two servers go in hang,
>

I don't know why this should be, off-hand, but I would investigate fencing first
off. It might be that there is a huge rush to fence all the nodes together. Or
are you saying that the cluster doesn't keep quorum though it should ? Some more
information would help helpful here. such as syslog outputs.

As Bob mentions, it might be best to have the student machines access the files
over NFS rather than GFS for other reasons.

> remember that
> every student-PS has 1 vote
> every server has 30 votes
> (i can verify with cman_tool nodes)
> 
> so i have (as i can see with cman_tool status):
> 
> Nodes: 35
> Total_votes: 93
> Quorum: 47
> 
> also if all nodes are booting simultaneously they hang waiting
> indefinitely for the clvmd start

Annoyingly this doesn't surprise me. We've had a few reports of cluster with >32
nodes behaving oddly when they are all started up together. Unfortunately we
haven't been able to reproduce this in our labs, so the current advice is "don't
do that"!

-- 

patrick



From Michael.Hagmann at hilti.com  Thu Mar 29 15:36:54 2007
From: Michael.Hagmann at hilti.com (Hagmann, Michael)
Date: Thu, 29 Mar 2007 17:36:54 +0200
Subject: [Linux-cluster] rgmanager-1.9.65-0.x86_64.rpm
In-Reply-To: <460B78FC.2010708@fis.unical.it>
References: <460B78FC.2010708@fis.unical.it>
Message-ID: <9C203D6FD2BF9D49BFF3450201DEDA53014AC6C8@LI-OWL.hag.hilti.com>

Hi all

is anyone here that can give me a list of bugzillas that are fixed in
rgmanager-1.9.65-0.x86_64.rpm, because there no Infos in the changelog.

I d'like to know if all fixes from "rgmanager-1.9.54-2.218112hf" are
included in rgmanager-1.9.65-0.x86_64.rpm. Because we have here serious
Problems with the rgmanger on the Production 6-Node Cluster of our
Intranet.

thx mike

Michael Hagmann
UNIX Systems Engineering
Enterprise Systems Technology

Hilti Corporation
9494 Schaan  Liechtenstein

Department FIBS
Feldkircherstrasse 100   P.O.Box 333
P +423-234 2467  F +423-234 6467
E michael.hagmann at hilti.com
www.hilti.com




From rpeterso at redhat.com  Thu Mar 29 15:46:47 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Thu, 29 Mar 2007 10:46:47 -0500
Subject: [Linux-cluster] Filesystem Freeze
In-Reply-To: <992633B6A0E42B49BC5A41C10A8C841B05905561@MUCEX004.root.local>
References: <992633B6A0E42B49BC5A41C10A8C841B05905561@MUCEX004.root.local>
Message-ID: <460BDF67.5080701@redhat.com>

R?thlein Michael (RI-Solution) wrote:
> Hello,
>  
> We use GFS for about 1 year now on a 4 node cluster connected to a SAN. In the past weeks nearly every day we had troubles desribed here: http://www.open-sharedroot.org/faq/troubleshooting-guide/system-failures/the-whole-cluster-freezes/ as "Freeze of filesystem"
>> There you will not see anything in the syslogs. The only thing you will see is rapidly increasing load on the system an 
>> rapidly increasing amounts of processes in the D State (means waiting for I/O). Here a lock is hung and any process 
>> accessing the resource (normally directory) locked by this lock will end up in the processstate D. 
>  
> Is there any sure way to determine which node causes the lock or any fix for this issue? 
> We use this cluster as web server running apache 2.
> 
> Thanks in advance
> 
> Yours
> 
> Michael
Hi Michael,

Maybe your best bet is to save off a sysrq-t ("Magic Sysrq") output from all nodes 
in the cluster as soon as you notice the hang.  Then open a bugzilla and attach 
all of these outputs to the bz, along with version information, etc.

Regards,

Bob Peterson
Red Hat Cluster Suite



From lhh at redhat.com  Thu Mar 29 18:22:46 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 29 Mar 2007 14:22:46 -0400
Subject: [Linux-cluster] Why no node-specific <quorumd>?
In-Reply-To: <200703261916.l2QJGJh19329@xos037.xos.nl>
References: <200703261916.l2QJGJh19329@xos037.xos.nl>
Message-ID: <20070329182246.GA15242@redhat.com>

On Mon, Mar 26, 2007 at 09:16:18PM +0200, Jos Vos wrote:
> Hi,
> 
> I was wondering why <quorumd> is a cluster-global config item.  This
> means it *has* to be the same for all nodes.  Is there a good reason
> for this?

Primarily, it is designed to solve two (three?) use cases:

(a) two node cluster needing a tiebreaker, and
    a.1.  standard "ip tiebreaker" for example, or
    a.2.  "master is the service owner" case in a network partition,
          for cases where network connectivity to the service is
          not important compared to the service staying online

(b) 4-node cluster, that has a special requirement to be able to go
down to (any) 1 node in a single bound, i.e. Oracle RAC on top of GFS


> Of course it is possible to make node-specific distinctions in the
> heuristic scripts that are executed.  But at least the votes associated
> with each node now can't be different.

Sure, like:

   <heuristic script="/tmp/foo"/>

Where /tmp/foo contains whatever operation(s) you want on each node.

As for the votes, I think you are mistaken about how they're used.  They
do not "add votes" to the node's reported votes.  Qdiskd adds to each
specific node's total perceived votes.

Could you provide an example use case which requires disjoint per-node
qdisk votes, and explain why it can not be solved currently?  I guess I
just don't fully understand what this would solve at the moment.

Weighted votes + weighted qdisk votes gets awful complex pretty darn
quick.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Mar 29 18:34:34 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 29 Mar 2007 14:34:34 -0400
Subject: [Linux-cluster] Why no node-specific <quorumd>?
In-Reply-To: <20070326232539.A19333@xos037.xos.nl>
References: <200703261916.l2QJGJh19329@xos037.xos.nl>
	<1174942804.9784.5.camel@shih.broked.org>
	<20070326232539.A19333@xos037.xos.nl>
Message-ID: <20070329183434.GB15242@redhat.com>

On Mon, Mar 26, 2007 at 11:25:39PM +0200, Jos Vos wrote:
> On Mon, Mar 26, 2007 at 02:00:04PM -0700, Steven Dake wrote:
> 
> > I'm not sure why you would want to select one view of a primary
> > component (aka quorum membership) with one node, and another primary
> > component with another node.  Then as the nodes make decisions, they
> > would be inconsistent (ie: consider GFS one node would think writes were
> > ok, while another may not...  shouldn't they agree?).
> 
> Maybe one step back to the CMAN algorithm:
> 
> I have been looking for a comprehensive summary of the algorithm on
> how CMAN determines node failures (heartbeat *AND* quorum disk should
> show the node being "up"?) and how votes are calculated.  But I could
> not find it.
> 
> I'm afraid it is not clear what the role the <quorumd> votes play:
> are they only calculated on each node or are the votes stored on the
> quorum disk and read by other nodes?

CMAN's quorum device votes are calculated on each node.  However, when
tallied for a total quorum count, they are only counted once.

So, if we have N instances of qdiskd running advertising 1 vote to its
parent CMAN = 1 vote, cluster-wide - even though every node is keeping
track of the quorum device.

Ex: If you have a 4 node + 4-vote qdisk setup:

   When all nodes are online:
      node1: I see 3 others + qdisk = 8
      node2: I see 3 others + qdisk = 8
      node3: I see 3 others + qdisk = 8
      node4: I see 3 others + qdisk = 8

  In a 3:1 split, with properly configured heuristics (where the 1
  node must continue operations):
      node1: I see 0 others + qdisk = 5
      node2: I see 2 others + NO qdisk = 3
      node3: I see 2 others + NO qdisk = 3
      node4: I see 2 others + NO qdisk = 3

  = node1 wins

  In a 3:1 split, with properly configured heuristics (where the 1
  node was partitioned off):
      node1: I see 1 others + NO qdisk = 2
      node2: I see 3 others + qdisk = 6
      node3: I see 3 others + qdisk = 6
      node4: I see 3 others + qdisk = 6

  = nodes 2/3/4 win

When the heuristics fail, qdiskd tells CMAN that qdisk votes are no
longer available, and it advertises over the disk that it's no longer
fit for participation in the cluster.

Normal fencing rules apply (winner partition must fence the dead
partition in a network partition)

Hope this helps.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Mar 29 18:36:51 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 29 Mar 2007 14:36:51 -0400
Subject: [Linux-cluster] RHCS without shared Quorum disk
In-Reply-To: <2fa0bfca0703282319q2e3e34behd6a3360afacb5a3f@mail.gmail.com>
References: <2fa0bfca0703282319q2e3e34behd6a3360afacb5a3f@mail.gmail.com>
Message-ID: <20070329183651.GC15242@redhat.com>

On Thu, Mar 29, 2007 at 02:19:34PM +0800, Hanny Tidore wrote:
> Hi,
> 
> Is it possible to setup HA using Redhat Cluster Suite with only 2 servers,
> without a shared disk. Which version of RHCS support this.
> 
> To my knowledge, RHCS requires a quorum disk (a shared disk). This is an
> expensive solution (we have implemented using RHCS version 3). Hence we are
> looking at ways to have HA without the shared disk.

On RHCS3, there is no documented method of doing this.  On RHCS4, you
can do it very easily.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Thu Mar 29 18:39:29 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 29 Mar 2007 14:39:29 -0400
Subject: [Linux-cluster] Fenced node never reboots properly
In-Reply-To: <460B72F0.9020200@xb.nl>
References: <460B72F0.9020200@xb.nl>
Message-ID: <20070329183929.GD15242@redhat.com>

On Thu, Mar 29, 2007 at 10:04:00AM +0200, Jeroen van den Horn wrote:
> However during shutdown node 2 executes /etc/rc6.d/S31umountnfs (it's a 
> Debian system) which also attempts to unmount the GFS disk - result: 
> kernel OOPS. The system continues shutdown until it says 'Will now 
> restart.' but that's the end of it. I've tried setting the 
> /proc/sys/kernel/panic and added 'panic=5' to the kernel boot options 
> but to no avail.
> 
> I'm really at a loss here - does anybody have any suggestions on how to 
> solve this problem?

Yes, it's supposed to be killed (immediately) when fenced, not
gracefully attempting to shut down.  What fencing agent are you using?
It sounds like there's a bug.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From c_triantafillou at hotmail.com  Thu Mar 29 18:52:18 2007
From: c_triantafillou at hotmail.com (Christos Triantafillou)
Date: Thu, 29 Mar 2007 20:52:18 +0200
Subject: [Linux-cluster] GFS locks recovery
Message-ID: <BAY123-F1157F882D9E4A0D12477F2936C0@phx.gbl>

Hello,

I am using a RHEL4 cluster with 2 nodes, GFS and fencing.

As a test, I started a process on node2 that got an fcntl() exclusive lock 
on a GFS file
and then a process on node1 that started the same program waiting for an 
exclusive
lock on the same GFS file.
Node2 was then switched off and rebooted.

What I observed was that node1 did not acquire the lock immediately after 
the switch-off  but only when node2 finished rebooting.

A few questions:
1. when a node goes down, shouldn't all its GFS locks be (almost) 
immediately released as part of the fencing proces or the GFS recovery on 
the other nodes?

2. during the lock wait, it was impossible to interrupt/kill the process on 
node1.
Is it possible to interrupt a process waiting on a POSIX lock?

3. if the previous are not possible, would it be preferable to use POSIX 
locks on an NFS file instead?
Or would you recommend using DLM?

Thanks in advance for your answers,
Christos Triantafillou

_________________________________________________________________
Don't just search. Find. Check out the new MSN Search! 
http://search.msn.com/



From lhh at redhat.com  Thu Mar 29 18:56:19 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 29 Mar 2007 14:56:19 -0400
Subject: [Linux-cluster] rgmanager-1.9.65-0.x86_64.rpm
In-Reply-To: <9C203D6FD2BF9D49BFF3450201DEDA53014AC6C8@LI-OWL.hag.hilti.com>
References: <460B78FC.2010708@fis.unical.it>
	<9C203D6FD2BF9D49BFF3450201DEDA53014AC6C8@LI-OWL.hag.hilti.com>
Message-ID: <20070329185619.GE15242@redhat.com>

On Thu, Mar 29, 2007 at 05:36:54PM +0200, Hagmann, Michael wrote:
> Hi all
> 
> is anyone here that can give me a list of bugzillas that are fixed in
> rgmanager-1.9.65-0.x86_64.rpm, because there no Infos in the changelog.
> 
> I d'like to know if all fixes from "rgmanager-1.9.54-2.218112hf" are
> included in rgmanager-1.9.65-0.x86_64.rpm. Because we have here serious
> Problems with the rgmanger on the Production 6-Node Cluster of our
> Intranet.
> 

Note that 212634 is the same as 218112

https://bugzilla.redhat.com/bugzilla/buglist.cgi?product=Red+Hat+Cluster+Suite&version=4&component=rgmanager&bug_status=MODIFIED&bug_status=PASSES_QA&bug_status=ON_QA&bug_status=RELEASE_PENDING

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From teigland at redhat.com  Thu Mar 29 19:31:15 2007
From: teigland at redhat.com (David Teigland)
Date: Thu, 29 Mar 2007 14:31:15 -0500
Subject: [Linux-cluster] GFS locks recovery
In-Reply-To: <BAY123-F1157F882D9E4A0D12477F2936C0@phx.gbl>
References: <BAY123-F1157F882D9E4A0D12477F2936C0@phx.gbl>
Message-ID: <20070329193115.GD21346@redhat.com>

On Thu, Mar 29, 2007 at 08:52:18PM +0200, Christos Triantafillou wrote:
> Hello,
> 
> I am using a RHEL4 cluster with 2 nodes, GFS and fencing.
> 
> As a test, I started a process on node2 that got an fcntl() exclusive lock 
> on a GFS file
> and then a process on node1 that started the same program waiting for an 
> exclusive
> lock on the same GFS file.
> Node2 was then switched off and rebooted.
> 
> What I observed was that node1 did not acquire the lock immediately after 
> the switch-off  but only when node2 finished rebooting.
> 
> A few questions:
> 1. when a node goes down, shouldn't all its GFS locks be (almost) 
> immediately released as part of the fencing proces or the GFS recovery on 
> the other nodes?

Yes.  Did the remaining node have quorum when you killed the other?  If
not, then you should set two_node=1 in cluster.conf so it will.  Fencing,
dlm recovery and gfs recovery won't happen unless there's quorum; after
this recovery, the locks you want should be granted (regardless of whether
the other node has rebooted or not).

> 2. during the lock wait, it was impossible to interrupt/kill the process on 
> node1.  Is it possible to interrupt a process waiting on a POSIX lock?

no

> 3. if the previous are not possible, would it be preferable to use POSIX 
> locks on an NFS file instead?
> Or would you recommend using DLM?

Either, possibly; you'd have to try it out.  GFS works much better with
flock (although that's not interruptible either), if that's an option.

Dave



From isplist at logicore.net  Thu Mar 29 22:20:59 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Thu, 29 Mar 2007 16:20:59 -0600
Subject: [Linux-cluster] Clearing Dead Volumes
Message-ID: <2007329162059.511587@leena>

A while back, my storage was changed without the nodes getting a chance to be 
disconnected from the storage. 

At this point, firing up the nodes has all of them seeing a lot of garbage 
data in place of what used to be their storage volumes. What is the best way 
of clearing this old data so that I can re-create the proper storage devices?

Mike





From c_triantafillou at hotmail.com  Thu Mar 29 22:02:23 2007
From: c_triantafillou at hotmail.com (Christos Triantafillou)
Date: Fri, 30 Mar 2007 00:02:23 +0200
Subject: [Linux-cluster] GFS locks recovery
In-Reply-To: <20070329193115.GD21346@redhat.com>
Message-ID: <BAY123-F17E92A335BBACCEBE4DA22936C0@phx.gbl>

>Yes.  Did the remaining node have quorum when you killed the other?  If
>not, then you should set two_node=1 in cluster.conf so it will.  Fencing,
>dlm recovery and gfs recovery won't happen unless there's quorum; after
>this recovery, the locks you want should be granted (regardless of whether
>the other node has rebooted or not).

Is there a way to see whether a node has the quorum at any given time?
Or whether GFS recovery has taken place?

>Either, possibly; you'd have to try it out.  GFS works much better with
>flock (although that's not interruptible either), if that's an option.

How flock() works better than fcntl() on GFS?

Thanks,
Christos

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/



From J.vandenHorn at xb.nl  Fri Mar 30 05:45:18 2007
From: J.vandenHorn at xb.nl (Jeroen van den Horn)
Date: Fri, 30 Mar 2007 07:45:18 +0200
Subject: [Linux-cluster] Fenced node never reboots properly
In-Reply-To: <20070329183929.GD15242@redhat.com>
References: <460B72F0.9020200@xb.nl> <20070329183929.GD15242@redhat.com>
Message-ID: <460CA3EE.6070102@xb.nl>

I'm using fence_vmware which I downloaded from some CVS repository. Good 
to hear that that is the issue - I'll take a look at the source and see 
whether the VMWare API support some sort of 'hard reset'.

Jeroen

Lon Hohberger wrote:
> On Thu, Mar 29, 2007 at 10:04:00AM +0200, Jeroen van den Horn wrote:
>   
>> However during shutdown node 2 executes /etc/rc6.d/S31umountnfs (it's a 
>> Debian system) which also attempts to unmount the GFS disk - result: 
>> kernel OOPS. The system continues shutdown until it says 'Will now 
>> restart.' but that's the end of it. I've tried setting the 
>> /proc/sys/kernel/panic and added 'panic=5' to the kernel boot options 
>> but to no avail.
>>
>> I'm really at a loss here - does anybody have any suggestions on how to 
>> solve this problem?
>>     
>
> Yes, it's supposed to be killed (immediately) when fenced, not
> gracefully attempting to shut down.  What fencing agent are you using?
> It sounds like there's a bug.
>
> -- Lon
>
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070330/719751ca/attachment.htm>

From mnapolis at redhat.com  Fri Mar 30 06:12:09 2007
From: mnapolis at redhat.com (Isauro Michael Napolis)
Date: Fri, 30 Mar 2007 16:12:09 +1000
Subject: [Linux-cluster] RHEL4 U4's qdiskd without disk or partition
Message-ID: <1175235128.11799.123.camel@localhost.localdomain>

hi all,

i have a question regarding using ping for tie
breaker in a redundant NIC with a crossover for the qdisk daemon WITHOUT
the use of a disk/ partition.  Has anybody made this work in RHEL4 U4?

I'm still testing it and here's my very basic setup using
system-config-cluster:

<?xml version="1.0" ?>
<cluster alias="new_cluster" config_version="2" name="new_cluster">
        <quorumd interval="1" label="stewy" min_score="3" tko="10"
votes="3">
                <heuristic interval="2" program="ping -c 1 192.168.5.1"
score="1"/>
        </quorumd>
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="cluster1" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="manual_fence"
nodename="cluster1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="cluster2" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="manual_fence"
nodename="cluster2"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices>
                <fencedevice agent="fence_manual" name="manual_fence"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="test_domain" ordered="0"
restricted="0">
                                <failoverdomainnode name="cluster1"
priority="1"/>
                                <failoverdomainnode name="cluster2"
priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="172.16.189.250" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="test_domain" name="test"
recovery="relocate">
                        <ip ref="172.16.189.250"/>
                </service>
        </rm>
</cluster>

I can start ccsd, cman, fenced and rgmanager but qdiskd fails with qdisk
label 'stewy' not matching with any device.

Basically, I'm trying to do what's mentioned in the 'Note' of the FAQ16
where in we can use qdisk without a disk/ partition:

 http://sources.redhat.com/cluster/faq.html#quorumdiskhow

Any ideas?

Thanks,
Michael



From fedele at fis.unical.it  Fri Mar 30 09:17:55 2007
From: fedele at fis.unical.it (Fedele Stabile)
Date: Fri, 30 Mar 2007 11:17:55 +0200
Subject: [Linux-cluster] General question about cluster
In-Reply-To: <460BCF11.1090509@redhat.com>
References: <460B78FC.2010708@fis.unical.it> <460BCF11.1090509@redhat.com>
Message-ID: <460CD5C3.8020502@fis.unical.it>

Thank you for your suggestions.

I'm using CentOS 4.4 and the purpose of the cluster is sharing of user filesystem.

I know that it's possible via NFS but i would experiment the cluster-suite.

If i don't user CS on student-PC can i use GFS instead NFS to export SAN filesystems to student-PC?

Thank you
Fedele STABILE



From dan.deshayes at algitech.com  Fri Mar 30 12:28:35 2007
From: dan.deshayes at algitech.com (Dan Deshayes)
Date: Fri, 30 Mar 2007 14:28:35 +0200
Subject: [Linux-cluster] dlm_recvd and dlm_sendd causes high load.
Message-ID: <460D0273.2090002@algitech.com>

Hello everyone,

I've setup a high availibility cluster but have some problems with 
dlm_recvd and dlm_sendd processes.
Everything seems to be running smooth except these processes that takes 
half of the cpu power each.

I'm running a 2-node cluster using Centos Beta5 with own compiled 
2.6.20.4 kernel using a shared firewiredisc for
the shared storage with GFS2 lock_dlm.

When i start the clvmd-service the messagelog says:

clustertest-2 kernel: dlm: clvmd: recover 1
clustertest-2 kernel: dlm: clvmd: add member 1
clustertest-2 kernel: dlm: clvmd: add member 2
clustertest-2 kernel: dlm: got connection from 1
clustertest-2 kernel: dlm: clvmd: total members 2 error 0
clustertest-2 kernel: dlm: clvmd: dlm_recover_directory
clustertest-2 kernel: dlm: clvmd: dlm_recover_directory 0 entries
clustertest-2 clvmd: Cluster LVM daemon started - connected to CMAN
clustertest-2 kernel: dlm: clvmd: recover 1 done: 7049 ms

After the startup of clvmd the processes tops the process list.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2473 root      20  -5     0    0    0 S 49.8  0.0   5:06.48 dlm_recvd
 2474 root      20  -5     0    0    0 S 49.8  0.0   5:07.40 dlm_sendd

I've been trying to find some useful information but with no success.

Anyone having an idea or experienced the same problem?
Maybe I can provide more useful information on this?

Thanks in advance,
Redgards Dan.



From lhh at redhat.com  Fri Mar 30 13:54:51 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 30 Mar 2007 09:54:51 -0400
Subject: [Linux-cluster] RHEL4 U4's qdiskd without disk or partition
In-Reply-To: <1175235128.11799.123.camel@localhost.localdomain>
References: <1175235128.11799.123.camel@localhost.localdomain>
Message-ID: <20070330135451.GF15242@redhat.com>

On Fri, Mar 30, 2007 at 04:12:09PM +1000, Isauro Michael Napolis wrote:
> hi all,
> 
> i have a question regarding using ping for tie
> breaker in a redundant NIC with a crossover for the qdisk daemon WITHOUT
> the use of a disk/ partition.  Has anybody made this work in RHEL4 U4?

It doesn't work right now, but there's no really good reason it *couldn't*
work.

The thing is that, of course, if you have a network partition caused by
a network anomaly - say an ARP storm (caused by maybe a switch loop, for
example) - where you can see the router, but not the other node, both
nodes will think they are quorate.

This isn't a problem - because one node will fence the other node first
;)  So, in the worst case, it becomes a race-to-fence.

Anyone else see any problems with doing this?

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From rpeterso at redhat.com  Fri Mar 30 13:54:05 2007
From: rpeterso at redhat.com (Robert Peterson)
Date: Fri, 30 Mar 2007 08:54:05 -0500
Subject: [Linux-cluster] General question about cluster
In-Reply-To: <460CD5C3.8020502@fis.unical.it>
References: <460B78FC.2010708@fis.unical.it> <460BCF11.1090509@redhat.com>
	<460CD5C3.8020502@fis.unical.it>
Message-ID: <460D167D.8070906@redhat.com>

Fedele Stabile wrote:
> Thank you for your suggestions.
> 
> I'm using CentOS 4.4 and the purpose of the cluster is sharing of user 
> filesystem.
> 
> I know that it's possible via NFS but i would experiment the cluster-suite.
> 
> If i don't user CS on student-PC can i use GFS instead NFS to export SAN 
> filesystems to student-PC?
> 
> Thank you
> Fedele STABILE

Hi Fedele,

One primary use of GFS is to allow cooperation of a "physically shared" 
file system between multiple nodes.  By physically shared, I mean
that the machines have access to the hardware, for example, through a
device driver.  That can be a SAN, iSCSI or whatever.  GFS does not 
export the storage to computers that don't have direct access.

It sounds like you have a SAN attached to your two servers, so
they should probably use GFS so they cooperate on the file system
properly.

It sounds like your student PCs are not connected to your SAN, so they
don't have direct access.  If you want them to have "network access" to
the SAN, you'll have to either use a network file system like NFS or
SAMBA to export the GFS file system from the servers to the PCs.
In this case, only the two servers need to be clustered.
 
Another (more complex) option is to export the block device associated 
with SAN on a server with GNBD (gnbd_export) and import the block device
to your student PCs with gnbd_import and mount the imported block
device as type GFS.  For example, on a server with the SAN, 
after cluster suite is started, do something like:

gnbd_serv
gnbd_export -d /dev/your_vg/your_lv -e san_storage
(where /dev/your_vg/your_lv) is the logical volume associated with
the SAN on your server.)

On your student PCs, do something like:

(Start the cluster sortware)
gnbd_import -i <server name>
mount -tgfs /dev/gnbd/san_storage /mnt/student

This way, all the student PCs are referencing the same storage
through the block device, but that means you would once again
have to put them all in the cluster so all their GFS file 
systems would cooperate on the file system.  And then you'd
have the same quorum problems when the student PCs are shut off.

I haven't used gnbd in a long time, so I'm a bit fuzzy about that
setup.

Regards,

Bob Peterson
Red Hat Cluster Suite



From J.vandenHorn at xb.nl  Fri Mar 30 14:07:16 2007
From: J.vandenHorn at xb.nl (Jeroen van den Horn)
Date: Fri, 30 Mar 2007 16:07:16 +0200
Subject: [Linux-cluster] Fenced node never reboots properly
In-Reply-To: <460CA3EE.6070102@xb.nl>
References: <460B72F0.9020200@xb.nl> <20070329183929.GD15242@redhat.com>
	<460CA3EE.6070102@xb.nl>
Message-ID: <460D1994.9060601@xb.nl>

In response to Lon's suggestion I modified the fence_vmware code and set 
the type of reset to HARD - cluster node now resets properly. Remaining 
issue is that under VMWare we are still experiencing performance issues. 
It's as if a node in the cluster starts 'lagging behind' (also the 
system clock starts drifting) and that after some time one of the nodes 
declares the other dead.

Does anybody have any pointers towards performance issues and/or clock 
drifting with GFS on virtual machines?

Regards,
Jeroen

> I'm using fence_vmware which I downloaded from some CVS repository. 
> Good to hear that that is the issue - I'll take a look at the source 
> and see whether the VMWare API support some sort of 'hard reset'.
>
> Jeroen
>
> Lon Hohberger wrote:
>> On Thu, Mar 29, 2007 at 10:04:00AM +0200, Jeroen van den Horn wrote:
>>   
>>> However during shutdown node 2 executes /etc/rc6.d/S31umountnfs (it's a 
>>> Debian system) which also attempts to unmount the GFS disk - result: 
>>> kernel OOPS. The system continues shutdown until it says 'Will now 
>>> restart.' but that's the end of it. I've tried setting the 
>>> /proc/sys/kernel/panic and added 'panic=5' to the kernel boot options 
>>> but to no avail.
>>>
>>> I'm really at a loss here - does anybody have any suggestions on how to 
>>> solve this problem?
>>>     
>>
>> Yes, it's supposed to be killed (immediately) when fenced, not
>> gracefully attempting to shut down.  What fencing agent are you using?
>> It sounds like there's a bug.
>>
>> -- Lon
>>
>>   
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070330/32ce147a/attachment.htm>

From fedele at fis.unical.it  Fri Mar 30 15:07:50 2007
From: fedele at fis.unical.it (Fedele Stabile)
Date: Fri, 30 Mar 2007 17:07:50 +0200
Subject: [Linux-cluster] General question about cluster
In-Reply-To: <460D167D.8070906@redhat.com>
References: <460B78FC.2010708@fis.unical.it> <460BCF11.1090509@redhat.com>
	<460CD5C3.8020502@fis.unical.it> <460D167D.8070906@redhat.com>
Message-ID: <460D27C6.8030307@fis.unical.it>

Thank you Robert for your answer that cleared up all my doubts.

I tryed to use GNBD on non-clustered student-PC as you suggested and also using the -c option with
gnbd_export on server and -n option with gnbd_import on PC
but it doesn't work because of GFS filesystem that requires the cluster  lock-manager.

At the end i'm using NFS, can you suggest any other possibility?


Fedele



From jlauro at umflint.edu  Fri Mar 30 15:38:08 2007
From: jlauro at umflint.edu (Lauro, John)
Date: Fri, 30 Mar 2007 11:38:08 -0400
Subject: [Linux-cluster] Fenced node never reboots properly
In-Reply-To: <460D1994.9060601@xb.nl>
References: <460B72F0.9020200@xb.nl>
	<20070329183929.GD15242@redhat.com><460CA3EE.6070102@xb.nl>
	<460D1994.9060601@xb.nl>
Message-ID: <D4BFED4FB59C104ABC9FF2C745FCD4F703DC24CC@SMB2.umflint.edu>

What type of vmware environment?  (VI ESX 3, Server, Workstation, or
one of the older platforms?)

 

The Vmware forums have a fair amount of help on how to handle clock
drift.  Are you on AMD or Intel, 32 or 64 bit?

 

________________________________

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeroen van den
Horn
Sent: Friday, March 30, 2007 10:07 AM
To: linux clustering
Subject: Re: [Linux-cluster] Fenced node never reboots properly

 

In response to Lon's suggestion I modified the fence_vmware code and
set the type of reset to HARD - cluster node now resets properly.
Remaining issue is that under VMWare we are still experiencing
performance issues. It's as if a node in the cluster starts 'lagging
behind' (also the system clock starts drifting) and that after some
time one of the nodes declares the other dead.

Does anybody have any pointers towards performance issues and/or clock
drifting with GFS on virtual machines?

Regards,
Jeroen




I'm using fence_vmware which I downloaded from some CVS repository.
Good to hear that that is the issue - I'll take a look at the source
and see whether the VMWare API support some sort of 'hard reset'.

Jeroen

Lon Hohberger wrote: 

On Thu, Mar 29, 2007 at 10:04:00AM +0200, Jeroen van den Horn wrote:
  

	However during shutdown node 2 executes
/etc/rc6.d/S31umountnfs (it's a 
	Debian system) which also attempts to unmount the GFS disk -
result: 
	kernel OOPS. The system continues shutdown until it says 'Will
now 
	restart.' but that's the end of it. I've tried setting the 
	/proc/sys/kernel/panic and added 'panic=5' to the kernel boot
options 
	but to no avail.
	 
	I'm really at a loss here - does anybody have any suggestions
on how to 
	solve this problem?
	    

 
Yes, it's supposed to be killed (immediately) when fenced, not
gracefully attempting to shut down.  What fencing agent are you using?
It sounds like there's a bug.
 
-- Lon
 
  





 



________________________________



 
--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070330/bda5f974/attachment.htm>

From teigland at redhat.com  Fri Mar 30 16:12:29 2007
From: teigland at redhat.com (David Teigland)
Date: Fri, 30 Mar 2007 11:12:29 -0500
Subject: [Linux-cluster] GFS locks recovery
In-Reply-To: <BAY123-F17E92A335BBACCEBE4DA22936C0@phx.gbl>
References: <20070329193115.GD21346@redhat.com>
	<BAY123-F17E92A335BBACCEBE4DA22936C0@phx.gbl>
Message-ID: <20070330161229.GB13056@redhat.com>

On Fri, Mar 30, 2007 at 12:02:23AM +0200, Christos Triantafillou wrote:
> >Yes.  Did the remaining node have quorum when you killed the other?  If
> >not, then you should set two_node=1 in cluster.conf so it will.  Fencing,
> >dlm recovery and gfs recovery won't happen unless there's quorum; after
> >this recovery, the locks you want should be granted (regardless of whether
> >the other node has rebooted or not).
> 
> Is there a way to see whether a node has the quorum at any given time?
> Or whether GFS recovery has taken place?

'cman_tool status' will tell you about quorum.  But, if only one node in a
two node cluster is up, and you don't have two_node=1, then we already
know that your cluster didn't have quorum.

group_tool will tell you the recovery status of fencing, dlm and gfs.

> >Either, possibly; you'd have to try it out.  GFS works much better with
> >flock (although that's not interruptible either), if that's an option.
> 
> How flock() works better than fcntl() on GFS?

Despite being used for similar things, they are completely different
commands with completely different implementations.  Since flock() is a
far simpler command, its implementation is far simpler, faster, and works
much better in general.

Dave



From teigland at redhat.com  Fri Mar 30 16:15:00 2007
From: teigland at redhat.com (David Teigland)
Date: Fri, 30 Mar 2007 11:15:00 -0500
Subject: [Linux-cluster] dlm_recvd and dlm_sendd causes high load.
In-Reply-To: <460D0273.2090002@algitech.com>
References: <460D0273.2090002@algitech.com>
Message-ID: <20070330161500.GC13056@redhat.com>

On Fri, Mar 30, 2007 at 02:28:35PM +0200, Dan Deshayes wrote:
> Hello everyone,
> 
> I've setup a high availibility cluster but have some problems with 
> dlm_recvd and dlm_sendd processes.
> Everything seems to be running smooth except these processes that takes 
> half of the cpu power each.

That was a bug that's been fixed in more recent kernels.

https://www.redhat.com/archives/linux-cluster/2007-March/msg00068.html

Dave



From lhh at redhat.com  Fri Mar 30 16:38:28 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 30 Mar 2007 12:38:28 -0400
Subject: [Linux-cluster] qdiskd, multiple master, updates?
In-Reply-To: <460A8FAB.5050805@diamond.ac.uk>
References: <460A8FAB.5050805@diamond.ac.uk>
Message-ID: <20070330163828.GB6347@redhat.com>

On Wed, Mar 28, 2007 at 04:54:19PM +0100, Frederik Ferner wrote:
> Hi,
> 
> I noticed that there seems to be a fix for my original problem with 
> multiple masters for qdiskd in 
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=220211.

> I assume the fix is going to be in RHEL4U5, correct?

Yes.  It's fixed two ways:

* Add timing increases so that a node can't promote -> bid in the same
cycle, or bid and attain master faster than it can declare another node
online.

* Add resolution in the case that, despite our best efforts,
multi-master occurs *anyway* - due to, perhaps, slow I/O on one node but
fast I/O on another (an unlikely case, but no doubt possible).


> Could somebody give me an estimate when we can expect that to be available?

I think it should be beta soon, and when it does go beta, it should
appear on RHN.

> Lon: Do the test rpms at http://people.redhat.com/lhh/packages.html 
> include these fixes? Which rpms do I need to test it? This time I might 
> actually get a chance to test.

The most recent fixes are not included.  Would you like me to build a
new one?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Mar 30 16:39:26 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 30 Mar 2007 12:39:26 -0400
Subject: [Linux-cluster] LVM integration with Resource Group Manager
In-Reply-To: <20070328193825.93749.qmail@web63506.mail.re1.yahoo.com>
References: <20070328193825.93749.qmail@web63506.mail.re1.yahoo.com>
Message-ID: <20070330163926.GC6347@redhat.com>

On Wed, Mar 28, 2007 at 12:38:24PM -0700, Jon S wrote:
> I'd like to have rgmanager manage the logical volumes of LVM abstracted SAN storage as resources of a service.  Congo allows for shared raw devices and LVM hosted GFS filesystems but neither are as good a fit as LVM abstracted shared SAN devices vgimported on one node at a time.  
> 
> I'm hoping this is implemented and I just need to FRFM (find the right fine manual).  If not, is it being considered or is it just a bad idea?
> 
> btw, I'm running RHEL5 release.

RHEL5.1 to get official support for it, but it's available in CVS right
now (in the RHEL5 branch, even).

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Mar 30 16:45:40 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 30 Mar 2007 12:45:40 -0400
Subject: [Linux-cluster] rgmanager is ping-ponging service
In-Reply-To: <200703261128.43435.hlawatschek@atix.de>
References: <200703161449.16182.hlawatschek@atix.de>
	<200703221100.50772.hlawatschek@atix.de>
	<1174601021.5158.55.camel@asuka.boston.devel.redhat.com>
	<200703261128.43435.hlawatschek@atix.de>
Message-ID: <20070330164540.GD6347@redhat.com>

On Mon, Mar 26, 2007 at 11:28:43AM +0200, Mark Hlawatschek wrote:
> Hi Lon,
> 
> > This is definitely something that should end up in bugzilla, if it isn't
> > already (it might be; I didn't check yet).
> >
> > It looks like what's going on is that we bring up the IP, but when we do
> > a "ping-self" test, it's failing (even though somehow the link-check is
> > succeeding?).  We can, I suppose, do this test after bringing up the IP
> > address to prevent the "start" from succeeding (e.g. attached patch
> > should do this).
> Note, in the test I did a ifdown bond1 to disable the interface. I haven't 
> unplugged the cables. 

Hmm, that sounds like something simple - like the bonding logic not 
correctly checking for the interface(s) being up in ip.sh...

If so, then it should work if you pull the cables.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Mar 30 16:50:14 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 30 Mar 2007 12:50:14 -0400
Subject: [Linux-cluster] Filesystem Freeze
In-Reply-To: <992633B6A0E42B49BC5A41C10A8C841B05905561@MUCEX004.root.local>
References: <992633B6A0E42B49BC5A41C10A8C841B05905561@MUCEX004.root.local>
Message-ID: <20070330165013.GE6347@redhat.com>

On Mon, Mar 19, 2007 at 03:42:50PM +0100, R?thlein Michael (RI-Solution) wrote:
> Hello,
>  
> We use GFS for about 1 year now on a 4 node cluster connected to a SAN. In the past weeks nearly every day we had troubles desribed here: http://www.open-sharedroot.org/faq/troubleshooting-guide/system-failures/the-whole-cluster-freezes/ as "Freeze of filesystem"
> > There you will not see anything in the syslogs. The only thing you will see is rapidly increasing load on the system an 
> > rapidly increasing amounts of processes in the D State (means waiting for I/O). Here a lock is hung and any process 
> > accessing the resource (normally directory) locked by this lock will end up in the processstate D. 

Well, yes... Each process in 'D' adds exactly 1.0 to the system load
average.

> Is there any sure way to determine which node causes the lock or any fix for this issue? 
> We use this cluster as web server running apache 2.

What release of GFS are you using, on what kernel, and what Linux
distribution?

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From lhh at redhat.com  Fri Mar 30 16:55:51 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 30 Mar 2007 12:55:51 -0400
Subject: [Linux-cluster] Advice on setting up Resources for Postgres
In-Reply-To: <200703122012.06172.fajarpri@arinet.org>
References: <200703122012.06172.fajarpri@arinet.org>
Message-ID: <20070330165551.GF6347@redhat.com>

On Mon, Mar 12, 2007 at 08:12:05PM +0700, Fajar Priyanto wrote:
> Hi All,
> I'd like to setup RHCS for Postgres on 2-nodes cluster.
> What is the best way to setup the resources? No GFS.
> - Do I need to setup a script for postgres init.d? Or should I just let it on 
> on both server from chkconfig?

There's a postgres resource agent in CVS here:

http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/rgmanager/src/resources/?cvsroot=cluster

... which should help some :)

Otherwise, don't use chkconfig -- use the <script> facility in rgmanager
to reference /etc/init.d/[script] instead.  It should be chkconfig'd
off.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070330/ffb2d345/attachment.sig>

From lhh at redhat.com  Fri Mar 30 16:58:05 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 30 Mar 2007 12:58:05 -0400
Subject: [Linux-cluster] Asymmetric config for two-node cluster with qdisk?
In-Reply-To: <200703271650.l2RGoYG31011@xos037.xos.nl>
References: <200703271650.l2RGoYG31011@xos037.xos.nl>
Message-ID: <20070330165804.GG6347@redhat.com>

On Tue, Mar 27, 2007 at 06:50:34PM +0200, Jos Vos wrote:
> Hi,
> 
> Should I put some asymmetry in the cluster config (and/or the qdisk
> heuristic scripts) of a two-node cluster to make (only) one node
> decide to continue its services in case of a split-brain problem?
> 
> Or are heartbeat failures ignored *if* the quorum disk still shows
> that the other node is running ok (or does this only affect whether
> to fence or not)?
> 
> During my tests (disconnecting nodes, shared storage via working)
> I got tons of messages like
> 
>   qdiskd[4012]: <crit> A master exists, but it's not me?!
>   qdiskd[4012]: <crit> Critical Error: More than one master found!

That's fixed in CVS.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From srramasw at cisco.com  Fri Mar 30 17:46:34 2007
From: srramasw at cisco.com (Sridhar Ramaswamy (srramasw))
Date: Fri, 30 Mar 2007 10:46:34 -0700
Subject: [Linux-cluster] GFS2 hangs with dlm messages
Message-ID: <B14199FA0DBAAF4AA89E83EB41D35435034E9735@xmb-sjc-22c.amer.cisco.com>

I'm trying to get some performance numbers out of GFS2. Before
describing the problem, I got to mention the previous dlm_sendd/recvd
spinning issue is no longer seen after moving to 2.6.21-rc4 kernel on
FC6.
 
I've a two node GFS2 setup sharing a disk off node1 using GNBD. I'm
running meta-data heavy tests (i.e create/read/delete tons of small 8k
files) from node2. The test kind of hangs in the middle. I see the
following log mesgs in node1,
 
Mar 29 15:16:23 cfs1 gnbd_serv[2723]: startup succeeded
Mar 29 15:16:37 cfs1 gnbd_clusterd[2729]: connected
Mar 29 15:17:15 cfs1 kernel: GFS2: fsid=: Trying to join cluster
"lock_dlm", "ciscogfs2:sridhar"
Mar 29 15:17:15 cfs1 kernel: dlm: connecting to 2
Mar 29 15:17:15 cfs1 kernel: dlm: got connection from 2
Mar 29 15:17:15 cfs1 kernel: GFS2: fsid=ciscogfs2:sridhar.1: Joined
cluster. Now mounting FS...
Mar 29 15:17:16 cfs1 kernel: GFS2: fsid=ciscogfs2:sridhar.1: jid=1,
already locked for use
Mar 29 15:17:16 cfs1 kernel: GFS2: fsid=ciscogfs2:sridhar.1: jid=1:
Looking at journal...
Mar 29 15:17:16 cfs1 kernel: GFS2: fsid=ciscogfs2:sridhar.1: jid=1: Done
Mar 30 09:58:34 cfs1 kernel: dlm: sridhar: remove fr 2 none
Mar 30 09:58:34 cfs1 kernel: dlm: message size 5457 from 2 too big, buf
len 4632
Mar 30 09:58:34 cfs1 kernel: dlm: sridhar: remove fr 2 none
Mar 30 09:58:35 cfs1 last message repeated 51 times
Mar 30 09:58:35 cfs1 kernel: dlm: message size 13880 from 2 too big, buf
len 85072
Mar 30 09:58:35 cfs1 kernel: dlm: sridhar: remove fr 2 none
Mar 30 09:58:35 cfs1 last message repeated 3 times
Mar 30 09:58:35 cfs1 kernel: dlm: message size 13880 from 2 too big, buf
len 93760
Mar 30 09:58:37 cfs1 kernel: dlm: message size 8224 from 2 too big, buf
len 101136
Mar 30 09:58:37 cfs1 kernel: dlm: message size 8224 from 2 too big, buf
len 101248
Mar 30 09:58:39 cfs1 kernel: dlm: message size 8224 from 2 too big, buf
len 101472

sar/iostat shows there is no major network/disk-io traffic going on
after this problem. strace'ing any of the GFS process hangs. Except I
see tons of activity in 'aisexec' process with lots of sendmsg/recvmsg
going on. It seems some cluster level component - cman or dlm error -
causes GFS2 to lock up.
 
Previous tests with random file size (range 0 to 1MB) went thru' fine.
But I also remember one of previous block test (create/read/rw a1GB
file) had similar problem.
 
Anyone seen such a problem? Any clues to resolve?
 
thanks,
Sridhar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070330/ae0e1d6b/attachment.htm>

From isplist at logicore.net  Fri Mar 30 19:00:31 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Fri, 30 Mar 2007 13:00:31 -0600
Subject: [Linux-cluster] Can't join cluster?
Message-ID: <200733013031.174981@leena>

# cman_tool status
Protocol version: 5.0.1
Config version: 68
Cluster name: vgcomp
Cluster ID: 6990
Cluster Member: No
Membership state: Not-in-Cluster

The other nodes all join just fine. I can't seem to find a reason for this one 
not to join now? Just like all of the other nodes, all of the services are 
running, everything's the same but it's not joining.

I know this is something simple, can someone give me a few ideas on what to 
look for in finding the solution here.

Mike





From teigland at redhat.com  Fri Mar 30 18:06:34 2007
From: teigland at redhat.com (David Teigland)
Date: Fri, 30 Mar 2007 13:06:34 -0500
Subject: [Linux-cluster] GFS2 hangs with dlm messages
In-Reply-To: <B14199FA0DBAAF4AA89E83EB41D35435034E9735@xmb-sjc-22c.amer.cisco.com>
References: <B14199FA0DBAAF4AA89E83EB41D35435034E9735@xmb-sjc-22c.amer.cisco.com>
Message-ID: <20070330180634.GD13056@redhat.com>

On Fri, Mar 30, 2007 at 10:46:34AM -0700, Sridhar Ramaswamy (srramasw) wrote:
> Mar 30 09:58:34 cfs1 kernel: dlm: message size 5457 from 2 too big, buf
> len 4632

I'm not sure, but this patch might be the fix for the message sizes.

http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=74a3949ec42455f400a5bb4b5c9edaeda192b42e;hp=ed65d5376177cc04de39d6410a5c2d2d0883f616

> see tons of activity in 'aisexec' process with lots of sendmsg/recvmsg

'cman_tool nodes' and /var/log/messages should give some idea about
cluster membership problems.

Dave



From teigland at redhat.com  Fri Mar 30 18:08:43 2007
From: teigland at redhat.com (David Teigland)
Date: Fri, 30 Mar 2007 13:08:43 -0500
Subject: [Linux-cluster] Can't join cluster?
In-Reply-To: <200733013031.174981@leena>
References: <200733013031.174981@leena>
Message-ID: <20070330180843.GE13056@redhat.com>

On Fri, Mar 30, 2007 at 01:00:31PM -0600, isplist at logicore.net wrote:
> # cman_tool status
> Protocol version: 5.0.1
> Config version: 68
> Cluster name: vgcomp
> Cluster ID: 6990
> Cluster Member: No
> Membership state: Not-in-Cluster
> 
> The other nodes all join just fine. I can't seem to find a reason for
> this one not to join now? Just like all of the other nodes, all of the
> services are running, everything's the same but it's not joining.
> 
> I know this is something simple, can someone give me a few ideas on what
> to look for in finding the solution here.

I'd look in /var/log/messages on all the nodes and run cman_tool join -d.

Dave



From isplist at logicore.net  Fri Mar 30 19:08:48 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Fri, 30 Mar 2007 13:08:48 -0600
Subject: [Linux-cluster] Can't join cluster?
In-Reply-To: <20070330180843.GE13056@redhat.com>
Message-ID: <200733013848.371517@leena>

> I'd look in /var/log/messages on all the nodes and run cman_tool join -d.

Not much in there;

Mar 30 13:06:43 compdev ccsd[2224]: Unable to connect to cluster 
infrastructure after 3180 seconds.
Mar 30 13:07:13 compdev ccsd[2224]: Unable to connect to cluster 
infrastructure after 3210 seconds.
Mar 30 13:07:44 compdev ccsd[2224]: Unable to connect to cluster 
infrastructure after 3240 seconds.

over and over again. 





From isplist at logicore.net  Fri Mar 30 19:31:58 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Fri, 30 Mar 2007 13:31:58 -0600
Subject: [Linux-cluster] Can't join cluster?
In-Reply-To: <200733013848.371517@leena>
Message-ID: <2007330133158.944517@leena>

Ah, I knew it was something simple. The number of nodes changed and so did the 
cluster.conf file so this node and in fact others were not joining :).

Thanks for the help.

Mike





From srramasw at cisco.com  Fri Mar 30 18:55:10 2007
From: srramasw at cisco.com (Sridhar Ramaswamy (srramasw))
Date: Fri, 30 Mar 2007 11:55:10 -0700
Subject: [Linux-cluster] GFS2 hangs with dlm messages
In-Reply-To: <20070330180634.GD13056@redhat.com>
Message-ID: <B14199FA0DBAAF4AA89E83EB41D35435034E97DD@xmb-sjc-22c.amer.cisco.com>

Thanks David. I'll try the DLM patch.

I guess 'cman_tool nodes' shows things are okay over there,

[root at cfs1 cluster-2.00.00]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M      4   2007-03-29 15:15:01  cfs1
   2   M      8   2007-03-29 15:15:03  cfs5

Thanks,
Sridhar

> -----Original Message-----
> From: David Teigland [mailto:teigland at redhat.com] 
> Sent: Friday, March 30, 2007 11:07 AM
> To: Sridhar Ramaswamy (srramasw)
> Cc: linux clustering
> Subject: Re: [Linux-cluster] GFS2 hangs with dlm messages
> 
> On Fri, Mar 30, 2007 at 10:46:34AM -0700, Sridhar Ramaswamy 
> (srramasw) wrote:
> > Mar 30 09:58:34 cfs1 kernel: dlm: message size 5457 from 2 
> too big, buf
> > len 4632
> 
> I'm not sure, but this patch might be the fix for the message sizes.
> 
> http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/steve/gfs2
> -2.6-nmw.git;a=commitdiff;h=74a3949ec42455f400a5bb4b5c9edaeda1
> 92b42e;hp=ed65d5376177cc04de39d6410a5c2d2d0883f616
> 
> > see tons of activity in 'aisexec' process with lots of 
> sendmsg/recvmsg
> 
> 'cman_tool nodes' and /var/log/messages should give some idea about
> cluster membership problems.
> 
> Dave
> 



From isplist at logicore.net  Fri Mar 30 19:38:04 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Fri, 30 Mar 2007 13:38:04 -0600
Subject: [Linux-cluster] clearing volumes
Message-ID: <200733013384.191661@leena>

# lvscan
  inactive          '/dev/VolGroup03/web' [475.62 GB] inherit
  inactive          '/dev/VolGroup02/qm' [483.39 GB] inherit
  inactive          '/dev/VolGroup01/sql' [483.39 GB] inherit
  Couldn't find device with uuid 'B6AeMI-7Ce3-ieQ6-mwqL-1LJ1-SyFO-oluVRv'.
  Couldn't find all physical volumes for volume group vol00.
  Couldn't find device with uuid 'B6AeMI-7Ce3-ieQ6-mwqL-1LJ1-SyFO-oluVRv'.
  Couldn't find all physical volumes for volume group vol00.
  Volume group "vol00" not found

How can I clear out all of the junk in the above? Only the first three are 
useful to me, the rest is just messy stuff left over from a storage removal.

Mike





From lhh at redhat.com  Fri Mar 30 19:56:46 2007
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 30 Mar 2007 15:56:46 -0400
Subject: [Linux-cluster] Virtual machine live migration
Message-ID: <20070330195646.GM6347@redhat.com>

It's been in the RHEL5 branch (and HEAD) for a couple of weeks now,
and we've had pretty good luck with it internally.  However, we could
use some feedback if anyone is willing to help out.

Let me know what you need (software wise - e.g. packages, for example)
and I'll see what I can do.  While it's working now, I would like to
make it better.

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.



From isplist at logicore.net  Fri Mar 30 21:01:45 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Fri, 30 Mar 2007 15:01:45 -0600
Subject: [Linux-cluster] Can't mount?
Message-ID: <200733015145.883799@leena>

Mar 30 14:59:36 compdev kernel: GFS: Trying to join cluster "lock_dlm", 
"vgcomp:web"
Mar 30 14:59:36 compdev kernel: lock_dlm: fence domain not found; check fenced
Mar 30 14:59:36 compdev kernel: GFS: can't mount proto = lock_dlm, table = 
vgcomp:web, hostdata =
Mar 30 14:59:36 compdev hald[2672]: Timed out waiting for hotplug event 419. 
Rebasing to 420





From bmarzins at redhat.com  Fri Mar 30 23:07:06 2007
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Fri, 30 Mar 2007 17:07:06 -0600
Subject: [Linux-cluster] using GNBD versus iSCSI
In-Reply-To: <9ea95a710703271322m34d0bacctf388a4cd7d7bd840@mail.gmail.com>
References: <9ea95a710703271322m34d0bacctf388a4cd7d7bd840@mail.gmail.com>
Message-ID: <20070330230706.GA5158@ether.msp.redhat.com>

On Tue, Mar 27, 2007 at 10:22:01PM +0200, David Shwatrz wrote:
>    Hello,
>    In short: what are the advantages/disadvantages when using GNBD versus
>    iSCSI for exporting storage in a cluster?

The only real advantage of using GNBD is that it has built in fencing. With
iSCSI, you still need some somthing to fence all the machines (unless your scsi
target supports SCSI-3 persistent resrvations). Theoretically, GNBD could
run faster, since it doesn't need to do the work to imitate a SCSI device, but
but there's a lot of work that needs to be done for GNBD to reach it's speed
potential. Since there isn't much active development of GNBD, if iSCSI
isn't already faster than it, it will eventually be. I am pretty much the only
GNBD developer, and aside from bug fixing, I do very little work on it. iSCSI
has an active community of developers. Using iSCSI also allows a much more
seemless transition to a hardware shared storage solution later on.

If you don't have any fencing hardware, and your iSCSI target doesn't support
SCSI-3 persistent reservations, then you should probably go with GNBD. Otherwise
it's up to you.

-Ben

>    Regards,
>    Dave S.

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From bmarzins at redhat.com  Sat Mar 31 03:37:17 2007
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Fri, 30 Mar 2007 21:37:17 -0600
Subject: [Linux-cluster] LVS+GFS+GNBD
In-Reply-To: <460A3564.D1F2.004B.0@wou.edu.my>
References: <460A3564.D1F2.004B.0@wou.edu.my>
Message-ID: <20070331033717.GB5158@ether.msp.redhat.com>

On Wed, Mar 28, 2007 at 09:29:15AM +0800, Lee Siang Fong wrote:
>    Dear All,
> 
>    I am planning to use only 2 nodes clusters sharing 1 storage location,
>    which is a Raid 5 SCSI hard drives of another HP DL385 machine. Appreciate
>    if you advise me the following:-
> 
>    1. Can GNBD work well in this case?

Yes, GNBD is fine for this setup.

>    2. I have successfully installed GNBD  and GFS( all in one server) and
>    mount it locally to 2 web cluster nodes(I put mount command in rc.local),
>    but when 1 do simulation on shutting down 2 webservers, it hung; i have to
>    manually shut it down. When I start both servers, it takes very long time
>    to startup, sometimes it also hung? Any suggestion?

You may have your cluster set up correctly, but I'm not sure from your
description. If you have the GFS source, look at cluster/doc/min-gfs.txt You
can also look at this file online through the web interface to CVS at
http://sources.redhat.com/cluster/. This has a description of how you should
set up a 2 node cluster, using a third node as a GNBD server.  The hangs could
also be a problem with the init scripts.

>    3. I hardly  find GFS mounting instruction in RHCS tools, I tried to mount
>    GFS as  Shared Resources, but it can only mount at 1 server at a time.

look at cluster/doc/usage.txt and cluster/doc/min-gfs.txt for mounting
instructions. If this doesn't help, could you post a copy of the error
message you are getting?

>    4. I have used back the old school method, I only used NFS for my data.
>    May  I know if there are any performance issues? ( i think we have little
>    chances to faced data corruption due to most of the clients are merely
>    downloading the contents).
> 
> 
>    Thanks a and sorry for my bad english.
> 
>    Rdgs
>    SFLee
> 


> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From npf at eurotux.com  Sat Mar 31 11:50:24 2007
From: npf at eurotux.com (Nuno Fernandes)
Date: Sat, 31 Mar 2007 12:50:24 +0100
Subject: [Linux-cluster] system-config-cluster doesn't see cluster
Message-ID: <200703311250.24951.npf@eurotux.com>

Hi,

I've created a cluster using system-config-cluster
I'm using rhel5 with kernel 2.6.18-8.el5

# uname -a
Linux xen2.dc.server.pt 2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:14 EST 2007 
x86_64 x86_64 x86_64 GNU/Linux

Apearently everything seems ok:

# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  xen1.dc.server.pt                      1 Online
  xen2.dc.server.pt                      2 Online, Local
  xen3.dc.server.pt                      3 Online

But when i start system-config-cluster in one of the members it reports:

"Because this node is not currently part of a cluster, the management tab for 
this application is not available."

The server where i'm executing system-config-cluster is xen1.dc.server.pt. Why 
does it reports that its not a member of the cluster?

[root at xen1 ~]# cman_tool status
Version: 6.0.1
Config Version: 3
Cluster Name: clt_server
Cluster Id: 52751
Cluster Member: Yes
Cluster Generation: 12
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Quorum: 2
Active subsystems: 7
Flags:
Ports Bound: 0 11
Node name: xen1.dc.server.pt
Node ID: 1
Multicast addresses: 239.192.206.221
Node addresses: 172.16.40.107

I've made several clusters back in rhel4 whithout any problems. I've 
reinstalled the servers twice to make sure that it wasn't any problem, so, 
i'm able to reproduce the problem everytime.

Any info?

Thanks
Nuno Fernandes
-- 
Nuno Pais Fernandes
Cisco Certified Network Associate
Oracle Certified Professional
Eurotux Informatica S.A.
Tel: +351 253257395
Fax: +351 253257396



From Robert.Hell at fabasoft.com  Sat Mar 31 16:38:22 2007
From: Robert.Hell at fabasoft.com (Hell, Robert)
Date: Sat, 31 Mar 2007 18:38:22 +0200
Subject: AW: [Linux-cluster] system-config-cluster doesn't see cluster
References: <200703311250.24951.npf@eurotux.com>
Message-ID: <B710F3299F04664DB6B37C258FDEEB9436E551@FABAMAIL.fabagl.fabasoft.com>

Hi,
 
s-c-cluster uses cman_tool to determine if the on which it is started is a cluster member and to read out node and service state.
The problem is that s-c-cluster uses /sbin/cman_tool but cman_tool is installed in /usr/sbin/cman_tool.
 
Just try: ln -s /usr/sbin/cman_tool /sbin/cman_tool on every node and everything will work fine.
 
See also bugzilla #233621 (https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=233621 <https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=233621> )
Also be aware of this bugzilla when using the management tab of system-config-cluster: #233633 (https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=233633)
 
Kind Regards
Robert
 
Ing. Robert Hell
Fabasoft R&D Software GmbH & Co KG
Honauerstrasse 4
4020 Linz
Austria - Europe

________________________________

Von: linux-cluster-bounces at redhat.com im Auftrag von Nuno Fernandes
Gesendet: Sa 31.03.2007 13:50
An: linux-cluster
Betreff: [Linux-cluster] system-config-cluster doesn't see cluster



Hi,

I've created a cluster using system-config-cluster
I'm using rhel5 with kernel 2.6.18-8.el5

# uname -a
Linux xen2.dc.server.pt 2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:14 EST 2007
x86_64 x86_64 x86_64 GNU/Linux

Apearently everything seems ok:

# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  xen1.dc.server.pt                      1 Online
  xen2.dc.server.pt                      2 Online, Local
  xen3.dc.server.pt                      3 Online

But when i start system-config-cluster in one of the members it reports:

"Because this node is not currently part of a cluster, the management tab for
this application is not available."

The server where i'm executing system-config-cluster is xen1.dc.server.pt. Why
does it reports that its not a member of the cluster?

[root at xen1 ~]# cman_tool status
Version: 6.0.1
Config Version: 3
Cluster Name: clt_server
Cluster Id: 52751
Cluster Member: Yes
Cluster Generation: 12
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Quorum: 2
Active subsystems: 7
Flags:
Ports Bound: 0 11
Node name: xen1.dc.server.pt
Node ID: 1
Multicast addresses: 239.192.206.221
Node addresses: 172.16.40.107

I've made several clusters back in rhel4 whithout any problems. I've
reinstalled the servers twice to make sure that it wasn't any problem, so,
i'm able to reproduce the problem everytime.

Any info?

Thanks
Nuno Fernandes
--
Nuno Pais Fernandes
Cisco Certified Network Associate
Oracle Certified Professional
Eurotux Informatica S.A.
Tel: +351 253257395
Fax: +351 253257396

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster <https://192.84.221.201/https/0/www.redhat.com/mailman/listinfo/linux-cluster> 


-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 7174 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070331/4c500071/attachment.bin>

From isplist at logicore.net  Sat Mar 31 17:58:44 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Sat, 31 Mar 2007 11:58:44 -0600
Subject: [Linux-cluster] Can't mount?
In-Reply-To: <200733015145.883799@leena>
Message-ID: <2007331115844.469730@leena>

Nothing changed, tried it a while later and now everything mounts again. Kinda 
confusing when things change for no obvious reasons.

Mike




On Fri, 30 Mar 2007 15:01:45 -0600, isplist at logicore.net wrote:
> Mar 30 14:59:36 compdev kernel: GFS: Trying to join cluster "lock_dlm",
> 
> "vgcomp:web"
> Mar 30 14:59:36 compdev kernel: lock_dlm: fence domain not found; check
> fenced
> Mar 30 14:59:36 compdev kernel: GFS: can't mount proto = lock_dlm, table =
> vgcomp:web, hostdata =
> Mar 30 14:59:36 compdev hald[2672]: Timed out waiting for hotplug event 419.
> Rebasing to 420
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster






From isplist at logicore.net  Sat Mar 31 23:41:23 2007
From: isplist at logicore.net (isplist at logicore.net)
Date: Sat, 31 Mar 2007 17:41:23 -0600
Subject: [Linux-cluster] Different Distros: GFS
Message-ID: <2007331174123.872339@leena>

I thought I read long ago when starting with GFS that in order to make a 
cluster function, you needed to have the same versions across the nodes. 

However, does that also mean that I could use different distributions?

I have cases where I need to use different Linux versions, RHEL and CentOS but 
need them all to be in the cluster. Is that an issue?

Mike