From rodgersr at yahoo.com  Mon Nov  1 15:53:32 2010
From: rodgersr at yahoo.com (Rick Rodgers)
Date: Mon, 1 Nov 2010 08:53:32 -0700 (PDT)
Subject: [Linux-cluster] (no subject)
Message-ID: <938409.38916.qm@web34207.mail.mud.yahoo.com>

http://www.ihfb.37zq.com


      
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101101/3d1961ad/attachment.htm>

From ifetch at du.edu  Tue Nov  2 04:45:11 2010
From: ifetch at du.edu (Ivan Fetch)
Date: Mon, 1 Nov 2010 22:45:11 -0600
Subject: [Linux-cluster] Some questions - migrating from Sun to Red Hat
	cluster
Message-ID: <79C2D3933C76AB41B6D135F3480AC5C957C470EC@EXCH.du.edu>

Hello,

I have been using two CentOS 5.5 virtual machines, to learn Linux clustering, as a potential replacement for Sun (Sparc) clusters. We run Red Hat Enterprise Linux, but do not yet have any production cluster experience. I've got a few questions, which I'm stuck on:

IS it possible to stop or restart one resource, instead of the entire resource group (service)? This can be handy when you want to work on a resource (Apache), without having cluster restart it out from under you, but you still want your storage and IP to stay online. It seems like the clusvcadm command only operates on services; groups of resources.

What is the most common way to create and adjust service definitions - using Lusi, editing cluster.conf by hand, using command-line tools, or something else?

For a non-global filesystem, which follows a service, is HA LVM the way to go? I have seen some recommendations against HA LVM, because LVM tagging being reset on a node, can allow that node to touch the LVM out-of-turn.

What is the recommended way to make changes to an HA LVM, or add a new HA LVM, when lvm.conf on cluster nodes are already configured to tag? I have accomplished this by temporarily editing lvm.conf on one node, removing the tag line, and then making necessary changes to the LVM - it seems like there is likely a better way to do this.

Will the use of a quarum disk, help to keep one node from fensing the other at boot (E.G> node1 is running, node2 boots and fenses node1)? This fensing does not happen every time I boot node2 - I may need to reproduce this and provide logs.

Thank you, for your help,

Ivan.




From thomas at sjolshagen.net  Tue Nov  2 10:14:17 2010
From: thomas at sjolshagen.net (Thomas Sjolshagen)
Date: Tue, 02 Nov 2010 06:14:17 -0400
Subject: [Linux-cluster] Some questions - migrating from Sun to Red Hat
 cluster
In-Reply-To: <79C2D3933C76AB41B6D135F3480AC5C957C470EC@EXCH.du.edu>
References: <79C2D3933C76AB41B6D135F3480AC5C957C470EC@EXCH.du.edu>
Message-ID: <3400c42be3875da5ec83db8002080e5b@www.sjolshagen.net>

 On Mon, 1 Nov 2010 22:45:11 -0600, Ivan Fetch <ifetch at du.edu> wrote:
> Hello,
>
> I have been using two CentOS 5.5 virtual machines, to learn Linux
> clustering, as a potential replacement for Sun (Sparc) clusters. We
> run Red Hat Enterprise Linux, but do not yet have any production
> cluster experience. I've got a few questions, which I'm stuck on:
>
> IS it possible to stop or restart one resource, instead of the entire
> resource group (service)? This can be handy when you want to work on 
> a
> resource (Apache), without having cluster restart it out from under
> you, but you still want your storage and IP to stay online. It seems
> like the clusvcadm command only operates on services; groups of
> resources.
>

 I don't know if this is the officially sanctioned way, but I tend to 
 freeze the group/service (clusvcadm -Z) and then use the start/stop 
 service script (service httpd reload, etc) to manipulate the daemons. 
 I've got a multi-daemon mail server service that brings up postfix + 
 amavisd + sqlgrey, ++ so this is handy here).

> What is the most common way to create and adjust service definitions
> - using Lusi, editing cluster.conf by hand, using command-line tools,
> or something else?
>

 I'm a die-hard CLI guy, so I tend to prefer editing by hand & 
 validating the cluster.conf file before loading it/using it (had a 
 couple of typo's that caused me grief as far as keeping things running 
 goes).
 
> For a non-global filesystem, which follows a service, is HA LVM the
> way to go? I have seen some recommendations against HA LVM, because
> LVM tagging being reset on a node, can allow that node to touch the
> LVM out-of-turn.
>
> What is the recommended way to make changes to an HA LVM, or add a
> new HA LVM, when lvm.conf on cluster nodes are already configured to
> tag? I have accomplished this by temporarily editing lvm.conf on one
> node, removing the tag line, and then making necessary changes to the
> LVM - it seems like there is likely a better way to do this.
>
> Will the use of a quarum disk, help to keep one node from fensing the
> other at boot (E.G> node1 is running, node2 boots and fenses node1)?
> This fensing does not happen every time I boot node2 - I may need to
> reproduce this and provide logs.

 I think, perhaps, you may need/want the <fence_daemon clean_start="1"> 
 included so as to avoid this? IIRC, setting clean_start helped me avoid 
 fencing of the surviving node at restart.

 I use the quorum disk to ensure less confusion by the nodes during 
 reboot scenarios too though.

 hth,

 // Thomas



From bturner at redhat.com  Tue Nov  2 16:04:27 2010
From: bturner at redhat.com (Ben Turner)
Date: Tue, 2 Nov 2010 12:04:27 -0400 (EDT)
Subject: [Linux-cluster] Fence Issue on BL 460C G6
In-Reply-To: <1845350465.1095071288713706952.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com>
Message-ID: <142457019.1095461288713867261.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com>

Your nodes don't seem to be able to communicate:

Oct 30 16:08:15 rhel-cluster-node2 fenced[3549]: rhel-cluster-node1.mgmt.local not a cluster member after 3 sec post_join_delay
Oct 30 16:08:15 rhel-cluster-node2 fenced[3549]: fencing node "rhel-cluster-node1.mgmt.local"
Oct 30 16:08:29 rhel-cluster-node2 fenced[3549]: fence "rhel-cluster-node1.mgmt.local" success

I never see them form a cluster:

Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM  ] CLM CONFIGURATION CHANGE
Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM  ] New Configuration:
Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM  ]       r(0) ip(10.4.1.102)
Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM  ] Members Left:
Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM  ] Members Joined:
Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM  ] CLM CONFIGURATION CHANGE
Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM  ] New Configuration:
Oct 30 16:03:26 rhel-cluster-node2 openais[3511]: [CLM  ]       r(0) ip(10.4.1.102)
Oct 30 16:03:26 rhel-cluster-node2 openais[3511]: [CLM  ] Members Left:
Oct 30 16:03:26 rhel-cluster-node2 openais[3511]: [CLM  ] Members Joined:
Oct 30 16:03:26 rhel-cluster-node2 openais[3511]: [SYNC ] This node is within the primary component and will provide service.
Oct 30 16:03:26 rhel-cluster-node2 openais[3511]: [TOTEM] entering OPERATIONAL state.

Are the nodes just rebooting each other in a cycle?  If so my guess is that you are having issues routing the multicast traffic.  An easy test is to try using broadcast.  Change your cman tag to say:

<cman expected_votes="1" two_node="1" broadcast="yes"/>

If your nodes can form a cluster with that set then you need to evaluate your multicast config.

-Ben



----- "Wahyu Darmawan" <wahyu at vivastor.co.id> wrote:

> Hi all,
> 
> Thanks. I?ve replaced mainboard on both servers. But there?s another
> problem. Both servers active after mainboard replaced.
> 
> 
> 
> But, when I restart the node that is active, other node will be
> restarted as well. This happened during fencing.
> 
> Repeated occurrence, which would in turn lead to both restart
> repeatedly.
> 
> 
> 
> Need your suggestion please..
> 
> Please find the attachment of /var/log/messages/
> 
> And, here?s my cluster.conf
> 
> <?xml version="1.0"?>
> <cluster alias="PORTAL_WORLD" config_version="32" name="PORTAL_WORLD">
> <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="3"/>
> <clusternodes>
> <clusternode name="rhel-cluster-node1.mgmt.local" nodeid="1"
> votes="1">
> <fence>
> <method name="1">
> <device name="NODE1-ILO"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="rhel-cluster-node2.mgmt.local" nodeid="2"
> votes="1">
> <fence>
> <method name="1">
> <device name="NODE2-ILO"/>
> </method>
> </fence>
> </clusternode>
> </clusternodes>
> <quorumd device="/dev/sdf1" interval="3" label="quorum_disk1" tko="23"
> votes="2">
> <heuristic interval="2" program="ping 10.4.0.1 -c1 -t1" score="1"/>
> </quorumd>
> <cman expected_votes="1" two_node="1"/>
> <fencedevices>
> <fencedevice agent="fence_ilo" hostname="ilo-node2"
> login="Administrator" name="NODE2-ILO" passwd="password"/>
> <fencedevice agent="fence_ilo" hostname="ilo-node1"
> login="Administrator" name="NODE1-ILO" passwd="password"/>
> </fencedevices>
> <rm>
> <failoverdomains>
> <failoverdomain name="Failover" nofailback="0" ordered="0"
> restricted="0">
> <failoverdomainnode name="rhel-cluster-node2.mgmt.local"
> priority="1"/>
> <failoverdomainnode name="rhel-cluster-node1.mgmt.local"
> priority="1"/>
> </failoverdomain>
> </failoverdomains>
> <resources>
> <ip address="10.4.1.103" monitor_link="1"/>
> </resources>
> <service autostart="1" domain="Failover" exclusive="0"
> name="IP_Virtual" recovery="relocate">
> <ip ref="10.4.1.103"/>
> </service>
> </rm>
> </cluster>
> 
> 
> 
> Thanks,
> 
> 
> 
> 
> 
> 
> 
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dustin Henry
> Offutt
> Sent: Thursday, October 28, 2010 11:46 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Fence Issue on BL 460C G6
> 
> 
> 
> I believe your problem is being caused by "nofailback" being set to
> "1". :
> 
> <failoverdomain name="Failover" nofailback="1" ordered="0"
> restricted="0">
> 
> Set it to zero and I believe your problem will be resolved.
> 
> 
> On Wed, Oct 27, 2010 at 10:43 PM, Wahyu Darmawan <
> wahyu at vivastor.co.id > wrote:
> 
> Hi Ben,
> Here is my cluster.conf. Need your help please.
> 
> 
> <?xml version="1.0"?>
> <cluster alias="PORTAL_WORLD" config_version="32" name="PORTAL_WORLD">
> <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="3"/>
> <clusternodes>
> <clusternode name="rhel-cluster-node1.mgmt.local" nodeid="1"
> votes="1">
> <fence>
> <method name="1">
> <device name="NODE1-ILO"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="rhel-cluster-node2.mgmt.local" nodeid="2"
> votes="1">
> <fence>
> <method name="1">
> <device name="NODE2-ILO"/>
> </method>
> </fence>
> </clusternode>
> </clusternodes>
> <quorumd device="/dev/sdf1" interval="3" label="quorum_disk1" tko="23"
> votes="2">
> <heuristic interval="2" program="ping 10.4.0.1 -c1 -t1" score="1"/>
> </quorumd>
> <cman expected_votes="1" two_node="1"/>
> <fencedevices>
> <fencedevice agent="fence_ilo" hostname="ilo-node2"
> login="Administrator" name="NODE2-ILO" passwd="password"/>
> <fencedevice agent="fence_ilo" hostname="ilo-node1"
> login="Administrator" name="NODE1-ILO" passwd="password"/>
> </fencedevices>
> <rm>
> <failoverdomains>
> <failoverdomain name="Failover" nofailback="1" ordered="0"
> restricted="0">
> <failoverdomainnode name="rhel-cluster-node2.mgmt.local"
> priority="1"/>
> <failoverdomainnode name="rhel-cluster-node1.mgmt.local"
> priority="1"/>
> </failoverdomain>
> </failoverdomains>
> <resources>
> <ip address="10.4.1.103" monitor_link="1"/>
> </resources>
> <service autostart="1" domain="Failover" exclusive="0"
> name="IP_Virtual" recovery="relocate">
> <ip ref="10.4.1.103"/>
> </service>
> </rm>
> </cluster>
> 
> Many thanks,
> Wahyu
> 
> 
> 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com ] On Behalf Of Ben Turner
> Sent: Thursday, October 28, 2010 12:18 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] Fence Issue on BL 460C G6
> 
> My guess is there is a problem with fencing. Are you running fence_ilo
> with an HP blade? Iirc the iLOs on the blades have a different CLI, I
> don't think fence_ilo will work with them. What do you see in the
> messages files during these events? If you see failed fence messages
> you may want to look into using fence_ipmilan:
> 
> http://sources.redhat.com/cluster/wiki/IPMI_FencingConfig
> 
> If you post a snip of your messages file from this event and your
> cluster.conf I will have a better idea of what is going on.
> 
> -b
> 
> 
> 
> ----- "Wahyu Darmawan" < wahyu at vivastor.co.id > wrote:
> 
> > Hi all,
> >
> >
> >
> > For fencing, I?m using HP iLO and server is BL460c G6. Problem is
> > resource is start moving to the passive when the failed node is
> power
> > on. It is really strange for me. For example, I shutdown the node1
> and
> > physically remove the node1 machine from the blade chassis and
> monitor
> > the clustat output, clustat was still showing that the resource is
> on
> > node 1, even node 1 is power down and removed from c7000 blade
> > chassis. But when I plugged again the failed node1 on the c7000
> blade
> > chassis and it power-on, then clustat is showing that the resource
> is
> > start moving to the passive node from the failed node.
> > I?m powering down the blade server with power button in front of it,
> > then we remove it from the chassis, If we face the hardware problem
> in
> > our active node and the active node goes down then how the resource
> > move to the passive node. In addition, When I rebooted or shutdown
> the
> > machine from the CLI, then the resource moves successfully from the
> > passive node. Furthurmore, When I shutdown the active node with
> > "shutdown -hy 0" command, after shuting down the active node
> > automatically restart.
> >
> > Please help me.
> >
> >
> >
> > Many Thanks,
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From corey.kovacs at gmail.com  Tue Nov  2 20:14:34 2010
From: corey.kovacs at gmail.com (Corey Kovacs)
Date: Tue, 2 Nov 2010 20:14:34 +0000
Subject: [Linux-cluster] ha-lvm
Message-ID: <AANLkTikMsNOigiYEfS7cbHzzWJBJrd1yaWoR5cQfWTyc@mail.gmail.com>

Folks,

I have a 5 node cluster backed by an FC SAN with 5 VG's each with a single LVM.

I am using ha_lvm and have lvm.conf configured to use tags as per the
instructions. Things work fine until I try to migrate the volume
containing our home dir (all others work as expected) The umount for
that volume fails and depending on the active config, the node reboots
itself (self_fence=1) or it simply fails and get's disabled.

lsof doesn't reveal anything "holding" onto that mount point yet the
umount fails consistently (force_umount is enabled)

Furthermore, it appears I have at least one ov my VG's with bad tags,
is there a way to show what tags a VG has?

I've gone over the config several times and although I cannot show the
config, here is a basic rundown in case something jumps out...

5 nodes, dl360g5 2xQcore w/16GB ram
EVA8100
2x4GB FC, multipath
5VG's each w/a single lv each with an ext3 fs.
ha lvm in is use as a measure of protection for the ext3 fs's
local locking only via lvm.conf
tags enabled via lvm.conf
initrd's are newer than the lvm.conf changes.

I did notice that the ext3 label in use on the home volume was not of
the form /home (it was /ha_home) from early testing but I've corrected
that and the umount fail still occurs.

If anyone has any ideas I'd appreciate it.



From Chris.Jankowski at hp.com  Wed Nov  3 02:15:25 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Wed, 3 Nov 2010 02:15:25 +0000
Subject: [Linux-cluster] ha-lvm
In-Reply-To: <AANLkTikMsNOigiYEfS7cbHzzWJBJrd1yaWoR5cQfWTyc@mail.gmail.com>
References: <AANLkTikMsNOigiYEfS7cbHzzWJBJrd1yaWoR5cQfWTyc@mail.gmail.com>
Message-ID: <036B68E61A28CA49AC2767596576CD596F5841A5E5@GVW1113EXC.americas.hpqcorp.net>

Corey,

I vaguely remember from my work on UNIX clusters many years ago that if /dir is the mount point of a mounted filesystem then cd /dir or into any directory below /dir from an interactive shell will prevent an unmount of the filesystem i.e. umount /dir will fail.  I believe that this restriction is because it will create an inconsistency in the state of the shell process. lsof will not show it.

Of course most users after login end up in the home directory by default.

I believe that Linux will have the same semantics as UNIX. You can test that easily on a standalone Linux box.

Regards,

Chris Jankowski


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Corey Kovacs
Sent: Wednesday, 3 November 2010 07:15
To: linux clustering
Subject: [Linux-cluster] ha-lvm

Folks,

I have a 5 node cluster backed by an FC SAN with 5 VG's each with a single LVM.

I am using ha_lvm and have lvm.conf configured to use tags as per the instructions. Things work fine until I try to migrate the volume containing our home dir (all others work as expected) The umount for that volume fails and depending on the active config, the node reboots itself (self_fence=1) or it simply fails and get's disabled.

lsof doesn't reveal anything "holding" onto that mount point yet the umount fails consistently (force_umount is enabled)

Furthermore, it appears I have at least one ov my VG's with bad tags, is there a way to show what tags a VG has?

I've gone over the config several times and although I cannot show the config, here is a basic rundown in case something jumps out...

5 nodes, dl360g5 2xQcore w/16GB ram
EVA8100
2x4GB FC, multipath
5VG's each w/a single lv each with an ext3 fs.
ha lvm in is use as a measure of protection for the ext3 fs's local locking only via lvm.conf tags enabled via lvm.conf initrd's are newer than the lvm.conf changes.

I did notice that the ext3 label in use on the home volume was not of the form /home (it was /ha_home) from early testing but I've corrected that and the umount fail still occurs.

If anyone has any ideas I'd appreciate it.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From corey.kovacs at gmail.com  Wed Nov  3 06:27:22 2010
From: corey.kovacs at gmail.com (Corey Kovacs)
Date: Wed, 3 Nov 2010 06:27:22 +0000
Subject: [Linux-cluster] ha-lvm
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F5841A5E5@GVW1113EXC.americas.hpqcorp.net>
References: <AANLkTikMsNOigiYEfS7cbHzzWJBJrd1yaWoR5cQfWTyc@mail.gmail.com>
	<036B68E61A28CA49AC2767596576CD596F5841A5E5@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <E6E8C3ED-6527-4F18-B11D-89389A872B2E@gmail.com>

You are certainly correct. I neglected to mention that I'd also  
checked for logged in users as well and there were none. Thank for  
this anyway, I appretiate the feedback.

Corey

Sent from my iPod

On Nov 3, 2010, at 2:15 AM, "Jankowski, Chris"  
<Chris.Jankowski at hp.com> wrote:

> Corey,
>
> I vaguely remember from my work on UNIX clusters many years ago that  
> if /dir is the mount point of a mounted filesystem then cd /dir or  
> into any directory below /dir from an interactive shell will prevent  
> an unmount of the filesystem i.e. umount /dir will fail.  I believe  
> that this restriction is because it will create an inconsistency in  
> the state of the shell process. lsof will not show it.
>
> Of course most users after login end up in the home directory by  
> default.
>
> I believe that Linux will have the same semantics as UNIX. You can  
> test that easily on a standalone Linux box.
>
> Regards,
>
> Chris Jankowski
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster- 
> bounces at redhat.com] On Behalf Of Corey Kovacs
> Sent: Wednesday, 3 November 2010 07:15
> To: linux clustering
> Subject: [Linux-cluster] ha-lvm
>
> Folks,
>
> I have a 5 node cluster backed by an FC SAN with 5 VG's each with a  
> single LVM.
>
> I am using ha_lvm and have lvm.conf configured to use tags as per  
> the instructions. Things work fine until I try to migrate the volume  
> containing our home dir (all others work as expected) The umount for  
> that volume fails and depending on the active config, the node  
> reboots itself (self_fence=1) or it simply fails and get's disabled.
>
> lsof doesn't reveal anything "holding" onto that mount point yet the  
> umount fails consistently (force_umount is enabled)
>
> Furthermore, it appears I have at least one ov my VG's with bad  
> tags, is there a way to show what tags a VG has?
>
> I've gone over the config several times and although I cannot show  
> the config, here is a basic rundown in case something jumps out...
>
> 5 nodes, dl360g5 2xQcore w/16GB ram
> EVA8100
> 2x4GB FC, multipath
> 5VG's each w/a single lv each with an ext3 fs.
> ha lvm in is use as a measure of protection for the ext3 fs's local  
> locking only via lvm.conf tags enabled via lvm.conf initrd's are  
> newer than the lvm.conf changes.
>
> I did notice that the ext3 label in use on the home volume was not  
> of the form /home (it was /ha_home) from early testing but I've  
> corrected that and the umount fail still occurs.
>
> If anyone has any ideas I'd appreciate it.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From jonathan.barber at gmail.com  Wed Nov  3 10:00:42 2010
From: jonathan.barber at gmail.com (Jonathan Barber)
Date: Wed, 3 Nov 2010 10:00:42 +0000
Subject: [Linux-cluster] ha-lvm
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F5841A5E5@GVW1113EXC.americas.hpqcorp.net>
References: <AANLkTikMsNOigiYEfS7cbHzzWJBJrd1yaWoR5cQfWTyc@mail.gmail.com>
	<036B68E61A28CA49AC2767596576CD596F5841A5E5@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <AANLkTinsSCsyvLuh+udy=Fkg46o8yDY3tX21CT_Ui6s+@mail.gmail.com>

On 3 November 2010 02:15, Jankowski, Chris <Chris.Jankowski at hp.com> wrote:
> Corey,
>
> I vaguely remember from my work on UNIX clusters many years ago that if /dir is the mount point of a mounted filesystem then cd /dir or into any directory below /dir from an interactive shell will prevent an unmount of the filesystem i.e. umount /dir will fail. ?I believe that this restriction is because it will create an inconsistency in the state of the shell process. lsof will not show it.

lsof does show this:
$ mkdir /scratch/foo
$ cd /scratch/foo
$ lsof +D /scratch/foo
COMMAND  PID   USER   FD   TYPE DEVICE SIZE/OFF   NODE NAME
bash    3060 x01024  cwd    DIR  253,4     4096 303105 /scratch/foo
lsof    4606 x01024  cwd    DIR  253,4     4096 303105 /scratch/foo
lsof    4607 x01024  cwd    DIR  253,4     4096 303105 /scratch/foo

This is on fedora 13 with an ext3 FS, but it also true for RHEL4 and 5.

> Of course most users after login end up in the home directory by default.
>
> I believe that Linux will have the same semantics as UNIX. You can test that easily on a standalone Linux box.
>
> Regards,
>
> Chris Jankowski
>
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Corey Kovacs
> Sent: Wednesday, 3 November 2010 07:15
> To: linux clustering
> Subject: [Linux-cluster] ha-lvm
>
> Folks,
>
> I have a 5 node cluster backed by an FC SAN with 5 VG's each with a single LVM.
>
> I am using ha_lvm and have lvm.conf configured to use tags as per the instructions. Things work fine until I try to migrate the volume containing our home dir (all others work as expected) The umount for that volume fails and depending on the active config, the node reboots itself (self_fence=1) or it simply fails and get's disabled.
>
> lsof doesn't reveal anything "holding" onto that mount point yet the umount fails consistently (force_umount is enabled)
>
> Furthermore, it appears I have at least one ov my VG's with bad tags, is there a way to show what tags a VG has?
>
> I've gone over the config several times and although I cannot show the config, here is a basic rundown in case something jumps out...
>
> 5 nodes, dl360g5 2xQcore w/16GB ram
> EVA8100
> 2x4GB FC, multipath
> 5VG's each w/a single lv each with an ext3 fs.
> ha lvm in is use as a measure of protection for the ext3 fs's local locking only via lvm.conf tags enabled via lvm.conf initrd's are newer than the lvm.conf changes.
>
> I did notice that the ext3 label in use on the home volume was not of the form /home (it was /ha_home) from early testing but I've corrected that and the umount fail still occurs.
>
> If anyone has any ideas I'd appreciate it.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Jonathan Barber <jonathan.barber at gmail.com>



From jonathan.barber at gmail.com  Wed Nov  3 10:41:56 2010
From: jonathan.barber at gmail.com (Jonathan Barber)
Date: Wed, 3 Nov 2010 10:41:56 +0000
Subject: [Linux-cluster] ha-lvm
In-Reply-To: <AANLkTikMsNOigiYEfS7cbHzzWJBJrd1yaWoR5cQfWTyc@mail.gmail.com>
References: <AANLkTikMsNOigiYEfS7cbHzzWJBJrd1yaWoR5cQfWTyc@mail.gmail.com>
Message-ID: <AANLkTimfTHTOgdg177g_y0Fc+ewgB88rz8XHenxVkDHk@mail.gmail.com>

On 2 November 2010 20:14, Corey Kovacs <corey.kovacs at gmail.com> wrote:
> Folks,

[snip]

> lsof doesn't reveal anything "holding" onto that mount point yet the
> umount fails consistently (force_umount is enabled)

Are you sure that you're specifying the filesystem mount point (as
listed in fstab) and not the directory. I've cut myself on the sharp
options in lsof before. It might be worth adding the +D argument to
traverse all of the directories under the filesystem looking for open
files.

You could also use fuser command in case it's pre-coffee operator
induced error ;)

> Furthermore, it appears I have at least one ov my VG's with bad tags,
> is there a way to show what tags a VG has?

"vgs -o vg_name,vg_tags"

Can you umount the volume manually? If you can then it's something to
do with the RHCS, otherwise it's something else.

> I've gone over the config several times and although I cannot show the
> config, here is a basic rundown in case something jumps out...

[snip]

>
> If anyone has any ideas I'd appreciate it.
>

-- 
Jonathan Barber <jonathan.barber at gmail.com>



From corey.kovacs at gmail.com  Wed Nov  3 11:55:12 2010
From: corey.kovacs at gmail.com (Corey Kovacs)
Date: Wed, 3 Nov 2010 11:55:12 +0000
Subject: [Linux-cluster] ha-lvm
In-Reply-To: <AANLkTimfTHTOgdg177g_y0Fc+ewgB88rz8XHenxVkDHk@mail.gmail.com>
References: <AANLkTikMsNOigiYEfS7cbHzzWJBJrd1yaWoR5cQfWTyc@mail.gmail.com>
	<AANLkTimfTHTOgdg177g_y0Fc+ewgB88rz8XHenxVkDHk@mail.gmail.com>
Message-ID: <AANLkTimu8Z4fS+Hvo28tVVxMAc4Yj==GNvinPec5dNoF@mail.gmail.com>

John,

This is a cluster managed mount so there is no fstab entry.

The lsof options you show...

"vgs -o vg_name,vg_tags"

are a welcome addition to my tool belt, thanks for that.

seems I need to practice what I preach and use the man pages more...

I am out today but I'll try these tomorrow.

Thanks

Corey



On Wed, Nov 3, 2010 at 10:41 AM, Jonathan Barber
<jonathan.barber at gmail.com> wrote:
> On 2 November 2010 20:14, Corey Kovacs <corey.kovacs at gmail.com> wrote:
>> Folks,
>
> [snip]
>
>> lsof doesn't reveal anything "holding" onto that mount point yet the
>> umount fails consistently (force_umount is enabled)
>
> Are you sure that you're specifying the filesystem mount point (as
> listed in fstab) and not the directory. I've cut myself on the sharp
> options in lsof before. It might be worth adding the +D argument to
> traverse all of the directories under the filesystem looking for open
> files.
>
> You could also use fuser command in case it's pre-coffee operator
> induced error ;)
>
>> Furthermore, it appears I have at least one ov my VG's with bad tags,
>> is there a way to show what tags a VG has?
>
> "vgs -o vg_name,vg_tags"
>
> Can you umount the volume manually? If you can then it's something to
> do with the RHCS, otherwise it's something else.
>
>> I've gone over the config several times and although I cannot show the
>> config, here is a basic rundown in case something jumps out...
>
> [snip]
>
>>
>> If anyone has any ideas I'd appreciate it.
>>
>
> --
> Jonathan Barber <jonathan.barber at gmail.com>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From jonathan.barber at gmail.com  Wed Nov  3 13:13:19 2010
From: jonathan.barber at gmail.com (Jonathan Barber)
Date: Wed, 3 Nov 2010 13:13:19 +0000
Subject: [Linux-cluster] ha-lvm
In-Reply-To: <AANLkTimu8Z4fS+Hvo28tVVxMAc4Yj==GNvinPec5dNoF@mail.gmail.com>
References: <AANLkTikMsNOigiYEfS7cbHzzWJBJrd1yaWoR5cQfWTyc@mail.gmail.com>
	<AANLkTimfTHTOgdg177g_y0Fc+ewgB88rz8XHenxVkDHk@mail.gmail.com>
	<AANLkTimu8Z4fS+Hvo28tVVxMAc4Yj==GNvinPec5dNoF@mail.gmail.com>
Message-ID: <AANLkTim6-OMd11LWyDB9F6eT0ubRM=_va-THGUF+eN5L@mail.gmail.com>

On 3 November 2010 11:55, Corey Kovacs <corey.kovacs at gmail.com> wrote:
> John,
>
> This is a cluster managed mount so there is no fstab entry.

That doesn't mean you can't umount it from the command line:
# umount /path/to/mount/point

As commented in another thread the other day, you probably want to do
a "clusvcadm -Z servicename" to stop RHCS from taking action if you
manage to umount the filesystem. Don't forget to do "clusvcadm -U
servicename" afterwards...

> The lsof options you show...
>
> "vgs -o vg_name,vg_tags"
>
> are a welcome addition to my tool belt, thanks for that.
>
> seems I need to practice what I preach and use the man pages more...
>
> I am out today but I'll try these tomorrow.
>
> Thanks
>
> Corey
>
>
>
> On Wed, Nov 3, 2010 at 10:41 AM, Jonathan Barber
> <jonathan.barber at gmail.com> wrote:
>> On 2 November 2010 20:14, Corey Kovacs <corey.kovacs at gmail.com> wrote:
>>> Folks,
>>
>> [snip]
>>
>>> lsof doesn't reveal anything "holding" onto that mount point yet the
>>> umount fails consistently (force_umount is enabled)
>>
>> Are you sure that you're specifying the filesystem mount point (as
>> listed in fstab) and not the directory. I've cut myself on the sharp
>> options in lsof before. It might be worth adding the +D argument to
>> traverse all of the directories under the filesystem looking for open
>> files.
>>
>> You could also use fuser command in case it's pre-coffee operator
>> induced error ;)
>>
>>> Furthermore, it appears I have at least one ov my VG's with bad tags,
>>> is there a way to show what tags a VG has?
>>
>> "vgs -o vg_name,vg_tags"
>>
>> Can you umount the volume manually? If you can then it's something to
>> do with the RHCS, otherwise it's something else.
>>
>>> I've gone over the config several times and although I cannot show the
>>> config, here is a basic rundown in case something jumps out...
>>
>> [snip]
>>
>>>
>>> If anyone has any ideas I'd appreciate it.
>>>
>>
>> --
>> Jonathan Barber <jonathan.barber at gmail.com>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Jonathan Barber <jonathan.barber at gmail.com>



From mylinuxhalist at gmail.com  Wed Nov  3 13:23:27 2010
From: mylinuxhalist at gmail.com (My LinuxHAList)
Date: Wed, 3 Nov 2010 09:23:27 -0400
Subject: [Linux-cluster] ha-lvm
In-Reply-To: <AANLkTim6-OMd11LWyDB9F6eT0ubRM=_va-THGUF+eN5L@mail.gmail.com>
References: <AANLkTikMsNOigiYEfS7cbHzzWJBJrd1yaWoR5cQfWTyc@mail.gmail.com>
	<AANLkTimfTHTOgdg177g_y0Fc+ewgB88rz8XHenxVkDHk@mail.gmail.com>
	<AANLkTimu8Z4fS+Hvo28tVVxMAc4Yj==GNvinPec5dNoF@mail.gmail.com>
	<AANLkTim6-OMd11LWyDB9F6eT0ubRM=_va-THGUF+eN5L@mail.gmail.com>
Message-ID: <AANLkTi=g9OPwMwaVj8FJOHmgjv4oiZ7xpqJTnaX_aau-@mail.gmail.com>

One possibility is that, say you try to unmount /mountpoint, however
you have another partition mounted at /mountpoint/subdir, that would
prevent /mountpoint to be unmounted, without unmounting
/mountpoint/subdir first.

You could check the output of mount command.

On Wed, Nov 3, 2010 at 9:13 AM, Jonathan Barber
<jonathan.barber at gmail.com> wrote:
> On 3 November 2010 11:55, Corey Kovacs <corey.kovacs at gmail.com> wrote:
>> John,
>>
>> This is a cluster managed mount so there is no fstab entry.
>
> That doesn't mean you can't umount it from the command line:
> # umount /path/to/mount/point
>
> As commented in another thread the other day, you probably want to do
> a "clusvcadm -Z servicename" to stop RHCS from taking action if you
> manage to umount the filesystem. Don't forget to do "clusvcadm -U
> servicename" afterwards...
>
>> The lsof options you show...
>>
>> "vgs -o vg_name,vg_tags"
>>
>> are a welcome addition to my tool belt, thanks for that.
>>
>> seems I need to practice what I preach and use the man pages more...
>>
>> I am out today but I'll try these tomorrow.
>>
>> Thanks
>>
>> Corey
>>
>>
>>
>> On Wed, Nov 3, 2010 at 10:41 AM, Jonathan Barber
>> <jonathan.barber at gmail.com> wrote:
>>> On 2 November 2010 20:14, Corey Kovacs <corey.kovacs at gmail.com> wrote:
>>>> Folks,
>>>
>>> [snip]
>>>
>>>> lsof doesn't reveal anything "holding" onto that mount point yet the
>>>> umount fails consistently (force_umount is enabled)
>>>
>>> Are you sure that you're specifying the filesystem mount point (as
>>> listed in fstab) and not the directory. I've cut myself on the sharp
>>> options in lsof before. It might be worth adding the +D argument to
>>> traverse all of the directories under the filesystem looking for open
>>> files.
>>>
>>> You could also use fuser command in case it's pre-coffee operator
>>> induced error ;)
>>>
>>>> Furthermore, it appears I have at least one ov my VG's with bad tags,
>>>> is there a way to show what tags a VG has?
>>>
>>> "vgs -o vg_name,vg_tags"
>>>
>>> Can you umount the volume manually? If you can then it's something to
>>> do with the RHCS, otherwise it's something else.
>>>
>>>> I've gone over the config several times and although I cannot show the
>>>> config, here is a basic rundown in case something jumps out...
>>>
>>> [snip]
>>>
>>>>
>>>> If anyone has any ideas I'd appreciate it.
>>>>
>>>
>>> --
>>> Jonathan Barber <jonathan.barber at gmail.com>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> Jonathan Barber <jonathan.barber at gmail.com>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From zagar at arlut.utexas.edu  Wed Nov  3 17:55:58 2010
From: zagar at arlut.utexas.edu (Randy Zagar)
Date: Wed, 03 Nov 2010 12:55:58 -0500
Subject: [Linux-cluster] ha-lvm
In-Reply-To: <mailman.39.1288800005.20887.linux-cluster@redhat.com>
References: <mailman.39.1288800005.20887.linux-cluster@redhat.com>
Message-ID: <4CD1A22E.2070004@arlut.utexas.edu>


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I frequently find that I'm unable to umount volumes, even after lsof
and fuser return nothing relevant, and have to "force" a "lazy" umount
like so:

    umount -lf /dir

because both "umount /dir" and "umount -f /dir" fail.

- -RZ

>
> On Nov 3, 2010, at 2:15 AM, "Jankowski, Chris"
> <Chris.Jankowski at hp.com> wrote:
>
>> Corey,
>>
>> I vaguely remember from my work on UNIX clusters many years ago
>> that if /dir is the mount point of a mounted filesystem then cd
>> /dir or into any directory below /dir from an interactive shell
>> will prevent an unmount of the filesystem i.e. umount /dir will
>> fail. I believe that this restriction is because it will create
>> an inconsistency in the state of the shell process. lsof will not
>> show it.
>>
>> Of course most users after login end up in the home directory by
>> default.
>>
>> I believe that Linux will have the same semantics as UNIX. You
>> can test that easily on a standalone Linux box.
>>
>> Regards,
>>
>> Chris Jankowski
>>
>>
>> -----Original Message----- From: linux-cluster-bounces at redhat.com
>> [mailto:linux-cluster- bounces at redhat.com] On Behalf Of Corey
>> Kovacs Sent: Wednesday, 3 November 2010 07:15 To: linux
>> clustering Subject: [Linux-cluster] ha-lvm
>>
>> Folks,
>>
>> I have a 5 node cluster backed by an FC SAN with 5 VG's each with
>> a single LVM.
>>
>> I am using ha_lvm and have lvm.conf configured to use tags as per
>> the instructions. Things work fine until I try to migrate the
>> volume containing our home dir (all others work as expected) The
>> umount for that volume fails and depending on the active config,
>> the node reboots itself (self_fence=1) or it simply fails and
>> get's disabled.
>>
>> lsof doesn't reveal anything "holding" onto that mount point yet
>> the umount fails consistently (force_umount is enabled)
>>
>> Furthermore, it appears I have at least one ov my VG's with bad
>> tags, is there a way to show what tags a VG has?
>>
>> I've gone over the config several times and although I cannot
>> show the config, here is a basic rundown in case something jumps
>> out...
>>
>> 5 nodes, dl360g5 2xQcore w/16GB ram EVA8100 2x4GB FC, multipath
>> 5VG's each w/a single lv each with an ext3 fs. ha lvm in is use
>> as a measure of protection for the ext3 fs's local locking only
>> via lvm.conf tags enabled via lvm.conf initrd's are newer than
>> the lvm.conf changes.
>>
>> I did notice that the ext3 label in use on the home volume was
>> not of the form /home (it was /ha_home) from early testing but
>> I've corrected that and the umount fail still occurs.
>>
>> If anyone has any ideas I'd appreciate it.
>>
>> -- Linux-cluster mailing list Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iEYEARECAAYFAkzRoi4ACgkQKQP9Tvu8x8xq3wCghKNS6//Pv0kDF6RggnCCk0b4
oaEAn3uO3rDQUNAjlaXHr0yojzaUiXU8
=HaFU
-----END PGP SIGNATURE-----



From dxh at yahoo.com  Wed Nov  3 19:30:38 2010
From: dxh at yahoo.com (Don Hoover)
Date: Wed, 3 Nov 2010 12:30:38 -0700 (PDT)
Subject: [Linux-cluster] iptables
Message-ID: <90758.87004.qm@web120718.mail.ne1.yahoo.com>

Doing some testing with RHEL6 Beta2+, and I turned on debugging to verify my iptables was working with RHCS.

And I noticed that there are some packets send between each node periodically that are going to destination port=0.

Dropped by firewall: IN=bond0 OUT= MAC=00:14:38:bc:ab:4d:00:1b:78:ba:80:14:08:00 SRC=10.240.48.180 DST=10.240.48.178 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=19018 DF PROTO=TCP SPT=49555 DPT=0 WINDOW=5840 RES=0x00 SYN URGP=0
Dropped by firewall: IN=bond0 OUT= MAC=00:14:38:bc:ab:4d:00:17:a4:47:99:57:08:00 SRC=10.240.48.179 DST=10.240.48.178 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=32053 DF PROTO=TCP SPT=22430 DPT=0 WINDOW=5840 RES=0x00 SYN URGP=0


Does port 0 need to be opened? This is no where in the docs, I used all the normal port suggested.

Here is what I am testing with having open:

#-A INPUT -m state --state NEW -m tcp -p tcp --dport 137 -j ACCEPT
#-A INPUT -m state --state NEW -m tcp -p tcp --dport 138 -j ACCEPT
#-A INPUT -m state --state NEW -m udp -p udp --dport 137 -j ACCEPT
#-A INPUT -m state --state NEW -m udp -p udp --dport 138 -j ACCEPT
### cman - 5404,5405 udp
-A INPUT -m state --state NEW -m udp -p udp --dport 5404 -j ACCEPT 
-A INPUT -m state --state NEW -m udp -p udp --dport 5405 -j ACCEPT
### ricci - 11111 tcp
-A INPUT -m state --state NEW -m tcp -p tcp --dport 11111 -j ACCEPT
### dlm - 21064 tcp
-A INPUT -m state --state NEW -m tcp -p tcp --dport 21064 -j ACCEPT
### ccsd - 50006,50008,50008 tcp and 50007 udp
-A INPUT -m state --state NEW -m tcp -p tcp --dport 50006 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 50008 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 50009 -j ACCEPT
-A INPUT -m state --state NEW -m udp -p udp --dport 50007 -j ACCEPT
### multicast heartbeat (may be different for each cluster)
-A INPUT -s 239.192.0.0/16 -m addrtype --src-type MULTICAST -j ACCEPT
-A INPUT -s 224.0.0.0/8 -m addrtype --src-type MULTICAST -j ACCEPT






From jonathan.barber at gmail.com  Thu Nov  4 10:42:29 2010
From: jonathan.barber at gmail.com (Jonathan Barber)
Date: Thu, 4 Nov 2010 10:42:29 +0000
Subject: [Linux-cluster] ha-lvm
In-Reply-To: <4CD1A22E.2070004@arlut.utexas.edu>
References: <mailman.39.1288800005.20887.linux-cluster@redhat.com>
	<4CD1A22E.2070004@arlut.utexas.edu>
Message-ID: <AANLkTi=L5RvK=MU8-eb0qfkc2D2KBa+FvQeLMMPWDNU7@mail.gmail.com>

On 3 November 2010 17:55, Randy Zagar <zagar at arlut.utexas.edu> wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I frequently find that I'm unable to umount volumes, even after lsof
> and fuser return nothing relevant, and have to "force" a "lazy" umount
> like so:
>
> ? ?umount -lf /dir
>
> because both "umount /dir" and "umount -f /dir" fail.

That's a cool option, but I'd be very worried about corrupting the
filesystem if it was mounted on a second node whilst a process was
holding the filesystem open on the original node.

> - -RZ
>
>>
>> On Nov 3, 2010, at 2:15 AM, "Jankowski, Chris"
>> <Chris.Jankowski at hp.com> wrote:
>>
>>> Corey,
>>>
>>> I vaguely remember from my work on UNIX clusters many years ago
>>> that if /dir is the mount point of a mounted filesystem then cd
>>> /dir or into any directory below /dir from an interactive shell
>>> will prevent an unmount of the filesystem i.e. umount /dir will
>>> fail. I believe that this restriction is because it will create
>>> an inconsistency in the state of the shell process. lsof will not
>>> show it.
>>>
>>> Of course most users after login end up in the home directory by
>>> default.
>>>
>>> I believe that Linux will have the same semantics as UNIX. You
>>> can test that easily on a standalone Linux box.
>>>
>>> Regards,
>>>
>>> Chris Jankowski
>>>
>>>
>>> -----Original Message----- From: linux-cluster-bounces at redhat.com
>>> [mailto:linux-cluster- bounces at redhat.com] On Behalf Of Corey
>>> Kovacs Sent: Wednesday, 3 November 2010 07:15 To: linux
>>> clustering Subject: [Linux-cluster] ha-lvm
>>>
>>> Folks,
>>>
>>> I have a 5 node cluster backed by an FC SAN with 5 VG's each with
>>> a single LVM.
>>>
>>> I am using ha_lvm and have lvm.conf configured to use tags as per
>>> the instructions. Things work fine until I try to migrate the
>>> volume containing our home dir (all others work as expected) The
>>> umount for that volume fails and depending on the active config,
>>> the node reboots itself (self_fence=1) or it simply fails and
>>> get's disabled.
>>>
>>> lsof doesn't reveal anything "holding" onto that mount point yet
>>> the umount fails consistently (force_umount is enabled)
>>>
>>> Furthermore, it appears I have at least one ov my VG's with bad
>>> tags, is there a way to show what tags a VG has?
>>>
>>> I've gone over the config several times and although I cannot
>>> show the config, here is a basic rundown in case something jumps
>>> out...
>>>
>>> 5 nodes, dl360g5 2xQcore w/16GB ram EVA8100 2x4GB FC, multipath
>>> 5VG's each w/a single lv each with an ext3 fs. ha lvm in is use
>>> as a measure of protection for the ext3 fs's local locking only
>>> via lvm.conf tags enabled via lvm.conf initrd's are newer than
>>> the lvm.conf changes.
>>>
>>> I did notice that the ext3 label in use on the home volume was
>>> not of the form /home (it was /ha_home) from early testing but
>>> I've corrected that and the umount fail still occurs.
>>>
>>> If anyone has any ideas I'd appreciate it.
>>>
>>> -- Linux-cluster mailing list Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
> Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAkzRoi4ACgkQKQP9Tvu8x8xq3wCghKNS6//Pv0kDF6RggnCCk0b4
> oaEAn3uO3rDQUNAjlaXHr0yojzaUiXU8
> =HaFU
> -----END PGP SIGNATURE-----
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Jonathan Barber <jonathan.barber at gmail.com>



From bmr at redhat.com  Thu Nov  4 11:00:18 2010
From: bmr at redhat.com (Bryn M. Reeves)
Date: Thu, 04 Nov 2010 11:00:18 +0000
Subject: [Linux-cluster] ha-lvm
In-Reply-To: <AANLkTi=L5RvK=MU8-eb0qfkc2D2KBa+FvQeLMMPWDNU7@mail.gmail.com>
References: <mailman.39.1288800005.20887.linux-cluster@redhat.com>	<4CD1A22E.2070004@arlut.utexas.edu>
	<AANLkTi=L5RvK=MU8-eb0qfkc2D2KBa+FvQeLMMPWDNU7@mail.gmail.com>
Message-ID: <4CD29242.6070406@redhat.com>

On 11/04/2010 10:42 AM, Jonathan Barber wrote:
> On 3 November 2010 17:55, Randy Zagar <zagar at arlut.utexas.edu> wrote:
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> I frequently find that I'm unable to umount volumes, even after lsof
>> and fuser return nothing relevant, and have to "force" a "lazy" umount
>> like so:
>>
>>    umount -lf /dir
>>
>> because both "umount /dir" and "umount -f /dir" fail.
> 
> That's a cool option, but I'd be very worried about corrupting the
> filesystem if it was mounted on a second node whilst a process was
> holding the filesystem open on the original node.

Right; a lazy umount just detaches the root directory of the mounted file system
from the namespace. The file system is still mounted following this operation
it's just not reachable from the file system namespace (it will be cleaned up
properly once it's no longer busy but remains in use until that time).

Regards,
Bryn.



From gianluca.cecchi at gmail.com  Mon Nov  8 14:24:08 2010
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Mon, 8 Nov 2010 15:24:08 +0100
Subject: [Linux-cluster] ha-lvm
Message-ID: <AANLkTim4tHGwiUD64f5GBOgoGf_if13_hTZvG1JCdOkt@mail.gmail.com>

On Wed, 3 Nov 2010 11:55:12 +0000 Corey Kovacs wrote:
> John,
[snip]
> "vgs -o vg_name,vg_tags"
> are a welcome addition to my tool belt, thanks for that.

On 2 rh el 5.5 clusters I manage, with slightly different level
updates, and where I have HA-LVM configured, I don't get anything in
vg_tags colums....

Versions of packages are respectively:
lvm2-2.02.56-8.el5_5.6 on one cluster nodes
lvm2-2.02.56-8.el5_5.5 on another cluster nodes.

I'm using something like this in my lvm.conf files for the clusters:
volume_list = [ "VolGroup00", "@node01" ]

but no tag at all, both on passive and active node.....
[root at server1 ~]# vgs -o vg_name,vg_tags
  VG               VG Tags
  VG_ORA_APPL
  VG_ORA_DATA
  VG_ORA_LOGS
  VolGroup00

and the first three ones are acivated/mounted through HA-LVM

Gianluca



From marco.dominguez at gmail.com  Mon Nov  8 14:50:36 2010
From: marco.dominguez at gmail.com (Marco Andres Dominguez)
Date: Mon, 8 Nov 2010 11:50:36 -0300
Subject: [Linux-cluster] ha-lvm
In-Reply-To: <AANLkTim4tHGwiUD64f5GBOgoGf_if13_hTZvG1JCdOkt@mail.gmail.com>
References: <AANLkTim4tHGwiUD64f5GBOgoGf_if13_hTZvG1JCdOkt@mail.gmail.com>
Message-ID: <AANLkTi=_=R+Hx9xT1-67_opV3Y==0SPknCpeEWRk6ApX@mail.gmail.com>

Gianluca

The tag could be in the vg or in the lv depending on the configurations, I
usually have it in the lv so try this:

# lvs -o vg_name,lv_name,lv_tags

I hope it helps.
Regards.

Marco

On Mon, Nov 8, 2010 at 11:24 AM, Gianluca Cecchi
<gianluca.cecchi at gmail.com>wrote:

> On Wed, 3 Nov 2010 11:55:12 +0000 Corey Kovacs wrote:
> > John,
> [snip]
> > "vgs -o vg_name,vg_tags"
> > are a welcome addition to my tool belt, thanks for that.
>
> On 2 rh el 5.5 clusters I manage, with slightly different level
> updates, and where I have HA-LVM configured, I don't get anything in
> vg_tags colums....
>
> Versions of packages are respectively:
> lvm2-2.02.56-8.el5_5.6 on one cluster nodes
> lvm2-2.02.56-8.el5_5.5 on another cluster nodes.
>
> I'm using something like this in my lvm.conf files for the clusters:
> volume_list = [ "VolGroup00", "@node01" ]
>
> but no tag at all, both on passive and active node.....
> [root at server1 ~]# vgs -o vg_name,vg_tags
>  VG               VG Tags
>  VG_ORA_APPL
>  VG_ORA_DATA
>  VG_ORA_LOGS
>  VolGroup00
>
> and the first three ones are acivated/mounted through HA-LVM
>
> Gianluca
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101108/b629ae95/attachment.htm>

From gianluca.cecchi at gmail.com  Tue Nov  9 10:29:00 2010
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Tue, 9 Nov 2010 11:29:00 +0100
Subject: [Linux-cluster] ha-lvm
Message-ID: <AANLkTinxjnu0X-9M3xLysxX_JCw9kNYYEYuak3rQmDsX@mail.gmail.com>

On Mon, 8 Nov 2010 11:50:36 -0300 Marco Andres Dominguez wrote:
> The tag could be in the vg or in the lv depending on the configurations, I usually have it in the lv so try this:
>
> # lvs -o vg_name,lv_name,lv_tags
> I hope it helps.
> Regards.
> Marco

Thanks, Marco.
Indeed with the lvs conmand I can see my tags.. ;-)
Any link with details about ".. tag could be in the vg or in the lv
depending on the configurations,..."?



From rossnick-lists at cybercat.ca  Wed Nov 10 01:53:27 2010
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Tue, 9 Nov 2010 20:53:27 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
Message-ID: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>

Hi all !

Some of you might know, Apple just discontinued the xServe servers. In the
next few weeks we were about to buy 50 k$ worth of xServes to replace our
aging g5 and xserve-raid setup.

Our setup is primarly composed of about a dozen xServes and a couple
xserve-raid enclosures for storage. All linked-up with fiber channel. On top
of this we have xSan to have a shared filesystem accross all servers. Some
of the volumes are mounted on a single server on a per-needed basis.

Now, I'm not that sure I will go again with xSan / xServes. So I am seeking
alternatives to our xSan setup.

In our server room we have several servers running centos, I am quite
familiar with it. I also grown and learn with Redhat from version 5 or so 
(RedHat 5 decades ago, not RHEL 5). A user on the CentOS mailing list 
pointed me to Gfs from RedHat and to this list.

So today I dug up on Gfs2 on Redhat's site and it petty much fits my need. 
It seems to be a very powerfull solution. If I understand correctly, I need 
to setup a cluster of nodes to use Gfs. Fine with that. But since it's not a 
real "cluster", do I stil need the quorum to operate the global file system 
? On our setup, a particular service runs on a single node from the shared 
filesystem.

The documentation on redhat's site is very technichal, but lacks some 
beginer's hints. For instance, there's a part about the required number of 
journal to create and the size of those. But I cannot find suggested size or 
any thumb-rule for those...

So thanks for any hints.

Regards,
Nicolas Ross 



From gordan at bobich.net  Wed Nov 10 08:13:21 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Wed, 10 Nov 2010 08:13:21 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>
Message-ID: <4CDA5421.9090006@bobich.net>

Nicolas Ross wrote:

> So today I dug up on Gfs2 on Redhat's site and it petty much fits my 
> need. It seems to be a very powerfull solution. If I understand 
> correctly, I need to setup a cluster of nodes to use Gfs. Fine with 
> that. But since it's not a real "cluster", do I stil need the quorum to 
> operate the global file system ? On our setup, a particular service runs 
> on a single node from the shared filesystem.

If you want the FS mounted on all nodes at the same time then all those 
nodes must be a part of the cluster, and they have to be quorate 
(majority of nodes have to be up). You don't need a quorum block device, 
but it can be useful when you have only 2 nodes.

If you are only ever going to have the SAN volume mounted on one device 
at a time, don't bother with GFS and make the SAN block device a 
fail-over resource so that only one node can mount it at a time, and put 
a normal non-shared FS on it. You will get better performance.

> The documentation on redhat's site is very technichal, but lacks some 
> beginer's hints. For instance, there's a part about the required number 
> of journal to create and the size of those. But I cannot find suggested 
> size or any thumb-rule for those...

The number of journals needs to be equal to or greater than the number 
of nodes you have in a cluster. e.g. if you have 5 nodes in a cluster, 
you need at least 5 journals. If you think you might upgrade your 
cluster to 10 nodes at some point in the future, then create 10 
journals, as this needs to be done at FS creation time.

Gordan



From rossnick-lists at cybercat.ca  Wed Nov 10 12:07:33 2010
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Wed, 10 Nov 2010 07:07:33 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>
	<4CDA5421.9090006@bobich.net>
Message-ID: <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>

Thanks

>
> If you want the FS mounted on all nodes at the same time then all those 
> nodes must be a part of the cluster, and they have to be quorate (majority 
> of nodes have to be up). You don't need a quorum block device, but it can 
> be useful when you have only 2 nodes.

At term, I will have 7 to 10 nodes, but 2 at first for initial setup and 
testing. Ok, so if I have a 3 nodes cluster for exemple, I need at least 2 
nodes for the cluster, and thus the gfs, to be up ? I cannot have a running 
gfs with only one node ?

> If you are only ever going to have the SAN volume mounted on one device at 
> a time, don't bother with GFS and make the SAN block device a fail-over 
> resource so that only one node can mount it at a time, and put a normal 
> non-shared FS on it. You will get better performance.

I do need a shared file-system, I am aware of the added latency, we 
currently have some latency on our xSan setup. But we do also need on some 
services an additional block-device that is accessed only by one node and is 
indeed failed-over another node when a node fail.

> The number of journals needs to be equal to or greater than the number of 
> nodes you have in a cluster. e.g. if you have 5 nodes in a cluster, you 
> need at least 5 journals. If you think you might upgrade your cluster to 
> 10 nodes at some point in the future, then create 10 journals, as this 
> needs to be done at FS creation time.

That I got. It's the size that I don't know how to figure out. Will 32 megs 
will be enough ? 64 ? 128 ?

Nicolas 



From gordan at bobich.net  Wed Nov 10 12:17:14 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Wed, 10 Nov 2010 12:17:14 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>
	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
Message-ID: <4CDA8D4A.6010507@bobich.net>

Nicolas Ross wrote:
> Thanks
> 
>>
>> If you want the FS mounted on all nodes at the same time then all 
>> those nodes must be a part of the cluster, and they have to be quorate 
>> (majority of nodes have to be up). You don't need a quorum block 
>> device, but it can be useful when you have only 2 nodes.
> 
> At term, I will have 7 to 10 nodes, but 2 at first for initial setup and 
> testing. Ok, so if I have a 3 nodes cluster for exemple, I need at least 
> 2 nodes for the cluster, and thus the gfs, to be up ? I cannot have a 
> running gfs with only one node ?

In a 2-node cluster, you can have running GFS with just one node up. But 
in that case it is advisble to have a quorum block device on the SAN. 
With a 3 node cluster, you cannot have quorum with just 1 node, and thus 
you cannot have GFS running. It will block until quorum is re-established.

>> If you are only ever going to have the SAN volume mounted on one 
>> device at a time, don't bother with GFS and make the SAN block device 
>> a fail-over resource so that only one node can mount it at a time, and 
>> put a normal non-shared FS on it. You will get better performance.
> 
> I do need a shared file-system, I am aware of the added latency, we 
> currently have some latency on our xSan setup. But we do also need on 
> some services an additional block-device that is accessed only by one 
> node and is indeed failed-over another node when a node fail.

So handle the file system failover for the ones where only one node 
accesses them at a time and have a shared file system for the areas 
where multiple nodes need concurrent access.

>> The number of journals needs to be equal to or greater than the number 
>> of nodes you have in a cluster. e.g. if you have 5 nodes in a cluster, 
>> you need at least 5 journals. If you think you might upgrade your 
>> cluster to 10 nodes at some point in the future, then create 10 
>> journals, as this needs to be done at FS creation time.
> 
> That I got. It's the size that I don't know how to figure out. Will 32 
> megs will be enough ? 64 ? 128 ?


That depends largely on how big your operations are. I cannot remember 
what the defaults are, but they are reasonable. In general, big journals 
can help if you do big I/O operations. In practice, block group sizes 
can be more important for performance (bigger can help on very large 
file systems or big files).

Gordan



From rossnick-lists at cybercat.ca  Wed Nov 10 13:53:56 2010
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Wed, 10 Nov 2010 08:53:56 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
	<4CDA8D4A.6010507@bobich.net>
Message-ID: <C23D7837AD26400AADA74D589FD3D04C@versa>

> In a 2-node cluster, you can have running GFS with just one node up. But 
> in that case it is advisble to have a quorum block device on the SAN. With 
> a 3 node cluster, you cannot have quorum with just 1 node, and thus you 
> cannot have GFS running. It will block until quorum is re-established.

Ok, I'll keep that in mind and experiment with what it does when I start 
playing with the hardware.

> That depends largely on how big your operations are. I cannot remember 
> what the defaults are, but they are reasonable. In general, big journals 
> can help if you do big I/O operations. In practice, block group sizes can 
> be more important for performance (bigger can help on very large file 
> systems or big files).

The volume will be composed of 7 1TB disk in raid5, so 6 TB. It will host 
many, many small files, and some biger files. But the files that change the 
most often will mos likely be smaller than the blocsize. The gfs will not be 
used for io-intensive tasks, that's where the standalone volumes comes into 
play. It'll be used to access many files, often. Specificly, apache will run 
from it, with document root, session store, etc on the gfs.

Regards, 



From marco.dominguez at gmail.com  Wed Nov 10 14:07:08 2010
From: marco.dominguez at gmail.com (Marco Andres Dominguez)
Date: Wed, 10 Nov 2010 11:07:08 -0300
Subject: [Linux-cluster] ha-lvm
In-Reply-To: <AANLkTinxjnu0X-9M3xLysxX_JCw9kNYYEYuak3rQmDsX@mail.gmail.com>
References: <AANLkTinxjnu0X-9M3xLysxX_JCw9kNYYEYuak3rQmDsX@mail.gmail.com>
Message-ID: <AANLkTikqXNsNf8o8ZRc3QQHP05BPCN2=s_uyx3p7S37L@mail.gmail.com>

I think the differences in the configurations are in the lvm resource in
cluster.conf.

If you put some thing like this:

<lvm name="lvm" vg_name="shared_vg" lv_name="ha-lv"/>

you get the tag in the lv

but if you put some thing like this:

<lvm name="lvm" vg_name="shared_vg"/>

you get the tag in the vg.

I have never used the second options so I am not 100% sure if it is right, I
should have to tried it.
You can have a look at doc: DOC-3068 to get more info on ha-lvm.

Regards

Marco




On Tue, Nov 9, 2010 at 7:29 AM, Gianluca Cecchi
<gianluca.cecchi at gmail.com>wrote:

> On Mon, 8 Nov 2010 11:50:36 -0300 Marco Andres Dominguez wrote:
> > The tag could be in the vg or in the lv depending on the configurations,
> I usually have it in the lv so try this:
> >
> > # lvs -o vg_name,lv_name,lv_tags
> > I hope it helps.
> > Regards.
> > Marco
>
> Thanks, Marco.
> Indeed with the lvs conmand I can see my tags.. ;-)
> Any link with details about ".. tag could be in the vg or in the lv
> depending on the configurations,..."?
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101110/02731fd1/attachment.htm>

From gordan at bobich.net  Wed Nov 10 14:12:51 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Wed, 10 Nov 2010 14:12:51 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <C23D7837AD26400AADA74D589FD3D04C@versa>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>
	<C23D7837AD26400AADA74D589FD3D04C@versa>
Message-ID: <4CDAA863.2040100@bobich.net>

Nicolas Ross wrote:

>> That depends largely on how big your operations are. I cannot remember 
>> what the defaults are, but they are reasonable. In general, big 
>> journals can help if you do big I/O operations. In practice, block 
>> group sizes can be more important for performance (bigger can help on 
>> very large file systems or big files).
> 
> The volume will be composed of 7 1TB disk in raid5, so 6 TB.

Be careful with that arrangement. You are right up against the ragged 
edge in terms of data safety.

1TB disks a consumer grade SATA disks with non-recoverable error rates 
of about 10^-14. That is one non-recoverable error per 11TB.

Now consider what happens when one of your disks fails. You have to read 
6TB to reconstruct the failed disk. With error rate of 1 in 11TB, the 
chances of another failure occurring in 6TB of reads is about 53%. So 
the chances are that during this operation, you are going to have 
another failure, and the chances are that your RAID layer will kick the 
disk out as faulty - at which point you will find yourself with 2 failed 
disks in a RAID5 array and in need of a day or two of downtime to scrub 
your data to a fresh array and hope for the best.

RAID5 is ill suited to arrays over 5TB. Using enterprise grade disks 
will gain you an improved error rate (10^-15), which makes it good 
enough - if you also have regular backups. But enterprise grade disks 
are much smaller and much more expensive.

Not to mention that your performance on small writes (smaller than the 
stripe width) will be appalling with RAID5 due to the write-read-write 
operation required to construct the parity which will reduce your 
effective performance to that of a single disk.

> It will 
> host many, many small files, and some biger files. But the files that 
> change the most often will mos likely be smaller than the blocsize.

That sounds like a scenario from hell for RAID5 (or RAID6).

> The 
> gfs will not be used for io-intensive tasks, that's where the standalone 
> volumes comes into play. It'll be used to access many files, often. 
> Specificly, apache will run from it, with document root, session store, 
> etc on the gfs.

Performance-wise, GFS should should be OK for that if you are running 
with noatime and the operations are all reads. If you end up with write 
contention without partitioning the access to directory subtrees on a 
per server basis, the performance will fall off a cliff pretty quickly.

Gordan



From linux at alteeve.com  Wed Nov 10 16:05:18 2010
From: linux at alteeve.com (Digimer)
Date: Wed, 10 Nov 2010 11:05:18 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CDA8D4A.6010507@bobich.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
	<4CDA8D4A.6010507@bobich.net>
Message-ID: <4CDAC2BE.4010009@alteeve.com>

On 10-11-10 07:17 AM, Gordan Bobic wrote:
>>> If you want the FS mounted on all nodes at the same time then all
>>> those nodes must be a part of the cluster, and they have to be
>>> quorate (majority of nodes have to be up). You don't need a quorum
>>> block device, but it can be useful when you have only 2 nodes.
>>
>> At term, I will have 7 to 10 nodes, but 2 at first for initial setup
>> and testing. Ok, so if I have a 3 nodes cluster for exemple, I need at
>> least 2 nodes for the cluster, and thus the gfs, to be up ? I cannot
>> have a running gfs with only one node ?
> 
> In a 2-node cluster, you can have running GFS with just one node up. But
> in that case it is advisble to have a quorum block device on the SAN.
> With a 3 node cluster, you cannot have quorum with just 1 node, and thus
> you cannot have GFS running. It will block until quorum is re-established.

With a quorum disk, you can in fact have one node left and still have
quorum. This is because the quorum drive should have (node-1) votes,
thus always giving the last node 50%+1 even with all other nodes being dead.

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From gordan at bobich.net  Wed Nov 10 16:09:54 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Wed, 10 Nov 2010 16:09:54 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CDAC2BE.4010009@alteeve.com>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
	<4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com>
Message-ID: <4CDAC3D2.9050703@bobich.net>

Digimer wrote:
> On 10-11-10 07:17 AM, Gordan Bobic wrote:
>>>> If you want the FS mounted on all nodes at the same time then all
>>>> those nodes must be a part of the cluster, and they have to be
>>>> quorate (majority of nodes have to be up). You don't need a quorum
>>>> block device, but it can be useful when you have only 2 nodes.
>>> At term, I will have 7 to 10 nodes, but 2 at first for initial setup
>>> and testing. Ok, so if I have a 3 nodes cluster for exemple, I need at
>>> least 2 nodes for the cluster, and thus the gfs, to be up ? I cannot
>>> have a running gfs with only one node ?
>> In a 2-node cluster, you can have running GFS with just one node up. But
>> in that case it is advisble to have a quorum block device on the SAN.
>> With a 3 node cluster, you cannot have quorum with just 1 node, and thus
>> you cannot have GFS running. It will block until quorum is re-established.
> 
> With a quorum disk, you can in fact have one node left and still have
> quorum. This is because the quorum drive should have (node-1) votes,
> thus always giving the last node 50%+1 even with all other nodes being dead.

I've never tried testing that use-case extensively, but I suspect that 
it is only safe to do with SAN-side fencing. Otherwise two nodes could 
lose contact with each other and still both have access to the SAN and 
thus both be individually quorate.

Gordan



From rossnick-lists at cybercat.ca  Wed Nov 10 16:21:55 2010
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Wed, 10 Nov 2010 11:21:55 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net><C23D7837AD26400AADA74D589FD3D04C@versa>
	<4CDAA863.2040100@bobich.net>
Message-ID: <1634100741B94E019943576441CD6873@versa>

>> The volume will be composed of 7 1TB disk in raid5, so 6 TB.
>
> Be careful with that arrangement. You are right up against the ragged edge
> in terms of data safety.
>
> 1TB disks a consumer grade SATA disks with non-recoverable error rates of
> about 10^-14. That is one non-recoverable error per 11TB.
>
> Now consider what happens when one of your disks fails. You have to read
> 6TB to reconstruct the failed disk. With error rate of 1 in 11TB, the
> chances of another failure occurring in 6TB of reads is about 53%. So the
> chances are that during this operation, you are going to have another
> failure, and the chances are that your RAID layer will kick the disk out
> as faulty - at which point you will find yourself with 2 failed disks in a
> RAID5 array and in need of a day or two of downtime to scrub your data to
> a fresh array and hope for the best.
>
> RAID5 is ill suited to arrays over 5TB. Using enterprise grade disks will
> gain you an improved error rate (10^-15), which makes it good enough - if
> you also have regular backups. But enterprise grade disks are much smaller
> and much more expensive.
>
> Not to mention that your performance on small writes (smaller than the
> stripe width) will be appalling with RAID5 due to the write-read-write
> operation required to construct the parity which will reduce your
> effective performance to that of a single disk.

Wow...

The enclosure I will use (and already have) is an activestorage's activeraid
in 16 x 1tb config. (http://www.getactivestorage.com/activeraid.php). The
drives are Hitachi model HDE721010SLA33. From what I could find, error rate
is 1 in 10^15.

We will do have good backups. One of the node will have a local copy of the 
critical data (about 1 tb) on a internally-attached disks. All of the rest 
of the data will be rsync-ed off site to a secondary identical setup.

>> It will host many, many small files, and some biger files. But the files
>> that change the most often will mos likely be smaller than the blocsize.
>
> That sounds like a scenario from hell for RAID5 (or RAID6).

What do you suggest to acheive size in the range of 6-7 TB, maybe more ?

>> The gfs will not be used for io-intensive tasks, that's where the
>> standalone volumes comes into play. It'll be used to access many files,
>> often. Specificly, apache will run from it, with document root, session
>> store, etc on the gfs.
>
> Performance-wise, GFS should should be OK for that if you are running with
> noatime and the operations are all reads. If you end up with write
> contention without partitioning the access to directory subtrees on a per
> server basis, the performance will fall off a cliff pretty quickly.

Can you explain a little bit more ? I'm not sure I fully understand the
partitioning into directories ?



From rossnick-lists at cybercat.ca  Wed Nov 10 16:29:51 2010
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Wed, 10 Nov 2010 11:29:51 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire><4CDA8D4A.6010507@bobich.net>
	<4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net>
Message-ID: <C24ED9A0ACDE4E30956B6D02579F962B@versa>

>> With a quorum disk, you can in fact have one node left and still have
>> quorum. This is because the quorum drive should have (node-1) votes,
>> thus always giving the last node 50%+1 even with all other nodes being 
>> dead.
>
> I've never tried testing that use-case extensively, but I suspect that it 
> is only safe to do with SAN-side fencing. Otherwise two nodes could lose 
> contact with each other and still both have access to the SAN and thus 
> both be individually quorate.

I our case, a particular node will run a particular service from a 
particular directory in the disk. So, even if 2 nodes looses contacts to 
each other, they should not end up writing or reading from the same files. 
Am I wrong ? 



From linux at alteeve.com  Wed Nov 10 16:41:27 2010
From: linux at alteeve.com (Digimer)
Date: Wed, 10 Nov 2010 11:41:27 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CDAC3D2.9050703@bobich.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>
	<4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net>
Message-ID: <4CDACB37.3070704@alteeve.com>

On 10-11-10 11:09 AM, Gordan Bobic wrote:
> Digimer wrote:
>> On 10-11-10 07:17 AM, Gordan Bobic wrote:
>>>>> If you want the FS mounted on all nodes at the same time then all
>>>>> those nodes must be a part of the cluster, and they have to be
>>>>> quorate (majority of nodes have to be up). You don't need a quorum
>>>>> block device, but it can be useful when you have only 2 nodes.
>>>> At term, I will have 7 to 10 nodes, but 2 at first for initial setup
>>>> and testing. Ok, so if I have a 3 nodes cluster for exemple, I need at
>>>> least 2 nodes for the cluster, and thus the gfs, to be up ? I cannot
>>>> have a running gfs with only one node ?
>>> In a 2-node cluster, you can have running GFS with just one node up. But
>>> in that case it is advisble to have a quorum block device on the SAN.
>>> With a 3 node cluster, you cannot have quorum with just 1 node, and thus
>>> you cannot have GFS running. It will block until quorum is
>>> re-established.
>>
>> With a quorum disk, you can in fact have one node left and still have
>> quorum. This is because the quorum drive should have (node-1) votes,
>> thus always giving the last node 50%+1 even with all other nodes being
>> dead.
> 
> I've never tried testing that use-case extensively, but I suspect that
> it is only safe to do with SAN-side fencing. Otherwise two nodes could
> lose contact with each other and still both have access to the SAN and
> thus both be individually quorate.
> 
> Gordan

Clustered storage *requires* fencing. To not use fencing is like driving
tired; It's just a matter of time before something bad happens. That
said, I should have been more clear in specifying the requirement for
fencing.

Now that said, the fencing shouldn't be needed at the SAN side, though
that works fine as well.

The way it works is:

In normal operation, all nodes communicate via corosync. Corosync in
turn manages the distributed locking and ensures that locks are ordered
across all nodes (virtual synchrony).

As soon as communication fails on one or more nodes, locks are no longer
issued and all I/O is blocked until:
a) The node responds finally
or
b) A timeout is reached and corosync issues a fence against the
incommunicado node(s).

Once a fence is issued, nothing will proceed until, and only until, the
fence agent returns a successful fence message to the fence daemon.

In the case of a split brain (nodes partition and are up but not talking
to each other), both partitions will issue a fence against the other
node(s). This is now a race, often described as an old-west style duel.
Both partitions will try to fence the other, but the slower will lose
and get fenced before it can fence.

With a successful fence, the surviving partition (which could be just
one node), will reconfigure and then begin restoring the clustered file
system (GFS2 in this case). Once recovery is complete, I/O unblocks and
continues.

With SAN-side fencing, a fence is in the form of a logic disconnection
from the storage network. This has no inherent mechanism for recovery,
so the sysadmin will have to manually recover the node(s). For this
reason, I do not prefer it.

With power fencing, by far the most common method which can be
implemented via IPMI, addressable PDUs, etc, the node that is fenced is
rebooted. The benefit of this method is that the node may well reboot
"healthy" and then be able to rejoin the cluster automatically. Of
course, if you prefer, you can have nodes powered off and left off.

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From jeff.sturm at eprize.com  Wed Nov 10 18:04:19 2010
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Wed, 10 Nov 2010 13:04:19 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <1634100741B94E019943576441CD6873@versa>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net><C23D7837AD26400AADA74D589FD3D04C@versa><4CDAA863.2040100@bobich.net>
	<1634100741B94E019943576441CD6873@versa>
Message-ID: <64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Nicolas Ross
> Sent: Wednesday, November 10, 2010 11:22 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] Starter Cluster / GFS
> 
> > Performance-wise, GFS should should be OK for that if you are
running
> > with noatime and the operations are all reads. If you end up with
> > write contention without partitioning the access to directory
subtrees
> > on a per server basis, the performance will fall off a cliff pretty
quickly.
> 
> Can you explain a little bit more ? I'm not sure I fully understand
the partitioning into
> directories ?

We had to make similar changes to our application.

Avoid allowing two (or more) hosts to create small files in the same
shared directory within a GFS filesystem.  That particular case scales
poorly with GFS.

If you can partition things so that two hosts will never create files in
the same directory (we used a per-host directory structure for our
application), or perhaps direct all write operations to one host while
other hosts only read from GFS, it should perform well.

-Jeff





From rossnick-lists at cybercat.ca  Wed Nov 10 19:32:21 2010
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Wed, 10 Nov 2010 14:32:21 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net><C23D7837AD26400AADA74D589FD3D04C@versa><4CDAA863.2040100@bobich.net><1634100741B94E019943576441CD6873@versa>
	<64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local>
Message-ID: <58753D1C20B84B8682E080FB080E69E1@versa>

> We had to make similar changes to our application.
>
> Avoid allowing two (or more) hosts to create small files in the same
> shared directory within a GFS filesystem.  That particular case scales
> poorly with GFS.
>
> If you can partition things so that two hosts will never create files in
> the same directory (we used a per-host directory structure for our
> application), or perhaps direct all write operations to one host while
> other hosts only read from GFS, it should perform well.

Ok, I see. Our applications will read/write into its own directory most of 
the time. In the rare cases when it'll be possible that 2 nodes read/writes 
to the same directory, it'll be for php sessions files. If we ever need to 
reach to this stage, we'll have to make a custom session handler to put them 
into a central memcached or something else... 



From yvette at dbtgroup.com  Wed Nov 10 19:38:59 2010
From: yvette at dbtgroup.com (yvette hirth)
Date: Wed, 10 Nov 2010 19:38:59 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <1634100741B94E019943576441CD6873@versa>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net><C23D7837AD26400AADA74D589FD3D04C@versa>	<4CDAA863.2040100@bobich.net>
	<1634100741B94E019943576441CD6873@versa>
Message-ID: <4CDAF4D3.9060105@dbtgroup.com>

Nicolas Ross wrote:

> What do you suggest to acheive size in the range of 6-7 TB, maybe more ?

i suggest RAID10.  we have a promise 16x2TB fibre raid array, and we've 
got two sets of six drives in two RAID10 arrays.  RAID10 arrays 
experience much faster rebuild rates than RAID5.

RAID10 offers much faster rebuild times, a nice combination of read v. 
write performance, but wastes a lot of space...

at the end of the day, it's your choice.

more on RAID here:  http://en.wikipedia.org/wiki/RAID

hth
yvette hirth



From RJM002 at shsu.edu  Wed Nov 10 20:50:50 2010
From: RJM002 at shsu.edu (Marti, Robert)
Date: Wed, 10 Nov 2010 14:50:50 -0600
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <58753D1C20B84B8682E080FB080E69E1@versa>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>
	<4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
	<4CDA8D4A.6010507@bobich.net><C23D7837AD26400AADA74D589FD3D04C@versa><4CDAA863.2040100@bobich.net><1634100741B94E019943576441CD6873@versa>
	<64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local>
	<58753D1C20B84B8682E080FB080E69E1@versa>
Message-ID: <8FAC1E47484E43469AA28DBF35C955E4BDF5ED079B@EXMBX.SHSU.EDU>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-
> bounces at redhat.com] On Behalf Of Nicolas Ross
> Sent: Wednesday, November 10, 2010 1:32 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Starter Cluster / GFS
> 
> > We had to make similar changes to our application.
> >
> > Avoid allowing two (or more) hosts to create small files in the same
> > shared directory within a GFS filesystem.  That particular case scales
> > poorly with GFS.
> >
> > If you can partition things so that two hosts will never create files
> > in the same directory (we used a per-host directory structure for our
> > application), or perhaps direct all write operations to one host while
> > other hosts only read from GFS, it should perform well.
> 
> Ok, I see. Our applications will read/write into its own directory most of the
> time. In the rare cases when it'll be possible that 2 nodes read/writes to the
> same directory, it'll be for php sessions files. If we ever need to reach to this
> stage, we'll have to make a custom session handler to put them into a central
> memcached or something else...
> 

If that's the case, why look at shared storage at all?



From Chris.Jankowski at hp.com  Wed Nov 10 21:04:01 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Wed, 10 Nov 2010 21:04:01 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <8FAC1E47484E43469AA28DBF35C955E4BDF5ED079B@EXMBX.SHSU.EDU>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>
	<4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
	<4CDA8D4A.6010507@bobich.net><C23D7837AD26400AADA74D589FD3D04C@versa><4CDAA863.2040100@bobich.net><1634100741B94E019943576441CD6873@versa>
	<64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local>
	<58753D1C20B84B8682E080FB080E69E1@versa>
	<8FAC1E47484E43469AA28DBF35C955E4BDF5ED079B@EXMBX.SHSU.EDU>
Message-ID: <036B68E61A28CA49AC2767596576CD596F58483534@GVW1113EXC.americas.hpqcorp.net>

Robert,

One reason is that with GFS2 you do not have to do fsck on the surviving node after one node in the cluster failed. 

Doing fsck ona 20 TB filesystem with heaps of files may take well over an hour.

So, if you built your cluster for HA you'd rather avoid it.

The locks need to be recovered, but this is much faster operation and fairly time bound. Fsck is not.

Regards,

Chris Jankowski

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marti, Robert
Sent: Thursday, 11 November 2010 07:51
To: 'linux clustering'
Subject: Re: [Linux-cluster] Starter Cluster / GFS

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster- 
> bounces at redhat.com] On Behalf Of Nicolas Ross
> Sent: Wednesday, November 10, 2010 1:32 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Starter Cluster / GFS
> 
> > We had to make similar changes to our application.
> >
> > Avoid allowing two (or more) hosts to create small files in the same 
> > shared directory within a GFS filesystem.  That particular case 
> > scales poorly with GFS.
> >
> > If you can partition things so that two hosts will never create 
> > files in the same directory (we used a per-host directory structure 
> > for our application), or perhaps direct all write operations to one 
> > host while other hosts only read from GFS, it should perform well.
> 
> Ok, I see. Our applications will read/write into its own directory 
> most of the time. In the rare cases when it'll be possible that 2 
> nodes read/writes to the same directory, it'll be for php sessions 
> files. If we ever need to reach to this stage, we'll have to make a 
> custom session handler to put them into a central memcached or something else...
> 

If that's the case, why look at shared storage at all?

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From RJM002 at shsu.edu  Wed Nov 10 21:37:08 2010
From: RJM002 at shsu.edu (Marti, Robert)
Date: Wed, 10 Nov 2010 15:37:08 -0600
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F58483534@GVW1113EXC.americas.hpqcorp.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>
	<4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
	<4CDA8D4A.6010507@bobich.net><C23D7837AD26400AADA74D589FD3D04C@versa><4CDAA863.2040100@bobich.net><1634100741B94E019943576441CD6873@versa>
	<64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local>
	<58753D1C20B84B8682E080FB080E69E1@versa>
	<8FAC1E47484E43469AA28DBF35C955E4BDF5ED079B@EXMBX.SHSU.EDU>
	<036B68E61A28CA49AC2767596576CD596F58483534@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <8FAC1E47484E43469AA28DBF35C955E4BDF5ED079C@EXMBX.SHSU.EDU>


> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-
> bounces at redhat.com] On Behalf Of Jankowski, Chris
> Sent: Wednesday, November 10, 2010 3:04 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Starter Cluster / GFS
> 
> Robert,
> 
> One reason is that with GFS2 you do not have to do fsck on the surviving
> node after one node in the cluster failed.
> 
> Doing fsck ona 20 TB filesystem with heaps of files may take well over an
> hour.
> 
> So, if you built your cluster for HA you'd rather avoid it.
> 
> The locks need to be recovered, but this is much faster operation and fairly
> time bound. Fsck is not.
> 
> Regards,
> 
> Chris Jankowski
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-
> bounces at redhat.com] On Behalf Of Marti, Robert
> Sent: Thursday, 11 November 2010 07:51
> To: 'linux clustering'
> Subject: Re: [Linux-cluster] Starter Cluster / GFS
> 
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-
> > bounces at redhat.com] On Behalf Of Nicolas Ross
> > Sent: Wednesday, November 10, 2010 1:32 PM
> > To: linux clustering
> > Subject: Re: [Linux-cluster] Starter Cluster / GFS
> >
> > > We had to make similar changes to our application.
> > >
> > > Avoid allowing two (or more) hosts to create small files in the same
> > > shared directory within a GFS filesystem.  That particular case
> > > scales poorly with GFS.
> > >
> > > If you can partition things so that two hosts will never create
> > > files in the same directory (we used a per-host directory structure
> > > for our application), or perhaps direct all write operations to one
> > > host while other hosts only read from GFS, it should perform well.
> >
> > Ok, I see. Our applications will read/write into its own directory
> > most of the time. In the rare cases when it'll be possible that 2
> > nodes read/writes to the same directory, it'll be for php sessions
> > files. If we ever need to reach to this stage, we'll have to make a
> > custom session handler to put them into a central memcached or
> something else...
> >
> 
> If that's the case, why look at shared storage at all?
> 
> --

In this scenario, he's not building the apps for HA (single server at a time, except maybe for sessions) he's not using massive filesystems (5-6TB total)...

The overhead involved in managing shared storage isn't typically worth it if you're not going to leverage the shared portion of it.

Rob Marti
 



From rossnick-lists at cybercat.ca  Wed Nov 10 23:12:51 2010
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Wed, 10 Nov 2010 18:12:51 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <8FAC1E47484E43469AA28DBF35C955E4BDF5ED079B@EXMBX.SHSU.EDU>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>
	<4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
	<4CDA8D4A.6010507@bobich.net><C23D7837AD26400AADA74D589FD3D04C@versa><4CDAA863.2040100@bobich.net><1634100741B94E019943576441CD6873@versa>
	<64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local>
	<58753D1C20B84B8682E080FB080E69E1@versa>
	<8FAC1E47484E43469AA28DBF35C955E4BDF5ED079B@EXMBX.SHSU.EDU>
Message-ID: <A7BE042A-D092-49F0-9A20-FF5141B881D0@cybercat.ca>

Redundency for high-availaibility.

If a node fail, I can restart the service manually, or automaticly on another node, without loosing any data.

Also, there are come common data between services that need to be availaible in real-time.

> If that's the case, why look at shared storage at all?



From jakov.sosic at srce.hr  Wed Nov 10 23:50:12 2010
From: jakov.sosic at srce.hr (Jakov Sosic)
Date: Thu, 11 Nov 2010 00:50:12 +0100
Subject: [Linux-cluster] Configurations of services?
In-Reply-To: <4CCBFAD3.9010305@srce.hr>
References: <4CCBFAD3.9010305@srce.hr>
Message-ID: <4CDB2FB4.8000309@srce.hr>

On 10/30/2010 01:00 PM, Jakov Sosic wrote:
> Hi!
> 
> What is best practice for keeping and updating configurations of
> services that someone runs in cluster? For example, if I run <apache>
> via cluster agent, then I create /etc/cluster/httpd-<nameofservice> on
> each node in the domain (cp -r /etc/httpd /etc/cluster/httpd-<name>; cd
> /etc/cluster/httpd-<name>; rm -f logs run modules; ln -s .....).
> 
> Now, Im puzzled how do you sync configurations between nodes? I do it
> manually currently, but am seeking some automation of the process.
> 
> I do not want to keep configurations of EACH service ona shared disks,
> for some services I want to have configurations on each node available.
> 
> 
> Any thoughts on this one?


Well, let me say something then :) I'm thinking about starting a project
- developing set of utilities that would work just like "ccs_tool update
/etc/cluster/cluster.conf", but could update any config file in /etc/
directory.

What do you think about this?



From rossnick-lists at cybercat.ca  Thu Nov 11 00:13:32 2010
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Wed, 10 Nov 2010 19:13:32 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <8FAC1E47484E43469AA28DBF35C955E4BDF5ED079C@EXMBX.SHSU.EDU>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>
	<4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
	<4CDA8D4A.6010507@bobich.net><C23D7837AD26400AADA74D589FD3D04C@versa><4CDAA863.2040100@bobich.net><1634100741B94E019943576441CD6873@versa>
	<64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local>
	<58753D1C20B84B8682E080FB080E69E1@versa>
	<8FAC1E47484E43469AA28DBF35C955E4BDF5ED079B@EXMBX.SHSU.EDU>
	<036B68E61A28CA49AC2767596576CD596F58483534@GVW1113EXC.americas.hpqcorp.net>
	<8FAC1E47484E43469AA28DBF35C955E4BDF5ED079C@EXMBX.SHSU.EDU>
Message-ID: <F3D5FADE-1024-4CE8-87F8-E3316D5262D8@cybercat.ca>

So, if I read you correctly, I would be better off making a big logical volume and smaller partitions inside it to put my services on it, mount the relevant partition on a server by server basis, and manage my shared portion otherwise ?

> 
> In this scenario, he's not building the apps for HA (single server at a time, except maybe for sessions) he's not using massive filesystems (5-6TB total)...
> 
> The overhead involved in managing shared storage isn't typically worth it if you're not going to leverage the shared portion of it.
> 
> Rob Marti



From Chris.Jankowski at hp.com  Thu Nov 11 02:30:19 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Thu, 11 Nov 2010 02:30:19 +0000
Subject: [Linux-cluster] Configurations of services?
In-Reply-To: <4CDB2FB4.8000309@srce.hr>
References: <4CCBFAD3.9010305@srce.hr> <4CDB2FB4.8000309@srce.hr>
Message-ID: <036B68E61A28CA49AC2767596576CD596F5848362C@GVW1113EXC.americas.hpqcorp.net>

Jakov,

If you make it general enough you may end up with rsync. 

How would you position your tool in the continuum between ccs_tool update .. And rsync?

Where would it add value?

Regards,

Chris Jankowski

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jakov Sosic
Sent: Thursday, 11 November 2010 10:50
To: linux clustering
Subject: Re: [Linux-cluster] Configurations of services?

On 10/30/2010 01:00 PM, Jakov Sosic wrote:
> Hi!
> 
> What is best practice for keeping and updating configurations of 
> services that someone runs in cluster? For example, if I run <apache> 
> via cluster agent, then I create /etc/cluster/httpd-<nameofservice> on 
> each node in the domain (cp -r /etc/httpd /etc/cluster/httpd-<name>; 
> cd /etc/cluster/httpd-<name>; rm -f logs run modules; ln -s .....).
> 
> Now, Im puzzled how do you sync configurations between nodes? I do it 
> manually currently, but am seeking some automation of the process.
> 
> I do not want to keep configurations of EACH service ona shared disks, 
> for some services I want to have configurations on each node available.
> 
> 
> Any thoughts on this one?


Well, let me say something then :) I'm thinking about starting a project
- developing set of utilities that would work just like "ccs_tool update /etc/cluster/cluster.conf", but could update any config file in /etc/ directory.

What do you think about this?

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From Chris.Jankowski at hp.com  Thu Nov 11 03:29:44 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Thu, 11 Nov 2010 03:29:44 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CDACB37.3070704@alteeve.com>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>
	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>
	<4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com>
Message-ID: <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>

Digimer,

1.
Digimer wrote:
>>>Both partitions will try to fence the other, but the slower will lose and get fenced before it can fence.

Well, this is certainly not my experience in dealing with modern rack mounted or blade servers where you use iLO (on HP) or DRAC (on Dell).

What actually happens in two node clusters is that both servers issue the fence request to the iLO or DRAC. It gets processed and *both* servers get powered off.  Ouch!!  Your 100% HA cluster becomes 100% dead cluster.

2.
Your comment did not explain what role the quorum disk plays in the cluster.  Also, if there are any useful cluster quorum disk heuristics that can be used in this case.

Thanks and regards,

Chris Jankowski

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Digimer
Sent: Thursday, 11 November 2010 03:41
To: linux clustering
Subject: Re: [Linux-cluster] Starter Cluster / GFS

On 10-11-10 11:09 AM, Gordan Bobic wrote:
> Digimer wrote:
>> On 10-11-10 07:17 AM, Gordan Bobic wrote:
>>>>> If you want the FS mounted on all nodes at the same time then all 
>>>>> those nodes must be a part of the cluster, and they have to be 
>>>>> quorate (majority of nodes have to be up). You don't need a quorum 
>>>>> block device, but it can be useful when you have only 2 nodes.
>>>> At term, I will have 7 to 10 nodes, but 2 at first for initial 
>>>> setup and testing. Ok, so if I have a 3 nodes cluster for exemple, 
>>>> I need at least 2 nodes for the cluster, and thus the gfs, to be up 
>>>> ? I cannot have a running gfs with only one node ?
>>> In a 2-node cluster, you can have running GFS with just one node up. 
>>> But in that case it is advisble to have a quorum block device on the SAN.
>>> With a 3 node cluster, you cannot have quorum with just 1 node, and 
>>> thus you cannot have GFS running. It will block until quorum is 
>>> re-established.
>>
>> With a quorum disk, you can in fact have one node left and still have 
>> quorum. This is because the quorum drive should have (node-1) votes, 
>> thus always giving the last node 50%+1 even with all other nodes 
>> being dead.
> 
> I've never tried testing that use-case extensively, but I suspect that 
> it is only safe to do with SAN-side fencing. Otherwise two nodes could 
> lose contact with each other and still both have access to the SAN and 
> thus both be individually quorate.
> 
> Gordan

Clustered storage *requires* fencing. To not use fencing is like driving tired; It's just a matter of time before something bad happens. That said, I should have been more clear in specifying the requirement for fencing.

Now that said, the fencing shouldn't be needed at the SAN side, though that works fine as well.

The way it works is:

In normal operation, all nodes communicate via corosync. Corosync in turn manages the distributed locking and ensures that locks are ordered across all nodes (virtual synchrony).

As soon as communication fails on one or more nodes, locks are no longer issued and all I/O is blocked until:
a) The node responds finally
or
b) A timeout is reached and corosync issues a fence against the incommunicado node(s).

Once a fence is issued, nothing will proceed until, and only until, the fence agent returns a successful fence message to the fence daemon.

In the case of a split brain (nodes partition and are up but not talking to each other), both partitions will issue a fence against the other node(s). This is now a race, often described as an old-west style duel.
Both partitions will try to fence the other, but the slower will lose and get fenced before it can fence.

With a successful fence, the surviving partition (which could be just one node), will reconfigure and then begin restoring the clustered file system (GFS2 in this case). Once recovery is complete, I/O unblocks and continues.

With SAN-side fencing, a fence is in the form of a logic disconnection from the storage network. This has no inherent mechanism for recovery, so the sysadmin will have to manually recover the node(s). For this reason, I do not prefer it.

With power fencing, by far the most common method which can be implemented via IPMI, addressable PDUs, etc, the node that is fenced is rebooted. The benefit of this method is that the node may well reboot "healthy" and then be able to rejoin the cluster automatically. Of course, if you prefer, you can have nodes powered off and left off.

--
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From linux at alteeve.com  Thu Nov 11 04:29:46 2010
From: linux at alteeve.com (Digimer)
Date: Wed, 10 Nov 2010 23:29:46 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>
	<4CDACB37.3070704@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <4CDB713A.8080303@alteeve.com>

On 10-11-10 10:29 PM, Jankowski, Chris wrote:
> Digimer,
> 
> 1.
> Digimer wrote:
>>>> Both partitions will try to fence the other, but the slower will lose and get fenced before it can fence.
> 
> Well, this is certainly not my experience in dealing with modern rack mounted or blade servers where you use iLO (on HP) or DRAC (on Dell).
> 
> What actually happens in two node clusters is that both servers issue the fence request to the iLO or DRAC. It gets processed and *both* servers get powered off.  Ouch!!  Your 100% HA cluster becomes 100% dead cluster.

That is somewhat frightening. My experience is limited to stock IPMI and
Node Assassin. I've not seen a situation where both die. I'd strongly
suggest that a bug be filed.

> 2.
> Your comment did not explain what role the quorum disk plays in the cluster.  Also, if there are any useful cluster quorum disk heuristics that can be used in this case.
> 
> Thanks and regards,
> 
> Chris Jankowski

Ah, the idea is that, with the quorum disk (ignoring heuristics for the
moment), if only one node is left alive, the quorum disk will contribute
sufficient votes for quorum to be achieved. Of course, this depends on
the node(s) having access to the qdisk still.

Now for heuristics; Consider this;

you have a 7-node cluster;
- Each node gets 1 vote.
- The qdisk gets 6 votes.
- Total votes is 13, quorum then is >= 7.

You cluster partitions, say from a network failure. Six nodes separate
from a core switch, while one happens to still have access to a critical
route (say, to the Internet). The heuristic test (ie: pinging an
external server) will pass for the 1 node and fail for the six others.

The one node with access to the critical route will be the one to get
the votes of the quorum disk (1 + 6 = 7, quorum!) while the other six
will get six votes (1 + 1 + 1 + 1 + 1 + 1 = 6, no quorum). The six nodes
will lose and be fenced and will not be able to rejoin the cluster until
they regain access to that critical route.

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From Chris.Jankowski at hp.com  Thu Nov 11 05:48:10 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Thu, 11 Nov 2010 05:48:10 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CDB713A.8080303@alteeve.com>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>
	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>
	<4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>
	<4CDB713A.8080303@alteeve.com>
Message-ID: <036B68E61A28CA49AC2767596576CD596F584F326A@GVW1113EXC.americas.hpqcorp.net>

Digimer,

Again, the heuristic you gave does not pass the data centre operational sanity test.

First of all, in data centres everything is redundant, so you have 2 core switches.  Of course you could ping both of them and have some NAND logic. That is not important. 

The point is that no matter what you'd do, your cluster cannot fix the network. So, fencing nodes on network failure is the last thing you want to do. You loose warm database caches, user sessions and incomplete transactions. Disk quorum times out in 10 seconds or so. A typical network meltdown due to spanning tree recalculation is 40 seconds. If the proposed heuristic was applied to the 7 node clusters they all will murder each other and there will be nothing left. You'd convert a localised, short term network problem into a cluster wide disaster.

In fact, I have yet to see a heuristic that would make sense in real world. I cannot think of one.

Regards,

Chris Jankowski

-----Original Message-----
From: Digimer [mailto:linux at alteeve.com] 
Sent: Thursday, 11 November 2010 15:30
To: linux clustering
Cc: Jankowski, Chris
Subject: Re: [Linux-cluster] Starter Cluster / GFS

On 10-11-10 10:29 PM, Jankowski, Chris wrote:
> Digimer,
> 
> 1.
> Digimer wrote:
>>>> Both partitions will try to fence the other, but the slower will lose and get fenced before it can fence.
> 
> Well, this is certainly not my experience in dealing with modern rack mounted or blade servers where you use iLO (on HP) or DRAC (on Dell).
> 
> What actually happens in two node clusters is that both servers issue the fence request to the iLO or DRAC. It gets processed and *both* servers get powered off.  Ouch!!  Your 100% HA cluster becomes 100% dead cluster.

That is somewhat frightening. My experience is limited to stock IPMI and Node Assassin. I've not seen a situation where both die. I'd strongly suggest that a bug be filed.

> 2.
> Your comment did not explain what role the quorum disk plays in the cluster.  Also, if there are any useful cluster quorum disk heuristics that can be used in this case.
> 
> Thanks and regards,
> 
> Chris Jankowski

Ah, the idea is that, with the quorum disk (ignoring heuristics for the moment), if only one node is left alive, the quorum disk will contribute sufficient votes for quorum to be achieved. Of course, this depends on the node(s) having access to the qdisk still.

Now for heuristics; Consider this;

you have a 7-node cluster;
- Each node gets 1 vote.
- The qdisk gets 6 votes.
- Total votes is 13, quorum then is >= 7.

You cluster partitions, say from a network failure. Six nodes separate from a core switch, while one happens to still have access to a critical route (say, to the Internet). The heuristic test (ie: pinging an external server) will pass for the 1 node and fail for the six others.

The one node with access to the critical route will be the one to get the votes of the quorum disk (1 + 6 = 7, quorum!) while the other six will get six votes (1 + 1 + 1 + 1 + 1 + 1 = 6, no quorum). The six nodes will lose and be fenced and will not be able to rejoin the cluster until they regain access to that critical route.

--
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From gordan at bobich.net  Thu Nov 11 08:56:09 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 11 Nov 2010 08:56:09 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <1634100741B94E019943576441CD6873@versa>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net><C23D7837AD26400AADA74D589FD3D04C@versa>	<4CDAA863.2040100@bobich.net>
	<1634100741B94E019943576441CD6873@versa>
Message-ID: <4CDBAFA9.8040707@bobich.net>

Nicolas Ross wrote:
>>> The volume will be composed of 7 1TB disk in raid5, so 6 TB.
>>
>> Be careful with that arrangement. You are right up against the ragged 
>> edge
>> in terms of data safety.
>>
>> 1TB disks a consumer grade SATA disks with non-recoverable error rates of
>> about 10^-14. That is one non-recoverable error per 11TB.
>>
>> Now consider what happens when one of your disks fails. You have to read
>> 6TB to reconstruct the failed disk. With error rate of 1 in 11TB, the
>> chances of another failure occurring in 6TB of reads is about 53%. So the
>> chances are that during this operation, you are going to have another
>> failure, and the chances are that your RAID layer will kick the disk out
>> as faulty - at which point you will find yourself with 2 failed disks 
>> in a
>> RAID5 array and in need of a day or two of downtime to scrub your data to
>> a fresh array and hope for the best.
>>
>> RAID5 is ill suited to arrays over 5TB. Using enterprise grade disks will
>> gain you an improved error rate (10^-15), which makes it good enough - if
>> you also have regular backups. But enterprise grade disks are much 
>> smaller
>> and much more expensive.
>>
>> Not to mention that your performance on small writes (smaller than the
>> stripe width) will be appalling with RAID5 due to the write-read-write
>> operation required to construct the parity which will reduce your
>> effective performance to that of a single disk.
> 
> Wow...
> 
> The enclosure I will use (and already have) is an activestorage's 
> activeraid
> in 16 x 1tb config. (http://www.getactivestorage.com/activeraid.php).

I dealt with them before. All I'm going to say is - disregard any and 
all performance figures they claim and work out what the performance is 
likely to be from basic principles. Provided you stick to that and 
ignore the marketing specmanship, as far as enterprisey storage 
appliances go, those are reasonably good value for money.

> The
> drives are Hitachi model HDE721010SLA33. From what I could find, error rate
> is 1 in 10^15.

That makes it less bad than my figures above, but still, be careful.

>>> It will host many, many small files, and some biger files. But the files
>>> that change the most often will mos likely be smaller than the blocsize.
>>
>> That sounds like a scenario from hell for RAID5 (or RAID6).
> 
> What do you suggest to acheive size in the range of 6-7 TB, maybe more ?

RAID10 if you need more performance than that of a single disk, unless 
your I/O operations are always very big (bigger than the RAID stripe width).

stripe_width = chunk_size * number_of_disks

Smaller disks are good for reducing rebuild times, and more smaller 
disks will give you better performance. It all depends on the nature of 
the I/O and the performance you require.

>>> The gfs will not be used for io-intensive tasks, that's where the
>>> standalone volumes comes into play. It'll be used to access many files,
>>> often. Specificly, apache will run from it, with document root, session
>>> store, etc on the gfs.
>>
>> Performance-wise, GFS should should be OK for that if you are running 
>> with
>> noatime and the operations are all reads. If you end up with write
>> contention without partitioning the access to directory subtrees on a per
>> server basis, the performance will fall off a cliff pretty quickly.
> 
> Can you explain a little bit more ? I'm not sure I fully understand the
> partitioning into directories ?

Make sure that only one node only accesses a particular directory 
subtree (until it gets failed over, that is). If you have multiple nodes 
simultaneously writing to the same directory with any regularity you 
will experience performance issues.

Gordan



From gordan at bobich.net  Thu Nov 11 08:59:30 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 11 Nov 2010 08:59:30 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <C24ED9A0ACDE4E30956B6D02579F962B@versa>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire><4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>
	<4CDAC3D2.9050703@bobich.net>
	<C24ED9A0ACDE4E30956B6D02579F962B@versa>
Message-ID: <4CDBB072.7020001@bobich.net>

Nicolas Ross wrote:
>>> With a quorum disk, you can in fact have one node left and still have
>>> quorum. This is because the quorum drive should have (node-1) votes,
>>> thus always giving the last node 50%+1 even with all other nodes 
>>> being dead.
>>
>> I've never tried testing that use-case extensively, but I suspect that 
>> it is only safe to do with SAN-side fencing. Otherwise two nodes could 
>> lose contact with each other and still both have access to the SAN and 
>> thus both be individually quorate.
> 
> I our case, a particular node will run a particular service from a 
> particular directory in the disk. So, even if 2 nodes looses contacts to 
> each other, they should not end up writing or reading from the same 
> files. Am I wrong ?

If two nodes lose contact to each other, one will fence each other and 
shut it down.

If you don't need concurrent access, then why do you need a cluster file 
system?

Gordan



From gordan at bobich.net  Thu Nov 11 09:04:20 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 11 Nov 2010 09:04:20 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CDACB37.3070704@alteeve.com>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>
	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>
	<4CDACB37.3070704@alteeve.com>
Message-ID: <4CDBB194.2020601@bobich.net>

Digimer wrote:
> On 10-11-10 11:09 AM, Gordan Bobic wrote:
>> Digimer wrote:
>>> On 10-11-10 07:17 AM, Gordan Bobic wrote:
>>>>>> If you want the FS mounted on all nodes at the same time then all
>>>>>> those nodes must be a part of the cluster, and they have to be
>>>>>> quorate (majority of nodes have to be up). You don't need a quorum
>>>>>> block device, but it can be useful when you have only 2 nodes.
>>>>> At term, I will have 7 to 10 nodes, but 2 at first for initial setup
>>>>> and testing. Ok, so if I have a 3 nodes cluster for exemple, I need at
>>>>> least 2 nodes for the cluster, and thus the gfs, to be up ? I cannot
>>>>> have a running gfs with only one node ?
>>>> In a 2-node cluster, you can have running GFS with just one node up. But
>>>> in that case it is advisble to have a quorum block device on the SAN.
>>>> With a 3 node cluster, you cannot have quorum with just 1 node, and thus
>>>> you cannot have GFS running. It will block until quorum is
>>>> re-established.
>>> With a quorum disk, you can in fact have one node left and still have
>>> quorum. This is because the quorum drive should have (node-1) votes,
>>> thus always giving the last node 50%+1 even with all other nodes being
>>> dead.
>> I've never tried testing that use-case extensively, but I suspect that
>> it is only safe to do with SAN-side fencing. Otherwise two nodes could
>> lose contact with each other and still both have access to the SAN and
>> thus both be individually quorate.
>>
>> Gordan
> 
> Clustered storage *requires* fencing. To not use fencing is like driving
> tired; It's just a matter of time before something bad happens. That
> said, I should have been more clear in specifying the requirement for
> fencing.
> 
> Now that said, the fencing shouldn't be needed at the SAN side, though
> that works fine as well.

The default fencing action, last time I checked, is reboot. Consider the 
use case where you have a network failure and separate networks for 
various things, and you lose connectivity between the nodes but they 
both still have access to the SAN. One node gets fenced, reboots, comes 
up and connects to the SAN. It connects to the quorum device and has 
quorum without the other nodes, and mounts the file systems and starts 
writing - while all the other nodes that have become partitioned off do 
the same thing. Unless you can fence the nodes from the SAN side, quorum 
device having a 50% weight is a recipe for disaster.

> The way it works is:
[...]

I'm well aware of how fencing works, but you overlooked one major 
failure mode that is essentially guaranteed to hose your data if you set 
up the quorum device to have 50% of the votes.

> With SAN-side fencing, a fence is in the form of a logic disconnection
> from the storage network. This has no inherent mechanism for recovery,
> so the sysadmin will have to manually recover the node(s). For this
> reason, I do not prefer it.

Then don't use a quorum device with more than an equal weight to the 
individual nodes.

Gordan



From gordan at bobich.net  Thu Nov 11 09:08:00 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 11 Nov 2010 09:08:00 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <58753D1C20B84B8682E080FB080E69E1@versa>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net><C23D7837AD26400AADA74D589FD3D04C@versa><4CDAA863.2040100@bobich.net><1634100741B94E019943576441CD6873@versa>	<64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local>
	<58753D1C20B84B8682E080FB080E69E1@versa>
Message-ID: <4CDBB270.9000505@bobich.net>

Nicolas Ross wrote:

> Ok, I see. Our applications will read/write into its own directory most 
> of the time. In the rare cases when it'll be possible that 2 nodes 
> read/writes to the same directory, it'll be for php sessions files. If 
> we ever need to reach to this stage, we'll have to make a custom session 
> handler to put them into a central memcached or something else...

You may be better off moving the session files to an asynchronous 
storage medium. Something like master-master replicated MySQL or 
SeznamFS if you want to use the file system. You'll likely save a 
considerable amount of latency on accesses. You don't need 100% 
real-time synchronicity on session information in this way. A few 
milliseconds of lag should be fine and it'll reduce the access latencies 
by potentially quite a lot.

Gordan



From gordan at bobich.net  Thu Nov 11 09:13:50 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 11 Nov 2010 09:13:50 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F58483534@GVW1113EXC.americas.hpqcorp.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net><C23D7837AD26400AADA74D589FD3D04C@versa><4CDAA863.2040100@bobich.net><1634100741B94E019943576441CD6873@versa>	<64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local>	<58753D1C20B84B8682E080FB080E69E1@versa>	<8FAC1E47484E43469AA28DBF35C955E4BDF5ED079B@EXMBX.SHSU.EDU>
	<036B68E61A28CA49AC2767596576CD596F58483534@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <4CDBB3CE.8020603@bobich.net>

Jankowski, Chris wrote:
> Robert,
> 
> One reason is that with GFS2 you do not have to do fsck on the surviving node
> after one node in the cluster failed. 

You don't have to do fsck after an unclean shutdown anyway, provided you 
use a journaled file system. GFS2 avoids the need for fsck through 
journaling same as any other journaled file system, not through some 
other magic.

> Doing fsck ona 20 TB filesystem with heaps of files may take well over an hour.

Depends on your file system. fsck on one of my (4TB RAID10 arrays took 
only about 2 minutes with ext4. Scaling that by 5x to get to 20TB still 
implies a figure of about 10 minutes, well short of an hour.

Gordan



From gordan at bobich.net  Thu Nov 11 09:18:30 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 11 Nov 2010 09:18:30 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <A7BE042A-D092-49F0-9A20-FF5141B881D0@cybercat.ca>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net><C23D7837AD26400AADA74D589FD3D04C@versa><4CDAA863.2040100@bobich.net><1634100741B94E019943576441CD6873@versa>	<64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local>	<58753D1C20B84B8682E080FB080E69E1@versa>	<8FAC1E47484E43469AA28DBF35C955E4BDF5ED079B@EXMBX.SHSU.EDU>
	<A7BE042A-D092-49F0-9A20-FF5141B881D0@cybercat.ca>
Message-ID: <4CDBB4E6.70101@bobich.net>

Nicolas Ross wrote:
> Redundency for high-availaibility.
> 
> If a node fail, I can restart the service manually, or automaticly on
> another node, without loosing any data.

You can do that anyway. You make the SAN exported block device and the 
non-shared FS on that share into dependent services. You make it so the 
FS service requires the block device service, and make the application 
providing service depend on the file system service. That ensures 
they'll all come up in the correct order and fail over together.

Using a shared file system gains you nothing in the use cases you are 
describing, other than reduce the performance.

> Also, there are come common data between services that need to be availaible in real-time.

That's fair enough, but in that case some volume splitting may be in 
order (have the common static data on GFS and have everything else on 
fail-over non-shared file systems). For optimal performance, you should 
unshare as much as possible.

Gordan



From gordan at bobich.net  Thu Nov 11 09:19:47 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 11 Nov 2010 09:19:47 +0000
Subject: [Linux-cluster] Configurations of services?
In-Reply-To: <4CDB2FB4.8000309@srce.hr>
References: <4CCBFAD3.9010305@srce.hr> <4CDB2FB4.8000309@srce.hr>
Message-ID: <4CDBB533.3030805@bobich.net>

Jakov Sosic wrote:
> On 10/30/2010 01:00 PM, Jakov Sosic wrote:
>> Hi!
>>
>> What is best practice for keeping and updating configurations of
>> services that someone runs in cluster? For example, if I run <apache>
>> via cluster agent, then I create /etc/cluster/httpd-<nameofservice> on
>> each node in the domain (cp -r /etc/httpd /etc/cluster/httpd-<name>; cd
>> /etc/cluster/httpd-<name>; rm -f logs run modules; ln -s .....).
>>
>> Now, Im puzzled how do you sync configurations between nodes? I do it
>> manually currently, but am seeking some automation of the process.
>>
>> I do not want to keep configurations of EACH service ona shared disks,
>> for some services I want to have configurations on each node available.
>>
>>
>> Any thoughts on this one?
> 
> 
> Well, let me say something then :) I'm thinking about starting a project
> - developing set of utilities that would work just like "ccs_tool update
> /etc/cluster/cluster.conf", but could update any config file in /etc/
> directory.
> 
> What do you think about this?

You may want to look at csync2 before you re-invent that particular 
wheel. :)

Gordan



From gordan at bobich.net  Thu Nov 11 09:23:45 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 11 Nov 2010 09:23:45 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>
	<4CDACB37.3070704@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <4CDBB621.2090809@bobich.net>

Jankowski, Chris wrote:
> Digimer,
> 
> 1.
> Digimer wrote:
>>>> Both partitions will try to fence the other, but the slower
>>>>will lose and get fenced before it can fence.
> 
> Well, this is certainly not my experience in dealing with modern
> rack mounted or blade servers where you use iLO (on HP) or DRAC (on Dell).
> 
> What actually happens in two node clusters is that both servers
> issue the fence request to the iLO or DRAC. It gets processed
> and *both* servers get powered off.  Ouch!!  Your 100% HA cluster
> becomes 100% dead cluster.

Indeed, I've seen this, too, on a range of hardware. My quick and dirty 
solution was to doctor the fencing agent to add a different sleep() on 
each node, in order of survivor preference. There may be a setting in 
cluster.conf that can be used to achieve the same effect, can't remember 
off the top of my head.

Gordan



From gordan at bobich.net  Thu Nov 11 09:27:41 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 11 Nov 2010 09:27:41 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CDB713A.8080303@alteeve.com>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>
	<4CDB713A.8080303@alteeve.com>
Message-ID: <4CDBB70D.6080204@bobich.net>

Digimer wrote:
> On 10-11-10 10:29 PM, Jankowski, Chris wrote:
>> Digimer,
>>
>> 1.
>> Digimer wrote:
>>>>> Both partitions will try to fence the other, but the slower will lose and get fenced before it can fence.
>> Well, this is certainly not my experience in dealing with modern rack mounted or blade servers where you use iLO (on HP) or DRAC (on Dell).
>>
>> What actually happens in two node clusters is that both servers issue the fence request to the iLO or DRAC. It gets processed and *both* servers get powered off.  Ouch!!  Your 100% HA cluster becomes 100% dead cluster.
> 
> That is somewhat frightening. My experience is limited to stock IPMI and
> Node Assassin. I've not seen a situation where both die. I'd strongly
> suggest that a bug be filed.

It's actually fairly predictable and quite common. If the nodes lose 
connectivity to each other but both are actually alive (e.g. cluster 
service switch failure), you will get this sort of a shoot-out. The 
cause is that most out-of-band power-off mechanisms have an inherent lag 
of several seconds (i.e. it can be a few seconds between when you issue 
a power-off command and the machine actually powers off). During that 
race window, both machines may issue a remote power-off before they 
actually shut down themselves.

Gordan



From gordan at bobich.net  Thu Nov 11 09:31:57 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 11 Nov 2010 09:31:57 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584F326A@GVW1113EXC.americas.hpqcorp.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>
	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>	<4CDB713A.8080303@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584F326A@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <4CDBB80D.70301@bobich.net>

Jankowski, Chris wrote:

> The point is that no matter what you'd do, your cluster cannot fix the network.
> So, fencing nodes on network failure is the last thing you want to do. You loose
> warm database caches, user sessions and incomplete transactions. Disk quorum times
> out in 10 seconds or so. A typical network meltdown due to spanning tree recalculation
> is 40 seconds.

I'd argue that if you regularly get outages of 40 seconds due to 
spanning tree rebuilds, you have bigger problems (such as too many 
machines on the same VLAN). And if you have that many nodes in a cluster 
(you do keep your cluster interfaces on a dedicated VLAN, right?), you 
are doing way better than what the claimed limits for RHCS are. :)

Gordan



From Chris.Jankowski at hp.com  Thu Nov 11 09:59:13 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Thu, 11 Nov 2010 09:59:13 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CDBB70D.6080204@bobich.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>
	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>
	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>
	<4CDB713A.8080303@alteeve.com> <4CDBB70D.6080204@bobich.net>
Message-ID: <036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net>

Gordan,

I do understand the mechanism.  I was trying to gently point out that this behaviour is unacceptable for my commercial IP customers. The customers buy clusters for high availability. Loosing the whole cluster due to single component failure - hearbeat link is not acceptable. The heartbeat link is a huge SPOF. And the cluster design does not support redundant links for heartbeat.

Also, none of the commercially available UNIX clusters or Linux clusters (HP ServiceGuard, Veritas, SteelEye) would display this type of behaviour and they do not clobber cluster filesystems.  So, it is possible to achieve acceptable reaction to this type of failure.

Regards,

Chris Jankowski

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Gordan Bobic
Sent: Thursday, 11 November 2010 20:28
To: linux clustering
Subject: Re: [Linux-cluster] Starter Cluster / GFS

Digimer wrote:
> On 10-11-10 10:29 PM, Jankowski, Chris wrote:
>> Digimer,
>>
>> 1.
>> Digimer wrote:
>>>>> Both partitions will try to fence the other, but the slower will lose and get fenced before it can fence.
>> Well, this is certainly not my experience in dealing with modern rack mounted or blade servers where you use iLO (on HP) or DRAC (on Dell).
>>
>> What actually happens in two node clusters is that both servers issue the fence request to the iLO or DRAC. It gets processed and *both* servers get powered off.  Ouch!!  Your 100% HA cluster becomes 100% dead cluster.
> 
> That is somewhat frightening. My experience is limited to stock IPMI 
> and Node Assassin. I've not seen a situation where both die. I'd 
> strongly suggest that a bug be filed.

It's actually fairly predictable and quite common. If the nodes lose connectivity to each other but both are actually alive (e.g. cluster service switch failure), you will get this sort of a shoot-out. The cause is that most out-of-band power-off mechanisms have an inherent lag of several seconds (i.e. it can be a few seconds between when you issue a power-off command and the machine actually powers off). During that race window, both machines may issue a remote power-off before they actually shut down themselves.

Gordan

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From gordan at bobich.net  Thu Nov 11 10:07:31 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 11 Nov 2010 10:07:31 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>	<4CDB713A.8080303@alteeve.com>
	<4CDBB70D.6080204@bobich.net>
	<036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <4CDBC063.2060605@bobich.net>

Jankowski, Chris wrote:
> Gordan,
> 
> I do understand the mechanism.  I was trying to gently point out that
> this behaviour is unacceptable for my commercial IP customers. The customers
> buy clusters for high availability. Loosing the whole cluster due to single
> component failure - hearbeat link is not acceptable. The heartbeat link is
> a huge SPOF. And the cluster design does not support redundant links for
> heartbeat.
> 
> Also, none of the commercially available UNIX clusters or Linux clusters
> (HP ServiceGuard, Veritas, SteelEye) would display this type of behaviour
> and they do not clobber cluster filesystems.  So, it is possible to
> achieve acceptable reaction to this type of failure.

My point was that you can easily overcome the race by introducing a 
staggered delay into fencing that works around the race condition.

I never tried, but are you sure bonded devices don't work for heartbeat?

Gordan



From Chris.Jankowski at hp.com  Thu Nov 11 10:30:43 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Thu, 11 Nov 2010 10:30:43 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CDBC063.2060605@bobich.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>
	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>
	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>
	<4CDB713A.8080303@alteeve.com>	<4CDBB70D.6080204@bobich.net>
	<036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net>
	<4CDBC063.2060605@bobich.net>
Message-ID: <036B68E61A28CA49AC2767596576CD596F584F3439@GVW1113EXC.americas.hpqcorp.net>

Gordan,

I did not ask for bonding.  This should work.  I asked for multiple independent links - different networking interfaces configured for different IP subnets mapping to different VLANS.

STP is, these days, run on a per VLAN basis. Having multiple links in different VLANs protects against important classes of network failures.  Bonded interface does not do it. This must be integrated in the clustering software.

Regards,

Chris Jankowski

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Gordan Bobic
Sent: Thursday, 11 November 2010 21:08
To: linux clustering
Subject: Re: [Linux-cluster] Starter Cluster / GFS

Jankowski, Chris wrote:
> Gordan,
> 
> I do understand the mechanism.  I was trying to gently point out that 
> this behaviour is unacceptable for my commercial IP customers. The 
> customers buy clusters for high availability. Loosing the whole 
> cluster due to single component failure - hearbeat link is not 
> acceptable. The heartbeat link is a huge SPOF. And the cluster design 
> does not support redundant links for heartbeat.
> 
> Also, none of the commercially available UNIX clusters or Linux 
> clusters (HP ServiceGuard, Veritas, SteelEye) would display this type 
> of behaviour and they do not clobber cluster filesystems.  So, it is 
> possible to achieve acceptable reaction to this type of failure.

My point was that you can easily overcome the race by introducing a staggered delay into fencing that works around the race condition.

I never tried, but are you sure bonded devices don't work for heartbeat?

Gordan

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From gordan at bobich.net  Thu Nov 11 10:46:25 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 11 Nov 2010 10:46:25 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584F3439@GVW1113EXC.americas.hpqcorp.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>	<4CDB713A.8080303@alteeve.com>	<4CDBB70D.6080204@bobich.net>	<036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net>	<4CDBC063.2060605@bobich.net>
	<036B68E61A28CA49AC2767596576CD596F584F3439@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <4CDBC981.4010807@bobich.net>

Jankowski, Chris wrote:
> Gordan,
> 
> I did not ask for bonding.  This should work.  I asked for
> multiple independent links - different networking interfaces
> configured for different IP subnets mapping to different VLANS.
 >
> STP is, these days, run on a per VLAN basis. Having multiple
> links in different VLANs protects against important classes of
> network failures.  Bonded interface does not do it. This must
> be integrated in the clustering software.

I don't quite see the point you're making. If your goal is redundant 
networking, then you can achieve that by having bonded interfaces in 
each node, and each of the components of the bonded interface should go 
to a different switch. That will give you both extra bandwidth and a 
redundant path between all the nodes, which will ensure you don't end up 
with a partitioned cluster.

Gordan



From jakov.sosic at srce.hr  Thu Nov 11 11:42:10 2010
From: jakov.sosic at srce.hr (Jakov Sosic)
Date: Thu, 11 Nov 2010 12:42:10 +0100
Subject: [Linux-cluster] Configurations of services?
In-Reply-To: <4CDBB533.3030805@bobich.net>
References: <4CCBFAD3.9010305@srce.hr> <4CDB2FB4.8000309@srce.hr>
	<4CDBB533.3030805@bobich.net>
Message-ID: <4CDBD692.4020108@srce.hr>

On 11/11/2010 10:19 AM, Gordan Bobic wrote:
> Jakov Sosic wrote:
>> On 10/30/2010 01:00 PM, Jakov Sosic wrote:
>>> Hi!
>>>
>>> What is best practice for keeping and updating configurations of
>>> services that someone runs in cluster? For example, if I run <apache>
>>> via cluster agent, then I create /etc/cluster/httpd-<nameofservice> on
>>> each node in the domain (cp -r /etc/httpd /etc/cluster/httpd-<name>; cd
>>> /etc/cluster/httpd-<name>; rm -f logs run modules; ln -s .....).
>>>
>>> Now, Im puzzled how do you sync configurations between nodes? I do it
>>> manually currently, but am seeking some automation of the process.
>>>
>>> I do not want to keep configurations of EACH service ona shared disks,
>>> for some services I want to have configurations on each node available.
>>>
>>>
>>> Any thoughts on this one?
>>
>>
>> Well, let me say something then :) I'm thinking about starting a project
>> - developing set of utilities that would work just like "ccs_tool update
>> /etc/cluster/cluster.conf", but could update any config file in /etc/
>> directory.
>>
>> What do you think about this?
> 
> You may want to look at csync2 before you re-invent that particular
> wheel.

:)

Thank you for your information, I'm getting at it right away...


-- 
Jakov Sosic



From jonathan.barber at gmail.com  Thu Nov 11 13:25:38 2010
From: jonathan.barber at gmail.com (Jonathan Barber)
Date: Thu, 11 Nov 2010 13:25:38 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CDBC981.4010807@bobich.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>
	<4CDA5421.9090006@bobich.net>
	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
	<4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com>
	<4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>
	<4CDB713A.8080303@alteeve.com> <4CDBB70D.6080204@bobich.net>
	<036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net>
	<4CDBC063.2060605@bobich.net>
	<036B68E61A28CA49AC2767596576CD596F584F3439@GVW1113EXC.americas.hpqcorp.net>
	<4CDBC981.4010807@bobich.net>
Message-ID: <AANLkTimBQzfN131SS5CVP6YSGeog=6-mjtA1vmjkSyNh@mail.gmail.com>

On 11 November 2010 10:46, Gordan Bobic <gordan at bobich.net> wrote:
> Jankowski, Chris wrote:
>>
>> Gordan,
>>
>> I did not ask for bonding. ?This should work. ?I asked for
>> multiple independent links - different networking interfaces
>> configured for different IP subnets mapping to different VLANS.
>
>>
>>
>> STP is, these days, run on a per VLAN basis. Having multiple
>> links in different VLANs protects against important classes of
>> network failures. ?Bonded interface does not do it. This must
>> be integrated in the clustering software.
>
> I don't quite see the point you're making. If your goal is redundant
> networking, then you can achieve that by having bonded interfaces in each
> node, and each of the components of the bonded interface should go to a
> different switch. That will give you both extra bandwidth and a redundant
> path between all the nodes, which will ensure you don't end up with a
> partitioned cluster.

Chris' point is that if the STP has to recalculate (for example if the
STP root node dies), then having multiple interfaces in the same VLAN
will not help (if the time taken to recalculate is longer than the
fencing timeout). But, if he can run the heartbeat across multiple
VLANs, and the network supports per-VLAN STP, then he lowers the risk
of both VLANs being affected by the same event and therefore reduces
the likelihood of a shootout between the cluster nodes.

Of course, it depends on the topology of the STP domains as to whether
you are guaranteed to maintain at least one path between nodes in the
cluster given a STP node failure.

> Gordan
-- 
Jonathan Barber <jonathan.barber at gmail.com>



From gordan at bobich.net  Thu Nov 11 13:38:04 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 11 Nov 2010 13:38:04 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <AANLkTimBQzfN131SS5CVP6YSGeog=6-mjtA1vmjkSyNh@mail.gmail.com>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>
	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>
	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>	<4CDB713A.8080303@alteeve.com>
	<4CDBB70D.6080204@bobich.net>	<036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net>	<4CDBC063.2060605@bobich.net>	<036B68E61A28CA49AC2767596576CD596F584F3439@GVW1113EXC.americas.hpqcorp.net>	<4CDBC981.4010807@bobich.net>
	<AANLkTimBQzfN131SS5CVP6YSGeog=6-mjtA1vmjkSyNh@mail.gmail.com>
Message-ID: <4CDBF1BC.7000103@bobich.net>

Jonathan Barber wrote:
> On 11 November 2010 10:46, Gordan Bobic <gordan at bobich.net> wrote:
>> Jankowski, Chris wrote:
>>> Gordan,
>>>
>>> I did not ask for bonding.  This should work.  I asked for
>>> multiple independent links - different networking interfaces
>>> configured for different IP subnets mapping to different VLANS.
>>>
>>> STP is, these days, run on a per VLAN basis. Having multiple
>>> links in different VLANs protects against important classes of
>>> network failures.  Bonded interface does not do it. This must
>>> be integrated in the clustering software.
>> I don't quite see the point you're making. If your goal is redundant
>> networking, then you can achieve that by having bonded interfaces in each
>> node, and each of the components of the bonded interface should go to a
>> different switch. That will give you both extra bandwidth and a redundant
>> path between all the nodes, which will ensure you don't end up with a
>> partitioned cluster.
> 
> Chris' point is that if the STP has to recalculate (for example if the
> STP root node dies), then having multiple interfaces in the same VLAN
> will not help (if the time taken to recalculate is longer than the
> fencing timeout). But, if he can run the heartbeat across multiple
> VLANs, and the network supports per-VLAN STP, then he lowers the risk
> of both VLANs being affected by the same event and therefore reduces
> the likelihood of a shootout between the cluster nodes.
> 
> Of course, it depends on the topology of the STP domains as to whether
> you are guaranteed to maintain at least one path between nodes in the
> cluster given a STP node failure.

Yes, but your cluster VLAN (the one that's monitored for heartbeating) 
should be isolated, rather than public, so the only nodes on it will be 
the cluster nodes (and probably the SAN). If with that many nodes your 
spanning tree recalculation still takes 40 seconds you have network gear 
that is unfit for purpose anyway.

Gordan



From linux at alteeve.com  Thu Nov 11 16:38:38 2010
From: linux at alteeve.com (Digimer)
Date: Thu, 11 Nov 2010 11:38:38 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>	<4CDB713A.8080303@alteeve.com>
	<4CDBB70D.6080204@bobich.net>
	<036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <4CDC1C0E.6010804@alteeve.com>

On 10-11-11 04:59 AM, Jankowski, Chris wrote:
> Gordan,
> 
> I do understand the mechanism.  I was trying to gently point out that this behaviour is unacceptable for my commercial IP customers. The customers buy clusters for high availability. Loosing the whole cluster due to single component failure - hearbeat link is not acceptable. The heartbeat link is a huge SPOF. And the cluster design does not support redundant links for heartbeat.
> 
> Also, none of the commercially available UNIX clusters or Linux clusters (HP ServiceGuard, Veritas, SteelEye) would display this type of behaviour and they do not clobber cluster filesystems.  So, it is possible to achieve acceptable reaction to this type of failure.
> 
> Regards,
> 
> Chris Jankowski

I can't speak to heartbeat, but under RHCS you can have multiple fence
methods and devices, and they will used in the order that they are found
in the configuration file.

With the power-based devices I've used (again, just IPMI and NA), the
poweroff call is more or less instant. I've not seen, personally, a lag
exceeding a second with these devices. I would consider a fence device
that does not disable a node in <1 second to be flawed.

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From linux at alteeve.com  Thu Nov 11 16:44:14 2010
From: linux at alteeve.com (Digimer)
Date: Thu, 11 Nov 2010 11:44:14 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CDBB621.2090809@bobich.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>
	<4CDBB621.2090809@bobich.net>
Message-ID: <4CDC1D5E.6030905@alteeve.com>

On 10-11-11 04:23 AM, Gordan Bobic wrote:
> Jankowski, Chris wrote:
>> Digimer,
>>
>> 1.
>> Digimer wrote:
>>>>> Both partitions will try to fence the other, but the slower
>>>>> will lose and get fenced before it can fence.
>>
>> Well, this is certainly not my experience in dealing with modern
>> rack mounted or blade servers where you use iLO (on HP) or DRAC (on
>> Dell).
>>
>> What actually happens in two node clusters is that both servers
>> issue the fence request to the iLO or DRAC. It gets processed
>> and *both* servers get powered off.  Ouch!!  Your 100% HA cluster
>> becomes 100% dead cluster.
> 
> Indeed, I've seen this, too, on a range of hardware. My quick and dirty
> solution was to doctor the fencing agent to add a different sleep() on
> each node, in order of survivor preference. There may be a setting in
> cluster.conf that can be used to achieve the same effect, can't remember
> off the top of my head.
> 
> Gordan

I've not seen such an option, though I make no claims to complete
knowledge of the options available. I do know that there are pre-device
fence options (that is, IPMI has a set of options that differs from
DRAC, etc). So perhaps there is an option there.

I am very curious to know how this scenario can happen. As I had
previously understood it, this should simply not be possible. Obviously
it is though... The only thing I can think of is where a fence device is
external to the nodes and allows for multiple fence calls at the same
time. I would expect that and fence device should terminate a node
nearly instantly. If it doesn't or can't, then I would suggest that it
not accept a second fence request until after the pending one completes.

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From linux at alteeve.com  Thu Nov 11 16:48:50 2010
From: linux at alteeve.com (Digimer)
Date: Thu, 11 Nov 2010 11:48:50 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CDBB194.2020601@bobich.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>
	<4CDBB194.2020601@bobich.net>
Message-ID: <4CDC1E72.9020703@alteeve.com>

On 10-11-11 04:04 AM, Gordan Bobic wrote:
> Digimer wrote:
>> On 10-11-10 11:09 AM, Gordan Bobic wrote:
>>> Digimer wrote:
>>>> On 10-11-10 07:17 AM, Gordan Bobic wrote:
>>>>>>> If you want the FS mounted on all nodes at the same time then all
>>>>>>> those nodes must be a part of the cluster, and they have to be
>>>>>>> quorate (majority of nodes have to be up). You don't need a quorum
>>>>>>> block device, but it can be useful when you have only 2 nodes.
>>>>>> At term, I will have 7 to 10 nodes, but 2 at first for initial setup
>>>>>> and testing. Ok, so if I have a 3 nodes cluster for exemple, I
>>>>>> need at
>>>>>> least 2 nodes for the cluster, and thus the gfs, to be up ? I cannot
>>>>>> have a running gfs with only one node ?
>>>>> In a 2-node cluster, you can have running GFS with just one node
>>>>> up. But
>>>>> in that case it is advisble to have a quorum block device on the SAN.
>>>>> With a 3 node cluster, you cannot have quorum with just 1 node, and
>>>>> thus
>>>>> you cannot have GFS running. It will block until quorum is
>>>>> re-established.
>>>> With a quorum disk, you can in fact have one node left and still have
>>>> quorum. This is because the quorum drive should have (node-1) votes,
>>>> thus always giving the last node 50%+1 even with all other nodes being
>>>> dead.
>>> I've never tried testing that use-case extensively, but I suspect that
>>> it is only safe to do with SAN-side fencing. Otherwise two nodes could
>>> lose contact with each other and still both have access to the SAN and
>>> thus both be individually quorate.
>>>
>>> Gordan
>>
>> Clustered storage *requires* fencing. To not use fencing is like driving
>> tired; It's just a matter of time before something bad happens. That
>> said, I should have been more clear in specifying the requirement for
>> fencing.
>>
>> Now that said, the fencing shouldn't be needed at the SAN side, though
>> that works fine as well.
> 
> The default fencing action, last time I checked, is reboot. Consider the
> use case where you have a network failure and separate networks for
> various things, and you lose connectivity between the nodes but they
> both still have access to the SAN. One node gets fenced, reboots, comes
> up and connects to the SAN. It connects to the quorum device and has
> quorum without the other nodes, and mounts the file systems and starts
> writing - while all the other nodes that have become partitioned off do
> the same thing. Unless you can fence the nodes from the SAN side, quorum
> device having a 50% weight is a recipe for disaster.

Agreed, and that is one of the major benefits of qdisk. It prevents a
50/50 split. Regardless though, say you have an eight node cluster and
it partitions evenly with no qdisk to tie break. In that case, neither
partition has >50% of the votes, so neither should have quorum. In turn,
neither should touch the SAN.

This is because DLM is required for clustered file systems, and DLM in
turn requires quorum. Without quorum, DLM won't run and you will not be
able to touch the SAN. :)

>> The way it works is:
> [...]
> 
> I'm well aware of how fencing works, but you overlooked one major
> failure mode that is essentially guaranteed to hose your data if you set
> up the quorum device to have 50% of the votes.

See above. 50% is not quorum.

>> With SAN-side fencing, a fence is in the form of a logic disconnection
>> from the storage network. This has no inherent mechanism for recovery,
>> so the sysadmin will have to manually recover the node(s). For this
>> reason, I do not prefer it.
> 
> Then don't use a quorum device with more than an equal weight to the
> individual nodes.
> 
> Gordan

How does the number of nodes relate, in this case, to the SAN-side fence
recovery?

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From gordan at bobich.net  Thu Nov 11 17:59:57 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Thu, 11 Nov 2010 17:59:57 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CDC1E72.9020703@alteeve.com>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>
	<4CDBB194.2020601@bobich.net> <4CDC1E72.9020703@alteeve.com>
Message-ID: <4CDC2F1D.2070500@bobich.net>

On 11/11/2010 04:48 PM, Digimer wrote:
>>> Clustered storage *requires* fencing. To not use fencing is like driving
>>> tired; It's just a matter of time before something bad happens. That
>>> said, I should have been more clear in specifying the requirement for
>>> fencing.
>>>
>>> Now that said, the fencing shouldn't be needed at the SAN side, though
>>> that works fine as well.
>>
>> The default fencing action, last time I checked, is reboot. Consider the
>> use case where you have a network failure and separate networks for
>> various things, and you lose connectivity between the nodes but they
>> both still have access to the SAN. One node gets fenced, reboots, comes
>> up and connects to the SAN. It connects to the quorum device and has
>> quorum without the other nodes, and mounts the file systems and starts
>> writing - while all the other nodes that have become partitioned off do
>> the same thing. Unless you can fence the nodes from the SAN side, quorum
>> device having a 50% weight is a recipe for disaster.
>
> Agreed, and that is one of the major benefits of qdisk. It prevents a
> 50/50 split. Regardless though, say you have an eight node cluster and
> it partitions evenly with no qdisk to tie break. In that case, neither
> partition has>50% of the votes, so neither should have quorum. In turn,
> neither should touch the SAN.

Exactly - qdisk is a tie-breaker. The point I was responding to was the 
one where somebody suggested giving qdisk a 50% vote weight (i.e. needs 
only qdisk + 1 node for quorum), which is IMO not a sane way to do it.

>> I'm well aware of how fencing works, but you overlooked one major
>> failure mode that is essentially guaranteed to hose your data if you set
>> up the quorum device to have 50% of the votes.
>
> See above. 50% is not quorum.

No, but 50% + 1 node is quorum, and I'm saying that having qdisk (50%) + 
1 node = quorum is not the way to go.

>>> With SAN-side fencing, a fence is in the form of a logic disconnection
>>> from the storage network. This has no inherent mechanism for recovery,
>>> so the sysadmin will have to manually recover the node(s). For this
>>> reason, I do not prefer it.
>>
>> Then don't use a quorum device with more than an equal weight to the
>> individual nodes.
>
> How does the number of nodes relate, in this case, to the SAN-side fence
> recovery?

It doesn't directly. I'm saying that the only way that giving qdisk 50% 
of the vote toward quorum is if your fencing is done by the SAN itself. 
Otherwise any 1 node that comes up has quorum, regardless of how many 
other are down, which in turn leads to multiple nodes being individually 
quorate when the connect to the SAN. This situation will trash the 
shared file system.

Gordan



From dxh at yahoo.com  Thu Nov 11 19:15:03 2010
From: dxh at yahoo.com (Don Hoover)
Date: Thu, 11 Nov 2010 11:15:03 -0800 (PST)
Subject: [Linux-cluster] What keeps more than one node from grabbing qdisk?
Message-ID: <347609.50037.qm@web120711.mail.ne1.yahoo.com>

I have seen multiple boxes think they have ownership of the qdisk, is there something that prevents this other than they fence (reboot) each other?

And what keeps them from getting into a fence war?





From Chris.Jankowski at hp.com  Fri Nov 12 01:58:01 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Fri, 12 Nov 2010 01:58:01 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CDC1D5E.6030905@alteeve.com>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>
	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>
	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>
	<4CDBB621.2090809@bobich.net> <4CDC1D5E.6030905@alteeve.com>
Message-ID: <036B68E61A28CA49AC2767596576CD596F584F3572@GVW1113EXC.americas.hpqcorp.net>

Digimer,

>>>I am very curious to know how this scenario can happen. As I had previously understood it, this should simply not be possible. Obviously it is though... 

It actually is very simple. For the mutual simultaneous killing to be guaranteed to happen three conditions are sufficient:

1. The fencing request is generated by the two nodes at the same time. Fulfilled by current design of the fencing.
2. Your fencing device needs to be a separate piece of equipment dedicated to the node to be fenced. Note that iLO or DRAC fulfill the requirement.
3. The implementation of the fencing device needs to be transactional i.e. - accept an order to fence, then execute it after a certain delay. Both iLO and DRAC work transactionally and there is sufficient delay.

What happens is simple. Think about it as transactions. Both nodes start at the same time transacting with the corresponding fencing devices. Each fencing device accepts the transaction. Only then, after a small delay, they start executing it. Both fencing devices are at this point committed to the execution and will do what they have been told.

The set of conditions is sufficient in mathematical sense. 

In modern networked servers with built-in service processors this set of confditions is almost certainly true for all of them.

The following are possible ways of resolving the problem for this set of sufficient condiions:

1. Invalidate condition 1 - introduce different fixed delays in fencing agents for each node - e.g. node A - no delay, node B 2 seconds.  This is a good solution, but requires custom programming work.  The current cluster design does not allow it as a configuration option.
2. Invalidate condition 2 - common physical fencing device that will accept only one request from one node. Essentially this serialises the transactions and allows at most one. This is not a clean way to do it, as such device would be a SPOF.
3. Invalidate condition 3 - change the execution phase to conditional based on the state of the requestor - in the execution phase execute the request only if the requestor is still alive.  This shrinks, but does not eliminate the time in which the race condition leads to both nodes going down.

However, I believe that the real solution is to change the mindset of the cluster from "I am the omniscient and omnipotent master of the world and I will shoot anything I do not like" to protecting resources i.e. protecting shared storage through SCSI reservations, which is what commerial Linux and UNIX clusters do.  Alas, the STONITH concept is so ingrained in the minds of developers of the Linux cluster that this change seems to be impossible to achieve.

--------

Please note that the STONITH concept has other fatal flows in the modern networked world. Consider, step by step scenario of what would happen to your available cluster if a node in the cluster gets completely separated from the network including its access, its hearbeat and iLO/DRAC network connections.  Again, the end result is that you have no access to your supposedly highly available application. From the functional point of view the whole cluster has failed.  The core issue, again, is the inadequacy of the STONITH concept. And again, commercial UNIX and Lnux clusters deal with this scenario correctly.  Their clusters will continue.

Regards,

Chris Jankowski

 



In fact, to remove the race condition o

 

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Digimer
Sent: Friday, 12 November 2010 03:44
To: linux clustering
Subject: Re: [Linux-cluster] Starter Cluster / GFS

On 10-11-11 04:23 AM, Gordan Bobic wrote:
> Jankowski, Chris wrote:
>> Digimer,
>>
>> 1.
>> Digimer wrote:
>>>>> Both partitions will try to fence the other, but the slower will 
>>>>> lose and get fenced before it can fence.
>>
>> Well, this is certainly not my experience in dealing with modern rack 
>> mounted or blade servers where you use iLO (on HP) or DRAC (on Dell).
>>
>> What actually happens in two node clusters is that both servers issue 
>> the fence request to the iLO or DRAC. It gets processed and *both* 
>> servers get powered off.  Ouch!!  Your 100% HA cluster becomes 100% 
>> dead cluster.
> 
> Indeed, I've seen this, too, on a range of hardware. My quick and 
> dirty solution was to doctor the fencing agent to add a different 
> sleep() on each node, in order of survivor preference. There may be a 
> setting in cluster.conf that can be used to achieve the same effect, 
> can't remember off the top of my head.
> 
> Gordan

I've not seen such an option, though I make no claims to complete knowledge of the options available. I do know that there are pre-device fence options (that is, IPMI has a set of options that differs from DRAC, etc). So perhaps there is an option there.

I am very curious to know how this scenario can happen. As I had previously understood it, this should simply not be possible. Obviously it is though... The only thing I can think of is where a fence device is external to the nodes and allows for multiple fence calls at the same time. I would expect that and fence device should terminate a node nearly instantly. If it doesn't or can't, then I would suggest that it not accept a second fence request until after the pending one completes.

--
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From Chris.Jankowski at hp.com  Fri Nov 12 02:22:16 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Fri, 12 Nov 2010 02:22:16 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CDC1C0E.6010804@alteeve.com>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>
	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>
	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>
	<4CDB713A.8080303@alteeve.com> <4CDBB70D.6080204@bobich.net>
	<036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net>
	<4CDC1C0E.6010804@alteeve.com>
Message-ID: <036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net>

Digimer,

>>>>I can't speak to heartbeat, but under RHCS you can have multiple fence methods and devices, and they will used in the order that they are found in the configuration file.

Separate hearbeat networks (not a single network with a bonded interface) is what my customers require.  I believe this is not available in standard Linux Cluster, as distributed by RedHat.  This is completely independent from what fencing device or method is used.

>>>>With the power-based devices I've used (again, just IPMI and NA), the poweroff call is more or less instant. I've not seen, personally, a lag exceeding a second with these devices. I would consider a fence device that does not disable a node in <1 second to be flawed.

1.
In the world where I work separate power-based devices are not an option. Blade servers do not even have power supplies. They use common power from the blade enclosure.  The only access to the power state is through service processor.

2.
We are not talking about long delays here. The whole cycle of taking the power off a blade including login to the service processor is less than 1 ms. Delay or lack thereof is not a problem.  The transactional nature of the processing is the issue.

Regards,

Chris Jankowski


-----Original Message-----
From: Digimer [mailto:linux at alteeve.com] 
Sent: Friday, 12 November 2010 03:39
To: linux clustering
Cc: Jankowski, Chris
Subject: Re: [Linux-cluster] Starter Cluster / GFS

On 10-11-11 04:59 AM, Jankowski, Chris wrote:
> Gordan,
> 
> I do understand the mechanism.  I was trying to gently point out that this behaviour is unacceptable for my commercial IP customers. The customers buy clusters for high availability. Loosing the whole cluster due to single component failure - hearbeat link is not acceptable. The heartbeat link is a huge SPOF. And the cluster design does not support redundant links for heartbeat.
> 
> Also, none of the commercially available UNIX clusters or Linux clusters (HP ServiceGuard, Veritas, SteelEye) would display this type of behaviour and they do not clobber cluster filesystems.  So, it is possible to achieve acceptable reaction to this type of failure.
> 
> Regards,
> 
> Chris Jankowski

I can't speak to heartbeat, but under RHCS you can have multiple fence methods and devices, and they will used in the order that they are found in the configuration file.

With the power-based devices I've used (again, just IPMI and NA), the poweroff call is more or less instant. I've not seen, personally, a lag exceeding a second with these devices. I would consider a fence device that does not disable a node in <1 second to be flawed.

--
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From linux at alteeve.com  Fri Nov 12 02:41:30 2010
From: linux at alteeve.com (Digimer)
Date: Thu, 11 Nov 2010 21:41:30 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>	<4CDB713A.8080303@alteeve.com>
	<4CDBB70D.6080204@bobich.net>
	<036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net>
	<4CDC1C0E.6010804@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <4CDCA95A.2070104@alteeve.com>

On 10-11-11 09:22 PM, Jankowski, Chris wrote:
> Digimer,
> 
>>>>> I can't speak to heartbeat, but under RHCS you can have multiple fence methods and devices, and they will used in the order that they are found in the configuration file.
> 
> Separate hearbeat networks (not a single network with a bonded interface) is what my customers require.  I believe this is not available in standard Linux Cluster, as distributed by RedHat.  This is completely independent from what fencing device or method is used.

It is possible. ie:

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="1">
    <cman two_node="1" expected_votes="1"/>
    <totem secauth="off" rrp_mode="active"/>
    <clusternodes>
        <clusternode name="an-node01.alteeve.com" nodeid="1">
            <fence>
                <method name="ipmi">
                    <device name="fence_an01" action="reboot" />
                </method>
                <method name="node_assassin">
                    <device name="batou" port="01" action="reboot"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="an-node02.alteeve.com" nodeid="2">
            <fence>
                <method name="ipmi">
                    <device name="fence_an02" action="reboot" />
                </method>
                <method name="node_assassin">
                    <device name="batou" port="02" action="reboot"/>
                </method>
            </fence>
        </clusternode>
    </clusternodes>
    <fencedevices>
        <fencedevice name="fence_an01" agent="fence_ipmilan"
ipaddr="192.168.3.61" login="admin" passwd="secret" />
        <fencedevice name="fence_an02" agent="fence_ipmilan"
ipaddr="192.168.3.62" login="admin" passwd="secret" />
        <fencedevice name="batou" agent="fence_na"
ipaddr="batou.alteeve.com" login="username" passwd="secret" quiet="1"/>
    </fencedevices>
</cluster>

In the above case, should 'an-node02' need to be fenced, the first
method 'ipmi' would be used. Should it fail, the next method
'node_assassin' would be tried.

>>>>> With the power-based devices I've used (again, just IPMI and NA), the poweroff call is more or less instant. I've not seen, personally, a lag exceeding a second with these devices. I would consider a fence device that does not disable a node in <1 second to be flawed.
> 
> 1.
> In the world where I work separate power-based devices are not an option. Blade servers do not even have power supplies. They use common power from the blade enclosure.  The only access to the power state is through service processor.

Out of curiosity, do the blades have header pins for the power and reset
switches? I don't see why they would, but I've not played with
traditional blades before.

> 2.
> We are not talking about long delays here. The whole cycle of taking the power off a blade including login to the service processor is less than 1 ms. Delay or lack thereof is not a problem.  The transactional nature of the processing is the issue.
> 
> Regards,
> 
> Chris Jankowski

Let me talk to the Red Hat folks and see what they think about
configurable per-node user-defined fence delays.

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From linux at alteeve.com  Fri Nov 12 02:47:10 2010
From: linux at alteeve.com (Digimer)
Date: Thu, 11 Nov 2010 21:47:10 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>	<4CDB713A.8080303@alteeve.com>
	<4CDBB70D.6080204@bobich.net>
	<036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net>
	<4CDC1C0E.6010804@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <4CDCAAAE.6020002@alteeve.com>

On 10-11-11 09:22 PM, Jankowski, Chris wrote:
> 2.
> We are not talking about long delays here. The whole cycle of taking the power off a blade including login to the service processor is less than 1 ms. Delay or lack thereof is not a problem.  The transactional nature of the processing is the issue.
> 
> Regards,
> 
> Chris Jankowski

I forgot to mention; Fence calls can only be sent by nodes with quorum.
So a race condition should, as I understand it, be a concern with 2-node
clusters only.

I'm not entirely sure though on how quorum is determined at the time of
partitioning. That is, say you have a three node cluster, and one node
disconnects. I need to verify that it checks to see if it has quorum
before sending a fence call. I expect that is the case though.

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From Chris.Jankowski at hp.com  Fri Nov 12 03:25:41 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Fri, 12 Nov 2010 03:25:41 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CDCA95A.2070104@alteeve.com>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>
	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>
	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>
	<4CDB713A.8080303@alteeve.com> <4CDBB70D.6080204@bobich.net>
	<036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net>
	<4CDC1C0E.6010804@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net>
	<4CDCA95A.2070104@alteeve.com>
Message-ID: <036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net>

Digimer,

I think you do not make distinction between the network that maintains the hearbeat and the networks to use for fence devices. I'll explain this again.

These are two very different things and operated for different purpose.

The hearbeat network is between the nodes for the purpose of maintaining cluster membership.
The connections from the nodes to your fence devices form the other two networks.

In fact speaking of networks in this case is a little limiting. Each of the IP addresses involved may, in principle, be in different IP subnet

In the example that you gave, you have two (possibly different) networks for fence devices, as you have two fence devices.

However, your cluster membership is maintained through the single hearbeat network implicitly defined through the names of the cluster nodes.  I want to have two, independently configurable network like this and heartbeat being sent through both of them. I cannot do this at the moment, as the software will always maintain the hertbeat through the single IP address to which the node name resolves. In your case the heartbeat traffic will always go between an-node01.alteeve.com and an-node02.alteeve.com.

What I want is to have hertbeat traffic going between:
an-node01h1.alteeve.com and an-node02h1.alteeve.com
and between
an-node01h2.alteeve.com and an-node02h2.alteeve.com
Whereas my application would access the cluster through:
an-node01.alteeve.com and an-node02.alteeve.com

So I would need minimum of 3 Ethernet interfaces per server and minimum of 6 if all links will be bonded, but this is OK. 

Regards,

Chris Jankowski

-----Original Message-----
From: Digimer [mailto:linux at alteeve.com] 
Sent: Friday, 12 November 2010 13:42
To: Jankowski, Chris
Cc: linux clustering
Subject: Re: [Linux-cluster] Starter Cluster / GFS

On 10-11-11 09:22 PM, Jankowski, Chris wrote:
> Digimer,
> 
>>>>> I can't speak to heartbeat, but under RHCS you can have multiple fence methods and devices, and they will used in the order that they are found in the configuration file.
> 
> Separate hearbeat networks (not a single network with a bonded interface) is what my customers require.  I believe this is not available in standard Linux Cluster, as distributed by RedHat.  This is completely independent from what fencing device or method is used.

It is possible. ie:

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="1">
    <cman two_node="1" expected_votes="1"/>
    <totem secauth="off" rrp_mode="active"/>
    <clusternodes>
        <clusternode name="an-node01.alteeve.com" nodeid="1">
            <fence>
                <method name="ipmi">
                    <device name="fence_an01" action="reboot" />
                </method>
                <method name="node_assassin">
                    <device name="batou" port="01" action="reboot"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="an-node02.alteeve.com" nodeid="2">
            <fence>
                <method name="ipmi">
                    <device name="fence_an02" action="reboot" />
                </method>
                <method name="node_assassin">
                    <device name="batou" port="02" action="reboot"/>
                </method>
            </fence>
        </clusternode>
    </clusternodes>
    <fencedevices>
        <fencedevice name="fence_an01" agent="fence_ipmilan"
ipaddr="192.168.3.61" login="admin" passwd="secret" />
        <fencedevice name="fence_an02" agent="fence_ipmilan"
ipaddr="192.168.3.62" login="admin" passwd="secret" />
        <fencedevice name="batou" agent="fence_na"
ipaddr="batou.alteeve.com" login="username" passwd="secret" quiet="1"/>
    </fencedevices>
</cluster>

In the above case, should 'an-node02' need to be fenced, the first method 'ipmi' would be used. Should it fail, the next method 'node_assassin' would be tried.

>>>>> With the power-based devices I've used (again, just IPMI and NA), the poweroff call is more or less instant. I've not seen, personally, a lag exceeding a second with these devices. I would consider a fence device that does not disable a node in <1 second to be flawed.
> 
> 1.
> In the world where I work separate power-based devices are not an option. Blade servers do not even have power supplies. They use common power from the blade enclosure.  The only access to the power state is through service processor.

Out of curiosity, do the blades have header pins for the power and reset switches? I don't see why they would, but I've not played with traditional blades before.

> 2.
> We are not talking about long delays here. The whole cycle of taking the power off a blade including login to the service processor is less than 1 ms. Delay or lack thereof is not a problem.  The transactional nature of the processing is the issue.
> 
> Regards,
> 
> Chris Jankowski

Let me talk to the Red Hat folks and see what they think about configurable per-node user-defined fence delays.

--
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From linux at alteeve.com  Fri Nov 12 03:43:33 2010
From: linux at alteeve.com (Digimer)
Date: Thu, 11 Nov 2010 22:43:33 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>	<4CDB713A.8080303@alteeve.com>
	<4CDBB70D.6080204@bobich.net>
	<036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net>
	<4CDC1C0E.6010804@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net>
	<4CDCA95A.2070104@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <4CDCB7E5.40204@alteeve.com>

On 10-11-11 10:25 PM, Jankowski, Chris wrote:
> Digimer,
> 
> I think you do not make distinction between the network that maintains the hearbeat and the networks to use for fence devices. I'll explain this again.

Perhaps. In my clusters, I use at least three interfaces on three
separate subnets... I put IPMI on one and NA on the second.

> These are two very different things and operated for different purpose.
> 
> The hearbeat network is between the nodes for the purpose of maintaining cluster membership.
> The connections from the nodes to your fence devices form the other two networks.
> 
> In fact speaking of networks in this case is a little limiting. Each of the IP addresses involved may, in principle, be in different IP subnet
> 
> In the example that you gave, you have two (possibly different) networks for fence devices, as you have two fence devices.

You can use the <altname ...> element to define a second totem ring
(redundant ring protocol) to act as a backup, on a second subnet, for
backup cluster communication.

> However, your cluster membership is maintained through the single hearbeat network implicitly defined through the names of the cluster nodes.  I want to have two, independently configurable network like this and heartbeat being sent through both of them. I cannot do this at the moment, as the software will always maintain the hertbeat through the single IP address to which the node name resolves. In your case the heartbeat traffic will always go between an-node01.alteeve.com and an-node02.alteeve.com.
> 
> What I want is to have hertbeat traffic going between:
> an-node01h1.alteeve.com and an-node02h1.alteeve.com
> and between
> an-node01h2.alteeve.com and an-node02h2.alteeve.com
> Whereas my application would access the cluster through:
> an-node01.alteeve.com and an-node02.alteeve.com
> 
> So I would need minimum of 3 Ethernet interfaces per server and minimum of 6 if all links will be bonded, but this is OK. 

Exactly what I do, though RRP is currently limited to two interfaces due
to inherent complexities preventing going beyond that. There is work on
a newly-announced project that will allow for n-number of paths, but
that's alpha stage at this point.

> Regards,
> 
> Chris Jankowski

Cheers

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From Chris.Jankowski at hp.com  Fri Nov 12 04:10:08 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Fri, 12 Nov 2010 04:10:08 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CDCB7E5.40204@alteeve.com>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>
	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>
	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>
	<4CDB713A.8080303@alteeve.com> <4CDBB70D.6080204@bobich.net>
	<036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net>
	<4CDC1C0E.6010804@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net>
	<4CDCA95A.2070104@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net>
	<4CDCB7E5.40204@alteeve.com>
Message-ID: <036B68E61A28CA49AC2767596576CD596F584F3656@GVW1113EXC.americas.hpqcorp.net>

Digimer,

>>>You can use the <altname ...> element to define a second totem ring (redundant ring protocol) to act as a backup, on a second subnet, for backup cluster communication.

Thank you. I was not aware of this option <altname ..>. Is this documented anywhere, so I can read it?

Regards,

Chris Jankowski

-----Original Message-----
From: Digimer [mailto:linux at alteeve.com] 
Sent: Friday, 12 November 2010 14:44
To: Jankowski, Chris
Cc: linux clustering
Subject: Re: [Linux-cluster] Starter Cluster / GFS

On 10-11-11 10:25 PM, Jankowski, Chris wrote:
> Digimer,
> 
> I think you do not make distinction between the network that maintains the hearbeat and the networks to use for fence devices. I'll explain this again.

Perhaps. In my clusters, I use at least three interfaces on three separate subnets... I put IPMI on one and NA on the second.

> These are two very different things and operated for different purpose.
> 
> The hearbeat network is between the nodes for the purpose of maintaining cluster membership.
> The connections from the nodes to your fence devices form the other two networks.
> 
> In fact speaking of networks in this case is a little limiting. Each 
> of the IP addresses involved may, in principle, be in different IP 
> subnet
> 
> In the example that you gave, you have two (possibly different) networks for fence devices, as you have two fence devices.

You can use the <altname ...> element to define a second totem ring (redundant ring protocol) to act as a backup, on a second subnet, for backup cluster communication.

> However, your cluster membership is maintained through the single hearbeat network implicitly defined through the names of the cluster nodes.  I want to have two, independently configurable network like this and heartbeat being sent through both of them. I cannot do this at the moment, as the software will always maintain the hertbeat through the single IP address to which the node name resolves. In your case the heartbeat traffic will always go between an-node01.alteeve.com and an-node02.alteeve.com.
> 
> What I want is to have hertbeat traffic going between:
> an-node01h1.alteeve.com and an-node02h1.alteeve.com and between 
> an-node01h2.alteeve.com and an-node02h2.alteeve.com Whereas my 
> application would access the cluster through:
> an-node01.alteeve.com and an-node02.alteeve.com
> 
> So I would need minimum of 3 Ethernet interfaces per server and minimum of 6 if all links will be bonded, but this is OK. 

Exactly what I do, though RRP is currently limited to two interfaces due to inherent complexities preventing going beyond that. There is work on a newly-announced project that will allow for n-number of paths, but that's alpha stage at this point.

> Regards,
> 
> Chris Jankowski

Cheers

--
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From linux at alteeve.com  Fri Nov 12 04:56:14 2010
From: linux at alteeve.com (Digimer)
Date: Thu, 11 Nov 2010 23:56:14 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584F3656@GVW1113EXC.americas.hpqcorp.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>	<4CDB713A.8080303@alteeve.com>
	<4CDBB70D.6080204@bobich.net>
	<036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net>
	<4CDC1C0E.6010804@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net>
	<4CDCA95A.2070104@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net>
	<4CDCB7E5.40204@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584F3656@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <4CDCC8EE.6090500@alteeve.com>

On 10-11-11 11:10 PM, Jankowski, Chris wrote:
> Digimer,
> 
>>>> You can use the <altname ...> element to define a second totem ring (redundant ring protocol) to act as a backup, on a second subnet, for backup cluster communication.
> 
> Thank you. I was not aware of this option <altname ..>. Is this documented anywhere, so I can read it?
> 
> Regards,
> 
> Chris Jankowski

Not officially, but I've been working on documenting all of the options.
I make no claim to accuracy, so please read with that in mind. Of
course, corrections and feedback are appreciated. :)

http://wiki.alteeve.com/index.php/Cluster.conf#Element.3B_altname

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From Chris.Jankowski at hp.com  Mon Nov 15 03:30:13 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Mon, 15 Nov 2010 03:30:13 +0000
Subject: [Linux-cluster] XFS as a servicein RHEL 6  Linux Cluster.
Message-ID: <036B68E61A28CA49AC2767596576CD596F58F8863E@GVW1113EXC.americas.hpqcorp.net>

Hi,

RHEL 6 now officially supports XFS, as an additional subscription option, I believe.

Does the RHEL 6 Linux Cluster provide the necessary module to configure an XFS filesystem as a failover service?

Thanks and regards,

Chris Jankowski

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101115/d81ff897/attachment.htm>

From noreply at boxbe.com  Mon Nov 15 09:18:01 2010
From: noreply at boxbe.com (noreply at boxbe.com)
Date: Mon, 15 Nov 2010 01:18:01 -0800 (PST)
Subject: [Linux-cluster] Starter Cluster / GFS (Action Required)
Message-ID: <688108002.22833.1289812681752.JavaMail.prod@app004.boxbe.com>


Hello linux clustering,

You will not receive any more courtesy notices from our members 
for two days. Messages you have sent will remain in a lower 
priority mailbox for our member to review at their leisure.

Future messages will be more likely to be viewed if you are on 
our member's priority Guest List.


  Thank you,
  vishalspatil at gmail.com


About this Notice
Boxbe prioritizes and screens email using a personal Guest List and your 
extended social network.  It's free, it removes clutter, and it helps 
you focus on the people who matter to you. 

Visit http://www.boxbe.com/how-it-works?tc=5902846179_739730772
End Email Overload

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101115/39ad3f1c/attachment.htm>
-------------- next part --------------
An embedded message was scrubbed...
From: "Jankowski, Chris" <Chris.Jankowski at hp.com>
Subject: Re: [Linux-cluster] Starter Cluster / GFS
Date: Fri, 12 Nov 2010 04:10:08 +0000
Size: 4854
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101115/39ad3f1c/attachment.eml>

From radu.rendec at mindbit.ro  Mon Nov 15 12:34:22 2010
From: radu.rendec at mindbit.ro (Radu Rendec)
Date: Mon, 15 Nov 2010 14:34:22 +0200
Subject: [Linux-cluster] rgmanager blocked
Message-ID: <1289824462.3353.70.camel@localhost>

Hello,

I'm trying to migrate an older Centos 5 / rhcs2 cluster to the newer
rhcs3. Being eager to play around, I decided to make my tests on Fedora
14, before Centos 6 is out.

Although everything seemed to work fine at the beginning, after a few
hours of cluster uptime I came across a strange situation of rgmanager
being apparently blocked. The process is still there, but:

1. It no longer produces any output - it's run in a "screen" session,
with params "-fd". Normally it's very verbose (I can see a lot of debug
messages, including output from agent scripts). It's been more than a
week since it blocked, and it hadn't output a sigle line of debug.

2. Resources from node 1 were (automatically) relocated to node 2 when
node 1 blocked, but node 2 blocked in a similar manner a few hours
later.

3. Now resources are still active on node 2, on both nodes a "clustat"
looks like this:

Service states unavailable: Temporary failure; try again
Cluster Status for ****** @ Mon Nov 15 14:14:22 2010
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 storage1.******                                                1 Online, Local
 storage2.******                                                2 Online

I've already tried several simple things like:
* looking at the process tree for some hung resource agents - no luck;
it's just clurgmgrd and its child threads;
* looking at the open files of clurgmgrd in /proc/NNN/fd - nothing
unusual
* tracing (with strace) the main clurgmgrd thread and the children.

At this point I'm totally clueless, so any suggestion would be welcome.
I can provide further info / logs about the running system / processes.

Thanks,

Radu Rendec




From fdinitto at redhat.com  Mon Nov 15 14:48:30 2010
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Mon, 15 Nov 2010 15:48:30 +0100
Subject: [Linux-cluster] XFS as a servicein RHEL 6  Linux Cluster.
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F58F8863E@GVW1113EXC.americas.hpqcorp.net>
References: <036B68E61A28CA49AC2767596576CD596F58F8863E@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <4CE1483E.2040301@redhat.com>

On 11/15/2010 4:30 AM, Jankowski, Chris wrote:

>  
> Does the RHEL 6 Linux Cluster provide the necessary module to configure
> an XFS filesystem as a failover service?

Yes, you can use the filesystem resource.

Fabio



From Colin.Simpson at iongeo.com  Mon Nov 15 18:57:04 2010
From: Colin.Simpson at iongeo.com (Colin Simpson)
Date: Mon, 15 Nov 2010 18:57:04 +0000
Subject: [Linux-cluster] Configurations of services?
In-Reply-To: <4CDBD692.4020108@srce.hr>
References: <4CCBFAD3.9010305@srce.hr> <4CDB2FB4.8000309@srce.hr>
	<4CDBB533.3030805@bobich.net> <4CDBD692.4020108@srce.hr>
Message-ID: <1289847424.16298.18.camel@cowie>

Out of interest (for my own setup) does anyone know if there are any
massive negatives to keeping the service config files on a GFS2 volume?
Just seems like a nice lazy approach to distributing them to me, esp as
on GFS2 you have shared storage anyway.

I maybe thought if a service needs cleanly shutdown (or more likely
checking if it's down on a node might require the config file and the
GFS2 might not have been mounted). 

Thanks


Colin


On Thu, 2010-11-11 at 12:42 +0100, Jakov Sosic wrote:
> >>> I do not want to keep configurations of EACH service ona shared disks,
> >>> for some services I want to have configurations on each node available.
> >>>
> >>>

This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.





From jakov.sosic at srce.hr  Mon Nov 15 21:29:05 2010
From: jakov.sosic at srce.hr (Jakov Sosic)
Date: Mon, 15 Nov 2010 22:29:05 +0100
Subject: [Linux-cluster] Configurations of services?
In-Reply-To: <1289847424.16298.18.camel@cowie>
References: <4CCBFAD3.9010305@srce.hr>
	<4CDB2FB4.8000309@srce.hr>	<4CDBB533.3030805@bobich.net>
	<4CDBD692.4020108@srce.hr> <1289847424.16298.18.camel@cowie>
Message-ID: <4CE1A621.6090508@srce.hr>

On 11/15/2010 07:57 PM, Colin Simpson wrote:
> Out of interest (for my own setup) does anyone know if there are any
> massive negatives to keeping the service config files on a GFS2 volume?
> Just seems like a nice lazy approach to distributing them to me, esp as
> on GFS2 you have shared storage anyway.
> 
> I maybe thought if a service needs cleanly shutdown (or more likely
> checking if it's down on a node might require the config file and the
> GFS2 might not have been mounted). 

I fail to see negatives on that kind of setup. Although, I don't use
GFS2 too often, so I try to solve this problem differently.


@Gordan, I've tried csync2 and it's a great tool. It works exactly the
way I wanted! Thank you very much, I'm prepping it for big push into
production environments :)



-- 
Jakov Sosic



From gordan at bobich.net  Mon Nov 15 21:46:52 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Mon, 15 Nov 2010 21:46:52 +0000
Subject: [Linux-cluster] Configurations of services?
In-Reply-To: <4CE1A621.6090508@srce.hr>
References: <4CCBFAD3.9010305@srce.hr>	<4CDB2FB4.8000309@srce.hr>	<4CDBB533.3030805@bobich.net>	<4CDBD692.4020108@srce.hr>
	<1289847424.16298.18.camel@cowie> <4CE1A621.6090508@srce.hr>
Message-ID: <4CE1AA4C.2050609@bobich.net>

On 11/15/2010 09:29 PM, Jakov Sosic wrote:

> @Gordan, I've tried csync2 and it's a great tool. It works exactly the
> way I wanted! Thank you very much, I'm prepping it for big push into
> production environments :)

Glad I could help. :)

Gordan



From ag8817282 at gideon.org  Wed Nov 17 20:22:53 2010
From: ag8817282 at gideon.org (Andrew Gideon)
Date: Wed, 17 Nov 2010 15:22:53 -0500
Subject: [Linux-cluster] GFS+DRBD+Quorum: Help wrap my brain around this
Message-ID: <1290025373.7401.1158.camel@carrot>

I'm trying to figure out the best solution for GFS+DRBD.  My mental
block isn't really with GFS, though, but with clustered LVM (I think).

I understand the quorum problem with a two-node cluster.  And I
understand that DRBD is not suitable for use as a quorum disk
(presumably because it too would suffer from any partitioning, unlike a
physical array connected directly to both nodes).

Am I right so far?

What I'd really like to do is have a three (or more) node cluster with
two nodes having access to the DRBD storage.  This solves the quorum
problem (effectively having the third node as a quorum server).

But when I try to create a volume on a volume group on a device shared
by two nodes of a three node cluster, I get an error indicating that the
volume group cannot be found on the third node.  Which is true: the
shared volume isn't available on that node.

In the Cluster Logical Volume Manager document, I found:

        By default, logical volumes created with CLVM on shared storage
        are visible to all computers that have access to the shared
        storage. 
        
What I've not figured out is how to tell CLVMD (or whomever) that only
nodes one and two have access to the shared storage.  Is there a way to
do this? 

I've also read, in the GFS2 Overview document:

        When you configure a GFS2 file system as a cluster file system,
        you must ensure that all nodes in the cluster have access to the
        shared storage

This suggests that a cluster running GFS must have access to the storage
on all nodes.  Which would clearly block my idea for a three node
cluster with only two nodes having access to the shared storage.

I do have one idea, but it sounds like a more complex version of a Rube
Goldberg device: A two node cluster with a third machine providing
access to a device via iSCSI.  The LUN exported from that third system
could be used as the quorum disk by the two cluster nodes (effectively
making that little iSCSI target the quorum server).

This assumes that a failure of the quorum disk in an otherwise healthy
two node cluster is survived.  I've yet to confirm this.

This seems ridiculously complex, so much so that I cannot imagine that
there's not a better solution.  But I just cannot get my brain wrapped
around this well enough to see it.

Any suggestions would be very welcome.

Thanks...

	Andrew




From ag8817282 at gideon.org  Wed Nov 17 20:36:19 2010
From: ag8817282 at gideon.org (Andrew Gideon)
Date: Wed, 17 Nov 2010 15:36:19 -0500
Subject: [Linux-cluster] A fencing mechanism for Xen (or KVM) guests
Message-ID: <1290026179.7401.1167.camel@carrot>


I found myself unhappy with what I located for fencing of Xen guests, so
I put together a new mechanism.  Would this be of interest to anyone
else?

The node on which fence_node is called uses SSH to connect to the list
of hypervisors.  The connection is key based, which limits the nodes to
execution of the specific fencing command and also lets a given node
fence only a guest that's in a specific list.  This prevents a node of
one cluster from fencing a node of another even if they reside on the
same set of hypervisors.

The fencing script issues the fence command (via SSH) to each
hypervisor.  Success of the command requires either (1) a guest of the
specified name is found and destroyed o at least one hypervisor or (2)
every hypervisor has been visited and reported that there is no such
guest running.

#2 was an interesting choice, BTW, on which I'd welcome feedback.  The
alternative would have been to presume that an unreachable hypervisor
was down.  That didn't seem like the best choice to me, but I'm curious
what others might think.

Thanks...

	Andrew




From Colin.Simpson at iongeo.com  Wed Nov 17 21:02:35 2010
From: Colin.Simpson at iongeo.com (Colin Simpson)
Date: Wed, 17 Nov 2010 21:02:35 +0000
Subject: [Linux-cluster] GFS+DRBD+Quorum: Help wrap my brain around this
In-Reply-To: <1290025373.7401.1158.camel@carrot>
References: <1290025373.7401.1158.camel@carrot>
Message-ID: <1290027755.4270.33.camel@cowie>

You are right so far in your first paragraph.

You cannot totally solve the quorum cluster problem with a two node
cluster. The basic issue you are really trying to address is you want to
avoid a split brain scenario, that is really all quorum is giving you. 

So with DRBD your best bet is to do your level best to avoid a split
brain with your two nodes. Use decent fencing (maybe multiple fences),
have redundant bonded network links (and interlinks) (I'm looking at
splitting these over two physical cards on the nodes), setup DRBD's
startup waiting appropriately and be careful at startups (see scenario
below).

Then just tell RHCS that you want to run with 2 nodes in cluster.conf
e.g 

<cman expected_votes="1" two_node="1"/>

And in drbd.conf I have,

  startup {
  	wfc-timeout  300;       # Wait 300 for initial connection
  	degr-wfc-timeout 60;  # Wait only 60 seconds if this node was a
degraded cluster
	become-primary-on both;
  }

, many may prefer the system to wait indefinitely in DRBD on some of
these conditions (to manually bring stuff up in a bad situation). So
basically here I will wait 5 mins for the other node to join my DRBD
before doing any cluster stuff and but wait less (60s) if I was degraded
already (I'm assuming my other node is probably broken for an extended
period in that case, so I want my other server up pretty quick). I'm
still thinking this through just now. 

On a two node non-shared storage setup you can never fully guard against
the scenario of node A being shutdown, node B then being shutdown later.
Then node A being brought up and having no way of knowing that it has
the older data than B, if B is still down. You can mitigate against this
though by ensuring that you setup DRBD to wait long enough (or forever)
on boot, and/or being careful to start things up in the right order
after long periods of downtime from one node (good node needs to be up
already). Just needs a bit of scenario thought. 

Three nodes just adds needless complexity from what you are saying.

That's my thoughts on this, I'm pretty new to this too. Just how I'm
thinking this should work just now.

Colin


On Wed, 2010-11-17 at 15:22 -0500, Andrew Gideon wrote:
> I'm trying to figure out the best solution for GFS+DRBD.  My mental
> block isn't really with GFS, though, but with clustered LVM (I think).
> 
> I understand the quorum problem with a two-node cluster.  And I
> understand that DRBD is not suitable for use as a quorum disk
> (presumably because it too would suffer from any partitioning, unlike a
> physical array connected directly to both nodes).
> 
> Am I right so far?
> 
> What I'd really like to do is have a three (or more) node cluster with
> two nodes having access to the DRBD storage.  This solves the quorum
> problem (effectively having the third node as a quorum server).
> 
> But when I try to create a volume on a volume group on a device shared
> by two nodes of a three node cluster, I get an error indicating that the
> volume group cannot be found on the third node.  Which is true: the
> shared volume isn't available on that node.
> 
> In the Cluster Logical Volume Manager document, I found:
> 
>         By default, logical volumes created with CLVM on shared storage
>         are visible to all computers that have access to the shared
>         storage. 
>         
> What I've not figured out is how to tell CLVMD (or whomever) that only
> nodes one and two have access to the shared storage.  Is there a way to
> do this? 
> 
> I've also read, in the GFS2 Overview document:
> 
>         When you configure a GFS2 file system as a cluster file system,
>         you must ensure that all nodes in the cluster have access to the
>         shared storage
> 
> This suggests that a cluster running GFS must have access to the storage
> on all nodes.  Which would clearly block my idea for a three node
> cluster with only two nodes having access to the shared storage.
> 
> I do have one idea, but it sounds like a more complex version of a Rube
> Goldberg device: A two node cluster with a third machine providing
> access to a device via iSCSI.  The LUN exported from that third system
> could be used as the quorum disk by the two cluster nodes (effectively
> making that little iSCSI target the quorum server).
> 
> This assumes that a failure of the quorum disk in an otherwise healthy
> two node cluster is survived.  I've yet to confirm this.
> 
> This seems ridiculously complex, so much so that I cannot imagine that
> there's not a better solution.  But I just cannot get my brain wrapped
> around this well enough to see it.
> 
> Any suggestions would be very welcome.
> 
> Thanks...
> 
> 	Andrew
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.





From Jost.Rakovec at snt.si  Thu Nov 18 12:03:38 2010
From: Jost.Rakovec at snt.si (Rakovec Jost)
Date: Thu, 18 Nov 2010 13:03:38 +0100
Subject: [Linux-cluster] A fencing mechanism for Xen (or KVM) guests
In-Reply-To: <1290026179.7401.1167.camel@carrot>
References: <1290026179.7401.1167.camel@carrot>
Message-ID: <3754ED14F3EE0C459DEFE2DF184515FF0F101C7241@SIMAIL.snt-is.com>


Hi

I would like to tray. Where can I get your software?


thx

br jost

________________________________________
From: linux-cluster-bounces at redhat.com [linux-cluster-bounces at redhat.com] On Behalf Of Andrew Gideon [ag8817282 at gideon.org]
Sent: Wednesday, November 17, 2010 9:36 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] A fencing mechanism for Xen (or KVM) guests

I found myself unhappy with what I located for fencing of Xen guests, so
I put together a new mechanism.  Would this be of interest to anyone
else?

The node on which fence_node is called uses SSH to connect to the list
of hypervisors.  The connection is key based, which limits the nodes to
execution of the specific fencing command and also lets a given node
fence only a guest that's in a specific list.  This prevents a node of
one cluster from fencing a node of another even if they reside on the
same set of hypervisors.

The fencing script issues the fence command (via SSH) to each
hypervisor.  Success of the command requires either (1) a guest of the
specified name is found and destroyed o at least one hypervisor or (2)
every hypervisor has been visited and reported that there is no such
guest running.

#2 was an interesting choice, BTW, on which I'd welcome feedback.  The
alternative would have been to presume that an unreachable hypervisor
was down.  That didn't seem like the best choice to me, but I'm curious
what others might think.

Thanks...

        Andrew


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From Colin.Simpson at iongeo.com  Thu Nov 18 17:14:27 2010
From: Colin.Simpson at iongeo.com (Colin Simpson)
Date: Thu, 18 Nov 2010 17:14:27 +0000
Subject: [Linux-cluster] Configurations of services?
In-Reply-To: <4CE1AA4C.2050609@bobich.net>
References: <4CE1AA4C.2050609@bobich.net>
Message-ID: <1290100467.25543.3.camel@cowie>

Sorry to invade your thread, but I have a query I'd like to post as a
new thread, but every time I do it never seems to turn up, it just
disappears into a black hole. 

I tried linux-cluster-owner at redhat.com ,but have had no reply there
either.

Anyone know how I can get permission to start a new thread, or what I'm
doing wrong?

Thanks

Colin

On Mon, 2010-11-15 at 21:46 +0000, Gordan Bobic wrote:
> On 11/15/2010 09:29 PM, Jakov Sosic wrote:
> 
> > @Gordan, I've tried csync2 and it's a great tool. It works exactly
> the
> > way I wanted! Thank you very much, I'm prepping it for big push into
> > production environments :)
> 
> Glad I could help. :)
> 
> Gordan
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 

This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.





From dan.candea at quah.ro  Thu Nov 18 20:26:18 2010
From: dan.candea at quah.ro (Dan Candea)
Date: Thu, 18 Nov 2010 22:26:18 +0200
Subject: [Linux-cluster] clusterfs.sh
Message-ID: <4CE58BEA.3030007@quah.ro>

hello

I'm using cluster-3.0.17 and I'm mounting a shared gfs2 storage with 
force_unmount="1"
When rgmanager crashes on one node, it crashes on all other nodes and 
the reboot is the only option
because the shared storage is not unmounted and every process is using 
it it freezes.

Before figured it out why it crashes, I'm trying to understand why is 
not unmounted. In the log file I receive
/
Not unmounting clusterfs:backupfs - still in use by 1 other service(s)
/
the above message means the procesesses that uses the fs are not killed, 
or my services are not configured correctly?
I have something like below

<rm status_poll_interval="30">
<failoverdomains>
<failoverdomain name="node1" restricted="1">
<failoverdomainnode name="node1"/>
</failoverdomain>
<failoverdomain name="node2" restricted="1">
<failoverdomainnode name="node2"/>
</failoverdomain>
</failoverdomains>

<resources>
<clusterfs force_unmount="1" fstype="gfs2" device="/dev/mapper/backup" 
mountpoint="/var/mybackup/backup_fs" name="backupfs" 
options="noatime,nodiratime,quota=off"/>
<clusterfs force_unmount="1" fstype="gfs2" device="/dev/mapper/san" 
mountpoint="/mnt/storage" name="datafs" 
options="noatime,nodiratime,quota=off"/>
<script name="gfs2" file="/etc/init.d/gfs2"/>
<script name="crontab" file="/etc/init.d/crontab_install"/>
</resources>

<service autostart="1" recovery="disable" name="backupfs-node1" 
domain="node1">
<clusterfs ref="backupfs">
<script ref="crontab"/>
</clusterfs>
</service>
<service autostart="1" recovery="disable" name="datafs-node1" 
domain="node1">
<clusterfs ref="datafs"/>
</service>
<service autostart="1" recovery="disable" name="backupfs-node2" 
domain="node2">
<clusterfs ref="backupfs">
<script ref="crontab"/>
</clusterfs>
</service>
</rm>

thank you

-- 
Dan C?ndea
Does God Play Dice?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101118/e02b4b48/attachment.htm>

From rossnick-lists at cybercat.ca  Fri Nov 19 02:23:33 2010
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Thu, 18 Nov 2010 21:23:33 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>	<4CDB713A.8080303@alteeve.com><4CDBB70D.6080204@bobich.net><036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net><4CDC1C0E.6010804@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net><4CDCA95A.2070104@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net><4CDCB7E5.40204@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F3656@GVW1113EXC.americas.hpqcorp.net>
	<4CDCC8EE.6090500@alteeve.com>
Message-ID: <F8B32A84985745EF8817B7C31DA4E550@Aspire>

Hi again !

I am begining to play with my new servers. I got for starter 2 nodes (1u
intel server platform, with a LSI Logic FC949ES FC card). I am like a child
playing with his new toy at christmas...

So, now I have a few points and questions. Sorry if it's long.

1. Raid sets

So, I made up a 2-node cluster for the moment. I was able to bring up the 
cluster and make a GFS file system, in fact 2. We've made some test with 
different strategy of raid. Our first idea for the gfs was to use 5 1tb 
disks in raid 5. With that I got a 4 tb fs. It has been suggested previously 
that might not be a good idea. Our controler don't support directly raid 10 
wich seems to be the consensus of a better setup. We will be making the 0 
part on linux.

I made 2 raid 1 sets of 1tb (2 disks) on our raid enclosure, and added them 
to a single vg. I created a lv on top of that, so I yield with a 2 tb fs. We 
don't plan on using striping on the lv (-i2) because of the overhead if we 
add more space we will need to add 2 sets of raid1. So we plan on making a 
"starter" gfs with those 2 sets (2tb total). It's nearly double the 1.1 tb 
we have now, so we'll start with that.

Now, we made some write test with dd, and judging by the disk activity, all 
data was writen to the first disk (pair of) of the vg, and never the second 
one. I assume that once the first disk is full, it'll start writing to the 
2nd one. In the long term, I don't beleive it'll be a problem, but I'd 
prefer if the data was written alternativly on both disks without using 
stripes. Is that possible ? I looked at the --alloc option to the vgcreate, 
but it doesn't seem to be that.


2. Network setup.

All our new servers have 3 nics, one being dedicated on to the mamagement 
module. I will be using the first one to make a private network that will be 
serving my services. In my new setup real routable ips will terminated at 
the router and will be nated to the private ones for eventual 
load-balancing. I will be using the second network on a different vlan and 
subnet for cluster communications. The management modules will be on that 
same vlan. So is this a good setup ? Should I be doing something differently 
?


3. Deadlocks

I found a small c program for testing the locks/s that is possible on a file 
accessed similtunously on many nodes. (It's ping_pong, some fo you might 
have used it). So, one of the parameters of that program is the number of 
nodes using that file +1. On one test, I used 2 in stead of 3 on one of the 
node. Both profram on both nodes seemed stuck, not killable, not even -9. So 
I must assume that they were in some kind of deadlock. dlm_tool 
deadlock_check didn't show anything, and I can't make heads or tails from 
gfs2_tool lockdump or what to do with it. I was forced to reboot (forcebly) 
one of the node. Most likely on my production environement we won't arrive 
to that situation. But I want to know what happed and what to do to prevent 
it or stop that kind of lock.


Thank you all. 



From linux at alteeve.com  Fri Nov 19 03:16:13 2010
From: linux at alteeve.com (Digimer)
Date: Thu, 18 Nov 2010 22:16:13 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <F8B32A84985745EF8817B7C31DA4E550@Aspire>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>	<4CDB713A.8080303@alteeve.com><4CDBB70D.6080204@bobich.net><036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net><4CDC1C0E.6010804@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net><4CDCA95A.2070104@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net><4CDCB7E5.40204@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F3656@GVW1113EXC.americas.hpqcorp.net>	<4CDCC8EE.6090500@alteeve.com>
	<F8B32A84985745EF8817B7C31DA4E550@Aspire>
Message-ID: <4CE5EBFD.8000104@alteeve.com>

On 11/18/2010 09:23 PM, Nicolas Ross wrote:
> Hi again !
>
> I am begining to play with my new servers. I got for starter 2 nodes (1u
> intel server platform, with a LSI Logic FC949ES FC card). I am like a child
> playing with his new toy at christmas...
>
> So, now I have a few points and questions. Sorry if it's long.
>
> 1. Raid sets
>
> So, I made up a 2-node cluster for the moment. I was able to bring up
> the cluster and make a GFS file system, in fact 2. We've made some test
> with different strategy of raid. Our first idea for the gfs was to use 5
> 1tb disks in raid 5. With that I got a 4 tb fs. It has been suggested
> previously that might not be a good idea. Our controler don't support
> directly raid 10 wich seems to be the consensus of a better setup. We
> will be making the 0 part on linux.
>
> I made 2 raid 1 sets of 1tb (2 disks) on our raid enclosure, and added
> them to a single vg. I created a lv on top of that, so I yield with a 2
> tb fs. We don't plan on using striping on the lv (-i2) because of the
> overhead if we add more space we will need to add 2 sets of raid1. So we
> plan on making a "starter" gfs with those 2 sets (2tb total). It's
> nearly double the 1.1 tb we have now, so we'll start with that.
>
> Now, we made some write test with dd, and judging by the disk activity,
> all data was writen to the first disk (pair of) of the vg, and never the
> second one. I assume that once the first disk is full, it'll start
> writing to the 2nd one. In the long term, I don't beleive it'll be a
> problem, but I'd prefer if the data was written alternativly on both
> disks without using stripes. Is that possible ? I looked at the --alloc
> option to the vgcreate, but it doesn't seem to be that.

RAID 0 means that data should be evenly split between either array 
member. I would suspect a problem there. If the RAID controller is a 
simple on (I am not familiar with the model), I might suggest building 
the entire array in Linux. It would be interesting to see the 
performance difference, if any. Though I prefer that mainly because I am 
more familiar with mdadm, so take that recommendation with a grain of salt.

> 2. Network setup.
>
> All our new servers have 3 nics, one being dedicated on to the
> mamagement module. I will be using the first one to make a private
> network that will be serving my services. In my new setup real routable
> ips will terminated at the router and will be nated to the private ones
> for eventual load-balancing. I will be using the second network on a
> different vlan and subnet for cluster communications. The management
> modules will be on that same vlan. So is this a good setup ? Should I be
> doing something differently ?

Sounds okay to me. I like having a dedicated subnet for data syncing, 
but what really matters is that cluster communication and fencing are 
separate from Internet traffic.

> 3. Deadlocks
>
> I found a small c program for testing the locks/s that is possible on a
> file accessed similtunously on many nodes. (It's ping_pong, some fo you
> might have used it). So, one of the parameters of that program is the
> number of nodes using that file +1. On one test, I used 2 in stead of 3
> on one of the node. Both profram on both nodes seemed stuck, not
> killable, not even -9. So I must assume that they were in some kind of
> deadlock. dlm_tool deadlock_check didn't show anything, and I can't make
> heads or tails from gfs2_tool lockdump or what to do with it. I was
> forced to reboot (forcebly) one of the node. Most likely on my
> production environement we won't arrive to that situation. But I want to
> know what happed and what to do to prevent it or stop that kind of lock.

Do you have your fence devices configured and working properly? A 
failure to fence can hang a cluster. Also, are you using managed 
switches and have either IGMP snooping or spanning tree enabled?

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From pieter.baele at gmail.com  Fri Nov 19 07:17:10 2010
From: pieter.baele at gmail.com (Pieter Baele)
Date: Fri, 19 Nov 2010 08:17:10 +0100
Subject: [Linux-cluster] RH Cluster / Pacemaker / Veritas Cluster Server & SF
Message-ID: <AANLkTi=G6pw0GK6fGhkw6sqD=TLLmvGHMAzu1rowPhph@mail.gmail.com>

Hi,

Has anyone experience or made a comparison between these different
clustering products?
We are evaluating Linux cluster solutions, I've too say Veritas is
very very complete (but closed source and expensive....!)

If anyone has some recommandations or tips, especially for RH Cluster,
please share them.

What's the future direction of RH: Will pacemaker become default or
will it only be include in RH 6 as an alternative?

Sincerely,
Pieter Baele
www.pieterb.be



From pieter.baele at gmail.com  Fri Nov 19 07:25:43 2010
From: pieter.baele at gmail.com (Pieter Baele)
Date: Fri, 19 Nov 2010 08:25:43 +0100
Subject: [Linux-cluster] 2 SAN sites: clvm mirroring or md?
Message-ID: <AANLkTi=u-3y3EPD9NJr_v7ReC2jRU91z7QUm4FfcOaNG@mail.gmail.com>

Hi,

In my company we have 2 SAN sites (low latency)
This is not supported by default by RH Cluster, but can be validated.

How should we replicate the data?

Someone from another company mentioned using md but then we need to
use scripting..
(only Pacemaker has support for md)
With the inclusion of clvm mirroring in RH5.3, is this the preferred solution?
I could not get a clear answer from RH (support)...

FYI
* DRBD is no option because we would like to use the SAN infrastructure.
* Veritas offers 2 solutions: VVR (replication) or an agent for the SAN


Met vriendelijke groeten,
Pieter Baele
www.pieterb.be



From gordan at bobich.net  Fri Nov 19 08:32:48 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Fri, 19 Nov 2010 08:32:48 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <F8B32A84985745EF8817B7C31DA4E550@Aspire>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>	<4CDB713A.8080303@alteeve.com><4CDBB70D.6080204@bobich.net><036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net><4CDC1C0E.6010804@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net><4CDCA95A.2070104@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net><4CDCB7E5.40204@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F3656@GVW1113EXC.americas.hpqcorp.net>	<4CDCC8EE.6090500@alteeve.com>
	<F8B32A84985745EF8817B7C31DA4E550@Aspire>
Message-ID: <4CE63630.2010001@bobich.net>

Nicolas Ross wrote:
> 1. Raid sets
> 
> So, I made up a 2-node cluster for the moment. I was able to bring up 
> the cluster and make a GFS file system, in fact 2. We've made some test 
> with different strategy of raid. Our first idea for the gfs was to use 5 
> 1tb disks in raid 5. With that I got a 4 tb fs. It has been suggested 
> previously that might not be a good idea. Our controler don't support 
> directly raid 10 wich seems to be the consensus of a better setup. We 
> will be making the 0 part on linux.
> 
> I made 2 raid 1 sets of 1tb (2 disks) on our raid enclosure, and added 
> them to a single vg. I created a lv on top of that, so I yield with a 2 
> tb fs. We don't plan on using striping on the lv (-i2) because of the 
> overhead if we add more space we will need to add 2 sets of raid1. So we 
> plan on making a "starter" gfs with those 2 sets (2tb total). It's 
> nearly double the 1.1 tb we have now, so we'll start with that.
> 
> Now, we made some write test with dd, and judging by the disk activity, 
> all data was writen to the first disk (pair of) of the vg, and never the 
> second one. I assume that once the first disk is full, it'll start 
> writing to the 2nd one. In the long term, I don't beleive it'll be a 
> problem, but I'd prefer if the data was written alternativly on both 
> disks without using stripes. Is that possible ? I looked at the --alloc 
> option to the vgcreate, but it doesn't seem to be that.

Is this the storage you are sharing between the nodes? If so, how 
exactly are you doing it?

Also, you do realize that you don't have to use LVM at all? It is 
entirely optional.

> 2. Network setup.
> 
> All our new servers have 3 nics, one being dedicated on to the 
> mamagement module. I will be using the first one to make a private 
> network that will be serving my services. In my new setup real routable 
> ips will terminated at the router and will be nated to the private ones 
> for eventual load-balancing. I will be using the second network on a 
> different vlan and subnet for cluster communications. The management 
> modules will be on that same vlan. So is this a good setup ? Should I be 
> doing something differently ?

In theory, your cluster/storage interface should be the same interface 
you access the fencing devices over. As long as you stick to that, it 
should be OK.

> 3. Deadlocks
> 
> I found a small c program for testing the locks/s that is possible on a 
> file accessed similtunously on many nodes. (It's ping_pong, some fo you 
> might have used it). So, one of the parameters of that program is the 
> number of nodes using that file +1. On one test, I used 2 in stead of 3 
> on one of the node. Both profram on both nodes seemed stuck, not 
> killable, not even -9. So I must assume that they were in some kind of 
> deadlock. dlm_tool deadlock_check didn't show anything, and I can't make 
> heads or tails from gfs2_tool lockdump or what to do with it. I was 
> forced to reboot (forcebly) one of the node. Most likely on my 
> production environement we won't arrive to that situation. But I want to 
> know what happed and what to do to prevent it or stop that kind of lock.

What version of RHEL are you using? Early versions of RHEL5 had GFS2 
lock-up issues like you're describing. IIRC, GFS2 was only considered 
stable from around RHEL 5.5 (technology-preview-only in earlier versions 
of RHEL). Try with GFS1, it's a lot more mature.

Gordan



From radu.rendec at mindbit.ro  Fri Nov 19 10:41:30 2010
From: radu.rendec at mindbit.ro (Radu Rendec)
Date: Fri, 19 Nov 2010 11:41:30 +0100
Subject: [Linux-cluster] RH Cluster / Pacemaker / Veritas Cluster Server
 & SF
In-Reply-To: <AANLkTi=G6pw0GK6fGhkw6sqD=TLLmvGHMAzu1rowPhph@mail.gmail.com>
References: <AANLkTi=G6pw0GK6fGhkw6sqD=TLLmvGHMAzu1rowPhph@mail.gmail.com>
Message-ID: <1290163290.2160.12.camel@localhost>

On Fri, 2010-11-19 at 08:17 +0100, Pieter Baele wrote:
> If anyone has some recommandations or tips, especially for RH Cluster,
> please share them.

I've been using RH cluster for quite a while. It's been working great
for me (but I don't have a very sophisticated setup though).

Before deciding to use RH cluster, I also looked at other open source
solutions. I decided to go with RH cluster because, in my opinion, it
combines simple configuration concepts with powerful HA features.

> What's the future direction of RH: Will pacemaker become default or
> will it only be include in RH 6 as an alternative?

I have no idea what will be the default (by the way, RH6 is already out)
but my guess is that they will go with their own solution (RH cluster)
instead of moving on to a "third party" one.

However, I can tell you for sure that RH6 comes with RH cluster 3.x.x.
My attempts to migrate the setup I have on RH5 (based on RH cluster
2.x.x) to version 3.x.x have lamentably failed.

It's also true that I ran the tests on Fedora 14 (because Centos 6 is
not out yet) but on the other hand it's RH cluster that didn't work
properly, not the distribution.

Best regards,

Radu Rendec




From fdinitto at redhat.com  Fri Nov 19 11:04:01 2010
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Fri, 19 Nov 2010 12:04:01 +0100
Subject: [Linux-cluster] RH Cluster / Pacemaker / Veritas Cluster Server
 & SF
In-Reply-To: <1290163290.2160.12.camel@localhost>
References: <AANLkTi=G6pw0GK6fGhkw6sqD=TLLmvGHMAzu1rowPhph@mail.gmail.com>
	<1290163290.2160.12.camel@localhost>
Message-ID: <4CE659A1.6060509@redhat.com>

On 11/19/2010 11:41 AM, Radu Rendec wrote:

> 
> However, I can tell you for sure that RH6 comes with RH cluster 3.x.x.
> My attempts to migrate the setup I have on RH5 (based on RH cluster
> 2.x.x) to version 3.x.x have lamentably failed.

> 
> It's also true that I ran the tests on Fedora 14 (because Centos 6 is
> not out yet) but on the other hand it's RH cluster that didn't work
> properly, not the distribution.

did you file any bugzillas?

as upstream, I don?t really care if it?s based on Fedora or Centos.

what problems did you hit?

Fabio



From radu.rendec at mindbit.ro  Fri Nov 19 11:46:20 2010
From: radu.rendec at mindbit.ro (Radu Rendec)
Date: Fri, 19 Nov 2010 13:46:20 +0200
Subject: [Linux-cluster] RH Cluster / Pacemaker / Veritas Cluster Server
 & SF
In-Reply-To: <4CE659A1.6060509@redhat.com>
References: <AANLkTi=G6pw0GK6fGhkw6sqD=TLLmvGHMAzu1rowPhph@mail.gmail.com>
	<1290163290.2160.12.camel@localhost>  <4CE659A1.6060509@redhat.com>
Message-ID: <1290167180.2160.30.camel@localhost>

On Fri, 2010-11-19 at 12:04 +0100, Fabio M. Di Nitto wrote:
> On 11/19/2010 11:41 AM, Radu Rendec wrote: 
> > However, I can tell you for sure that RH6 comes with RH cluster 3.x.x.
> > My attempts to migrate the setup I have on RH5 (based on RH cluster
> > 2.x.x) to version 3.x.x have lamentably failed.
> 
> > 
> > It's also true that I ran the tests on Fedora 14 (because Centos 6 is
> > not out yet) but on the other hand it's RH cluster that didn't work
> > properly, not the distribution.
> 
> did you file any bugzillas?

I didn't file any. First I tried posting to this list hoping that
someone would shout "hey, you messed up your setup! you were not
supposed to do this and that" etc etc. But I actually haven't got any
reply at all.

> as upstream, I don?t really care if it?s based on Fedora or Centos.
> 
> what problems did you hit?

The problem is described in more detail in an older post that I made a
few days ago:

https://www.redhat.com/archives/linux-cluster/2010-November/msg00076.html

Basically I've got a "braindead" rgmanager after a few hours of cluster
uptime. I've been keeping the machines running like that since the
failure, hoping that someone would ask me to look at various things
while the processes are still in this state.

There's another problem also related to rgmanager that I didn't describe
in the other post. At a certain point, I added a new resource to the
config file and updated the cluster config. However, the rgmanager on
the very node that I used for updating failed to "see" the new resource,
while the other node "saw" it immediately and started id. Cman reported
the same config version (the new one) on both nodes.

Looking at your email address (the domain part actually), I'm kindly
asking you for any suggestions on where and how to report these issues
properly :)

I would really like to help debugging this because I consider RH cluster
to be great software.

Best regards,

Radu Rendec




From ashley at host365.com  Fri Nov 19 11:24:26 2010
From: ashley at host365.com (Ashley Large)
Date: Fri, 19 Nov 2010 11:24:26 -0000
Subject: [Linux-cluster] Advice on suitability
Message-ID: <201011191151.oAJBp0q7006671@mx1.redhat.com>

Hi

 

I have been finding a fair bit of configuration info iro clustering, cluster
suite and gfs/2 but not quite seeing the more conceptual stuff with pretty
diagrams that would help me grasp whether its applicable to my needs so I
hope you can help with unscrambling my confusion. 

 

Currently I have 25 odd servers each running LAMP and qmail etc. Throughput
on each ranges from 1 - 10 mbits/s. Each currently works off raid 1 mirrored
disks and all are relatively low powered machines. Total storage usage for
all 25 machines is about 4Tb. I want to virtualize the whole lot (I am
leaning towards KVM). My motivations are 1) reduce power usage in rack and
2) build in as much redundancy as I can.

 

My thoughts have been to have 3 "front end" physical host machines with a
rough 50/50 split of the current 25 odd servers into VMs on these physical
hosts. I will then have redundancy across 3 machines so that if machine 1
dies, the vms are moved to machine 2 or 3 pending fix of machine 1. Then I
want to create two "back end" storage servers which are replicated and
provide redundant storage to the front ends in a primary/secondary way such
that the front ends can failover to either available storage server. I am
anticipating having only VM host booting off machine1-3 and using the back
end storage servers for booting the vms and storing their data. Each storage
server will have 6 x 2Tb sata disks set up as a 6Tb raid1 array although I
have been reading that more spindles is better so maybe I need more disks
for optimal performance. Front ends and back ends would have multiple bonded
Gig e ports to a dedicated switch which I think should provide enough
bandwidth and the net facing NICs running off a separate switch.

 

I have looked at openfiler and open-e for storage platforms which are
options but then followed some RHEL links to Cluster Suite, GFS2 and the HA
add ons and it began to look like another option for presenting a redundant,
automated-failover storage platform to the VMs using commodity hardware
rather than proprietary SAN kit.

 

So I am trying to work out if I am on the right track with Cluster
Suite/GFS/ etc and if so, what are the elements I need to meet my objective
and will it actually work and provide the throughput I need or are 25
web/mail servers doing constant reads/writes going to mean its all too slow.
I also saw some warnings about mysql usage on the cluster which looked
troubling. Any real world experiences, pointers or advice would be great.

 

Thanks and regards

Ashley

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101119/c8254495/attachment.htm>

From rossnick-lists at cybercat.ca  Fri Nov 19 12:05:19 2010
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Fri, 19 Nov 2010 07:05:19 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>	<4CDB713A.8080303@alteeve.com><4CDBB70D.6080204@bobich.net><036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net><4CDC1C0E.6010804@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net><4CDCA95A.2070104@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net><4CDCB7E5.40204@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F3656@GVW1113EXC.americas.hpqcorp.net>	<4CDCC8EE.6090500@alteeve.com><F8B32A84985745EF8817B7C31DA4E550@Aspire>
	<4CE63630.2010001@bobich.net>
Message-ID: <B6AE87F3809C47E0BC9705C45B576F58@Aspire>

> Is this the storage you are sharing between the nodes? If so, how exactly 
> are you doing it?
>
> Also, you do realize that you don't have to use LVM at all? It is entirely 
> optional.

Yes, that if for the shared gfs between the node. What we've decided to use 
is 2 raid-1 sets of 1 tb (2 disk each), added to a single vg of 2 tb. Other 
raid sets will be mounted on a per-node and per service basis.

I know I can make a gfs on a standard partion, but to make a larger than the 
disk volume, I need either raid or lvm. Lvm adds also the ability to add 
space later.

> What version of RHEL are you using? Early versions of RHEL5 had GFS2 
> lock-up issues like you're describing. IIRC, GFS2 was only considered 
> stable from around RHEL 5.5 (technology-preview-only in earlier versions 
> of RHEL). Try with GFS1, it's a lot more mature.

I am currently using RHEL6 beta, while waiting for centos 6 to be out in a 
couple of weeks. I dont' think gfs1 is included with rhel6. 



From fdinitto at redhat.com  Fri Nov 19 12:15:05 2010
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Fri, 19 Nov 2010 13:15:05 +0100
Subject: [Linux-cluster] RH Cluster / Pacemaker / Veritas Cluster Server
 & SF
In-Reply-To: <1290167180.2160.30.camel@localhost>
References: <AANLkTi=G6pw0GK6fGhkw6sqD=TLLmvGHMAzu1rowPhph@mail.gmail.com>	<1290163290.2160.12.camel@localhost>
	<4CE659A1.6060509@redhat.com> <1290167180.2160.30.camel@localhost>
Message-ID: <4CE66A49.3060901@redhat.com>

On 11/19/2010 12:46 PM, Radu Rendec wrote:
> On Fri, 2010-11-19 at 12:04 +0100, Fabio M. Di Nitto wrote:
>> On 11/19/2010 11:41 AM, Radu Rendec wrote: 
>>> However, I can tell you for sure that RH6 comes with RH cluster 3.x.x.
>>> My attempts to migrate the setup I have on RH5 (based on RH cluster
>>> 2.x.x) to version 3.x.x have lamentably failed.
>>
>>>
>>> It's also true that I ran the tests on Fedora 14 (because Centos 6 is
>>> not out yet) but on the other hand it's RH cluster that didn't work
>>> properly, not the distribution.
>>
>> did you file any bugzillas?
> 
> I didn't file any. First I tried posting to this list hoping that
> someone would shout "hey, you messed up your setup! you were not
> supposed to do this and that" etc etc. But I actually haven't got any
> reply at all.

Understood, but you need to file bugzilla?s for issues. We simply don?t
have enough resources to track bugs on mailing lists too.

> 
>> as upstream, I don?t really care if it?s based on Fedora or Centos.
>>
>> what problems did you hit?
> 
> The problem is described in more detail in an older post that I made a
> few days ago:
> 
> https://www.redhat.com/archives/linux-cluster/2010-November/msg00076.html
> 
> Basically I've got a "braindead" rgmanager after a few hours of cluster
> uptime. I've been keeping the machines running like that since the
> failure, hoping that someone would ask me to look at various things
> while the processes are still in this state.

> 
> There's another problem also related to rgmanager that I didn't describe
> in the other post. At a certain point, I added a new resource to the
> config file and updated the cluster config. However, the rgmanager on
> the very node that I used for updating failed to "see" the new resource,
> while the other node "saw" it immediately and started id. Cman reported
> the same config version (the new one) on both nodes.

Ok please, for each problem file a separate bugzilla, collect
/var/log/cluster/* and cluster.conf.

> 
> Looking at your email address (the domain part actually), I'm kindly
> asking you for any suggestions on where and how to report these issues
> properly :)

I?d be very happy to help, but two warnings, the rgmanager maintainer is
temporary unavailable and I don?t have his expertise. Failure to provide
requested data, is going to make it complex to debug.

If, for any reason, you need to hide password in cluster.conf, please
attach the "mangled" version in bugzilla and a send a pristine copy to
me and Lon <lhh at redhat.com>. We don?t care about your passwords or real
ip addresses, but we have seen way too many people breaking cluster.conf
only to mask a password. We need to make sure everything is in the right
place.

Another option is also to add <logging debug="on"/> in cluster.conf and
repeat the tests.

> 
> I would really like to help debugging this because I consider RH cluster
> to be great software.

thanks for the help!

Fabio



From radu.rendec at mindbit.ro  Fri Nov 19 12:47:31 2010
From: radu.rendec at mindbit.ro (Radu Rendec)
Date: Fri, 19 Nov 2010 14:47:31 +0200
Subject: [Linux-cluster] Advice on suitability
In-Reply-To: <201011191151.oAJBp0q7006671@mx1.redhat.com>
References: <201011191151.oAJBp0q7006671@mx1.redhat.com>
Message-ID: <1290170851.2160.50.camel@localhost>

On Fri, 2010-11-19 at 11:24 +0000, Ashley Large wrote:
> My thoughts have been to have 3 ?front end? physical host machines
> with a rough 50/50 split of the current 25 odd servers into VMs on
> these physical hosts. I will then have redundancy across 3 machines so
> that if machine 1 dies, the vms are moved to machine 2 or 3 pending
> fix of machine 1. Then I want to create two ?back end? storage servers
> which are replicated and provide redundant storage to the front ends
> in a primary/secondary way such that the front ends can failover to
> either available storage server. I am anticipating having only VM host
> booting off machine1-3 and using the back end storage servers for
> booting the vms and storing their data. Each storage server will have
> 6 x 2Tb sata disks set up as a 6Tb raid1 array although I have been
> reading that more spindles is better so maybe I need more disks for
> optimal performance. Front ends and back ends would have multiple
> bonded Gig e ports to a dedicated switch which I think should provide
> enough bandwidth and the net facing NICs running off a separate
> switch.

I've built a very similar system on top of Centos 5 / RHCS. I used drbd
on top of LVM to replicate the data across the two storage machines, and
then iscsi to export the storage to the VM nodes. I used xen for
virtualization.

While my system seemed to be pretty much ok, I experienced random
crashes on virtual machines that (unfortunately) were windows. The crash
however resided not in the virtual machine itself, but in the
virtualization system (segfault in qemu-dm).

The crashes seem to be related to moving storage resources across the
two nodes, but however they cannot be 100% replicated. I also
experienced file corruption inside the VMs. I'm pretty sure that under
certain circumstances some block writes are lost, but for now I didn't
manage to figure out exactly where.

Best regards,

Radu Rendec





From gordan at bobich.net  Fri Nov 19 14:17:01 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Fri, 19 Nov 2010 14:17:01 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <B6AE87F3809C47E0BC9705C45B576F58@Aspire>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>	<4CDB713A.8080303@alteeve.com><4CDBB70D.6080204@bobich.net><036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net><4CDC1C0E.6010804@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net><4CDCA95A.2070104@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net><4CDCB7E5.40204@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F3656@GVW1113EXC.americas.hpqcorp.net>	<4CDCC8EE.6090500@alteeve.com><F8B32A84985745EF8817B7C31DA4E550@Aspire>	<4CE63630.2010001@bobich.net>
	<B6AE87F3809C47E0BC9705C45B576F58@Aspire>
Message-ID: <4CE686DD.5070503@bobich.net>

Nicolas Ross wrote:
>> Is this the storage you are sharing between the nodes? If so, how 
>> exactly are you doing it?
> 
> Yes, that if for the shared gfs between the node. What we've decided to 
> use is 2 raid-1 sets of 1 tb (2 disk each), added to a single vg of 2 
> tb. Other raid sets will be mounted on a per-node and per service basis.

I see. In that case, I dare say that the writes hitting one disk at a 
time is normal. It's a volume group, not RAID1 per se.

>> What version of RHEL are you using? Early versions of RHEL5 had GFS2 
>> lock-up issues like you're describing. IIRC, GFS2 was only considered 
>> stable from around RHEL 5.5 (technology-preview-only in earlier 
>> versions of RHEL). Try with GFS1, it's a lot more mature.
> 
> I am currently using RHEL6 beta, while waiting for centos 6 to be out in 
> a couple of weeks. I dont' think gfs1 is included with rhel6.

Interesting. If that really is the same GFS2 bug you're tripping that I 
bumped into 3 years ago and you're on RHEL6b, that is quite concerning. 
Doubly so if GFS1 support has been completely removed.

Gordan



From swhiteho at redhat.com  Fri Nov 19 14:34:29 2010
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Fri, 19 Nov 2010 14:34:29 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CE686DD.5070503@bobich.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>
	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>
	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>
	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>
	<4CDB713A.8080303@alteeve.com><4CDBB70D.6080204@bobich.net>
	<036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net>
	<4CDC1C0E.6010804@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net>
	<4CDCA95A.2070104@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net>
	<4CDCB7E5.40204@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584F3656@GVW1113EXC.americas.hpqcorp.net>
	<4CDCC8EE.6090500@alteeve.com><F8B32A84985745EF8817B7C31DA4E550@Aspire>
	<4CE63630.2010001@bobich.net> <B6AE87F3809C47E0BC9705C45B576F58@Aspire>
	<4CE686DD.5070503@bobich.net>
Message-ID: <1290177269.2570.54.camel@dolmen>

Hi,

On Fri, 2010-11-19 at 14:17 +0000, Gordan Bobic wrote:
> Nicolas Ross wrote:
> >> Is this the storage you are sharing between the nodes? If so, how 
> >> exactly are you doing it?
> > 
> > Yes, that if for the shared gfs between the node. What we've decided to 
> > use is 2 raid-1 sets of 1 tb (2 disk each), added to a single vg of 2 
> > tb. Other raid sets will be mounted on a per-node and per service basis.
> 
> I see. In that case, I dare say that the writes hitting one disk at a 
> time is normal. It's a volume group, not RAID1 per se.
> 
> >> What version of RHEL are you using? Early versions of RHEL5 had GFS2 
> >> lock-up issues like you're describing. IIRC, GFS2 was only considered 
> >> stable from around RHEL 5.5 (technology-preview-only in earlier 
> >> versions of RHEL). Try with GFS1, it's a lot more mature.
> > 
GFS2 has been fully supported since RHEL 5.3

> > I am currently using RHEL6 beta, while waiting for centos 6 to be out in 
> > a couple of weeks. I dont' think gfs1 is included with rhel6.
> 
> Interesting. If that really is the same GFS2 bug you're tripping that I 
> bumped into 3 years ago and you're on RHEL6b, that is quite concerning. 
> Doubly so if GFS1 support has been completely removed.
> 
> Gordan
> 
There is no GFS1 for RHEL6 and above, since the purpose of GFS2 is to
provide an upstream, improved version of GFS1. If you've run into any
issues with the RHEL6 version we are very interested to know about it.
Please file a bugzilla so that we can collect all the info and look into
it,

Steve.




From rossnick-lists at cybercat.ca  Fri Nov 19 14:46:57 2010
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Fri, 19 Nov 2010 09:46:57 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CE686DD.5070503@bobich.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDA5421.9090006@bobich.net>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>	<4CDB713A.8080303@alteeve.com><4CDBB70D.6080204@bobich.net><036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net><4CDC1C0E.6010804@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net><4CDCA95A.2070104@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net><4CDCB7E5.40204@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F3656@GVW1113EXC.americas.hpqcorp.net>	<4CDCC8EE.6090500@alteeve.com><F8B32A84985745EF8817B7C31DA4E550@Aspire>	<4CE63630.2010001@bobich.net>
	<B6AE87F3809C47E0BC9705C45B576F58@Aspire>
	<4CE686DD.5070503@bobich.net>
Message-ID: <C489620E-7E43-4E06-8525-7409DCCA52BB@cybercat.ca>

> I see. In that case, I dare say that the writes hitting one disk at a time is normal. It's a volume group, not RAID1 per se.

Ok, that's what I assumed. Is it possible to have best of both world;-) ? That is, strip the data on both disk, using vg (semi raid 0), without using dm, and without using striping at the vg level (vgcreate -i) ?

> Interesting. If that really is the same GFS2 bug you're tripping that I bumped into 3 years ago and you're on RHEL6b, that is quite concerning. Doubly so if GFS1 support has been completely removed.

How can I confirm that is what is happening ? Can you provide me with a test case ?



From gordan at bobich.net  Fri Nov 19 15:03:33 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Fri, 19 Nov 2010 15:03:33 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <C489620E-7E43-4E06-8525-7409DCCA52BB@cybercat.ca>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>	<4CDB713A.8080303@alteeve.com><4CDBB70D.6080204@bobich.net><036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net><4CDC1C0E.6010804@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net><4CDCA95A.2070104@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net><4CDCB7E5.40204@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F3656@GVW1113EXC.americas.hpqcorp.net>	<4CDCC8EE.6090500@alteeve.com><F8B32A84985745EF8817B7C31DA4E550@Aspire>	<4CE63630.2010001@bobich.net>	<B6AE87F3809C47E0BC9705C45B576F58@Aspire>	<4CE686DD.5070503@bobich.net>
	<C489620E-7E43-4E06-8525-7409DCCA52BB@cybercat.ca>
Message-ID: <4CE691C5.3060202@bobich.net>

Nicolas Ross wrote:
>> I see. In that case, I dare say that the writes hitting one disk at a time is normal. It's a volume group, not RAID1 per se.
> 
> Ok, that's what I assumed. Is it possible to have best of both world;-) ? That is, strip the data on both disk, using vg (semi raid 0), without using dm, and without using striping at the vg level (vgcreate -i) ?

I'm not sure LVM can do that, but I'm not a big fan of LVM so perhaps 
somebody else can provide a definitive answer?

I keep thinking it'd be really nice if MD finally got cluster awareness, 
but that feature has been coming for years and is till nowhere to be 
seen, so I wouldn't hold my breath for it.

>> Interesting. If that really is the same GFS2 bug you're tripping that I bumped into 3 years ago and you're on RHEL6b, that is quite concerning. Doubly so if GFS1 support has been completely removed.
> 
> How can I confirm that is what is happening ? Can you provide me with a test case ?

Back when I tried using it, it was reproducible by having > 1 dovecot 
IMAP client running in the cluster with maildirs stored on GFS2. 
Accessing mailboxes with only 2-3 users used to reliably deadlock it in 
under 10 seconds. The only cure was a full cluster reboot. IIRC I was 
told at the time here that the bug was one of the known ones, so I never 
bothered investigating further.

Gordan



From linux at alteeve.com  Fri Nov 19 15:27:25 2010
From: linux at alteeve.com (Digimer)
Date: Fri, 19 Nov 2010 10:27:25 -0500
Subject: [Linux-cluster] RH Cluster / Pacemaker / Veritas Cluster Server
 & SF
In-Reply-To: <AANLkTi=G6pw0GK6fGhkw6sqD=TLLmvGHMAzu1rowPhph@mail.gmail.com>
References: <AANLkTi=G6pw0GK6fGhkw6sqD=TLLmvGHMAzu1rowPhph@mail.gmail.com>
Message-ID: <4CE6975D.2090008@alteeve.com>

On 11/19/2010 02:17 AM, Pieter Baele wrote:
> Hi,
>
> Has anyone experience or made a comparison between these different
> clustering products?
> We are evaluating Linux cluster solutions, I've too say Veritas is
> very very complete (but closed source and expensive....!)
>
> If anyone has some recommandations or tips, especially for RH Cluster,
> please share them.
>
> What's the future direction of RH: Will pacemaker become default or
> will it only be include in RH 6 as an alternative?

I used HA Linux some years ago and was never happy with it, so I 
switched to RHCS. Building and maintaining clusters in it is, I find, 
much better. Now with that said, I've not gone back since Pacemaker was 
introduced, and I know a lot of people love Pacemaker now. I do plan to 
try it again before too long.

In general though, Radu said, RHCS has a central configuration in 
/etc/cluster/cluster.conf. I *love* that. With that though comes 
limitations. If you want fairly simple fail-over and recovery options, 
rgmanager is great. If you want very complex configurations though, 
Pacemaker is more flexible and.

In my mind, Pacemaker and RHCS are equally good, just different. Both 
use corosync and that's where all the magic happens.

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From linux at alteeve.com  Fri Nov 19 15:32:29 2010
From: linux at alteeve.com (Digimer)
Date: Fri, 19 Nov 2010 10:32:29 -0500
Subject: [Linux-cluster] RH Cluster / Pacemaker / Veritas Cluster Server
 & SF
In-Reply-To: <4CE66A49.3060901@redhat.com>
References: <AANLkTi=G6pw0GK6fGhkw6sqD=TLLmvGHMAzu1rowPhph@mail.gmail.com>	<1290163290.2160.12.camel@localhost>	<4CE659A1.6060509@redhat.com>
	<1290167180.2160.30.camel@localhost> <4CE66A49.3060901@redhat.com>
Message-ID: <4CE6988D.1090903@alteeve.com>

On 11/19/2010 07:15 AM, Fabio M. Di Nitto wrote:
>>> did you file any bugzillas?
>>
>> I didn't file any. First I tried posting to this list hoping that
>> someone would shout "hey, you messed up your setup! you were not
>> supposed to do this and that" etc etc. But I actually haven't got any
>> reply at all.
>
> Understood, but you need to file bugzilla?s for issues. We simply don?t
> have enough resources to track bugs on mailing lists too.
>
...
>
> Ok please, for each problem file a separate bugzilla, collect
> /var/log/cluster/* and cluster.conf.

I've filed several bugs now, and most of them do get addressed.

> I?d be very happy to help, but two warnings, the rgmanager maintainer is
> temporary unavailable and I don?t have his expertise. Failure to provide
> requested data, is going to make it complex to debug.
>
> If, for any reason, you need to hide password in cluster.conf, please
> attach the "mangled" version in bugzilla and a send a pristine copy to
> me and Lon<lhh at redhat.com>. We don?t care about your passwords or real
> ip addresses, but we have seen way too many people breaking cluster.conf
> only to mask a password. We need to make sure everything is in the right
> place.

For what it's worth, I've found Fabio and Lon to be wonderfully helpful 
and I've trusted Fabio enough to give him access to my machine in the 
past. I know that you don't me, so this is of limited value I suspect, 
but I trust them and wouldn't hesitate to send pristine config files to 
either of them.

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From rossnick-lists at cybercat.ca  Fri Nov 19 17:46:22 2010
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Fri, 19 Nov 2010 12:46:22 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire>	<4CDA8D4A.6010507@bobich.net>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>	<4CDB713A.8080303@alteeve.com><4CDBB70D.6080204@bobich.net><036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net><4CDC1C0E.6010804@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net><4CDCA95A.2070104@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net><4CDCB7E5.40204@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F3656@GVW1113EXC.americas.hpqcorp.net>	<4CDCC8EE.6090500@alteeve.com><F8B32A84985745EF8817B7C31DA4E550@Aspire>	<4CE63630.2010001@bobich.net>	<B6AE87F3809C47E0BC9705C45B576F58@Aspire>	<4CE686DD.5070503@bobich.net><C489620E-7E43-4E06-8525-7409DCCA52BB@cybercat.ca>
	<4CE691C5.3060202@! bobich.net>
Message-ID: <3564CDC469D042E1A6CFFA390FC6BC47@versa>

> Back when I tried using it, it was reproducible by having > 1 dovecot IMAP
> client running in the cluster with maildirs stored on GFS2. Accessing
> mailboxes with only 2-3 users used to reliably deadlock it in under 10
> seconds. The only cure was a full cluster reboot. IIRC I was told at the
> time here that the bug was one of the known ones, so I never bothered
> investigating further.

Ok, I see, in your case, users were connecting to a single node or many ?

Is there a bug report for this with someting I can reproduce without having 
to install a full IMAP server ?

Nicolas 



From ag8817282 at gideon.org  Fri Nov 19 17:48:22 2010
From: ag8817282 at gideon.org (Andrew Gideon)
Date: Fri, 19 Nov 2010 12:48:22 -0500
Subject: [Linux-cluster] GFS+DRBD+Quorum: Help wrap my brain around this
In-Reply-To: <1290027755.4270.33.camel@cowie>
References: <1290025373.7401.1158.camel@carrot>
	<1290027755.4270.33.camel@cowie>
Message-ID: <1290188902.5822.5.camel@carrot>

On Wed, 2010-11-17 at 21:02 +0000, Colin Simpson wrote:
> On a two node non-shared storage setup you can never fully guard against
> the scenario of node A being shutdown, node B then being shutdown later.
> Then node A being brought up and having no way of knowing that it has
> the older data than B, if B is still down. 

I was under the impression that this was solved by adding in the quorum
disk.  Is that not correct?

[...]

> Three nodes just adds needless complexity from what you are saying. 

I thought that a third node could be acting as a "quorum server".  If A
can still reach that third node (C), then A and C have quorum.  The same
is true if one replaced A with B.  If A and B retain contact with each
other, but lose touch with C, quorum still exists.

You're right, though, that this doesn't solve the scenario you described
above.  Solving that by adding a third node would involve having C
somehow inform A and B which amongst them had been up most recently.  A
quorum disk would be simpler *if* the quorum disk solves this problem.

Does it?

How does this problem get solved in the DRBD world w/o an additional
layer of clustering?

Thanks...

	- Andrew




From ag8817282 at gideon.org  Fri Nov 19 17:50:26 2010
From: ag8817282 at gideon.org (Andrew Gideon)
Date: Fri, 19 Nov 2010 12:50:26 -0500
Subject: [Linux-cluster] A fencing mechanism for Xen (or KVM) guests
In-Reply-To: <3754ED14F3EE0C459DEFE2DF184515FF0F101C7241@SIMAIL.snt-is.com>
References: <1290026179.7401.1167.camel@carrot>
	<3754ED14F3EE0C459DEFE2DF184515FF0F101C7241@SIMAIL.snt-is.com>
Message-ID: <1290189026.5822.7.camel@carrot>

On Thu, 2010-11-18 at 13:03 +0100, Rakovec Jost wrote:
> 
> I would like to tray. Where can I get your software? 

I've not put it anywhere public yet.  Is there a common place for such
things, or should I just put it up on my own web server?

Forgive me if this is an incredibly naive question; I've never tried
contributing anything to this project before.

Thanks...

	Andrew




From jumanjiman at gmail.com  Fri Nov 19 18:12:47 2010
From: jumanjiman at gmail.com (Paul Morgan)
Date: Fri, 19 Nov 2010 13:12:47 -0500
Subject: [Linux-cluster] A fencing mechanism for Xen (or KVM) guests
In-Reply-To: <1290189026.5822.7.camel@carrot>
References: <1290026179.7401.1167.camel@carrot>
	<3754ED14F3EE0C459DEFE2DF184515FF0F101C7241@SIMAIL.snt-is.com>
	<1290189026.5822.7.camel@carrot>
Message-ID: <AANLkTinaFtoFKXpP4iFzc3hf_jTcKtV8_NYng+juPrW7@mail.gmail.com>

github.com is a good choice for sharing.

top-posted from gmail on android. apologies.
On Nov 19, 2010 1:00 PM, "Andrew Gideon" <ag8817282 at gideon.org> wrote:
> On Thu, 2010-11-18 at 13:03 +0100, Rakovec Jost wrote:
>>
>> I would like to tray. Where can I get your software?
>
> I've not put it anywhere public yet. Is there a common place for such
> things, or should I just put it up on my own web server?
>
> Forgive me if this is an incredibly naive question; I've never tried
> contributing anything to this project before.
>
> Thanks...
>
> Andrew
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101119/c3ebaff1/attachment.htm>

From fdinitto at redhat.com  Fri Nov 19 18:41:40 2010
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Fri, 19 Nov 2010 19:41:40 +0100
Subject: [Linux-cluster] A fencing mechanism for Xen (or KVM) guests
In-Reply-To: <1290026179.7401.1167.camel@carrot>
References: <1290026179.7401.1167.camel@carrot>
Message-ID: <4CE6C4E4.2080806@redhat.com>

On 11/17/2010 09:36 PM, Andrew Gideon wrote:
> 
> I found myself unhappy with what I located for fencing of Xen guests, so
> I put together a new mechanism.  Would this be of interest to anyone
> else?
> 
> The node on which fence_node is called uses SSH to connect to the list
> of hypervisors.  The connection is key based, which limits the nodes to
> execution of the specific fencing command and also lets a given node
> fence only a guest that's in a specific list.  This prevents a node of
> one cluster from fencing a node of another even if they reside on the
> same set of hypervisors.
> 
> The fencing script issues the fence command (via SSH) to each
> hypervisor.  Success of the command requires either (1) a guest of the
> specified name is found and destroyed o at least one hypervisor or (2)
> every hypervisor has been visited and reported that there is no such
> guest running.
> 
> #2 was an interesting choice, BTW, on which I'd welcome feedback.  The
> alternative would have been to presume that an unreachable hypervisor
> was down.  That didn't seem like the best choice to me, but I'm curious
> what others might think.

We have already several mechanism in place to achieve the same but if
you can post your fence_agent, we can be more productive and see what's
missing from the current methods, or eventually include it in RHCS.

Marek is the fence agent maintainer (in CC).

Lon wrote both fence_xvm(d) and fence_virt(d) (in CC).

Federico wrote a VM tracking system to handle similar situation (in CC).

Fabio



From fdinitto at redhat.com  Fri Nov 19 18:44:10 2010
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Fri, 19 Nov 2010 19:44:10 +0100
Subject: [Linux-cluster] A fencing mechanism for Xen (or KVM) guests
In-Reply-To: <1290189026.5822.7.camel@carrot>
References: <1290026179.7401.1167.camel@carrot>	<3754ED14F3EE0C459DEFE2DF184515FF0F101C7241@SIMAIL.snt-is.com>
	<1290189026.5822.7.camel@carrot>
Message-ID: <4CE6C57A.2060407@redhat.com>

On 11/19/2010 06:50 PM, Andrew Gideon wrote:
> On Thu, 2010-11-18 at 13:03 +0100, Rakovec Jost wrote:
>>
>> I would like to tray. Where can I get your software? 
> 
> I've not put it anywhere public yet.  Is there a common place for such
> things, or should I just put it up on my own web server?

Or simply mail to cluster-devel at redhat.com mailing list (you need to be
subscribed).

> 
> Forgive me if this is an incredibly naive question; I've never tried
> contributing anything to this project before.

not a problem at all. As long as your work makes it to the developers,
we don't really care how it does.

Thanks
Fabio



From Colin.Simpson at iongeo.com  Fri Nov 19 19:46:13 2010
From: Colin.Simpson at iongeo.com (Colin Simpson)
Date: Fri, 19 Nov 2010 19:46:13 -0000
Subject: [Linux-cluster] Private Network for all cluster comms?
Message-ID: <9E998F70C1968944B4BFAECD14A5C5AECFBE@edi1exch01.iouk.ioroot.tld>

I'm looking at implementing a RHEL 6 based cluster using DRBD. For this I plan to have a dedicated fast bonded network between the two nodes for the backend DRBD Storage Network (SN), well a 10Gb with a 1Gb failover.

Most examples I've seen of doing this seem to use the SN as ringnumber 1 in totem, does it make more sense to be ringnumber 0. I assume this could be accomplished by setting the hostnames associated with SN network IPs as the node names in cluster.conf. And put the altname in as the hostname associated with the main network IPs.

My reasons for thinking of doing this are that my SN network will be back to back wired so more reliable than a switch based network but more importantly it's performance will be higher for GFS2 locking etc

Does this make sense to do (anyone else do this)?

Does this make management harder at all say from a central off cluster luci, that the node names are not themselves are not resolveable from the main network DNS (the altname's will be). Any other nastyness that might arise?

One thing I don't like is the machine hostnames will not be their network DNS names. Would the cluster suite be happy if the hostname of the node matches the "altname" but not the "clusternode name"?

As a quick aside, I've only played with clustering on RHEL5 before so on RHEL6 does corosync.conf get generated from cluster.conf. If not, what would happen if I hand craft a corosync.conf for doing my config above ?

Would this upset things with regards using keeping things the right way around in cluster.conf with regards "clusternode name" and "altname"?


Thanks for any info

Colin

This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101119/86001f8f/attachment.htm>

From linux at alteeve.com  Fri Nov 19 19:58:16 2010
From: linux at alteeve.com (Digimer)
Date: Fri, 19 Nov 2010 14:58:16 -0500
Subject: [Linux-cluster] Private Network for all cluster comms?
In-Reply-To: <9E998F70C1968944B4BFAECD14A5C5AECFBE@edi1exch01.iouk.ioroot.tld>
References: <9E998F70C1968944B4BFAECD14A5C5AECFBE@edi1exch01.iouk.ioroot.tld>
Message-ID: <4CE6D6D8.1020103@alteeve.com>

On 11/19/2010 02:46 PM, Colin Simpson wrote:
> I'm looking at implementing a RHEL 6 based cluster using DRBD. For this
> I plan to have a dedicated fast bonded network between the two nodes for
> the backend DRBD Storage Network (SN), well a 10Gb with a 1Gb failover.

A good plan

> Most examples I've seen of doing this seem to use the SN as ringnumber 1
> in totem, does it make more sense to be ringnumber 0. I assume this
> could be accomplished by setting the hostnames associated with SN
> network IPs as the node names in cluster.conf. And put the altname in as
> the hostname associated with the main network IPs.

I don't think it matters much, assuming that your cluster communication 
is also on a private network. I'd sort of argue that the cluster comm 
network would be quieter and thus less likely to suffer latency which 
you might see during periods of high disk I/O on your storage network.

> My reasons for thinking of doing this are that my SN network will be
> back to back wired so more reliable than a switch based network but more
> importantly it's performance will be higher for GFS2 locking etc

For this reason, it is a good one to use, regardless of whether it is 
ring 0 or 1.

> Does this make sense to do (anyone else do this)?

I do this.

> Does this make management harder at all say from a central off cluster
> luci, that the node names are not themselves are not resolveable from
> the main network DNS (the altname's will be). Any other nastyness that
> might arise?

When a ring fails, for any reason, it's recovery must be performed manually.

> One thing I don't like is the machine hostnames will not be their
> network DNS names. Would the cluster suite be happy if the hostname of
> the node matches the "altname" but not the "clusternode name"?

Cluster comm will, iirc, happen on the network resolved by `uname -n`. 
I'm not certain of this though, as it may be tied to the interface of 
the active ring.

> As a quick aside, I've only played with clustering on RHEL5 before so on
> RHEL6 does corosync.conf get generated from cluster.conf. If not, what
> would happen if I hand craft a corosync.conf for doing my config above ?

In RHCS3, corosync.conf should not, and is not used. Everything comes 
from cluster.conf.

> Would this upset things with regards using keeping things the right way
> around in cluster.conf with regards "clusternode name" and "altname"?

My only real concern is that your SN could get too noisy. Personally, 
I'd advice making it ring 0 and leaving your names alone. That said, 
either *should* work.

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From ag8817282 at gideon.org  Fri Nov 19 20:22:20 2010
From: ag8817282 at gideon.org (Andrew Gideon)
Date: Fri, 19 Nov 2010 15:22:20 -0500
Subject: [Linux-cluster] GFS+DRBD+Quorum: Help wrap my brain around this
In-Reply-To: <1290188902.5822.5.camel@carrot>
References: <1290025373.7401.1158.camel@carrot>
	<1290027755.4270.33.camel@cowie>  <1290188902.5822.5.camel@carrot>
Message-ID: <1290198140.5822.74.camel@carrot>

On Fri, 2010-11-19 at 12:48 -0500, Andrew Gideon wrote:
> On Wed, 2010-11-17 at 21:02 +0000, Colin Simpson wrote:
> > On a two node non-shared storage setup you can never fully guard against
> > the scenario of node A being shutdown, node B then being shutdown later.
> > Then node A being brought up and having no way of knowing that it has
> > the older data than B, if B is still down. 
> 
> I was under the impression that this was solved by adding in the quorum
> disk.  Is that not correct?
> 
> [...]
> 
> > Three nodes just adds needless complexity from what you are saying. 
> 
> I thought that a third node could be acting as a "quorum server".  If A
> can still reach that third node (C), then A and C have quorum.  The same
> is true if one replaced A with B.  If A and B retain contact with each
> other, but lose touch with C, quorum still exists.

I've given this a little more thought.  I'm not sure if I'm thinking in
the proper direction, though.

If cluster quorum is preserved despite A and B being partitioned, then
one of A or B will be fenced (either cluster fencing or DRBD fencing).
This would be true whether quorum is maintained with a third node or a
quorum disk.

More, to avoid the problem described a couple of messages back (A fails,
B fails, A returns w/o knowing that B has later data), the fact that B
continued w/o A needs to be stored somewhere.  This can be done either
on a quorum disk or via a third node.  Either way, the fencing logic
would make a note of this.  For example, if A were fenced then that bit
of extra storage (quorum disk or third node) would reflect that B had
continued w/o A and that B therefore had the latest copy of the data.

When A or B returns to service, it would need to check that additional
storage.  If a node determines that its peer has the later data, it can
invoke "drbdadm outdate" on itself.

Doesn't this seem reasonable?  Or am I misthinking it somehow?

	- Andrew






From jeremymiller at ups.com  Fri Nov 19 20:27:58 2010
From: jeremymiller at ups.com (jeremymiller at ups.com)
Date: Fri, 19 Nov 2010 15:27:58 -0500
Subject: [Linux-cluster] GFS vs Ext3/4
Message-ID: <14BC54DE51931D419D9EA28B99AC0CF490BD1E65@gaalpsvr03cc.us.ups.com>

I want to get some opinions from the group.

In one of our development environments we have 80+ databases split across two cluster nodes. (yes it's a lot) Each database instance has 3-4 ext3 filesystems mounted on the node running the DB.  The databases are split across the cluster with roughly 40 DBs per node.  Having to maintain all these resources in the cluster.conf file is tedious and the file is _enourmous_.  In fact, we believe that we have seen this impact rgmanager on a number of occasions.  This is the primary reason for why we are considering GFS2 as opposed to ext3 - in that it _greatly_ reduces the clutter from the cluster.conf file and should alleviate the load on rgmanager.  However, and this is the reason for this email, is it fundamentally a mis-use of GFS2 to be using it when there is no requirement for a shared filesystem across the cluster nodes?  Or, is using GFS2, regardless of the requirement for share-access, the direction intended by the developers for all cluster services?  Granted, there will be a slight performance hit using GFS2 vs ext3 due to the locking overhead.  What other pros/cons are there to GFS2 vs ext3/4 when there is no real need for shared filesystem access across the cluster?

Regards,
--
JM



From linux at alteeve.com  Fri Nov 19 21:30:37 2010
From: linux at alteeve.com (Digimer)
Date: Fri, 19 Nov 2010 16:30:37 -0500
Subject: [Linux-cluster] GFS vs Ext3/4
In-Reply-To: <14BC54DE51931D419D9EA28B99AC0CF490BD1E65@gaalpsvr03cc.us.ups.com>
References: <14BC54DE51931D419D9EA28B99AC0CF490BD1E65@gaalpsvr03cc.us.ups.com>
Message-ID: <4CE6EC7D.8090705@alteeve.com>

On 11/19/2010 03:27 PM, jeremymiller at ups.com wrote:
> I want to get some opinions from the group.
>
> In one of our development environments we have 80+ databases split across two cluster nodes. (yes it's a lot) Each database instance has 3-4 ext3 filesystems mounted on the node running the DB.  The databases are split across the cluster with roughly 40 DBs per node.  Having to maintain all these resources in the cluster.conf file is tedious and the file is _enourmous_.  In fact, we believe that we have seen this impact rgmanager on a number of occasions.  This is the primary reason for why we are considering GFS2 as opposed to ext3 - in that it _greatly_ reduces the clutter from the cluster.conf file and should alleviate the load on rgmanager.  However, and this is the reason for this email, is it fundamentally a mis-use of GFS2 to be using it when there is no requirement for a shared filesystem across the cluster nodes?  Or, is using GFS2, regardless of the requirement for share-access, the direction intended by the developers for all cluster services?  Granted, there wil
l!
>    be a slight performance hit using GFS2 vs ext3 due to the locking overhead.  What other pros/cons are there to GFS2 vs ext3/4 when there is no real need for shared filesystem access across the cluster?
>
> Regards,
> --
> JM

Forgive me, but I am not entirely sure how switching the underlying FS 
alone will reduce the overhead in cluster.conf. Can you explain that 
bit? Perhaps a snippet from your cluster.conf would help.

In general though, you should be able to mount a GFS2 partition on just 
one node without trouble. Personally, I like to use shared storage and 
then put an LVM on it, enable Cluster LVM and then create a VG->LV which 
I then put the GFS2 partition on. For two node clusters, a DRBD device 
would suit just fine and would incur no extra hardware costs.

So short answer; It should be fine to just use GFS2.

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From jhammerman at saymedia.com  Fri Nov 19 22:13:55 2010
From: jhammerman at saymedia.com (Joe Hammerman)
Date: Fri, 19 Nov 2010 14:13:55 -0800
Subject: [Linux-cluster] RedHat Cluster with DRBD and GFS2
In-Reply-To: <C90C0E8D.12C34%jhammerman@saymedia.com>
Message-ID: <C90C36A3.12C5F%jhammerman@saymedia.com>

Good afternoon all. I would imagine this is a topic which comes up regularly, but I have researched this topic fairly extenisively on the internet, and I have been unable to find anything which actually describes how to configure a cluster in the manner we are attempting. Any ideas, or advice would be deeply appreciated!

We have two VMWare virtual machines which are members of a two node cluster, sitting behind a load balanced VIP. When DRBD breaks, GFS locks up, because it is worried about corrupting the data on the shared device. This is good, but the nodes do not fence each other.

It was my guess that this was because, although fencing is configured, DRBD is not defined as a resource (is this accurate?). I have edited my cluster.conf file thusly, and updated the clusters configuration.

<?xml version="1.0"?>
<cluster alias="studio2.sacpa" config_version="8" name="studio2.sacpa">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="studio104.sacpa.videoegg.com" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="fence_node2"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="studio103.sacpa.videoegg.com" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="fence_node1"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_vmware" ipaddr="10.1.69.106:8333" login="xxx" name="fence_node2" passwd_script="xxx" port="[standard] studio104.sacpa/studio104.sacpa.vmx" vmware_type="server2"/>
                <fencedevice agent="fence_vmware" ipaddr="10.1.69.105:8333" login="xxx" name="fence_node1" passwd_script="xxx" port="[standard] studio103.sacpa/studio103.sacpa.vmx" vmware_type="server2"/>
        </fencedevices>
        <cman expected_votes="1" two_node="1"/>
        rm>
                <resources>
                </resources>
                <failoverdomains/>
                  <service autostart="1" name="httpd_drive">
                        <drbd name="drbd-httpd" resource="httpd">
                                <fs device="/dev/studio-vg/studio-lv" mountpoint="/export/www/html" fstype="gfs2" name="httpd_drive" options="noatime,nodiratime,data=writeback"/>
                        </drbd>
                </service>
        </rm>
</cluster>

I have also tried this with <fs device> directed at the underlying DRBD device. However, the drive will not mount, and the service will not start. Can anyone shed any light on this issue? Is Pacemaker a better bet for DRBD Primary / Primary configurations?

Thanks!
------ End of Forwarded Message
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101119/f7a8689e/attachment.htm>

From Colin.Simpson at iongeo.com  Sat Nov 20 10:41:04 2010
From: Colin.Simpson at iongeo.com (Colin Simpson)
Date: Sat, 20 Nov 2010 10:41:04 -0000
Subject: [Linux-cluster] Private Network for all cluster comms?
References: <9E998F70C1968944B4BFAECD14A5C5AECFBE@edi1exch01.iouk.ioroot.tld>
	<4CE6D6D8.1020103@alteeve.com>
Message-ID: <9E998F70C1968944B4BFAECD14A5C5AECFC0@edi1exch01.iouk.ioroot.tld>


> >
> > Would this upset things with regards using keeping things the right way
> > around in cluster.conf with regards "clusternode name" and "altname"?
>
> My only real concern is that your SN could get too noisy. Personally, 
> I'd advice making it ring 0 and leaving your names alone. That said, 
> either *should* work.

So how do you force cluster comms to a private network in cluster.conf if the the hostnames are still the main network names?

I suppose also as I've never seen any reference to ring numbers in cluster.conf itself.

Thanks

Colin

This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101120/9ed177f1/attachment.htm>

From gordan at bobich.net  Sat Nov 20 11:15:08 2010
From: gordan at bobich.net (Gordan Bobic)
Date: Sat, 20 Nov 2010 11:15:08 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <3564CDC469D042E1A6CFFA390FC6BC47@versa>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>	<4CDACB37.3070704@alteeve.com>	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>	<4CDB713A.8080303@alteeve.com><4CDBB70D.6080204@bobich.net><036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net><4CDC1C0E.6010804@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net><4CDCA95A.2070104@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net><4CDCB7E5.40204@alteeve.com><036B68E61A28CA49AC2767596576CD596F584F3656@GVW1113EXC.americas.hpqcorp.net>	<4CDCC8EE.6090500@alteeve.com><F8B32A84985745EF8817B7C31DA4E550@Aspire>	<4CE63630.2010001@bobich.net>	<B6AE87F3809C47E0BC9705C45B576F58@Aspire>	<4CE686DD.5070503@bobich.net><C489620E-7E43-4E06-8525-7409DCCA52BB@cybercat.ca>	<4CE691C5.3060202@!
	bobich.net> <3564CDC469D042E1A6CFFA390FC6BC47@versa>
Message-ID: <4CE7ADBC.90503@bobich.net>

On 11/19/2010 05:46 PM, Nicolas Ross wrote:
>> Back when I tried using it, it was reproducible by having > 1 dovecot
>> IMAP
>> client running in the cluster with maildirs stored on GFS2. Accessing
>> mailboxes with only 2-3 users used to reliably deadlock it in under 10
>> seconds. The only cure was a full cluster reboot. IIRC I was told at the
>> time here that the bug was one of the known ones, so I never bothered
>> investigating further.
>
> Ok, I see, in your case, users were connecting to a single node or many ?

Many.

> Is there a bug report for this with someting I can reproduce without
> having to install a full IMAP server ?

As I said, I never bothered filing a bug report or developing a more 
specific test case because it was mentioned that a similar bug is 
already known.

Gordan



From corey.kovacs at gmail.com  Sat Nov 20 12:07:49 2010
From: corey.kovacs at gmail.com (Corey Kovacs)
Date: Sat, 20 Nov 2010 12:07:49 +0000
Subject: [Linux-cluster] Private Network for all cluster comms?
In-Reply-To: <9E998F70C1968944B4BFAECD14A5C5AECFC0@edi1exch01.iouk.ioroot.tld>
References: <9E998F70C1968944B4BFAECD14A5C5AECFBE@edi1exch01.iouk.ioroot.tld>
	<4CE6D6D8.1020103@alteeve.com>
	<9E998F70C1968944B4BFAECD14A5C5AECFC0@edi1exch01.iouk.ioroot.tld>
Message-ID: <AANLkTinf4XUTwof0X4-urcrRH-ygWHQwPO9LFZdjx_hW@mail.gmail.com>

What you can do (at least on rhel5) is create host names for your
cluster interfaces (node1-clu for example) and use those names for the
cluster config. This breaks the ability to use the management tools
though.

I would also see other ways to do it for rhel5.

-C



On Sat, Nov 20, 2010 at 10:41 AM, Colin Simpson
<Colin.Simpson at iongeo.com> wrote:
>
>> >
>> > Would this upset things with regards using keeping things the right way
>> > around in cluster.conf with regards "clusternode name" and "altname"?
>>
>> My only real concern is that your SN could get too noisy. Personally,
>> I'd advice making it ring 0 and leaving your names alone. That said,
>> either *should* work.
>
> So how do you force cluster comms to a private network in cluster.conf if
> the the hostnames are still the main network names?
>
> I suppose also as I've never seen any reference to ring numbers in
> cluster.conf itself.
>
> Thanks
>
> Colin
>
> This email and any files transmitted with it are confidential and are
> intended solely for the use of the individual or entity to whom they are
> addressed. If you are not the original recipient or the person responsible
> for delivering the email to the intended recipient, be advised that you have
> received this email in error, and that any use, dissemination, forwarding,
> printing, or copying of this email is strictly prohibited. If you received
> this email in error, please immediately notify the sender and delete the
> original.
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From linux at alteeve.com  Sat Nov 20 13:01:12 2010
From: linux at alteeve.com (Digimer)
Date: Sat, 20 Nov 2010 08:01:12 -0500
Subject: [Linux-cluster] Private Network for all cluster comms?
In-Reply-To: <9E998F70C1968944B4BFAECD14A5C5AECFC0@edi1exch01.iouk.ioroot.tld>
References: <9E998F70C1968944B4BFAECD14A5C5AECFBE@edi1exch01.iouk.ioroot.tld>
	<4CE6D6D8.1020103@alteeve.com>
	<9E998F70C1968944B4BFAECD14A5C5AECFC0@edi1exch01.iouk.ioroot.tld>
Message-ID: <4CE7C698.2010508@alteeve.com>

On 11/20/2010 05:41 AM, Colin Simpson wrote:
>
>  > >
>  > > Would this upset things with regards using keeping things the right way
>  > > around in cluster.conf with regards "clusternode name" and "altname"?
>  >
>  > My only real concern is that your SN could get too noisy. Personally,
>  > I'd advice making it ring 0 and leaving your names alone. That said,
>  > either *should* work.
>
> So how do you force cluster comms to a private network in cluster.conf
> if the the hostnames are still the main network names?

I use three interfaces;

eth0 = private back-channel network w/ fence devices. Hostnames (uname 
-n) resolve to this interface's IP (thus being ring 0)

eth1 = private storage channel. Entries in /etc/hosts with ".sn" suffix 
resolve to this interface's IP address. This is used with <altname...>, 
thus, is ring 1

eth2 = Internet facing network. During setup, these have hostnames in 
/etc/hosts with ".ifn" suffix resolving to IP addresses on the internet 
polluted network and have nothing cluster related on them.

> I suppose also as I've never seen any reference to ring numbers in
> cluster.conf itself.

<name ...> = ring 0
<altname ...> = ring 1

> Thanks
>
> Colin

Hope that helps. :)

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From rossnick-lists at cybercat.ca  Sat Nov 20 15:22:14 2010
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Sat, 20 Nov 2010 10:22:14 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <3564CDC469D042E1A6CFFA390FC6BC47@versa>
References: bobich.net> <3564CDC469D042E1A6CFFA390FC6BC47@versa>
Message-ID: <4051F7E44892417BAC9BDE291EBD6064@Inspiron>

Another question comes to mind.

For my services that will require a dedicated volume/file system, Am I
obligated to use gfs ? Services exemples in the docs refers to gfs, ext3/4 
exemples seems to refer to local drives.

Can I make a volume group on my san, and many lv in that vg and mount a ext4
lv on one node and another lv on another node ? All being on the same drives
?

Next, I've also begun to play with service in cluster.conf. I successfully 
made a quick service like this :

<rm>
 <resources>
  <clusterfs device="/dev/mapper/GFSa-GFSa" force_unmount="1" fstype="gfs2" 
mountpoint="/GFSa" name="GFSa" options="noatime"/>
 </resources>
 <service autostart="0" exclusive="0" name="example_server" 
recovery="relocate">
  <clusterfs ref="GFSa">
   <ip address="192.168.110.75" monitor_link="on" sleeptime="10">
    <script file="/GFSa/RichelieuWebServer/apache2/bin/apachectl" 
name="example_server"/>
   </ip>
  </clusterfs>
 </service>
</rm>

It's working pretty well. There's not much documentation for the ressource 
manager, I wish to know more about the expected behaviour of the script in 
regard of status, etc. 



From rossnick-lists at cybercat.ca  Sat Nov 20 17:56:13 2010
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Sat, 20 Nov 2010 12:56:13 -0500
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4051F7E44892417BAC9BDE291EBD6064@Inspiron>
References: <3564CDC469D042E1A6CFFA390FC6BC47@versa>
	<4051F7E44892417BAC9BDE291EBD6064@Inspiron>
Message-ID: <B0445BAE-1C11-4073-994C-0879D9D6DB18@cybercat.ca>


> For my services that will require a dedicated volume/file system, Am I
> obligated to use gfs ? Services exemples in the docs refers to gfs, ext3/4 exemples seems to refer to local drives.
> 
> Can I make a volume group on my san, and many lv in that vg and mount a ext4
> lv on one node and another lv on another node ? All being on the same drives
> ?

I'll just respond to myselft in part. I beleive that using gfs2 in a cluster environement is safer in the event that a filesystem would be mounted on several node, for whatever reason. Am i right ?

Another question in that regards, as for the number of journal to create to a gfs filesystem. Do I need a journal per node or per node that will use the filesystem ?

Thanks again,
Nicolas



From Jost.Rakovec at snt.si  Sat Nov 20 19:29:36 2010
From: Jost.Rakovec at snt.si (Rakovec Jost)
Date: Sat, 20 Nov 2010 20:29:36 +0100
Subject: [Linux-cluster] A fencing mechanism for Xen (or KVM) guests
In-Reply-To: <4CE6C57A.2060407@redhat.com>
References: <1290026179.7401.1167.camel@carrot>
	<3754ED14F3EE0C459DEFE2DF184515FF0F101C7241@SIMAIL.snt-is.com>
	<1290189026.5822.7.camel@carrot>,<4CE6C57A.2060407@redhat.com>
Message-ID: <3754ED14F3EE0C459DEFE2DF184515FF0F101C724A@SIMAIL.snt-is.com>

Hi,

Or to this >linux-cluster at redhat.com< mailing list as attachment and some readme or small how to will also be very nice.


thx

jost

________________________________________
From: linux-cluster-bounces at redhat.com [linux-cluster-bounces at redhat.com] On Behalf Of Fabio M. Di Nitto [fdinitto at redhat.com]
Sent: Friday, November 19, 2010 7:44 PM
To: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] A fencing mechanism for Xen (or KVM) guests

On 11/19/2010 06:50 PM, Andrew Gideon wrote:
> On Thu, 2010-11-18 at 13:03 +0100, Rakovec Jost wrote:
>>
>> I would like to tray. Where can I get your software?
>
> I've not put it anywhere public yet.  Is there a common place for such
> things, or should I just put it up on my own web server?

Or simply mail to cluster-devel at redhat.com mailing list (you need to be
subscribed).

>
> Forgive me if this is an incredibly naive question; I've never tried
> contributing anything to this project before.

not a problem at all. As long as your work makes it to the developers,
we don't really care how it does.

Thanks
Fabio

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From Colin.Simpson at iongeo.com  Sun Nov 21 21:08:24 2010
From: Colin.Simpson at iongeo.com (Colin Simpson)
Date: Sun, 21 Nov 2010 21:08:24 +0000
Subject: [Linux-cluster] Private Network for all cluster comms?
In-Reply-To: <4CE7C698.2010508@alteeve.com>
References: <9E998F70C1968944B4BFAECD14A5C5AECFBE@edi1exch01.iouk.ioroot.tld>
	<4CE6D6D8.1020103@alteeve.com>
	<9E998F70C1968944B4BFAECD14A5C5AECFC0@edi1exch01.iouk.ioroot.tld>
	<4CE7C698.2010508@alteeve.com>
Message-ID: <1290373704.1480.26.camel@shyster>

OK that's clear. So if I have this scenario say:

eth0 - To be a Storage network with subnet 192.168.1.0/24 (back to back
network)

eth1 - Fence network 192.168.2.0/24, local network between nodes and
used for the fence devices.

eth2 - Main Internal network on which services will be provided by the
cluster (say 10.1.1.0/24). I'd also like to manage the network from a
remote luci or a luci running on another system on the main network.


If I have as a hosts file

node1-sn  192.168.1.1
node2-sn  192.168.1.2

node1-fn  192.168.2.1
node2-fn  192.168.2.2

node1  10.1.1.1
node2  10.1.1.2

I'm presuming I'd have to set the machine's hostnames to "node1-sn" and
"node2-sn", and these names in the "<name...>" in cluster.conf to get
all cluster comms to go on eth0?

Then set altname to "node1-fn" and "node2-fn" for backup comms to be on
the fence network or "node1" for backup comms (ring1) to be on the main
network.

Am I good so far?
 
But this seems pretty ugly, as firstly the hostnames are node1-sn which
means that they won't match the main network's DNS entries (forward and
reverse) (which would sensibly be node1 and node2 as per the hosts file
above to be consistent). 

Maybe not a big issue on Internet application, but it will complicate
things like Kerberos authentication to the nodes (SSH login to nodes via
host/ keys). And I'm concerned breaks a centralised luci from my main
network (as Corey was asking). The only thing I can think of is to run a
luci on every cluster and hopefully I'll be able to connect to it from
the main network.

But all seems a bit nasty.

Is this all I can do?

Is there no way of saying, well my cluster nodes are node1 and node2 but
I'd like to use these IP's for comms to these nodes. Or maybe there is
some nastiness I can do with domain search orders.

Thanks again

Colin



On Sat, 2010-11-20 at 13:01 +0000, Digimer wrote:
> On 11/20/2010 05:41 AM, Colin Simpson wrote:
> >
> >  > >
> >  > > Would this upset things with regards using keeping things the
> right way
> >  > > around in cluster.conf with regards "clusternode name" and
> "altname"?
> >  >
> >  > My only real concern is that your SN could get too noisy.
> Personally,
> >  > I'd advice making it ring 0 and leaving your names alone. That
> said,
> >  > either *should* work.
> >
> > So how do you force cluster comms to a private network in
> cluster.conf
> > if the the hostnames are still the main network names?
> 
> I use three interfaces;
> 
> eth0 = private back-channel network w/ fence devices. Hostnames (uname
> -n) resolve to this interface's IP (thus being ring 0)
> 
> eth1 = private storage channel. Entries in /etc/hosts with ".sn"
> suffix
> resolve to this interface's IP address. This is used with
> <altname...>,
> thus, is ring 1
> 
> eth2 = Internet facing network. During setup, these have hostnames in
> /etc/hosts with ".ifn" suffix resolving to IP addresses on the
> internet
> polluted network and have nothing cluster related on them.
> 
> > I suppose also as I've never seen any reference to ring numbers in
> > cluster.conf itself.
> 
> <name ...> = ring 0
> <altname ...> = ring 1
> 
> > Thanks
> >
> > Colin
> 
> Hope that helps. :)
> 
> --
> Digimer
> E-Mail: digimer at alteeve.com
> AN!Whitepapers: http://alteeve.com
> Node Assassin:  http://nodeassassin.org
> 
> 

This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.





From Colin.Simpson at iongeo.com  Sun Nov 21 21:46:03 2010
From: Colin.Simpson at iongeo.com (Colin Simpson)
Date: Sun, 21 Nov 2010 21:46:03 +0000
Subject: [Linux-cluster] GFS+DRBD+Quorum: Help wrap my brain around this
In-Reply-To: <1290198140.5822.74.camel@carrot>
References: <1290198140.5822.74.camel@carrot>
Message-ID: <1290375963.1480.47.camel@shyster>

I suppose what I'm saying is that there is no real way to get a quorum
disk with DRBD. And basically it doesn't really gain you anything
without actual shared storage.

Now as I say, I'm pretty new to all this, but I've not seen anyone try
to setup a 3rd node for quorum with DRBD.

The scenario is well mitigated by DRBD on two nodes already without
this. The system will not, if you config properly,  start DRBD (and all
the cluster storage stuff after, presuming your start up files are in
the right order) until it sees the second node. The one that has the
newest data will be used for all requests to either nodes and will sync
the older node. No need to force it to outdate the data manually on a
node, it should so that itself, with the right options.

The situation of two nodes coming up when the out of date one comes up
first should never arise if you give it sufficient time to see the other
node (it will always pick the new good one's data), you can make it wait
forever and then require manual intervention if you prefer (should a
node be down for an extended period). For me a couple of minutes waiting
for the other node is sufficient if it was degraded already, maybe a bit
longer if the DRBD was sync'd before they went down. 

I can send you config's I believe are correct from the Linbit docs of
using DRBD Primary/Primary with GFS, if you like. 

But I'm told (from a thread I posted at DRBD) that this should always
work. You have to work a bit to break this, particularly with waiting
forever for the other DRBD node. You would have to bring up a outdated
node and tell it to proceed when waiting for the other node. 

Make any sense.

Colin

On Fri, 2010-11-19 at 20:22 +0000, Andrew Gideon wrote:
> 
> I've given this a little more thought.  I'm not sure if I'm thinking
> in
> the proper direction, though.
> 
> If cluster quorum is preserved despite A and B being partitioned, then
> one of A or B will be fenced (either cluster fencing or DRBD fencing).
> This would be true whether quorum is maintained with a third node or a
> quorum disk.
> 
> More, to avoid the problem described a couple of messages back (A
> fails,
> B fails, A returns w/o knowing that B has later data), the fact that B
> continued w/o A needs to be stored somewhere.  This can be done either
> on a quorum disk or via a third node.  Either way, the fencing logic
> would make a note of this.  For example, if A were fenced then that
> bit
> of extra storage (quorum disk or third node) would reflect that B had
> continued w/o A and that B therefore had the latest copy of the data.
> 
> When A or B returns to service, it would need to check that additional
> storage.  If a node determines that its peer has the later data, it
> can
> invoke "drbdadm outdate" on itself.
> 
> Doesn't this seem reasonable?  Or am I misthinking it somehow?
> 
>         - Andrew
> 
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 

This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.





From swhiteho at redhat.com  Mon Nov 22 11:04:41 2010
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 22 Nov 2010 11:04:41 +0000
Subject: [Linux-cluster] Starter Cluster / GFS
In-Reply-To: <4CE7ADBC.90503@bobich.net>
References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire>
	<4CDAC2BE.4010009@alteeve.com>	<4CDAC3D2.9050703@bobich.net>
	<4CDACB37.3070704@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net>
	<4CDB713A.8080303@alteeve.com><4CDBB70D.6080204@bobich.net>
	<036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net>
	<4CDC1C0E.6010804@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net>
	<4CDCA95A.2070104@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net>
	<4CDCB7E5.40204@alteeve.com>
	<036B68E61A28CA49AC2767596576CD596F584F3656@GVW1113EXC.americas.hpqcorp.net>
	<4CDCC8EE.6090500@alteeve.com><F8B32A84985745EF8817B7C31DA4E550@Aspire>
	<4CE63630.2010001@bobich.net>	<B6AE87F3809C47E0BC9705C45B576F58@Aspire>
	<4CE686DD.5070503@bobich.net>
	<C489620E-7E43-4E06-8525-7409DCCA52BB@cybercat.ca>
	<4CE691C5.3060202@! bobich.net>
	<3564CDC469D042E1A6CFFA390FC6BC47@versa>
	<4CE7ADBC.90503@bobich.net>
Message-ID: <1290423881.2537.0.camel@dolmen>

Hi,

On Sat, 2010-11-20 at 11:15 +0000, Gordan Bobic wrote:
> On 11/19/2010 05:46 PM, Nicolas Ross wrote:
> >> Back when I tried using it, it was reproducible by having > 1 dovecot
> >> IMAP
> >> client running in the cluster with maildirs stored on GFS2. Accessing
> >> mailboxes with only 2-3 users used to reliably deadlock it in under 10
> >> seconds. The only cure was a full cluster reboot. IIRC I was told at the
> >> time here that the bug was one of the known ones, so I never bothered
> >> investigating further.
> >
> > Ok, I see, in your case, users were connecting to a single node or many ?
> 
> Many.
> 
> > Is there a bug report for this with someting I can reproduce without
> > having to install a full IMAP server ?
> 
> As I said, I never bothered filing a bug report or developing a more 
> specific test case because it was mentioned that a similar bug is 
> already known.
> 
> Gordan
> 
If it was known at the time, it is probably long since fixed. I don't
know of any outstanding issues similar to this at the moment,

Steve.




From rhurst at bidmc.harvard.edu  Mon Nov 22 15:33:49 2010
From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu)
Date: Mon, 22 Nov 2010 10:33:49 -0500
Subject: [Linux-cluster] gfs2_jadd borked my cluster?
In-Reply-To: <50168EC934B8D64AA8D8DD37F840F3DE0564062E0F@EVS2CCR.its.caregroup.org>
References: <4CBC68C0.2010204@bull.net>
	<50168EC934B8D64AA8D8DD37F840F3DE0564062E0E@EVS2CCR.its.caregroup.org>
	<50168EC934B8D64AA8D8DD37F840F3DE0564062E0F@EVS2CCR.its.caregroup.org>
Message-ID: <50168EC934B8D64AA8D8DD37F840F3DE056516F6C3@EVS2CCR.its.caregroup.org>

FYI, RHN support has provided no insight to the problem.  We recreated the GFS2 filesystems and can join/use them, but after a few days, they all withdraw at some point.  :(

I suspect a virtio_blk caching issue is causing the problems with GFS2 on KVM guests.  I read in the RHEL 5.6 (beta) release notes that "a caching issue" (generically written as this) was corrected with the virtio_blk module.  And RHEL 6 declares that GFS2 is a supported filesystem no KVM guests -- there is no such written statement anywhere in the RHEL 5 documentation.

However, RHN support wrote back in my ticket that our infrastructure and cluster configuration are supported.  It just doesn't work.  :P

I am going to try the GNBD method for the KVM guests, interestingly, its documentation specifically speaks to a caching issue to _disable_ or it can lead to corruption -- something very similar in lines we are experiencing using virtio_blk to a fiber-channel disk.

Anyone else running KVM guests with or without a physical host mix using GFS2 clustered filesystems?  We'd like to know, thanks.


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Hurst,Robert (BIDMC - Information Systems)
Sent: Wednesday, October 20, 2010 12:51 PM
To: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] gfs2_jadd borked my cluster?

Also, the messages from the failure follow:

Oct 20 12:11:28 watsonapp2 ccsd[3016]: Initial status:: Quorate 
Oct 20 12:11:40 watsonapp2 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "ccc_devtest55:homedt55"
Oct 20 12:11:40 watsonapp2 kernel: dlm: Using TCP for communications
Oct 20 12:11:40 watsonapp2 kernel: dlm: connecting to 8
Oct 20 12:11:40 watsonapp2 kernel: dlm: connecting to 3
Oct 20 12:11:40 watsonapp2 kernel: dlm: connecting to 2
Oct 20 12:11:40 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.3: Joined cluster. Now mounting FS...
Oct 20 12:11:40 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.3: can't mount journal #3
Oct 20 12:11:40 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.3: there are only 3 journals (0 - 2)
Oct 20 12:17:03 watsonapp2 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "ccc_devtest55:homedt55"
Oct 20 12:17:03 watsonapp2 kernel: dlm: Using TCP for communications
Oct 20 12:17:03 watsonapp2 kernel: dlm: connecting to 3
Oct 20 12:17:03 watsonapp2 kernel: dlm: connecting to 2
Oct 20 12:17:03 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: Joined cluster. Now mounting FS...
Oct 20 12:17:04 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: jid=0, already locked for use
Oct 20 12:17:04 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: jid=0: Looking at journal...
Oct 20 12:17:04 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: jid=0: Done
Oct 20 12:17:17 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: fatal: filesystem consistency error
Oct 20 12:17:17 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0:   RG = 458777
Oct 20 12:17:17 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0:   function = gfs2_setbit, file = fs/gfs2/rgrp.c, line = 97
Oct 20 12:17:17 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: about to withdraw this file system
Oct 20 12:17:17 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: telling LM to withdraw
Oct 20 12:17:18 watsonapp2 kernel: GFS2: fsid=ccc_devtest55:homedt55.0: withdrawn
Oct 20 12:17:18 watsonapp2 kernel: 
Oct 20 12:17:18 watsonapp2 kernel: Call Trace:
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884b543e>] :gfs2:gfs2_lm_withdraw+0xd1/0xfe
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff80013b19>] find_lock_page+0x26/0xa2
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff80025c06>] find_or_create_page+0x22/0x72
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884b72d2>] :gfs2:__glock_lo_add+0x62/0x89
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884c8ae3>] :gfs2:gfs2_consist_rgrpd_i+0x34/0x39
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884c555f>] :gfs2:rgblk_free+0x13a/0x15c
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884c5801>] :gfs2:gfs2_unlink_di+0x25/0x60
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884b3be9>] :gfs2:gfs2_change_nlink+0xf8/0x102
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884bfa8b>] :gfs2:gfs2_rename+0x470/0x652
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884bf71b>] :gfs2:gfs2_rename+0x100/0x652
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884bf73c>] :gfs2:gfs2_rename+0x121/0x652
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884bf761>] :gfs2:gfs2_rename+0x146/0x652
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884bf786>] :gfs2:gfs2_rename+0x16b/0x652
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff884bf7b9>] :gfs2:gfs2_rename+0x19e/0x652
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff80030c69>] d_splice_alias+0xdc/0xfb
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff8000d9d8>] permission+0x81/0xc8
Oct 20 12:17:18 watsonapp2 kernel:  [<ffffffff8002a9ec>] vfs_rename+0x2f4/0x471
Oct 20 12:17:19 watsonapp2 kernel:  [<ffffffff80036be0>] sys_renameat+0x180/0x1eb
Oct 20 12:17:19 watsonapp2 kernel:  [<ffffffff80066b88>] do_page_fault+0x4fe/0x874
Oct 20 12:17:19 watsonapp2 kernel:  [<ffffffff800b7649>] audit_syscall_entry+0x180/0x1b3
Oct 20 12:17:19 watsonapp2 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Oct 20 12:17:19 watsonapp2 kernel: 

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Hurst,Robert (BIDMC - Information Systems)
Sent: Wednesday, October 20, 2010 12:41 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] gfs2_jadd borked my cluster?

Latest RHEL 5u5 with a four node cluster:

cman-2.0.115-34.el5_5.3
gfs2-utils-0.1.62-20.el5
kernel-2.6.18-194.17.1.el5

Three nodes are blades; the fourth is a KVM guest.

I executed `gfs2_jadd -j1 /home` to add a fourth journal; it completely successfully with old=3, new=4 message.  I checked on all three nodes with `gfs2_tool journals /home` and they all reported four journals of size 128MB.

I joined KVM guest to cluster.  I attempted to mount /home and it complained there were only three journals.  EH???  So, I umount /home on a blade and mount /home on the KVM guest -- it allowed it to mount.

Checking journals on all hosts again, they now report only 3.

I umount /home on KVM guest, and re-mounted it on the blade.  It, too, only reports 3 journals now.

I repeated process again, but second time around, I got a GFS2 filesystem withdrawal dump on the guest.  And now the DLM has got that channel locked on all nodes with a LEAVE_STOP_WAIT status.  I tried fence_node against the guest, it re-booted the node fine, but now DLM fence is locked with a FAIL_ALL_STOPPED status.

1) Can I clear this issue (obviously without re-booting)?

2) What could possibly have gone wrong with gfs2_jadd?

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From jeff.sturm at eprize.com  Mon Nov 22 18:03:00 2010
From: jeff.sturm at eprize.com (Jeff Sturm)
Date: Mon, 22 Nov 2010 13:03:00 -0500
Subject: [Linux-cluster] gfs2_jadd borked my cluster?
Message-ID: <64D0546C5EBBD147B75DE133D798665F06A127D0@hugo.eprize.local>

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of rhurst at bidmc.harvard.edu
> Sent: Monday, November 22, 2010 10:34 AM
> To: linux-cluster at redhat.com
> Subject: Re: [Linux-cluster] gfs2_jadd borked my cluster?
> 
> I suspect a virtio_blk caching issue is causing the problems with GFS2
on KVM guests.
> I read in the RHEL 5.6 (beta) release notes that "a caching issue"
(generically written
> as this) was corrected with the virtio_blk module.  And RHEL 6
declares that GFS2 is a
> supported filesystem no KVM guests -- there is no such written
statement anywhere in
> the RHEL 5 documentation.

Although we don't use KVM or GFS2, I've seen a similar issue.  We had
GFS filesystems periodically withdraw from the cluster, often requiring
a node restart or fsck to fix.

We changed our Xen block devices to use the tap:sync: backend driver and
haven't seen the problem since.  I don't have anything conclusive to
tell you this fixed the problem, but the evidence is there.  Having no
familiarity with KVM I can't tell you what the equivalent of tap:sync:
is, or if one even exists.

We did not stumble across this setting by accident.  Through some
brainstorming, asking ourselves "what's different about 
our virtual clusters and physical clusters", we had guessed that block
caching could be responsible.  If a virtual host completes some I/O and
tells the cluster it is done, it seems intuitive that the I/O must be
complete to guarantee filesystem consistency.   From the virtual host's
perspective the I/O may be done, but the physical host is responsible
for flushing blocks to the actual SAN, and may delay the operation, or
write blocks in a different order than originally intended.

Xen's tap:sync: driver ensures that each block written by the virtual
host is written immediately to the physical device.  

-Jeff





From zagar at arlut.utexas.edu  Mon Nov 22 18:12:08 2010
From: zagar at arlut.utexas.edu (Randy Zagar)
Date: Mon, 22 Nov 2010 12:12:08 -0600
Subject: [Linux-cluster] Filesystem repair on a Clustered-NFS Server
In-Reply-To: <mailman.33.1290445205.19461.linux-cluster@redhat.com>
References: <mailman.33.1290445205.19461.linux-cluster@redhat.com>
Message-ID: <4CEAB278.5070905@arlut.utexas.edu>


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Assume for a moment that you have an EL5 NFS Cluster.

You have a clustered service "nfs-server1" that has an IP resource and
several "child" filesystems tied to it so that the filesystems move
when the IP moves through the cluster.  Each of these child
filesystems has an nfs-export resource associated with it, so they are
nfs-exported automatically.

How do you repair one of those filesystems without taking down the
entire service?  You can't unmount it while it's exported, and if you
try to unexport/unmount manually the cluster restarts it for you...

Do you have to remove the failed filesystem from the service?  Or can
you "disable" one resource and leave the others in peace?

This is not entirely a rhetorical question for me,

- -RZ


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iEYEARECAAYFAkzqsngACgkQKQP9Tvu8x8wYNACg/e6KxYLYTgIVK39A+rrNXR16
M8YAoO8znk8TQmjUIAMlv4KYoxRvCq7I
=SPej
-----END PGP SIGNATURE-----



From rhurst at bidmc.harvard.edu  Mon Nov 22 19:36:42 2010
From: rhurst at bidmc.harvard.edu (rhurst at bidmc.harvard.edu)
Date: Mon, 22 Nov 2010 14:36:42 -0500
Subject: [Linux-cluster] gfs2_jadd borked my cluster?
In-Reply-To: <64D0546C5EBBD147B75DE133D798665F06A127D0@hugo.eprize.local>
References: <64D0546C5EBBD147B75DE133D798665F06A127D0@hugo.eprize.local>
Message-ID: <50168EC934B8D64AA8D8DD37F840F3DE056516F6E4@EVS2CCR.its.caregroup.org>

Your arrival to Xen's tap:sync "solution" was good to read, Jeff, thanks for sharing!


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeff Sturm
Sent: Monday, November 22, 2010 1:03 PM
To: linux clustering
Subject: Re: [Linux-cluster] gfs2_jadd borked my cluster?

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of rhurst at bidmc.harvard.edu
> Sent: Monday, November 22, 2010 10:34 AM
> To: linux-cluster at redhat.com
> Subject: Re: [Linux-cluster] gfs2_jadd borked my cluster?
> 
> I suspect a virtio_blk caching issue is causing the problems with GFS2
on KVM guests.
> I read in the RHEL 5.6 (beta) release notes that "a caching issue"
(generically written
> as this) was corrected with the virtio_blk module.  And RHEL 6
declares that GFS2 is a
> supported filesystem no KVM guests -- there is no such written
statement anywhere in
> the RHEL 5 documentation.

Although we don't use KVM or GFS2, I've seen a similar issue.  We had GFS filesystems periodically withdraw from the cluster, often requiring a node restart or fsck to fix.

We changed our Xen block devices to use the tap:sync: backend driver and haven't seen the problem since.  I don't have anything conclusive to tell you this fixed the problem, but the evidence is there.  Having no familiarity with KVM I can't tell you what the equivalent of tap:sync:
is, or if one even exists.

We did not stumble across this setting by accident.  Through some brainstorming, asking ourselves "what's different about our virtual clusters and physical clusters", we had guessed that block caching could be responsible.  If a virtual host completes some I/O and tells the cluster it is done, it seems intuitive that the I/O must be
complete to guarantee filesystem consistency.   From the virtual host's
perspective the I/O may be done, but the physical host is responsible for flushing blocks to the actual SAN, and may delay the operation, or write blocks in a different order than originally intended.

Xen's tap:sync: driver ensures that each block written by the virtual host is written immediately to the physical device.  

-Jeff



--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From jeremymiller at ups.com  Mon Nov 22 21:16:20 2010
From: jeremymiller at ups.com (jeremymiller at ups.com)
Date: Mon, 22 Nov 2010 16:16:20 -0500
Subject: [Linux-cluster] GFS vs Ext3/4
In-Reply-To: <4CE6EC7D.8090705@alteeve.com>
References: <14BC54DE51931D419D9EA28B99AC0CF490BD1E65@gaalpsvr03cc.us.ups.com>
	<4CE6EC7D.8090705@alteeve.com>
Message-ID: <14BC54DE51931D419D9EA28B99AC0CF490C2D034@gaalpsvr03cc.us.ups.com>


> Forgive me, but I am not entirely sure how switching the
> underlying FS alone will reduce the overhead in cluster.conf.
> Can you explain that bit? Perhaps a snippet from your
> cluster.conf would help.
>
It reduces the overhead for rgmanager - less resources for the daemon to manage.  Our cluster.conf has 2,303 lines - each of the 80+ database services typically has 3-5 filesystems, for which you have associated fs devices, lv_names and ip addresses.  Not to mention, 2000+ lines breaks Luci/conga so we edit the file manually - a rather tedious and error-prone process.  We've seen the management of the resources in this file become an issue for rgmanager having to scan each service, check status and stop/start all of them.  By using GFS2 the cluster.conf file is _greatly_ simplified since all the lv's and filesystem data is moved out to fstab - and rgmanager has much less to do.

> In general though, you should be able to mount a GFS2
> partition on just one node without trouble. Personally, I
> like to use shared storage and then put an LVM on it, enable
> Cluster LVM and then create a VG->LV which I then put the
> GFS2 partition on. For two node clusters, a DRBD device would
> suit just fine and would incur no extra hardware costs.
>
> So short answer; It should be fine to just use GFS2.
>


Came across this document today:

http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/ch-overview-GFS2.html

Quote:  "Although a GFS2 file system can be implemented in a standalone system or as part of a cluster configuration, for the Red Hat Enterprise Linux 6 release Red Hat does not support the use of GFS2 as a single-node file system. Red Hat does support a number of high-performance single node file systems which are optimized for single node and thus have generally lower overhead than a cluster file system. Red Hat recommends using these file systems in preference to GFS2 in cases where only a single node needs to mount the file system."

I guess that pretty much answers my question, unless others on the list would seem to think this not a problematic issue and/or vouch for continuing the use of GFS2 in spite of the application only residing on a single node of the cluster.




From ag8817282 at gideon.org  Mon Nov 22 21:21:50 2010
From: ag8817282 at gideon.org (A. Gideon)
Date: Mon, 22 Nov 2010 21:21:50 +0000 (UTC)
Subject: [Linux-cluster] GFS+DRBD+Quorum: Help wrap my brain around this
References: <1290198140.5822.74.camel@carrot>
	<1290375963.1480.47.camel@shyster>
Message-ID: <icemte$ovr$3@taco.int.tagonline.com>

On Sun, 21 Nov 2010 21:46:03 +0000, Colin Simpson wrote:


> I suppose what I'm saying is that there is no real way to get a quorum
> disk with DRBD. And basically it doesn't really gain you anything
> without actual shared storage.

I understand that.  That's why I'm looking for that "external" solution 
(ie. a separate iSCSI volume from a third machine) to act as a quorum 
disk (effectively making that third machine a quorum server).

But I'm not clear how important this is.  I think the problem is that, 
while I've some familiarity with clustering, I've less with DRBD.  I 
don't understand how DRBD handles the matter of quorum given only two 
potential voters.

[...]
> The scenario is well mitigated by DRBD on two nodes already without
> this. The system will not, if you config properly,  start DRBD (and all
> the cluster storage stuff after, presuming your start up files are in
> the right order) until it sees the second node. 

So if one node fails, the mirror is broken but storage is still 
available?  But if both nodes go down, storage only becomes available 
again once both nodes are up?  I've missed this in the documentation, I'm 
afraid.

[...]
> The situation of two nodes coming up when the out of date one comes up
> first should never arise if you give it sufficient time to see the other
> node (it will always pick the new good one's data), you can make it wait
> forever and then require manual intervention if you prefer (should a
> node be down for an extended period). 

Waiting forever for the second node seems a little strict to me, though I 
suppose if the second node is the node with the most up-to-date data then 
this is the proper thing to do.  But waiting forever for the node that 
has outdated information seems inefficient, though I see it is caused by 
the fact that DRBD has no way to know which node is more up-to-date.

Am I understanding that correctly?

> For me a couple of minutes waiting
> for the other node is sufficient if it was degraded already, maybe a bit
> longer if the DRBD was sync'd before they went down.

I'm afraid I'm not clear what you mean by this.  Isn't the fact that each 
node cannot know the state of the other the problem?  So how can wait 
times be varied as you describe?


> I can send you config's I believe are correct from the Linbit docs of
> using DRBD Primary/Primary with GFS, if you like.

Something more than http://www.drbd.org/users-guide/s-gfs-create-
resource.html ?  That would be welcome.

> 
> But I'm told (from a thread I posted at DRBD) that this should always
> work. 

This is something I'm realizing: that I need to ask some of my questions 
on that list rather than here, since my questions right now are more down 
at that layer.

Thanks...
	- Andrew



From jonathan.barber at gmail.com  Mon Nov 22 22:55:08 2010
From: jonathan.barber at gmail.com (Jonathan Barber)
Date: Mon, 22 Nov 2010 22:55:08 +0000
Subject: [Linux-cluster] Filesystem repair on a Clustered-NFS Server
In-Reply-To: <4CEAB278.5070905@arlut.utexas.edu>
References: <mailman.33.1290445205.19461.linux-cluster@redhat.com>
	<4CEAB278.5070905@arlut.utexas.edu>
Message-ID: <AANLkTimVEy5-QeWeFS9J6nYN3ZiH9+yivwtgyzCMiPbY@mail.gmail.com>

On 22 November 2010 18:12, Randy Zagar <zagar at arlut.utexas.edu> wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Assume for a moment that you have an EL5 NFS Cluster.
>
> You have a clustered service "nfs-server1" that has an IP resource and
> several "child" filesystems tied to it so that the filesystems move
> when the IP moves through the cluster. ?Each of these child
> filesystems has an nfs-export resource associated with it, so they are
> nfs-exported automatically.
>
> How do you repair one of those filesystems without taking down the
> entire service? ?You can't unmount it while it's exported, and if you
> try to unexport/unmount manually the cluster restarts it for you...

Freeze/unfreeze should do it:
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/s2-admin-manage-ha-services-operations-cli-CA.html

> Do you have to remove the failed filesystem from the service? ?Or can
> you "disable" one resource and leave the others in peace?
>
> This is not entirely a rhetorical question for me,
>
> - -RZ
-- 
Jonathan Barber <jonathan.barber at gmail.com>



From Chris.Jankowski at hp.com  Tue Nov 23 06:13:56 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Tue, 23 Nov 2010 06:13:56 +0000
Subject: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM snapshots
Message-ID: <036B68E61A28CA49AC2767596576CD596F5916BADF@GVW1113EXC.americas.hpqcorp.net>

Hi,

I am preparing a build of a RHEL 6 cluster with a filesystem resource(s) (ext4 or XFS).  The customer would like to use LVM snapshots of the filesystems for tape backup.  The tape backup may take a few hours after which the snapshot will be deleted.

Questions:

1.
Is the filesystem resource compatible with using LVM snapshots?

2.
How can I reconcile the temporary existence of a snapshot with the notion of the filesystem resource, which is all about having things permanently mounted ?

3.
Would either ext4 or XFS be preferable for any reasons for use with snapshots for backup?

I'd appreciate your comments.

Thanks and regards,

Chris Jankowski

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101123/2def630d/attachment.htm>

From xavier.montagutelli at unilim.fr  Tue Nov 23 10:17:45 2010
From: xavier.montagutelli at unilim.fr (Xavier Montagutelli)
Date: Tue, 23 Nov 2010 11:17:45 +0100
Subject: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM
	snapshots
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F5916BADF@GVW1113EXC.americas.hpqcorp.net>
References: <036B68E61A28CA49AC2767596576CD596F5916BADF@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <201011231117.45743.xavier.montagutelli@unilim.fr>

On Tuesday 23 November 2010 07:13:56 Jankowski, Chris wrote:
> Hi,
> 
> I am preparing a build of a RHEL 6 cluster with a filesystem resource(s)
>  (ext4 or XFS).  The customer would like to use LVM snapshots of the
>  filesystems for tape backup.  

Where will your filesystem resource be located ? In a cluster, it's usually on 
a shared storage, using *cluster* LVM. Are snapshots available now with CLVM ? 
It was not the case last time I read documentation, which is/was (?) a serious 
drawback.

>  The tape backup may take a few hours after
>  which the snapshot will be deleted.
> 
> Questions:
> 
> 1.
> Is the filesystem resource compatible with using LVM snapshots?
> 
> 2.
> How can I reconcile the temporary existence of a snapshot with the notion
>  of the filesystem resource, which is all about having things permanently
>  mounted ?
> 
> 3.
> Would either ext4 or XFS be preferable for any reasons for use with
>  snapshots for backup?
> 
> I'd appreciate your comments.
> 
> Thanks and regards,
> 
> Chris Jankowski
> 

-- 
Xavier Montagutelli                      Tel : +33 (0)5 55 45 77 20
Service Commun Informatique              Fax : +33 (0)5 55 45 75 95
Universite de Limoges
123, avenue Albert Thomas
87060 Limoges cedex



From kkovachev at varna.net  Tue Nov 23 11:09:33 2010
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Tue, 23 Nov 2010 13:09:33 +0200
Subject: [Linux-cluster] GFS+DRBD+Quorum: Help wrap my brain around this
In-Reply-To: <icemte$ovr$3@taco.int.tagonline.com>
References: <1290198140.5822.74.camel@carrot>
	<1290375963.1480.47.camel@shyster>
	<icemte$ovr$3@taco.int.tagonline.com>
Message-ID: <bfd62aaf1e34948db5e0b466e8988b0c@mx.varna.net>

Hi,
 just my 0.02 below

On Mon, 22 Nov 2010 21:21:50 +0000 (UTC), "A. Gideon"
<ag8817282 at gideon.org> wrote:
> On Sun, 21 Nov 2010 21:46:03 +0000, Colin Simpson wrote:
> 
> 
>> I suppose what I'm saying is that there is no real way to get a quorum
>> disk with DRBD. And basically it doesn't really gain you anything
>> without actual shared storage.
> 
> I understand that.  That's why I'm looking for that "external" solution 
> (ie. a separate iSCSI volume from a third machine) to act as a quorum 
> disk (effectively making that third machine a quorum server).
> 

why introducing iSCSI on a third machine at all? Just having a third node
(even not running any cluster services - just cman) you will get the
tiebreaker function like from a quorum disk ... well the drawback is the
requirement for fencing on the third machine too, but i consider that a
bonus :) as even the less important services running on that machine get
some protection, more if you run them in separate failover domain for that
host only.

> But I'm not clear how important this is.  I think the problem is that, 
> while I've some familiarity with clustering, I've less with DRBD.  I 
> don't understand how DRBD handles the matter of quorum given only two 
> potential voters.
> 
> [...]
>> The scenario is well mitigated by DRBD on two nodes already without
>> this. The system will not, if you config properly,  start DRBD (and all
>> the cluster storage stuff after, presuming your start up files are in
>> the right order) until it sees the second node. 
> 
> So if one node fails, the mirror is broken but storage is still 
> available?  But if both nodes go down, storage only becomes available 
> again once both nodes are up?  I've missed this in the documentation,
I'm 
> afraid.
> 
> [...]
>> The situation of two nodes coming up when the out of date one comes up
>> first should never arise if you give it sufficient time to see the
other
>> node (it will always pick the new good one's data), you can make it
wait
>> forever and then require manual intervention if you prefer (should a
>> node be down for an extended period). 
> 
> Waiting forever for the second node seems a little strict to me, though
I 
> suppose if the second node is the node with the most up-to-date data
then 
> this is the proper thing to do.  But waiting forever for the node that 
> has outdated information seems inefficient, though I see it is caused by

> the fact that DRBD has no way to know which node is more up-to-date.
> 
> Am I understanding that correctly?

it is preferable to do it this way to guarantee that the data won't be
corrupted. Lets asume you have 3 (Node1, Node2 and Node3) nodes and two of
them (1 and 2 only) are running DRBD. If Node1 fails you have quorum and
Node2 or Node3 will fence it, but if some time after fencing (before node1
is back, but new data was written to DRBD) node2 freezes? Node3 can't fence
it, because it lost quorum, then when Node1 (re)joins the cluster and
quorum is restored, Node3 will fence Node2 and if you don't wait enough for
Node2 to boot (because it was checking HDDs or other extended delay), then
Node1 is started with the old data.

If just one node failed - you won't have to wait too long, but in case
both have failed you need them both up before touching any data to avoid
corruption.

As an additional step you may set fencing (in drbd.conf) to
resource-and-stonith and edit your outdate-peer DRBD script to issue
fence_node and return exit status 7 as last resort action (if the other
node can't be reached) - this will also protect you from the case when just
the communication between the DRBD machines is lost

> 
>> For me a couple of minutes waiting
>> for the other node is sufficient if it was degraded already, maybe a
bit
>> longer if the DRBD was sync'd before they went down.
> 
> I'm afraid I'm not clear what you mean by this.  Isn't the fact that
each 
> node cannot know the state of the other the problem?  So how can wait 
> times be varied as you describe?
> 
> 
>> I can send you config's I believe are correct from the Linbit docs of
>> using DRBD Primary/Primary with GFS, if you like.
> 
> Something more than http://www.drbd.org/users-guide/s-gfs-create-
> resource.html ?  That would be welcome.
> 
>> 
>> But I'm told (from a thread I posted at DRBD) that this should always
>> work. 
> 
> This is something I'm realizing: that I need to ask some of my questions

> on that list rather than here, since my questions right now are more
down 
> at that layer.
> 
> Thanks...
> 	- Andrew
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From Chris.Jankowski at hp.com  Tue Nov 23 11:45:22 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Tue, 23 Nov 2010 11:45:22 +0000
Subject: [Linux-cluster] RHEL 6 cluster filesystem resource and
	LVM	snapshots
In-Reply-To: <201011231117.45743.xavier.montagutelli@unilim.fr>
References: <036B68E61A28CA49AC2767596576CD596F5916BADF@GVW1113EXC.americas.hpqcorp.net>
	<201011231117.45743.xavier.montagutelli@unilim.fr>
Message-ID: <036B68E61A28CA49AC2767596576CD596F5916BC82@GVW1113EXC.americas.hpqcorp.net>

Xavier,

I do not think that I have to use CLVM with ext4 or XFS in a cluster.

The ext4 or XFS filesystems will be on shared (FC) storage, but they will be presented as a filesystem resource i.e. accessible to only one cluster at a time, as they have to be. So, I believe that simple LVM will do and snapshots will be available.

The question is about the coexistence of a snapshot and the cluster filesystem resource. For example, what will happen if the node that runs the resource and happen to have created and mounted a snapshot on it, fails?

I need to integrate snapshots for backup and keep the system highly available during backup as well.

Regards,

Chris Jankowski

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Xavier Montagutelli
Sent: Tuesday, 23 November 2010 21:18
To: linux clustering
Subject: Re: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM snapshots

On Tuesday 23 November 2010 07:13:56 Jankowski, Chris wrote:
> Hi,
> 
> I am preparing a build of a RHEL 6 cluster with a filesystem resource(s)
>  (ext4 or XFS).  The customer would like to use LVM snapshots of the
>  filesystems for tape backup.  

Where will your filesystem resource be located ? In a cluster, it's usually on 
a shared storage, using *cluster* LVM. Are snapshots available now with CLVM ? 
It was not the case last time I read documentation, which is/was (?) a serious 
drawback.

>  The tape backup may take a few hours after
>  which the snapshot will be deleted.
> 
> Questions:
> 
> 1.
> Is the filesystem resource compatible with using LVM snapshots?
> 
> 2.
> How can I reconcile the temporary existence of a snapshot with the notion
>  of the filesystem resource, which is all about having things permanently
>  mounted ?
> 
> 3.
> Would either ext4 or XFS be preferable for any reasons for use with
>  snapshots for backup?
> 
> I'd appreciate your comments.
> 
> Thanks and regards,
> 
> Chris Jankowski
> 

-- 
Xavier Montagutelli                      Tel : +33 (0)5 55 45 77 20
Service Commun Informatique              Fax : +33 (0)5 55 45 75 95
Universite de Limoges
123, avenue Albert Thomas
87060 Limoges cedex

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From Colin.Simpson at iongeo.com  Tue Nov 23 12:28:41 2010
From: Colin.Simpson at iongeo.com (Colin Simpson)
Date: Tue, 23 Nov 2010 12:28:41 +0000
Subject: [Linux-cluster] GFS+DRBD+Quorum: Help wrap my brain around this
In-Reply-To: <bfd62aaf1e34948db5e0b466e8988b0c@mx.varna.net>
References: <1290198140.5822.74.camel@carrot>
	<1290375963.1480.47.camel@shyster><icemte$ovr$3@taco.int.tagonline.com>
	<bfd62aaf1e34948db5e0b466e8988b0c@mx.varna.net>
Message-ID: <1290515321.24204.32.camel@cowie>

On Tue, 2010-11-23 at 11:09 +0000, Kaloyan Kovachev wrote:
> Hi,
>  just my 0.02 below
> 
> On Mon, 22 Nov 2010 21:21:50 +0000 (UTC), "A. Gideon"
> <ag8817282 at gideon.org> wrote:

> 
> As an additional step you may set fencing (in drbd.conf) to
> resource-and-stonith and edit your outdate-peer DRBD script to issue
> fence_node and return exit status 7 as last resort action (if the
> other
> node can't be reached) - this will also protect you from the case when
> just
> the communication between the DRBD machines is lost
> 
> >
> >> For me a couple of minutes waiting
> >> for the other node is sufficient if it was degraded already, maybe
> a
> bit
> >> longer if the DRBD was sync'd before they went down.
> >
> > I'm afraid I'm not clear what you mean by this.  Isn't the fact that
> each
> > node cannot know the state of the other the problem?  So how can
> wait
> > times be varied as you describe?
> >
> >
> >> I can send you config's I believe are correct from the Linbit docs
> of
> >> using DRBD Primary/Primary with GFS, if you like.
> >
> > Something more than http://www.drbd.org/users-guide/s-gfs-create-
> > resource.html ?  That would be welcome.
> >
> >>
> >> But I'm told (from a thread I posted at DRBD) that this should
> always
> >> work.
> >
> > This is something I'm realizing: that I need to ask some of my
> questions
> 
> > on that list rather than here, since my questions right now are more
> down
> > at that layer.
> >
> > Thanks...

I don't see why you need fencing at all in the drbd, you can (I believe)
should do all the fencing just in cluster suite. Setting in drbd.conf:


  net {
   allow-two-primaries;
   after-sb-0pri discard-zero-changes;
   after-sb-1pri discard-secondary;
   after-sb-2pri disconnect;
  }

Should always use the newest data set for all access in an out of sync
drbd, and resync with the newest data. 

So I'm led to believe, I'm still testing but seems to do the correct
thing in my test set-up. And seemed to be the answer I got on the DRBD
mailing list the thread was called "Best Practice with DRBD RHCS and
GFS2"

Just tell cluster quite that a single node can be quorum
	<cman expected_votes="1" two_node="1"/>

Since the third node is ignorant about the status of DRBD, I don't
really see what help it gives for it to decide on quorum. 

Colin

This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.





From Colin.Simpson at iongeo.com  Tue Nov 23 12:45:26 2010
From: Colin.Simpson at iongeo.com (Colin Simpson)
Date: Tue, 23 Nov 2010 12:45:26 +0000
Subject: [Linux-cluster] GFS+DRBD+Quorum: Help wrap my brain around this
In-Reply-To: <icemte$ovr$3@taco.int.tagonline.com>
References: <1290198140.5822.74.camel@carrot>
	<1290375963.1480.47.camel@shyster>
	<icemte$ovr$3@taco.int.tagonline.com>
Message-ID: <1290516326.24204.48.camel@cowie>

On Mon, 2010-11-22 at 21:21 +0000, A. Gideon wrote:
> On Sun, 21 Nov 2010 21:46:03 +0000, Colin Simpson wrote:
> 
> 
> > I suppose what I'm saying is that there is no real way to get a
> quorum
> > disk with DRBD. And basically it doesn't really gain you anything
> > without actual shared storage.
> 
> I understand that.  That's why I'm looking for that "external"
> solution
> (ie. a separate iSCSI volume from a third machine) to act as a quorum
> disk (effectively making that third machine a quorum server).
> 
> But I'm not clear how important this is.  I think the problem is that,
> while I've some familiarity with clustering, I've less with DRBD.  I
> don't understand how DRBD handles the matter of quorum given only two
> potential voters.
> 
Just by telling cluster suite that a single node can be quorum in a 2
node cluster.

<cman expected_votes="1" two_node="1"/>

This is fine, but just needs a little bit of careful handling with DRBD
and the outdated node situation IMHO.

> [...]
> > The scenario is well mitigated by DRBD on two nodes already without
> > this. The system will not, if you config properly,  start DRBD (and
> all
> > the cluster storage stuff after, presuming your start up files are
> in
> > the right order) until it sees the second node.
> 
> So if one node fails, the mirror is broken but storage is still
> available?  But if both nodes go down, storage only becomes available
> again once both nodes are up?  I've missed this in the documentation,
> I'm
> afraid.

If one node fails the storage should be fine. When the second node comes
back up it will see that the first node has newer data and rebuild it's
copy with the newer data, and I believe (and my tests seems to say so)
that the second syncing node will pass all disk requests to the "good"
upto date node during the sync.

> [...]
> > The situation of two nodes coming up when the out of date one comes
> up
> > first should never arise if you give it sufficient time to see the
> other
> > node (it will always pick the new good one's data), you can make it
> wait
> > forever and then require manual intervention if you prefer (should a
> > node be down for an extended period).
> 
> Waiting forever for the second node seems a little strict to me,
> though I
> suppose if the second node is the node with the most up-to-date data
> then
> this is the proper thing to do.  But waiting forever for the node that
> has outdated information seems inefficient, though I see it is caused
> by
> the fact that DRBD has no way to know which node is more up-to-date.
> 
> Am I understanding that correctly?
You seem correct to me. 
> 
> > For me a couple of minutes waiting
> > for the other node is sufficient if it was degraded already, maybe a
> bit
> > longer if the DRBD was sync'd before they went down.
> 
> I'm afraid I'm not clear what you mean by this.  Isn't the fact that
> each
> node cannot know the state of the other the problem?  So how can wait
> times be varied as you describe?

Depends on your situation I think. I don't want to wait for ever either
as I don't want to visit the systems on such a scenario.  I have in my
test setup:

 startup {
  	wfc-timeout  300;       # Wait 300 for initial connection
  	degr-wfc-timeout 60;  # Wait only 60 seconds if this node was a
degraded cluster
	become-primary-on both;
  }

So I'm saying, at startup,  if I was degraded last time I was up, I'm
assuming that the other node was already down last time I was up(I'm
assuming say a long term HW outage). So it's less likely (unlikely) to
see the other node come up during this boot, so I'm probably the
"primary" good up to date  node. So I'm only going to wait around for 60
sec, to see if he does appear before the drbd script will finish and all
my cluster stuff comes up. 

On the normal case, of me not being degraded I will wait up to 5 minutes
before assuming I'm not going to see the other node, and therefore I
have the up to date data. 

Not perfect, but no way of telling with only two nodes. It's like the
man with two watches who doesn't know the time....

This mitigates the situation of two nodes if careful. Though you could
bump the times up for more paranoia. Or wait forever for maximum
paranoia. 

I have pretty much the same concerns as you on this, see my thread I
started "Best Practice with DRBD RHCS and GFS2" on the drbd mailing
list. A guy there seemed to address most of my concerns.

Colin



This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.





From orkcu at yahoo.com  Tue Nov 23 14:32:00 2010
From: orkcu at yahoo.com (Roger Pena Escobio)
Date: Tue, 23 Nov 2010 06:32:00 -0800 (PST)
Subject: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM
	snapshots
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F5916BC82@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <328341.69514.qm@web88306.mail.re4.yahoo.com>
--- On Tue, 11/23/10, Jankowski, Chris <Chris.Jankowski at hp.com> wrote:

> From: Jankowski, Chris <Chris.Jankowski at hp.com>
> Subject: Re: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM snapshots
> To: "linux clustering" <linux-cluster at redhat.com>
> Received: Tuesday, November 23, 2010, 6:45 AM
> Xavier,
> 
> I do not think that I have to use CLVM with ext4 or XFS in
> a cluster.
> 
> The ext4 or XFS filesystems will be on shared (FC) storage,
> but they will be presented as a filesystem resource i.e.
> accessible to only one cluster at a time, as they have to
> be. So, I believe that simple LVM will do and snapshots will
> be available.

Hi Chris

first I would like to say it has been a long time since I used RHCS and the only major problems I had was with clvm, but,  by your question, I think if you still foresee changes in the lvm space, you will still need clvm, even if the filesystem will be a non-cluster FS

the lvm layout is independent of the filesystem used, you might want to add more volumes to a group, resize, etc, and that info is read and cached when the kernel read the device, not when mounting the filesyste, so, if the device is presented to a node of the cluster, it will read the lvm layout and filesystem properties, even if not mounted. If you change that layout in one node, the others nodes might have a wrong information that could led to a crash in case you tried to mount the fs in there.


see the point of having clvm in a cluster even if using ext3/4 ?

if you don't plan to use lvm for the cluster, which is possible since you are having the device from a SAN/NAS/iSCSI where you will have exactly, or almost exactly, the same features that LVM provide, why having the extra layer if you will not use it ?

that is the conclusion I reached years ago when facing problems with CLVM

thanks
roger





From Chris.Jankowski at hp.com  Tue Nov 23 20:23:07 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Tue, 23 Nov 2010 20:23:07 +0000
Subject: [Linux-cluster] RHEL 6 cluster filesystem resource and
	LVM	snapshots
In-Reply-To: <328341.69514.qm@web88306.mail.re4.yahoo.com>
References: <036B68E61A28CA49AC2767596576CD596F5916BC82@GVW1113EXC.americas.hpqcorp.net>
	<328341.69514.qm@web88306.mail.re4.yahoo.com>
Message-ID: <036B68E61A28CA49AC2767596576CD596F591FB514@GVW1113EXC.americas.hpqcorp.net>

Roger,

Thank you.
I see your point.  Indeed, it looks that I need CLVM regardless of the type of filesystem used.

Question:
Does CLSVM support snapshots?

Thanks and regards,

Chris Jankowski

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Roger Pena Escobio
Sent: Wednesday, 24 November 2010 01:32
To: linux clustering
Subject: Re: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM snapshots



--- On Tue, 11/23/10, Jankowski, Chris <Chris.Jankowski at hp.com> wrote:

> From: Jankowski, Chris <Chris.Jankowski at hp.com>
> Subject: Re: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM snapshots
> To: "linux clustering" <linux-cluster at redhat.com>
> Received: Tuesday, November 23, 2010, 6:45 AM
> Xavier,
> 
> I do not think that I have to use CLVM with ext4 or XFS in
> a cluster.
> 
> The ext4 or XFS filesystems will be on shared (FC) storage,
> but they will be presented as a filesystem resource i.e.
> accessible to only one cluster at a time, as they have to
> be. So, I believe that simple LVM will do and snapshots will
> be available.

Hi Chris

first I would like to say it has been a long time since I used RHCS and the only major problems I had was with clvm, but,  by your question, I think if you still foresee changes in the lvm space, you will still need clvm, even if the filesystem will be a non-cluster FS

the lvm layout is independent of the filesystem used, you might want to add more volumes to a group, resize, etc, and that info is read and cached when the kernel read the device, not when mounting the filesyste, so, if the device is presented to a node of the cluster, it will read the lvm layout and filesystem properties, even if not mounted. If you change that layout in one node, the others nodes might have a wrong information that could led to a crash in case you tried to mount the fs in there.


see the point of having clvm in a cluster even if using ext3/4 ?

if you don't plan to use lvm for the cluster, which is possible since you are having the device from a SAN/NAS/iSCSI where you will have exactly, or almost exactly, the same features that LVM provide, why having the extra layer if you will not use it ?

that is the conclusion I reached years ago when facing problems with CLVM

thanks
roger



--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From Chris.Jankowski at hp.com  Wed Nov 24 00:20:42 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Wed, 24 Nov 2010 00:20:42 +0000
Subject: [Linux-cluster] RHEL 6 cluster filesystem resource and
	LVM	snapshots
References: <036B68E61A28CA49AC2767596576CD596F5916BC82@GVW1113EXC.americas.hpqcorp.net>
	<328341.69514.qm@web88306.mail.re4.yahoo.com> 
Message-ID: <036B68E61A28CA49AC2767596576CD596F591FB530@GVW1113EXC.americas.hpqcorp.net>

Hi,

1.
I found in the "Logical Volume Manager Administration" manual for RHEL 6 on p.12 and on p.35 the following statement:

"LVM snapshots are not supported across the nodes in a cluster. You cannot create a snapshot volume in a clustered volume group."

I understand that this means no snapshots for shared volumes when using CLVM.

2.
Strangely enough, I also found the following fragment on p.4 of the same manual:

"If you are using a clustered system for failover where only a single node that accesses the storage is active at any one time, you should use High Availability Logical Volume Manager agents (HA-LVM). For information on HA-LVM, see Configuring and Managing a Red Hat Cluster."

I understand this as an older system of synchronizing LVM configuration across the nodes that predates CLVM. I thought it was deprecated. Anyway, I cannot find any pointers to it in the current RHEL 6 manual.

Questions:

- Is HA-LVM still a viable option for the "one at a time" scenario described in the fragment above?
- If so, where do I find more information about HA-LVM? What should I search for?
- If it is still a viable option, would it support LVM snapshots?

Thanks and regards,

Chris Jankowski


-----Original Message-----
From: Jankowski, Chris 
Sent: Wednesday, 24 November 2010 07:23
To: 'linux clustering'
Subject: RE: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM snapshots

Roger,

Thank you.
I see your point.  Indeed, it looks that I need CLVM regardless of the type of filesystem used.

Question:
Does CLSVM support snapshots?

Thanks and regards,

Chris Jankowski

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Roger Pena Escobio
Sent: Wednesday, 24 November 2010 01:32
To: linux clustering
Subject: Re: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM snapshots



--- On Tue, 11/23/10, Jankowski, Chris <Chris.Jankowski at hp.com> wrote:

> From: Jankowski, Chris <Chris.Jankowski at hp.com>
> Subject: Re: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM snapshots
> To: "linux clustering" <linux-cluster at redhat.com>
> Received: Tuesday, November 23, 2010, 6:45 AM
> Xavier,
> 
> I do not think that I have to use CLVM with ext4 or XFS in
> a cluster.
> 
> The ext4 or XFS filesystems will be on shared (FC) storage,
> but they will be presented as a filesystem resource i.e.
> accessible to only one cluster at a time, as they have to
> be. So, I believe that simple LVM will do and snapshots will
> be available.

Hi Chris

first I would like to say it has been a long time since I used RHCS and the only major problems I had was with clvm, but,  by your question, I think if you still foresee changes in the lvm space, you will still need clvm, even if the filesystem will be a non-cluster FS

the lvm layout is independent of the filesystem used, you might want to add more volumes to a group, resize, etc, and that info is read and cached when the kernel read the device, not when mounting the filesyste, so, if the device is presented to a node of the cluster, it will read the lvm layout and filesystem properties, even if not mounted. If you change that layout in one node, the others nodes might have a wrong information that could led to a crash in case you tried to mount the fs in there.


see the point of having clvm in a cluster even if using ext3/4 ?

if you don't plan to use lvm for the cluster, which is possible since you are having the device from a SAN/NAS/iSCSI where you will have exactly, or almost exactly, the same features that LVM provide, why having the extra layer if you will not use it ?

that is the conclusion I reached years ago when facing problems with CLVM

thanks
roger



--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From xavier.montagutelli at unilim.fr  Wed Nov 24 08:03:30 2010
From: xavier.montagutelli at unilim.fr (Xavier Montagutelli)
Date: Wed, 24 Nov 2010 09:03:30 +0100
Subject: [Linux-cluster] RHEL 6 cluster filesystem resource and
	LVM	snapshots
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F591FB530@GVW1113EXC.americas.hpqcorp.net>
References: <036B68E61A28CA49AC2767596576CD596F5916BC82@GVW1113EXC.americas.hpqcorp.net>
	<328341.69514.qm@web88306.mail.re4.yahoo.com>
	<036B68E61A28CA49AC2767596576CD596F591FB530@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <201011240903.30234.xavier.montagutelli@unilim.fr>

On Wednesday 24 November 2010 01:20:42 Jankowski, Chris wrote:
> Hi,
> 
> 1.
> I found in the "Logical Volume Manager Administration" manual for RHEL 6 on
>  p.12 and on p.35 the following statement:
> 
> "LVM snapshots are not supported across the nodes in a cluster. You cannot
>  create a snapshot volume in a clustered volume group."
> 
> I understand that this means no snapshots for shared volumes when using
>  CLVM.
> 
> 2.
> Strangely enough, I also found the following fragment on p.4 of the same
>  manual:
> 
> "If you are using a clustered system for failover where only a single node
>  that accesses the storage is active at any one time, you should use High
>  Availability Logical Volume Manager agents (HA-LVM). For information on
>  HA-LVM, see Configuring and Managing a Red Hat Cluster."
> 
> I understand this as an older system of synchronizing LVM configuration
>  across the nodes that predates CLVM. I thought it was deprecated. Anyway,
>  I cannot find any pointers to it in the current RHEL 6 manual.
> 
> Questions:
> 
> - Is HA-LVM still a viable option for the "one at a time" scenario
>  described in the fragment above? 

>From what I understand, with HA-LVM, a VG on a shared storage will be 
activated on only one node at a time. From my point of view, it is not a 
"synchronization" mechanism, it only ensures that only one node has access to 
a VG.

Is it relieable ? From the doc (see below), I suppose it works with "tags" 
added on the VG, and with lvm.conf being properly configured. So it is not 
built into LVM itself, and can be "broken" if the administrator makes a 
mistake or forgive a step in the configuration...

Can something wrong happen if two nodes access the same VG at the same time, 
*without modifying anything* ? Afterall, in that case, LVM only reads 
metadata, no write. A friend of mine used LVM like that during some time, 
nothing broke. I know this a a Bad Thing, but can someone really give me a 
hint about what could happen ?

>  - If so, where do I find more information
>  about HA-LVM? What should I search for? 

Perhaps here :

http://docs.redhat.com/docs/en-
US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/ap-ha-resource-
params-CA.html#tb-lvm-resource-CA

and here :

http://sources.redhat.com/cluster/wiki/LVMFailover

>  - If it is still a viable option, would it support LVM snapshots?

I suppose yes, because you'll be using "normal" LVM. But of course, there are 
maybe a few things to take care of, like : what happens if you mount the 
snapshot on the active node (these things being done outside the cluster 
manager), then you want to stop the service and restart it on another node in 
the cluster ?

> 
> Thanks and regards,
> 
> Chris Jankowski
> 
> 
> -----Original Message-----
> From: Jankowski, Chris
> Sent: Wednesday, 24 November 2010 07:23
> To: 'linux clustering'
> Subject: RE: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM
>  snapshots
> 
> Roger,
> 
> Thank you.
> I see your point.  Indeed, it looks that I need CLVM regardless of the type
>  of filesystem used.
> 
> Question:
> Does CLSVM support snapshots?
> 
> Thanks and regards,
> 
> Chris Jankowski
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
>  [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Roger Pena Escobio
>  Sent: Wednesday, 24 November 2010 01:32
> To: linux clustering
> Subject: Re: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM
>  snapshots
> 
> --- On Tue, 11/23/10, Jankowski, Chris <Chris.Jankowski at hp.com> wrote:
> > From: Jankowski, Chris <Chris.Jankowski at hp.com>
> > Subject: Re: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM
> > snapshots To: "linux clustering" <linux-cluster at redhat.com>
> > Received: Tuesday, November 23, 2010, 6:45 AM
> > Xavier,
> >
> > I do not think that I have to use CLVM with ext4 or XFS in
> > a cluster.
> >
> > The ext4 or XFS filesystems will be on shared (FC) storage,
> > but they will be presented as a filesystem resource i.e.
> > accessible to only one cluster at a time, as they have to
> > be. So, I believe that simple LVM will do and snapshots will
> > be available.
> 
> Hi Chris
> 
> first I would like to say it has been a long time since I used RHCS and the
>  only major problems I had was with clvm, but,  by your question, I think
>  if you still foresee changes in the lvm space, you will still need clvm,
>  even if the filesystem will be a non-cluster FS
> 
> the lvm layout is independent of the filesystem used, you might want to add
>  more volumes to a group, resize, etc, and that info is read and cached
>  when the kernel read the device, not when mounting the filesyste, so, if
>  the device is presented to a node of the cluster, it will read the lvm
>  layout and filesystem properties, even if not mounted. If you change that
>  layout in one node, the others nodes might have a wrong information that
>  could led to a crash in case you tried to mount the fs in there.
> 
> 
> see the point of having clvm in a cluster even if using ext3/4 ?
> 
> if you don't plan to use lvm for the cluster, which is possible since you
>  are having the device from a SAN/NAS/iSCSI where you will have exactly, or
>  almost exactly, the same features that LVM provide, why having the extra
>  layer if you will not use it ?
> 
> that is the conclusion I reached years ago when facing problems with CLVM
> 
> thanks
> roger
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
Xavier Montagutelli                      Tel : +33 (0)5 55 45 77 20
Service Commun Informatique              Fax : +33 (0)5 55 45 75 95
Universite de Limoges
123, avenue Albert Thomas
87060 Limoges cedex



From Chris.Jankowski at hp.com  Wed Nov 24 08:34:48 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Wed, 24 Nov 2010 08:34:48 +0000
Subject: [Linux-cluster] RHEL 6 cluster filesystem resource
	and	LVM	snapshots
In-Reply-To: <201011240903.30234.xavier.montagutelli@unilim.fr>
References: <036B68E61A28CA49AC2767596576CD596F5916BC82@GVW1113EXC.americas.hpqcorp.net>
	<328341.69514.qm@web88306.mail.re4.yahoo.com>
	<036B68E61A28CA49AC2767596576CD596F591FB530@GVW1113EXC.americas.hpqcorp.net>
	<201011240903.30234.xavier.montagutelli@unilim.fr>
Message-ID: <036B68E61A28CA49AC2767596576CD596F591FB805@GVW1113EXC.americas.hpqcorp.net>

Xavier,

Thank you for the explanation. 
This all makes sense.

One more question about one of the documents you pointed me to:

What does this do exactly and why do I need it:

Quote:

4) Update your initrd on all your cluster machines. Example: 
prompt> new-kernel-pkg --mkinitrd \
        --initrdfile=/boot/initrd-halvm-`uname -r`.img --install `uname -r`

Unquote

Regards,

Chris Jankowski

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Xavier Montagutelli
Sent: Wednesday, 24 November 2010 19:04
To: linux clustering
Subject: Re: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM snapshots

On Wednesday 24 November 2010 01:20:42 Jankowski, Chris wrote:
> Hi,
> 
> 1.
> I found in the "Logical Volume Manager Administration" manual for RHEL 6 on
>  p.12 and on p.35 the following statement:
> 
> "LVM snapshots are not supported across the nodes in a cluster. You cannot
>  create a snapshot volume in a clustered volume group."
> 
> I understand that this means no snapshots for shared volumes when using
>  CLVM.
> 
> 2.
> Strangely enough, I also found the following fragment on p.4 of the same
>  manual:
> 
> "If you are using a clustered system for failover where only a single node
>  that accesses the storage is active at any one time, you should use High
>  Availability Logical Volume Manager agents (HA-LVM). For information on
>  HA-LVM, see Configuring and Managing a Red Hat Cluster."
> 
> I understand this as an older system of synchronizing LVM configuration
>  across the nodes that predates CLVM. I thought it was deprecated. Anyway,
>  I cannot find any pointers to it in the current RHEL 6 manual.
> 
> Questions:
> 
> - Is HA-LVM still a viable option for the "one at a time" scenario
>  described in the fragment above? 

>From what I understand, with HA-LVM, a VG on a shared storage will be 
activated on only one node at a time. From my point of view, it is not a 
"synchronization" mechanism, it only ensures that only one node has access to 
a VG.

Is it relieable ? From the doc (see below), I suppose it works with "tags" 
added on the VG, and with lvm.conf being properly configured. So it is not 
built into LVM itself, and can be "broken" if the administrator makes a 
mistake or forgive a step in the configuration...

Can something wrong happen if two nodes access the same VG at the same time, 
*without modifying anything* ? Afterall, in that case, LVM only reads 
metadata, no write. A friend of mine used LVM like that during some time, 
nothing broke. I know this a a Bad Thing, but can someone really give me a 
hint about what could happen ?

>  - If so, where do I find more information
>  about HA-LVM? What should I search for? 

Perhaps here :

http://docs.redhat.com/docs/en-
US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/ap-ha-resource-
params-CA.html#tb-lvm-resource-CA

and here :

http://sources.redhat.com/cluster/wiki/LVMFailover

>  - If it is still a viable option, would it support LVM snapshots?

I suppose yes, because you'll be using "normal" LVM. But of course, there are 
maybe a few things to take care of, like : what happens if you mount the 
snapshot on the active node (these things being done outside the cluster 
manager), then you want to stop the service and restart it on another node in 
the cluster ?

> 
> Thanks and regards,
> 
> Chris Jankowski
> 
> 
> -----Original Message-----
> From: Jankowski, Chris
> Sent: Wednesday, 24 November 2010 07:23
> To: 'linux clustering'
> Subject: RE: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM
>  snapshots
> 
> Roger,
> 
> Thank you.
> I see your point.  Indeed, it looks that I need CLVM regardless of the type
>  of filesystem used.
> 
> Question:
> Does CLSVM support snapshots?
> 
> Thanks and regards,
> 
> Chris Jankowski
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
>  [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Roger Pena Escobio
>  Sent: Wednesday, 24 November 2010 01:32
> To: linux clustering
> Subject: Re: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM
>  snapshots
> 
> --- On Tue, 11/23/10, Jankowski, Chris <Chris.Jankowski at hp.com> wrote:
> > From: Jankowski, Chris <Chris.Jankowski at hp.com>
> > Subject: Re: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM
> > snapshots To: "linux clustering" <linux-cluster at redhat.com>
> > Received: Tuesday, November 23, 2010, 6:45 AM
> > Xavier,
> >
> > I do not think that I have to use CLVM with ext4 or XFS in
> > a cluster.
> >
> > The ext4 or XFS filesystems will be on shared (FC) storage,
> > but they will be presented as a filesystem resource i.e.
> > accessible to only one cluster at a time, as they have to
> > be. So, I believe that simple LVM will do and snapshots will
> > be available.
> 
> Hi Chris
> 
> first I would like to say it has been a long time since I used RHCS and the
>  only major problems I had was with clvm, but,  by your question, I think
>  if you still foresee changes in the lvm space, you will still need clvm,
>  even if the filesystem will be a non-cluster FS
> 
> the lvm layout is independent of the filesystem used, you might want to add
>  more volumes to a group, resize, etc, and that info is read and cached
>  when the kernel read the device, not when mounting the filesyste, so, if
>  the device is presented to a node of the cluster, it will read the lvm
>  layout and filesystem properties, even if not mounted. If you change that
>  layout in one node, the others nodes might have a wrong information that
>  could led to a crash in case you tried to mount the fs in there.
> 
> 
> see the point of having clvm in a cluster even if using ext3/4 ?
> 
> if you don't plan to use lvm for the cluster, which is possible since you
>  are having the device from a SAN/NAS/iSCSI where you will have exactly, or
>  almost exactly, the same features that LVM provide, why having the extra
>  layer if you will not use it ?
> 
> that is the conclusion I reached years ago when facing problems with CLVM
> 
> thanks
> roger
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
Xavier Montagutelli                      Tel : +33 (0)5 55 45 77 20
Service Commun Informatique              Fax : +33 (0)5 55 45 75 95
Universite de Limoges
123, avenue Albert Thomas
87060 Limoges cedex

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From xavier.montagutelli at unilim.fr  Wed Nov 24 09:48:48 2010
From: xavier.montagutelli at unilim.fr (Xavier Montagutelli)
Date: Wed, 24 Nov 2010 10:48:48 +0100
Subject: [Linux-cluster] RHEL 6 cluster filesystem resource
	and	LVM	snapshots
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F591FB805@GVW1113EXC.americas.hpqcorp.net>
References: <036B68E61A28CA49AC2767596576CD596F5916BC82@GVW1113EXC.americas.hpqcorp.net>
	<201011240903.30234.xavier.montagutelli@unilim.fr>
	<036B68E61A28CA49AC2767596576CD596F591FB805@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <201011241048.48243.xavier.montagutelli@unilim.fr>

On Wednesday 24 November 2010 09:34:48 Jankowski, Chris wrote:
> Xavier,
> 
> Thank you for the explanation.
> This all makes sense.
> 
> One more question about one of the documents you pointed me to:
> 
> What does this do exactly and why do I need it:
> 
> Quote:
> 
> 4) Update your initrd on all your cluster machines. Example:
> prompt> new-kernel-pkg --mkinitrd \
>         --initrdfile=/boot/initrd-halvm-`uname -r`.img --install `uname -r`
> 
> Unquote

Caution : the following are only supposition, because I haven't read the 
lvm.sh script. 

In step 3 of http://sources.redhat.com/cluster/wiki/LVMFailover , you have to 
modify your lvm.conf file, "volume_list" parameter, to filter the VG/LV that can 
be activated on a particular host.

In step 4, they say to create a new initrd : I suppose this step is necesary 
to include the modified lvm.conf file inside the initrd, and to NOT activate the 
VG located on the shared storage at boot time.

The lvm.sh script must add or remove the "good" tag (i.e. a tag matching the 
hostname of the node running the service) on the fly.

If someone can confirm or give additional pointers ?

> 
> Regards,
> 
> Chris Jankowski
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
>  [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Xavier Montagutelli
>  Sent: Wednesday, 24 November 2010 19:04
> To: linux clustering
> Subject: Re: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM
>  snapshots
> 
> On Wednesday 24 November 2010 01:20:42 Jankowski, Chris wrote:
> > Hi,
> >
> > 1.
> > I found in the "Logical Volume Manager Administration" manual for RHEL 6
> > on p.12 and on p.35 the following statement:
> >
> > "LVM snapshots are not supported across the nodes in a cluster. You
> > cannot create a snapshot volume in a clustered volume group."
> >
> > I understand that this means no snapshots for shared volumes when using
> >  CLVM.
> >
> > 2.
> > Strangely enough, I also found the following fragment on p.4 of the same
> >  manual:
> >
> > "If you are using a clustered system for failover where only a single
> > node that accesses the storage is active at any one time, you should use
> > High Availability Logical Volume Manager agents (HA-LVM). For information
> > on HA-LVM, see Configuring and Managing a Red Hat Cluster."
> >
> > I understand this as an older system of synchronizing LVM configuration
> >  across the nodes that predates CLVM. I thought it was deprecated.
> > Anyway, I cannot find any pointers to it in the current RHEL 6 manual.
> >
> > Questions:
> >
> > - Is HA-LVM still a viable option for the "one at a time" scenario
> >  described in the fragment above?
> >
> >From what I understand, with HA-LVM, a VG on a shared storage will be
> 
> activated on only one node at a time. From my point of view, it is not a
> "synchronization" mechanism, it only ensures that only one node has access
>  to a VG.
> 
> Is it relieable ? From the doc (see below), I suppose it works with "tags"
> added on the VG, and with lvm.conf being properly configured. So it is not
> built into LVM itself, and can be "broken" if the administrator makes a
> mistake or forgive a step in the configuration...
> 
> Can something wrong happen if two nodes access the same VG at the same
>  time, *without modifying anything* ? Afterall, in that case, LVM only
>  reads metadata, no write. A friend of mine used LVM like that during some
>  time, nothing broke. I know this a a Bad Thing, but can someone really
>  give me a hint about what could happen ?
> 
> >  - If so, where do I find more information
> >  about HA-LVM? What should I search for?
> 
> Perhaps here :
> 
> http://docs.redhat.com/docs/en-
> US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/ap-ha-resource-
> params-CA.html#tb-lvm-resource-CA
> 
> and here :
> 
> http://sources.redhat.com/cluster/wiki/LVMFailover
> 
> >  - If it is still a viable option, would it support LVM snapshots?
> 
> I suppose yes, because you'll be using "normal" LVM. But of course, there
>  are maybe a few things to take care of, like : what happens if you mount
>  the snapshot on the active node (these things being done outside the
>  cluster manager), then you want to stop the service and restart it on
>  another node in the cluster ?
> 
> > Thanks and regards,
> >
> > Chris Jankowski
> >
> >
> > -----Original Message-----
> > From: Jankowski, Chris
> > Sent: Wednesday, 24 November 2010 07:23
> > To: 'linux clustering'
> > Subject: RE: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM
> >  snapshots
> >
> > Roger,
> >
> > Thank you.
> > I see your point.  Indeed, it looks that I need CLVM regardless of the
> > type of filesystem used.
> >
> > Question:
> > Does CLSVM support snapshots?
> >
> > Thanks and regards,
> >
> > Chris Jankowski
> >
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com
> >  [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Roger Pena
> > Escobio Sent: Wednesday, 24 November 2010 01:32
> > To: linux clustering
> > Subject: Re: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM
> >  snapshots
> >
> > --- On Tue, 11/23/10, Jankowski, Chris <Chris.Jankowski at hp.com> wrote:
> > > From: Jankowski, Chris <Chris.Jankowski at hp.com>
> > > Subject: Re: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM
> > > snapshots To: "linux clustering" <linux-cluster at redhat.com>
> > > Received: Tuesday, November 23, 2010, 6:45 AM
> > > Xavier,
> > >
> > > I do not think that I have to use CLVM with ext4 or XFS in
> > > a cluster.
> > >
> > > The ext4 or XFS filesystems will be on shared (FC) storage,
> > > but they will be presented as a filesystem resource i.e.
> > > accessible to only one cluster at a time, as they have to
> > > be. So, I believe that simple LVM will do and snapshots will
> > > be available.
> >
> > Hi Chris
> >
> > first I would like to say it has been a long time since I used RHCS and
> > the only major problems I had was with clvm, but,  by your question, I
> > think if you still foresee changes in the lvm space, you will still need
> > clvm, even if the filesystem will be a non-cluster FS
> >
> > the lvm layout is independent of the filesystem used, you might want to
> > add more volumes to a group, resize, etc, and that info is read and
> > cached when the kernel read the device, not when mounting the filesyste,
> > so, if the device is presented to a node of the cluster, it will read the
> > lvm layout and filesystem properties, even if not mounted. If you change
> > that layout in one node, the others nodes might have a wrong information
> > that could led to a crash in case you tried to mount the fs in there.
> >
> >
> > see the point of having clvm in a cluster even if using ext3/4 ?
> >
> > if you don't plan to use lvm for the cluster, which is possible since you
> >  are having the device from a SAN/NAS/iSCSI where you will have exactly,
> > or almost exactly, the same features that LVM provide, why having the
> > extra layer if you will not use it ?
> >
> > that is the conclusion I reached years ago when facing problems with CLVM
> >
> > thanks
> > roger
> >
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
Xavier Montagutelli                      Tel : +33 (0)5 55 45 77 20
Service Commun Informatique              Fax : +33 (0)5 55 45 75 95
Universite de Limoges
123, avenue Albert Thomas
87060 Limoges cedex



From jonathan.barber at gmail.com  Wed Nov 24 14:27:53 2010
From: jonathan.barber at gmail.com (Jonathan Barber)
Date: Wed, 24 Nov 2010 14:27:53 +0000
Subject: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM
	snapshots
In-Reply-To: <201011241048.48243.xavier.montagutelli@unilim.fr>
References: <036B68E61A28CA49AC2767596576CD596F5916BC82@GVW1113EXC.americas.hpqcorp.net>
	<201011240903.30234.xavier.montagutelli@unilim.fr>
	<036B68E61A28CA49AC2767596576CD596F591FB805@GVW1113EXC.americas.hpqcorp.net>
	<201011241048.48243.xavier.montagutelli@unilim.fr>
Message-ID: <AANLkTikuLZ-79h4LjFdbHL8=UoVtka6udwAcjL7HLJ5Q@mail.gmail.com>

On 24 November 2010 09:48, Xavier Montagutelli
<xavier.montagutelli at unilim.fr> wrote:
> On Wednesday 24 November 2010 09:34:48 Jankowski, Chris wrote:
>> Xavier,
>>
>> Thank you for the explanation.
>> This all makes sense.
>>
>> One more question about one of the documents you pointed me to:
>>
>> What does this do exactly and why do I need it:
>>
>> Quote:
>>
>> 4) Update your initrd on all your cluster machines. Example:
>> prompt> new-kernel-pkg --mkinitrd \
>> ? ? ? ? --initrdfile=/boot/initrd-halvm-`uname -r`.img --install `uname -r`
>>
>> Unquote
>
> Caution : the following are only supposition, because I haven't read the
> lvm.sh script.
>
> In step 3 of http://sources.redhat.com/cluster/wiki/LVMFailover , you have to
> modify your lvm.conf file, "volume_list" parameter, to filter the VG/LV that can
> be activated on a particular host.
>
> In step 4, they say to create a new initrd : I suppose this step is necesary
> to include the modified lvm.conf file inside the initrd, and to NOT activate the
> VG located on the shared storage at boot time.
>
> The lvm.sh script must add or remove the "good" tag (i.e. a tag matching the
> hostname of the node running the service) on the fly.
>
> If someone can confirm or give additional pointers ?

That's how I understand it.

I've used LVM on RHEL5 *without* clvmd and not had any problems with
corruption, etc., but I haven't used snapshots. You have to try
really, really hard to break the tagged LVM config from the command
line, as the tags prevent activation (which also prevents you from
accidentally mounting the FS on different nodes). It's worth knowing
about the "--config" argument to the LVM commands, and how to active
the LVs from the command line so you can do maintenance to the tagged
VG/LVs outside of RHCS:
$ lvchange --config "activation { volume_list = [ '@$HOSTNAME' ] }"
vg00/test -a y

I am not a LVM hacker, so take the following comments with the
appropriate caution:
If you only ever make changes to the LVM on the active node, I think
you'd have to be really unlucky to suffer corruption due to stale LVM
metadata. Although if you're carrying out long running tasks like
relocating PEs, it might be worth freezing the service (I think this
is still probably overly cautious). I think if you start making
changes on multiple nodes at the same time, you will suffer badly (but
the tags should stop this from happening accidentally).

If you are activating LV resources on the basis of their VG, LVM
snapshots should survive the resource being relocated between nodes;
when the VG is deactivated on the original node, both the original and
snapshots LV will be deactivated at the same time, so you won't miss
any writes in the snapshot.

In my RHEL6 test environment, I just created a two node cluster with
cman/clvmd and could create a snapshot on the LVs in a shared VG. This
fails under RHEL5. I'm not sure I'd trust it to actually work
though...

It's probably worth directing some of your questions at the LVM list
for a more definitive answer:
https://www.redhat.com/mailman/listinfo/linux-lvm

PS: I'm sure you will, but you should test it :)

Cheers

>>
>> Regards,
>>
>> Chris Jankowski
> --
> Xavier Montagutelli ? ? ? ? ? ? ? ? ? ? ?Tel : +33 (0)5 55 45 77 20
> Service Commun Informatique ? ? ? ? ? ? ?Fax : +33 (0)5 55 45 75 95
> Universite de Limoges
> 123, avenue Albert Thomas
> 87060 Limoges cedex

-- 
Jonathan Barber <jonathan.barber at gmail.com>



From xavier.montagutelli at unilim.fr  Wed Nov 24 16:44:08 2010
From: xavier.montagutelli at unilim.fr (Xavier Montagutelli)
Date: Wed, 24 Nov 2010 17:44:08 +0100
Subject: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM
	snapshots
In-Reply-To: <AANLkTikuLZ-79h4LjFdbHL8=UoVtka6udwAcjL7HLJ5Q@mail.gmail.com>
References: <036B68E61A28CA49AC2767596576CD596F5916BC82@GVW1113EXC.americas.hpqcorp.net>
	<201011241048.48243.xavier.montagutelli@unilim.fr>
	<AANLkTikuLZ-79h4LjFdbHL8=UoVtka6udwAcjL7HLJ5Q@mail.gmail.com>
Message-ID: <201011241744.08703.xavier.montagutelli@unilim.fr>

On Wednesday 24 November 2010 15:27:53 Jonathan Barber wrote:

> In my RHEL6 test environment, I just created a two node cluster with
> cman/clvmd and could create a snapshot on the LVs in a shared VG. This
> fails under RHEL5. I'm not sure I'd trust it to actually work
> though...

This is strange, and perhaps be reported as a bug, since the doc says :

"You cannot create a snapshot volume in a clustered volume group. "

http://docs.redhat.com/docs/en-
US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/snapshot_volumes.html

Or I missed something ?

> It's probably worth directing some of your questions at the LVM list
> for a more definitive answer:
> https://www.redhat.com/mailman/listinfo/linux-lvm


-- 
Xavier Montagutelli                      Tel : +33 (0)5 55 45 77 20
Service Commun Informatique              Fax : +33 (0)5 55 45 75 95
Universite de Limoges
123, avenue Albert Thomas
87060 Limoges cedex



From brem.belguebli at gmail.com  Wed Nov 24 23:06:40 2010
From: brem.belguebli at gmail.com (Brem Belguebli)
Date: Thu, 25 Nov 2010 00:06:40 +0100
Subject: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM
 snapshots
In-Reply-To: <AANLkTikuLZ-79h4LjFdbHL8=UoVtka6udwAcjL7HLJ5Q@mail.gmail.com>
References: <036B68E61A28CA49AC2767596576CD596F5916BC82@GVW1113EXC.americas.hpqcorp.net>
	<201011240903.30234.xavier.montagutelli@unilim.fr>
	<036B68E61A28CA49AC2767596576CD596F591FB805@GVW1113EXC.americas.hpqcorp.net>
	<201011241048.48243.xavier.montagutelli@unilim.fr>
	<AANLkTikuLZ-79h4LjFdbHL8=UoVtka6udwAcjL7HLJ5Q@mail.gmail.com>
Message-ID: <1290640000.2833.19.camel@newgen.localdomain>

Hi,

clvm for this specific need, ie using a non clustered filesystem (ext4,
xfs...) on top of a clustered VG/LV needs the exclusive activation to be
setup to prevent activation on another node of the running cluster.

The main problem with this is that:
 *up  until not so far, no RH  cluster shipped lvm resource script
supported this exclusive activation, despite the fact that there were
proposals from many people to the RHCS mailing list (look at Rafael
Miranda proposal around June 2009 in the ml archive)
 * On RHCS docs, it is said that to use HA-LVM for this specific need of
non clustered FS. (untill RHEL 5 RHCS 2)

Anyway, either method (HA-LVM or CLVM exclusive activation) can be
overridden with the --config CLI option, making both options not that
malevolent action safe.



    	   

On Wed, 2010-11-24 at 14:27 +0000, Jonathan Barber wrote:
> On 24 November 2010 09:48, Xavier Montagutelli
> <xavier.montagutelli at unilim.fr> wrote:
> > On Wednesday 24 November 2010 09:34:48 Jankowski, Chris wrote:
> >> Xavier,
> >>
> >> Thank you for the explanation.
> >> This all makes sense.
> >>
> >> One more question about one of the documents you pointed me to:
> >>
> >> What does this do exactly and why do I need it:
> >>
> >> Quote:
> >>
> >> 4) Update your initrd on all your cluster machines. Example:
> >> prompt> new-kernel-pkg --mkinitrd \
> >>         --initrdfile=/boot/initrd-halvm-`uname -r`.img --install `uname -r`
> >>
> >> Unquote
> >
> > Caution : the following are only supposition, because I haven't read the
> > lvm.sh script.
> >
> > In step 3 of http://sources.redhat.com/cluster/wiki/LVMFailover , you have to
> > modify your lvm.conf file, "volume_list" parameter, to filter the VG/LV that can
> > be activated on a particular host.
> >
> > In step 4, they say to create a new initrd : I suppose this step is necesary
> > to include the modified lvm.conf file inside the initrd, and to NOT activate the
> > VG located on the shared storage at boot time.
> >
> > The lvm.sh script must add or remove the "good" tag (i.e. a tag matching the
> > hostname of the node running the service) on the fly.
> >
> > If someone can confirm or give additional pointers ?
> 
> That's how I understand it.
> 
> I've used LVM on RHEL5 *without* clvmd and not had any problems with
> corruption, etc., but I haven't used snapshots. You have to try
> really, really hard to break the tagged LVM config from the command
> line, as the tags prevent activation (which also prevents you from
> accidentally mounting the FS on different nodes). It's worth knowing
> about the "--config" argument to the LVM commands, and how to active
> the LVs from the command line so you can do maintenance to the tagged
> VG/LVs outside of RHCS:
> $ lvchange --config "activation { volume_list = [ '@$HOSTNAME' ] }"
> vg00/test -a y
> 
> I am not a LVM hacker, so take the following comments with the
> appropriate caution:
> If you only ever make changes to the LVM on the active node, I think
> you'd have to be really unlucky to suffer corruption due to stale LVM
> metadata. Although if you're carrying out long running tasks like
> relocating PEs, it might be worth freezing the service (I think this
> is still probably overly cautious). I think if you start making
> changes on multiple nodes at the same time, you will suffer badly (but
> the tags should stop this from happening accidentally).
> 
> If you are activating LV resources on the basis of their VG, LVM
> snapshots should survive the resource being relocated between nodes;
> when the VG is deactivated on the original node, both the original and
> snapshots LV will be deactivated at the same time, so you won't miss
> any writes in the snapshot.
> 
> In my RHEL6 test environment, I just created a two node cluster with
> cman/clvmd and could create a snapshot on the LVs in a shared VG. This
> fails under RHEL5. I'm not sure I'd trust it to actually work
> though...
> 
> It's probably worth directing some of your questions at the LVM list
> for a more definitive answer:
> https://www.redhat.com/mailman/listinfo/linux-lvm
> 
> PS: I'm sure you will, but you should test it :)
> 
> Cheers
> 
> >>
> >> Regards,
> >>
> >> Chris Jankowski
> > --
> > Xavier Montagutelli                      Tel : +33 (0)5 55 45 77 20
> > Service Commun Informatique              Fax : +33 (0)5 55 45 75 95
> > Universite de Limoges
> > 123, avenue Albert Thomas
> > 87060 Limoges cedex
> 




From Chris.Jankowski at hp.com  Thu Nov 25 01:03:05 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Thu, 25 Nov 2010 01:03:05 +0000
Subject: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM
 snapshots
In-Reply-To: <1290640000.2833.19.camel@newgen.localdomain>
References: <036B68E61A28CA49AC2767596576CD596F5916BC82@GVW1113EXC.americas.hpqcorp.net>
	<201011240903.30234.xavier.montagutelli@unilim.fr>
	<036B68E61A28CA49AC2767596576CD596F591FB805@GVW1113EXC.americas.hpqcorp.net>
	<201011241048.48243.xavier.montagutelli@unilim.fr>
	<AANLkTikuLZ-79h4LjFdbHL8=UoVtka6udwAcjL7HLJ5Q@mail.gmail.com>
	<1290640000.2833.19.camel@newgen.localdomain>
Message-ID: <036B68E61A28CA49AC2767596576CD596F591FBA34@GVW1113EXC.americas.hpqcorp.net>

Brem,

Could you expand a little on your comments, please? I am new to this area and do not quite grasp some of the fine points?

1.
Are you saying that I could configure CLVM for exclusive activation for needs of a non-clustered filesystem (ext4/XFS) by using the LVM resource (lvm.sh) now delivered as standard part of Linux Cluster in RHEL6?

Specifically we are talking about ext4/XFS filesystems mounted exclusively by one node at a time, of course, and managed by a HA service. This service would contain the LVM resource, filesystem resource, IP resource and script resource.  The physical storage is on FC SAN and shared by all nodes of the cluster and visible to the nodes through multiple paths through Device Mapper Multipath.

2.
If so, is a special configuration required for CLVM itself, or is that just the effect of use of the LVM resource?

3.
In this configuration, would I be able to use filesystem snapshots?

4.
The --config CLI option you mentioned is an option for what?

Thanks and regards,

Chris Jankowski

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Brem Belguebli
Sent: Thursday, 25 November 2010 10:07
To: linux clustering
Subject: Re: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM snapshots

Hi,

clvm for this specific need, ie using a non clustered filesystem (ext4,
xfs...) on top of a clustered VG/LV needs the exclusive activation to be
setup to prevent activation on another node of the running cluster.

The main problem with this is that:
 *up  until not so far, no RH  cluster shipped lvm resource script
supported this exclusive activation, despite the fact that there were
proposals from many people to the RHCS mailing list (look at Rafael
Miranda proposal around June 2009 in the ml archive)
 * On RHCS docs, it is said that to use HA-LVM for this specific need of
non clustered FS. (untill RHEL 5 RHCS 2)

Anyway, either method (HA-LVM or CLVM exclusive activation) can be
overridden with the --config CLI option, making both options not that
malevolent action safe.



    	   

On Wed, 2010-11-24 at 14:27 +0000, Jonathan Barber wrote:
> On 24 November 2010 09:48, Xavier Montagutelli
> <xavier.montagutelli at unilim.fr> wrote:
> > On Wednesday 24 November 2010 09:34:48 Jankowski, Chris wrote:
> >> Xavier,
> >>
> >> Thank you for the explanation.
> >> This all makes sense.
> >>
> >> One more question about one of the documents you pointed me to:
> >>
> >> What does this do exactly and why do I need it:
> >>
> >> Quote:
> >>
> >> 4) Update your initrd on all your cluster machines. Example:
> >> prompt> new-kernel-pkg --mkinitrd \
> >>         --initrdfile=/boot/initrd-halvm-`uname -r`.img --install `uname -r`
> >>
> >> Unquote
> >
> > Caution : the following are only supposition, because I haven't read the
> > lvm.sh script.
> >
> > In step 3 of http://sources.redhat.com/cluster/wiki/LVMFailover , you have to
> > modify your lvm.conf file, "volume_list" parameter, to filter the VG/LV that can
> > be activated on a particular host.
> >
> > In step 4, they say to create a new initrd : I suppose this step is necesary
> > to include the modified lvm.conf file inside the initrd, and to NOT activate the
> > VG located on the shared storage at boot time.
> >
> > The lvm.sh script must add or remove the "good" tag (i.e. a tag matching the
> > hostname of the node running the service) on the fly.
> >
> > If someone can confirm or give additional pointers ?
> 
> That's how I understand it.
> 
> I've used LVM on RHEL5 *without* clvmd and not had any problems with
> corruption, etc., but I haven't used snapshots. You have to try
> really, really hard to break the tagged LVM config from the command
> line, as the tags prevent activation (which also prevents you from
> accidentally mounting the FS on different nodes). It's worth knowing
> about the "--config" argument to the LVM commands, and how to active
> the LVs from the command line so you can do maintenance to the tagged
> VG/LVs outside of RHCS:
> $ lvchange --config "activation { volume_list = [ '@$HOSTNAME' ] }"
> vg00/test -a y
> 
> I am not a LVM hacker, so take the following comments with the
> appropriate caution:
> If you only ever make changes to the LVM on the active node, I think
> you'd have to be really unlucky to suffer corruption due to stale LVM
> metadata. Although if you're carrying out long running tasks like
> relocating PEs, it might be worth freezing the service (I think this
> is still probably overly cautious). I think if you start making
> changes on multiple nodes at the same time, you will suffer badly (but
> the tags should stop this from happening accidentally).
> 
> If you are activating LV resources on the basis of their VG, LVM
> snapshots should survive the resource being relocated between nodes;
> when the VG is deactivated on the original node, both the original and
> snapshots LV will be deactivated at the same time, so you won't miss
> any writes in the snapshot.
> 
> In my RHEL6 test environment, I just created a two node cluster with
> cman/clvmd and could create a snapshot on the LVs in a shared VG. This
> fails under RHEL5. I'm not sure I'd trust it to actually work
> though...
> 
> It's probably worth directing some of your questions at the LVM list
> for a more definitive answer:
> https://www.redhat.com/mailman/listinfo/linux-lvm
> 
> PS: I'm sure you will, but you should test it :)
> 
> Cheers
> 
> >>
> >> Regards,
> >>
> >> Chris Jankowski
> > --
> > Xavier Montagutelli                      Tel : +33 (0)5 55 45 77 20
> > Service Commun Informatique              Fax : +33 (0)5 55 45 75 95
> > Universite de Limoges
> > 123, avenue Albert Thomas
> > 87060 Limoges cedex
> 


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From jonathan.barber at gmail.com  Thu Nov 25 09:12:22 2010
From: jonathan.barber at gmail.com (Jonathan Barber)
Date: Thu, 25 Nov 2010 09:12:22 +0000
Subject: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM
	snapshots
In-Reply-To: <201011241744.08703.xavier.montagutelli@unilim.fr>
References: <036B68E61A28CA49AC2767596576CD596F5916BC82@GVW1113EXC.americas.hpqcorp.net>
	<201011241048.48243.xavier.montagutelli@unilim.fr>
	<AANLkTikuLZ-79h4LjFdbHL8=UoVtka6udwAcjL7HLJ5Q@mail.gmail.com>
	<201011241744.08703.xavier.montagutelli@unilim.fr>
Message-ID: <AANLkTi=LA8gUXS3eyvWDLPVj_c4=6nuhOZuuY-J_sSbV@mail.gmail.com>

On 24 November 2010 16:44, Xavier Montagutelli
<xavier.montagutelli at unilim.fr> wrote:
> On Wednesday 24 November 2010 15:27:53 Jonathan Barber wrote:
>
>> In my RHEL6 test environment, I just created a two node cluster with
>> cman/clvmd and could create a snapshot on the LVs in a shared VG. This
>> fails under RHEL5. I'm not sure I'd trust it to actually work
>> though...
>
> This is strange, and perhaps be reported as a bug, since the doc says :
>
> "You cannot create a snapshot volume in a clustered volume group. "
>
> http://docs.redhat.com/docs/en-
> US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/snapshot_volumes.html
>
> Or I missed something ?

My bad. I created the VG/LV before starting clvmd, so the VG wasn't
considered to be clustered.

>> It's probably worth directing some of your questions at the LVM list
>> for a more definitive answer:
>> https://www.redhat.com/mailman/listinfo/linux-lvm
>
>
> --
> Xavier Montagutelli ? ? ? ? ? ? ? ? ? ? ?Tel : +33 (0)5 55 45 77 20
> Service Commun Informatique ? ? ? ? ? ? ?Fax : +33 (0)5 55 45 75 95
> Universite de Limoges
> 123, avenue Albert Thomas
> 87060 Limoges cedex
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Jonathan Barber <jonathan.barber at gmail.com>



From ag8817282 at gideon.org  Thu Nov 25 16:39:19 2010
From: ag8817282 at gideon.org (A. Gideon)
Date: Thu, 25 Nov 2010 16:39:19 +0000 (UTC)
Subject: [Linux-cluster] GFS+DRBD+Quorum: Help wrap my brain around this
References: <1290198140.5822.74.camel@carrot>
	<1290375963.1480.47.camel@shyster>
	<icemte$ovr$3@taco.int.tagonline.com>
	<bfd62aaf1e34948db5e0b466e8988b0c@mx.varna.net>
	<1290515321.24204.32.camel@cowie>
Message-ID: <icm3fn$p8p$1@taco.int.tagonline.com>

On Tue, 23 Nov 2010 12:28:41 +0000, Colin Simpson wrote:

> Since the third node is ignorant about the status of DRBD, I don't
> really see what help it gives for it to decide on quorum.

I've just read through the "Best Practice with DRBD RHCS and GFS2" thread 
on the drbd-users list.  And I'm still missing what seems to me to be a 
fundamental issue.

First: It seems like you no longer (since 8.3.8) need to have GFS startup 
await the DRBD sync operation.  That's good, but is this because DRBD 
does the proper thing with I/O requests during a sync?  That's what I 
think is so, but then I don't understand why you'd an issue with 8.2.  Or 
am I missing something?

But the real issue for me is quorum/consensus.  I noted:

  startup {
    wfc-timeout 0 ;       # Wait forever for initial connection
    degr-wfc-timeout 60;  # Wait only 60 seconds if this node 
			  # was a degraded cluster
  }

and

        net
        {
                allow-two-primaries;
		after-sb-0pri discard-zero-changes;
		after-sb-1pri discard-secondary;
		after-sb-2pri disconnect;
	}

but when I break the DRBD connection between two primary nodes, 
"disconnected" apparently means that the nodes both continue as if 
they've UpToDate disks.  But this lets the data go out of sync.  Isn't 
this a Bad Thing?

Clearly, if there were some third party (ie. a quorum disk or a third 
node), this could be resolved.  But these don't seem to be required in 
the DRBD world, so how is this situation resolved?

DRBD supports fencing, so perhaps that is the answer?  I'm reluctant to 
make use of the cluster's fencing as - as described in the thread you 
referenced - cluster suite starts after DRBD.

I'm thinking of trying a fencing policy of resource-and-stonith where the 
the handler tries to get a shared semaphore (ie. connect to a port on a 
third server that accepts only a single connection at a time, or perhaps 
even just a lock on an file mounted via NFS from a third server).  If it 
raises the semaphore/gets the lock, it fences the DRBD peer.  If it 
doesn't, it either waits forever or marks itself as outdated.

This may also work to solve the startup "wait forever" problem, in that 
the starting node in WaitForConnect which gets the shared lock first gets 
to come up while the other is blocked.  I'm not yet sure how to implement 
this from DRBD's perspective, though.  I'm not clear that there's a 
handler that's called if DRBD starts and cannot establish an initial 
connection.

That I've found no mention of this idea leaves me suspicious that it 
won't work or that it's overkill.  Yet I cannot see why.  It follows the 
same model of quorum as the cluster software.

Thanks...

	Andrew




From brem.belguebli at gmail.com  Thu Nov 25 19:27:35 2010
From: brem.belguebli at gmail.com (brem belguebli)
Date: Thu, 25 Nov 2010 20:27:35 +0100
Subject: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM
	snapshots
In-Reply-To: <036B68E61A28CA49AC2767596576CD596F591FBA34@GVW1113EXC.americas.hpqcorp.net>
References: <036B68E61A28CA49AC2767596576CD596F5916BC82@GVW1113EXC.americas.hpqcorp.net>
	<201011240903.30234.xavier.montagutelli@unilim.fr>
	<036B68E61A28CA49AC2767596576CD596F591FB805@GVW1113EXC.americas.hpqcorp.net>
	<201011241048.48243.xavier.montagutelli@unilim.fr>
	<AANLkTikuLZ-79h4LjFdbHL8=UoVtka6udwAcjL7HLJ5Q@mail.gmail.com>
	<1290640000.2833.19.camel@newgen.localdomain>
	<036B68E61A28CA49AC2767596576CD596F591FBA34@GVW1113EXC.americas.hpqcorp.net>
Message-ID: <AANLkTikoed1pHgedk6Qb+drvU6Mze3VWXPvg+S7orcBX@mail.gmail.com>

Chris,


2010/11/25 Jankowski, Chris <Chris.Jankowski at hp.com>:
> Brem,
>
> Could you expand a little on your comments, please? I am new to this area and do not quite grasp some of the fine points?
>
> 1.
> Are you saying that I could configure CLVM for exclusive activation for needs of a non-clustered filesystem (ext4/XFS) by using the LVM resource (lvm.sh) now delivered as standard part of Linux Cluster in RHEL6?

I am saying that in order to be able to use non clustered FS's  in a
cluster you would need a mechanism that hardens to a certain level
your data.
2 options are possible:
1) HA-LVM : what was recommended  until RHEL 5 by Redhat and
apparently it's still the case .No need for CLVM in this very case.
see link (even for CLuster3  ie RHEL6)
http://sources.redhat.com/cluster/wiki/LVMFailover. HA-LVM is exactly
what was proposed by HP with SGLX,  tags.


2) Exclusive activation: Requires CLVM . The problem is that RedHat
never provided a resource script to do so. Some have written such
resources and proposed it to the Cluster team but never got
integrated. If you want to use them, there will be no support fro
Redhat(the one from Rafael Miranda worked quite well for me, I even
changed it to manage exclusive activation at VG level, not at LV ).
>
> Specifically we are talking about ext4/XFS filesystems mounted exclusively by one node at a time, of course, and managed by a HA service. This service would contain the LVM resource, filesystem resource, IP resource and script resource. ?The physical storage is on FC SAN and shared by all nodes of the cluster and visible to the nodes through multiple paths through Device Mapper Multipath.
>
Classical setup, shared storage (SAN...) active/passive resources,
what most of the clients do.
> 2.
> If so, is a special configuration required for CLVM itself, or is that just the effect of use of the LVM resource?
>
The shipped LVM resource will not allow tyou to do exclusive activation.
> 3.
> In this configuration, would I be able to use filesystem snapshots?
>
Never tried that, do not know, but it seems that some gave you
pointers saying that snapshots are not supported on CLVM.
> 4.
> The --config CLI option you mentioned is an option for what?
>

you can do a lot from the lvm cli (vgchange --config "global
{locking_type=1}" -ay myvg) bypasses the lvm.conf locking_type (=3 if
clustered) and allows you to activate the VG on the wrong node...

> Thanks and regards,
>
> Chris Jankowski
>
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Brem Belguebli
> Sent: Thursday, 25 November 2010 10:07
> To: linux clustering
> Subject: Re: [Linux-cluster] RHEL 6 cluster filesystem resource and LVM snapshots
>
> Hi,
>
> clvm for this specific need, ie using a non clustered filesystem (ext4,
> xfs...) on top of a clustered VG/LV needs the exclusive activation to be
> setup to prevent activation on another node of the running cluster.
>
> The main problem with this is that:
> ?*up ?until not so far, no RH ?cluster shipped lvm resource script
> supported this exclusive activation, despite the fact that there were
> proposals from many people to the RHCS mailing list (look at Rafael
> Miranda proposal around June 2009 in the ml archive)
> ?* On RHCS docs, it is said that to use HA-LVM for this specific need of
> non clustered FS. (untill RHEL 5 RHCS 2)
>
> Anyway, either method (HA-LVM or CLVM exclusive activation) can be
> overridden with the --config CLI option, making both options not that
> malevolent action safe.
>
>
>
>
>
> On Wed, 2010-11-24 at 14:27 +0000, Jonathan Barber wrote:
>> On 24 November 2010 09:48, Xavier Montagutelli
>> <xavier.montagutelli at unilim.fr> wrote:
>> > On Wednesday 24 November 2010 09:34:48 Jankowski, Chris wrote:
>> >> Xavier,
>> >>
>> >> Thank you for the explanation.
>> >> This all makes sense.
>> >>
>> >> One more question about one of the documents you pointed me to:
>> >>
>> >> What does this do exactly and why do I need it:
>> >>
>> >> Quote:
>> >>
>> >> 4) Update your initrd on all your cluster machines. Example:
>> >> prompt> new-kernel-pkg --mkinitrd \
>> >> ? ? ? ? --initrdfile=/boot/initrd-halvm-`uname -r`.img --install `uname -r`
>> >>
>> >> Unquote
>> >
>> > Caution : the following are only supposition, because I haven't read the
>> > lvm.sh script.
>> >
>> > In step 3 of http://sources.redhat.com/cluster/wiki/LVMFailover , you have to
>> > modify your lvm.conf file, "volume_list" parameter, to filter the VG/LV that can
>> > be activated on a particular host.
>> >
>> > In step 4, they say to create a new initrd : I suppose this step is necesary
>> > to include the modified lvm.conf file inside the initrd, and to NOT activate the
>> > VG located on the shared storage at boot time.
>> >
>> > The lvm.sh script must add or remove the "good" tag (i.e. a tag matching the
>> > hostname of the node running the service) on the fly.
>> >
>> > If someone can confirm or give additional pointers ?
>>
>> That's how I understand it.
>>
>> I've used LVM on RHEL5 *without* clvmd and not had any problems with
>> corruption, etc., but I haven't used snapshots. You have to try
>> really, really hard to break the tagged LVM config from the command
>> line, as the tags prevent activation (which also prevents you from
>> accidentally mounting the FS on different nodes). It's worth knowing
>> about the "--config" argument to the LVM commands, and how to active
>> the LVs from the command line so you can do maintenance to the tagged
>> VG/LVs outside of RHCS:
>> $ lvchange --config "activation { volume_list = [ '@$HOSTNAME' ] }"
>> vg00/test -a y
>>
>> I am not a LVM hacker, so take the following comments with the
>> appropriate caution:
>> If you only ever make changes to the LVM on the active node, I think
>> you'd have to be really unlucky to suffer corruption due to stale LVM
>> metadata. Although if you're carrying out long running tasks like
>> relocating PEs, it might be worth freezing the service (I think this
>> is still probably overly cautious). I think if you start making
>> changes on multiple nodes at the same time, you will suffer badly (but
>> the tags should stop this from happening accidentally).
>>
>> If you are activating LV resources on the basis of their VG, LVM
>> snapshots should survive the resource being relocated between nodes;
>> when the VG is deactivated on the original node, both the original and
>> snapshots LV will be deactivated at the same time, so you won't miss
>> any writes in the snapshot.
>>
>> In my RHEL6 test environment, I just created a two node cluster with
>> cman/clvmd and could create a snapshot on the LVs in a shared VG. This
>> fails under RHEL5. I'm not sure I'd trust it to actually work
>> though...
>>
>> It's probably worth directing some of your questions at the LVM list
>> for a more definitive answer:
>> https://www.redhat.com/mailman/listinfo/linux-lvm
>>
>> PS: I'm sure you will, but you should test it :)
>>
>> Cheers
>>
>> >>
>> >> Regards,
>> >>
>> >> Chris Jankowski
>> > --
>> > Xavier Montagutelli ? ? ? ? ? ? ? ? ? ? ?Tel : +33 (0)5 55 45 77 20
>> > Service Commun Informatique ? ? ? ? ? ? ?Fax : +33 (0)5 55 45 75 95
>> > Universite de Limoges
>> > 123, avenue Albert Thomas
>> > 87060 Limoges cedex
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From bizzam at gmail.com  Fri Nov 26 10:03:55 2010
From: bizzam at gmail.com (bizza)
Date: Fri, 26 Nov 2010 11:03:55 +0100
Subject: [Linux-cluster] Problems on new cluster
Message-ID: <AANLkTinV9VnrN-OC=e07kV2MHvr59VcJ4SM=kahuOJW0@mail.gmail.com>

Fresh rhel 5.4 installation on 2 nodes, clustering and cluster storage
groups, Oracle Rac installed on both systems.
When we try to configure the cluster (cluster suite i mean) using Luci
web interface, just after pressing the "Create Cluster" button, the
operation fails and in /var/log/messages we find these strings:

[...]
Nov 26 10:33:39 sdbsap01 openais[16218]: [MAIN ] ERROR: Could not
accept Library connection: (null) - prior to this log entry, openais
logger dropped '5' messages because of overflow.
Nov 26 10:33:39 sdbsap01 openais[16218]: [MAIN ] ERROR: Could not
accept Library connection: (null) - prior to this log entry, openais
logger dropped '5' messages because of overflow.
Nov 26 10:33:39 sdbsap01 openais[16218]: [MAIN ] ERROR: Could not
accept Library connection: (null) - prior to this log entry, openais
logger dropped '65' messages because of overflow.
[...]

and

[...]
Nov 26 10:38:03 sdbsap01 last message repeated 2 times
Nov 26 10:38:06 sdbsap01 ccsd[16212]: Unable to connect to cluster
infrastructure after 270 seconds.
Nov 26 10:38:36 sdbsap01 ccsd[16212]: Unable to connect to cluster
infrastructure after 300 seconds.
[...]

service cman status return

[root at sdbsap01 ~]# service cman start
Starting cluster:
Loading modules... done
Mounting configfs... done
Starting ccsd... done
Starting cman... failed

[FAILED]
[root at sdbsap01 ~]#


and this is our cluster.conf file:

[root at sdbsap01 ~]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="cludbsap01" config_version="1" name="cludbsap01">
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="sdbsap01-priv" nodeid="1" votes="1"/>
<clusternode name="sdbsap02-priv" nodeid="2" votes="1"/>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices/>
<rm/>
</cluster>


Using tcpdump multicast seems to allowed between the nodes

10:33:39.836081 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF],
proto: IGMP (2), length: 40, options ( RA (148) len 4 )) 192.1.1.26 >
224.0.0.22: igmp v3 report, 1 group record(s) [gaddr 239.192.158.231
to_in, 0 source(s)]
10:33:41.933008 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF],
proto: IGMP (2), length: 40, options ( RA (148) len 4 )) 192.1.1.27 >
224.0.0.22: igmp v3 report, 1 group record(s) [gaddr 239.192.158.231
to_ex, 0 source(s)]
10:33:41.965995 IP (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto:
UDP (17), length: 176) 192.1.1.27.5149 > 239.192.158.231.netsupport:
UDP, length 148
10:33:42.188485 IP (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto:
UDP (17), length: 146) 192.1.1.27.5149 > 239.192.158.231.netsupport:
UDP, length 118
10:33:42.507042 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF],
proto: IGMP (2), length: 40, options ( RA (148) len 4 )) 192.1.1.27 >
224.0.0.22: igmp v3 report, 1 group record(s) [gaddr 239.192.158.231
to_in, 0 source(s)]


We are already tried to remove all the packages (yum groupremove
"Clustering" "Cluster Storage"), the /etc/cluster and /var/lib/luci
and /var/lib/ricci directories, and then reinstalling the cluster
suite, but we still have this problem.

Any suggestions?
Regards
Marco
-- 
bizza
http://www.rm-rf.eu/



From jonathan.barber at gmail.com  Fri Nov 26 10:51:22 2010
From: jonathan.barber at gmail.com (Jonathan Barber)
Date: Fri, 26 Nov 2010 10:51:22 +0000
Subject: [Linux-cluster] Problems on new cluster
In-Reply-To: <AANLkTinV9VnrN-OC=e07kV2MHvr59VcJ4SM=kahuOJW0@mail.gmail.com>
References: <AANLkTinV9VnrN-OC=e07kV2MHvr59VcJ4SM=kahuOJW0@mail.gmail.com>
Message-ID: <AANLkTinim9gE79_udDySX-_JWUjkNkowby9Ky3jZ952Y@mail.gmail.com>

On 26 November 2010 10:03, bizza <bizzam at gmail.com> wrote:
> Fresh rhel 5.4 installation on 2 nodes, clustering and cluster storage
> groups, Oracle Rac installed on both systems.
> When we try to configure the cluster (cluster suite i mean) using Luci
> web interface, just after pressing the "Create Cluster" button, the
> operation fails and in /var/log/messages we find these strings:

[snip]

> Any suggestions?

IMHO having two clustering solutions (RHCS and Oracle's CRS - which is
required for RAC) running at the same time is asking for trouble. I
suggest you look at using Oracle's CRS to manage your non-RAC services
as well.

Cheers

> Regards
> Marco
> --
> bizza
> http://www.rm-rf.eu/
-- 
Jonathan Barber <jonathan.barber at gmail.com>



From bizzam at gmail.com  Fri Nov 26 11:06:17 2010
From: bizzam at gmail.com (bizza)
Date: Fri, 26 Nov 2010 12:06:17 +0100
Subject: [Linux-cluster] Problems on new cluster
In-Reply-To: <AANLkTinim9gE79_udDySX-_JWUjkNkowby9Ky3jZ952Y@mail.gmail.com>
References: <AANLkTinV9VnrN-OC=e07kV2MHvr59VcJ4SM=kahuOJW0@mail.gmail.com>
	<AANLkTinim9gE79_udDySX-_JWUjkNkowby9Ky3jZ952Y@mail.gmail.com>
Message-ID: <AANLkTimWK0GZ==kHvdRo4YuujBm-GV1mxO5Z+W=mGmHH@mail.gmail.com>

On Fri, Nov 26, 2010 at 11:51 AM, Jonathan Barber
<jonathan.barber at gmail.com> wrote:
> [snip]
>
>> Any suggestions?
>
> IMHO having two clustering solutions (RHCS and Oracle's CRS - which is
> required for RAC) running at the same time is asking for trouble. I
> suggest you look at using Oracle's CRS to manage your non-RAC services
> as well.
>
> Cheers

if there is any other way to mount gfs luns that not require cluster
suite running, i'm very happy to remove the red hat cluster packages
:)
Is it possible or i really need cman, clvmd (ecc..) to mount it?

Thanks for your help
Marco
-- 
bizza
http://www.rm-rf.eu/



From jonathan.barber at gmail.com  Fri Nov 26 15:02:45 2010
From: jonathan.barber at gmail.com (Jonathan Barber)
Date: Fri, 26 Nov 2010 15:02:45 +0000
Subject: [Linux-cluster] Problems on new cluster
In-Reply-To: <AANLkTimWK0GZ==kHvdRo4YuujBm-GV1mxO5Z+W=mGmHH@mail.gmail.com>
References: <AANLkTinV9VnrN-OC=e07kV2MHvr59VcJ4SM=kahuOJW0@mail.gmail.com>
	<AANLkTinim9gE79_udDySX-_JWUjkNkowby9Ky3jZ952Y@mail.gmail.com>
	<AANLkTimWK0GZ==kHvdRo4YuujBm-GV1mxO5Z+W=mGmHH@mail.gmail.com>
Message-ID: <AANLkTimUbeUTGoPsXro-VB_Dc9V1rhbj82RrwLtCCYQ8@mail.gmail.com>

On 26 November 2010 11:06, bizza <bizzam at gmail.com> wrote:
> On Fri, Nov 26, 2010 at 11:51 AM, Jonathan Barber
> <jonathan.barber at gmail.com> wrote:
>> [snip]
>>
>>> Any suggestions?
>>
>> IMHO having two clustering solutions (RHCS and Oracle's CRS - which is
>> required for RAC) running at the same time is asking for trouble. I
>> suggest you look at using Oracle's CRS to manage your non-RAC services
>> as well.
>>
>> Cheers
>
> if there is any other way to mount gfs luns that not require cluster
> suite running, i'm very happy to remove the red hat cluster packages
> :)
> Is it possible or i really need cman, clvmd (ecc..) to mount it?

No, you certainly need CMAN for GFS2. Maybe the CRS fencing will do
the job of fencing the nodes though.

As to your original question, my random guess is that you should check
the name attributes in the clusternode elements are resolvable on the
nodes. If they are, try setting them to the result of running "uname
-n" on the nodes.

> Thanks for your help
> Marco
> --
> bizza
> http://www.rm-rf.eu/
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
Jonathan Barber <jonathan.barber at gmail.com>



From Colin.Simpson at iongeo.com  Fri Nov 26 15:04:40 2010
From: Colin.Simpson at iongeo.com (Colin Simpson)
Date: Fri, 26 Nov 2010 15:04:40 +0000
Subject: [Linux-cluster] GFS+DRBD+Quorum: Help wrap my brain around this
In-Reply-To: <icm3fn$p8p$1@taco.int.tagonline.com>
References: <icm3fn$p8p$1@taco.int.tagonline.com>
Message-ID: <1290783880.21803.31.camel@cowie>

On Thu, 2010-11-25 at 16:39 +0000, A. Gideon wrote:
> On Tue, 23 Nov 2010 12:28:41 +0000, Colin Simpson wrote:
> 
> > Since the third node is ignorant about the status of DRBD, I don't
> > really see what help it gives for it to decide on quorum.
> 
> I've just read through the "Best Practice with DRBD RHCS and GFS2"
> thread
> on the drbd-users list.  And I'm still missing what seems to me to be
> a
> fundamental issue.
> 
> First: It seems like you no longer (since 8.3.8) need to have GFS
> startup
> await the DRBD sync operation.  That's good, but is this because DRBD
> does the proper thing with I/O requests during a sync?  That's what I
> think is so, but then I don't understand why you'd an issue with 8.2.
> Or
> am I missing something?

The main issue I had was on 8.2 when the system booted out of sync it
would ooops a kernel module in the GFS, so couldn't really get any
further to see what the filesystem was up to during the sync

It is now fine on 8.3, it seems to always have the up to date i.e the
latest data, on the out of sync node when it is syncing, from my
testing. So it seems to do the right thing, as far as I can tell.


> But the real issue for me is quorum/consensus.  I noted:
> 
>   startup {
>     wfc-timeout 0 ;       # Wait forever for initial connection
>     degr-wfc-timeout 60;  # Wait only 60 seconds if this node
>                           # was a degraded cluster
>   }
> 
> and
> 
>         net
>         {
>                 allow-two-primaries;
>                 after-sb-0pri discard-zero-changes;
>                 after-sb-1pri discard-secondary;
>                 after-sb-2pri disconnect;
>         }
> 
> but when I break the DRBD connection between two primary nodes,
> "disconnected" apparently means that the nodes both continue as if
> they've UpToDate disks.  But this lets the data go out of sync.  Isn't
> this a Bad Thing?

Yup that could be an issue, however you should never be in a situation
where you break the connection between the two nodes. This needs to be
heavily mitigated, I'm planning to bond two interfaces on two different
cards so this doesn't happen (or I should say is highly unlikely). 

I guess there are two scenarios:

1/ The DRBD network links gets broken. This shouldn't be allowed to
happen, as above. 

2/ The node goes down totally so DRBD loses comms. But as all the comms
are down the other node will notice and Cluster Suite will fence the bad
node. Remember that GFS will suspend all operations (on all nodes) until
the bad node is fenced. 

I plan to further help the first situation by having my cluster comms
share the same bond with the DRBD. So if the comms fail, cluster suite
should notice, both the DRBD's on each node shouldn't change as GFS will
have suspended operations. Assuming the fence devices are reachable then
one of the nodes should fence the other (it might be a bit of a shoot
out situation) and then GFS should resume on the remaining node. 

That's how I currently see it working anyway. 

On the startup issue. I think order should be cman, drbd, clvmd and then
rgmanager. Basically the default order for the cluster services with
drbd starting before clvmd but after cman. So no filesystem mounting
will take place on a node until clvmd is up, that comes after drbd is
started. The drbd startup script will be the bit that does all the
either waiting forever or waiting the defined interval stuff etc. So the
boot will stop at this point and wait for this.

This is all my understanding, so if anyone sees flaws, I'd like to know
too.

Colin


This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.





From Colin.Simpson at iongeo.com  Fri Nov 26 15:50:29 2010
From: Colin.Simpson at iongeo.com (Colin Simpson)
Date: Fri, 26 Nov 2010 15:50:29 +0000
Subject: [Linux-cluster] Clustered DNS service use on the node's resolvers?
Message-ID: <1290786629.21803.55.camel@cowie>

Hi

I playing with clustering bind. The named service has it's own IP
resource. I have successfully tested a basic caching name servers all
seems to be working. 

Now is it possible (or wise) to give the nodes of the cluster hosting
the DNS service the resource IP as their DNS server in their
resolv.conf. 

This is mainly so that other services can have a highly available DNS
service. Multiple DNS servers in resolv.conf really is a non-starter as
it takes forever to give up on the first name server and try the next
one (then it always keeps trying the first server again on subsequent
queries).

My main issue, I think, is that I'm guessing so long as all my cluster
node names (and any other names) used in cluster.conf are in /etc/hosts
then the cluster suite itself should be happy ? Does is ever need to hit
DNS? 

But what if services (maybe Samba or httpd) uses and/or requires DNS? I
know I could make these dependant on DNS but that seems a bit messy
(with so many services inside one). Plus I would like DNS to be in a
different failover domain from some of the services. So I suppose the
question is, does rgmanager start the service and decide a service is
alive or dead based on just return status, not timeouts ? Will it get
upset by a slow starting service? And will it start services in parallel
(not waiting for each to start in turn)? 

Anyone else tried this?

Thanks

Colin

This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.





From fdinitto at redhat.com  Sat Nov 27 07:35:02 2010
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Sat, 27 Nov 2010 08:35:02 +0100
Subject: [Linux-cluster] Clustered DNS service use on the node's
	resolvers?
In-Reply-To: <1290786629.21803.55.camel@cowie>
References: <1290786629.21803.55.camel@cowie>
Message-ID: <4CF0B4A6.4010101@redhat.com>

On 11/26/2010 04:50 PM, Colin Simpson wrote:
> Hi
> 
> I playing with clustering bind. The named service has it's own IP
> resource. I have successfully tested a basic caching name servers all
> seems to be working. 
> 
> Now is it possible (or wise) to give the nodes of the cluster hosting
> the DNS service the resource IP as their DNS server in their
> resolv.conf. 

I guess it all depends on what kind of services are running on the
cluster and how dependent they are on DNS itself.

> 
> This is mainly so that other services can have a highly available DNS
> service. Multiple DNS servers in resolv.conf really is a non-starter as
> it takes forever to give up on the first name server and try the next
> one (then it always keeps trying the first server again on subsequent
> queries).
> 
> My main issue, I think, is that I'm guessing so long as all my cluster
> node names (and any other names) used in cluster.conf are in /etc/hosts
> then the cluster suite itself should be happy ? Does is ever need to hit
> DNS? 

This should work.

> 
> But what if services (maybe Samba or httpd) uses and/or requires DNS? I
> know I could make these dependant on DNS but that seems a bit messy
> (with so many services inside one). Plus I would like DNS to be in a
> different failover domain from some of the services.

Hmm this is a bit of a grey area. Let me explain what I think.

Let's assume you have service foo (like httpd) that needs to do lots of
DNS queries (to resolve ip <-> hostnames for logs) and for whatever
reason your DNS service dies, rgmanager attempts to relocate, still fail
(maybe a configuration error from the admin that cannot be recovered by
rgmanager for obvious reasons).

rgmanager will stop trying to recover DNS according to your configured
policies, but then httpd, and any other service dependent on DNS, will
suffer from the lack of it.

The question then goes down to: how quickly would the admin notice that
DNS is down and recover?

> So I suppose the
> question is, does rgmanager start the service and decide a service is
> alive or dead based on just return status, not timeouts ?

It depends on the resource agent to drive that resource. In most cases
the resource agent behaves like an init script and does a status check.
Nothing stops you to customize the resource agent and check for
something else. Just make sure to use standard exit codes.

> Will it get
> upset by a slow starting service?

Still depends on how the resource agent is written and how the resource
work. Some daemons will fork in background right away, init script will
exit clean and later the daemon fail. rgmanager will notice that the
daemon has failed only at the next status check.

If the init script (or resource agent) will instead wait for the service
to start before exiting, there should be no problems at all.

It will just take longer to start the resources dependent on the slow one.

> And will it start services in parallel
> (not waiting for each to start in turn)? 
> 

I can't remember the option to do it, but it is possible.

Now keep in mind that, by splitting the resources in different failover
domains, you can start in them parallel, but if they depend on each
other for proper runtime functionality, you are introducing a situation
where rgmanager can't help you to start them in the right order.
So make sure it's actually what you are looking for.

Fabio



From allard at oceanpark.com  Sat Nov 27 17:12:31 2010
From: allard at oceanpark.com (Dennis G Allard)
Date: Sat, 27 Nov 2010 09:12:31 -0800
Subject: [Linux-cluster] Evaluation of Red Hat Cluster?
Message-ID: <4CF13BFF.7000701@oceanpark.com>

Is there an evaluation version of Red Hat Cluster that I can install in 
order to try it out before deciding to purchase it?

Cheers,
Dennis



From Colin.Simpson at iongeo.com  Sat Nov 27 18:20:42 2010
From: Colin.Simpson at iongeo.com (Colin Simpson)
Date: Sat, 27 Nov 2010 18:20:42 +0000
Subject: [Linux-cluster] Evaluation of Red Hat Cluster?
In-Reply-To: <4CF13BFF.7000701@oceanpark.com>
References: <4CF13BFF.7000701@oceanpark.com>
Message-ID: <1290882042.26583.1.camel@shyster>

I'm testing on Centos but plan to deploy into production on RHEL. You
could also test on Fedora I guess.

Colin


On Sat, 2010-11-27 at 17:12 +0000, Dennis G Allard wrote:
> Is there an evaluation version of Red Hat Cluster that I can install
> in
> order to try it out before deciding to purchase it?
> 
> Cheers,
> Dennis
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 

This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.





From linux at alteeve.com  Sat Nov 27 20:45:28 2010
From: linux at alteeve.com (Digimer)
Date: Sat, 27 Nov 2010 15:45:28 -0500
Subject: [Linux-cluster] Evaluation of Red Hat Cluster?
In-Reply-To: <4CF13BFF.7000701@oceanpark.com>
References: <4CF13BFF.7000701@oceanpark.com>
Message-ID: <4CF16DE8.7080108@alteeve.com>

On 11/27/2010 12:12 PM, Dennis G Allard wrote:
> Is there an evaluation version of Red Hat Cluster that I can install in
> order to try it out before deciding to purchase it?
>
> Cheers,
> Dennis

At this immediate moment, I don't think there is. CentOS 6.0 should be 
out "real soon now", and that will do. You can try Fedora 14, which uses 
the same clustering software, but you will need to strip it down a fair 
bit.

Have you tried contacting Red Hat sales directly? If you want more 
general information and are not adverse to IRC, come visit 
#linux-cluster on freenode. Lots of friendly people there who use both 
RHCS and Pacemaker (and developers). It can be quiet at times though, so 
hang around if you don't get answers right away.

Cheers!

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From carlopmart at gmail.com  Sat Nov 27 21:00:15 2010
From: carlopmart at gmail.com (carlopmart)
Date: Sat, 27 Nov 2010 22:00:15 +0100
Subject: [Linux-cluster] Updating cluster.conf file under RHEL6
Message-ID: <4CF1715F.1020801@gmail.com>

Hi all,

  I have downloaded a trial RHEL6 DVD to test new cluster suite under this platform 
(cman-3.0.12-23.el6_0.4.x86_64). After install one node, I have migrated some data 
from my production's cluster that consists on two RHEL 5.5 nodes.

  All works ok (moving cluster.conf file, data, etc), except when I try to modify 
and propagate new cluster.conf file. In RedHat's documentation says: "Run the 
cman_tool version -r command to propagate the configuration to the rest of the 
cluster nodes.". This procedure works if all nodes are up, but if I try to do this 
with only one up returns this error:

Error: Unable to initialize the crypto subsystem.
cman_tool: ccs_sync failed.
If you have distributed the config file yourself, try re-running with -S

Trying "cman_tool version -S", returns:

6.2.0 config 3

This is an old version. With RHEL5.x I can use this command "ccs_tool update 
/etc/cluster/cluster.conf" and all works ok if all nodes are up or not.

  Is not possible to update cluster.conf if all nodes aren't up?? This is a problem, 
almost for me.

Thanks.

-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From fdinitto at redhat.com  Sun Nov 28 09:57:13 2010
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Sun, 28 Nov 2010 10:57:13 +0100
Subject: [Linux-cluster] Updating cluster.conf file under RHEL6
In-Reply-To: <4CF1715F.1020801@gmail.com>
References: <4CF1715F.1020801@gmail.com>
Message-ID: <4CF22779.7070809@redhat.com>

On 11/27/2010 10:00 PM, carlopmart wrote:
> Hi all,
> 
>  I have downloaded a trial RHEL6 DVD to test new cluster suite under
> this platform (cman-3.0.12-23.el6_0.4.x86_64). After install one node, I
> have migrated some data from my production's cluster that consists on
> two RHEL 5.5 nodes.
> 
>  All works ok (moving cluster.conf file, data, etc), except when I try
> to modify and propagate new cluster.conf file. In RedHat's documentation
> says: "Run the cman_tool version -r command to propagate the
> configuration to the rest of the cluster nodes.". This procedure works
> if all nodes are up, but if I try to do this with only one up returns
> this error:
> 
> Error: Unable to initialize the crypto subsystem.
> cman_tool: ccs_sync failed.
> If you have distributed the config file yourself, try re-running with -S

Do you have ricci running on the nodes? If not, you need to enable it.

I am fairly confident the documentation mentions that too.

cman_tool version -r -S would be the correct command IF the
configuration has been synced around manually (see the "propagate the
configuration via scp" section of the documentation).

Fabio



From kitgerrits at gmail.com  Sun Nov 28 10:33:44 2010
From: kitgerrits at gmail.com (Kit Gerrits)
Date: Sun, 28 Nov 2010 11:33:44 +0100
Subject: [Linux-cluster] Updating cluster.conf file under RHEL6
In-Reply-To: <4CF22779.7070809@redhat.com>
Message-ID: <4cf23025.cb7b0e0a.42b4.ffffc010@mx.google.com>


What about people that build their cluster 'by hand' or using the GUI tool?
(AKA nor Luci)
Do they still need to run Ricci?

Isn't that what ccsd was for?


Regards,

Kit

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Fabio M. Di Nitto
Sent: zondag 28 november 2010 10:57
To: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] Updating cluster.conf file under RHEL6

On 11/27/2010 10:00 PM, carlopmart wrote:
> Hi all,
> 
>  I have downloaded a trial RHEL6 DVD to test new cluster suite under 
> this platform (cman-3.0.12-23.el6_0.4.x86_64). After install one node, 
> I have migrated some data from my production's cluster that consists 
> on two RHEL 5.5 nodes.
> 
>  All works ok (moving cluster.conf file, data, etc), except when I try 
> to modify and propagate new cluster.conf file. In RedHat's 
> documentation
> says: "Run the cman_tool version -r command to propagate the 
> configuration to the rest of the cluster nodes.". This procedure works 
> if all nodes are up, but if I try to do this with only one up returns 
> this error:
> 
> Error: Unable to initialize the crypto subsystem.
> cman_tool: ccs_sync failed.
> If you have distributed the config file yourself, try re-running with 
> -S

Do you have ricci running on the nodes? If not, you need to enable it.

I am fairly confident the documentation mentions that too.

cman_tool version -r -S would be the correct command IF the configuration
has been synced around manually (see the "propagate the configuration via
scp" section of the documentation).

Fabio

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From carlopmart at gmail.com  Sun Nov 28 12:40:41 2010
From: carlopmart at gmail.com (carlopmart)
Date: Sun, 28 Nov 2010 13:40:41 +0100
Subject: [Linux-cluster] Updating cluster.conf file under RHEL6
In-Reply-To: <4CF22779.7070809@redhat.com>
References: <4CF1715F.1020801@gmail.com> <4CF22779.7070809@redhat.com>
Message-ID: <4CF24DC9.7010400@gmail.com>

On 11/28/2010 10:57 AM, Fabio M. Di Nitto wrote:
> On 11/27/2010 10:00 PM, carlopmart wrote:
>> Hi all,
>>
>>   I have downloaded a trial RHEL6 DVD to test new cluster suite under
>> this platform (cman-3.0.12-23.el6_0.4.x86_64). After install one node, I
>> have migrated some data from my production's cluster that consists on
>> two RHEL 5.5 nodes.
>>
>>   All works ok (moving cluster.conf file, data, etc), except when I try
>> to modify and propagate new cluster.conf file. In RedHat's documentation
>> says: "Run the cman_tool version -r command to propagate the
>> configuration to the rest of the cluster nodes.". This procedure works
>> if all nodes are up, but if I try to do this with only one up returns
>> this error:
>>
>> Error: Unable to initialize the crypto subsystem.
>> cman_tool: ccs_sync failed.
>> If you have distributed the config file yourself, try re-running with -S
>
> Do you have ricci running on the nodes? If not, you need to enable it.
>
> I am fairly confident the documentation mentions that too.
>
> cman_tool version -r -S would be the correct command IF the
> configuration has been synced around manually (see the "propagate the
> configuration via scp" section of the documentation).
>
> Fabio
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

Thanks Fabio, but do I need to run ricci on both nodes?? If i need run ricci on both 
nodes, what about certificates generated by ricci init script?? Do i need to copy on 
both nodes??

Thanks.

-- 
CL Martinez
carlopmart {at} gmail {d0t} com



From rossnick-lists at cybercat.ca  Sun Nov 28 13:31:15 2010
From: rossnick-lists at cybercat.ca (Nicolas Ross)
Date: Sun, 28 Nov 2010 08:31:15 -0500
Subject: [Linux-cluster] Evaluation of Red Hat Cluster?
In-Reply-To: <4CF13BFF.7000701@oceanpark.com>
References: <4CF13BFF.7000701@oceanpark.com>
Message-ID: <72D9A419-ECC0-4066-A316-981B26E686D4@cybercat.ca>


> Is there an evaluation version of Red Hat Cluster that I can install in order to try it out before deciding to purchase it?

You can also dowload tthe RHEL6 beta from ftp.redhat.com. It has some bugs, but it'll get you pretty close to rhel6. Thath's what's I am doingat the moment...



From fdinitto at redhat.com  Sun Nov 28 15:26:31 2010
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Sun, 28 Nov 2010 16:26:31 +0100
Subject: [Linux-cluster] Updating cluster.conf file under RHEL6
In-Reply-To: <4CF24DC9.7010400@gmail.com>
References: <4CF1715F.1020801@gmail.com> <4CF22779.7070809@redhat.com>
	<4CF24DC9.7010400@gmail.com>
Message-ID: <4CF274A7.9070308@redhat.com>

On 11/28/2010 01:40 PM, carlopmart wrote:
>>
> 
> Thanks Fabio, but do I need to run ricci on both nodes?? If i need run
> ricci on both nodes, what about certificates generated by ricci init
> script?? Do i need to copy on both nodes??
>

Yes you need to run ricci on every single cluster node.

I am not entirely sure about ricci's internal, but I added Ryan and
Chris in CC and they should be able to answer specifically your question.

In my tests/setups, I simply run ccs_sync manually on all nodes to
exchange certificates, but I am fairly sure there is a better way.

Fabio



From fdinitto at redhat.com  Sun Nov 28 15:30:02 2010
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Sun, 28 Nov 2010 16:30:02 +0100
Subject: [Linux-cluster] Updating cluster.conf file under RHEL6
In-Reply-To: <4cf23025.cb7b0e0a.42b4.ffffc010@mx.google.com>
References: <4cf23025.cb7b0e0a.42b4.ffffc010@mx.google.com>
Message-ID: <4CF2757A.90703@redhat.com>

On 11/28/2010 11:33 AM, Kit Gerrits wrote:
> 
> What about people that build their cluster 'by hand' or using the GUI tool?
> (AKA nor Luci)
> Do they still need to run Ricci?

There is no other guy tool outside luci for RHEL6.

ricci can be used outside of luci scope and as described in the
documentations, if you do not want to use ricci and prefer to do manual
sync of the config, you can still do that.

Basically there is enough flexibility for everybody to choose what they
want/need.

ricci becomes mandatory in conjunction with luci and with the cli that
will be available in not too long from now.

> 
> Isn't that what ccsd was for?

ccsd has been deprecated in rhel6 and does not exist anymore.

Fabio

> 
> 
> Regards,
> 
> Kit
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Fabio M. Di Nitto
> Sent: zondag 28 november 2010 10:57
> To: linux-cluster at redhat.com
> Subject: Re: [Linux-cluster] Updating cluster.conf file under RHEL6
> 
> On 11/27/2010 10:00 PM, carlopmart wrote:
>> Hi all,
>>
>>  I have downloaded a trial RHEL6 DVD to test new cluster suite under 
>> this platform (cman-3.0.12-23.el6_0.4.x86_64). After install one node, 
>> I have migrated some data from my production's cluster that consists 
>> on two RHEL 5.5 nodes.
>>
>>  All works ok (moving cluster.conf file, data, etc), except when I try 
>> to modify and propagate new cluster.conf file. In RedHat's 
>> documentation
>> says: "Run the cman_tool version -r command to propagate the 
>> configuration to the rest of the cluster nodes.". This procedure works 
>> if all nodes are up, but if I try to do this with only one up returns 
>> this error:
>>
>> Error: Unable to initialize the crypto subsystem.
>> cman_tool: ccs_sync failed.
>> If you have distributed the config file yourself, try re-running with 
>> -S
> 
> Do you have ricci running on the nodes? If not, you need to enable it.
> 
> I am fairly confident the documentation mentions that too.
> 
> cman_tool version -r -S would be the correct command IF the configuration
> has been synced around manually (see the "propagate the configuration via
> scp" section of the documentation).
> 
> Fabio
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From linux at alteeve.com  Sun Nov 28 17:44:08 2010
From: linux at alteeve.com (Digimer)
Date: Sun, 28 Nov 2010 12:44:08 -0500
Subject: [Linux-cluster] Evaluation of Red Hat Cluster?
In-Reply-To: <4CF13BFF.7000701@oceanpark.com>
References: <4CF13BFF.7000701@oceanpark.com>
Message-ID: <4CF294E8.8070702@alteeve.com>

On 11/27/2010 12:12 PM, Dennis G Allard wrote:
> Is there an evaluation version of Red Hat Cluster that I can install in
> order to try it out before deciding to purchase it?
>
> Cheers,
> Dennis
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Here you go:

https://www.redhat.com/rhel/details/eval/

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From yvette at dbtgroup.com  Mon Nov 29 03:45:58 2010
From: yvette at dbtgroup.com (yvette hirth)
Date: Mon, 29 Nov 2010 03:45:58 +0000
Subject: [Linux-cluster] new cluster defined and storage added - what's next?
In-Reply-To: <AANLkTinV9VnrN-OC=e07kV2MHvr59VcJ4SM=kahuOJW0@mail.gmail.com>
References: <AANLkTinV9VnrN-OC=e07kV2MHvr59VcJ4SM=kahuOJW0@mail.gmail.com>
Message-ID: <4CF321F6.10904@dbtgroup.com>

hi,

ok i've been lurking here for a while studying "the whole cluster 
thing", and i've made this progress:

two ILO2-based HP servers running CentOS 5.5 with fibre connections to a 
switch;
one promise fibre array-based raid1 drive of 4TB (4x 2TB drives); and
Cluster Storage and Clustering packages installed.

using luci running on srv0 i've defined the cluster, added two nodes, 
their fencing (IPMI Lan), and everything looks good.

on srv0 i changed the "wrk" raid1 storage to "Clustered: True", and sure 
enough it shows up on srv1 as "Clustered:  True".  so i assUme i've done 
everything right thus far...

since i'm aiming for a primary:primary cluster (HA), i should change 
that cluster.conf parm for that, yes?

if i want to be able to access this "wrk" drive from a non-clustered 
server via NFS, what's next?

q:  the "wrk" array uses xfs (setup before the cluster).  it's ok if it
     gets levelled.  do i need to use GFS2 or ZFS or some other
     filesystem format, or can i continue to use xfs?

q:  i'm assUming that i need to make NFS a managed service if i want it
     to be HA, yes?

q:  anything i've missed?

i'm amazed that this has been this easy thus far - kudos to all of you!

yvette



From Chris.Jankowski at hp.com  Mon Nov 29 04:33:23 2010
From: Chris.Jankowski at hp.com (Jankowski, Chris)
Date: Mon, 29 Nov 2010 04:33:23 +0000
Subject: [Linux-cluster] new cluster defined and storage added - what's
 next?
In-Reply-To: <4CF321F6.10904@dbtgroup.com>
References: <AANLkTinV9VnrN-OC=e07kV2MHvr59VcJ4SM=kahuOJW0@mail.gmail.com>
	<4CF321F6.10904@dbtgroup.com>
Message-ID: <036B68E61A28CA49AC2767596576CD596F59E41441@GVW1113EXC.americas.hpqcorp.net>

Yvette,

You can:
- either use GFS2 with concurrent access to your filesystem from both nodes, as this is a cluster filesystem
- or use ext3/XFS as a failover filesystem - mounted by no more than one of the cluster nodes at any time.

Either of the two approaches will have HA characteristics without any need for NFS.

Regards,

Chris Jankowski

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of yvette hirth
Sent: Monday, 29 November 2010 14:46
To: linux clustering
Subject: [Linux-cluster] new cluster defined and storage added - what's next?

hi,

ok i've been lurking here for a while studying "the whole cluster 
thing", and i've made this progress:

two ILO2-based HP servers running CentOS 5.5 with fibre connections to a 
switch;
one promise fibre array-based raid1 drive of 4TB (4x 2TB drives); and
Cluster Storage and Clustering packages installed.

using luci running on srv0 i've defined the cluster, added two nodes, 
their fencing (IPMI Lan), and everything looks good.

on srv0 i changed the "wrk" raid1 storage to "Clustered: True", and sure 
enough it shows up on srv1 as "Clustered:  True".  so i assUme i've done 
everything right thus far...

since i'm aiming for a primary:primary cluster (HA), i should change 
that cluster.conf parm for that, yes?

if i want to be able to access this "wrk" drive from a non-clustered 
server via NFS, what's next?

q:  the "wrk" array uses xfs (setup before the cluster).  it's ok if it
     gets levelled.  do i need to use GFS2 or ZFS or some other
     filesystem format, or can i continue to use xfs?

q:  i'm assUming that i need to make NFS a managed service if i want it
     to be HA, yes?

q:  anything i've missed?

i'm amazed that this has been this easy thus far - kudos to all of you!

yvette

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From fdinitto at redhat.com  Mon Nov 29 08:08:59 2010
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Mon, 29 Nov 2010 09:08:59 +0100
Subject: [Linux-cluster] Updating cluster.conf file under RHEL6
In-Reply-To: <4CF274A7.9070308@redhat.com>
References: <4CF1715F.1020801@gmail.com>
	<4CF22779.7070809@redhat.com>	<4CF24DC9.7010400@gmail.com>
	<4CF274A7.9070308@redhat.com>
Message-ID: <4CF35F9B.2030000@redhat.com>

On 11/28/2010 04:26 PM, Fabio M. Di Nitto wrote:
> On 11/28/2010 01:40 PM, carlopmart wrote:
>>>
>>
>> Thanks Fabio, but do I need to run ricci on both nodes?? If i need run
>> ricci on both nodes, what about certificates generated by ricci init
>> script?? Do i need to copy on both nodes??
>>
> 
> Yes you need to run ricci on every single cluster node.
> 
> I am not entirely sure about ricci's internal, but I added Ryan and
> Chris in CC and they should be able to answer specifically your question.

... and actually doing it :)

> 
> In my tests/setups, I simply run ccs_sync manually on all nodes to
> exchange certificates, but I am fairly sure there is a better way.
> 
> Fabio
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



From arif4linux at gmail.com  Mon Nov 29 08:42:21 2010
From: arif4linux at gmail.com (Mohamed Arif Khan)
Date: Mon, 29 Nov 2010 14:12:21 +0530
Subject: [Linux-cluster] cluster without fencing device
Message-ID: <AANLkTimXZazDxfUYyYB6vtSev8Sq_i2+Y16g69NOLoCz@mail.gmail.com>

How to configure cluster without fencing device ?
-- 
*Thanks & Regards*
*M.Arif Khan*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20101129/1387a3f0/attachment.htm>

From linux at alteeve.com  Mon Nov 29 14:43:36 2010
From: linux at alteeve.com (Digimer)
Date: Mon, 29 Nov 2010 09:43:36 -0500
Subject: [Linux-cluster] cluster without fencing device
In-Reply-To: <AANLkTimXZazDxfUYyYB6vtSev8Sq_i2+Y16g69NOLoCz@mail.gmail.com>
References: <AANLkTimXZazDxfUYyYB6vtSev8Sq_i2+Y16g69NOLoCz@mail.gmail.com>
Message-ID: <4CF3BC18.4020405@alteeve.com>

On 11/29/2010 03:42 AM, Mohamed Arif Khan wrote:
>
> How to configure cluster without fencing device ?

In RHCS, it is not possible.

http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_3_Tutorial#Concept.3B_Fencing

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org



From ag8817282 at gideon.org  Mon Nov 29 21:40:42 2010
From: ag8817282 at gideon.org (A. Gideon)
Date: Mon, 29 Nov 2010 21:40:42 +0000 (UTC)
Subject: [Linux-cluster] GFS+DRBD+Quorum: Help wrap my brain around this
References: <icm3fn$p8p$1@taco.int.tagonline.com>
	<1290783880.21803.31.camel@cowie>
Message-ID: <id16kq$rpv$1@taco.int.tagonline.com>

On Fri, 26 Nov 2010 15:04:40 +0000, Colin Simpson wrote:

>> but when I break the DRBD connection between two primary nodes,
>> "disconnected" apparently means that the nodes both continue as if
>> they've UpToDate disks.  But this lets the data go out of sync.  Isn't
>> this a Bad Thing?
> 
> Yup that could be an issue, however you should never be in a situation
> where you break the connection between the two nodes. This needs to be
> heavily mitigated, I'm planning to bond two interfaces on two different
> cards so this doesn't happen (or I should say is highly unlikely).

Since I'll be a person tasked with cleaning up from this situation, and 
given that I've no idea how to achieve that cleanup once writes are 
occurring on both sides independently, I think I'll want something more 
than "highly unlikely".  That's rather the point of these tools, isn't it?

[...]
> 2/ The node goes down totally so DRBD loses comms. But as all the comms
> are down the other node will notice and Cluster Suite will fence the bad
> node. Remember that GFS will suspend all operations (on all nodes) until
> the bad node is fenced.

Does it make sense to have Cluster Suite do this fencing, or should DRBD 
do it?  I'm thinking that DRBD's resource-and-stonith gets me pretty 
close. 

> I plan to further help the first situation by having my cluster comms
> share the same bond with the DRBD. So if the comms fail, cluster suite
> should notice, both the DRBD's on each node shouldn't change as GFS will
> have suspended operations. Assuming the fence devices are reachable then
> one of the nodes should fence the other (it might be a bit of a shoot
> out situation) and then GFS should resume on the remaining node.

This "shoot out situation" (race condition) is part of my worry.  A third 
voter of any form eliminates this, in that it can arbitrate the matter of 
which of the two nodes in a lost-comm situation should be "outdated" and 
fenced. 

And if the third voter can solve the "wait forever on startup", so much 
the better.

I'm looking at how to solve this all at the DRBD layer.  But I'm also 
interested in a more Cluster-Suite-centric solution.  I could use a 
quorum disk, but a third node would also be useful.  I haven't figured 
out, though, how to run clvmd with the shared storage available on only 
two of three cluster nodes.  Is there a way to do this?

	- Andrew



From cfeist at redhat.com  Mon Nov 29 22:26:33 2010
From: cfeist at redhat.com (Chris Feist)
Date: Mon, 29 Nov 2010 16:26:33 -0600
Subject: [Linux-cluster] Updating cluster.conf file under RHEL6
In-Reply-To: <4CF35F9B.2030000@redhat.com>
References: <4CF1715F.1020801@gmail.com>
	<4CF22779.7070809@redhat.com>	<4CF24DC9.7010400@gmail.com>
	<4CF274A7.9070308@redhat.com> <4CF35F9B.2030000@redhat.com>
Message-ID: <4CF42899.7030700@redhat.com>

On 11/29/10 02:08, Fabio M. Di Nitto wrote:
> On 11/28/2010 04:26 PM, Fabio M. Di Nitto wrote:
>> On 11/28/2010 01:40 PM, carlopmart wrote:
>>>>
>>>
>>> Thanks Fabio, but do I need to run ricci on both nodes?? If i need run
>>> ricci on both nodes, what about certificates generated by ricci init
>>> script?? Do i need to copy on both nodes??

You need to have ricci running on all of your cluster nodes and when you run 
ccs_sync for the first time the certificates necessary will be automatically 
generated (after you enter the password for each node).

Thanks,
Chris

>>>
>>
>> Yes you need to run ricci on every single cluster node.
>>
>> I am not entirely sure about ricci's internal, but I added Ryan and
>> Chris in CC and they should be able to answer specifically your question.
>
> ... and actually doing it :)
>
>>
>> In my tests/setups, I simply run ccs_sync manually on all nodes to
>> exchange certificates, but I am fairly sure there is a better way.
>>
>> Fabio
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>



From kkovachev at varna.net  Tue Nov 30 09:22:22 2010
From: kkovachev at varna.net (Kaloyan Kovachev)
Date: Tue, 30 Nov 2010 11:22:22 +0200
Subject: [Linux-cluster] GFS+DRBD+Quorum: Help wrap my brain around this
In-Reply-To: <id16kq$rpv$1@taco.int.tagonline.com>
References: <icm3fn$p8p$1@taco.int.tagonline.com>
	<1290783880.21803.31.camel@cowie>
	<id16kq$rpv$1@taco.int.tagonline.com>
Message-ID: <20101130091940.M81210@varna.net>

On Mon, 29 Nov 2010 21:40:42 +0000 (UTC), A. Gideon wrote
> On Fri, 26 Nov 2010 15:04:40 +0000, Colin Simpson wrote:
> 
> >> but when I break the DRBD connection between two primary nodes,
> >> "disconnected" apparently means that the nodes both continue as if
> >> they've UpToDate disks.  But this lets the data go out of sync.  Isn't
> >> this a Bad Thing?
> > 
> > Yup that could be an issue, however you should never be in a situation
> > where you break the connection between the two nodes. This needs to be
> > heavily mitigated, I'm planning to bond two interfaces on two different
> > cards so this doesn't happen (or I should say is highly unlikely).
> 
> Since I'll be a person tasked with cleaning up from this situation, and 
> given that I've no idea how to achieve that cleanup once writes are 
> occurring on both sides independently, I think I'll want something more 
> than "highly unlikely".  That's rather the point of these tools, isn't it?
> 
> [...]
> > 2/ The node goes down totally so DRBD loses comms. But as all the comms
> > are down the other node will notice and Cluster Suite will fence the bad
> > node. Remember that GFS will suspend all operations (on all nodes) until
> > the bad node is fenced.
> 
> Does it make sense to have Cluster Suite do this fencing, or should DRBD 
> do it?  I'm thinking that DRBD's resource-and-stonith gets me pretty 
> close.
> 
> > I plan to further help the first situation by having my cluster comms
> > share the same bond with the DRBD. So if the comms fail, cluster suite
> > should notice, both the DRBD's on each node shouldn't change as GFS will
> > have suspended operations. Assuming the fence devices are reachable then
> > one of the nodes should fence the other (it might be a bit of a shoot
> > out situation) and then GFS should resume on the remaining node.
> 
> This "shoot out situation" (race condition) is part of my worry.  A third 
> voter of any form eliminates this, in that it can arbitrate the matter of 
> which of the two nodes in a lost-comm situation should be "outdated" and 
> fenced.
> 
> And if the third voter can solve the "wait forever on startup", so much 
> the better.
> 
> I'm looking at how to solve this all at the DRBD layer.  But I'm also 
> interested in a more Cluster-Suite-centric solution.  I could use a 
> quorum disk, but a third node would also be useful.  I haven't figured 
> out, though, how to run clvmd with the shared storage available on only 
> two of three cluster nodes.  Is there a way to do this?
> 
> 	- Andrew
> 

reading your thoughts i guess you didn't got (or read) my email to this thread
some time ago (that is pretty much the same) which can be found here
https://www.redhat.com/archives/linux-cluster/2010-November/msg00136.html


> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster