From rodgersr at yahoo.com Mon Nov 1 15:53:32 2010 From: rodgersr at yahoo.com (Rick Rodgers) Date: Mon, 1 Nov 2010 08:53:32 -0700 (PDT) Subject: [Linux-cluster] (no subject) Message-ID: <938409.38916.qm@web34207.mail.mud.yahoo.com> http://www.ihfb.37zq.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From ifetch at du.edu Tue Nov 2 04:45:11 2010 From: ifetch at du.edu (Ivan Fetch) Date: Mon, 1 Nov 2010 22:45:11 -0600 Subject: [Linux-cluster] Some questions - migrating from Sun to Red Hat cluster Message-ID: <79C2D3933C76AB41B6D135F3480AC5C957C470EC@EXCH.du.edu> Hello, I have been using two CentOS 5.5 virtual machines, to learn Linux clustering, as a potential replacement for Sun (Sparc) clusters. We run Red Hat Enterprise Linux, but do not yet have any production cluster experience. I've got a few questions, which I'm stuck on: IS it possible to stop or restart one resource, instead of the entire resource group (service)? This can be handy when you want to work on a resource (Apache), without having cluster restart it out from under you, but you still want your storage and IP to stay online. It seems like the clusvcadm command only operates on services; groups of resources. What is the most common way to create and adjust service definitions - using Lusi, editing cluster.conf by hand, using command-line tools, or something else? For a non-global filesystem, which follows a service, is HA LVM the way to go? I have seen some recommendations against HA LVM, because LVM tagging being reset on a node, can allow that node to touch the LVM out-of-turn. What is the recommended way to make changes to an HA LVM, or add a new HA LVM, when lvm.conf on cluster nodes are already configured to tag? I have accomplished this by temporarily editing lvm.conf on one node, removing the tag line, and then making necessary changes to the LVM - it seems like there is likely a better way to do this. Will the use of a quarum disk, help to keep one node from fensing the other at boot (E.G> node1 is running, node2 boots and fenses node1)? This fensing does not happen every time I boot node2 - I may need to reproduce this and provide logs. Thank you, for your help, Ivan. From thomas at sjolshagen.net Tue Nov 2 10:14:17 2010 From: thomas at sjolshagen.net (Thomas Sjolshagen) Date: Tue, 02 Nov 2010 06:14:17 -0400 Subject: [Linux-cluster] Some questions - migrating from Sun to Red Hat cluster In-Reply-To: <79C2D3933C76AB41B6D135F3480AC5C957C470EC@EXCH.du.edu> References: <79C2D3933C76AB41B6D135F3480AC5C957C470EC@EXCH.du.edu> Message-ID: <3400c42be3875da5ec83db8002080e5b@www.sjolshagen.net> On Mon, 1 Nov 2010 22:45:11 -0600, Ivan Fetch wrote: > Hello, > > I have been using two CentOS 5.5 virtual machines, to learn Linux > clustering, as a potential replacement for Sun (Sparc) clusters. We > run Red Hat Enterprise Linux, but do not yet have any production > cluster experience. I've got a few questions, which I'm stuck on: > > IS it possible to stop or restart one resource, instead of the entire > resource group (service)? This can be handy when you want to work on > a > resource (Apache), without having cluster restart it out from under > you, but you still want your storage and IP to stay online. It seems > like the clusvcadm command only operates on services; groups of > resources. > I don't know if this is the officially sanctioned way, but I tend to freeze the group/service (clusvcadm -Z) and then use the start/stop service script (service httpd reload, etc) to manipulate the daemons. I've got a multi-daemon mail server service that brings up postfix + amavisd + sqlgrey, ++ so this is handy here). > What is the most common way to create and adjust service definitions > - using Lusi, editing cluster.conf by hand, using command-line tools, > or something else? > I'm a die-hard CLI guy, so I tend to prefer editing by hand & validating the cluster.conf file before loading it/using it (had a couple of typo's that caused me grief as far as keeping things running goes). > For a non-global filesystem, which follows a service, is HA LVM the > way to go? I have seen some recommendations against HA LVM, because > LVM tagging being reset on a node, can allow that node to touch the > LVM out-of-turn. > > What is the recommended way to make changes to an HA LVM, or add a > new HA LVM, when lvm.conf on cluster nodes are already configured to > tag? I have accomplished this by temporarily editing lvm.conf on one > node, removing the tag line, and then making necessary changes to the > LVM - it seems like there is likely a better way to do this. > > Will the use of a quarum disk, help to keep one node from fensing the > other at boot (E.G> node1 is running, node2 boots and fenses node1)? > This fensing does not happen every time I boot node2 - I may need to > reproduce this and provide logs. I think, perhaps, you may need/want the included so as to avoid this? IIRC, setting clean_start helped me avoid fencing of the surviving node at restart. I use the quorum disk to ensure less confusion by the nodes during reboot scenarios too though. hth, // Thomas From bturner at redhat.com Tue Nov 2 16:04:27 2010 From: bturner at redhat.com (Ben Turner) Date: Tue, 2 Nov 2010 12:04:27 -0400 (EDT) Subject: [Linux-cluster] Fence Issue on BL 460C G6 In-Reply-To: <1845350465.1095071288713706952.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> Message-ID: <142457019.1095461288713867261.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> Your nodes don't seem to be able to communicate: Oct 30 16:08:15 rhel-cluster-node2 fenced[3549]: rhel-cluster-node1.mgmt.local not a cluster member after 3 sec post_join_delay Oct 30 16:08:15 rhel-cluster-node2 fenced[3549]: fencing node "rhel-cluster-node1.mgmt.local" Oct 30 16:08:29 rhel-cluster-node2 fenced[3549]: fence "rhel-cluster-node1.mgmt.local" success I never see them form a cluster: Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM ] CLM CONFIGURATION CHANGE Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM ] New Configuration: Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM ] r(0) ip(10.4.1.102) Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM ] Members Left: Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM ] Members Joined: Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM ] CLM CONFIGURATION CHANGE Oct 30 16:03:25 rhel-cluster-node2 openais[3511]: [CLM ] New Configuration: Oct 30 16:03:26 rhel-cluster-node2 openais[3511]: [CLM ] r(0) ip(10.4.1.102) Oct 30 16:03:26 rhel-cluster-node2 openais[3511]: [CLM ] Members Left: Oct 30 16:03:26 rhel-cluster-node2 openais[3511]: [CLM ] Members Joined: Oct 30 16:03:26 rhel-cluster-node2 openais[3511]: [SYNC ] This node is within the primary component and will provide service. Oct 30 16:03:26 rhel-cluster-node2 openais[3511]: [TOTEM] entering OPERATIONAL state. Are the nodes just rebooting each other in a cycle? If so my guess is that you are having issues routing the multicast traffic. An easy test is to try using broadcast. Change your cman tag to say: If your nodes can form a cluster with that set then you need to evaluate your multicast config. -Ben ----- "Wahyu Darmawan" wrote: > Hi all, > > Thanks. I?ve replaced mainboard on both servers. But there?s another > problem. Both servers active after mainboard replaced. > > > > But, when I restart the node that is active, other node will be > restarted as well. This happened during fencing. > > Repeated occurrence, which would in turn lead to both restart > repeatedly. > > > > Need your suggestion please.. > > Please find the attachment of /var/log/messages/ > > And, here?s my cluster.conf > > > > post_join_delay="3"/> > > votes="1"> > > > > > > > votes="1"> > > > > > > > > votes="2"> > > > > > login="Administrator" name="NODE2-ILO" passwd="password"/> > login="Administrator" name="NODE1-ILO" passwd="password"/> > > > > restricted="0"> > priority="1"/> > priority="1"/> > > > > > > name="IP_Virtual" recovery="relocate"> > > > > > > > > Thanks, > > > > > > > > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dustin Henry > Offutt > Sent: Thursday, October 28, 2010 11:46 PM > To: linux clustering > Subject: Re: [Linux-cluster] Fence Issue on BL 460C G6 > > > > I believe your problem is being caused by "nofailback" being set to > "1". : > > restricted="0"> > > Set it to zero and I believe your problem will be resolved. > > > On Wed, Oct 27, 2010 at 10:43 PM, Wahyu Darmawan < > wahyu at vivastor.co.id > wrote: > > Hi Ben, > Here is my cluster.conf. Need your help please. > > > > > post_join_delay="3"/> > > votes="1"> > > > > > > > votes="1"> > > > > > > > > votes="2"> > > > > > login="Administrator" name="NODE2-ILO" passwd="password"/> > login="Administrator" name="NODE1-ILO" passwd="password"/> > > > > restricted="0"> > priority="1"/> > priority="1"/> > > > > > > name="IP_Virtual" recovery="relocate"> > > > > > > Many thanks, > Wahyu > > > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto: > linux-cluster-bounces at redhat.com ] On Behalf Of Ben Turner > Sent: Thursday, October 28, 2010 12:18 AM > To: linux clustering > Subject: Re: [Linux-cluster] Fence Issue on BL 460C G6 > > My guess is there is a problem with fencing. Are you running fence_ilo > with an HP blade? Iirc the iLOs on the blades have a different CLI, I > don't think fence_ilo will work with them. What do you see in the > messages files during these events? If you see failed fence messages > you may want to look into using fence_ipmilan: > > http://sources.redhat.com/cluster/wiki/IPMI_FencingConfig > > If you post a snip of your messages file from this event and your > cluster.conf I will have a better idea of what is going on. > > -b > > > > ----- "Wahyu Darmawan" < wahyu at vivastor.co.id > wrote: > > > Hi all, > > > > > > > > For fencing, I?m using HP iLO and server is BL460c G6. Problem is > > resource is start moving to the passive when the failed node is > power > > on. It is really strange for me. For example, I shutdown the node1 > and > > physically remove the node1 machine from the blade chassis and > monitor > > the clustat output, clustat was still showing that the resource is > on > > node 1, even node 1 is power down and removed from c7000 blade > > chassis. But when I plugged again the failed node1 on the c7000 > blade > > chassis and it power-on, then clustat is showing that the resource > is > > start moving to the passive node from the failed node. > > I?m powering down the blade server with power button in front of it, > > then we remove it from the chassis, If we face the hardware problem > in > > our active node and the active node goes down then how the resource > > move to the passive node. In addition, When I rebooted or shutdown > the > > machine from the CLI, then the resource moves successfully from the > > passive node. Furthurmore, When I shutdown the active node with > > "shutdown -hy 0" command, after shuting down the active node > > automatically restart. > > > > Please help me. > > > > > > > > Many Thanks, > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From corey.kovacs at gmail.com Tue Nov 2 20:14:34 2010 From: corey.kovacs at gmail.com (Corey Kovacs) Date: Tue, 2 Nov 2010 20:14:34 +0000 Subject: [Linux-cluster] ha-lvm Message-ID: Folks, I have a 5 node cluster backed by an FC SAN with 5 VG's each with a single LVM. I am using ha_lvm and have lvm.conf configured to use tags as per the instructions. Things work fine until I try to migrate the volume containing our home dir (all others work as expected) The umount for that volume fails and depending on the active config, the node reboots itself (self_fence=1) or it simply fails and get's disabled. lsof doesn't reveal anything "holding" onto that mount point yet the umount fails consistently (force_umount is enabled) Furthermore, it appears I have at least one ov my VG's with bad tags, is there a way to show what tags a VG has? I've gone over the config several times and although I cannot show the config, here is a basic rundown in case something jumps out... 5 nodes, dl360g5 2xQcore w/16GB ram EVA8100 2x4GB FC, multipath 5VG's each w/a single lv each with an ext3 fs. ha lvm in is use as a measure of protection for the ext3 fs's local locking only via lvm.conf tags enabled via lvm.conf initrd's are newer than the lvm.conf changes. I did notice that the ext3 label in use on the home volume was not of the form /home (it was /ha_home) from early testing but I've corrected that and the umount fail still occurs. If anyone has any ideas I'd appreciate it. From Chris.Jankowski at hp.com Wed Nov 3 02:15:25 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Wed, 3 Nov 2010 02:15:25 +0000 Subject: [Linux-cluster] ha-lvm In-Reply-To: References: Message-ID: <036B68E61A28CA49AC2767596576CD596F5841A5E5@GVW1113EXC.americas.hpqcorp.net> Corey, I vaguely remember from my work on UNIX clusters many years ago that if /dir is the mount point of a mounted filesystem then cd /dir or into any directory below /dir from an interactive shell will prevent an unmount of the filesystem i.e. umount /dir will fail. I believe that this restriction is because it will create an inconsistency in the state of the shell process. lsof will not show it. Of course most users after login end up in the home directory by default. I believe that Linux will have the same semantics as UNIX. You can test that easily on a standalone Linux box. Regards, Chris Jankowski -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Corey Kovacs Sent: Wednesday, 3 November 2010 07:15 To: linux clustering Subject: [Linux-cluster] ha-lvm Folks, I have a 5 node cluster backed by an FC SAN with 5 VG's each with a single LVM. I am using ha_lvm and have lvm.conf configured to use tags as per the instructions. Things work fine until I try to migrate the volume containing our home dir (all others work as expected) The umount for that volume fails and depending on the active config, the node reboots itself (self_fence=1) or it simply fails and get's disabled. lsof doesn't reveal anything "holding" onto that mount point yet the umount fails consistently (force_umount is enabled) Furthermore, it appears I have at least one ov my VG's with bad tags, is there a way to show what tags a VG has? I've gone over the config several times and although I cannot show the config, here is a basic rundown in case something jumps out... 5 nodes, dl360g5 2xQcore w/16GB ram EVA8100 2x4GB FC, multipath 5VG's each w/a single lv each with an ext3 fs. ha lvm in is use as a measure of protection for the ext3 fs's local locking only via lvm.conf tags enabled via lvm.conf initrd's are newer than the lvm.conf changes. I did notice that the ext3 label in use on the home volume was not of the form /home (it was /ha_home) from early testing but I've corrected that and the umount fail still occurs. If anyone has any ideas I'd appreciate it. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From corey.kovacs at gmail.com Wed Nov 3 06:27:22 2010 From: corey.kovacs at gmail.com (Corey Kovacs) Date: Wed, 3 Nov 2010 06:27:22 +0000 Subject: [Linux-cluster] ha-lvm In-Reply-To: <036B68E61A28CA49AC2767596576CD596F5841A5E5@GVW1113EXC.americas.hpqcorp.net> References: <036B68E61A28CA49AC2767596576CD596F5841A5E5@GVW1113EXC.americas.hpqcorp.net> Message-ID: You are certainly correct. I neglected to mention that I'd also checked for logged in users as well and there were none. Thank for this anyway, I appretiate the feedback. Corey Sent from my iPod On Nov 3, 2010, at 2:15 AM, "Jankowski, Chris" wrote: > Corey, > > I vaguely remember from my work on UNIX clusters many years ago that > if /dir is the mount point of a mounted filesystem then cd /dir or > into any directory below /dir from an interactive shell will prevent > an unmount of the filesystem i.e. umount /dir will fail. I believe > that this restriction is because it will create an inconsistency in > the state of the shell process. lsof will not show it. > > Of course most users after login end up in the home directory by > default. > > I believe that Linux will have the same semantics as UNIX. You can > test that easily on a standalone Linux box. > > Regards, > > Chris Jankowski > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster- > bounces at redhat.com] On Behalf Of Corey Kovacs > Sent: Wednesday, 3 November 2010 07:15 > To: linux clustering > Subject: [Linux-cluster] ha-lvm > > Folks, > > I have a 5 node cluster backed by an FC SAN with 5 VG's each with a > single LVM. > > I am using ha_lvm and have lvm.conf configured to use tags as per > the instructions. Things work fine until I try to migrate the volume > containing our home dir (all others work as expected) The umount for > that volume fails and depending on the active config, the node > reboots itself (self_fence=1) or it simply fails and get's disabled. > > lsof doesn't reveal anything "holding" onto that mount point yet the > umount fails consistently (force_umount is enabled) > > Furthermore, it appears I have at least one ov my VG's with bad > tags, is there a way to show what tags a VG has? > > I've gone over the config several times and although I cannot show > the config, here is a basic rundown in case something jumps out... > > 5 nodes, dl360g5 2xQcore w/16GB ram > EVA8100 > 2x4GB FC, multipath > 5VG's each w/a single lv each with an ext3 fs. > ha lvm in is use as a measure of protection for the ext3 fs's local > locking only via lvm.conf tags enabled via lvm.conf initrd's are > newer than the lvm.conf changes. > > I did notice that the ext3 label in use on the home volume was not > of the form /home (it was /ha_home) from early testing but I've > corrected that and the umount fail still occurs. > > If anyone has any ideas I'd appreciate it. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From jonathan.barber at gmail.com Wed Nov 3 10:00:42 2010 From: jonathan.barber at gmail.com (Jonathan Barber) Date: Wed, 3 Nov 2010 10:00:42 +0000 Subject: [Linux-cluster] ha-lvm In-Reply-To: <036B68E61A28CA49AC2767596576CD596F5841A5E5@GVW1113EXC.americas.hpqcorp.net> References: <036B68E61A28CA49AC2767596576CD596F5841A5E5@GVW1113EXC.americas.hpqcorp.net> Message-ID: On 3 November 2010 02:15, Jankowski, Chris wrote: > Corey, > > I vaguely remember from my work on UNIX clusters many years ago that if /dir is the mount point of a mounted filesystem then cd /dir or into any directory below /dir from an interactive shell will prevent an unmount of the filesystem i.e. umount /dir will fail. ?I believe that this restriction is because it will create an inconsistency in the state of the shell process. lsof will not show it. lsof does show this: $ mkdir /scratch/foo $ cd /scratch/foo $ lsof +D /scratch/foo COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME bash 3060 x01024 cwd DIR 253,4 4096 303105 /scratch/foo lsof 4606 x01024 cwd DIR 253,4 4096 303105 /scratch/foo lsof 4607 x01024 cwd DIR 253,4 4096 303105 /scratch/foo This is on fedora 13 with an ext3 FS, but it also true for RHEL4 and 5. > Of course most users after login end up in the home directory by default. > > I believe that Linux will have the same semantics as UNIX. You can test that easily on a standalone Linux box. > > Regards, > > Chris Jankowski > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Corey Kovacs > Sent: Wednesday, 3 November 2010 07:15 > To: linux clustering > Subject: [Linux-cluster] ha-lvm > > Folks, > > I have a 5 node cluster backed by an FC SAN with 5 VG's each with a single LVM. > > I am using ha_lvm and have lvm.conf configured to use tags as per the instructions. Things work fine until I try to migrate the volume containing our home dir (all others work as expected) The umount for that volume fails and depending on the active config, the node reboots itself (self_fence=1) or it simply fails and get's disabled. > > lsof doesn't reveal anything "holding" onto that mount point yet the umount fails consistently (force_umount is enabled) > > Furthermore, it appears I have at least one ov my VG's with bad tags, is there a way to show what tags a VG has? > > I've gone over the config several times and although I cannot show the config, here is a basic rundown in case something jumps out... > > 5 nodes, dl360g5 2xQcore w/16GB ram > EVA8100 > 2x4GB FC, multipath > 5VG's each w/a single lv each with an ext3 fs. > ha lvm in is use as a measure of protection for the ext3 fs's local locking only via lvm.conf tags enabled via lvm.conf initrd's are newer than the lvm.conf changes. > > I did notice that the ext3 label in use on the home volume was not of the form /home (it was /ha_home) from early testing but I've corrected that and the umount fail still occurs. > > If anyone has any ideas I'd appreciate it. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Jonathan Barber From jonathan.barber at gmail.com Wed Nov 3 10:41:56 2010 From: jonathan.barber at gmail.com (Jonathan Barber) Date: Wed, 3 Nov 2010 10:41:56 +0000 Subject: [Linux-cluster] ha-lvm In-Reply-To: References: Message-ID: On 2 November 2010 20:14, Corey Kovacs wrote: > Folks, [snip] > lsof doesn't reveal anything "holding" onto that mount point yet the > umount fails consistently (force_umount is enabled) Are you sure that you're specifying the filesystem mount point (as listed in fstab) and not the directory. I've cut myself on the sharp options in lsof before. It might be worth adding the +D argument to traverse all of the directories under the filesystem looking for open files. You could also use fuser command in case it's pre-coffee operator induced error ;) > Furthermore, it appears I have at least one ov my VG's with bad tags, > is there a way to show what tags a VG has? "vgs -o vg_name,vg_tags" Can you umount the volume manually? If you can then it's something to do with the RHCS, otherwise it's something else. > I've gone over the config several times and although I cannot show the > config, here is a basic rundown in case something jumps out... [snip] > > If anyone has any ideas I'd appreciate it. > -- Jonathan Barber From corey.kovacs at gmail.com Wed Nov 3 11:55:12 2010 From: corey.kovacs at gmail.com (Corey Kovacs) Date: Wed, 3 Nov 2010 11:55:12 +0000 Subject: [Linux-cluster] ha-lvm In-Reply-To: References: Message-ID: John, This is a cluster managed mount so there is no fstab entry. The lsof options you show... "vgs -o vg_name,vg_tags" are a welcome addition to my tool belt, thanks for that. seems I need to practice what I preach and use the man pages more... I am out today but I'll try these tomorrow. Thanks Corey On Wed, Nov 3, 2010 at 10:41 AM, Jonathan Barber wrote: > On 2 November 2010 20:14, Corey Kovacs wrote: >> Folks, > > [snip] > >> lsof doesn't reveal anything "holding" onto that mount point yet the >> umount fails consistently (force_umount is enabled) > > Are you sure that you're specifying the filesystem mount point (as > listed in fstab) and not the directory. I've cut myself on the sharp > options in lsof before. It might be worth adding the +D argument to > traverse all of the directories under the filesystem looking for open > files. > > You could also use fuser command in case it's pre-coffee operator > induced error ;) > >> Furthermore, it appears I have at least one ov my VG's with bad tags, >> is there a way to show what tags a VG has? > > "vgs -o vg_name,vg_tags" > > Can you umount the volume manually? If you can then it's something to > do with the RHCS, otherwise it's something else. > >> I've gone over the config several times and although I cannot show the >> config, here is a basic rundown in case something jumps out... > > [snip] > >> >> If anyone has any ideas I'd appreciate it. >> > > -- > Jonathan Barber > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From jonathan.barber at gmail.com Wed Nov 3 13:13:19 2010 From: jonathan.barber at gmail.com (Jonathan Barber) Date: Wed, 3 Nov 2010 13:13:19 +0000 Subject: [Linux-cluster] ha-lvm In-Reply-To: References: Message-ID: On 3 November 2010 11:55, Corey Kovacs wrote: > John, > > This is a cluster managed mount so there is no fstab entry. That doesn't mean you can't umount it from the command line: # umount /path/to/mount/point As commented in another thread the other day, you probably want to do a "clusvcadm -Z servicename" to stop RHCS from taking action if you manage to umount the filesystem. Don't forget to do "clusvcadm -U servicename" afterwards... > The lsof options you show... > > "vgs -o vg_name,vg_tags" > > are a welcome addition to my tool belt, thanks for that. > > seems I need to practice what I preach and use the man pages more... > > I am out today but I'll try these tomorrow. > > Thanks > > Corey > > > > On Wed, Nov 3, 2010 at 10:41 AM, Jonathan Barber > wrote: >> On 2 November 2010 20:14, Corey Kovacs wrote: >>> Folks, >> >> [snip] >> >>> lsof doesn't reveal anything "holding" onto that mount point yet the >>> umount fails consistently (force_umount is enabled) >> >> Are you sure that you're specifying the filesystem mount point (as >> listed in fstab) and not the directory. I've cut myself on the sharp >> options in lsof before. It might be worth adding the +D argument to >> traverse all of the directories under the filesystem looking for open >> files. >> >> You could also use fuser command in case it's pre-coffee operator >> induced error ;) >> >>> Furthermore, it appears I have at least one ov my VG's with bad tags, >>> is there a way to show what tags a VG has? >> >> "vgs -o vg_name,vg_tags" >> >> Can you umount the volume manually? If you can then it's something to >> do with the RHCS, otherwise it's something else. >> >>> I've gone over the config several times and although I cannot show the >>> config, here is a basic rundown in case something jumps out... >> >> [snip] >> >>> >>> If anyone has any ideas I'd appreciate it. >>> >> >> -- >> Jonathan Barber >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Jonathan Barber From mylinuxhalist at gmail.com Wed Nov 3 13:23:27 2010 From: mylinuxhalist at gmail.com (My LinuxHAList) Date: Wed, 3 Nov 2010 09:23:27 -0400 Subject: [Linux-cluster] ha-lvm In-Reply-To: References: Message-ID: One possibility is that, say you try to unmount /mountpoint, however you have another partition mounted at /mountpoint/subdir, that would prevent /mountpoint to be unmounted, without unmounting /mountpoint/subdir first. You could check the output of mount command. On Wed, Nov 3, 2010 at 9:13 AM, Jonathan Barber wrote: > On 3 November 2010 11:55, Corey Kovacs wrote: >> John, >> >> This is a cluster managed mount so there is no fstab entry. > > That doesn't mean you can't umount it from the command line: > # umount /path/to/mount/point > > As commented in another thread the other day, you probably want to do > a "clusvcadm -Z servicename" to stop RHCS from taking action if you > manage to umount the filesystem. Don't forget to do "clusvcadm -U > servicename" afterwards... > >> The lsof options you show... >> >> "vgs -o vg_name,vg_tags" >> >> are a welcome addition to my tool belt, thanks for that. >> >> seems I need to practice what I preach and use the man pages more... >> >> I am out today but I'll try these tomorrow. >> >> Thanks >> >> Corey >> >> >> >> On Wed, Nov 3, 2010 at 10:41 AM, Jonathan Barber >> wrote: >>> On 2 November 2010 20:14, Corey Kovacs wrote: >>>> Folks, >>> >>> [snip] >>> >>>> lsof doesn't reveal anything "holding" onto that mount point yet the >>>> umount fails consistently (force_umount is enabled) >>> >>> Are you sure that you're specifying the filesystem mount point (as >>> listed in fstab) and not the directory. I've cut myself on the sharp >>> options in lsof before. It might be worth adding the +D argument to >>> traverse all of the directories under the filesystem looking for open >>> files. >>> >>> You could also use fuser command in case it's pre-coffee operator >>> induced error ;) >>> >>>> Furthermore, it appears I have at least one ov my VG's with bad tags, >>>> is there a way to show what tags a VG has? >>> >>> "vgs -o vg_name,vg_tags" >>> >>> Can you umount the volume manually? If you can then it's something to >>> do with the RHCS, otherwise it's something else. >>> >>>> I've gone over the config several times and although I cannot show the >>>> config, here is a basic rundown in case something jumps out... >>> >>> [snip] >>> >>>> >>>> If anyone has any ideas I'd appreciate it. >>>> >>> >>> -- >>> Jonathan Barber >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > -- > Jonathan Barber > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From zagar at arlut.utexas.edu Wed Nov 3 17:55:58 2010 From: zagar at arlut.utexas.edu (Randy Zagar) Date: Wed, 03 Nov 2010 12:55:58 -0500 Subject: [Linux-cluster] ha-lvm In-Reply-To: References: Message-ID: <4CD1A22E.2070004@arlut.utexas.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I frequently find that I'm unable to umount volumes, even after lsof and fuser return nothing relevant, and have to "force" a "lazy" umount like so: umount -lf /dir because both "umount /dir" and "umount -f /dir" fail. - -RZ > > On Nov 3, 2010, at 2:15 AM, "Jankowski, Chris" > wrote: > >> Corey, >> >> I vaguely remember from my work on UNIX clusters many years ago >> that if /dir is the mount point of a mounted filesystem then cd >> /dir or into any directory below /dir from an interactive shell >> will prevent an unmount of the filesystem i.e. umount /dir will >> fail. I believe that this restriction is because it will create >> an inconsistency in the state of the shell process. lsof will not >> show it. >> >> Of course most users after login end up in the home directory by >> default. >> >> I believe that Linux will have the same semantics as UNIX. You >> can test that easily on a standalone Linux box. >> >> Regards, >> >> Chris Jankowski >> >> >> -----Original Message----- From: linux-cluster-bounces at redhat.com >> [mailto:linux-cluster- bounces at redhat.com] On Behalf Of Corey >> Kovacs Sent: Wednesday, 3 November 2010 07:15 To: linux >> clustering Subject: [Linux-cluster] ha-lvm >> >> Folks, >> >> I have a 5 node cluster backed by an FC SAN with 5 VG's each with >> a single LVM. >> >> I am using ha_lvm and have lvm.conf configured to use tags as per >> the instructions. Things work fine until I try to migrate the >> volume containing our home dir (all others work as expected) The >> umount for that volume fails and depending on the active config, >> the node reboots itself (self_fence=1) or it simply fails and >> get's disabled. >> >> lsof doesn't reveal anything "holding" onto that mount point yet >> the umount fails consistently (force_umount is enabled) >> >> Furthermore, it appears I have at least one ov my VG's with bad >> tags, is there a way to show what tags a VG has? >> >> I've gone over the config several times and although I cannot >> show the config, here is a basic rundown in case something jumps >> out... >> >> 5 nodes, dl360g5 2xQcore w/16GB ram EVA8100 2x4GB FC, multipath >> 5VG's each w/a single lv each with an ext3 fs. ha lvm in is use >> as a measure of protection for the ext3 fs's local locking only >> via lvm.conf tags enabled via lvm.conf initrd's are newer than >> the lvm.conf changes. >> >> I did notice that the ext3 label in use on the home volume was >> not of the form /home (it was /ha_home) from early testing but >> I've corrected that and the umount fail still occurs. >> >> If anyone has any ideas I'd appreciate it. >> >> -- Linux-cluster mailing list Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ iEYEARECAAYFAkzRoi4ACgkQKQP9Tvu8x8xq3wCghKNS6//Pv0kDF6RggnCCk0b4 oaEAn3uO3rDQUNAjlaXHr0yojzaUiXU8 =HaFU -----END PGP SIGNATURE----- From dxh at yahoo.com Wed Nov 3 19:30:38 2010 From: dxh at yahoo.com (Don Hoover) Date: Wed, 3 Nov 2010 12:30:38 -0700 (PDT) Subject: [Linux-cluster] iptables Message-ID: <90758.87004.qm@web120718.mail.ne1.yahoo.com> Doing some testing with RHEL6 Beta2+, and I turned on debugging to verify my iptables was working with RHCS. And I noticed that there are some packets send between each node periodically that are going to destination port=0. Dropped by firewall: IN=bond0 OUT= MAC=00:14:38:bc:ab:4d:00:1b:78:ba:80:14:08:00 SRC=10.240.48.180 DST=10.240.48.178 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=19018 DF PROTO=TCP SPT=49555 DPT=0 WINDOW=5840 RES=0x00 SYN URGP=0 Dropped by firewall: IN=bond0 OUT= MAC=00:14:38:bc:ab:4d:00:17:a4:47:99:57:08:00 SRC=10.240.48.179 DST=10.240.48.178 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=32053 DF PROTO=TCP SPT=22430 DPT=0 WINDOW=5840 RES=0x00 SYN URGP=0 Does port 0 need to be opened? This is no where in the docs, I used all the normal port suggested. Here is what I am testing with having open: #-A INPUT -m state --state NEW -m tcp -p tcp --dport 137 -j ACCEPT #-A INPUT -m state --state NEW -m tcp -p tcp --dport 138 -j ACCEPT #-A INPUT -m state --state NEW -m udp -p udp --dport 137 -j ACCEPT #-A INPUT -m state --state NEW -m udp -p udp --dport 138 -j ACCEPT ### cman - 5404,5405 udp -A INPUT -m state --state NEW -m udp -p udp --dport 5404 -j ACCEPT -A INPUT -m state --state NEW -m udp -p udp --dport 5405 -j ACCEPT ### ricci - 11111 tcp -A INPUT -m state --state NEW -m tcp -p tcp --dport 11111 -j ACCEPT ### dlm - 21064 tcp -A INPUT -m state --state NEW -m tcp -p tcp --dport 21064 -j ACCEPT ### ccsd - 50006,50008,50008 tcp and 50007 udp -A INPUT -m state --state NEW -m tcp -p tcp --dport 50006 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 50008 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 50009 -j ACCEPT -A INPUT -m state --state NEW -m udp -p udp --dport 50007 -j ACCEPT ### multicast heartbeat (may be different for each cluster) -A INPUT -s 239.192.0.0/16 -m addrtype --src-type MULTICAST -j ACCEPT -A INPUT -s 224.0.0.0/8 -m addrtype --src-type MULTICAST -j ACCEPT From jonathan.barber at gmail.com Thu Nov 4 10:42:29 2010 From: jonathan.barber at gmail.com (Jonathan Barber) Date: Thu, 4 Nov 2010 10:42:29 +0000 Subject: [Linux-cluster] ha-lvm In-Reply-To: <4CD1A22E.2070004@arlut.utexas.edu> References: <4CD1A22E.2070004@arlut.utexas.edu> Message-ID: On 3 November 2010 17:55, Randy Zagar wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I frequently find that I'm unable to umount volumes, even after lsof > and fuser return nothing relevant, and have to "force" a "lazy" umount > like so: > > ? ?umount -lf /dir > > because both "umount /dir" and "umount -f /dir" fail. That's a cool option, but I'd be very worried about corrupting the filesystem if it was mounted on a second node whilst a process was holding the filesystem open on the original node. > - -RZ > >> >> On Nov 3, 2010, at 2:15 AM, "Jankowski, Chris" >> wrote: >> >>> Corey, >>> >>> I vaguely remember from my work on UNIX clusters many years ago >>> that if /dir is the mount point of a mounted filesystem then cd >>> /dir or into any directory below /dir from an interactive shell >>> will prevent an unmount of the filesystem i.e. umount /dir will >>> fail. I believe that this restriction is because it will create >>> an inconsistency in the state of the shell process. lsof will not >>> show it. >>> >>> Of course most users after login end up in the home directory by >>> default. >>> >>> I believe that Linux will have the same semantics as UNIX. You >>> can test that easily on a standalone Linux box. >>> >>> Regards, >>> >>> Chris Jankowski >>> >>> >>> -----Original Message----- From: linux-cluster-bounces at redhat.com >>> [mailto:linux-cluster- bounces at redhat.com] On Behalf Of Corey >>> Kovacs Sent: Wednesday, 3 November 2010 07:15 To: linux >>> clustering Subject: [Linux-cluster] ha-lvm >>> >>> Folks, >>> >>> I have a 5 node cluster backed by an FC SAN with 5 VG's each with >>> a single LVM. >>> >>> I am using ha_lvm and have lvm.conf configured to use tags as per >>> the instructions. Things work fine until I try to migrate the >>> volume containing our home dir (all others work as expected) The >>> umount for that volume fails and depending on the active config, >>> the node reboots itself (self_fence=1) or it simply fails and >>> get's disabled. >>> >>> lsof doesn't reveal anything "holding" onto that mount point yet >>> the umount fails consistently (force_umount is enabled) >>> >>> Furthermore, it appears I have at least one ov my VG's with bad >>> tags, is there a way to show what tags a VG has? >>> >>> I've gone over the config several times and although I cannot >>> show the config, here is a basic rundown in case something jumps >>> out... >>> >>> 5 nodes, dl360g5 2xQcore w/16GB ram EVA8100 2x4GB FC, multipath >>> 5VG's each w/a single lv each with an ext3 fs. ha lvm in is use >>> as a measure of protection for the ext3 fs's local locking only >>> via lvm.conf tags enabled via lvm.conf initrd's are newer than >>> the lvm.conf changes. >>> >>> I did notice that the ext3 label in use on the home volume was >>> not of the form /home (it was /ha_home) from early testing but >>> I've corrected that and the umount fail still occurs. >>> >>> If anyone has any ideas I'd appreciate it. >>> >>> -- Linux-cluster mailing list Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.10 (GNU/Linux) > Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ > > iEYEARECAAYFAkzRoi4ACgkQKQP9Tvu8x8xq3wCghKNS6//Pv0kDF6RggnCCk0b4 > oaEAn3uO3rDQUNAjlaXHr0yojzaUiXU8 > =HaFU > -----END PGP SIGNATURE----- > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Jonathan Barber From bmr at redhat.com Thu Nov 4 11:00:18 2010 From: bmr at redhat.com (Bryn M. Reeves) Date: Thu, 04 Nov 2010 11:00:18 +0000 Subject: [Linux-cluster] ha-lvm In-Reply-To: References: <4CD1A22E.2070004@arlut.utexas.edu> Message-ID: <4CD29242.6070406@redhat.com> On 11/04/2010 10:42 AM, Jonathan Barber wrote: > On 3 November 2010 17:55, Randy Zagar wrote: >> >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> I frequently find that I'm unable to umount volumes, even after lsof >> and fuser return nothing relevant, and have to "force" a "lazy" umount >> like so: >> >> umount -lf /dir >> >> because both "umount /dir" and "umount -f /dir" fail. > > That's a cool option, but I'd be very worried about corrupting the > filesystem if it was mounted on a second node whilst a process was > holding the filesystem open on the original node. Right; a lazy umount just detaches the root directory of the mounted file system from the namespace. The file system is still mounted following this operation it's just not reachable from the file system namespace (it will be cleaned up properly once it's no longer busy but remains in use until that time). Regards, Bryn. From gianluca.cecchi at gmail.com Mon Nov 8 14:24:08 2010 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Mon, 8 Nov 2010 15:24:08 +0100 Subject: [Linux-cluster] ha-lvm Message-ID: On Wed, 3 Nov 2010 11:55:12 +0000 Corey Kovacs wrote: > John, [snip] > "vgs -o vg_name,vg_tags" > are a welcome addition to my tool belt, thanks for that. On 2 rh el 5.5 clusters I manage, with slightly different level updates, and where I have HA-LVM configured, I don't get anything in vg_tags colums.... Versions of packages are respectively: lvm2-2.02.56-8.el5_5.6 on one cluster nodes lvm2-2.02.56-8.el5_5.5 on another cluster nodes. I'm using something like this in my lvm.conf files for the clusters: volume_list = [ "VolGroup00", "@node01" ] but no tag at all, both on passive and active node..... [root at server1 ~]# vgs -o vg_name,vg_tags VG VG Tags VG_ORA_APPL VG_ORA_DATA VG_ORA_LOGS VolGroup00 and the first three ones are acivated/mounted through HA-LVM Gianluca From marco.dominguez at gmail.com Mon Nov 8 14:50:36 2010 From: marco.dominguez at gmail.com (Marco Andres Dominguez) Date: Mon, 8 Nov 2010 11:50:36 -0300 Subject: [Linux-cluster] ha-lvm In-Reply-To: References: Message-ID: Gianluca The tag could be in the vg or in the lv depending on the configurations, I usually have it in the lv so try this: # lvs -o vg_name,lv_name,lv_tags I hope it helps. Regards. Marco On Mon, Nov 8, 2010 at 11:24 AM, Gianluca Cecchi wrote: > On Wed, 3 Nov 2010 11:55:12 +0000 Corey Kovacs wrote: > > John, > [snip] > > "vgs -o vg_name,vg_tags" > > are a welcome addition to my tool belt, thanks for that. > > On 2 rh el 5.5 clusters I manage, with slightly different level > updates, and where I have HA-LVM configured, I don't get anything in > vg_tags colums.... > > Versions of packages are respectively: > lvm2-2.02.56-8.el5_5.6 on one cluster nodes > lvm2-2.02.56-8.el5_5.5 on another cluster nodes. > > I'm using something like this in my lvm.conf files for the clusters: > volume_list = [ "VolGroup00", "@node01" ] > > but no tag at all, both on passive and active node..... > [root at server1 ~]# vgs -o vg_name,vg_tags > VG VG Tags > VG_ORA_APPL > VG_ORA_DATA > VG_ORA_LOGS > VolGroup00 > > and the first three ones are acivated/mounted through HA-LVM > > Gianluca > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gianluca.cecchi at gmail.com Tue Nov 9 10:29:00 2010 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Tue, 9 Nov 2010 11:29:00 +0100 Subject: [Linux-cluster] ha-lvm Message-ID: On Mon, 8 Nov 2010 11:50:36 -0300 Marco Andres Dominguez wrote: > The tag could be in the vg or in the lv depending on the configurations, I usually have it in the lv so try this: > > # lvs -o vg_name,lv_name,lv_tags > I hope it helps. > Regards. > Marco Thanks, Marco. Indeed with the lvs conmand I can see my tags.. ;-) Any link with details about ".. tag could be in the vg or in the lv depending on the configurations,..."? From rossnick-lists at cybercat.ca Wed Nov 10 01:53:27 2010 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Tue, 9 Nov 2010 20:53:27 -0500 Subject: [Linux-cluster] Starter Cluster / GFS Message-ID: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> Hi all ! Some of you might know, Apple just discontinued the xServe servers. In the next few weeks we were about to buy 50 k$ worth of xServes to replace our aging g5 and xserve-raid setup. Our setup is primarly composed of about a dozen xServes and a couple xserve-raid enclosures for storage. All linked-up with fiber channel. On top of this we have xSan to have a shared filesystem accross all servers. Some of the volumes are mounted on a single server on a per-needed basis. Now, I'm not that sure I will go again with xSan / xServes. So I am seeking alternatives to our xSan setup. In our server room we have several servers running centos, I am quite familiar with it. I also grown and learn with Redhat from version 5 or so (RedHat 5 decades ago, not RHEL 5). A user on the CentOS mailing list pointed me to Gfs from RedHat and to this list. So today I dug up on Gfs2 on Redhat's site and it petty much fits my need. It seems to be a very powerfull solution. If I understand correctly, I need to setup a cluster of nodes to use Gfs. Fine with that. But since it's not a real "cluster", do I stil need the quorum to operate the global file system ? On our setup, a particular service runs on a single node from the shared filesystem. The documentation on redhat's site is very technichal, but lacks some beginer's hints. For instance, there's a part about the required number of journal to create and the size of those. But I cannot find suggested size or any thumb-rule for those... So thanks for any hints. Regards, Nicolas Ross From gordan at bobich.net Wed Nov 10 08:13:21 2010 From: gordan at bobich.net (Gordan Bobic) Date: Wed, 10 Nov 2010 08:13:21 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> Message-ID: <4CDA5421.9090006@bobich.net> Nicolas Ross wrote: > So today I dug up on Gfs2 on Redhat's site and it petty much fits my > need. It seems to be a very powerfull solution. If I understand > correctly, I need to setup a cluster of nodes to use Gfs. Fine with > that. But since it's not a real "cluster", do I stil need the quorum to > operate the global file system ? On our setup, a particular service runs > on a single node from the shared filesystem. If you want the FS mounted on all nodes at the same time then all those nodes must be a part of the cluster, and they have to be quorate (majority of nodes have to be up). You don't need a quorum block device, but it can be useful when you have only 2 nodes. If you are only ever going to have the SAN volume mounted on one device at a time, don't bother with GFS and make the SAN block device a fail-over resource so that only one node can mount it at a time, and put a normal non-shared FS on it. You will get better performance. > The documentation on redhat's site is very technichal, but lacks some > beginer's hints. For instance, there's a part about the required number > of journal to create and the size of those. But I cannot find suggested > size or any thumb-rule for those... The number of journals needs to be equal to or greater than the number of nodes you have in a cluster. e.g. if you have 5 nodes in a cluster, you need at least 5 journals. If you think you might upgrade your cluster to 10 nodes at some point in the future, then create 10 journals, as this needs to be done at FS creation time. Gordan From rossnick-lists at cybercat.ca Wed Nov 10 12:07:33 2010 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Wed, 10 Nov 2010 07:07:33 -0500 Subject: [Linux-cluster] Starter Cluster / GFS References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> Message-ID: <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> Thanks > > If you want the FS mounted on all nodes at the same time then all those > nodes must be a part of the cluster, and they have to be quorate (majority > of nodes have to be up). You don't need a quorum block device, but it can > be useful when you have only 2 nodes. At term, I will have 7 to 10 nodes, but 2 at first for initial setup and testing. Ok, so if I have a 3 nodes cluster for exemple, I need at least 2 nodes for the cluster, and thus the gfs, to be up ? I cannot have a running gfs with only one node ? > If you are only ever going to have the SAN volume mounted on one device at > a time, don't bother with GFS and make the SAN block device a fail-over > resource so that only one node can mount it at a time, and put a normal > non-shared FS on it. You will get better performance. I do need a shared file-system, I am aware of the added latency, we currently have some latency on our xSan setup. But we do also need on some services an additional block-device that is accessed only by one node and is indeed failed-over another node when a node fail. > The number of journals needs to be equal to or greater than the number of > nodes you have in a cluster. e.g. if you have 5 nodes in a cluster, you > need at least 5 journals. If you think you might upgrade your cluster to > 10 nodes at some point in the future, then create 10 journals, as this > needs to be done at FS creation time. That I got. It's the size that I don't know how to figure out. Will 32 megs will be enough ? 64 ? 128 ? Nicolas From gordan at bobich.net Wed Nov 10 12:17:14 2010 From: gordan at bobich.net (Gordan Bobic) Date: Wed, 10 Nov 2010 12:17:14 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> Message-ID: <4CDA8D4A.6010507@bobich.net> Nicolas Ross wrote: > Thanks > >> >> If you want the FS mounted on all nodes at the same time then all >> those nodes must be a part of the cluster, and they have to be quorate >> (majority of nodes have to be up). You don't need a quorum block >> device, but it can be useful when you have only 2 nodes. > > At term, I will have 7 to 10 nodes, but 2 at first for initial setup and > testing. Ok, so if I have a 3 nodes cluster for exemple, I need at least > 2 nodes for the cluster, and thus the gfs, to be up ? I cannot have a > running gfs with only one node ? In a 2-node cluster, you can have running GFS with just one node up. But in that case it is advisble to have a quorum block device on the SAN. With a 3 node cluster, you cannot have quorum with just 1 node, and thus you cannot have GFS running. It will block until quorum is re-established. >> If you are only ever going to have the SAN volume mounted on one >> device at a time, don't bother with GFS and make the SAN block device >> a fail-over resource so that only one node can mount it at a time, and >> put a normal non-shared FS on it. You will get better performance. > > I do need a shared file-system, I am aware of the added latency, we > currently have some latency on our xSan setup. But we do also need on > some services an additional block-device that is accessed only by one > node and is indeed failed-over another node when a node fail. So handle the file system failover for the ones where only one node accesses them at a time and have a shared file system for the areas where multiple nodes need concurrent access. >> The number of journals needs to be equal to or greater than the number >> of nodes you have in a cluster. e.g. if you have 5 nodes in a cluster, >> you need at least 5 journals. If you think you might upgrade your >> cluster to 10 nodes at some point in the future, then create 10 >> journals, as this needs to be done at FS creation time. > > That I got. It's the size that I don't know how to figure out. Will 32 > megs will be enough ? 64 ? 128 ? That depends largely on how big your operations are. I cannot remember what the defaults are, but they are reasonable. In general, big journals can help if you do big I/O operations. In practice, block group sizes can be more important for performance (bigger can help on very large file systems or big files). Gordan From rossnick-lists at cybercat.ca Wed Nov 10 13:53:56 2010 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Wed, 10 Nov 2010 08:53:56 -0500 Subject: [Linux-cluster] Starter Cluster / GFS References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> Message-ID: > In a 2-node cluster, you can have running GFS with just one node up. But > in that case it is advisble to have a quorum block device on the SAN. With > a 3 node cluster, you cannot have quorum with just 1 node, and thus you > cannot have GFS running. It will block until quorum is re-established. Ok, I'll keep that in mind and experiment with what it does when I start playing with the hardware. > That depends largely on how big your operations are. I cannot remember > what the defaults are, but they are reasonable. In general, big journals > can help if you do big I/O operations. In practice, block group sizes can > be more important for performance (bigger can help on very large file > systems or big files). The volume will be composed of 7 1TB disk in raid5, so 6 TB. It will host many, many small files, and some biger files. But the files that change the most often will mos likely be smaller than the blocsize. The gfs will not be used for io-intensive tasks, that's where the standalone volumes comes into play. It'll be used to access many files, often. Specificly, apache will run from it, with document root, session store, etc on the gfs. Regards, From marco.dominguez at gmail.com Wed Nov 10 14:07:08 2010 From: marco.dominguez at gmail.com (Marco Andres Dominguez) Date: Wed, 10 Nov 2010 11:07:08 -0300 Subject: [Linux-cluster] ha-lvm In-Reply-To: References: Message-ID: I think the differences in the configurations are in the lvm resource in cluster.conf. If you put some thing like this: you get the tag in the lv but if you put some thing like this: you get the tag in the vg. I have never used the second options so I am not 100% sure if it is right, I should have to tried it. You can have a look at doc: DOC-3068 to get more info on ha-lvm. Regards Marco On Tue, Nov 9, 2010 at 7:29 AM, Gianluca Cecchi wrote: > On Mon, 8 Nov 2010 11:50:36 -0300 Marco Andres Dominguez wrote: > > The tag could be in the vg or in the lv depending on the configurations, > I usually have it in the lv so try this: > > > > # lvs -o vg_name,lv_name,lv_tags > > I hope it helps. > > Regards. > > Marco > > Thanks, Marco. > Indeed with the lvs conmand I can see my tags.. ;-) > Any link with details about ".. tag could be in the vg or in the lv > depending on the configurations,..."? > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gordan at bobich.net Wed Nov 10 14:12:51 2010 From: gordan at bobich.net (Gordan Bobic) Date: Wed, 10 Nov 2010 14:12:51 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> Message-ID: <4CDAA863.2040100@bobich.net> Nicolas Ross wrote: >> That depends largely on how big your operations are. I cannot remember >> what the defaults are, but they are reasonable. In general, big >> journals can help if you do big I/O operations. In practice, block >> group sizes can be more important for performance (bigger can help on >> very large file systems or big files). > > The volume will be composed of 7 1TB disk in raid5, so 6 TB. Be careful with that arrangement. You are right up against the ragged edge in terms of data safety. 1TB disks a consumer grade SATA disks with non-recoverable error rates of about 10^-14. That is one non-recoverable error per 11TB. Now consider what happens when one of your disks fails. You have to read 6TB to reconstruct the failed disk. With error rate of 1 in 11TB, the chances of another failure occurring in 6TB of reads is about 53%. So the chances are that during this operation, you are going to have another failure, and the chances are that your RAID layer will kick the disk out as faulty - at which point you will find yourself with 2 failed disks in a RAID5 array and in need of a day or two of downtime to scrub your data to a fresh array and hope for the best. RAID5 is ill suited to arrays over 5TB. Using enterprise grade disks will gain you an improved error rate (10^-15), which makes it good enough - if you also have regular backups. But enterprise grade disks are much smaller and much more expensive. Not to mention that your performance on small writes (smaller than the stripe width) will be appalling with RAID5 due to the write-read-write operation required to construct the parity which will reduce your effective performance to that of a single disk. > It will > host many, many small files, and some biger files. But the files that > change the most often will mos likely be smaller than the blocsize. That sounds like a scenario from hell for RAID5 (or RAID6). > The > gfs will not be used for io-intensive tasks, that's where the standalone > volumes comes into play. It'll be used to access many files, often. > Specificly, apache will run from it, with document root, session store, > etc on the gfs. Performance-wise, GFS should should be OK for that if you are running with noatime and the operations are all reads. If you end up with write contention without partitioning the access to directory subtrees on a per server basis, the performance will fall off a cliff pretty quickly. Gordan From linux at alteeve.com Wed Nov 10 16:05:18 2010 From: linux at alteeve.com (Digimer) Date: Wed, 10 Nov 2010 11:05:18 -0500 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <4CDA8D4A.6010507@bobich.net> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> Message-ID: <4CDAC2BE.4010009@alteeve.com> On 10-11-10 07:17 AM, Gordan Bobic wrote: >>> If you want the FS mounted on all nodes at the same time then all >>> those nodes must be a part of the cluster, and they have to be >>> quorate (majority of nodes have to be up). You don't need a quorum >>> block device, but it can be useful when you have only 2 nodes. >> >> At term, I will have 7 to 10 nodes, but 2 at first for initial setup >> and testing. Ok, so if I have a 3 nodes cluster for exemple, I need at >> least 2 nodes for the cluster, and thus the gfs, to be up ? I cannot >> have a running gfs with only one node ? > > In a 2-node cluster, you can have running GFS with just one node up. But > in that case it is advisble to have a quorum block device on the SAN. > With a 3 node cluster, you cannot have quorum with just 1 node, and thus > you cannot have GFS running. It will block until quorum is re-established. With a quorum disk, you can in fact have one node left and still have quorum. This is because the quorum drive should have (node-1) votes, thus always giving the last node 50%+1 even with all other nodes being dead. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From gordan at bobich.net Wed Nov 10 16:09:54 2010 From: gordan at bobich.net (Gordan Bobic) Date: Wed, 10 Nov 2010 16:09:54 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <4CDAC2BE.4010009@alteeve.com> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> Message-ID: <4CDAC3D2.9050703@bobich.net> Digimer wrote: > On 10-11-10 07:17 AM, Gordan Bobic wrote: >>>> If you want the FS mounted on all nodes at the same time then all >>>> those nodes must be a part of the cluster, and they have to be >>>> quorate (majority of nodes have to be up). You don't need a quorum >>>> block device, but it can be useful when you have only 2 nodes. >>> At term, I will have 7 to 10 nodes, but 2 at first for initial setup >>> and testing. Ok, so if I have a 3 nodes cluster for exemple, I need at >>> least 2 nodes for the cluster, and thus the gfs, to be up ? I cannot >>> have a running gfs with only one node ? >> In a 2-node cluster, you can have running GFS with just one node up. But >> in that case it is advisble to have a quorum block device on the SAN. >> With a 3 node cluster, you cannot have quorum with just 1 node, and thus >> you cannot have GFS running. It will block until quorum is re-established. > > With a quorum disk, you can in fact have one node left and still have > quorum. This is because the quorum drive should have (node-1) votes, > thus always giving the last node 50%+1 even with all other nodes being dead. I've never tried testing that use-case extensively, but I suspect that it is only safe to do with SAN-side fencing. Otherwise two nodes could lose contact with each other and still both have access to the SAN and thus both be individually quorate. Gordan From rossnick-lists at cybercat.ca Wed Nov 10 16:21:55 2010 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Wed, 10 Nov 2010 11:21:55 -0500 Subject: [Linux-cluster] Starter Cluster / GFS References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAA863.2040100@bobich.net> Message-ID: <1634100741B94E019943576441CD6873@versa> >> The volume will be composed of 7 1TB disk in raid5, so 6 TB. > > Be careful with that arrangement. You are right up against the ragged edge > in terms of data safety. > > 1TB disks a consumer grade SATA disks with non-recoverable error rates of > about 10^-14. That is one non-recoverable error per 11TB. > > Now consider what happens when one of your disks fails. You have to read > 6TB to reconstruct the failed disk. With error rate of 1 in 11TB, the > chances of another failure occurring in 6TB of reads is about 53%. So the > chances are that during this operation, you are going to have another > failure, and the chances are that your RAID layer will kick the disk out > as faulty - at which point you will find yourself with 2 failed disks in a > RAID5 array and in need of a day or two of downtime to scrub your data to > a fresh array and hope for the best. > > RAID5 is ill suited to arrays over 5TB. Using enterprise grade disks will > gain you an improved error rate (10^-15), which makes it good enough - if > you also have regular backups. But enterprise grade disks are much smaller > and much more expensive. > > Not to mention that your performance on small writes (smaller than the > stripe width) will be appalling with RAID5 due to the write-read-write > operation required to construct the parity which will reduce your > effective performance to that of a single disk. Wow... The enclosure I will use (and already have) is an activestorage's activeraid in 16 x 1tb config. (http://www.getactivestorage.com/activeraid.php). The drives are Hitachi model HDE721010SLA33. From what I could find, error rate is 1 in 10^15. We will do have good backups. One of the node will have a local copy of the critical data (about 1 tb) on a internally-attached disks. All of the rest of the data will be rsync-ed off site to a secondary identical setup. >> It will host many, many small files, and some biger files. But the files >> that change the most often will mos likely be smaller than the blocsize. > > That sounds like a scenario from hell for RAID5 (or RAID6). What do you suggest to acheive size in the range of 6-7 TB, maybe more ? >> The gfs will not be used for io-intensive tasks, that's where the >> standalone volumes comes into play. It'll be used to access many files, >> often. Specificly, apache will run from it, with document root, session >> store, etc on the gfs. > > Performance-wise, GFS should should be OK for that if you are running with > noatime and the operations are all reads. If you end up with write > contention without partitioning the access to directory subtrees on a per > server basis, the performance will fall off a cliff pretty quickly. Can you explain a little bit more ? I'm not sure I fully understand the partitioning into directories ? From rossnick-lists at cybercat.ca Wed Nov 10 16:29:51 2010 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Wed, 10 Nov 2010 11:29:51 -0500 Subject: [Linux-cluster] Starter Cluster / GFS References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire><4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> Message-ID: >> With a quorum disk, you can in fact have one node left and still have >> quorum. This is because the quorum drive should have (node-1) votes, >> thus always giving the last node 50%+1 even with all other nodes being >> dead. > > I've never tried testing that use-case extensively, but I suspect that it > is only safe to do with SAN-side fencing. Otherwise two nodes could lose > contact with each other and still both have access to the SAN and thus > both be individually quorate. I our case, a particular node will run a particular service from a particular directory in the disk. So, even if 2 nodes looses contacts to each other, they should not end up writing or reading from the same files. Am I wrong ? From linux at alteeve.com Wed Nov 10 16:41:27 2010 From: linux at alteeve.com (Digimer) Date: Wed, 10 Nov 2010 11:41:27 -0500 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <4CDAC3D2.9050703@bobich.net> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> Message-ID: <4CDACB37.3070704@alteeve.com> On 10-11-10 11:09 AM, Gordan Bobic wrote: > Digimer wrote: >> On 10-11-10 07:17 AM, Gordan Bobic wrote: >>>>> If you want the FS mounted on all nodes at the same time then all >>>>> those nodes must be a part of the cluster, and they have to be >>>>> quorate (majority of nodes have to be up). You don't need a quorum >>>>> block device, but it can be useful when you have only 2 nodes. >>>> At term, I will have 7 to 10 nodes, but 2 at first for initial setup >>>> and testing. Ok, so if I have a 3 nodes cluster for exemple, I need at >>>> least 2 nodes for the cluster, and thus the gfs, to be up ? I cannot >>>> have a running gfs with only one node ? >>> In a 2-node cluster, you can have running GFS with just one node up. But >>> in that case it is advisble to have a quorum block device on the SAN. >>> With a 3 node cluster, you cannot have quorum with just 1 node, and thus >>> you cannot have GFS running. It will block until quorum is >>> re-established. >> >> With a quorum disk, you can in fact have one node left and still have >> quorum. This is because the quorum drive should have (node-1) votes, >> thus always giving the last node 50%+1 even with all other nodes being >> dead. > > I've never tried testing that use-case extensively, but I suspect that > it is only safe to do with SAN-side fencing. Otherwise two nodes could > lose contact with each other and still both have access to the SAN and > thus both be individually quorate. > > Gordan Clustered storage *requires* fencing. To not use fencing is like driving tired; It's just a matter of time before something bad happens. That said, I should have been more clear in specifying the requirement for fencing. Now that said, the fencing shouldn't be needed at the SAN side, though that works fine as well. The way it works is: In normal operation, all nodes communicate via corosync. Corosync in turn manages the distributed locking and ensures that locks are ordered across all nodes (virtual synchrony). As soon as communication fails on one or more nodes, locks are no longer issued and all I/O is blocked until: a) The node responds finally or b) A timeout is reached and corosync issues a fence against the incommunicado node(s). Once a fence is issued, nothing will proceed until, and only until, the fence agent returns a successful fence message to the fence daemon. In the case of a split brain (nodes partition and are up but not talking to each other), both partitions will issue a fence against the other node(s). This is now a race, often described as an old-west style duel. Both partitions will try to fence the other, but the slower will lose and get fenced before it can fence. With a successful fence, the surviving partition (which could be just one node), will reconfigure and then begin restoring the clustered file system (GFS2 in this case). Once recovery is complete, I/O unblocks and continues. With SAN-side fencing, a fence is in the form of a logic disconnection from the storage network. This has no inherent mechanism for recovery, so the sysadmin will have to manually recover the node(s). For this reason, I do not prefer it. With power fencing, by far the most common method which can be implemented via IPMI, addressable PDUs, etc, the node that is fenced is rebooted. The benefit of this method is that the node may well reboot "healthy" and then be able to rejoin the cluster automatically. Of course, if you prefer, you can have nodes powered off and left off. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From jeff.sturm at eprize.com Wed Nov 10 18:04:19 2010 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Wed, 10 Nov 2010 13:04:19 -0500 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <1634100741B94E019943576441CD6873@versa> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net><4CDAA863.2040100@bobich.net> <1634100741B94E019943576441CD6873@versa> Message-ID: <64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local> > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] > On Behalf Of Nicolas Ross > Sent: Wednesday, November 10, 2010 11:22 AM > To: linux clustering > Subject: Re: [Linux-cluster] Starter Cluster / GFS > > > Performance-wise, GFS should should be OK for that if you are running > > with noatime and the operations are all reads. If you end up with > > write contention without partitioning the access to directory subtrees > > on a per server basis, the performance will fall off a cliff pretty quickly. > > Can you explain a little bit more ? I'm not sure I fully understand the partitioning into > directories ? We had to make similar changes to our application. Avoid allowing two (or more) hosts to create small files in the same shared directory within a GFS filesystem. That particular case scales poorly with GFS. If you can partition things so that two hosts will never create files in the same directory (we used a per-host directory structure for our application), or perhaps direct all write operations to one host while other hosts only read from GFS, it should perform well. -Jeff From rossnick-lists at cybercat.ca Wed Nov 10 19:32:21 2010 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Wed, 10 Nov 2010 14:32:21 -0500 Subject: [Linux-cluster] Starter Cluster / GFS References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net><4CDAA863.2040100@bobich.net><1634100741B94E019943576441CD6873@versa> <64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local> Message-ID: <58753D1C20B84B8682E080FB080E69E1@versa> > We had to make similar changes to our application. > > Avoid allowing two (or more) hosts to create small files in the same > shared directory within a GFS filesystem. That particular case scales > poorly with GFS. > > If you can partition things so that two hosts will never create files in > the same directory (we used a per-host directory structure for our > application), or perhaps direct all write operations to one host while > other hosts only read from GFS, it should perform well. Ok, I see. Our applications will read/write into its own directory most of the time. In the rare cases when it'll be possible that 2 nodes read/writes to the same directory, it'll be for php sessions files. If we ever need to reach to this stage, we'll have to make a custom session handler to put them into a central memcached or something else... From yvette at dbtgroup.com Wed Nov 10 19:38:59 2010 From: yvette at dbtgroup.com (yvette hirth) Date: Wed, 10 Nov 2010 19:38:59 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <1634100741B94E019943576441CD6873@versa> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAA863.2040100@bobich.net> <1634100741B94E019943576441CD6873@versa> Message-ID: <4CDAF4D3.9060105@dbtgroup.com> Nicolas Ross wrote: > What do you suggest to acheive size in the range of 6-7 TB, maybe more ? i suggest RAID10. we have a promise 16x2TB fibre raid array, and we've got two sets of six drives in two RAID10 arrays. RAID10 arrays experience much faster rebuild rates than RAID5. RAID10 offers much faster rebuild times, a nice combination of read v. write performance, but wastes a lot of space... at the end of the day, it's your choice. more on RAID here: http://en.wikipedia.org/wiki/RAID hth yvette hirth From RJM002 at shsu.edu Wed Nov 10 20:50:50 2010 From: RJM002 at shsu.edu (Marti, Robert) Date: Wed, 10 Nov 2010 14:50:50 -0600 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <58753D1C20B84B8682E080FB080E69E1@versa> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net><4CDAA863.2040100@bobich.net><1634100741B94E019943576441CD6873@versa> <64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local> <58753D1C20B84B8682E080FB080E69E1@versa> Message-ID: <8FAC1E47484E43469AA28DBF35C955E4BDF5ED079B@EXMBX.SHSU.EDU> > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster- > bounces at redhat.com] On Behalf Of Nicolas Ross > Sent: Wednesday, November 10, 2010 1:32 PM > To: linux clustering > Subject: Re: [Linux-cluster] Starter Cluster / GFS > > > We had to make similar changes to our application. > > > > Avoid allowing two (or more) hosts to create small files in the same > > shared directory within a GFS filesystem. That particular case scales > > poorly with GFS. > > > > If you can partition things so that two hosts will never create files > > in the same directory (we used a per-host directory structure for our > > application), or perhaps direct all write operations to one host while > > other hosts only read from GFS, it should perform well. > > Ok, I see. Our applications will read/write into its own directory most of the > time. In the rare cases when it'll be possible that 2 nodes read/writes to the > same directory, it'll be for php sessions files. If we ever need to reach to this > stage, we'll have to make a custom session handler to put them into a central > memcached or something else... > If that's the case, why look at shared storage at all? From Chris.Jankowski at hp.com Wed Nov 10 21:04:01 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Wed, 10 Nov 2010 21:04:01 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <8FAC1E47484E43469AA28DBF35C955E4BDF5ED079B@EXMBX.SHSU.EDU> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net><4CDAA863.2040100@bobich.net><1634100741B94E019943576441CD6873@versa> <64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local> <58753D1C20B84B8682E080FB080E69E1@versa> <8FAC1E47484E43469AA28DBF35C955E4BDF5ED079B@EXMBX.SHSU.EDU> Message-ID: <036B68E61A28CA49AC2767596576CD596F58483534@GVW1113EXC.americas.hpqcorp.net> Robert, One reason is that with GFS2 you do not have to do fsck on the surviving node after one node in the cluster failed. Doing fsck ona 20 TB filesystem with heaps of files may take well over an hour. So, if you built your cluster for HA you'd rather avoid it. The locks need to be recovered, but this is much faster operation and fairly time bound. Fsck is not. Regards, Chris Jankowski -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Marti, Robert Sent: Thursday, 11 November 2010 07:51 To: 'linux clustering' Subject: Re: [Linux-cluster] Starter Cluster / GFS > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster- > bounces at redhat.com] On Behalf Of Nicolas Ross > Sent: Wednesday, November 10, 2010 1:32 PM > To: linux clustering > Subject: Re: [Linux-cluster] Starter Cluster / GFS > > > We had to make similar changes to our application. > > > > Avoid allowing two (or more) hosts to create small files in the same > > shared directory within a GFS filesystem. That particular case > > scales poorly with GFS. > > > > If you can partition things so that two hosts will never create > > files in the same directory (we used a per-host directory structure > > for our application), or perhaps direct all write operations to one > > host while other hosts only read from GFS, it should perform well. > > Ok, I see. Our applications will read/write into its own directory > most of the time. In the rare cases when it'll be possible that 2 > nodes read/writes to the same directory, it'll be for php sessions > files. If we ever need to reach to this stage, we'll have to make a > custom session handler to put them into a central memcached or something else... > If that's the case, why look at shared storage at all? -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From RJM002 at shsu.edu Wed Nov 10 21:37:08 2010 From: RJM002 at shsu.edu (Marti, Robert) Date: Wed, 10 Nov 2010 15:37:08 -0600 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <036B68E61A28CA49AC2767596576CD596F58483534@GVW1113EXC.americas.hpqcorp.net> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net><4CDAA863.2040100@bobich.net><1634100741B94E019943576441CD6873@versa> <64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local> <58753D1C20B84B8682E080FB080E69E1@versa> <8FAC1E47484E43469AA28DBF35C955E4BDF5ED079B@EXMBX.SHSU.EDU> <036B68E61A28CA49AC2767596576CD596F58483534@GVW1113EXC.americas.hpqcorp.net> Message-ID: <8FAC1E47484E43469AA28DBF35C955E4BDF5ED079C@EXMBX.SHSU.EDU> > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster- > bounces at redhat.com] On Behalf Of Jankowski, Chris > Sent: Wednesday, November 10, 2010 3:04 PM > To: linux clustering > Subject: Re: [Linux-cluster] Starter Cluster / GFS > > Robert, > > One reason is that with GFS2 you do not have to do fsck on the surviving > node after one node in the cluster failed. > > Doing fsck ona 20 TB filesystem with heaps of files may take well over an > hour. > > So, if you built your cluster for HA you'd rather avoid it. > > The locks need to be recovered, but this is much faster operation and fairly > time bound. Fsck is not. > > Regards, > > Chris Jankowski > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster- > bounces at redhat.com] On Behalf Of Marti, Robert > Sent: Thursday, 11 November 2010 07:51 > To: 'linux clustering' > Subject: Re: [Linux-cluster] Starter Cluster / GFS > > > -----Original Message----- > > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster- > > bounces at redhat.com] On Behalf Of Nicolas Ross > > Sent: Wednesday, November 10, 2010 1:32 PM > > To: linux clustering > > Subject: Re: [Linux-cluster] Starter Cluster / GFS > > > > > We had to make similar changes to our application. > > > > > > Avoid allowing two (or more) hosts to create small files in the same > > > shared directory within a GFS filesystem. That particular case > > > scales poorly with GFS. > > > > > > If you can partition things so that two hosts will never create > > > files in the same directory (we used a per-host directory structure > > > for our application), or perhaps direct all write operations to one > > > host while other hosts only read from GFS, it should perform well. > > > > Ok, I see. Our applications will read/write into its own directory > > most of the time. In the rare cases when it'll be possible that 2 > > nodes read/writes to the same directory, it'll be for php sessions > > files. If we ever need to reach to this stage, we'll have to make a > > custom session handler to put them into a central memcached or > something else... > > > > If that's the case, why look at shared storage at all? > > -- In this scenario, he's not building the apps for HA (single server at a time, except maybe for sessions) he's not using massive filesystems (5-6TB total)... The overhead involved in managing shared storage isn't typically worth it if you're not going to leverage the shared portion of it. Rob Marti From rossnick-lists at cybercat.ca Wed Nov 10 23:12:51 2010 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Wed, 10 Nov 2010 18:12:51 -0500 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <8FAC1E47484E43469AA28DBF35C955E4BDF5ED079B@EXMBX.SHSU.EDU> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net><4CDAA863.2040100@bobich.net><1634100741B94E019943576441CD6873@versa> <64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local> <58753D1C20B84B8682E080FB080E69E1@versa> <8FAC1E47484E43469AA28DBF35C955E4BDF5ED079B@EXMBX.SHSU.EDU> Message-ID: Redundency for high-availaibility. If a node fail, I can restart the service manually, or automaticly on another node, without loosing any data. Also, there are come common data between services that need to be availaible in real-time. > If that's the case, why look at shared storage at all? From jakov.sosic at srce.hr Wed Nov 10 23:50:12 2010 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Thu, 11 Nov 2010 00:50:12 +0100 Subject: [Linux-cluster] Configurations of services? In-Reply-To: <4CCBFAD3.9010305@srce.hr> References: <4CCBFAD3.9010305@srce.hr> Message-ID: <4CDB2FB4.8000309@srce.hr> On 10/30/2010 01:00 PM, Jakov Sosic wrote: > Hi! > > What is best practice for keeping and updating configurations of > services that someone runs in cluster? For example, if I run > via cluster agent, then I create /etc/cluster/httpd- on > each node in the domain (cp -r /etc/httpd /etc/cluster/httpd-; cd > /etc/cluster/httpd-; rm -f logs run modules; ln -s .....). > > Now, Im puzzled how do you sync configurations between nodes? I do it > manually currently, but am seeking some automation of the process. > > I do not want to keep configurations of EACH service ona shared disks, > for some services I want to have configurations on each node available. > > > Any thoughts on this one? Well, let me say something then :) I'm thinking about starting a project - developing set of utilities that would work just like "ccs_tool update /etc/cluster/cluster.conf", but could update any config file in /etc/ directory. What do you think about this? From rossnick-lists at cybercat.ca Thu Nov 11 00:13:32 2010 From: rossnick-lists at cybercat.ca (Nicolas Ross) Date: Wed, 10 Nov 2010 19:13:32 -0500 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <8FAC1E47484E43469AA28DBF35C955E4BDF5ED079C@EXMBX.SHSU.EDU> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net><4CDAA863.2040100@bobich.net><1634100741B94E019943576441CD6873@versa> <64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local> <58753D1C20B84B8682E080FB080E69E1@versa> <8FAC1E47484E43469AA28DBF35C955E4BDF5ED079B@EXMBX.SHSU.EDU> <036B68E61A28CA49AC2767596576CD596F58483534@GVW1113EXC.americas.hpqcorp.net> <8FAC1E47484E43469AA28DBF35C955E4BDF5ED079C@EXMBX.SHSU.EDU> Message-ID: So, if I read you correctly, I would be better off making a big logical volume and smaller partitions inside it to put my services on it, mount the relevant partition on a server by server basis, and manage my shared portion otherwise ? > > In this scenario, he's not building the apps for HA (single server at a time, except maybe for sessions) he's not using massive filesystems (5-6TB total)... > > The overhead involved in managing shared storage isn't typically worth it if you're not going to leverage the shared portion of it. > > Rob Marti From Chris.Jankowski at hp.com Thu Nov 11 02:30:19 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Thu, 11 Nov 2010 02:30:19 +0000 Subject: [Linux-cluster] Configurations of services? In-Reply-To: <4CDB2FB4.8000309@srce.hr> References: <4CCBFAD3.9010305@srce.hr> <4CDB2FB4.8000309@srce.hr> Message-ID: <036B68E61A28CA49AC2767596576CD596F5848362C@GVW1113EXC.americas.hpqcorp.net> Jakov, If you make it general enough you may end up with rsync. How would you position your tool in the continuum between ccs_tool update .. And rsync? Where would it add value? Regards, Chris Jankowski -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jakov Sosic Sent: Thursday, 11 November 2010 10:50 To: linux clustering Subject: Re: [Linux-cluster] Configurations of services? On 10/30/2010 01:00 PM, Jakov Sosic wrote: > Hi! > > What is best practice for keeping and updating configurations of > services that someone runs in cluster? For example, if I run > via cluster agent, then I create /etc/cluster/httpd- on > each node in the domain (cp -r /etc/httpd /etc/cluster/httpd-; > cd /etc/cluster/httpd-; rm -f logs run modules; ln -s .....). > > Now, Im puzzled how do you sync configurations between nodes? I do it > manually currently, but am seeking some automation of the process. > > I do not want to keep configurations of EACH service ona shared disks, > for some services I want to have configurations on each node available. > > > Any thoughts on this one? Well, let me say something then :) I'm thinking about starting a project - developing set of utilities that would work just like "ccs_tool update /etc/cluster/cluster.conf", but could update any config file in /etc/ directory. What do you think about this? -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From Chris.Jankowski at hp.com Thu Nov 11 03:29:44 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Thu, 11 Nov 2010 03:29:44 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <4CDACB37.3070704@alteeve.com> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> Message-ID: <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> Digimer, 1. Digimer wrote: >>>Both partitions will try to fence the other, but the slower will lose and get fenced before it can fence. Well, this is certainly not my experience in dealing with modern rack mounted or blade servers where you use iLO (on HP) or DRAC (on Dell). What actually happens in two node clusters is that both servers issue the fence request to the iLO or DRAC. It gets processed and *both* servers get powered off. Ouch!! Your 100% HA cluster becomes 100% dead cluster. 2. Your comment did not explain what role the quorum disk plays in the cluster. Also, if there are any useful cluster quorum disk heuristics that can be used in this case. Thanks and regards, Chris Jankowski -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Digimer Sent: Thursday, 11 November 2010 03:41 To: linux clustering Subject: Re: [Linux-cluster] Starter Cluster / GFS On 10-11-10 11:09 AM, Gordan Bobic wrote: > Digimer wrote: >> On 10-11-10 07:17 AM, Gordan Bobic wrote: >>>>> If you want the FS mounted on all nodes at the same time then all >>>>> those nodes must be a part of the cluster, and they have to be >>>>> quorate (majority of nodes have to be up). You don't need a quorum >>>>> block device, but it can be useful when you have only 2 nodes. >>>> At term, I will have 7 to 10 nodes, but 2 at first for initial >>>> setup and testing. Ok, so if I have a 3 nodes cluster for exemple, >>>> I need at least 2 nodes for the cluster, and thus the gfs, to be up >>>> ? I cannot have a running gfs with only one node ? >>> In a 2-node cluster, you can have running GFS with just one node up. >>> But in that case it is advisble to have a quorum block device on the SAN. >>> With a 3 node cluster, you cannot have quorum with just 1 node, and >>> thus you cannot have GFS running. It will block until quorum is >>> re-established. >> >> With a quorum disk, you can in fact have one node left and still have >> quorum. This is because the quorum drive should have (node-1) votes, >> thus always giving the last node 50%+1 even with all other nodes >> being dead. > > I've never tried testing that use-case extensively, but I suspect that > it is only safe to do with SAN-side fencing. Otherwise two nodes could > lose contact with each other and still both have access to the SAN and > thus both be individually quorate. > > Gordan Clustered storage *requires* fencing. To not use fencing is like driving tired; It's just a matter of time before something bad happens. That said, I should have been more clear in specifying the requirement for fencing. Now that said, the fencing shouldn't be needed at the SAN side, though that works fine as well. The way it works is: In normal operation, all nodes communicate via corosync. Corosync in turn manages the distributed locking and ensures that locks are ordered across all nodes (virtual synchrony). As soon as communication fails on one or more nodes, locks are no longer issued and all I/O is blocked until: a) The node responds finally or b) A timeout is reached and corosync issues a fence against the incommunicado node(s). Once a fence is issued, nothing will proceed until, and only until, the fence agent returns a successful fence message to the fence daemon. In the case of a split brain (nodes partition and are up but not talking to each other), both partitions will issue a fence against the other node(s). This is now a race, often described as an old-west style duel. Both partitions will try to fence the other, but the slower will lose and get fenced before it can fence. With a successful fence, the surviving partition (which could be just one node), will reconfigure and then begin restoring the clustered file system (GFS2 in this case). Once recovery is complete, I/O unblocks and continues. With SAN-side fencing, a fence is in the form of a logic disconnection from the storage network. This has no inherent mechanism for recovery, so the sysadmin will have to manually recover the node(s). For this reason, I do not prefer it. With power fencing, by far the most common method which can be implemented via IPMI, addressable PDUs, etc, the node that is fenced is rebooted. The benefit of this method is that the node may well reboot "healthy" and then be able to rejoin the cluster automatically. Of course, if you prefer, you can have nodes powered off and left off. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From linux at alteeve.com Thu Nov 11 04:29:46 2010 From: linux at alteeve.com (Digimer) Date: Wed, 10 Nov 2010 23:29:46 -0500 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> Message-ID: <4CDB713A.8080303@alteeve.com> On 10-11-10 10:29 PM, Jankowski, Chris wrote: > Digimer, > > 1. > Digimer wrote: >>>> Both partitions will try to fence the other, but the slower will lose and get fenced before it can fence. > > Well, this is certainly not my experience in dealing with modern rack mounted or blade servers where you use iLO (on HP) or DRAC (on Dell). > > What actually happens in two node clusters is that both servers issue the fence request to the iLO or DRAC. It gets processed and *both* servers get powered off. Ouch!! Your 100% HA cluster becomes 100% dead cluster. That is somewhat frightening. My experience is limited to stock IPMI and Node Assassin. I've not seen a situation where both die. I'd strongly suggest that a bug be filed. > 2. > Your comment did not explain what role the quorum disk plays in the cluster. Also, if there are any useful cluster quorum disk heuristics that can be used in this case. > > Thanks and regards, > > Chris Jankowski Ah, the idea is that, with the quorum disk (ignoring heuristics for the moment), if only one node is left alive, the quorum disk will contribute sufficient votes for quorum to be achieved. Of course, this depends on the node(s) having access to the qdisk still. Now for heuristics; Consider this; you have a 7-node cluster; - Each node gets 1 vote. - The qdisk gets 6 votes. - Total votes is 13, quorum then is >= 7. You cluster partitions, say from a network failure. Six nodes separate from a core switch, while one happens to still have access to a critical route (say, to the Internet). The heuristic test (ie: pinging an external server) will pass for the 1 node and fail for the six others. The one node with access to the critical route will be the one to get the votes of the quorum disk (1 + 6 = 7, quorum!) while the other six will get six votes (1 + 1 + 1 + 1 + 1 + 1 = 6, no quorum). The six nodes will lose and be fenced and will not be able to rejoin the cluster until they regain access to that critical route. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From Chris.Jankowski at hp.com Thu Nov 11 05:48:10 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Thu, 11 Nov 2010 05:48:10 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <4CDB713A.8080303@alteeve.com> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> <4CDB713A.8080303@alteeve.com> Message-ID: <036B68E61A28CA49AC2767596576CD596F584F326A@GVW1113EXC.americas.hpqcorp.net> Digimer, Again, the heuristic you gave does not pass the data centre operational sanity test. First of all, in data centres everything is redundant, so you have 2 core switches. Of course you could ping both of them and have some NAND logic. That is not important. The point is that no matter what you'd do, your cluster cannot fix the network. So, fencing nodes on network failure is the last thing you want to do. You loose warm database caches, user sessions and incomplete transactions. Disk quorum times out in 10 seconds or so. A typical network meltdown due to spanning tree recalculation is 40 seconds. If the proposed heuristic was applied to the 7 node clusters they all will murder each other and there will be nothing left. You'd convert a localised, short term network problem into a cluster wide disaster. In fact, I have yet to see a heuristic that would make sense in real world. I cannot think of one. Regards, Chris Jankowski -----Original Message----- From: Digimer [mailto:linux at alteeve.com] Sent: Thursday, 11 November 2010 15:30 To: linux clustering Cc: Jankowski, Chris Subject: Re: [Linux-cluster] Starter Cluster / GFS On 10-11-10 10:29 PM, Jankowski, Chris wrote: > Digimer, > > 1. > Digimer wrote: >>>> Both partitions will try to fence the other, but the slower will lose and get fenced before it can fence. > > Well, this is certainly not my experience in dealing with modern rack mounted or blade servers where you use iLO (on HP) or DRAC (on Dell). > > What actually happens in two node clusters is that both servers issue the fence request to the iLO or DRAC. It gets processed and *both* servers get powered off. Ouch!! Your 100% HA cluster becomes 100% dead cluster. That is somewhat frightening. My experience is limited to stock IPMI and Node Assassin. I've not seen a situation where both die. I'd strongly suggest that a bug be filed. > 2. > Your comment did not explain what role the quorum disk plays in the cluster. Also, if there are any useful cluster quorum disk heuristics that can be used in this case. > > Thanks and regards, > > Chris Jankowski Ah, the idea is that, with the quorum disk (ignoring heuristics for the moment), if only one node is left alive, the quorum disk will contribute sufficient votes for quorum to be achieved. Of course, this depends on the node(s) having access to the qdisk still. Now for heuristics; Consider this; you have a 7-node cluster; - Each node gets 1 vote. - The qdisk gets 6 votes. - Total votes is 13, quorum then is >= 7. You cluster partitions, say from a network failure. Six nodes separate from a core switch, while one happens to still have access to a critical route (say, to the Internet). The heuristic test (ie: pinging an external server) will pass for the 1 node and fail for the six others. The one node with access to the critical route will be the one to get the votes of the quorum disk (1 + 6 = 7, quorum!) while the other six will get six votes (1 + 1 + 1 + 1 + 1 + 1 = 6, no quorum). The six nodes will lose and be fenced and will not be able to rejoin the cluster until they regain access to that critical route. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From gordan at bobich.net Thu Nov 11 08:56:09 2010 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 11 Nov 2010 08:56:09 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <1634100741B94E019943576441CD6873@versa> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAA863.2040100@bobich.net> <1634100741B94E019943576441CD6873@versa> Message-ID: <4CDBAFA9.8040707@bobich.net> Nicolas Ross wrote: >>> The volume will be composed of 7 1TB disk in raid5, so 6 TB. >> >> Be careful with that arrangement. You are right up against the ragged >> edge >> in terms of data safety. >> >> 1TB disks a consumer grade SATA disks with non-recoverable error rates of >> about 10^-14. That is one non-recoverable error per 11TB. >> >> Now consider what happens when one of your disks fails. You have to read >> 6TB to reconstruct the failed disk. With error rate of 1 in 11TB, the >> chances of another failure occurring in 6TB of reads is about 53%. So the >> chances are that during this operation, you are going to have another >> failure, and the chances are that your RAID layer will kick the disk out >> as faulty - at which point you will find yourself with 2 failed disks >> in a >> RAID5 array and in need of a day or two of downtime to scrub your data to >> a fresh array and hope for the best. >> >> RAID5 is ill suited to arrays over 5TB. Using enterprise grade disks will >> gain you an improved error rate (10^-15), which makes it good enough - if >> you also have regular backups. But enterprise grade disks are much >> smaller >> and much more expensive. >> >> Not to mention that your performance on small writes (smaller than the >> stripe width) will be appalling with RAID5 due to the write-read-write >> operation required to construct the parity which will reduce your >> effective performance to that of a single disk. > > Wow... > > The enclosure I will use (and already have) is an activestorage's > activeraid > in 16 x 1tb config. (http://www.getactivestorage.com/activeraid.php). I dealt with them before. All I'm going to say is - disregard any and all performance figures they claim and work out what the performance is likely to be from basic principles. Provided you stick to that and ignore the marketing specmanship, as far as enterprisey storage appliances go, those are reasonably good value for money. > The > drives are Hitachi model HDE721010SLA33. From what I could find, error rate > is 1 in 10^15. That makes it less bad than my figures above, but still, be careful. >>> It will host many, many small files, and some biger files. But the files >>> that change the most often will mos likely be smaller than the blocsize. >> >> That sounds like a scenario from hell for RAID5 (or RAID6). > > What do you suggest to acheive size in the range of 6-7 TB, maybe more ? RAID10 if you need more performance than that of a single disk, unless your I/O operations are always very big (bigger than the RAID stripe width). stripe_width = chunk_size * number_of_disks Smaller disks are good for reducing rebuild times, and more smaller disks will give you better performance. It all depends on the nature of the I/O and the performance you require. >>> The gfs will not be used for io-intensive tasks, that's where the >>> standalone volumes comes into play. It'll be used to access many files, >>> often. Specificly, apache will run from it, with document root, session >>> store, etc on the gfs. >> >> Performance-wise, GFS should should be OK for that if you are running >> with >> noatime and the operations are all reads. If you end up with write >> contention without partitioning the access to directory subtrees on a per >> server basis, the performance will fall off a cliff pretty quickly. > > Can you explain a little bit more ? I'm not sure I fully understand the > partitioning into directories ? Make sure that only one node only accesses a particular directory subtree (until it gets failed over, that is). If you have multiple nodes simultaneously writing to the same directory with any regularity you will experience performance issues. Gordan From gordan at bobich.net Thu Nov 11 08:59:30 2010 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 11 Nov 2010 08:59:30 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire><4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> Message-ID: <4CDBB072.7020001@bobich.net> Nicolas Ross wrote: >>> With a quorum disk, you can in fact have one node left and still have >>> quorum. This is because the quorum drive should have (node-1) votes, >>> thus always giving the last node 50%+1 even with all other nodes >>> being dead. >> >> I've never tried testing that use-case extensively, but I suspect that >> it is only safe to do with SAN-side fencing. Otherwise two nodes could >> lose contact with each other and still both have access to the SAN and >> thus both be individually quorate. > > I our case, a particular node will run a particular service from a > particular directory in the disk. So, even if 2 nodes looses contacts to > each other, they should not end up writing or reading from the same > files. Am I wrong ? If two nodes lose contact to each other, one will fence each other and shut it down. If you don't need concurrent access, then why do you need a cluster file system? Gordan From gordan at bobich.net Thu Nov 11 09:04:20 2010 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 11 Nov 2010 09:04:20 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <4CDACB37.3070704@alteeve.com> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> Message-ID: <4CDBB194.2020601@bobich.net> Digimer wrote: > On 10-11-10 11:09 AM, Gordan Bobic wrote: >> Digimer wrote: >>> On 10-11-10 07:17 AM, Gordan Bobic wrote: >>>>>> If you want the FS mounted on all nodes at the same time then all >>>>>> those nodes must be a part of the cluster, and they have to be >>>>>> quorate (majority of nodes have to be up). You don't need a quorum >>>>>> block device, but it can be useful when you have only 2 nodes. >>>>> At term, I will have 7 to 10 nodes, but 2 at first for initial setup >>>>> and testing. Ok, so if I have a 3 nodes cluster for exemple, I need at >>>>> least 2 nodes for the cluster, and thus the gfs, to be up ? I cannot >>>>> have a running gfs with only one node ? >>>> In a 2-node cluster, you can have running GFS with just one node up. But >>>> in that case it is advisble to have a quorum block device on the SAN. >>>> With a 3 node cluster, you cannot have quorum with just 1 node, and thus >>>> you cannot have GFS running. It will block until quorum is >>>> re-established. >>> With a quorum disk, you can in fact have one node left and still have >>> quorum. This is because the quorum drive should have (node-1) votes, >>> thus always giving the last node 50%+1 even with all other nodes being >>> dead. >> I've never tried testing that use-case extensively, but I suspect that >> it is only safe to do with SAN-side fencing. Otherwise two nodes could >> lose contact with each other and still both have access to the SAN and >> thus both be individually quorate. >> >> Gordan > > Clustered storage *requires* fencing. To not use fencing is like driving > tired; It's just a matter of time before something bad happens. That > said, I should have been more clear in specifying the requirement for > fencing. > > Now that said, the fencing shouldn't be needed at the SAN side, though > that works fine as well. The default fencing action, last time I checked, is reboot. Consider the use case where you have a network failure and separate networks for various things, and you lose connectivity between the nodes but they both still have access to the SAN. One node gets fenced, reboots, comes up and connects to the SAN. It connects to the quorum device and has quorum without the other nodes, and mounts the file systems and starts writing - while all the other nodes that have become partitioned off do the same thing. Unless you can fence the nodes from the SAN side, quorum device having a 50% weight is a recipe for disaster. > The way it works is: [...] I'm well aware of how fencing works, but you overlooked one major failure mode that is essentially guaranteed to hose your data if you set up the quorum device to have 50% of the votes. > With SAN-side fencing, a fence is in the form of a logic disconnection > from the storage network. This has no inherent mechanism for recovery, > so the sysadmin will have to manually recover the node(s). For this > reason, I do not prefer it. Then don't use a quorum device with more than an equal weight to the individual nodes. Gordan From gordan at bobich.net Thu Nov 11 09:08:00 2010 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 11 Nov 2010 09:08:00 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <58753D1C20B84B8682E080FB080E69E1@versa> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net><4CDAA863.2040100@bobich.net><1634100741B94E019943576441CD6873@versa> <64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local> <58753D1C20B84B8682E080FB080E69E1@versa> Message-ID: <4CDBB270.9000505@bobich.net> Nicolas Ross wrote: > Ok, I see. Our applications will read/write into its own directory most > of the time. In the rare cases when it'll be possible that 2 nodes > read/writes to the same directory, it'll be for php sessions files. If > we ever need to reach to this stage, we'll have to make a custom session > handler to put them into a central memcached or something else... You may be better off moving the session files to an asynchronous storage medium. Something like master-master replicated MySQL or SeznamFS if you want to use the file system. You'll likely save a considerable amount of latency on accesses. You don't need 100% real-time synchronicity on session information in this way. A few milliseconds of lag should be fine and it'll reduce the access latencies by potentially quite a lot. Gordan From gordan at bobich.net Thu Nov 11 09:13:50 2010 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 11 Nov 2010 09:13:50 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <036B68E61A28CA49AC2767596576CD596F58483534@GVW1113EXC.americas.hpqcorp.net> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net><4CDAA863.2040100@bobich.net><1634100741B94E019943576441CD6873@versa> <64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local> <58753D1C20B84B8682E080FB080E69E1@versa> <8FAC1E47484E43469AA28DBF35C955E4BDF5ED079B@EXMBX.SHSU.EDU> <036B68E61A28CA49AC2767596576CD596F58483534@GVW1113EXC.americas.hpqcorp.net> Message-ID: <4CDBB3CE.8020603@bobich.net> Jankowski, Chris wrote: > Robert, > > One reason is that with GFS2 you do not have to do fsck on the surviving node > after one node in the cluster failed. You don't have to do fsck after an unclean shutdown anyway, provided you use a journaled file system. GFS2 avoids the need for fsck through journaling same as any other journaled file system, not through some other magic. > Doing fsck ona 20 TB filesystem with heaps of files may take well over an hour. Depends on your file system. fsck on one of my (4TB RAID10 arrays took only about 2 minutes with ext4. Scaling that by 5x to get to 20TB still implies a figure of about 10 minutes, well short of an hour. Gordan From gordan at bobich.net Thu Nov 11 09:18:30 2010 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 11 Nov 2010 09:18:30 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net><1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net><4CDAA863.2040100@bobich.net><1634100741B94E019943576441CD6873@versa> <64D0546C5EBBD147B75DE133D798665F06A126C9@hugo.eprize.local> <58753D1C20B84B8682E080FB080E69E1@versa> <8FAC1E47484E43469AA28DBF35C955E4BDF5ED079B@EXMBX.SHSU.EDU> Message-ID: <4CDBB4E6.70101@bobich.net> Nicolas Ross wrote: > Redundency for high-availaibility. > > If a node fail, I can restart the service manually, or automaticly on > another node, without loosing any data. You can do that anyway. You make the SAN exported block device and the non-shared FS on that share into dependent services. You make it so the FS service requires the block device service, and make the application providing service depend on the file system service. That ensures they'll all come up in the correct order and fail over together. Using a shared file system gains you nothing in the use cases you are describing, other than reduce the performance. > Also, there are come common data between services that need to be availaible in real-time. That's fair enough, but in that case some volume splitting may be in order (have the common static data on GFS and have everything else on fail-over non-shared file systems). For optimal performance, you should unshare as much as possible. Gordan From gordan at bobich.net Thu Nov 11 09:19:47 2010 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 11 Nov 2010 09:19:47 +0000 Subject: [Linux-cluster] Configurations of services? In-Reply-To: <4CDB2FB4.8000309@srce.hr> References: <4CCBFAD3.9010305@srce.hr> <4CDB2FB4.8000309@srce.hr> Message-ID: <4CDBB533.3030805@bobich.net> Jakov Sosic wrote: > On 10/30/2010 01:00 PM, Jakov Sosic wrote: >> Hi! >> >> What is best practice for keeping and updating configurations of >> services that someone runs in cluster? For example, if I run >> via cluster agent, then I create /etc/cluster/httpd- on >> each node in the domain (cp -r /etc/httpd /etc/cluster/httpd-; cd >> /etc/cluster/httpd-; rm -f logs run modules; ln -s .....). >> >> Now, Im puzzled how do you sync configurations between nodes? I do it >> manually currently, but am seeking some automation of the process. >> >> I do not want to keep configurations of EACH service ona shared disks, >> for some services I want to have configurations on each node available. >> >> >> Any thoughts on this one? > > > Well, let me say something then :) I'm thinking about starting a project > - developing set of utilities that would work just like "ccs_tool update > /etc/cluster/cluster.conf", but could update any config file in /etc/ > directory. > > What do you think about this? You may want to look at csync2 before you re-invent that particular wheel. :) Gordan From gordan at bobich.net Thu Nov 11 09:23:45 2010 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 11 Nov 2010 09:23:45 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> Message-ID: <4CDBB621.2090809@bobich.net> Jankowski, Chris wrote: > Digimer, > > 1. > Digimer wrote: >>>> Both partitions will try to fence the other, but the slower >>>>will lose and get fenced before it can fence. > > Well, this is certainly not my experience in dealing with modern > rack mounted or blade servers where you use iLO (on HP) or DRAC (on Dell). > > What actually happens in two node clusters is that both servers > issue the fence request to the iLO or DRAC. It gets processed > and *both* servers get powered off. Ouch!! Your 100% HA cluster > becomes 100% dead cluster. Indeed, I've seen this, too, on a range of hardware. My quick and dirty solution was to doctor the fencing agent to add a different sleep() on each node, in order of survivor preference. There may be a setting in cluster.conf that can be used to achieve the same effect, can't remember off the top of my head. Gordan From gordan at bobich.net Thu Nov 11 09:27:41 2010 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 11 Nov 2010 09:27:41 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <4CDB713A.8080303@alteeve.com> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> <4CDB713A.8080303@alteeve.com> Message-ID: <4CDBB70D.6080204@bobich.net> Digimer wrote: > On 10-11-10 10:29 PM, Jankowski, Chris wrote: >> Digimer, >> >> 1. >> Digimer wrote: >>>>> Both partitions will try to fence the other, but the slower will lose and get fenced before it can fence. >> Well, this is certainly not my experience in dealing with modern rack mounted or blade servers where you use iLO (on HP) or DRAC (on Dell). >> >> What actually happens in two node clusters is that both servers issue the fence request to the iLO or DRAC. It gets processed and *both* servers get powered off. Ouch!! Your 100% HA cluster becomes 100% dead cluster. > > That is somewhat frightening. My experience is limited to stock IPMI and > Node Assassin. I've not seen a situation where both die. I'd strongly > suggest that a bug be filed. It's actually fairly predictable and quite common. If the nodes lose connectivity to each other but both are actually alive (e.g. cluster service switch failure), you will get this sort of a shoot-out. The cause is that most out-of-band power-off mechanisms have an inherent lag of several seconds (i.e. it can be a few seconds between when you issue a power-off command and the machine actually powers off). During that race window, both machines may issue a remote power-off before they actually shut down themselves. Gordan From gordan at bobich.net Thu Nov 11 09:31:57 2010 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 11 Nov 2010 09:31:57 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584F326A@GVW1113EXC.americas.hpqcorp.net> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> <4CDB713A.8080303@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584F326A@GVW1113EXC.americas.hpqcorp.net> Message-ID: <4CDBB80D.70301@bobich.net> Jankowski, Chris wrote: > The point is that no matter what you'd do, your cluster cannot fix the network. > So, fencing nodes on network failure is the last thing you want to do. You loose > warm database caches, user sessions and incomplete transactions. Disk quorum times > out in 10 seconds or so. A typical network meltdown due to spanning tree recalculation > is 40 seconds. I'd argue that if you regularly get outages of 40 seconds due to spanning tree rebuilds, you have bigger problems (such as too many machines on the same VLAN). And if you have that many nodes in a cluster (you do keep your cluster interfaces on a dedicated VLAN, right?), you are doing way better than what the claimed limits for RHCS are. :) Gordan From Chris.Jankowski at hp.com Thu Nov 11 09:59:13 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Thu, 11 Nov 2010 09:59:13 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <4CDBB70D.6080204@bobich.net> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> <4CDB713A.8080303@alteeve.com> <4CDBB70D.6080204@bobich.net> Message-ID: <036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net> Gordan, I do understand the mechanism. I was trying to gently point out that this behaviour is unacceptable for my commercial IP customers. The customers buy clusters for high availability. Loosing the whole cluster due to single component failure - hearbeat link is not acceptable. The heartbeat link is a huge SPOF. And the cluster design does not support redundant links for heartbeat. Also, none of the commercially available UNIX clusters or Linux clusters (HP ServiceGuard, Veritas, SteelEye) would display this type of behaviour and they do not clobber cluster filesystems. So, it is possible to achieve acceptable reaction to this type of failure. Regards, Chris Jankowski -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Gordan Bobic Sent: Thursday, 11 November 2010 20:28 To: linux clustering Subject: Re: [Linux-cluster] Starter Cluster / GFS Digimer wrote: > On 10-11-10 10:29 PM, Jankowski, Chris wrote: >> Digimer, >> >> 1. >> Digimer wrote: >>>>> Both partitions will try to fence the other, but the slower will lose and get fenced before it can fence. >> Well, this is certainly not my experience in dealing with modern rack mounted or blade servers where you use iLO (on HP) or DRAC (on Dell). >> >> What actually happens in two node clusters is that both servers issue the fence request to the iLO or DRAC. It gets processed and *both* servers get powered off. Ouch!! Your 100% HA cluster becomes 100% dead cluster. > > That is somewhat frightening. My experience is limited to stock IPMI > and Node Assassin. I've not seen a situation where both die. I'd > strongly suggest that a bug be filed. It's actually fairly predictable and quite common. If the nodes lose connectivity to each other but both are actually alive (e.g. cluster service switch failure), you will get this sort of a shoot-out. The cause is that most out-of-band power-off mechanisms have an inherent lag of several seconds (i.e. it can be a few seconds between when you issue a power-off command and the machine actually powers off). During that race window, both machines may issue a remote power-off before they actually shut down themselves. Gordan -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From gordan at bobich.net Thu Nov 11 10:07:31 2010 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 11 Nov 2010 10:07:31 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> <4CDB713A.8080303@alteeve.com> <4CDBB70D.6080204@bobich.net> <036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net> Message-ID: <4CDBC063.2060605@bobich.net> Jankowski, Chris wrote: > Gordan, > > I do understand the mechanism. I was trying to gently point out that > this behaviour is unacceptable for my commercial IP customers. The customers > buy clusters for high availability. Loosing the whole cluster due to single > component failure - hearbeat link is not acceptable. The heartbeat link is > a huge SPOF. And the cluster design does not support redundant links for > heartbeat. > > Also, none of the commercially available UNIX clusters or Linux clusters > (HP ServiceGuard, Veritas, SteelEye) would display this type of behaviour > and they do not clobber cluster filesystems. So, it is possible to > achieve acceptable reaction to this type of failure. My point was that you can easily overcome the race by introducing a staggered delay into fencing that works around the race condition. I never tried, but are you sure bonded devices don't work for heartbeat? Gordan From Chris.Jankowski at hp.com Thu Nov 11 10:30:43 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Thu, 11 Nov 2010 10:30:43 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <4CDBC063.2060605@bobich.net> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> <4CDB713A.8080303@alteeve.com> <4CDBB70D.6080204@bobich.net> <036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net> <4CDBC063.2060605@bobich.net> Message-ID: <036B68E61A28CA49AC2767596576CD596F584F3439@GVW1113EXC.americas.hpqcorp.net> Gordan, I did not ask for bonding. This should work. I asked for multiple independent links - different networking interfaces configured for different IP subnets mapping to different VLANS. STP is, these days, run on a per VLAN basis. Having multiple links in different VLANs protects against important classes of network failures. Bonded interface does not do it. This must be integrated in the clustering software. Regards, Chris Jankowski -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Gordan Bobic Sent: Thursday, 11 November 2010 21:08 To: linux clustering Subject: Re: [Linux-cluster] Starter Cluster / GFS Jankowski, Chris wrote: > Gordan, > > I do understand the mechanism. I was trying to gently point out that > this behaviour is unacceptable for my commercial IP customers. The > customers buy clusters for high availability. Loosing the whole > cluster due to single component failure - hearbeat link is not > acceptable. The heartbeat link is a huge SPOF. And the cluster design > does not support redundant links for heartbeat. > > Also, none of the commercially available UNIX clusters or Linux > clusters (HP ServiceGuard, Veritas, SteelEye) would display this type > of behaviour and they do not clobber cluster filesystems. So, it is > possible to achieve acceptable reaction to this type of failure. My point was that you can easily overcome the race by introducing a staggered delay into fencing that works around the race condition. I never tried, but are you sure bonded devices don't work for heartbeat? Gordan -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From gordan at bobich.net Thu Nov 11 10:46:25 2010 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 11 Nov 2010 10:46:25 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584F3439@GVW1113EXC.americas.hpqcorp.net> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> <4CDB713A.8080303@alteeve.com> <4CDBB70D.6080204@bobich.net> <036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net> <4CDBC063.2060605@bobich.net> <036B68E61A28CA49AC2767596576CD596F584F3439@GVW1113EXC.americas.hpqcorp.net> Message-ID: <4CDBC981.4010807@bobich.net> Jankowski, Chris wrote: > Gordan, > > I did not ask for bonding. This should work. I asked for > multiple independent links - different networking interfaces > configured for different IP subnets mapping to different VLANS. > > STP is, these days, run on a per VLAN basis. Having multiple > links in different VLANs protects against important classes of > network failures. Bonded interface does not do it. This must > be integrated in the clustering software. I don't quite see the point you're making. If your goal is redundant networking, then you can achieve that by having bonded interfaces in each node, and each of the components of the bonded interface should go to a different switch. That will give you both extra bandwidth and a redundant path between all the nodes, which will ensure you don't end up with a partitioned cluster. Gordan From jakov.sosic at srce.hr Thu Nov 11 11:42:10 2010 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Thu, 11 Nov 2010 12:42:10 +0100 Subject: [Linux-cluster] Configurations of services? In-Reply-To: <4CDBB533.3030805@bobich.net> References: <4CCBFAD3.9010305@srce.hr> <4CDB2FB4.8000309@srce.hr> <4CDBB533.3030805@bobich.net> Message-ID: <4CDBD692.4020108@srce.hr> On 11/11/2010 10:19 AM, Gordan Bobic wrote: > Jakov Sosic wrote: >> On 10/30/2010 01:00 PM, Jakov Sosic wrote: >>> Hi! >>> >>> What is best practice for keeping and updating configurations of >>> services that someone runs in cluster? For example, if I run >>> via cluster agent, then I create /etc/cluster/httpd- on >>> each node in the domain (cp -r /etc/httpd /etc/cluster/httpd-; cd >>> /etc/cluster/httpd-; rm -f logs run modules; ln -s .....). >>> >>> Now, Im puzzled how do you sync configurations between nodes? I do it >>> manually currently, but am seeking some automation of the process. >>> >>> I do not want to keep configurations of EACH service ona shared disks, >>> for some services I want to have configurations on each node available. >>> >>> >>> Any thoughts on this one? >> >> >> Well, let me say something then :) I'm thinking about starting a project >> - developing set of utilities that would work just like "ccs_tool update >> /etc/cluster/cluster.conf", but could update any config file in /etc/ >> directory. >> >> What do you think about this? > > You may want to look at csync2 before you re-invent that particular > wheel. :) Thank you for your information, I'm getting at it right away... -- Jakov Sosic From jonathan.barber at gmail.com Thu Nov 11 13:25:38 2010 From: jonathan.barber at gmail.com (Jonathan Barber) Date: Thu, 11 Nov 2010 13:25:38 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <4CDBC981.4010807@bobich.net> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> <4CDB713A.8080303@alteeve.com> <4CDBB70D.6080204@bobich.net> <036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net> <4CDBC063.2060605@bobich.net> <036B68E61A28CA49AC2767596576CD596F584F3439@GVW1113EXC.americas.hpqcorp.net> <4CDBC981.4010807@bobich.net> Message-ID: On 11 November 2010 10:46, Gordan Bobic wrote: > Jankowski, Chris wrote: >> >> Gordan, >> >> I did not ask for bonding. ?This should work. ?I asked for >> multiple independent links - different networking interfaces >> configured for different IP subnets mapping to different VLANS. > >> >> >> STP is, these days, run on a per VLAN basis. Having multiple >> links in different VLANs protects against important classes of >> network failures. ?Bonded interface does not do it. This must >> be integrated in the clustering software. > > I don't quite see the point you're making. If your goal is redundant > networking, then you can achieve that by having bonded interfaces in each > node, and each of the components of the bonded interface should go to a > different switch. That will give you both extra bandwidth and a redundant > path between all the nodes, which will ensure you don't end up with a > partitioned cluster. Chris' point is that if the STP has to recalculate (for example if the STP root node dies), then having multiple interfaces in the same VLAN will not help (if the time taken to recalculate is longer than the fencing timeout). But, if he can run the heartbeat across multiple VLANs, and the network supports per-VLAN STP, then he lowers the risk of both VLANs being affected by the same event and therefore reduces the likelihood of a shootout between the cluster nodes. Of course, it depends on the topology of the STP domains as to whether you are guaranteed to maintain at least one path between nodes in the cluster given a STP node failure. > Gordan -- Jonathan Barber From gordan at bobich.net Thu Nov 11 13:38:04 2010 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 11 Nov 2010 13:38:04 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> <4CDB713A.8080303@alteeve.com> <4CDBB70D.6080204@bobich.net> <036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net> <4CDBC063.2060605@bobich.net> <036B68E61A28CA49AC2767596576CD596F584F3439@GVW1113EXC.americas.hpqcorp.net> <4CDBC981.4010807@bobich.net> Message-ID: <4CDBF1BC.7000103@bobich.net> Jonathan Barber wrote: > On 11 November 2010 10:46, Gordan Bobic wrote: >> Jankowski, Chris wrote: >>> Gordan, >>> >>> I did not ask for bonding. This should work. I asked for >>> multiple independent links - different networking interfaces >>> configured for different IP subnets mapping to different VLANS. >>> >>> STP is, these days, run on a per VLAN basis. Having multiple >>> links in different VLANs protects against important classes of >>> network failures. Bonded interface does not do it. This must >>> be integrated in the clustering software. >> I don't quite see the point you're making. If your goal is redundant >> networking, then you can achieve that by having bonded interfaces in each >> node, and each of the components of the bonded interface should go to a >> different switch. That will give you both extra bandwidth and a redundant >> path between all the nodes, which will ensure you don't end up with a >> partitioned cluster. > > Chris' point is that if the STP has to recalculate (for example if the > STP root node dies), then having multiple interfaces in the same VLAN > will not help (if the time taken to recalculate is longer than the > fencing timeout). But, if he can run the heartbeat across multiple > VLANs, and the network supports per-VLAN STP, then he lowers the risk > of both VLANs being affected by the same event and therefore reduces > the likelihood of a shootout between the cluster nodes. > > Of course, it depends on the topology of the STP domains as to whether > you are guaranteed to maintain at least one path between nodes in the > cluster given a STP node failure. Yes, but your cluster VLAN (the one that's monitored for heartbeating) should be isolated, rather than public, so the only nodes on it will be the cluster nodes (and probably the SAN). If with that many nodes your spanning tree recalculation still takes 40 seconds you have network gear that is unfit for purpose anyway. Gordan From linux at alteeve.com Thu Nov 11 16:38:38 2010 From: linux at alteeve.com (Digimer) Date: Thu, 11 Nov 2010 11:38:38 -0500 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> <4CDB713A.8080303@alteeve.com> <4CDBB70D.6080204@bobich.net> <036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net> Message-ID: <4CDC1C0E.6010804@alteeve.com> On 10-11-11 04:59 AM, Jankowski, Chris wrote: > Gordan, > > I do understand the mechanism. I was trying to gently point out that this behaviour is unacceptable for my commercial IP customers. The customers buy clusters for high availability. Loosing the whole cluster due to single component failure - hearbeat link is not acceptable. The heartbeat link is a huge SPOF. And the cluster design does not support redundant links for heartbeat. > > Also, none of the commercially available UNIX clusters or Linux clusters (HP ServiceGuard, Veritas, SteelEye) would display this type of behaviour and they do not clobber cluster filesystems. So, it is possible to achieve acceptable reaction to this type of failure. > > Regards, > > Chris Jankowski I can't speak to heartbeat, but under RHCS you can have multiple fence methods and devices, and they will used in the order that they are found in the configuration file. With the power-based devices I've used (again, just IPMI and NA), the poweroff call is more or less instant. I've not seen, personally, a lag exceeding a second with these devices. I would consider a fence device that does not disable a node in <1 second to be flawed. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From linux at alteeve.com Thu Nov 11 16:44:14 2010 From: linux at alteeve.com (Digimer) Date: Thu, 11 Nov 2010 11:44:14 -0500 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <4CDBB621.2090809@bobich.net> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> <4CDBB621.2090809@bobich.net> Message-ID: <4CDC1D5E.6030905@alteeve.com> On 10-11-11 04:23 AM, Gordan Bobic wrote: > Jankowski, Chris wrote: >> Digimer, >> >> 1. >> Digimer wrote: >>>>> Both partitions will try to fence the other, but the slower >>>>> will lose and get fenced before it can fence. >> >> Well, this is certainly not my experience in dealing with modern >> rack mounted or blade servers where you use iLO (on HP) or DRAC (on >> Dell). >> >> What actually happens in two node clusters is that both servers >> issue the fence request to the iLO or DRAC. It gets processed >> and *both* servers get powered off. Ouch!! Your 100% HA cluster >> becomes 100% dead cluster. > > Indeed, I've seen this, too, on a range of hardware. My quick and dirty > solution was to doctor the fencing agent to add a different sleep() on > each node, in order of survivor preference. There may be a setting in > cluster.conf that can be used to achieve the same effect, can't remember > off the top of my head. > > Gordan I've not seen such an option, though I make no claims to complete knowledge of the options available. I do know that there are pre-device fence options (that is, IPMI has a set of options that differs from DRAC, etc). So perhaps there is an option there. I am very curious to know how this scenario can happen. As I had previously understood it, this should simply not be possible. Obviously it is though... The only thing I can think of is where a fence device is external to the nodes and allows for multiple fence calls at the same time. I would expect that and fence device should terminate a node nearly instantly. If it doesn't or can't, then I would suggest that it not accept a second fence request until after the pending one completes. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From linux at alteeve.com Thu Nov 11 16:48:50 2010 From: linux at alteeve.com (Digimer) Date: Thu, 11 Nov 2010 11:48:50 -0500 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <4CDBB194.2020601@bobich.net> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <4CDBB194.2020601@bobich.net> Message-ID: <4CDC1E72.9020703@alteeve.com> On 10-11-11 04:04 AM, Gordan Bobic wrote: > Digimer wrote: >> On 10-11-10 11:09 AM, Gordan Bobic wrote: >>> Digimer wrote: >>>> On 10-11-10 07:17 AM, Gordan Bobic wrote: >>>>>>> If you want the FS mounted on all nodes at the same time then all >>>>>>> those nodes must be a part of the cluster, and they have to be >>>>>>> quorate (majority of nodes have to be up). You don't need a quorum >>>>>>> block device, but it can be useful when you have only 2 nodes. >>>>>> At term, I will have 7 to 10 nodes, but 2 at first for initial setup >>>>>> and testing. Ok, so if I have a 3 nodes cluster for exemple, I >>>>>> need at >>>>>> least 2 nodes for the cluster, and thus the gfs, to be up ? I cannot >>>>>> have a running gfs with only one node ? >>>>> In a 2-node cluster, you can have running GFS with just one node >>>>> up. But >>>>> in that case it is advisble to have a quorum block device on the SAN. >>>>> With a 3 node cluster, you cannot have quorum with just 1 node, and >>>>> thus >>>>> you cannot have GFS running. It will block until quorum is >>>>> re-established. >>>> With a quorum disk, you can in fact have one node left and still have >>>> quorum. This is because the quorum drive should have (node-1) votes, >>>> thus always giving the last node 50%+1 even with all other nodes being >>>> dead. >>> I've never tried testing that use-case extensively, but I suspect that >>> it is only safe to do with SAN-side fencing. Otherwise two nodes could >>> lose contact with each other and still both have access to the SAN and >>> thus both be individually quorate. >>> >>> Gordan >> >> Clustered storage *requires* fencing. To not use fencing is like driving >> tired; It's just a matter of time before something bad happens. That >> said, I should have been more clear in specifying the requirement for >> fencing. >> >> Now that said, the fencing shouldn't be needed at the SAN side, though >> that works fine as well. > > The default fencing action, last time I checked, is reboot. Consider the > use case where you have a network failure and separate networks for > various things, and you lose connectivity between the nodes but they > both still have access to the SAN. One node gets fenced, reboots, comes > up and connects to the SAN. It connects to the quorum device and has > quorum without the other nodes, and mounts the file systems and starts > writing - while all the other nodes that have become partitioned off do > the same thing. Unless you can fence the nodes from the SAN side, quorum > device having a 50% weight is a recipe for disaster. Agreed, and that is one of the major benefits of qdisk. It prevents a 50/50 split. Regardless though, say you have an eight node cluster and it partitions evenly with no qdisk to tie break. In that case, neither partition has >50% of the votes, so neither should have quorum. In turn, neither should touch the SAN. This is because DLM is required for clustered file systems, and DLM in turn requires quorum. Without quorum, DLM won't run and you will not be able to touch the SAN. :) >> The way it works is: > [...] > > I'm well aware of how fencing works, but you overlooked one major > failure mode that is essentially guaranteed to hose your data if you set > up the quorum device to have 50% of the votes. See above. 50% is not quorum. >> With SAN-side fencing, a fence is in the form of a logic disconnection >> from the storage network. This has no inherent mechanism for recovery, >> so the sysadmin will have to manually recover the node(s). For this >> reason, I do not prefer it. > > Then don't use a quorum device with more than an equal weight to the > individual nodes. > > Gordan How does the number of nodes relate, in this case, to the SAN-side fence recovery? -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From gordan at bobich.net Thu Nov 11 17:59:57 2010 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 11 Nov 2010 17:59:57 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <4CDC1E72.9020703@alteeve.com> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <4CDBB194.2020601@bobich.net> <4CDC1E72.9020703@alteeve.com> Message-ID: <4CDC2F1D.2070500@bobich.net> On 11/11/2010 04:48 PM, Digimer wrote: >>> Clustered storage *requires* fencing. To not use fencing is like driving >>> tired; It's just a matter of time before something bad happens. That >>> said, I should have been more clear in specifying the requirement for >>> fencing. >>> >>> Now that said, the fencing shouldn't be needed at the SAN side, though >>> that works fine as well. >> >> The default fencing action, last time I checked, is reboot. Consider the >> use case where you have a network failure and separate networks for >> various things, and you lose connectivity between the nodes but they >> both still have access to the SAN. One node gets fenced, reboots, comes >> up and connects to the SAN. It connects to the quorum device and has >> quorum without the other nodes, and mounts the file systems and starts >> writing - while all the other nodes that have become partitioned off do >> the same thing. Unless you can fence the nodes from the SAN side, quorum >> device having a 50% weight is a recipe for disaster. > > Agreed, and that is one of the major benefits of qdisk. It prevents a > 50/50 split. Regardless though, say you have an eight node cluster and > it partitions evenly with no qdisk to tie break. In that case, neither > partition has>50% of the votes, so neither should have quorum. In turn, > neither should touch the SAN. Exactly - qdisk is a tie-breaker. The point I was responding to was the one where somebody suggested giving qdisk a 50% vote weight (i.e. needs only qdisk + 1 node for quorum), which is IMO not a sane way to do it. >> I'm well aware of how fencing works, but you overlooked one major >> failure mode that is essentially guaranteed to hose your data if you set >> up the quorum device to have 50% of the votes. > > See above. 50% is not quorum. No, but 50% + 1 node is quorum, and I'm saying that having qdisk (50%) + 1 node = quorum is not the way to go. >>> With SAN-side fencing, a fence is in the form of a logic disconnection >>> from the storage network. This has no inherent mechanism for recovery, >>> so the sysadmin will have to manually recover the node(s). For this >>> reason, I do not prefer it. >> >> Then don't use a quorum device with more than an equal weight to the >> individual nodes. > > How does the number of nodes relate, in this case, to the SAN-side fence > recovery? It doesn't directly. I'm saying that the only way that giving qdisk 50% of the vote toward quorum is if your fencing is done by the SAN itself. Otherwise any 1 node that comes up has quorum, regardless of how many other are down, which in turn leads to multiple nodes being individually quorate when the connect to the SAN. This situation will trash the shared file system. Gordan From dxh at yahoo.com Thu Nov 11 19:15:03 2010 From: dxh at yahoo.com (Don Hoover) Date: Thu, 11 Nov 2010 11:15:03 -0800 (PST) Subject: [Linux-cluster] What keeps more than one node from grabbing qdisk? Message-ID: <347609.50037.qm@web120711.mail.ne1.yahoo.com> I have seen multiple boxes think they have ownership of the qdisk, is there something that prevents this other than they fence (reboot) each other? And what keeps them from getting into a fence war? From Chris.Jankowski at hp.com Fri Nov 12 01:58:01 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Fri, 12 Nov 2010 01:58:01 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <4CDC1D5E.6030905@alteeve.com> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> <4CDBB621.2090809@bobich.net> <4CDC1D5E.6030905@alteeve.com> Message-ID: <036B68E61A28CA49AC2767596576CD596F584F3572@GVW1113EXC.americas.hpqcorp.net> Digimer, >>>I am very curious to know how this scenario can happen. As I had previously understood it, this should simply not be possible. Obviously it is though... It actually is very simple. For the mutual simultaneous killing to be guaranteed to happen three conditions are sufficient: 1. The fencing request is generated by the two nodes at the same time. Fulfilled by current design of the fencing. 2. Your fencing device needs to be a separate piece of equipment dedicated to the node to be fenced. Note that iLO or DRAC fulfill the requirement. 3. The implementation of the fencing device needs to be transactional i.e. - accept an order to fence, then execute it after a certain delay. Both iLO and DRAC work transactionally and there is sufficient delay. What happens is simple. Think about it as transactions. Both nodes start at the same time transacting with the corresponding fencing devices. Each fencing device accepts the transaction. Only then, after a small delay, they start executing it. Both fencing devices are at this point committed to the execution and will do what they have been told. The set of conditions is sufficient in mathematical sense. In modern networked servers with built-in service processors this set of confditions is almost certainly true for all of them. The following are possible ways of resolving the problem for this set of sufficient condiions: 1. Invalidate condition 1 - introduce different fixed delays in fencing agents for each node - e.g. node A - no delay, node B 2 seconds. This is a good solution, but requires custom programming work. The current cluster design does not allow it as a configuration option. 2. Invalidate condition 2 - common physical fencing device that will accept only one request from one node. Essentially this serialises the transactions and allows at most one. This is not a clean way to do it, as such device would be a SPOF. 3. Invalidate condition 3 - change the execution phase to conditional based on the state of the requestor - in the execution phase execute the request only if the requestor is still alive. This shrinks, but does not eliminate the time in which the race condition leads to both nodes going down. However, I believe that the real solution is to change the mindset of the cluster from "I am the omniscient and omnipotent master of the world and I will shoot anything I do not like" to protecting resources i.e. protecting shared storage through SCSI reservations, which is what commerial Linux and UNIX clusters do. Alas, the STONITH concept is so ingrained in the minds of developers of the Linux cluster that this change seems to be impossible to achieve. -------- Please note that the STONITH concept has other fatal flows in the modern networked world. Consider, step by step scenario of what would happen to your available cluster if a node in the cluster gets completely separated from the network including its access, its hearbeat and iLO/DRAC network connections. Again, the end result is that you have no access to your supposedly highly available application. From the functional point of view the whole cluster has failed. The core issue, again, is the inadequacy of the STONITH concept. And again, commercial UNIX and Lnux clusters deal with this scenario correctly. Their clusters will continue. Regards, Chris Jankowski In fact, to remove the race condition o -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Digimer Sent: Friday, 12 November 2010 03:44 To: linux clustering Subject: Re: [Linux-cluster] Starter Cluster / GFS On 10-11-11 04:23 AM, Gordan Bobic wrote: > Jankowski, Chris wrote: >> Digimer, >> >> 1. >> Digimer wrote: >>>>> Both partitions will try to fence the other, but the slower will >>>>> lose and get fenced before it can fence. >> >> Well, this is certainly not my experience in dealing with modern rack >> mounted or blade servers where you use iLO (on HP) or DRAC (on Dell). >> >> What actually happens in two node clusters is that both servers issue >> the fence request to the iLO or DRAC. It gets processed and *both* >> servers get powered off. Ouch!! Your 100% HA cluster becomes 100% >> dead cluster. > > Indeed, I've seen this, too, on a range of hardware. My quick and > dirty solution was to doctor the fencing agent to add a different > sleep() on each node, in order of survivor preference. There may be a > setting in cluster.conf that can be used to achieve the same effect, > can't remember off the top of my head. > > Gordan I've not seen such an option, though I make no claims to complete knowledge of the options available. I do know that there are pre-device fence options (that is, IPMI has a set of options that differs from DRAC, etc). So perhaps there is an option there. I am very curious to know how this scenario can happen. As I had previously understood it, this should simply not be possible. Obviously it is though... The only thing I can think of is where a fence device is external to the nodes and allows for multiple fence calls at the same time. I would expect that and fence device should terminate a node nearly instantly. If it doesn't or can't, then I would suggest that it not accept a second fence request until after the pending one completes. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From Chris.Jankowski at hp.com Fri Nov 12 02:22:16 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Fri, 12 Nov 2010 02:22:16 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <4CDC1C0E.6010804@alteeve.com> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> <4CDB713A.8080303@alteeve.com> <4CDBB70D.6080204@bobich.net> <036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net> <4CDC1C0E.6010804@alteeve.com> Message-ID: <036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net> Digimer, >>>>I can't speak to heartbeat, but under RHCS you can have multiple fence methods and devices, and they will used in the order that they are found in the configuration file. Separate hearbeat networks (not a single network with a bonded interface) is what my customers require. I believe this is not available in standard Linux Cluster, as distributed by RedHat. This is completely independent from what fencing device or method is used. >>>>With the power-based devices I've used (again, just IPMI and NA), the poweroff call is more or less instant. I've not seen, personally, a lag exceeding a second with these devices. I would consider a fence device that does not disable a node in <1 second to be flawed. 1. In the world where I work separate power-based devices are not an option. Blade servers do not even have power supplies. They use common power from the blade enclosure. The only access to the power state is through service processor. 2. We are not talking about long delays here. The whole cycle of taking the power off a blade including login to the service processor is less than 1 ms. Delay or lack thereof is not a problem. The transactional nature of the processing is the issue. Regards, Chris Jankowski -----Original Message----- From: Digimer [mailto:linux at alteeve.com] Sent: Friday, 12 November 2010 03:39 To: linux clustering Cc: Jankowski, Chris Subject: Re: [Linux-cluster] Starter Cluster / GFS On 10-11-11 04:59 AM, Jankowski, Chris wrote: > Gordan, > > I do understand the mechanism. I was trying to gently point out that this behaviour is unacceptable for my commercial IP customers. The customers buy clusters for high availability. Loosing the whole cluster due to single component failure - hearbeat link is not acceptable. The heartbeat link is a huge SPOF. And the cluster design does not support redundant links for heartbeat. > > Also, none of the commercially available UNIX clusters or Linux clusters (HP ServiceGuard, Veritas, SteelEye) would display this type of behaviour and they do not clobber cluster filesystems. So, it is possible to achieve acceptable reaction to this type of failure. > > Regards, > > Chris Jankowski I can't speak to heartbeat, but under RHCS you can have multiple fence methods and devices, and they will used in the order that they are found in the configuration file. With the power-based devices I've used (again, just IPMI and NA), the poweroff call is more or less instant. I've not seen, personally, a lag exceeding a second with these devices. I would consider a fence device that does not disable a node in <1 second to be flawed. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From linux at alteeve.com Fri Nov 12 02:41:30 2010 From: linux at alteeve.com (Digimer) Date: Thu, 11 Nov 2010 21:41:30 -0500 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> <4CDB713A.8080303@alteeve.com> <4CDBB70D.6080204@bobich.net> <036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net> <4CDC1C0E.6010804@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net> Message-ID: <4CDCA95A.2070104@alteeve.com> On 10-11-11 09:22 PM, Jankowski, Chris wrote: > Digimer, > >>>>> I can't speak to heartbeat, but under RHCS you can have multiple fence methods and devices, and they will used in the order that they are found in the configuration file. > > Separate hearbeat networks (not a single network with a bonded interface) is what my customers require. I believe this is not available in standard Linux Cluster, as distributed by RedHat. This is completely independent from what fencing device or method is used. It is possible. ie: In the above case, should 'an-node02' need to be fenced, the first method 'ipmi' would be used. Should it fail, the next method 'node_assassin' would be tried. >>>>> With the power-based devices I've used (again, just IPMI and NA), the poweroff call is more or less instant. I've not seen, personally, a lag exceeding a second with these devices. I would consider a fence device that does not disable a node in <1 second to be flawed. > > 1. > In the world where I work separate power-based devices are not an option. Blade servers do not even have power supplies. They use common power from the blade enclosure. The only access to the power state is through service processor. Out of curiosity, do the blades have header pins for the power and reset switches? I don't see why they would, but I've not played with traditional blades before. > 2. > We are not talking about long delays here. The whole cycle of taking the power off a blade including login to the service processor is less than 1 ms. Delay or lack thereof is not a problem. The transactional nature of the processing is the issue. > > Regards, > > Chris Jankowski Let me talk to the Red Hat folks and see what they think about configurable per-node user-defined fence delays. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From linux at alteeve.com Fri Nov 12 02:47:10 2010 From: linux at alteeve.com (Digimer) Date: Thu, 11 Nov 2010 21:47:10 -0500 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> <4CDB713A.8080303@alteeve.com> <4CDBB70D.6080204@bobich.net> <036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net> <4CDC1C0E.6010804@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net> Message-ID: <4CDCAAAE.6020002@alteeve.com> On 10-11-11 09:22 PM, Jankowski, Chris wrote: > 2. > We are not talking about long delays here. The whole cycle of taking the power off a blade including login to the service processor is less than 1 ms. Delay or lack thereof is not a problem. The transactional nature of the processing is the issue. > > Regards, > > Chris Jankowski I forgot to mention; Fence calls can only be sent by nodes with quorum. So a race condition should, as I understand it, be a concern with 2-node clusters only. I'm not entirely sure though on how quorum is determined at the time of partitioning. That is, say you have a three node cluster, and one node disconnects. I need to verify that it checks to see if it has quorum before sending a fence call. I expect that is the case though. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From Chris.Jankowski at hp.com Fri Nov 12 03:25:41 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Fri, 12 Nov 2010 03:25:41 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <4CDCA95A.2070104@alteeve.com> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> <4CDB713A.8080303@alteeve.com> <4CDBB70D.6080204@bobich.net> <036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net> <4CDC1C0E.6010804@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net> <4CDCA95A.2070104@alteeve.com> Message-ID: <036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net> Digimer, I think you do not make distinction between the network that maintains the hearbeat and the networks to use for fence devices. I'll explain this again. These are two very different things and operated for different purpose. The hearbeat network is between the nodes for the purpose of maintaining cluster membership. The connections from the nodes to your fence devices form the other two networks. In fact speaking of networks in this case is a little limiting. Each of the IP addresses involved may, in principle, be in different IP subnet In the example that you gave, you have two (possibly different) networks for fence devices, as you have two fence devices. However, your cluster membership is maintained through the single hearbeat network implicitly defined through the names of the cluster nodes. I want to have two, independently configurable network like this and heartbeat being sent through both of them. I cannot do this at the moment, as the software will always maintain the hertbeat through the single IP address to which the node name resolves. In your case the heartbeat traffic will always go between an-node01.alteeve.com and an-node02.alteeve.com. What I want is to have hertbeat traffic going between: an-node01h1.alteeve.com and an-node02h1.alteeve.com and between an-node01h2.alteeve.com and an-node02h2.alteeve.com Whereas my application would access the cluster through: an-node01.alteeve.com and an-node02.alteeve.com So I would need minimum of 3 Ethernet interfaces per server and minimum of 6 if all links will be bonded, but this is OK. Regards, Chris Jankowski -----Original Message----- From: Digimer [mailto:linux at alteeve.com] Sent: Friday, 12 November 2010 13:42 To: Jankowski, Chris Cc: linux clustering Subject: Re: [Linux-cluster] Starter Cluster / GFS On 10-11-11 09:22 PM, Jankowski, Chris wrote: > Digimer, > >>>>> I can't speak to heartbeat, but under RHCS you can have multiple fence methods and devices, and they will used in the order that they are found in the configuration file. > > Separate hearbeat networks (not a single network with a bonded interface) is what my customers require. I believe this is not available in standard Linux Cluster, as distributed by RedHat. This is completely independent from what fencing device or method is used. It is possible. ie: In the above case, should 'an-node02' need to be fenced, the first method 'ipmi' would be used. Should it fail, the next method 'node_assassin' would be tried. >>>>> With the power-based devices I've used (again, just IPMI and NA), the poweroff call is more or less instant. I've not seen, personally, a lag exceeding a second with these devices. I would consider a fence device that does not disable a node in <1 second to be flawed. > > 1. > In the world where I work separate power-based devices are not an option. Blade servers do not even have power supplies. They use common power from the blade enclosure. The only access to the power state is through service processor. Out of curiosity, do the blades have header pins for the power and reset switches? I don't see why they would, but I've not played with traditional blades before. > 2. > We are not talking about long delays here. The whole cycle of taking the power off a blade including login to the service processor is less than 1 ms. Delay or lack thereof is not a problem. The transactional nature of the processing is the issue. > > Regards, > > Chris Jankowski Let me talk to the Red Hat folks and see what they think about configurable per-node user-defined fence delays. -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From linux at alteeve.com Fri Nov 12 03:43:33 2010 From: linux at alteeve.com (Digimer) Date: Thu, 11 Nov 2010 22:43:33 -0500 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> <4CDB713A.8080303@alteeve.com> <4CDBB70D.6080204@bobich.net> <036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net> <4CDC1C0E.6010804@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net> <4CDCA95A.2070104@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net> Message-ID: <4CDCB7E5.40204@alteeve.com> On 10-11-11 10:25 PM, Jankowski, Chris wrote: > Digimer, > > I think you do not make distinction between the network that maintains the hearbeat and the networks to use for fence devices. I'll explain this again. Perhaps. In my clusters, I use at least three interfaces on three separate subnets... I put IPMI on one and NA on the second. > These are two very different things and operated for different purpose. > > The hearbeat network is between the nodes for the purpose of maintaining cluster membership. > The connections from the nodes to your fence devices form the other two networks. > > In fact speaking of networks in this case is a little limiting. Each of the IP addresses involved may, in principle, be in different IP subnet > > In the example that you gave, you have two (possibly different) networks for fence devices, as you have two fence devices. You can use the element to define a second totem ring (redundant ring protocol) to act as a backup, on a second subnet, for backup cluster communication. > However, your cluster membership is maintained through the single hearbeat network implicitly defined through the names of the cluster nodes. I want to have two, independently configurable network like this and heartbeat being sent through both of them. I cannot do this at the moment, as the software will always maintain the hertbeat through the single IP address to which the node name resolves. In your case the heartbeat traffic will always go between an-node01.alteeve.com and an-node02.alteeve.com. > > What I want is to have hertbeat traffic going between: > an-node01h1.alteeve.com and an-node02h1.alteeve.com > and between > an-node01h2.alteeve.com and an-node02h2.alteeve.com > Whereas my application would access the cluster through: > an-node01.alteeve.com and an-node02.alteeve.com > > So I would need minimum of 3 Ethernet interfaces per server and minimum of 6 if all links will be bonded, but this is OK. Exactly what I do, though RRP is currently limited to two interfaces due to inherent complexities preventing going beyond that. There is work on a newly-announced project that will allow for n-number of paths, but that's alpha stage at this point. > Regards, > > Chris Jankowski Cheers -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From Chris.Jankowski at hp.com Fri Nov 12 04:10:08 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Fri, 12 Nov 2010 04:10:08 +0000 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <4CDCB7E5.40204@alteeve.com> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> <4CDB713A.8080303@alteeve.com> <4CDBB70D.6080204@bobich.net> <036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net> <4CDC1C0E.6010804@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net> <4CDCA95A.2070104@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net> <4CDCB7E5.40204@alteeve.com> Message-ID: <036B68E61A28CA49AC2767596576CD596F584F3656@GVW1113EXC.americas.hpqcorp.net> Digimer, >>>You can use the element to define a second totem ring (redundant ring protocol) to act as a backup, on a second subnet, for backup cluster communication. Thank you. I was not aware of this option . Is this documented anywhere, so I can read it? Regards, Chris Jankowski -----Original Message----- From: Digimer [mailto:linux at alteeve.com] Sent: Friday, 12 November 2010 14:44 To: Jankowski, Chris Cc: linux clustering Subject: Re: [Linux-cluster] Starter Cluster / GFS On 10-11-11 10:25 PM, Jankowski, Chris wrote: > Digimer, > > I think you do not make distinction between the network that maintains the hearbeat and the networks to use for fence devices. I'll explain this again. Perhaps. In my clusters, I use at least three interfaces on three separate subnets... I put IPMI on one and NA on the second. > These are two very different things and operated for different purpose. > > The hearbeat network is between the nodes for the purpose of maintaining cluster membership. > The connections from the nodes to your fence devices form the other two networks. > > In fact speaking of networks in this case is a little limiting. Each > of the IP addresses involved may, in principle, be in different IP > subnet > > In the example that you gave, you have two (possibly different) networks for fence devices, as you have two fence devices. You can use the element to define a second totem ring (redundant ring protocol) to act as a backup, on a second subnet, for backup cluster communication. > However, your cluster membership is maintained through the single hearbeat network implicitly defined through the names of the cluster nodes. I want to have two, independently configurable network like this and heartbeat being sent through both of them. I cannot do this at the moment, as the software will always maintain the hertbeat through the single IP address to which the node name resolves. In your case the heartbeat traffic will always go between an-node01.alteeve.com and an-node02.alteeve.com. > > What I want is to have hertbeat traffic going between: > an-node01h1.alteeve.com and an-node02h1.alteeve.com and between > an-node01h2.alteeve.com and an-node02h2.alteeve.com Whereas my > application would access the cluster through: > an-node01.alteeve.com and an-node02.alteeve.com > > So I would need minimum of 3 Ethernet interfaces per server and minimum of 6 if all links will be bonded, but this is OK. Exactly what I do, though RRP is currently limited to two interfaces due to inherent complexities preventing going beyond that. There is work on a newly-announced project that will allow for n-number of paths, but that's alpha stage at this point. > Regards, > > Chris Jankowski Cheers -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From linux at alteeve.com Fri Nov 12 04:56:14 2010 From: linux at alteeve.com (Digimer) Date: Thu, 11 Nov 2010 23:56:14 -0500 Subject: [Linux-cluster] Starter Cluster / GFS In-Reply-To: <036B68E61A28CA49AC2767596576CD596F584F3656@GVW1113EXC.americas.hpqcorp.net> References: <4EC8CB8D678E4E4BA366D0215D33D19E@Aspire> <4CDA5421.9090006@bobich.net> <1F3ED3F7706C4BFE8CF683ACE86270C3@Aspire> <4CDA8D4A.6010507@bobich.net> <4CDAC2BE.4010009@alteeve.com> <4CDAC3D2.9050703@bobich.net> <4CDACB37.3070704@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584836A4@GVW1113EXC.americas.hpqcorp.net> <4CDB713A.8080303@alteeve.com> <4CDBB70D.6080204@bobich.net> <036B68E61A28CA49AC2767596576CD596F584F3413@GVW1113EXC.americas.hpqcorp.net> <4CDC1C0E.6010804@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584F359F@GVW1113EXC.americas.hpqcorp.net> <4CDCA95A.2070104@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584F3611@GVW1113EXC.americas.hpqcorp.net> <4CDCB7E5.40204@alteeve.com> <036B68E61A28CA49AC2767596576CD596F584F3656@GVW1113EXC.americas.hpqcorp.net> Message-ID: <4CDCC8EE.6090500@alteeve.com> On 10-11-11 11:10 PM, Jankowski, Chris wrote: > Digimer, > >>>> You can use the element to define a second totem ring (redundant ring protocol) to act as a backup, on a second subnet, for backup cluster communication. > > Thank you. I was not aware of this option . Is this documented anywhere, so I can read it? > > Regards, > > Chris Jankowski Not officially, but I've been working on documenting all of the options. I make no claim to accuracy, so please read with that in mind. Of course, corrections and feedback are appreciated. :) http://wiki.alteeve.com/index.php/Cluster.conf#Element.3B_altname -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From Chris.Jankowski at hp.com Mon Nov 15 03:30:13 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Mon, 15 Nov 2010 03:30:13 +0000 Subject: [Linux-cluster] XFS as a servicein RHEL 6 Linux Cluster. Message-ID: <036B68E61A28CA49AC2767596576CD596F58F8863E@GVW1113EXC.americas.hpqcorp.net> Hi, RHEL 6 now officially supports XFS, as an additional subscription option, I believe. Does the RHEL 6 Linux Cluster provide the necessary module to configure an XFS filesystem as a failover service? Thanks and regards, Chris Jankowski -------------- next part -------------- An HTML attachment was scrubbed... URL: From noreply at boxbe.com Mon Nov 15 09:18:01 2010 From: noreply at boxbe.com (noreply at boxbe.com) Date: Mon, 15 Nov 2010 01:18:01 -0800 (PST) Subject: [Linux-cluster] Starter Cluster / GFS (Action Required) Message-ID: <688108002.22833.1289812681752.JavaMail.prod@app004.boxbe.com> Hello linux clustering, You will not receive any more courtesy notices from our members for two days. Messages you have sent will remain in a lower priority mailbox for our member to review at their leisure. Future messages will be more likely to be viewed if you are on our member's priority Guest List. Thank you, vishalspatil at gmail.com About this Notice Boxbe prioritizes and screens email using a personal Guest List and your extended social network. It's free, it removes clutter, and it helps you focus on the people who matter to you. Visit http://www.boxbe.com/how-it-works?tc=5902846179_739730772 End Email Overload -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded message was scrubbed... From: "Jankowski, Chris" Subject: Re: [Linux-cluster] Starter Cluster / GFS Date: Fri, 12 Nov 2010 04:10:08 +0000 Size: 4854 URL: From radu.rendec at mindbit.ro Mon Nov 15 12:34:22 2010 From: radu.rendec at mindbit.ro (Radu Rendec) Date: Mon, 15 Nov 2010 14:34:22 +0200 Subject: [Linux-cluster] rgmanager blocked Message-ID: <1289824462.3353.70.camel@localhost> Hello, I'm trying to migrate an older Centos 5 / rhcs2 cluster to the newer rhcs3. Being eager to play around, I decided to make my tests on Fedora 14, before Centos 6 is out. Although everything seemed to work fine at the beginning, after a few hours of cluster uptime I came across a strange situation of rgmanager being apparently blocked. The process is still there, but: 1. It no longer produces any output - it's run in a "screen" session, with params "-fd". Normally it's very verbose (I can see a lot of debug messages, including output from agent scripts). It's been more than a week since it blocked, and it hadn't output a sigle line of debug. 2. Resources from node 1 were (automatically) relocated to node 2 when node 1 blocked, but node 2 blocked in a similar manner a few hours later. 3. Now resources are still active on node 2, on both nodes a "clustat" looks like this: Service states unavailable: Temporary failure; try again Cluster Status for ****** @ Mon Nov 15 14:14:22 2010 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ storage1.****** 1 Online, Local storage2.****** 2 Online I've already tried several simple things like: * looking at the process tree for some hung resource agents - no luck; it's just clurgmgrd and its child threads; * looking at the open files of clurgmgrd in /proc/NNN/fd - nothing unusual * tracing (with strace) the main clurgmgrd thread and the children. At this point I'm totally clueless, so any suggestion would be welcome. I can provide further info / logs about the running system / processes. Thanks, Radu Rendec From fdinitto at redhat.com Mon Nov 15 14:48:30 2010 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Mon, 15 Nov 2010 15:48:30 +0100 Subject: [Linux-cluster] XFS as a servicein RHEL 6 Linux Cluster. In-Reply-To: <036B68E61A28CA49AC2767596576CD596F58F8863E@GVW1113EXC.americas.hpqcorp.net> References: <036B68E61A28CA49AC2767596576CD596F58F8863E@GVW1113EXC.americas.hpqcorp.net> Message-ID: <4CE1483E.2040301@redhat.com> On 11/15/2010 4:30 AM, Jankowski, Chris wrote: > > Does the RHEL 6 Linux Cluster provide the necessary module to configure > an XFS filesystem as a failover service? Yes, you can use the filesystem resource. Fabio From Colin.Simpson at iongeo.com Mon Nov 15 18:57:04 2010 From: Colin.Simpson at iongeo.com (Colin Simpson) Date: Mon, 15 Nov 2010 18:57:04 +0000 Subject: [Linux-cluster] Configurations of services? In-Reply-To: <4CDBD692.4020108@srce.hr> References: <4CCBFAD3.9010305@srce.hr> <4CDB2FB4.8000309@srce.hr> <4CDBB533.3030805@bobich.net> <4CDBD692.4020108@srce.hr> Message-ID: <1289847424.16298.18.camel@cowie> Out of interest (for my own setup) does anyone know if there are any massive negatives to keeping the service config files on a GFS2 volume? Just seems like a nice lazy approach to distributing them to me, esp as on GFS2 you have shared storage anyway. I maybe thought if a service needs cleanly shutdown (or more likely checking if it's down on a node might require the config file and the GFS2 might not have been mounted). Thanks Colin On Thu, 2010-11-11 at 12:42 +0100, Jakov Sosic wrote: > >>> I do not want to keep configurations of EACH service ona shared disks, > >>> for some services I want to have configurations on each node available. > >>> > >>> This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original. From jakov.sosic at srce.hr Mon Nov 15 21:29:05 2010 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Mon, 15 Nov 2010 22:29:05 +0100 Subject: [Linux-cluster] Configurations of services? In-Reply-To: <1289847424.16298.18.camel@cowie> References: <4CCBFAD3.9010305@srce.hr> <4CDB2FB4.8000309@srce.hr> <4CDBB533.3030805@bobich.net> <4CDBD692.4020108@srce.hr> <1289847424.16298.18.camel@cowie> Message-ID: <4CE1A621.6090508@srce.hr> On 11/15/2010 07:57 PM, Colin Simpson wrote: > Out of interest (for my own setup) does anyone know if there are any > massive negatives to keeping the service config files on a GFS2 volume? > Just seems like a nice lazy approach to distributing them to me, esp as > on GFS2 you have shared storage anyway. > > I maybe thought if a service needs cleanly shutdown (or more likely > checking if it's down on a node might require the config file and the > GFS2 might not have been mounted). I fail to see negatives on that kind of setup. Although, I don't use GFS2 too often, so I try to solve this problem differently. @Gordan, I've tried csync2 and it's a great tool. It works exactly the way I wanted! Thank you very much, I'm prepping it for big push into production environments :) -- Jakov Sosic From gordan at bobich.net Mon Nov 15 21:46:52 2010 From: gordan at bobich.net (Gordan Bobic) Date: Mon, 15 Nov 2010 21:46:52 +0000 Subject: [Linux-cluster] Configurations of services? In-Reply-To: <4CE1A621.6090508@srce.hr> References: <4CCBFAD3.9010305@srce.hr> <4CDB2FB4.8000309@srce.hr> <4CDBB533.3030805@bobich.net> <4CDBD692.4020108@srce.hr> <1289847424.16298.18.camel@cowie> <4CE1A621.6090508@srce.hr> Message-ID: <4CE1AA4C.2050609@bobich.net> On 11/15/2010 09:29 PM, Jakov Sosic wrote: > @Gordan, I've tried csync2 and it's a great tool. It works exactly the > way I wanted! Thank you very much, I'm prepping it for big push into > production environments :) Glad I could help. :) Gordan From ag8817282 at gideon.org Wed Nov 17 20:22:53 2010 From: ag8817282 at gideon.org (Andrew Gideon) Date: Wed, 17 Nov 2010 15:22:53 -0500 Subject: [Linux-cluster] GFS+DRBD+Quorum: Help wrap my brain around this Message-ID: <1290025373.7401.1158.camel@carrot> I'm trying to figure out the best solution for GFS+DRBD. My mental block isn't really with GFS, though, but with clustered LVM (I think). I understand the quorum problem with a two-node cluster. And I understand that DRBD is not suitable for use as a quorum disk (presumably because it too would suffer from any partitioning, unlike a physical array connected directly to both nodes). Am I right so far? What I'd really like to do is have a three (or more) node cluster with two nodes having access to the DRBD storage. This solves the quorum problem (effectively having the third node as a quorum server). But when I try to create a volume on a volume group on a device shared by two nodes of a three node cluster, I get an error indicating that the volume group cannot be found on the third node. Which is true: the shared volume isn't available on that node. In the Cluster Logical Volume Manager document, I found: By default, logical volumes created with CLVM on shared storage are visible to all computers that have access to the shared storage. What I've not figured out is how to tell CLVMD (or whomever) that only nodes one and two have access to the shared storage. Is there a way to do this? I've also read, in the GFS2 Overview document: When you configure a GFS2 file system as a cluster file system, you must ensure that all nodes in the cluster have access to the shared storage This suggests that a cluster running GFS must have access to the storage on all nodes. Which would clearly block my idea for a three node cluster with only two nodes having access to the shared storage. I do have one idea, but it sounds like a more complex version of a Rube Goldberg device: A two node cluster with a third machine providing access to a device via iSCSI. The LUN exported from that third system could be used as the quorum disk by the two cluster nodes (effectively making that little iSCSI target the quorum server). This assumes that a failure of the quorum disk in an otherwise healthy two node cluster is survived. I've yet to confirm this. This seems ridiculously complex, so much so that I cannot imagine that there's not a better solution. But I just cannot get my brain wrapped around this well enough to see it. Any suggestions would be very welcome. Thanks... Andrew From ag8817282 at gideon.org Wed Nov 17 20:36:19 2010 From: ag8817282 at gideon.org (Andrew Gideon) Date: Wed, 17 Nov 2010 15:36:19 -0500 Subject: [Linux-cluster] A fencing mechanism for Xen (or KVM) guests Message-ID: <1290026179.7401.1167.camel@carrot> I found myself unhappy with what I located for fencing of Xen guests, so I put together a new mechanism. Would this be of interest to anyone else? The node on which fence_node is called uses SSH to connect to the list of hypervisors. The connection is key based, which limits the nodes to execution of the specific fencing command and also lets a given node fence only a guest that's in a specific list. This prevents a node of one cluster from fencing a node of another even if they reside on the same set of hypervisors. The fencing script issues the fence command (via SSH) to each hypervisor. Success of the command requires either (1) a guest of the specified name is found and destroyed o at least one hypervisor or (2) every hypervisor has been visited and reported that there is no such guest running. #2 was an interesting choice, BTW, on which I'd welcome feedback. The alternative would have been to presume that an unreachable hypervisor was down. That didn't seem like the best choice to me, but I'm curious what others might think. Thanks... Andrew From Colin.Simpson at iongeo.com Wed Nov 17 21:02:35 2010 From: Colin.Simpson at iongeo.com (Colin Simpson) Date: Wed, 17 Nov 2010 21:02:35 +0000 Subject: [Linux-cluster] GFS+DRBD+Quorum: Help wrap my brain around this In-Reply-To: <1290025373.7401.1158.camel@carrot> References: <1290025373.7401.1158.camel@carrot> Message-ID: <1290027755.4270.33.camel@cowie> You are right so far in your first paragraph. You cannot totally solve the quorum cluster problem with a two node cluster. The basic issue you are really trying to address is you want to avoid a split brain scenario, that is really all quorum is giving you. So with DRBD your best bet is to do your level best to avoid a split brain with your two nodes. Use decent fencing (maybe multiple fences), have redundant bonded network links (and interlinks) (I'm looking at splitting these over two physical cards on the nodes), setup DRBD's startup waiting appropriately and be careful at startups (see scenario below). Then just tell RHCS that you want to run with 2 nodes in cluster.conf e.g And in drbd.conf I have, startup { wfc-timeout 300; # Wait 300 for initial connection degr-wfc-timeout 60; # Wait only 60 seconds if this node was a degraded cluster become-primary-on both; } , many may prefer the system to wait indefinitely in DRBD on some of these conditions (to manually bring stuff up in a bad situation). So basically here I will wait 5 mins for the other node to join my DRBD before doing any cluster stuff and but wait less (60s) if I was degraded already (I'm assuming my other node is probably broken for an extended period in that case, so I want my other server up pretty quick). I'm still thinking this through just now. On a two node non-shared storage setup you can never fully guard against the scenario of node A being shutdown, node B then being shutdown later. Then node A being brought up and having no way of knowing that it has the older data than B, if B is still down. You can mitigate against this though by ensuring that you setup DRBD to wait long enough (or forever) on boot, and/or being careful to start things up in the right order after long periods of downtime from one node (good node needs to be up already). Just needs a bit of scenario thought. Three nodes just adds needless complexity from what you are saying. That's my thoughts on this, I'm pretty new to this too. Just how I'm thinking this should work just now. Colin On Wed, 2010-11-17 at 15:22 -0500, Andrew Gideon wrote: > I'm trying to figure out the best solution for GFS+DRBD. My mental > block isn't really with GFS, though, but with clustered LVM (I think). > > I understand the quorum problem with a two-node cluster. And I > understand that DRBD is not suitable for use as a quorum disk > (presumably because it too would suffer from any partitioning, unlike a > physical array connected directly to both nodes). > > Am I right so far? > > What I'd really like to do is have a three (or more) node cluster with > two nodes having access to the DRBD storage. This solves the quorum > problem (effectively having the third node as a quorum server). > > But when I try to create a volume on a volume group on a device shared > by two nodes of a three node cluster, I get an error indicating that the > volume group cannot be found on the third node. Which is true: the > shared volume isn't available on that node. > > In the Cluster Logical Volume Manager document, I found: > > By default, logical volumes created with CLVM on shared storage > are visible to all computers that have access to the shared > storage. > > What I've not figured out is how to tell CLVMD (or whomever) that only > nodes one and two have access to the shared storage. Is there a way to > do this? > > I've also read, in the GFS2 Overview document: > > When you configure a GFS2 file system as a cluster file system, > you must ensure that all nodes in the cluster have access to the > shared storage > > This suggests that a cluster running GFS must have access to the storage > on all nodes. Which would clearly block my idea for a three node > cluster with only two nodes having access to the shared storage. > > I do have one idea, but it sounds like a more complex version of a Rube > Goldberg device: A two node cluster with a third machine providing > access to a device via iSCSI. The LUN exported from that third system > could be used as the quorum disk by the two cluster nodes (effectively > making that little iSCSI target the quorum server). > > This assumes that a failure of the quorum disk in an otherwise healthy > two node cluster is survived. I've yet to confirm this. > > This seems ridiculously complex, so much so that I cannot imagine that > there's not a better solution. But I just cannot get my brain wrapped > around this well enough to see it. > > Any suggestions would be very welcome. > > Thanks... > > Andrew > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original. From Jost.Rakovec at snt.si Thu Nov 18 12:03:38 2010 From: Jost.Rakovec at snt.si (Rakovec Jost) Date: Thu, 18 Nov 2010 13:03:38 +0100 Subject: [Linux-cluster] A fencing mechanism for Xen (or KVM) guests In-Reply-To: <1290026179.7401.1167.camel@carrot> References: <1290026179.7401.1167.camel@carrot> Message-ID: <3754ED14F3EE0C459DEFE2DF184515FF0F101C7241@SIMAIL.snt-is.com> Hi I would like to tray. Where can I get your software? thx br jost ________________________________________ From: linux-cluster-bounces at redhat.com [linux-cluster-bounces at redhat.com] On Behalf Of Andrew Gideon [ag8817282 at gideon.org] Sent: Wednesday, November 17, 2010 9:36 PM To: linux-cluster at redhat.com Subject: [Linux-cluster] A fencing mechanism for Xen (or KVM) guests I found myself unhappy with what I located for fencing of Xen guests, so I put together a new mechanism. Would this be of interest to anyone else? The node on which fence_node is called uses SSH to connect to the list of hypervisors. The connection is key based, which limits the nodes to execution of the specific fencing command and also lets a given node fence only a guest that's in a specific list. This prevents a node of one cluster from fencing a node of another even if they reside on the same set of hypervisors. The fencing script issues the fence command (via SSH) to each hypervisor. Success of the command requires either (1) a guest of the specified name is found and destroyed o at least one hypervisor or (2) every hypervisor has been visited and reported that there is no such guest running. #2 was an interesting choice, BTW, on which I'd welcome feedback. The alternative would have been to presume that an unreachable hypervisor was down. That didn't seem like the best choice to me, but I'm curious what others might think. Thanks... Andrew -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From Colin.Simpson at iongeo.com Thu Nov 18 17:14:27 2010 From: Colin.Simpson at iongeo.com (Colin Simpson) Date: Thu, 18 Nov 2010 17:14:27 +0000 Subject: [Linux-cluster] Configurations of services? In-Reply-To: <4CE1AA4C.2050609@bobich.net> References: <4CE1AA4C.2050609@bobich.net> Message-ID: <1290100467.25543.3.camel@cowie> Sorry to invade your thread, but I have a query I'd like to post as a new thread, but every time I do it never seems to turn up, it just disappears into a black hole. I tried linux-cluster-owner at redhat.com ,but have had no reply there either. Anyone know how I can get permission to start a new thread, or what I'm doing wrong? Thanks Colin On Mon, 2010-11-15 at 21:46 +0000, Gordan Bobic wrote: > On 11/15/2010 09:29 PM, Jakov Sosic wrote: > > > @Gordan, I've tried csync2 and it's a great tool. It works exactly > the > > way I wanted! Thank you very much, I'm prepping it for big push into > > production environments :) > > Glad I could help. :) > > Gordan > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original. From dan.candea at quah.ro Thu Nov 18 20:26:18 2010 From: dan.candea at quah.ro (Dan Candea) Date: Thu, 18 Nov 2010 22:26:18 +0200 Subject: [Linux-cluster] clusterfs.sh Message-ID: <4CE58BEA.3030007@quah.ro> hello I'm using cluster-3.0.17 and I'm mounting a shared gfs2 storage with force_unmount="1" When rgmanager crashes on one node, it crashes on all other nodes and the reboot is the only option because the shared storage is not unmounted and every process is using it it freezes. Before figured it out why it crashes, I'm trying to understand why is not unmounted. In the log file I receive / Not unmounting clusterfs:backupfs - still in use by 1 other service(s) / the above message means the procesesses that uses the fs are not killed, or my services are not configured correctly? I have something like below