From nik600 at gmail.com Sat Feb 1 18:35:25 2014 From: nik600 at gmail.com (nik600) Date: Sat, 1 Feb 2014 19:35:25 +0100 Subject: [Linux-cluster] how to handle fence for a simple apache active/passive cluster with virtual ip on 2 virtual machine Message-ID: Dear all i need some clarification about clustering with rhel 6.4 i have a cluster with 2 node in active/passive configuration, i simply want to have a virtual ip and migrate it between 2 nodes. i've noticed that if i reboot or manually shut down a node the failover works correctly, but if i power-off one node the cluster doesn't failover on the other node. Another stange situation is that if power off all the nodes and then switch on only one the cluster doesn't start on the active node. I've read manual and documentation at https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/index.html and i've understand that the problem is related to fencing, but the problem is that my 2 nodes are on 2 virtual machine , i can't control hardware and can't issue any custom command on the host-side. I've tried to use fence_xvm but i'm not sure about it because if my VM has powered-off, how can it reply to fence_vxm messags? Here my logs when i power off the VM: ==> /var/log/cluster/fenced.log <== Feb 01 18:50:22 fenced fencing node mynode02 Feb 01 18:50:53 fenced fence mynode02 dev 0.0 agent fence_xvm result: error from agent Feb 01 18:50:53 fenced fence mynode02 failed I've tried to force the manual fence with: fence_ack_manual mynode02 and in this case the failover works properly. The point is: as i'm not using any shared filesystem but i'm only sharing apache with a virtual ip, i won't have any split-brain scenario so i don't need fencing, or not? So, is there the possibility to have a simple "dummy" fencing? here is my config.xml: Thanks to all in advance. -- /*************/ nik600 http://www.kumbe.it -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Sat Feb 1 18:43:35 2014 From: lists at alteeve.ca (Digimer) Date: Sat, 01 Feb 2014 13:43:35 -0500 Subject: [Linux-cluster] how to handle fence for a simple apache active/passive cluster with virtual ip on 2 virtual machine In-Reply-To: References: Message-ID: <52ED4057.4060309@alteeve.ca> On 01/02/14 01:35 PM, nik600 wrote: > Dear all > > i need some clarification about clustering with rhel 6.4 > > i have a cluster with 2 node in active/passive configuration, i simply > want to have a virtual ip and migrate it between 2 nodes. > > i've noticed that if i reboot or manually shut down a node the failover > works correctly, but if i power-off one node the cluster doesn't > failover on the other node. > > Another stange situation is that if power off all the nodes and then > switch on only one the cluster doesn't start on the active node. > > I've read manual and documentation at > > https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/index.html > > and i've understand that the problem is related to fencing, but the > problem is that my 2 nodes are on 2 virtual machine , i can't control > hardware and can't issue any custom command on the host-side. > > I've tried to use fence_xvm but i'm not sure about it because if my VM > has powered-off, how can it reply to fence_vxm messags? > > Here my logs when i power off the VM: > > ==> /var/log/cluster/fenced.log <== > Feb 01 18:50:22 fenced fencing node mynode02 > Feb 01 18:50:53 fenced fence mynode02 dev 0.0 agent fence_xvm result: > error from agent > Feb 01 18:50:53 fenced fence mynode02 failed > > I've tried to force the manual fence with: > > fence_ack_manual mynode02 > > and in this case the failover works properly. > > The point is: as i'm not using any shared filesystem but i'm only > sharing apache with a virtual ip, i won't have any split-brain scenario > so i don't need fencing, or not? > > So, is there the possibility to have a simple "dummy" fencing? > > here is my config.xml: > > > > post_join_delay="0"/> > > > > > > name="mynode01"/> > > > > > > > name="mynode02"/> > > > > > > > > > > > ordered="0" restricted="0"> > priority="1"/> > priority="2"/> > > > > recovery="relocate"> > sleeptime="2"/> > server_root="/etc/httpd" shutdown_wait="0"/> > > > > > Thanks to all in advance. The fence_virtd/fence_xvm agent works by using multicast to talk to the VM host. So the "off" confirmation comes from the hypervisor, not the target. Depending on your setup, you might find better luck with fence_virsh (I have to use this as there is a known multicast issue with Fedora hosts). Can you try, as a test if nothing else, if 'fence_virsh' will work for you? fence_virsh -a -l root -p -n -o status If this works, it should be trivial to add to cluster.conf. If that works, then you have a working fence method. However, I would recommend switching back to fence_xvm if you can. The fence_virsh agent is dependent on libvirtd running, which some consider a risk. hth -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From nik600 at gmail.com Sat Feb 1 20:50:03 2014 From: nik600 at gmail.com (nik600) Date: Sat, 1 Feb 2014 21:50:03 +0100 Subject: [Linux-cluster] how to handle fence for a simple apache active/passive cluster with virtual ip on 2 virtual machine In-Reply-To: <52ED4057.4060309@alteeve.ca> References: <52ED4057.4060309@alteeve.ca> Message-ID: My problem is that i don't have root access at host level. Il 01/feb/2014 19:49 "Digimer" ha scritto: > On 01/02/14 01:35 PM, nik600 wrote: > >> Dear all >> >> i need some clarification about clustering with rhel 6.4 >> >> i have a cluster with 2 node in active/passive configuration, i simply >> want to have a virtual ip and migrate it between 2 nodes. >> >> i've noticed that if i reboot or manually shut down a node the failover >> works correctly, but if i power-off one node the cluster doesn't >> failover on the other node. >> >> Another stange situation is that if power off all the nodes and then >> switch on only one the cluster doesn't start on the active node. >> >> I've read manual and documentation at >> >> https://access.redhat.com/site/documentation/en-US/Red_ >> Hat_Enterprise_Linux/6/html/Cluster_Administration/index.html >> >> and i've understand that the problem is related to fencing, but the >> problem is that my 2 nodes are on 2 virtual machine , i can't control >> hardware and can't issue any custom command on the host-side. >> >> I've tried to use fence_xvm but i'm not sure about it because if my VM >> has powered-off, how can it reply to fence_vxm messags? >> >> Here my logs when i power off the VM: >> >> ==> /var/log/cluster/fenced.log <== >> Feb 01 18:50:22 fenced fencing node mynode02 >> Feb 01 18:50:53 fenced fence mynode02 dev 0.0 agent fence_xvm result: >> error from agent >> Feb 01 18:50:53 fenced fence mynode02 failed >> >> I've tried to force the manual fence with: >> >> fence_ack_manual mynode02 >> >> and in this case the failover works properly. >> >> The point is: as i'm not using any shared filesystem but i'm only >> sharing apache with a virtual ip, i won't have any split-brain scenario >> so i don't need fencing, or not? >> >> So, is there the possibility to have a simple "dummy" fencing? >> >> here is my config.xml: >> >> >> >> > post_join_delay="0"/> >> >> >> >> >> >> > name="mynode01"/> >> >> >> >> >> >> >> > name="mynode02"/> >> >> >> >> >> >> >> >> >> >> >> > ordered="0" restricted="0"> >> > priority="1"/> >> > priority="2"/> >> >> >> >> > recovery="relocate"> >> > sleeptime="2"/> >> > server_root="/etc/httpd" shutdown_wait="0"/> >> >> >> >> >> Thanks to all in advance. >> > > The fence_virtd/fence_xvm agent works by using multicast to talk to the VM > host. So the "off" confirmation comes from the hypervisor, not the target. > > Depending on your setup, you might find better luck with fence_virsh (I > have to use this as there is a known multicast issue with Fedora hosts). > Can you try, as a test if nothing else, if 'fence_virsh' will work for you? > > fence_virsh -a -l root -p -n target vm> -o status > > If this works, it should be trivial to add to cluster.conf. If that works, > then you have a working fence method. However, I would recommend switching > back to fence_xvm if you can. The fence_virsh agent is dependent on > libvirtd running, which some consider a risk. > > hth > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without > access to education? > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Sat Feb 1 21:04:51 2014 From: lists at alteeve.ca (Digimer) Date: Sat, 01 Feb 2014 16:04:51 -0500 Subject: [Linux-cluster] how to handle fence for a simple apache active/passive cluster with virtual ip on 2 virtual machine In-Reply-To: References: <52ED4057.4060309@alteeve.ca> Message-ID: <52ED6173.5060404@alteeve.ca> Ooooh, I'm not sure what option you have then. I suppose fence_virtd/fence_xvm is your best option, but you're going to need to have the admin configure the fence_virtd side. On 01/02/14 03:50 PM, nik600 wrote: > My problem is that i don't have root access at host level. > > Il 01/feb/2014 19:49 "Digimer" > ha scritto: > > On 01/02/14 01:35 PM, nik600 wrote: > > Dear all > > i need some clarification about clustering with rhel 6.4 > > i have a cluster with 2 node in active/passive configuration, i > simply > want to have a virtual ip and migrate it between 2 nodes. > > i've noticed that if i reboot or manually shut down a node the > failover > works correctly, but if i power-off one node the cluster doesn't > failover on the other node. > > Another stange situation is that if power off all the nodes and then > switch on only one the cluster doesn't start on the active node. > > I've read manual and documentation at > > https://access.redhat.com/__site/documentation/en-US/Red___Hat_Enterprise_Linux/6/html/__Cluster_Administration/index.__html > > > and i've understand that the problem is related to fencing, but the > problem is that my 2 nodes are on 2 virtual machine , i can't > control > hardware and can't issue any custom command on the host-side. > > I've tried to use fence_xvm but i'm not sure about it because if > my VM > has powered-off, how can it reply to fence_vxm messags? > > Here my logs when i power off the VM: > > ==> /var/log/cluster/fenced.log <== > Feb 01 18:50:22 fenced fencing node mynode02 > Feb 01 18:50:53 fenced fence mynode02 dev 0.0 agent fence_xvm > result: > error from agent > Feb 01 18:50:53 fenced fence mynode02 failed > > I've tried to force the manual fence with: > > fence_ack_manual mynode02 > > and in this case the failover works properly. > > The point is: as i'm not using any shared filesystem but i'm only > sharing apache with a virtual ip, i won't have any split-brain > scenario > so i don't need fencing, or not? > > So, is there the possibility to have a simple "dummy" fencing? > > here is my config.xml: > > > > post_join_delay="0"/> > > > > > > name="mynode01"/> > > > > > > > name="mynode02"/> > > > > > > > > > > > nofailback="0" > ordered="0" restricted="0"> > name="mynode01" > priority="1"/> > name="mynode02" > priority="2"/> > > > > name="MYSERVICE" > recovery="relocate"> > monitor_link="on" > sleeptime="2"/> > server_root="/etc/httpd" shutdown_wait="0"/> > > > > > Thanks to all in advance. > > > The fence_virtd/fence_xvm agent works by using multicast to talk to > the VM host. So the "off" confirmation comes from the hypervisor, > not the target. > > Depending on your setup, you might find better luck with fence_virsh > (I have to use this as there is a known multicast issue with Fedora > hosts). Can you try, as a test if nothing else, if 'fence_virsh' > will work for you? > > fence_virsh -a -l root -p -n for target vm> -o status > > If this works, it should be trivial to add to cluster.conf. If that > works, then you have a working fence method. However, I would > recommend switching back to fence_xvm if you can. The fence_virsh > agent is dependent on libvirtd running, which some consider a risk. > > hth > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person > without access to education? > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/__mailman/listinfo/linux-cluster > > > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From nik600 at gmail.com Sat Feb 1 21:11:45 2014 From: nik600 at gmail.com (nik600) Date: Sat, 1 Feb 2014 22:11:45 +0100 Subject: [Linux-cluster] how to handle fence for a simple apache active/passive cluster with virtual ip on 2 virtual machine In-Reply-To: <52ED6173.5060404@alteeve.ca> References: <52ED4057.4060309@alteeve.ca> <52ED6173.5060404@alteeve.ca> Message-ID: Ok but is not possible to ignore fence? Il 01/feb/2014 22:09 "Digimer" ha scritto: > Ooooh, I'm not sure what option you have then. I suppose > fence_virtd/fence_xvm is your best option, but you're going to need to have > the admin configure the fence_virtd side. > > On 01/02/14 03:50 PM, nik600 wrote: > >> My problem is that i don't have root access at host level. >> >> Il 01/feb/2014 19:49 "Digimer" > > ha scritto: >> >> On 01/02/14 01:35 PM, nik600 wrote: >> >> Dear all >> >> i need some clarification about clustering with rhel 6.4 >> >> i have a cluster with 2 node in active/passive configuration, i >> simply >> want to have a virtual ip and migrate it between 2 nodes. >> >> i've noticed that if i reboot or manually shut down a node the >> failover >> works correctly, but if i power-off one node the cluster doesn't >> failover on the other node. >> >> Another stange situation is that if power off all the nodes and >> then >> switch on only one the cluster doesn't start on the active node. >> >> I've read manual and documentation at >> >> https://access.redhat.com/__site/documentation/en-US/Red__ >> _Hat_Enterprise_Linux/6/html/__Cluster_Administration/index.__html >> > Hat_Enterprise_Linux/6/html/Cluster_Administration/index.html> >> >> and i've understand that the problem is related to fencing, but >> the >> problem is that my 2 nodes are on 2 virtual machine , i can't >> control >> hardware and can't issue any custom command on the host-side. >> >> I've tried to use fence_xvm but i'm not sure about it because if >> my VM >> has powered-off, how can it reply to fence_vxm messags? >> >> Here my logs when i power off the VM: >> >> ==> /var/log/cluster/fenced.log <== >> Feb 01 18:50:22 fenced fencing node mynode02 >> Feb 01 18:50:53 fenced fence mynode02 dev 0.0 agent fence_xvm >> result: >> error from agent >> Feb 01 18:50:53 fenced fence mynode02 failed >> >> I've tried to force the manual fence with: >> >> fence_ack_manual mynode02 >> >> and in this case the failover works properly. >> >> The point is: as i'm not using any shared filesystem but i'm only >> sharing apache with a virtual ip, i won't have any split-brain >> scenario >> so i don't need fencing, or not? >> >> So, is there the possibility to have a simple "dummy" fencing? >> >> here is my config.xml: >> >> >> >> > post_join_delay="0"/> >> >> >> > votes="1"> >> >> >> > domain="mynode01" >> name="mynode01"/> >> >> >> >> > votes="1"> >> >> >> > domain="mynode02" >> name="mynode02"/> >> >> >> >> >> >> >> >> >> >> >> > nofailback="0" >> ordered="0" restricted="0"> >> > name="mynode01" >> priority="1"/> >> > name="mynode02" >> priority="2"/> >> >> >> >> > name="MYSERVICE" >> recovery="relocate"> >> > monitor_link="on" >> sleeptime="2"/> >> > server_root="/etc/httpd" shutdown_wait="0"/> >> >> >> >> >> Thanks to all in advance. >> >> >> The fence_virtd/fence_xvm agent works by using multicast to talk to >> the VM host. So the "off" confirmation comes from the hypervisor, >> not the target. >> >> Depending on your setup, you might find better luck with fence_virsh >> (I have to use this as there is a known multicast issue with Fedora >> hosts). Can you try, as a test if nothing else, if 'fence_virsh' >> will work for you? >> >> fence_virsh -a -l root -p -n > for target vm> -o status >> >> If this works, it should be trivial to add to cluster.conf. If that >> works, then you have a working fence method. However, I would >> recommend switching back to fence_xvm if you can. The fence_virsh >> agent is dependent on libvirtd running, which some consider a risk. >> >> hth >> >> -- >> Digimer >> Papers and Projects: https://alteeve.ca/w/ >> What if the cure for cancer is trapped in the mind of a person >> without access to education? >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/__mailman/listinfo/linux-cluster >> >> >> >> >> > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without > access to education? > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Sat Feb 1 21:22:08 2014 From: lists at alteeve.ca (Digimer) Date: Sat, 01 Feb 2014 16:22:08 -0500 Subject: [Linux-cluster] how to handle fence for a simple apache active/passive cluster with virtual ip on 2 virtual machine In-Reply-To: References: <52ED4057.4060309@alteeve.ca> <52ED6173.5060404@alteeve.ca> Message-ID: <52ED6580.7070700@alteeve.ca> No. When a node is lost, fenced is called. Fenced informs DLM that a fence is pending and DLM stops issuing locks. Only after fenced confirms successful fence is DLM told. The DLM will reap locks held by the now-fenced node and recovery can begin. Anything using DLM; rgmanager, clvmd, gfs2, will block. This is by design. If you ever allowed a cluster to make an assumption about the state of a lost node, you risk a split-brain. If a split-brain was tolerable, you wouldn't need an HA cluster. :) digimer On 01/02/14 04:11 PM, nik600 wrote: > Ok but is not possible to ignore fence? > > Il 01/feb/2014 22:09 "Digimer" > ha scritto: > > Ooooh, I'm not sure what option you have then. I suppose > fence_virtd/fence_xvm is your best option, but you're going to need > to have the admin configure the fence_virtd side. > > On 01/02/14 03:50 PM, nik600 wrote: > > My problem is that i don't have root access at host level. > > Il 01/feb/2014 19:49 "Digimer" > >> ha scritto: > > On 01/02/14 01:35 PM, nik600 wrote: > > Dear all > > i need some clarification about clustering with rhel 6.4 > > i have a cluster with 2 node in active/passive > configuration, i > simply > want to have a virtual ip and migrate it between 2 nodes. > > i've noticed that if i reboot or manually shut down a > node the > failover > works correctly, but if i power-off one node the > cluster doesn't > failover on the other node. > > Another stange situation is that if power off all the > nodes and then > switch on only one the cluster doesn't start on the > active node. > > I've read manual and documentation at > > https://access.redhat.com/____site/documentation/en-US/Red_____Hat_Enterprise_Linux/6/html/____Cluster_Administration/index.____html > > > > > > and i've understand that the problem is related to > fencing, but the > problem is that my 2 nodes are on 2 virtual machine , i > can't > control > hardware and can't issue any custom command on the > host-side. > > I've tried to use fence_xvm but i'm not sure about it > because if > my VM > has powered-off, how can it reply to fence_vxm messags? > > Here my logs when i power off the VM: > > ==> /var/log/cluster/fenced.log <== > Feb 01 18:50:22 fenced fencing node mynode02 > Feb 01 18:50:53 fenced fence mynode02 dev 0.0 agent > fence_xvm > result: > error from agent > Feb 01 18:50:53 fenced fence mynode02 failed > > I've tried to force the manual fence with: > > fence_ack_manual mynode02 > > and in this case the failover works properly. > > The point is: as i'm not using any shared filesystem > but i'm only > sharing apache with a virtual ip, i won't have any > split-brain > scenario > so i don't need fencing, or not? > > So, is there the possibility to have a simple "dummy" > fencing? > > here is my config.xml: > > > > post_join_delay="0"/> > > > nodeid="1" votes="1"> > > > domain="mynode01" > name="mynode01"/> > > > > nodeid="2" votes="1"> > > > domain="mynode02" > name="mynode02"/> > > > > > > name="mynode01"/> > name="mynode02"/> > > > > nofailback="0" > ordered="0" restricted="0"> > name="mynode01" > priority="1"/> > name="mynode02" > priority="2"/> > > > > name="MYSERVICE" > recovery="relocate"> > monitor_link="on" > sleeptime="2"/> > server_root="/etc/httpd" shutdown_wait="0"/> > > > > > Thanks to all in advance. > > > The fence_virtd/fence_xvm agent works by using multicast to > talk to > the VM host. So the "off" confirmation comes from the > hypervisor, > not the target. > > Depending on your setup, you might find better luck with > fence_virsh > (I have to use this as there is a known multicast issue > with Fedora > hosts). Can you try, as a test if nothing else, if > 'fence_virsh' > will work for you? > > fence_virsh -a -l root -p -n > for target vm> -o status > > If this works, it should be trivial to add to cluster.conf. > If that > works, then you have a working fence method. However, I would > recommend switching back to fence_xvm if you can. The > fence_virsh > agent is dependent on libvirtd running, which some consider > a risk. > > hth > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person > without access to education? > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > > > https://www.redhat.com/____mailman/listinfo/linux-cluster > > __> > > > > > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person > without access to education? > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/__mailman/listinfo/linux-cluster > > > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From ben at zentrix.be Fri Feb 7 16:13:15 2014 From: ben at zentrix.be (Benjamin Budts) Date: Fri, 7 Feb 2014 17:13:15 +0100 Subject: [Linux-cluster] manual intervention 1 node when fencing fails due to complete power outage Message-ID: <012b01cf241f$81c79300$8556b900$@zentrix.be> Gents, I have a 2 node setup (with quorum disk), redhat 6.5 & a luci mgmt console. Everything has been configured and we?re doing failover tests now. Couple of questions I have : ? When I simulate a complete power failure of a servers pdu?s (no more access to idrac fencing or APC PDU fencing) I can see that the fencing of that node who was running the application fails ? I noticed unless fencing returns an OK I?m stuck and my application won?t start on my 2nd node. Which is ok I guess, because no fencing could mean there is still I/O on my san. Clustat also shows on the active node that the 1st node is still running the application. How can I intervene manually, so as to force a start of the application on the node that is still alive ? Is there a way to tell the cluster, don?t take into account node 1 anymore and don?t try to fence anymore, just start the application on the node that is still ok ? I can?t possibly wait until power returns to that server. Downtime could be too long. ? If I tell a node to leave the cluster in Luci, I would like it to remain a non-cluster member after the reboot of that node. It rejoins the cluster automatically after a reboot. Any way to prevent this ? Thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Fri Feb 7 17:55:32 2014 From: lists at alteeve.ca (Digimer) Date: Fri, 07 Feb 2014 12:55:32 -0500 Subject: [Linux-cluster] manual intervention 1 node when fencing fails due to complete power outage In-Reply-To: <012b01cf241f$81c79300$8556b900$@zentrix.be> References: <012b01cf241f$81c79300$8556b900$@zentrix.be> Message-ID: <52F51E14.6030102@alteeve.ca> On 07/02/14 11:13 AM, Benjamin Budts wrote: > Gents, We're not all gents. ;) > I have a 2 node setup (with quorum disk), redhat 6.5 & a luci mgmt console. > > Everything has been configured and we?re doing failover tests now. > > Couple of questions I have : > > ?When I simulate a complete power failure of a servers pdu?s (no more > access to idrac fencing or APC PDU fencing) I can see that the fencing > of that node who was running the application fails ?I noticed unless > fencing returns an OK I?m stuck and my application won?t start on my > 2^nd node. Which is ok I guess, because no fencing could mean there is > still I/O on my san. This is expected. If a lost node can't be put into a known state, there is no safe way to proceed. To do so would be to risk a split brain at least, and data loss/corruption at worst. The way I deal with this is to have nodes with redundant power supplies and use two PDUs and two UPSes. This way, the failure of on cirtcuit / UPS / PDU doesn't knock out the power to the mainboard of the nodes, so you don't lose IPMI. > Clustat also shows on the active node that the 1^st node is still > running the application. That's likely because rgmanager uses DLM, and DLM blocks until the fence succeeds, so it can't update it's view. > How can I intervene manually, so as to force a start of the application > on the node that is still alive ? If you are *100% ABSOLUTELY SURE* that the lost node has been powered off, then you can run 'fence_ack_manual'. Please be super careful about this though. If you do this, in the heat of the moment with clients or bosses yelling at you, and the peer isn't really off (ie: it's only hung), you risk serious problems. I can not emphasis strongly enough the caution needed when using this command. > Is there a way to tell the cluster, don?t take into account node 1 > anymore and don?t try to fence anymore, just start the application on > the node that is still ok ? No. That would risk a split brain and data corruption. The only safe option for the cluster, if the face of a failed fence, is to hang. As bad as it is to hang, it's better than risking corruption. > I can?t possibly wait until power returns to that server. Downtime could > be too long. See the solution I mentioned earlier. > ?If I tell a node to leave the cluster in Luci, I would like it to > remain a non-cluster member after the reboot of that node. It rejoins > the cluster automatically after a reboot. Any way to prevent this ? > > Thx Don't let cman and rgmanager start on boot. This is always my policy. If a node failed and got fenced, I want it to reboot, so that I can log into it and figure out what happened, but I do _not_ want it back in the cluster until I've determined it is healthy. hth -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From Mark.Vallevand at UNISYS.com Fri Feb 7 19:48:12 2014 From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K) Date: Fri, 7 Feb 2014 13:48:12 -0600 Subject: [Linux-cluster] Collocating cloned resources Message-ID: <99C8B2929B39C24493377AC7A121E21FC5D8BC6C9F@USEA-EXCH8.na.uis.unisys.com> I'm pretty sure I can collocate cloned resources. If so, will the clone instance number in the resource agents (OCF_RESKEY_CRM_meta_clone) be the same for the instances running on the same node? Regards. Mark K Vallevand Mark.Vallevand at Unisys.com May you live in interesting times, may you come to the attention of important people and may all your wishes come true. THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Vallevand at UNISYS.com Fri Feb 7 20:32:27 2014 From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K) Date: Fri, 7 Feb 2014 14:32:27 -0600 Subject: [Linux-cluster] What happens if a node running a cloned resource crashes Message-ID: <99C8B2929B39C24493377AC7A121E21FC5D8BC6D62@USEA-EXCH8.na.uis.unisys.com> If I have 5 nodes in my cluster and I have a cloned resource running on 4 of them (clone_max=4), and one of the 4 crashes, will an instance of the cloned resource be started on the 5th node? Regards. Mark K Vallevand Mark.Vallevand at Unisys.com May you live in interesting times, may you come to the attention of important people and may all your wishes come true. THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. -------------- next part -------------- An HTML attachment was scrubbed... URL: From arnold at arnoldarts.de Fri Feb 7 21:18:06 2014 From: arnold at arnoldarts.de (Arnold Krille) Date: Fri, 7 Feb 2014 22:18:06 +0100 Subject: [Linux-cluster] What happens if a node running a cloned resource crashes In-Reply-To: <99C8B2929B39C24493377AC7A121E21FC5D8BC6D62@USEA-EXCH8.na.uis.unisys.com> References: <99C8B2929B39C24493377AC7A121E21FC5D8BC6D62@USEA-EXCH8.na.uis.unisys.com> Message-ID: <20140207221806.42a98416@xingu.arnoldarts.de> On Fri, 7 Feb 2014 14:32:27 -0600 "Vallevand, Mark K" wrote: > If I have 5 nodes in my cluster and I have a cloned resource running > on 4 of them (clone_max=4), and one of the 4 crashes, will an > instance of the cloned resource be started on the 5th node? Thats the idea! You tell the cluster to run up to 4 cloned resources on the cluster as long as possible (high availability!). And as long as possible, it will run these four resources. It might even run several of these resources on one node. You have to set clone-node-max appropriately to prevent all four resources running on one node. Have fun, Arnold -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 230 bytes Desc: not available URL: From Mark.Vallevand at UNISYS.com Fri Feb 7 21:42:10 2014 From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K) Date: Fri, 7 Feb 2014 15:42:10 -0600 Subject: [Linux-cluster] Collocating cloned resources In-Reply-To: <99C8B2929B39C24493377AC7A121E21FC5D8BC6C9F@USEA-EXCH8.na.uis.unisys.com> References: <99C8B2929B39C24493377AC7A121E21FC5D8BC6C9F@USEA-EXCH8.na.uis.unisys.com> Message-ID: <99C8B2929B39C24493377AC7A121E21FC5D8BC6E8F@USEA-EXCH8.na.uis.unisys.com> Assuming clone-node-max=1 so that one instance of each resource runs on each node, that is. Regards. Mark K Vallevand Mark.Vallevand at Unisys.com May you live in interesting times, may you come to the attention of important people and may all your wishes come true. THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Vallevand, Mark K Sent: Friday, February 07, 2014 01:48 PM To: linux-cluster at redhat.com Subject: [Linux-cluster] Collocating cloned resources I'm pretty sure I can collocate cloned resources. If so, will the clone instance number in the resource agents (OCF_RESKEY_CRM_meta_clone) be the same for the instances running on the same node? Regards. Mark K Vallevand Mark.Vallevand at Unisys.com May you live in interesting times, may you come to the attention of important people and may all your wishes come true. THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Vallevand at UNISYS.com Fri Feb 7 21:42:15 2014 From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K) Date: Fri, 7 Feb 2014 15:42:15 -0600 Subject: [Linux-cluster] What happens if a node running a cloned resource crashes In-Reply-To: <20140207221806.42a98416@xingu.arnoldarts.de> References: <99C8B2929B39C24493377AC7A121E21FC5D8BC6D62@USEA-EXCH8.na.uis.unisys.com> <20140207221806.42a98416@xingu.arnoldarts.de> Message-ID: <99C8B2929B39C24493377AC7A121E21FC5D8BC6E90@USEA-EXCH8.na.uis.unisys.com> Good. Thanks. Regards. Mark K Vallevand?? Mark.Vallevand at Unisys.com May you live in interesting times, may you come to the attention of important people and may all your wishes come true. THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Arnold Krille Sent: Friday, February 07, 2014 03:18 PM To: linux-cluster at redhat.com Subject: Re: [Linux-cluster] What happens if a node running a cloned resource crashes On Fri, 7 Feb 2014 14:32:27 -0600 "Vallevand, Mark K" wrote: > If I have 5 nodes in my cluster and I have a cloned resource running > on 4 of them (clone_max=4), and one of the 4 crashes, will an > instance of the cloned resource be started on the 5th node? Thats the idea! You tell the cluster to run up to 4 cloned resources on the cluster as long as possible (high availability!). And as long as possible, it will run these four resources. It might even run several of these resources on one node. You have to set clone-node-max appropriately to prevent all four resources running on one node. Have fun, Arnold From ben at zentrix.be Mon Feb 10 14:12:28 2014 From: ben at zentrix.be (Benjamin Budts) Date: Mon, 10 Feb 2014 15:12:28 +0100 Subject: [Linux-cluster] backup best practice when using Luci Message-ID: <018e01cf266a$21624170$6426c450$@zentrix.be> Ladies & Gents (I won't make that same mistake again ;) ), First, thank you to the lady who helped me explain how to force an OK on fencing that is failing. A 2 node config & Luci : I would liketo put a backup solution in place for the cluster config / nodeconfig/ fencing etc... What would you recommend ? Or does Luci archive versions of config-files somewhere ? Basically, if shit hits the fan I would like to untar a golden image of a config on luci and push it back to the nodes. Thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Mon Feb 10 14:43:22 2014 From: lists at alteeve.ca (Digimer) Date: Mon, 10 Feb 2014 09:43:22 -0500 Subject: [Linux-cluster] backup best practice when using Luci In-Reply-To: <018e01cf266a$21624170$6426c450$@zentrix.be> References: <018e01cf266a$21624170$6426c450$@zentrix.be> Message-ID: <52F8E58A.8080808@alteeve.ca> On 10/02/14 09:12 AM, Benjamin Budts wrote: > Ladies & Gents (I won?t make that same mistake again ;) ), > > First, thank you to the lady who helped me explain how to force an OK on > fencing that is failing. > > A 2 node config & Luci : > > I would liketo put a backup solution in place for the cluster config / > nodeconfig/ fencing etc... > > What would you recommend ? Or does Luci archive versions of > config-files somewhere ? > > Basically, if shit hits the fan I would like to untar a golden image of > a config on luci and push it back to the nodes? > > Thx The main config file is /etc/cluster/cluster.conf. A copy of this file should be on all nodes at once, so even if you didn't have a backup proper, you should be able to copy it from another node. Beyond that, I personally backup (sample taken from a node called 'an-c05n01'; ==== mkdir ~/base cd ~/base mkdir root mkdir -p etc/sysconfig/network-scripts/ mkdir -p etc/udev/rules.d/ # Root user rsync -av /root/.bashrc root/ rsync -av /root/.ssh root/ # Directories rsync -av /etc/ssh etc/ rsync -av /etc/apcupsd etc/ rsync -av /etc/cluster etc/ rsync -av /etc/drbd.* etc/ rsync -av /etc/lvm etc/ # Specific files. rsync -av /etc/sysconfig/network-scripts/ifcfg-{eth*,bond*,vbr*} etc/sysconfig/network-scripts/ rsync -av /etc/udev/rules.d/70-persistent-net.rules etc/udev/rules.d/ rsync -av /etc/sysconfig/network etc/sysconfig/ rsync -av /etc/hosts etc/ rsync -av /etc/ntp.conf etc/ # Save recreating user accounts. rsync -av /etc/passwd etc/ rsync -av /etc/group etc/ rsync -av /etc/shadow etc/ rsync -av /etc/gshadow etc/ # If you have the cluster built and want to backup it's configs. mkdir etc/cluster mkdir etc/lvm rsync -av /etc/cluster/cluster.conf etc/cluster/ rsync -av /etc/lvm/lvm.conf etc/lvm/ # NOTE: DRBD won't work until you've manually created the partitions. rsync -av /etc/drbd.d etc/ # If you had to manually set the UUID in libvirtd; mkdir etc/libvirt rsync -av /etc/libvirt/libvirt.conf etc/libvirt/ # If you're running RHEL and want to backup your registration info; rsync -av /etc/sysconfig/rhn etc/sysconfig/ # Pack it up # NOTE: Change the name to suit your node. tar -cvf base_an-c05n01.tar etc root ==== I then push the resulting tar file to my PXE server. I have a kickstart script that does a minimal rhel6 install, plus the cluster stuff, and then has a %post script that downloads this tar and extracts it. This way, when the node needs to be rebuilt, it's 95% ready to go. I still need to do things like 'drbdadm create-md ', but it's still very quick to restore a node. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From ckonstanski at pippiandcarlos.com Mon Feb 10 23:12:37 2014 From: ckonstanski at pippiandcarlos.com (Carlos Konstanski) Date: Mon, 10 Feb 2014 16:12:37 -0700 Subject: [Linux-cluster] changing an ocf:heartbeat:Filesystem config Message-ID: <52F95CE5.9080304@pippiandcarlos.com> I have a need to change the device node for an ocf:heartbeat:Filesystem resource. It currently points to /dev/sda. I want to change it to /dev/disk/by-uuid/e4284038-1b26-418e-b205-93395373379b. The reason for this change is to make sure that a renaming of disks by udev does not break my cluster. The above example is a lab environment. the production environment is messier. What is the easiest and/or least impacting way to make this change? If these two requirements are mutually exclusive, then please lean toward least impacting. Thanks! From ckonstanski at pippiandcarlos.com Mon Feb 10 23:26:01 2014 From: ckonstanski at pippiandcarlos.com (Carlos Konstanski) Date: Mon, 10 Feb 2014 16:26:01 -0700 Subject: [Linux-cluster] changing an ocf:heartbeat:Filesystem config In-Reply-To: <52F95CE5.9080304@pippiandcarlos.com> References: <52F95CE5.9080304@pippiandcarlos.com> Message-ID: <52F96009.3070703@pippiandcarlos.com> Is this what I want to do? crm configure save > filename Edit the file crm configure load replace filename On 02/10/2014 04:12 PM, Carlos Konstanski wrote: > I have a need to change the device node for an ocf:heartbeat:Filesystem > resource. It currently points to /dev/sda. I want to change it to > /dev/disk/by-uuid/e4284038-1b26-418e-b205-93395373379b. > > The reason for this change is to make sure that a renaming of disks by > udev does not break my cluster. The above example is a lab environment. > the production environment is messier. > > What is the easiest and/or least impacting way to make this change? If > these two requirements are mutually exclusive, then please lean toward > least impacting. > > Thanks! > From andrew at beekhof.net Mon Feb 10 23:35:32 2014 From: andrew at beekhof.net (Andrew Beekhof) Date: Tue, 11 Feb 2014 10:35:32 +1100 Subject: [Linux-cluster] changing an ocf:heartbeat:Filesystem config In-Reply-To: <52F96009.3070703@pippiandcarlos.com> References: <52F95CE5.9080304@pippiandcarlos.com> <52F96009.3070703@pippiandcarlos.com> Message-ID: that is a quite reasonable approach On 11 Feb 2014, at 10:26 am, Carlos Konstanski wrote: > Is this what I want to do? > > crm configure save > filename > Edit the file > crm configure load replace filename > > On 02/10/2014 04:12 PM, Carlos Konstanski wrote: >> I have a need to change the device node for an ocf:heartbeat:Filesystem >> resource. It currently points to /dev/sda. I want to change it to >> /dev/disk/by-uuid/e4284038-1b26-418e-b205-93395373379b. >> >> The reason for this change is to make sure that a renaming of disks by >> udev does not break my cluster. The above example is a lab environment. >> the production environment is messier. >> >> What is the easiest and/or least impacting way to make this change? If >> these two requirements are mutually exclusive, then please lean toward >> least impacting. >> >> Thanks! >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 841 bytes Desc: Message signed with OpenPGP using GPGMail URL: From ckonstanski at pippiandcarlos.com Mon Feb 10 23:47:27 2014 From: ckonstanski at pippiandcarlos.com (Carlos Konstanski) Date: Mon, 10 Feb 2014 16:47:27 -0700 Subject: [Linux-cluster] changing an ocf:heartbeat:Filesystem config In-Reply-To: References: <52F95CE5.9080304@pippiandcarlos.com> <52F96009.3070703@pippiandcarlos.com> Message-ID: <52F9650F.1060406@pippiandcarlos.com> Thanks! I'll do it. It worked well in my lab environment. On 02/10/2014 04:35 PM, Andrew Beekhof wrote: > that is a quite reasonable approach > > On 11 Feb 2014, at 10:26 am, Carlos Konstanski wrote: > >> Is this what I want to do? >> >> crm configure save > filename >> Edit the file >> crm configure load replace filename >> >> On 02/10/2014 04:12 PM, Carlos Konstanski wrote: >>> I have a need to change the device node for an ocf:heartbeat:Filesystem >>> resource. It currently points to /dev/sda. I want to change it to >>> /dev/disk/by-uuid/e4284038-1b26-418e-b205-93395373379b. >>> >>> The reason for this change is to make sure that a renaming of disks by >>> udev does not break my cluster. The above example is a lab environment. >>> the production environment is messier. >>> >>> What is the easiest and/or least impacting way to make this change? If >>> these two requirements are mutually exclusive, then please lean toward >>> least impacting. >>> >>> Thanks! >>> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > From Ralph.Grothe at itdz-berlin.de Tue Feb 11 10:17:17 2014 From: Ralph.Grothe at itdz-berlin.de (Ralph.Grothe at itdz-berlin.de) Date: Tue, 11 Feb 2014 11:17:17 +0100 Subject: [Linux-cluster] =?iso-8859-1?q?Question_regarding_typed_resources?= =?iso-8859-1?q?=B4_parent_child_vs=2E_sibling_ordering?= Message-ID: Hello, My actions and questions relate to a RHEL 5 RHCS cluster. Though I studied carefully the official RHEL 5 Cluster Admin Guide with special emphasis on the chapter "HA Resource Behavior" there remain certain things unclear to me. First of all, I have to mention that my cluster.conf?s parent-child-sibling hierarchies whithin the service scopes could successfully be checked in as valid cluster configuration (i.e. "ccs_tool update /etc/cluster/cluster.conf" succeeded). My first question is whether it is feasible to use the tag, which originally is meant to map inheritance, and populate such a block although I don't make any use of inheritance in my configuration? I simply find that its use makes the appearance of the block much more readable an tidier. Now to my main concern. Would such a block be valid and start and stop resources in the proper order (i.e. according to my intention)? e.g. ... ... ... I am asking because I read in the mentioned doc above that for a typed resource (such as ip, lvm, fs,...) there exists a strict start and stop sequence for siblings. In my parent-child hierarchy above I am reversing this start order by making the ip resource a parent of the lvm resource which in sibling context would have a higher starting precedence than the ip resource. Of course I had a second thought in mind when rigging up this seemingly oblique hierarchy of typed resources. Because there are scheduled maintenance downtimes I wanted to ease the activation of a whole bunch of a service's resources like LVM LVs, mountpoints and IP addresses with a single rg_test invocation when a service has previously been disabled. I then could issue according to the above config snippet just a e.g. rg_test test /etc/cluster/cluster.conf start ip 10.20.30.40 and have all resources activated apart from the Oracle DB instance. There is yet another issue that puzzles me. If I look at the starting sequence by issuing rg_test noop /etc/cluster/cluster.conf start service srv-a then the resource script:sid-a_statechg_notify gets executed before the resources oracledb:SID-A and script:oracle_em. This would imply to me that any resource of type script has a higher starting precedence than any resource of type oracledb because in my config above they are siblings. I actually would have thought it to be the other way round, i.e. that script resources have the lowest starting precedence of all. Unfortunately, in Table D.1. "Child Resource Type Start and Stop Order" on page 112 of the cluster administration guide the typed resource oracledb does not appear. Many thanks for your patience having read this far. Regards, Ralph From rmitchel at redhat.com Tue Feb 11 23:41:30 2014 From: rmitchel at redhat.com (Ryan Mitchell) Date: Wed, 12 Feb 2014 10:41:30 +1100 Subject: [Linux-cluster] =?iso-8859-1?q?Question_regarding_typed_resources?= =?iso-8859-1?q?=B4_parent_child_vs=2E_sibling_ordering?= In-Reply-To: References: Message-ID: <52FAB52A.4010809@redhat.com> Hi, On 02/11/2014 09:17 PM, Ralph.Grothe at itdz-berlin.de wrote: > Hello, > > > My actions and questions relate to a RHEL 5 RHCS cluster. > > Though I studied carefully the official RHEL 5 Cluster Admin > Guide with special emphasis on the chapter "HA Resource Behavior" > there remain certain things unclear to me. > > First of all, I have to mention that my cluster.conf?s > parent-child-sibling hierarchies whithin the service scopes could > successfully be checked in as valid cluster configuration (i.e. > "ccs_tool update /etc/cluster/cluster.conf" succeeded). > > > My first question is whether it is feasible to use the > tag, which originally is meant to map inheritance, > and populate such a block although I don't make any use of > inheritance in my configuration? > I simply find that its use makes the appearance of the block > much more readable an tidier. I don't entirely follow, but I'll take a guess that you are asking if it is compulsory to define resources in the section, and then referencing them in the section>? If that is what you mean, then I can confirm that the recommended method is to define resources in the tags and to reference those definitions in the tags. But it is also possible to leave the section blank, and declare the resources when they are specified in the section. Both are possible. Maybe I misunderstood you, or I misunderstood what you were referring to when you mentioned inheritance. Please clarify if I did not answer your question. > Now to my main concern. > > Would such a block be valid and start and stop > resources in the proper order (i.e. according to my intention)? > > e.g. > > > > > ... > > > ... > > > > > > > > > > > > > > > > __independent_subtree="1" __max_restarts="2" > __restart_expire_time="0"/> > > > ... > > > You have not stated your intentional starting order (to this point in the email), but my understanding is that this services will start in the following order: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.