From nik600 at gmail.com  Sat Feb  1 18:35:25 2014
From: nik600 at gmail.com (nik600)
Date: Sat, 1 Feb 2014 19:35:25 +0100
Subject: [Linux-cluster] how to handle fence for a simple apache
 active/passive cluster with virtual ip on 2 virtual machine
Message-ID: <CA+5uZNNd51ab5hoYmyzYXckZcdLGMWnFvS4_UzpHy1+4gTApJw@mail.gmail.com>

Dear all

i need some clarification about clustering with rhel 6.4

i have a cluster with 2 node in active/passive configuration, i simply want
to have a virtual ip and migrate it between 2 nodes.

i've noticed that if i reboot or manually shut down a node the failover
works correctly, but if i power-off one node the cluster doesn't failover
on the other node.

Another stange situation is that if power off all the nodes and then switch
on only one the cluster doesn't start on the active node.

I've read manual and documentation at

https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/index.html

and i've understand that the problem is related to fencing, but the problem
is that my 2 nodes are on 2 virtual machine , i can't control hardware and
can't issue any custom command on the host-side.

I've tried to use fence_xvm but i'm not sure about it because if my VM has
powered-off, how can it reply to fence_vxm messags?

Here my logs when i power off the VM:

==> /var/log/cluster/fenced.log <==
Feb 01 18:50:22 fenced fencing node mynode02
Feb 01 18:50:53 fenced fence mynode02 dev 0.0 agent fence_xvm result: error
from agent
Feb 01 18:50:53 fenced fence mynode02 failed

I've tried to force the manual fence with:

fence_ack_manual mynode02

and in this case the failover works properly.

The point is: as i'm not using any shared filesystem but i'm only sharing
apache with a virtual ip, i won't have any split-brain scenario so i don't
need fencing, or not?

So, is there the possibility to have a simple "dummy" fencing?

here is my config.xml:

<?xml version="1.0"?>
<cluster config_version="20" name="hacluster">
        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="0"/>
        <cman expected_votes="1" two_node="1"/>
        <clusternodes>
                <clusternode name="mynode01" nodeid="1" votes="1">
                        <fence>
                                <method name="mynode01">
                                        <device domain="mynode01"
name="mynode01"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="mynode02" nodeid="2" votes="1">
                        <fence>
                                <method name="mynode02">
                                        <device domain="mynode02"
name="mynode02"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_xvm" name="mynode01"/>
                <fencedevice agent="fence_xvm" name="mynode02"/>
        </fencedevices>
        <rm log_level="7">
                <failoverdomains>
                        <failoverdomain name="MYSERVICE" nofailback="0"
ordered="0" restricted="0">
                                <failoverdomainnode name="mynode01"
priority="1"/>
                                <failoverdomainnode name="mynode02"
priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources/>
                <service autostart="1" exclusive="0" name="MYSERVICE"
recovery="relocate">
                        <ip address="192.168.1.239" monitor_link="on"
sleeptime="2"/>
<apache config_file="conf/httpd.conf" name="apache"
server_root="/etc/httpd" shutdown_wait="0"/>
                </service>
        </rm>
</cluster>

Thanks to all in advance.

-- 
/*************/
nik600
http://www.kumbe.it
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140201/f1047d2e/attachment.htm>

From lists at alteeve.ca  Sat Feb  1 18:43:35 2014
From: lists at alteeve.ca (Digimer)
Date: Sat, 01 Feb 2014 13:43:35 -0500
Subject: [Linux-cluster] how to handle fence for a simple apache
 active/passive cluster with virtual ip on 2 virtual machine
In-Reply-To: <CA+5uZNNd51ab5hoYmyzYXckZcdLGMWnFvS4_UzpHy1+4gTApJw@mail.gmail.com>
References: <CA+5uZNNd51ab5hoYmyzYXckZcdLGMWnFvS4_UzpHy1+4gTApJw@mail.gmail.com>
Message-ID: <52ED4057.4060309@alteeve.ca>

On 01/02/14 01:35 PM, nik600 wrote:
> Dear all
>
> i need some clarification about clustering with rhel 6.4
>
> i have a cluster with 2 node in active/passive configuration, i simply
> want to have a virtual ip and migrate it between 2 nodes.
>
> i've noticed that if i reboot or manually shut down a node the failover
> works correctly, but if i power-off one node the cluster doesn't
> failover on the other node.
>
> Another stange situation is that if power off all the nodes and then
> switch on only one the cluster doesn't start on the active node.
>
> I've read manual and documentation at
>
> https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/index.html
>
> and i've understand that the problem is related to fencing, but the
> problem is that my 2 nodes are on 2 virtual machine , i can't control
> hardware and can't issue any custom command on the host-side.
>
> I've tried to use fence_xvm but i'm not sure about it because if my VM
> has powered-off, how can it reply to fence_vxm messags?
>
> Here my logs when i power off the VM:
>
> ==> /var/log/cluster/fenced.log <==
> Feb 01 18:50:22 fenced fencing node mynode02
> Feb 01 18:50:53 fenced fence mynode02 dev 0.0 agent fence_xvm result:
> error from agent
> Feb 01 18:50:53 fenced fence mynode02 failed
>
> I've tried to force the manual fence with:
>
> fence_ack_manual mynode02
>
> and in this case the failover works properly.
>
> The point is: as i'm not using any shared filesystem but i'm only
> sharing apache with a virtual ip, i won't have any split-brain scenario
> so i don't need fencing, or not?
>
> So, is there the possibility to have a simple "dummy" fencing?
>
> here is my config.xml:
>
> <?xml version="1.0"?>
> <cluster config_version="20" name="hacluster">
>          <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="0"/>
>          <cman expected_votes="1" two_node="1"/>
>          <clusternodes>
>                  <clusternode name="mynode01" nodeid="1" votes="1">
>                          <fence>
>                                  <method name="mynode01">
>                                          <device domain="mynode01"
> name="mynode01"/>
>                                  </method>
>                          </fence>
>                  </clusternode>
>                  <clusternode name="mynode02" nodeid="2" votes="1">
>                          <fence>
>                                  <method name="mynode02">
>                                          <device domain="mynode02"
> name="mynode02"/>
>                                  </method>
>                          </fence>
>                  </clusternode>
>          </clusternodes>
>          <fencedevices>
>                  <fencedevice agent="fence_xvm" name="mynode01"/>
>                  <fencedevice agent="fence_xvm" name="mynode02"/>
>          </fencedevices>
>          <rm log_level="7">
>                  <failoverdomains>
>                          <failoverdomain name="MYSERVICE" nofailback="0"
> ordered="0" restricted="0">
>                                  <failoverdomainnode name="mynode01"
> priority="1"/>
>                                  <failoverdomainnode name="mynode02"
> priority="2"/>
>                          </failoverdomain>
>                  </failoverdomains>
>                  <resources/>
>                  <service autostart="1" exclusive="0" name="MYSERVICE"
> recovery="relocate">
>                          <ip address="192.168.1.239" monitor_link="on"
> sleeptime="2"/>
> <apache config_file="conf/httpd.conf" name="apache"
> server_root="/etc/httpd" shutdown_wait="0"/>
>                  </service>
>          </rm>
> </cluster>
>
> Thanks to all in advance.

The fence_virtd/fence_xvm agent works by using multicast to talk to the 
VM host. So the "off" confirmation comes from the hypervisor, not the 
target.

Depending on your setup, you might find better luck with fence_virsh (I 
have to use this as there is a known multicast issue with Fedora hosts). 
Can you try, as a test if nothing else, if 'fence_virsh' will work for you?

fence_virsh -a <host ip> -l root -p <host root pw> -n <virsh name for 
target vm> -o status

If this works, it should be trivial to add to cluster.conf. If that 
works, then you have a working fence method. However, I would recommend 
switching back to fence_xvm if you can. The fence_virsh agent is 
dependent on libvirtd running, which some consider a risk.

hth

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?



From nik600 at gmail.com  Sat Feb  1 20:50:03 2014
From: nik600 at gmail.com (nik600)
Date: Sat, 1 Feb 2014 21:50:03 +0100
Subject: [Linux-cluster] how to handle fence for a simple apache
 active/passive cluster with virtual ip on 2 virtual machine
In-Reply-To: <52ED4057.4060309@alteeve.ca>
References: <CA+5uZNNd51ab5hoYmyzYXckZcdLGMWnFvS4_UzpHy1+4gTApJw@mail.gmail.com>
	<52ED4057.4060309@alteeve.ca>
Message-ID: <CA+5uZNMpOj+oQSGhpsWV2aKRLXKHBWq-Buc4mwOFGZty0gaV7g@mail.gmail.com>

My problem is that i don't have root access at host level.
Il 01/feb/2014 19:49 "Digimer" <lists at alteeve.ca> ha scritto:

> On 01/02/14 01:35 PM, nik600 wrote:
>
>> Dear all
>>
>> i need some clarification about clustering with rhel 6.4
>>
>> i have a cluster with 2 node in active/passive configuration, i simply
>> want to have a virtual ip and migrate it between 2 nodes.
>>
>> i've noticed that if i reboot or manually shut down a node the failover
>> works correctly, but if i power-off one node the cluster doesn't
>> failover on the other node.
>>
>> Another stange situation is that if power off all the nodes and then
>> switch on only one the cluster doesn't start on the active node.
>>
>> I've read manual and documentation at
>>
>> https://access.redhat.com/site/documentation/en-US/Red_
>> Hat_Enterprise_Linux/6/html/Cluster_Administration/index.html
>>
>> and i've understand that the problem is related to fencing, but the
>> problem is that my 2 nodes are on 2 virtual machine , i can't control
>> hardware and can't issue any custom command on the host-side.
>>
>> I've tried to use fence_xvm but i'm not sure about it because if my VM
>> has powered-off, how can it reply to fence_vxm messags?
>>
>> Here my logs when i power off the VM:
>>
>> ==> /var/log/cluster/fenced.log <==
>> Feb 01 18:50:22 fenced fencing node mynode02
>> Feb 01 18:50:53 fenced fence mynode02 dev 0.0 agent fence_xvm result:
>> error from agent
>> Feb 01 18:50:53 fenced fence mynode02 failed
>>
>> I've tried to force the manual fence with:
>>
>> fence_ack_manual mynode02
>>
>> and in this case the failover works properly.
>>
>> The point is: as i'm not using any shared filesystem but i'm only
>> sharing apache with a virtual ip, i won't have any split-brain scenario
>> so i don't need fencing, or not?
>>
>> So, is there the possibility to have a simple "dummy" fencing?
>>
>> here is my config.xml:
>>
>> <?xml version="1.0"?>
>> <cluster config_version="20" name="hacluster">
>>          <fence_daemon clean_start="0" post_fail_delay="0"
>> post_join_delay="0"/>
>>          <cman expected_votes="1" two_node="1"/>
>>          <clusternodes>
>>                  <clusternode name="mynode01" nodeid="1" votes="1">
>>                          <fence>
>>                                  <method name="mynode01">
>>                                          <device domain="mynode01"
>> name="mynode01"/>
>>                                  </method>
>>                          </fence>
>>                  </clusternode>
>>                  <clusternode name="mynode02" nodeid="2" votes="1">
>>                          <fence>
>>                                  <method name="mynode02">
>>                                          <device domain="mynode02"
>> name="mynode02"/>
>>                                  </method>
>>                          </fence>
>>                  </clusternode>
>>          </clusternodes>
>>          <fencedevices>
>>                  <fencedevice agent="fence_xvm" name="mynode01"/>
>>                  <fencedevice agent="fence_xvm" name="mynode02"/>
>>          </fencedevices>
>>          <rm log_level="7">
>>                  <failoverdomains>
>>                          <failoverdomain name="MYSERVICE" nofailback="0"
>> ordered="0" restricted="0">
>>                                  <failoverdomainnode name="mynode01"
>> priority="1"/>
>>                                  <failoverdomainnode name="mynode02"
>> priority="2"/>
>>                          </failoverdomain>
>>                  </failoverdomains>
>>                  <resources/>
>>                  <service autostart="1" exclusive="0" name="MYSERVICE"
>> recovery="relocate">
>>                          <ip address="192.168.1.239" monitor_link="on"
>> sleeptime="2"/>
>> <apache config_file="conf/httpd.conf" name="apache"
>> server_root="/etc/httpd" shutdown_wait="0"/>
>>                  </service>
>>          </rm>
>> </cluster>
>>
>> Thanks to all in advance.
>>
>
> The fence_virtd/fence_xvm agent works by using multicast to talk to the VM
> host. So the "off" confirmation comes from the hypervisor, not the target.
>
> Depending on your setup, you might find better luck with fence_virsh (I
> have to use this as there is a known multicast issue with Fedora hosts).
> Can you try, as a test if nothing else, if 'fence_virsh' will work for you?
>
> fence_virsh -a <host ip> -l root -p <host root pw> -n <virsh name for
> target vm> -o status
>
> If this works, it should be trivial to add to cluster.conf. If that works,
> then you have a working fence method. However, I would recommend switching
> back to fence_xvm if you can. The fence_virsh agent is dependent on
> libvirtd running, which some consider a risk.
>
> hth
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140201/ad4b22ef/attachment.htm>

From lists at alteeve.ca  Sat Feb  1 21:04:51 2014
From: lists at alteeve.ca (Digimer)
Date: Sat, 01 Feb 2014 16:04:51 -0500
Subject: [Linux-cluster] how to handle fence for a simple apache
 active/passive cluster with virtual ip on 2 virtual machine
In-Reply-To: <CA+5uZNMpOj+oQSGhpsWV2aKRLXKHBWq-Buc4mwOFGZty0gaV7g@mail.gmail.com>
References: <CA+5uZNNd51ab5hoYmyzYXckZcdLGMWnFvS4_UzpHy1+4gTApJw@mail.gmail.com>	<52ED4057.4060309@alteeve.ca>
	<CA+5uZNMpOj+oQSGhpsWV2aKRLXKHBWq-Buc4mwOFGZty0gaV7g@mail.gmail.com>
Message-ID: <52ED6173.5060404@alteeve.ca>

Ooooh, I'm not sure what option you have then. I suppose 
fence_virtd/fence_xvm is your best option, but you're going to need to 
have the admin configure the fence_virtd side.

On 01/02/14 03:50 PM, nik600 wrote:
> My problem is that i don't have root access at host level.
>
> Il 01/feb/2014 19:49 "Digimer" <lists at alteeve.ca
> <mailto:lists at alteeve.ca>> ha scritto:
>
>     On 01/02/14 01:35 PM, nik600 wrote:
>
>         Dear all
>
>         i need some clarification about clustering with rhel 6.4
>
>         i have a cluster with 2 node in active/passive configuration, i
>         simply
>         want to have a virtual ip and migrate it between 2 nodes.
>
>         i've noticed that if i reboot or manually shut down a node the
>         failover
>         works correctly, but if i power-off one node the cluster doesn't
>         failover on the other node.
>
>         Another stange situation is that if power off all the nodes and then
>         switch on only one the cluster doesn't start on the active node.
>
>         I've read manual and documentation at
>
>         https://access.redhat.com/__site/documentation/en-US/Red___Hat_Enterprise_Linux/6/html/__Cluster_Administration/index.__html
>         <https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/index.html>
>
>         and i've understand that the problem is related to fencing, but the
>         problem is that my 2 nodes are on 2 virtual machine , i can't
>         control
>         hardware and can't issue any custom command on the host-side.
>
>         I've tried to use fence_xvm but i'm not sure about it because if
>         my VM
>         has powered-off, how can it reply to fence_vxm messags?
>
>         Here my logs when i power off the VM:
>
>         ==> /var/log/cluster/fenced.log <==
>         Feb 01 18:50:22 fenced fencing node mynode02
>         Feb 01 18:50:53 fenced fence mynode02 dev 0.0 agent fence_xvm
>         result:
>         error from agent
>         Feb 01 18:50:53 fenced fence mynode02 failed
>
>         I've tried to force the manual fence with:
>
>         fence_ack_manual mynode02
>
>         and in this case the failover works properly.
>
>         The point is: as i'm not using any shared filesystem but i'm only
>         sharing apache with a virtual ip, i won't have any split-brain
>         scenario
>         so i don't need fencing, or not?
>
>         So, is there the possibility to have a simple "dummy" fencing?
>
>         here is my config.xml:
>
>         <?xml version="1.0"?>
>         <cluster config_version="20" name="hacluster">
>                   <fence_daemon clean_start="0" post_fail_delay="0"
>         post_join_delay="0"/>
>                   <cman expected_votes="1" two_node="1"/>
>                   <clusternodes>
>                           <clusternode name="mynode01" nodeid="1" votes="1">
>                                   <fence>
>                                           <method name="mynode01">
>                                                   <device domain="mynode01"
>         name="mynode01"/>
>                                           </method>
>                                   </fence>
>                           </clusternode>
>                           <clusternode name="mynode02" nodeid="2" votes="1">
>                                   <fence>
>                                           <method name="mynode02">
>                                                   <device domain="mynode02"
>         name="mynode02"/>
>                                           </method>
>                                   </fence>
>                           </clusternode>
>                   </clusternodes>
>                   <fencedevices>
>                           <fencedevice agent="fence_xvm" name="mynode01"/>
>                           <fencedevice agent="fence_xvm" name="mynode02"/>
>                   </fencedevices>
>                   <rm log_level="7">
>                           <failoverdomains>
>                                   <failoverdomain name="MYSERVICE"
>         nofailback="0"
>         ordered="0" restricted="0">
>                                           <failoverdomainnode
>         name="mynode01"
>         priority="1"/>
>                                           <failoverdomainnode
>         name="mynode02"
>         priority="2"/>
>                                   </failoverdomain>
>                           </failoverdomains>
>                           <resources/>
>                           <service autostart="1" exclusive="0"
>         name="MYSERVICE"
>         recovery="relocate">
>                                   <ip address="192.168.1.239"
>         monitor_link="on"
>         sleeptime="2"/>
>         <apache config_file="conf/httpd.conf" name="apache"
>         server_root="/etc/httpd" shutdown_wait="0"/>
>                           </service>
>                   </rm>
>         </cluster>
>
>         Thanks to all in advance.
>
>
>     The fence_virtd/fence_xvm agent works by using multicast to talk to
>     the VM host. So the "off" confirmation comes from the hypervisor,
>     not the target.
>
>     Depending on your setup, you might find better luck with fence_virsh
>     (I have to use this as there is a known multicast issue with Fedora
>     hosts). Can you try, as a test if nothing else, if 'fence_virsh'
>     will work for you?
>
>     fence_virsh -a <host ip> -l root -p <host root pw> -n <virsh name
>     for target vm> -o status
>
>     If this works, it should be trivial to add to cluster.conf. If that
>     works, then you have a working fence method. However, I would
>     recommend switching back to fence_xvm if you can. The fence_virsh
>     agent is dependent on libvirtd running, which some consider a risk.
>
>     hth
>
>     --
>     Digimer
>     Papers and Projects: https://alteeve.ca/w/
>     What if the cure for cancer is trapped in the mind of a person
>     without access to education?
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/__mailman/listinfo/linux-cluster
>     <https://www.redhat.com/mailman/listinfo/linux-cluster>
>
>
>


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?



From nik600 at gmail.com  Sat Feb  1 21:11:45 2014
From: nik600 at gmail.com (nik600)
Date: Sat, 1 Feb 2014 22:11:45 +0100
Subject: [Linux-cluster] how to handle fence for a simple apache
 active/passive cluster with virtual ip on 2 virtual machine
In-Reply-To: <52ED6173.5060404@alteeve.ca>
References: <CA+5uZNNd51ab5hoYmyzYXckZcdLGMWnFvS4_UzpHy1+4gTApJw@mail.gmail.com>
	<52ED4057.4060309@alteeve.ca>
	<CA+5uZNMpOj+oQSGhpsWV2aKRLXKHBWq-Buc4mwOFGZty0gaV7g@mail.gmail.com>
	<52ED6173.5060404@alteeve.ca>
Message-ID: <CA+5uZNOHjnwK09ZvX+oPskVt8qre+0+ojt-X-bhNiOSRBMgOZQ@mail.gmail.com>

Ok but is not possible to ignore fence?
Il 01/feb/2014 22:09 "Digimer" <lists at alteeve.ca> ha scritto:

> Ooooh, I'm not sure what option you have then. I suppose
> fence_virtd/fence_xvm is your best option, but you're going to need to have
> the admin configure the fence_virtd side.
>
> On 01/02/14 03:50 PM, nik600 wrote:
>
>> My problem is that i don't have root access at host level.
>>
>> Il 01/feb/2014 19:49 "Digimer" <lists at alteeve.ca
>> <mailto:lists at alteeve.ca>> ha scritto:
>>
>>     On 01/02/14 01:35 PM, nik600 wrote:
>>
>>         Dear all
>>
>>         i need some clarification about clustering with rhel 6.4
>>
>>         i have a cluster with 2 node in active/passive configuration, i
>>         simply
>>         want to have a virtual ip and migrate it between 2 nodes.
>>
>>         i've noticed that if i reboot or manually shut down a node the
>>         failover
>>         works correctly, but if i power-off one node the cluster doesn't
>>         failover on the other node.
>>
>>         Another stange situation is that if power off all the nodes and
>> then
>>         switch on only one the cluster doesn't start on the active node.
>>
>>         I've read manual and documentation at
>>
>>         https://access.redhat.com/__site/documentation/en-US/Red__
>> _Hat_Enterprise_Linux/6/html/__Cluster_Administration/index.__html
>>         <https://access.redhat.com/site/documentation/en-US/Red_
>> Hat_Enterprise_Linux/6/html/Cluster_Administration/index.html>
>>
>>         and i've understand that the problem is related to fencing, but
>> the
>>         problem is that my 2 nodes are on 2 virtual machine , i can't
>>         control
>>         hardware and can't issue any custom command on the host-side.
>>
>>         I've tried to use fence_xvm but i'm not sure about it because if
>>         my VM
>>         has powered-off, how can it reply to fence_vxm messags?
>>
>>         Here my logs when i power off the VM:
>>
>>         ==> /var/log/cluster/fenced.log <==
>>         Feb 01 18:50:22 fenced fencing node mynode02
>>         Feb 01 18:50:53 fenced fence mynode02 dev 0.0 agent fence_xvm
>>         result:
>>         error from agent
>>         Feb 01 18:50:53 fenced fence mynode02 failed
>>
>>         I've tried to force the manual fence with:
>>
>>         fence_ack_manual mynode02
>>
>>         and in this case the failover works properly.
>>
>>         The point is: as i'm not using any shared filesystem but i'm only
>>         sharing apache with a virtual ip, i won't have any split-brain
>>         scenario
>>         so i don't need fencing, or not?
>>
>>         So, is there the possibility to have a simple "dummy" fencing?
>>
>>         here is my config.xml:
>>
>>         <?xml version="1.0"?>
>>         <cluster config_version="20" name="hacluster">
>>                   <fence_daemon clean_start="0" post_fail_delay="0"
>>         post_join_delay="0"/>
>>                   <cman expected_votes="1" two_node="1"/>
>>                   <clusternodes>
>>                           <clusternode name="mynode01" nodeid="1"
>> votes="1">
>>                                   <fence>
>>                                           <method name="mynode01">
>>                                                   <device
>> domain="mynode01"
>>         name="mynode01"/>
>>                                           </method>
>>                                   </fence>
>>                           </clusternode>
>>                           <clusternode name="mynode02" nodeid="2"
>> votes="1">
>>                                   <fence>
>>                                           <method name="mynode02">
>>                                                   <device
>> domain="mynode02"
>>         name="mynode02"/>
>>                                           </method>
>>                                   </fence>
>>                           </clusternode>
>>                   </clusternodes>
>>                   <fencedevices>
>>                           <fencedevice agent="fence_xvm" name="mynode01"/>
>>                           <fencedevice agent="fence_xvm" name="mynode02"/>
>>                   </fencedevices>
>>                   <rm log_level="7">
>>                           <failoverdomains>
>>                                   <failoverdomain name="MYSERVICE"
>>         nofailback="0"
>>         ordered="0" restricted="0">
>>                                           <failoverdomainnode
>>         name="mynode01"
>>         priority="1"/>
>>                                           <failoverdomainnode
>>         name="mynode02"
>>         priority="2"/>
>>                                   </failoverdomain>
>>                           </failoverdomains>
>>                           <resources/>
>>                           <service autostart="1" exclusive="0"
>>         name="MYSERVICE"
>>         recovery="relocate">
>>                                   <ip address="192.168.1.239"
>>         monitor_link="on"
>>         sleeptime="2"/>
>>         <apache config_file="conf/httpd.conf" name="apache"
>>         server_root="/etc/httpd" shutdown_wait="0"/>
>>                           </service>
>>                   </rm>
>>         </cluster>
>>
>>         Thanks to all in advance.
>>
>>
>>     The fence_virtd/fence_xvm agent works by using multicast to talk to
>>     the VM host. So the "off" confirmation comes from the hypervisor,
>>     not the target.
>>
>>     Depending on your setup, you might find better luck with fence_virsh
>>     (I have to use this as there is a known multicast issue with Fedora
>>     hosts). Can you try, as a test if nothing else, if 'fence_virsh'
>>     will work for you?
>>
>>     fence_virsh -a <host ip> -l root -p <host root pw> -n <virsh name
>>     for target vm> -o status
>>
>>     If this works, it should be trivial to add to cluster.conf. If that
>>     works, then you have a working fence method. However, I would
>>     recommend switching back to fence_xvm if you can. The fence_virsh
>>     agent is dependent on libvirtd running, which some consider a risk.
>>
>>     hth
>>
>>     --
>>     Digimer
>>     Papers and Projects: https://alteeve.ca/w/
>>     What if the cure for cancer is trapped in the mind of a person
>>     without access to education?
>>
>>     --
>>     Linux-cluster mailing list
>>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>>     https://www.redhat.com/__mailman/listinfo/linux-cluster
>>     <https://www.redhat.com/mailman/listinfo/linux-cluster>
>>
>>
>>
>>
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140201/5431b746/attachment.htm>

From lists at alteeve.ca  Sat Feb  1 21:22:08 2014
From: lists at alteeve.ca (Digimer)
Date: Sat, 01 Feb 2014 16:22:08 -0500
Subject: [Linux-cluster] how to handle fence for a simple apache
 active/passive cluster with virtual ip on 2 virtual machine
In-Reply-To: <CA+5uZNOHjnwK09ZvX+oPskVt8qre+0+ojt-X-bhNiOSRBMgOZQ@mail.gmail.com>
References: <CA+5uZNNd51ab5hoYmyzYXckZcdLGMWnFvS4_UzpHy1+4gTApJw@mail.gmail.com>	<52ED4057.4060309@alteeve.ca>	<CA+5uZNMpOj+oQSGhpsWV2aKRLXKHBWq-Buc4mwOFGZty0gaV7g@mail.gmail.com>	<52ED6173.5060404@alteeve.ca>
	<CA+5uZNOHjnwK09ZvX+oPskVt8qre+0+ojt-X-bhNiOSRBMgOZQ@mail.gmail.com>
Message-ID: <52ED6580.7070700@alteeve.ca>

No. When a node is lost, fenced is called. Fenced informs DLM that a 
fence is pending and DLM stops issuing locks. Only after fenced confirms 
successful fence is DLM told. The DLM will reap locks held by the 
now-fenced node and recovery can begin.

Anything using DLM; rgmanager, clvmd, gfs2, will block. This is by 
design. If you ever allowed a cluster to make an assumption about the 
state of a lost node, you risk a split-brain. If a split-brain was 
tolerable, you wouldn't need an HA cluster. :)

digimer

On 01/02/14 04:11 PM, nik600 wrote:
> Ok but is not possible to ignore fence?
>
> Il 01/feb/2014 22:09 "Digimer" <lists at alteeve.ca
> <mailto:lists at alteeve.ca>> ha scritto:
>
>     Ooooh, I'm not sure what option you have then. I suppose
>     fence_virtd/fence_xvm is your best option, but you're going to need
>     to have the admin configure the fence_virtd side.
>
>     On 01/02/14 03:50 PM, nik600 wrote:
>
>         My problem is that i don't have root access at host level.
>
>         Il 01/feb/2014 19:49 "Digimer" <lists at alteeve.ca
>         <mailto:lists at alteeve.ca>
>         <mailto:lists at alteeve.ca <mailto:lists at alteeve.ca>>> ha scritto:
>
>              On 01/02/14 01:35 PM, nik600 wrote:
>
>                  Dear all
>
>                  i need some clarification about clustering with rhel 6.4
>
>                  i have a cluster with 2 node in active/passive
>         configuration, i
>                  simply
>                  want to have a virtual ip and migrate it between 2 nodes.
>
>                  i've noticed that if i reboot or manually shut down a
>         node the
>                  failover
>                  works correctly, but if i power-off one node the
>         cluster doesn't
>                  failover on the other node.
>
>                  Another stange situation is that if power off all the
>         nodes and then
>                  switch on only one the cluster doesn't start on the
>         active node.
>
>                  I've read manual and documentation at
>
>         https://access.redhat.com/____site/documentation/en-US/Red_____Hat_Enterprise_Linux/6/html/____Cluster_Administration/index.____html
>         <https://access.redhat.com/__site/documentation/en-US/Red___Hat_Enterprise_Linux/6/html/__Cluster_Administration/index.__html>
>
>         <https://access.redhat.com/__site/documentation/en-US/Red___Hat_Enterprise_Linux/6/html/__Cluster_Administration/index.__html
>         <https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/index.html>>
>
>                  and i've understand that the problem is related to
>         fencing, but the
>                  problem is that my 2 nodes are on 2 virtual machine , i
>         can't
>                  control
>                  hardware and can't issue any custom command on the
>         host-side.
>
>                  I've tried to use fence_xvm but i'm not sure about it
>         because if
>                  my VM
>                  has powered-off, how can it reply to fence_vxm messags?
>
>                  Here my logs when i power off the VM:
>
>                  ==> /var/log/cluster/fenced.log <==
>                  Feb 01 18:50:22 fenced fencing node mynode02
>                  Feb 01 18:50:53 fenced fence mynode02 dev 0.0 agent
>         fence_xvm
>                  result:
>                  error from agent
>                  Feb 01 18:50:53 fenced fence mynode02 failed
>
>                  I've tried to force the manual fence with:
>
>                  fence_ack_manual mynode02
>
>                  and in this case the failover works properly.
>
>                  The point is: as i'm not using any shared filesystem
>         but i'm only
>                  sharing apache with a virtual ip, i won't have any
>         split-brain
>                  scenario
>                  so i don't need fencing, or not?
>
>                  So, is there the possibility to have a simple "dummy"
>         fencing?
>
>                  here is my config.xml:
>
>                  <?xml version="1.0"?>
>                  <cluster config_version="20" name="hacluster">
>                            <fence_daemon clean_start="0" post_fail_delay="0"
>                  post_join_delay="0"/>
>                            <cman expected_votes="1" two_node="1"/>
>                            <clusternodes>
>                                    <clusternode name="mynode01"
>         nodeid="1" votes="1">
>                                            <fence>
>                                                    <method name="mynode01">
>                                                            <device
>         domain="mynode01"
>                  name="mynode01"/>
>                                                    </method>
>                                            </fence>
>                                    </clusternode>
>                                    <clusternode name="mynode02"
>         nodeid="2" votes="1">
>                                            <fence>
>                                                    <method name="mynode02">
>                                                            <device
>         domain="mynode02"
>                  name="mynode02"/>
>                                                    </method>
>                                            </fence>
>                                    </clusternode>
>                            </clusternodes>
>                            <fencedevices>
>                                    <fencedevice agent="fence_xvm"
>         name="mynode01"/>
>                                    <fencedevice agent="fence_xvm"
>         name="mynode02"/>
>                            </fencedevices>
>                            <rm log_level="7">
>                                    <failoverdomains>
>                                            <failoverdomain name="MYSERVICE"
>                  nofailback="0"
>                  ordered="0" restricted="0">
>                                                    <failoverdomainnode
>                  name="mynode01"
>                  priority="1"/>
>                                                    <failoverdomainnode
>                  name="mynode02"
>                  priority="2"/>
>                                            </failoverdomain>
>                                    </failoverdomains>
>                                    <resources/>
>                                    <service autostart="1" exclusive="0"
>                  name="MYSERVICE"
>                  recovery="relocate">
>                                            <ip address="192.168.1.239"
>                  monitor_link="on"
>                  sleeptime="2"/>
>                  <apache config_file="conf/httpd.conf" name="apache"
>                  server_root="/etc/httpd" shutdown_wait="0"/>
>                                    </service>
>                            </rm>
>                  </cluster>
>
>                  Thanks to all in advance.
>
>
>              The fence_virtd/fence_xvm agent works by using multicast to
>         talk to
>              the VM host. So the "off" confirmation comes from the
>         hypervisor,
>              not the target.
>
>              Depending on your setup, you might find better luck with
>         fence_virsh
>              (I have to use this as there is a known multicast issue
>         with Fedora
>              hosts). Can you try, as a test if nothing else, if
>         'fence_virsh'
>              will work for you?
>
>              fence_virsh -a <host ip> -l root -p <host root pw> -n
>         <virsh name
>              for target vm> -o status
>
>              If this works, it should be trivial to add to cluster.conf.
>         If that
>              works, then you have a working fence method. However, I would
>              recommend switching back to fence_xvm if you can. The
>         fence_virsh
>              agent is dependent on libvirtd running, which some consider
>         a risk.
>
>              hth
>
>              --
>              Digimer
>              Papers and Projects: https://alteeve.ca/w/
>              What if the cure for cancer is trapped in the mind of a person
>              without access to education?
>
>              --
>              Linux-cluster mailing list
>         Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>         <mailto:Linux-cluster at redhat.__com
>         <mailto:Linux-cluster at redhat.com>>
>         https://www.redhat.com/____mailman/listinfo/linux-cluster
>         <https://www.redhat.com/__mailman/listinfo/linux-cluster>
>              <https://www.redhat.com/__mailman/listinfo/linux-cluster
>         <https://www.redhat.com/mailman/listinfo/linux-cluster>__>
>
>
>
>
>
>     --
>     Digimer
>     Papers and Projects: https://alteeve.ca/w/
>     What if the cure for cancer is trapped in the mind of a person
>     without access to education?
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/__mailman/listinfo/linux-cluster
>     <https://www.redhat.com/mailman/listinfo/linux-cluster>
>
>
>


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?



From ben at zentrix.be  Fri Feb  7 16:13:15 2014
From: ben at zentrix.be (Benjamin Budts)
Date: Fri, 7 Feb 2014 17:13:15 +0100
Subject: [Linux-cluster] manual intervention 1 node when fencing fails due
	to complete power outage
Message-ID: <012b01cf241f$81c79300$8556b900$@zentrix.be>

 

Gents,

 

I have a 2 node setup (with quorum disk), redhat 6.5 & a luci mgmt console.

 

Everything has been configured and we?re doing failover tests now.

 

Couple of questions I have :

 

?         When I simulate a complete power failure of a servers pdu?s (no
more access to idrac fencing or APC PDU fencing) I can see that the fencing
of that node who was running the application fails ? I  noticed unless
fencing returns an OK I?m stuck and my application won?t start on my 2nd
node. Which is ok I guess, because no fencing could mean there is still I/O
on my san. 

Clustat also shows on the active node that the 1st node is still running the
application.

 

How can I intervene manually, so as to force a start of the application on
the node that is still alive ?

Is there a way to tell the cluster, don?t take into account node 1 anymore
and don?t try to fence anymore, just start the application on the node that
is still ok ?

 

I can?t possibly wait until power returns to that server. Downtime could be
too long.

 

 

?         If I tell a node to leave the cluster in Luci, I would like it to
remain a non-cluster member after the reboot of that node. It rejoins the
cluster automatically after a reboot. Any way to prevent this ?

 

Thx

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140207/2d63e40a/attachment.htm>

From lists at alteeve.ca  Fri Feb  7 17:55:32 2014
From: lists at alteeve.ca (Digimer)
Date: Fri, 07 Feb 2014 12:55:32 -0500
Subject: [Linux-cluster] manual intervention 1 node when fencing fails
 due to complete power outage
In-Reply-To: <012b01cf241f$81c79300$8556b900$@zentrix.be>
References: <012b01cf241f$81c79300$8556b900$@zentrix.be>
Message-ID: <52F51E14.6030102@alteeve.ca>

On 07/02/14 11:13 AM, Benjamin Budts wrote:
> Gents,

We're not all gents. ;)

> I have a 2 node setup (with quorum disk), redhat 6.5 & a luci mgmt console.
>
> Everything has been configured and we?re doing failover tests now.
>
> Couple of questions I have :
>
> ?When I simulate a complete power failure of a servers pdu?s (no more
> access to idrac fencing or APC PDU fencing) I can see that the fencing
> of that node who was running the application fails ?I  noticed unless
> fencing returns an OK I?m stuck and my application won?t start on my
> 2^nd node. Which is ok I guess, because no fencing could mean there is
> still I/O on my san.

This is expected. If a lost node can't be put into a known state, there 
is no safe way to proceed. To do so would be to risk a split brain at 
least, and data loss/corruption at worst.

The way I deal with this is to have nodes with redundant power supplies 
and use two PDUs and two UPSes. This way, the failure of on cirtcuit / 
UPS / PDU doesn't knock out the power to the mainboard of the nodes, so 
you don't lose IPMI.

> Clustat also shows on the active node that the 1^st node is still
> running the application.

That's likely because rgmanager uses DLM, and DLM blocks until the fence 
succeeds, so it can't update it's view.

> How can I intervene manually, so as to force a start of the application
> on the node that is still alive ?

If you are *100% ABSOLUTELY SURE* that the lost node has been powered 
off, then you can run 'fence_ack_manual'. Please be super careful about 
this though. If you do this, in the heat of the moment with clients or 
bosses yelling at you, and the peer isn't really off (ie: it's only 
hung), you risk serious problems.

I can not emphasis strongly enough the caution needed when using this 
command.

> Is there a way to tell the cluster, don?t take into account node 1
> anymore and don?t try to fence anymore, just start the application on
> the node that is still ok ?

No. That would risk a split brain and data corruption. The only safe 
option for the cluster, if the face of a failed fence, is to hang. As 
bad as it is to hang, it's better than risking corruption.

> I can?t possibly wait until power returns to that server. Downtime could
> be too long.

See the solution I mentioned earlier.

> ?If I tell a node to leave the cluster in Luci, I would like it to
> remain a non-cluster member after the reboot of that node. It rejoins
> the cluster automatically after a reboot. Any way to prevent this ?
>
> Thx

Don't let cman and rgmanager start on boot. This is always my policy. If 
a node failed and got fenced, I want it to reboot, so that I can log 
into it and figure out what happened, but I do _not_ want it back in the 
cluster until I've determined it is healthy.

hth

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?



From Mark.Vallevand at UNISYS.com  Fri Feb  7 19:48:12 2014
From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K)
Date: Fri, 7 Feb 2014 13:48:12 -0600
Subject: [Linux-cluster] Collocating cloned resources
Message-ID: <99C8B2929B39C24493377AC7A121E21FC5D8BC6C9F@USEA-EXCH8.na.uis.unisys.com>

I'm pretty sure I can collocate cloned resources.  If so, will the clone instance number in the resource agents (OCF_RESKEY_CRM_meta_clone) be the same for the instances running on the same node?

Regards.
Mark K Vallevand   Mark.Vallevand at Unisys.com<mailto:Mark.Vallevand at Unisys.com>
May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140207/27ddfe12/attachment.htm>

From Mark.Vallevand at UNISYS.com  Fri Feb  7 20:32:27 2014
From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K)
Date: Fri, 7 Feb 2014 14:32:27 -0600
Subject: [Linux-cluster] What happens if a node running a cloned resource
	crashes
Message-ID: <99C8B2929B39C24493377AC7A121E21FC5D8BC6D62@USEA-EXCH8.na.uis.unisys.com>

If I have 5 nodes in my cluster and I have a cloned resource running on 4 of them (clone_max=4), and one of the 4 crashes, will an instance of the cloned resource be started on the 5th node?

Regards.
Mark K Vallevand   Mark.Vallevand at Unisys.com<mailto:Mark.Vallevand at Unisys.com>
May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140207/68451cf8/attachment.htm>

From arnold at arnoldarts.de  Fri Feb  7 21:18:06 2014
From: arnold at arnoldarts.de (Arnold Krille)
Date: Fri, 7 Feb 2014 22:18:06 +0100
Subject: [Linux-cluster] What happens if a node running a cloned
 resource crashes
In-Reply-To: <99C8B2929B39C24493377AC7A121E21FC5D8BC6D62@USEA-EXCH8.na.uis.unisys.com>
References: <99C8B2929B39C24493377AC7A121E21FC5D8BC6D62@USEA-EXCH8.na.uis.unisys.com>
Message-ID: <20140207221806.42a98416@xingu.arnoldarts.de>

On Fri, 7 Feb 2014 14:32:27 -0600 "Vallevand, Mark K"
<Mark.Vallevand at UNISYS.com> wrote:
> If I have 5 nodes in my cluster and I have a cloned resource running
> on 4 of them (clone_max=4), and one of the 4 crashes, will an
> instance of the cloned resource be started on the 5th node?

Thats the idea!

You tell the cluster to run up to 4 cloned resources on the cluster as
long as possible (high availability!). And as long as possible, it will
run these four resources. It might even run several of these resources
on one node. You have to set clone-node-max appropriately to prevent all
four resources running on one node.

Have fun,

Arnold
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140207/ad143006/attachment.sig>

From Mark.Vallevand at UNISYS.com  Fri Feb  7 21:42:10 2014
From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K)
Date: Fri, 7 Feb 2014 15:42:10 -0600
Subject: [Linux-cluster] Collocating cloned resources
In-Reply-To: <99C8B2929B39C24493377AC7A121E21FC5D8BC6C9F@USEA-EXCH8.na.uis.unisys.com>
References: <99C8B2929B39C24493377AC7A121E21FC5D8BC6C9F@USEA-EXCH8.na.uis.unisys.com>
Message-ID: <99C8B2929B39C24493377AC7A121E21FC5D8BC6E8F@USEA-EXCH8.na.uis.unisys.com>

Assuming clone-node-max=1 so that one instance of each resource runs on each node, that is.

Regards.
Mark K Vallevand   Mark.Vallevand at Unisys.com<mailto:Mark.Vallevand at Unisys.com>
May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.

From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Vallevand, Mark K
Sent: Friday, February 07, 2014 01:48 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Collocating cloned resources

I'm pretty sure I can collocate cloned resources.  If so, will the clone instance number in the resource agents (OCF_RESKEY_CRM_meta_clone) be the same for the instances running on the same node?

Regards.
Mark K Vallevand   Mark.Vallevand at Unisys.com<mailto:Mark.Vallevand at Unisys.com>
May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140207/00f34ec5/attachment.htm>

From Mark.Vallevand at UNISYS.com  Fri Feb  7 21:42:15 2014
From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K)
Date: Fri, 7 Feb 2014 15:42:15 -0600
Subject: [Linux-cluster] What happens if a node running a cloned
 resource crashes
In-Reply-To: <20140207221806.42a98416@xingu.arnoldarts.de>
References: <99C8B2929B39C24493377AC7A121E21FC5D8BC6D62@USEA-EXCH8.na.uis.unisys.com>
	<20140207221806.42a98416@xingu.arnoldarts.de>
Message-ID: <99C8B2929B39C24493377AC7A121E21FC5D8BC6E90@USEA-EXCH8.na.uis.unisys.com>

Good.  Thanks.  


Regards.
Mark K Vallevand?? Mark.Vallevand at Unisys.com
May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Arnold Krille
Sent: Friday, February 07, 2014 03:18 PM
To: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] What happens if a node running a cloned resource crashes

On Fri, 7 Feb 2014 14:32:27 -0600 "Vallevand, Mark K"
<Mark.Vallevand at UNISYS.com> wrote:
> If I have 5 nodes in my cluster and I have a cloned resource running
> on 4 of them (clone_max=4), and one of the 4 crashes, will an
> instance of the cloned resource be started on the 5th node?

Thats the idea!

You tell the cluster to run up to 4 cloned resources on the cluster as
long as possible (high availability!). And as long as possible, it will
run these four resources. It might even run several of these resources
on one node. You have to set clone-node-max appropriately to prevent all
four resources running on one node.

Have fun,

Arnold



From ben at zentrix.be  Mon Feb 10 14:12:28 2014
From: ben at zentrix.be (Benjamin Budts)
Date: Mon, 10 Feb 2014 15:12:28 +0100
Subject: [Linux-cluster] backup best practice when using Luci
Message-ID: <018e01cf266a$21624170$6426c450$@zentrix.be>

 

Ladies & Gents (I won't make that same mistake again ;)  ),

 

First, thank you to the lady who helped me explain how to force an OK on
fencing that is failing.

 

A 2 node config & Luci :

 

I would liketo put a backup solution in place for the cluster config /
nodeconfig/ fencing etc...

What would you recommend ? Or does  Luci archive versions of config-files
somewhere ?

 

Basically, if shit hits the fan I would like to untar a golden image of a
config on luci and push it back to the nodes.

 

Thx

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140210/631cbee4/attachment.htm>

From lists at alteeve.ca  Mon Feb 10 14:43:22 2014
From: lists at alteeve.ca (Digimer)
Date: Mon, 10 Feb 2014 09:43:22 -0500
Subject: [Linux-cluster] backup best practice when using Luci
In-Reply-To: <018e01cf266a$21624170$6426c450$@zentrix.be>
References: <018e01cf266a$21624170$6426c450$@zentrix.be>
Message-ID: <52F8E58A.8080808@alteeve.ca>

On 10/02/14 09:12 AM, Benjamin Budts wrote:
> Ladies & Gents (I won?t make that same mistake again ;)  ),
>
> First, thank you to the lady who helped me explain how to force an OK on
> fencing that is failing.
>
> A 2 node config & Luci :
>
> I would liketo put a backup solution in place for the cluster config /
> nodeconfig/ fencing etc...
>
> What would you recommend ? Or does  Luci archive versions of
> config-files somewhere ?
>
> Basically, if shit hits the fan I would like to untar a golden image of
> a config on luci and push it back to the nodes?
>
> Thx

The main config file is /etc/cluster/cluster.conf. A copy of this file 
should be on all nodes at once, so even if you didn't have a backup 
proper, you should be able to copy it from another node.

Beyond that, I personally backup (sample taken from a node called 
'an-c05n01';

====
mkdir ~/base
cd ~/base
mkdir root
mkdir -p etc/sysconfig/network-scripts/
mkdir -p etc/udev/rules.d/

# Root user
rsync -av /root/.bashrc          root/
rsync -av /root/.ssh             root/

# Directories
rsync -av /etc/ssh               etc/
rsync -av /etc/apcupsd           etc/
rsync -av /etc/cluster           etc/
rsync -av /etc/drbd.*            etc/
rsync -av /etc/lvm               etc/

# Specific files.
rsync -av /etc/sysconfig/network-scripts/ifcfg-{eth*,bond*,vbr*} 
etc/sysconfig/network-scripts/
rsync -av /etc/udev/rules.d/70-persistent-net.rules etc/udev/rules.d/
rsync -av /etc/sysconfig/network etc/sysconfig/
rsync -av /etc/hosts             etc/
rsync -av /etc/ntp.conf          etc/

# Save recreating user accounts.
rsync -av /etc/passwd            etc/
rsync -av /etc/group             etc/
rsync -av /etc/shadow            etc/
rsync -av /etc/gshadow           etc/

# If you have the cluster built and want to backup it's configs.
mkdir etc/cluster
mkdir etc/lvm
rsync -av /etc/cluster/cluster.conf etc/cluster/
rsync -av /etc/lvm/lvm.conf etc/lvm/
# NOTE: DRBD won't work until you've manually created the partitions.
rsync -av /etc/drbd.d etc/

# If you had to manually set the UUID in libvirtd;
mkdir etc/libvirt
rsync -av /etc/libvirt/libvirt.conf etc/libvirt/

# If you're running RHEL and want to backup your registration info;
rsync -av /etc/sysconfig/rhn etc/sysconfig/

# Pack it up
# NOTE: Change the name to suit your node.
tar -cvf base_an-c05n01.tar etc root
====

I then push the resulting tar file to my PXE server. I have a kickstart 
script that does a minimal rhel6 install, plus the cluster stuff, and 
then has a %post script that downloads this tar and extracts it.

This way, when the node needs to be rebuilt, it's 95% ready to go. I 
still need to do things like 'drbdadm create-md <res>', but it's still 
very quick to restore a node.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?



From ckonstanski at pippiandcarlos.com  Mon Feb 10 23:12:37 2014
From: ckonstanski at pippiandcarlos.com (Carlos Konstanski)
Date: Mon, 10 Feb 2014 16:12:37 -0700
Subject: [Linux-cluster] changing an ocf:heartbeat:Filesystem config
Message-ID: <52F95CE5.9080304@pippiandcarlos.com>

I have a need to change the device node for an ocf:heartbeat:Filesystem
resource. It currently points to /dev/sda. I want to change it to
/dev/disk/by-uuid/e4284038-1b26-418e-b205-93395373379b.

The reason for this change is to make sure that a renaming of disks by
udev does not break my cluster. The above example is a lab environment.
the production environment is messier.

What is the easiest and/or least impacting way to make this change? If
these two requirements are mutually exclusive, then please lean toward
least impacting.

Thanks!



From ckonstanski at pippiandcarlos.com  Mon Feb 10 23:26:01 2014
From: ckonstanski at pippiandcarlos.com (Carlos Konstanski)
Date: Mon, 10 Feb 2014 16:26:01 -0700
Subject: [Linux-cluster] changing an ocf:heartbeat:Filesystem config
In-Reply-To: <52F95CE5.9080304@pippiandcarlos.com>
References: <52F95CE5.9080304@pippiandcarlos.com>
Message-ID: <52F96009.3070703@pippiandcarlos.com>

Is this what I want to do?

crm configure save > filename
Edit the file
crm configure load replace filename

On 02/10/2014 04:12 PM, Carlos Konstanski wrote:
> I have a need to change the device node for an ocf:heartbeat:Filesystem
> resource. It currently points to /dev/sda. I want to change it to
> /dev/disk/by-uuid/e4284038-1b26-418e-b205-93395373379b.
> 
> The reason for this change is to make sure that a renaming of disks by
> udev does not break my cluster. The above example is a lab environment.
> the production environment is messier.
> 
> What is the easiest and/or least impacting way to make this change? If
> these two requirements are mutually exclusive, then please lean toward
> least impacting.
> 
> Thanks!
> 



From andrew at beekhof.net  Mon Feb 10 23:35:32 2014
From: andrew at beekhof.net (Andrew Beekhof)
Date: Tue, 11 Feb 2014 10:35:32 +1100
Subject: [Linux-cluster] changing an ocf:heartbeat:Filesystem config
In-Reply-To: <52F96009.3070703@pippiandcarlos.com>
References: <52F95CE5.9080304@pippiandcarlos.com>
	<52F96009.3070703@pippiandcarlos.com>
Message-ID: <B9CF4190-3141-4C68-A7CC-186CFFA9BCF1@beekhof.net>

that is a quite reasonable approach

On 11 Feb 2014, at 10:26 am, Carlos Konstanski <ckonstanski at pippiandcarlos.com> wrote:

> Is this what I want to do?
> 
> crm configure save > filename
> Edit the file
> crm configure load replace filename
> 
> On 02/10/2014 04:12 PM, Carlos Konstanski wrote:
>> I have a need to change the device node for an ocf:heartbeat:Filesystem
>> resource. It currently points to /dev/sda. I want to change it to
>> /dev/disk/by-uuid/e4284038-1b26-418e-b205-93395373379b.
>> 
>> The reason for this change is to make sure that a renaming of disks by
>> udev does not break my cluster. The above example is a lab environment.
>> the production environment is messier.
>> 
>> What is the easiest and/or least impacting way to make this change? If
>> these two requirements are mutually exclusive, then please lean toward
>> least impacting.
>> 
>> Thanks!
>> 
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140211/3758e3cf/attachment.sig>

From ckonstanski at pippiandcarlos.com  Mon Feb 10 23:47:27 2014
From: ckonstanski at pippiandcarlos.com (Carlos Konstanski)
Date: Mon, 10 Feb 2014 16:47:27 -0700
Subject: [Linux-cluster] changing an ocf:heartbeat:Filesystem config
In-Reply-To: <B9CF4190-3141-4C68-A7CC-186CFFA9BCF1@beekhof.net>
References: <52F95CE5.9080304@pippiandcarlos.com>	<52F96009.3070703@pippiandcarlos.com>
	<B9CF4190-3141-4C68-A7CC-186CFFA9BCF1@beekhof.net>
Message-ID: <52F9650F.1060406@pippiandcarlos.com>

Thanks! I'll do it. It worked well in my lab environment.

On 02/10/2014 04:35 PM, Andrew Beekhof wrote:
> that is a quite reasonable approach
> 
> On 11 Feb 2014, at 10:26 am, Carlos Konstanski <ckonstanski at pippiandcarlos.com> wrote:
> 
>> Is this what I want to do?
>>
>> crm configure save > filename
>> Edit the file
>> crm configure load replace filename
>>
>> On 02/10/2014 04:12 PM, Carlos Konstanski wrote:
>>> I have a need to change the device node for an ocf:heartbeat:Filesystem
>>> resource. It currently points to /dev/sda. I want to change it to
>>> /dev/disk/by-uuid/e4284038-1b26-418e-b205-93395373379b.
>>>
>>> The reason for this change is to make sure that a renaming of disks by
>>> udev does not break my cluster. The above example is a lab environment.
>>> the production environment is messier.
>>>
>>> What is the easiest and/or least impacting way to make this change? If
>>> these two requirements are mutually exclusive, then please lean toward
>>> least impacting.
>>>
>>> Thanks!
>>>
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 



From Ralph.Grothe at itdz-berlin.de  Tue Feb 11 10:17:17 2014
From: Ralph.Grothe at itdz-berlin.de (Ralph.Grothe at itdz-berlin.de)
Date: Tue, 11 Feb 2014 11:17:17 +0100
Subject: [Linux-cluster] =?iso-8859-1?q?Question_regarding_typed_resources?=
	=?iso-8859-1?q?=B4_parent_child_vs=2E_sibling_ordering?=
Message-ID: <A789DDB53ED7E94396E842EE2AC9B5FF01432EF5@itdzex101.ITDZ.verwalt-berlin.de>

Hello,


My actions and questions relate to a RHEL 5 RHCS cluster.

Though I studied carefully the official RHEL 5 Cluster Admin
Guide with special emphasis on the chapter "HA Resource Behavior"
there remain certain things unclear to me.

First of all, I have to mention that my cluster.conf?s
parent-child-sibling hierarchies whithin the service scopes could
successfully be checked in as valid cluster configuration (i.e.
"ccs_tool update /etc/cluster/cluster.conf" succeeded).


My first question is whether it is feasible to use the
<resources> tag, which originally is meant to map inheritance,
and populate such a block although I don't make any use of
inheritance in my configuration?
I simply find that its use makes the appearance of the <rm> block
much more readable an tidier.


Now to my main concern.

Would such a <service> block be valid and start and stop
resources in the proper order (i.e. according to my intention)?

e.g.


<rm>
  <failoverdomains>
    ...
  </failoverdomains>
  <resources>
    ...
  </resources>
  <service name="srv-a" ...>
     <ip ref="10.20.30.40">
       <lvm ref="vg-a" ...>
          <fs ref="fs_srv-a_vg-a_lv-a"/>
          <fs ref="fs_srv-a_vg-a_lv-b"/>
          <fs ref="fs_srv-a_vg-a_lv-c"/>
       </lvm>
       <lvm ref="vg_b" ...>
          <fs ref="fs_srv-a_vg-b_lv-a"/>
          <fs ref="fs_srv-a_vg-b_lv-b"/>
          <fs ref="fs_srv-a_vg-b_lv-c"/>      
       </lvm>
     </ip>
     <oracledb ref="SID-A">
       <script ref="oracle_em" __independent_subtree="1"
__max_restarts="2" __restart_expire_time="0"/>
     </oracledb>
     </script ref="sid-a_statechg_notify"
__independent_subtree="1" __max_restarts="2"
__restart_expire_time="0"/>
  </service>
  <service name="srv-b" ...>
     ...
  </service>
</rm>
 


I am asking because I read in the mentioned doc above that for a
typed resource (such as ip, lvm, fs,...) there exists a strict
start and stop sequence for siblings.
In my parent-child hierarchy above I am reversing this start
order by making the ip resource a parent of the lvm resource
which in sibling context would have a higher starting precedence
than the ip resource.

Of course I had a second thought in mind when rigging up this
seemingly oblique hierarchy of typed resources.

Because there are scheduled maintenance downtimes I wanted to
ease the activation of a whole bunch of a service's resources
like LVM LVs, mountpoints and IP addresses with a single rg_test
invocation when a service has previously been disabled.
I then could issue according to the above config snippet just a

e.g.  rg_test test /etc/cluster/cluster.conf start ip 10.20.30.40


and have all resources activated apart from the Oracle DB
instance.


There is yet another issue that puzzles me.
If I look at the starting sequence by issuing

rg_test noop /etc/cluster/cluster.conf start service srv-a

then the resource script:sid-a_statechg_notify gets executed
before the resources oracledb:SID-A and script:oracle_em.

This would imply to me that any resource of type script has a
higher starting precedence than any resource of type oracledb
because in my config above they are siblings.
I actually would have thought it to be the other way round, i.e.
that script resources have the lowest starting precedence of all.

Unfortunately, in  Table D.1. "Child Resource Type Start and Stop
Order" on page 112 of the cluster administration guide the typed
resource oracledb does not appear.


Many thanks for your patience having read this far.

Regards,
Ralph



From rmitchel at redhat.com  Tue Feb 11 23:41:30 2014
From: rmitchel at redhat.com (Ryan Mitchell)
Date: Wed, 12 Feb 2014 10:41:30 +1100
Subject: [Linux-cluster]
 =?iso-8859-1?q?Question_regarding_typed_resources?=
 =?iso-8859-1?q?=B4_parent_child_vs=2E_sibling_ordering?=
In-Reply-To: <A789DDB53ED7E94396E842EE2AC9B5FF01432EF5@itdzex101.ITDZ.verwalt-berlin.de>
References: <A789DDB53ED7E94396E842EE2AC9B5FF01432EF5@itdzex101.ITDZ.verwalt-berlin.de>
Message-ID: <52FAB52A.4010809@redhat.com>

Hi,

On 02/11/2014 09:17 PM, Ralph.Grothe at itdz-berlin.de wrote:
> Hello,
>
>
> My actions and questions relate to a RHEL 5 RHCS cluster.
>
> Though I studied carefully the official RHEL 5 Cluster Admin
> Guide with special emphasis on the chapter "HA Resource Behavior"
> there remain certain things unclear to me.
>
> First of all, I have to mention that my cluster.conf?s
> parent-child-sibling hierarchies whithin the service scopes could
> successfully be checked in as valid cluster configuration (i.e.
> "ccs_tool update /etc/cluster/cluster.conf" succeeded).
>
>
> My first question is whether it is feasible to use the
> <resources> tag, which originally is meant to map inheritance,
> and populate such a block although I don't make any use of
> inheritance in my configuration?
> I simply find that its use makes the appearance of the <rm> block
> much more readable an tidier.

I don't entirely follow, but I'll take a guess that you are asking if it is compulsory to define resources in the <resources></resources> section, and then 
referencing them in the <services></services> section>?

If that is what you mean, then I can confirm that the recommended method is to define resources in the <resources></resources> tags and to reference those 
definitions in the <service></service> tags.  But it is also possible to leave the <resources></resources> section blank, and declare the resources when they 
are specified in the <service></service> section.  Both are possible.

Maybe I misunderstood you, or I misunderstood what you were referring to when you mentioned inheritance.  Please clarify if I did not answer your question.

> Now to my main concern.
>
> Would such a <service> block be valid and start and stop
> resources in the proper order (i.e. according to my intention)?
>
> e.g.
>
>
> <rm>
>    <failoverdomains>
>      ...
>    </failoverdomains>
>    <resources>
>      ...
>    </resources>
>    <service name="srv-a" ...>
>       <ip ref="10.20.30.40">
>         <lvm ref="vg-a" ...>
>            <fs ref="fs_srv-a_vg-a_lv-a"/>
>            <fs ref="fs_srv-a_vg-a_lv-b"/>
>            <fs ref="fs_srv-a_vg-a_lv-c"/>
>         </lvm>
>         <lvm ref="vg_b" ...>
>            <fs ref="fs_srv-a_vg-b_lv-a"/>
>            <fs ref="fs_srv-a_vg-b_lv-b"/>
>            <fs ref="fs_srv-a_vg-b_lv-c"/>
>         </lvm>
>       </ip>
>       <oracledb ref="SID-A">
>         <script ref="oracle_em" __independent_subtree="1"
> __max_restarts="2" __restart_expire_time="0"/>
>       </oracledb>
>       </script ref="sid-a_statechg_notify"
> __independent_subtree="1" __max_restarts="2"
> __restart_expire_time="0"/>
>    </service>
>    <service name="srv-b" ...>
>       ...
>    </service>
> </rm>
>

You have not stated your intentional starting order (to this point in the email), but my understanding is that this services will start in the following order:
1. <ip ref="10.20.30.40">
2. <lvm ref="vg-a" ...>
3. <fs ref="fs_srv-a_vg-a_lv-a"/>
4. <fs ref="fs_srv-a_vg-a_lv-b"/>
5. <fs ref="fs_srv-a_vg-a_lv-c"/>
6. <lvm ref="vg_b" ...>
7. <fs ref="fs_srv-a_vg-b_lv-a"/>
8. <fs ref="fs_srv-a_vg-b_lv-b"/>
9. <fs ref="fs_srv-a_vg-b_lv-c"/>
10. <script ref="sid-a_statechg_notify".../>
11. <oracledb ref="SID-A">
12. <script ref="oracle_em"...>

Later you mention that you expected #11 and #12 to start before #10.  I explained why they start in this order below.

> I am asking because I read in the mentioned doc above that for a
> typed resource (such as ip, lvm, fs,...) there exists a strict
> start and stop sequence for siblings.
> In my parent-child hierarchy above I am reversing this start
> order by making the ip resource a parent of the lvm resource
> which in sibling context would have a higher starting precedence
> than the ip resource.

This is correct, and is the recommended method of overriding the default resource starting order to create dependencies.  The defaults do not work for everyone.

> Of course I had a second thought in mind when rigging up this
> seemingly oblique hierarchy of typed resources.
>
> Because there are scheduled maintenance downtimes I wanted to
> ease the activation of a whole bunch of a service's resources
> like LVM LVs, mountpoints and IP addresses with a single rg_test
> invocation when a service has previously been disabled.
> I then could issue according to the above config snippet just a
>
> e.g.  rg_test test /etc/cluster/cluster.conf start ip 10.20.30.40
>
>
> and have all resources activated apart from the Oracle DB
> instance.

You are correct that rg_test can be used to start and stop individual resources within a FROZEN service.  You need to make sure that the <oracledb> and <script> 
resources can operate if the IP, filesystems and LVM resources are unavailable before you use this method to stop IP, LVM and filesystem resources.

Assuming the <oracledb> and both <script> resources require the IP resource but can operate without the LVM and filesystem resources, you could leave the IP, 
oracledb and both script resources running and just stop the filesystems and LVM resources like this:
# clusvcadm -Z srv-a          # freezes the service so it won't failover
# rg_test test /etc/cluster/cluster.conf stop lvm vg-a
# rg_test test /etc/cluster/cluster.conf stop lvm vg_b

The <ip>, <oracledb> and both <script> resources should still be running (they won't be checked for failure however because service is frozen).  To reactivate 
after maintenance:
# rg_test test /etc/cluster/cluster.conf start lvm vg-a
# rg_test test /etc/cluster/cluster.conf start lvm vg_b
# clusvcadm -U srv-a

> There is yet another issue that puzzles me.
> If I look at the starting sequence by issuing
>
> rg_test noop /etc/cluster/cluster.conf start service srv-a
>
> then the resource script:sid-a_statechg_notify gets executed
> before the resources oracledb:SID-A and script:oracle_em.
>
> This would imply to me that any resource of type script has a
> higher starting precedence than any resource of type oracledb
> because in my config above they are siblings.
> I actually would have thought it to be the other way round, i.e.
> that script resources have the lowest starting precedence of all.
>
> Unfortunately, in  Table D.1. "Child Resource Type Start and Stop
> Order" on page 112 of the cluster administration guide the typed
> resource oracledb does not appear.

The reason that the <script> resource starts before the <oracledb> resource is because oracledb resources do not have a default start order so it just comes 
after all resources that are defined.  From /usr/share/cluster/service.sh:
     <special tag="rgmanager">
         <attributes root="1" maxinstances="1"/>
         <child type="lvm" start="1" stop="9"/>
         <child type="fs" start="2" stop="8"/>
         <child type="clusterfs" start="3" stop="7"/>
         <child type="netfs" start="4" stop="6"/>
         <child type="nfsexport" start="5" stop="5"/>

         <child type="nfsclient" start="6" stop="4"/>

         <child type="ip" start="7" stop="2"/>
         <child type="smb" start="8" stop="3"/>
         <child type="script" start="9" stop="1"/>
     </special>

There is no entry for oracledb resources, but there is an entry for script resource.  As a result, the script resource will start with 9th priority, and all 
non-typed resources (including oracledb) will start in order listed after all typed resources.

You could argue that the omission of the oracledb resource is a bug or missing feature, however in the meantime you will have to force ordering of resources by 
using parent-child relationships (ie. if you want the oracledb resource to start before the script resource, the script resource should become a child of the 
oracledb resource).

>
>
> Many thanks for your patience having read this far.

If you have active RHEL subscriptions with addons, Red Hat Support can assist specifically with these issues.

Regards,

Ryan Mitchell
Red Hat Global Support Services



From ben at zentrix.be  Wed Feb 12 14:48:58 2014
From: ben at zentrix.be (Benjamin Budts)
Date: Wed, 12 Feb 2014 15:48:58 +0100
Subject: [Linux-cluster] executing /usr/bin/virsh nodeinfo but have
	physical servers ?
In-Reply-To: <CAE7pJ3BuF9MaSuVsKu9QEvHusk+dakT_nbVVthevfStACM0qCQ@mail.gmail.com>
References: <000a01cf1c26$424bc2b0$c6e34810$@zentrix.be>
	<CAE7pJ3BuF9MaSuVsKu9QEvHusk+dakT_nbVVthevfStACM0qCQ@mail.gmail.com>
Message-ID: <000d01cf2801$8fe52070$afaf6150$@zentrix.be>

Thx, works like a charm. I stopped the service and did a # chkconfig
libvirt-guests off  ; service libvirt-guests stop

Possible to uninstall the package that comes with this script ? Haven't
check from which package this file is coming from or if it has dependencies
somewhere.

 

cheers 

 

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of emmanuel segura
Sent: dinsdag 28 januari 2014 13:53
To: linux clustering
Subject: Re: [Linux-cluster] executing /usr/bin/virsh nodeinfo but have
physical servers ?

 

/etc/init.d/libvirtd status

 

2014-01-28 Benjamin Budts <ben at zentrix.be>

 

Gents,

 

I have a cluster, working fine with 2 physical machines & a luci mgmt.
station.

 

I see a lot of ricci : executing '/usr/bin/virsh nodeinfo'  in
/var/log/messages

 

If I execute this command manually I get  : 

 

Error : Failed to reconnect to the hypervisor

Error:  no valid connection

Error: internal error Unable to locate libvirtd daemon in /usr/sbin

 

Now. I'm not running hypervisors, everything is physical. How do I get rid
of this ? And why is luci trying to check this if my servers are physical ?

 

Thx a lot

B

 


--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




-- 
esta es mi vida e me la vivo hasta que dios quiera 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140212/83f05994/attachment.htm>

From emi2fast at gmail.com  Wed Feb 12 15:00:13 2014
From: emi2fast at gmail.com (emmanuel segura)
Date: Wed, 12 Feb 2014 16:00:13 +0100
Subject: [Linux-cluster] executing /usr/bin/virsh nodeinfo but have
 physical servers ?
In-Reply-To: <000d01cf2801$8fe52070$afaf6150$@zentrix.be>
References: <000a01cf1c26$424bc2b0$c6e34810$@zentrix.be>
	<CAE7pJ3BuF9MaSuVsKu9QEvHusk+dakT_nbVVthevfStACM0qCQ@mail.gmail.com>
	<000d01cf2801$8fe52070$afaf6150$@zentrix.be>
Message-ID: <CAE7pJ3A93CrE4cqKDuWm8CvvV4GVkcmx_BRjXS9xb3GX04Satg@mail.gmail.com>

rpm -qf /path_of_file


2014-02-12 15:48 GMT+01:00 Benjamin Budts <ben at zentrix.be>:

> Thx, works like a charm. I stopped the service and did a # chkconfig
> libvirt-guests off  ; service libvirt-guests stop
>
> Possible to uninstall the package that comes with this script ? Haven't
> check from which package this file is coming from or if it has dependencies
> somewhere.
>
>
>
> cheers
>
>
>
> *From:* linux-cluster-bounces at redhat.com [mailto:
> linux-cluster-bounces at redhat.com] *On Behalf Of *emmanuel segura
> *Sent:* dinsdag 28 januari 2014 13:53
> *To:* linux clustering
> *Subject:* Re: [Linux-cluster] executing /usr/bin/virsh nodeinfo but have
> physical servers ?
>
>
>
> /etc/init.d/libvirtd status
>
>
>
> 2014-01-28 Benjamin Budts <ben at zentrix.be>
>
>
>
> Gents,
>
>
>
> I have a cluster, working fine with 2 physical machines & a luci mgmt.
> station.
>
>
>
> I see a lot of ricci : executing '/usr/bin/virsh nodeinfo'  in
> /var/log/messages
>
>
>
> If I execute this command manually I get  :
>
>
>
> Error : Failed to reconnect to the hypervisor
>
> Error:  no valid connection
>
> Error: internal error Unable to locate libvirtd daemon in /usr/sbin
>
>
>
> Now... I'm not running hypervisors, everything is physical. How do I get rid
> of this ? And why is luci trying to check this if my servers are physical ?
>
>
>
> Thx a lot
>
> B
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140212/51eec95a/attachment.htm>

From fabio.ferrari at unimore.it  Fri Feb 14 09:54:45 2014
From: fabio.ferrari at unimore.it (FABIO FERRARI)
Date: Fri, 14 Feb 2014 10:54:45 +0100
Subject: [Linux-cluster] Question about cluster behavior
Message-ID: <74a7f2cfa11c75465a4d9c367eb4459a.squirrel@webmail2.unimore.it>

Hello,

I have a 3 nodes cluster in high availability with a quorum disk. It is
Redhat 6.
Occasionally it happens that we have to shut down the entire cluster system.
When I restart the machines, the cluster don't see any cluster partition
(/dev/mapper/vg-lv) until all machines are started.
If I want to start only 2 machines, I have to manually remove the other
machine frome the web interface and restart the other two machines. If I
don't do this, the cluster partition path isn't seen and the services
never start. Is this normal or there is some configuration problem in my
cluster?

thanks in advance for the answer

Fabio Ferrari




From emi2fast at gmail.com  Fri Feb 14 14:21:54 2014
From: emi2fast at gmail.com (emmanuel segura)
Date: Fri, 14 Feb 2014 15:21:54 +0100
Subject: [Linux-cluster] Question about cluster behavior
In-Reply-To: <74a7f2cfa11c75465a4d9c367eb4459a.squirrel@webmail2.unimore.it>
References: <74a7f2cfa11c75465a4d9c367eb4459a.squirrel@webmail2.unimore.it>
Message-ID: <CAE7pJ3CPH1KEiVJ7yTM1+SLTGDfTu+VcHpfOp83+Hk69qvgMJA@mail.gmail.com>

:( no cluster.conf && no log, so if you want someone try to help you, you
need to give more information, no just describe the problem


2014-02-14 10:54 GMT+01:00 FABIO FERRARI <fabio.ferrari at unimore.it>:

> Hello,
>
> I have a 3 nodes cluster in high availability with a quorum disk. It is
> Redhat 6.
> Occasionally it happens that we have to shut down the entire cluster
> system.
> When I restart the machines, the cluster don't see any cluster partition
> (/dev/mapper/vg-lv) until all machines are started.
> If I want to start only 2 machines, I have to manually remove the other
> machine frome the web interface and restart the other two machines. If I
> don't do this, the cluster partition path isn't seen and the services
> never start. Is this normal or there is some configuration problem in my
> cluster?
>
> thanks in advance for the answer
>
> Fabio Ferrari
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140214/0d22a1bf/attachment.htm>

From fabio.ferrari at unimore.it  Fri Feb 14 17:07:39 2014
From: fabio.ferrari at unimore.it (FABIO FERRARI)
Date: Fri, 14 Feb 2014 18:07:39 +0100
Subject: [Linux-cluster] Question about cluster behavior
In-Reply-To: <CAE7pJ3CPH1KEiVJ7yTM1+SLTGDfTu+VcHpfOp83+Hk69qvgMJA@mail.gmail.com>
References: <74a7f2cfa11c75465a4d9c367eb4459a.squirrel@webmail2.unimore.it>
	<CAE7pJ3CPH1KEiVJ7yTM1+SLTGDfTu+VcHpfOp83+Hk69qvgMJA@mail.gmail.com>
Message-ID: <0d3bf8891f3e462e51503796010974d3.squirrel@webmail2.unimore.it>

So it's not a normal behavior, I guess.

Here is my cluster.conf:

<?xml version="1.0"?>
<cluster config_version="59" name="mail">
        <clusternodes>
                <clusternode name="eta.mngt.unimo.it" nodeid="1">
                        <fence>
                                <method name="fence-eta">
                                        <device name="fence-eta"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="beta.mngt.unimo.it" nodeid="2">
                        <fence>
                                <method name="fence-beta">
                                        <device name="fence-beta"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="guerro.mngt.unimo.it" nodeid="3">
                        <fence>
                                <method name="fence-guerro">
                                        <device name="fence-guerro"
port="Guerro
" ssl="on" uuid="4213f370-9572-63c7-26e4-22f0f43843aa"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="5"/>
        <quorumd label="mail-qdisk"/>
        <rm>
                <resources>
                        <ip address="155.185.44.61/24" sleeptime="10"/>
                        <mysql config_file="/etc/my.cnf"
listen_address="155.185.44.61" name="mysql"
shutdown_wait="10" startup_wait="10"/>
                        <script file="/etc/init.d/httpd" name="httpd"/>
                        <script file="/etc/init.d/postfix" name="postfix"/>
                        <script file="/etc/init.d/dovecot" name="dovecot"/>
                        <fs device="/dev/mapper/mailvg-maillv"
force_fsck="1" force_unmount="1" fsid="58161"
fstype="xfs" mountpoint="/cl" name="mailvg-maill
v" options="defaults,noauto" self_fence="1"/>
                        <lvm lv_name="maillv" name="lvm-mailvg-maillv"
self_fence="1" vg_name="mailvg"/>
                </resources>
                <failoverdomains>
                        <failoverdomain name="mailfailoverdomain"
nofailback="1" ordered="1" restricted="1">
                                <failoverdomainnode
name="eta.mngt.unimo.it" priority="1"/>
                                <failoverdomainnode
name="beta.mngt.unimo.it" priority="2"/>
                                <failoverdomainnode
name="guerro.mngt.unimo.it" priority="3"/>
                        </failoverdomain>
                </failoverdomains>
                <service domain="mailfailoverdomain" max_restarts="3"
name="mailservices" recovery="restart"
restart_expire_time="600">
                        <fs ref="mailvg-maillv">
                                <ip ref="155.185.44.61/24">
                                        <mysql ref="mysql">
                                                <script ref="httpd"/>
                                                <script ref="postfix"/>
                                                <script ref="dovecot"/>
                                        </mysql>
                                </ip>
                        </fs>
                </service>
        </rm>
        <fencedevices>
                <fencedevice agent="fence_ipmilan" auth="password"
ipaddr="155.185.135.105" lanplus="on" login="root"
name="fence-eta" passwd="******" pr
ivlvl="ADMINISTRATOR"/>
                <fencedevice agent="fence_ipmilan" auth="password"
ipaddr="155.185.135.106" lanplus="on" login="root"
name="fence-beta" passwd="******" p
rivlvl="ADMINISTRATOR"/>
                <fencedevice agent="fence_vmware_soap"
ipaddr="155.185.0.10" login="etabetaguerro"
name="fence-guerro" passwd="******"/>
        </fencedevices>
</cluster>

What log file do you need? There are many in /var/log/cluster..

Fabio Ferrari

> :( no cluster.conf && no log, so if you want someone try to help you, you
> need to give more information, no just describe the problem
>
>
> 2014-02-14 10:54 GMT+01:00 FABIO FERRARI <fabio.ferrari at unimore.it>:
>
>> Hello,
>>
>> I have a 3 nodes cluster in high availability with a quorum disk. It is
>> Redhat 6.
>> Occasionally it happens that we have to shut down the entire cluster
>> system.
>> When I restart the machines, the cluster don't see any cluster partition
>> (/dev/mapper/vg-lv) until all machines are started.
>> If I want to start only 2 machines, I have to manually remove the other
>> machine frome the web interface and restart the other two machines. If I
>> don't do this, the cluster partition path isn't seen and the services
>> never start. Is this normal or there is some configuration problem in my
>> cluster?
>>
>> thanks in advance for the answer
>>
>> Fabio Ferrari
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From emi2fast at gmail.com  Fri Feb 14 17:23:00 2014
From: emi2fast at gmail.com (emmanuel segura)
Date: Fri, 14 Feb 2014 18:23:00 +0100
Subject: [Linux-cluster] Question about cluster behavior
In-Reply-To: <0d3bf8891f3e462e51503796010974d3.squirrel@webmail2.unimore.it>
References: <74a7f2cfa11c75465a4d9c367eb4459a.squirrel@webmail2.unimore.it>
	<CAE7pJ3CPH1KEiVJ7yTM1+SLTGDfTu+VcHpfOp83+Hk69qvgMJA@mail.gmail.com>
	<0d3bf8891f3e462e51503796010974d3.squirrel@webmail2.unimore.it>
Message-ID: <CAE7pJ3DPfo3jMT+hAnf9D=3vWEMtiRma+HLFm=38OUYm1aBx_g@mail.gmail.com>

cluster log


2014-02-14 18:07 GMT+01:00 FABIO FERRARI <fabio.ferrari at unimore.it>:

> So it's not a normal behavior, I guess.
>
> Here is my cluster.conf:
>
> <?xml version="1.0"?>
> <cluster config_version="59" name="mail">
>         <clusternodes>
>                 <clusternode name="eta.mngt.unimo.it" nodeid="1">
>                         <fence>
>                                 <method name="fence-eta">
>                                         <device name="fence-eta"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="beta.mngt.unimo.it" nodeid="2">
>                         <fence>
>                                 <method name="fence-beta">
>                                         <device name="fence-beta"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="guerro.mngt.unimo.it" nodeid="3">
>                         <fence>
>                                 <method name="fence-guerro">
>                                         <device name="fence-guerro"
> port="Guerro
> " ssl="on" uuid="4213f370-9572-63c7-26e4-22f0f43843aa"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="5"/>
>         <quorumd label="mail-qdisk"/>
>         <rm>
>                 <resources>
>                         <ip address="155.185.44.61/24" sleeptime="10"/>
>                         <mysql config_file="/etc/my.cnf"
> listen_address="155.185.44.61" name="mysql"
> shutdown_wait="10" startup_wait="10"/>
>                         <script file="/etc/init.d/httpd" name="httpd"/>
>                         <script file="/etc/init.d/postfix" name="postfix"/>
>                         <script file="/etc/init.d/dovecot" name="dovecot"/>
>                         <fs device="/dev/mapper/mailvg-maillv"
> force_fsck="1" force_unmount="1" fsid="58161"
> fstype="xfs" mountpoint="/cl" name="mailvg-maill
> v" options="defaults,noauto" self_fence="1"/>
>                         <lvm lv_name="maillv" name="lvm-mailvg-maillv"
> self_fence="1" vg_name="mailvg"/>
>                 </resources>
>                 <failoverdomains>
>                         <failoverdomain name="mailfailoverdomain"
> nofailback="1" ordered="1" restricted="1">
>                                 <failoverdomainnode
> name="eta.mngt.unimo.it" priority="1"/>
>                                 <failoverdomainnode
> name="beta.mngt.unimo.it" priority="2"/>
>                                 <failoverdomainnode
> name="guerro.mngt.unimo.it" priority="3"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <service domain="mailfailoverdomain" max_restarts="3"
> name="mailservices" recovery="restart"
> restart_expire_time="600">
>                         <fs ref="mailvg-maillv">
>                                 <ip ref="155.185.44.61/24">
>                                         <mysql ref="mysql">
>                                                 <script ref="httpd"/>
>                                                 <script ref="postfix"/>
>                                                 <script ref="dovecot"/>
>                                         </mysql>
>                                 </ip>
>                         </fs>
>                 </service>
>         </rm>
>         <fencedevices>
>                 <fencedevice agent="fence_ipmilan" auth="password"
> ipaddr="155.185.135.105" lanplus="on" login="root"
> name="fence-eta" passwd="******" pr
> ivlvl="ADMINISTRATOR"/>
>                 <fencedevice agent="fence_ipmilan" auth="password"
> ipaddr="155.185.135.106" lanplus="on" login="root"
> name="fence-beta" passwd="******" p
> rivlvl="ADMINISTRATOR"/>
>                 <fencedevice agent="fence_vmware_soap"
> ipaddr="155.185.0.10" login="etabetaguerro"
> name="fence-guerro" passwd="******"/>
>         </fencedevices>
> </cluster>
>
> What log file do you need? There are many in /var/log/cluster..
>
> Fabio Ferrari
>
> > :( no cluster.conf && no log, so if you want someone try to help you, you
> > need to give more information, no just describe the problem
> >
> >
> > 2014-02-14 10:54 GMT+01:00 FABIO FERRARI <fabio.ferrari at unimore.it>:
> >
> >> Hello,
> >>
> >> I have a 3 nodes cluster in high availability with a quorum disk. It is
> >> Redhat 6.
> >> Occasionally it happens that we have to shut down the entire cluster
> >> system.
> >> When I restart the machines, the cluster don't see any cluster partition
> >> (/dev/mapper/vg-lv) until all machines are started.
> >> If I want to start only 2 machines, I have to manually remove the other
> >> machine frome the web interface and restart the other two machines. If I
> >> don't do this, the cluster partition path isn't seen and the services
> >> never start. Is this normal or there is some configuration problem in my
> >> cluster?
> >>
> >> thanks in advance for the answer
> >>
> >> Fabio Ferrari
> >>
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>
> >
> >
> >
> > --
> > esta es mi vida e me la vivo hasta que dios quiera
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140214/d24374d8/attachment.htm>

From lists at alteeve.ca  Fri Feb 14 17:34:20 2014
From: lists at alteeve.ca (Digimer)
Date: Fri, 14 Feb 2014 12:34:20 -0500
Subject: [Linux-cluster] Question about cluster behavior
In-Reply-To: <0d3bf8891f3e462e51503796010974d3.squirrel@webmail2.unimore.it>
References: <74a7f2cfa11c75465a4d9c367eb4459a.squirrel@webmail2.unimore.it>	<CAE7pJ3CPH1KEiVJ7yTM1+SLTGDfTu+VcHpfOp83+Hk69qvgMJA@mail.gmail.com>
	<0d3bf8891f3e462e51503796010974d3.squirrel@webmail2.unimore.it>
Message-ID: <52FE539C.7050800@alteeve.ca>

Replies in-line:

On 14/02/14 12:07 PM, FABIO FERRARI wrote:
> So it's not a normal behavior, I guess.
>
> Here is my cluster.conf:
>
> <?xml version="1.0"?>
> <cluster config_version="59" name="mail">
>          <clusternodes>
>                  <clusternode name="eta.mngt.unimo.it" nodeid="1">
>                          <fence>
>                                  <method name="fence-eta">
>                                          <device name="fence-eta"/>
>                                  </method>
>                          </fence>
>                  </clusternode>
>                  <clusternode name="beta.mngt.unimo.it" nodeid="2">
>                          <fence>
>                                  <method name="fence-beta">
>                                          <device name="fence-beta"/>
>                                  </method>
>                          </fence>
>                  </clusternode>
>                  <clusternode name="guerro.mngt.unimo.it" nodeid="3">
>                          <fence>
>                                  <method name="fence-guerro">
>                                          <device name="fence-guerro"
> port="Guerro
> " ssl="on" uuid="4213f370-9572-63c7-26e4-22f0f43843aa"/>
>                                  </method>
>                          </fence>
>                  </clusternode>
>          </clusternodes>
>          <cman expected_votes="5"/>

You generally don't need to set this, the cluster can calculate it.

>          <quorumd label="mail-qdisk"/>

You don't set any votes, so the default is "1". So with expected votes 
being 5, that means all three nodes have to be up or two nodes and qdisk.

>          <rm>
>                  <resources>
>                          <ip address="155.185.44.61/24" sleeptime="10"/>
>                          <mysql config_file="/etc/my.cnf"
> listen_address="155.185.44.61" name="mysql"
> shutdown_wait="10" startup_wait="10"/>
>                          <script file="/etc/init.d/httpd" name="httpd"/>
>                          <script file="/etc/init.d/postfix" name="postfix"/>
>                          <script file="/etc/init.d/dovecot" name="dovecot"/>
>                          <fs device="/dev/mapper/mailvg-maillv"
> force_fsck="1" force_unmount="1" fsid="58161"
> fstype="xfs" mountpoint="/cl" name="mailvg-maill
> v" options="defaults,noauto" self_fence="1"/>
>                          <lvm lv_name="maillv" name="lvm-mailvg-maillv"
> self_fence="1" vg_name="mailvg"/>
>                  </resources>
>                  <failoverdomains>
>                          <failoverdomain name="mailfailoverdomain"
> nofailback="1" ordered="1" restricted="1">
>                                  <failoverdomainnode
> name="eta.mngt.unimo.it" priority="1"/>
>                                  <failoverdomainnode
> name="beta.mngt.unimo.it" priority="2"/>
>                                  <failoverdomainnode
> name="guerro.mngt.unimo.it" priority="3"/>
>                          </failoverdomain>
>                  </failoverdomains>
>                  <service domain="mailfailoverdomain" max_restarts="3"
> name="mailservices" recovery="restart"
> restart_expire_time="600">
>                          <fs ref="mailvg-maillv">
>                                  <ip ref="155.185.44.61/24">
>                                          <mysql ref="mysql">
>                                                  <script ref="httpd"/>
>                                                  <script ref="postfix"/>
>                                                  <script ref="dovecot"/>
>                                          </mysql>
>                                  </ip>
>                          </fs>
>                  </service>
>          </rm>
>          <fencedevices>
>                  <fencedevice agent="fence_ipmilan" auth="password"
> ipaddr="155.185.135.105" lanplus="on" login="root"
> name="fence-eta" passwd="******" pr
> ivlvl="ADMINISTRATOR"/>
>                  <fencedevice agent="fence_ipmilan" auth="password"
> ipaddr="155.185.135.106" lanplus="on" login="root"
> name="fence-beta" passwd="******" p
> rivlvl="ADMINISTRATOR"/>
>                  <fencedevice agent="fence_vmware_soap"
> ipaddr="155.185.0.10" login="etabetaguerro"
> name="fence-guerro" passwd="******"/>
>          </fencedevices>
> </cluster>
>
> What log file do you need? There are many in /var/log/cluster..

By default, /var/log/messages is the most useful. Checking 'cman_tool 
status' and 'clustat' are also good.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?



From emi2fast at gmail.com  Fri Feb 14 17:58:51 2014
From: emi2fast at gmail.com (emmanuel segura)
Date: Fri, 14 Feb 2014 18:58:51 +0100
Subject: [Linux-cluster] Question about cluster behavior
In-Reply-To: <52FE539C.7050800@alteeve.ca>
References: <74a7f2cfa11c75465a4d9c367eb4459a.squirrel@webmail2.unimore.it>
	<CAE7pJ3CPH1KEiVJ7yTM1+SLTGDfTu+VcHpfOp83+Hk69qvgMJA@mail.gmail.com>
	<0d3bf8891f3e462e51503796010974d3.squirrel@webmail2.unimore.it>
	<52FE539C.7050800@alteeve.ca>
Message-ID: <CAE7pJ3BSSSK58MmM2jW5sQ2ygESQBNhcqZrsRDr9bPbtDTVtug@mail.gmail.com>

in this case your quorum should be provides 2 votes, in this case if two
nodes dies and you want to continue with just one node {1(votes per node) +
2(quorum devices) = 3 votes of 5}, more than half


2014-02-14 18:34 GMT+01:00 Digimer <lists at alteeve.ca>:

> Replies in-line:
>
>
> On 14/02/14 12:07 PM, FABIO FERRARI wrote:
>
>> So it's not a normal behavior, I guess.
>>
>> Here is my cluster.conf:
>>
>> <?xml version="1.0"?>
>> <cluster config_version="59" name="mail">
>>          <clusternodes>
>>                  <clusternode name="eta.mngt.unimo.it" nodeid="1">
>>                          <fence>
>>                                  <method name="fence-eta">
>>                                          <device name="fence-eta"/>
>>                                  </method>
>>                          </fence>
>>                  </clusternode>
>>                  <clusternode name="beta.mngt.unimo.it" nodeid="2">
>>                          <fence>
>>                                  <method name="fence-beta">
>>                                          <device name="fence-beta"/>
>>                                  </method>
>>                          </fence>
>>                  </clusternode>
>>                  <clusternode name="guerro.mngt.unimo.it" nodeid="3">
>>                          <fence>
>>                                  <method name="fence-guerro">
>>                                          <device name="fence-guerro"
>> port="Guerro
>> " ssl="on" uuid="4213f370-9572-63c7-26e4-22f0f43843aa"/>
>>                                  </method>
>>                          </fence>
>>                  </clusternode>
>>          </clusternodes>
>>          <cman expected_votes="5"/>
>>
>
> You generally don't need to set this, the cluster can calculate it.
>
>           <quorumd label="mail-qdisk"/>
>>
>
> You don't set any votes, so the default is "1". So with expected votes
> being 5, that means all three nodes have to be up or two nodes and qdisk.
>
>
>           <rm>
>>                  <resources>
>>                          <ip address="155.185.44.61/24" sleeptime="10"/>
>>                          <mysql config_file="/etc/my.cnf"
>> listen_address="155.185.44.61" name="mysql"
>> shutdown_wait="10" startup_wait="10"/>
>>                          <script file="/etc/init.d/httpd" name="httpd"/>
>>                          <script file="/etc/init.d/postfix"
>> name="postfix"/>
>>                          <script file="/etc/init.d/dovecot"
>> name="dovecot"/>
>>                          <fs device="/dev/mapper/mailvg-maillv"
>> force_fsck="1" force_unmount="1" fsid="58161"
>> fstype="xfs" mountpoint="/cl" name="mailvg-maill
>> v" options="defaults,noauto" self_fence="1"/>
>>                          <lvm lv_name="maillv" name="lvm-mailvg-maillv"
>> self_fence="1" vg_name="mailvg"/>
>>                  </resources>
>>                  <failoverdomains>
>>                          <failoverdomain name="mailfailoverdomain"
>> nofailback="1" ordered="1" restricted="1">
>>                                  <failoverdomainnode
>> name="eta.mngt.unimo.it" priority="1"/>
>>                                  <failoverdomainnode
>> name="beta.mngt.unimo.it" priority="2"/>
>>                                  <failoverdomainnode
>> name="guerro.mngt.unimo.it" priority="3"/>
>>                          </failoverdomain>
>>                  </failoverdomains>
>>                  <service domain="mailfailoverdomain" max_restarts="3"
>> name="mailservices" recovery="restart"
>> restart_expire_time="600">
>>                          <fs ref="mailvg-maillv">
>>                                  <ip ref="155.185.44.61/24">
>>                                          <mysql ref="mysql">
>>                                                  <script ref="httpd"/>
>>                                                  <script ref="postfix"/>
>>                                                  <script ref="dovecot"/>
>>                                          </mysql>
>>                                  </ip>
>>                          </fs>
>>                  </service>
>>          </rm>
>>          <fencedevices>
>>                  <fencedevice agent="fence_ipmilan" auth="password"
>> ipaddr="155.185.135.105" lanplus="on" login="root"
>> name="fence-eta" passwd="******" pr
>> ivlvl="ADMINISTRATOR"/>
>>                  <fencedevice agent="fence_ipmilan" auth="password"
>> ipaddr="155.185.135.106" lanplus="on" login="root"
>> name="fence-beta" passwd="******" p
>> rivlvl="ADMINISTRATOR"/>
>>                  <fencedevice agent="fence_vmware_soap"
>> ipaddr="155.185.0.10" login="etabetaguerro"
>> name="fence-guerro" passwd="******"/>
>>          </fencedevices>
>> </cluster>
>>
>> What log file do you need? There are many in /var/log/cluster..
>>
>
> By default, /var/log/messages is the most useful. Checking 'cman_tool
> status' and 'clustat' are also good.
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140214/79fe8c72/attachment.htm>

From andrew at beekhof.net  Mon Feb 17 01:18:14 2014
From: andrew at beekhof.net (Andrew Beekhof)
Date: Mon, 17 Feb 2014 12:18:14 +1100
Subject: [Linux-cluster] Collocating cloned resources
In-Reply-To: <99C8B2929B39C24493377AC7A121E21FC5D8BC6C9F@USEA-EXCH8.na.uis.unisys.com>
References: <99C8B2929B39C24493377AC7A121E21FC5D8BC6C9F@USEA-EXCH8.na.uis.unisys.com>
Message-ID: <FEBFB221-0945-4414-9199-111B330D9E5C@beekhof.net>


On 8 Feb 2014, at 6:48 am, Vallevand, Mark K <Mark.Vallevand at UNISYS.com> wrote:

> I?m pretty sure I can collocate cloned resources.  If so, will the clone instance number in the resource agents (OCF_RESKEY_CRM_meta_clone) be the same for the instances running on the same node?

No.  Make no assumptions about clone instance numbers unless you explicitly set globally-unique to true (and look up the implications of that in Pacemaker Explained before doing so).

>  
> Regards.
> Mark K Vallevand   Mark.Vallevand at Unisys.com
> May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.
>  
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140217/7df0967c/attachment.sig>

From lists at alteeve.ca  Mon Feb 17 01:59:10 2014
From: lists at alteeve.ca (Digimer)
Date: Sun, 16 Feb 2014 20:59:10 -0500
Subject: [Linux-cluster] What condition would cause clvmd to exit with '143'
	when status called?
Message-ID: <53016CEE.7040009@alteeve.ca>

I hit this in a program I use to monitor 'clvmd' on a local and peer node:

====
989; [ DEBUG ] - get_daemon_state(); daemon: [clvmd], node: [peer]
1002; [ DEBUG ] - shell call: [/usr/bin/sshroot at an-c07n02.alteeve.ca 
"/etc/init.d/clvmd status; echo clvmd:\$?"]
1019; [ DEBUG ] - line: [clvmd (pid 4114 4098) is running...]
1011; [ DEBUG ] - peer::daemon::clvmd::rc: [143]
1019; [ DEBUG ] - line: [bash: line 1:  4096 Terminated 
/etc/init.d/clvmd status]
Daemon: [clvmd] is in an unknown state on: [an-c07n02.alteeve.ca].
Status return code was: [143].
====

The line numbers and debug messages are my program, not clvmd. Any idea 
why this would happen?

This was from a node that had just been intentionally crashed, was 
fenced and booted back up.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?



From fabio.ferrari at unimore.it  Mon Feb 17 09:25:32 2014
From: fabio.ferrari at unimore.it (FABIO FERRARI)
Date: Mon, 17 Feb 2014 10:25:32 +0100
Subject: [Linux-cluster] Question about cluster behavior
In-Reply-To: <CAE7pJ3BSSSK58MmM2jW5sQ2ygESQBNhcqZrsRDr9bPbtDTVtug@mail.gmail.com>
References: <74a7f2cfa11c75465a4d9c367eb4459a.squirrel@webmail2.unimore.it>
	<CAE7pJ3CPH1KEiVJ7yTM1+SLTGDfTu+VcHpfOp83+Hk69qvgMJA@mail.gmail.com>
	<0d3bf8891f3e462e51503796010974d3.squirrel@webmail2.unimore.it>
	<52FE539C.7050800@alteeve.ca>
	<CAE7pJ3BSSSK58MmM2jW5sQ2ygESQBNhcqZrsRDr9bPbtDTVtug@mail.gmail.com>
Message-ID: <d8f15da35492cb7c8121b31c3e6bc80b.squirrel@webmail2.unimore.it>

Yes, I know that if I have only 1 machine active the cluster won't start
with this configuration.
But this is not my problem. My problem is that the cluster was not started
even with 2 machines and the quorum disk active.
There isn't any cluster.log in /var/log or in /var/log/cluster. Anyway, as
you can see, there isn't ANY log line in all cluster log files between
8:30 and 9:00 of 11th february. Only /var/log/messages has some logs
during this period.
(Don't consider the errors between 8:00 and 8:30 because the storage was
not active. At about 9:00 we started the third machine so after this hour
it was all ok.)

Fabio Ferrari

rgmanager.log
Feb 07 17:41:23 rgmanager Exiting
Feb 11 08:13:49 rgmanager Waiting for CMAN to start
Feb 11 09:02:27 rgmanager Disconnecting from CMAN

corosync.log
Feb 07 17:41:52 corosync [MAIN  ] Corosync Cluster Engine exiting with
status 0 at main.c:1947.)
Feb 11 08:12:42 corosync [MAIN  ] Corosync Cluster Engine ('1.4.1'):
started and ready to provide service.
Feb 11 08:12:42 corosync [MAIN  ] Corosync built-in features: nss dbus
rdma snmp
Feb 11 08:12:42 corosync [MAIN  ] Successfully read config from
/etc/cluster/cluster.conf
Feb 11 08:12:42 corosync [MAIN  ] Successfully parsed cman config
Feb 11 08:12:42 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).
Feb 11 08:12:42 corosync [TOTEM ] Initializing transmit/receive security:
libtomcrypt SOBER128/SHA1HMAC (mode 0).
Feb 11 08:12:42 corosync [TOTEM ] The network interface [155.185.135.21]
is now up.
Feb 11 08:12:42 corosync [QUORUM] Using quorum provider quorum_cman
Feb 11 08:12:42 corosync [SERV  ] Service engine loaded: corosync cluster
quorum service v0.1
Feb 11 08:12:42 corosync [CMAN  ] CMAN 3.0.12.1 (built Dec  9 2013
10:48:35) started
Feb 11 08:12:42 corosync [SERV  ] Service engine loaded: corosync CMAN
membership service 2.90
Feb 11 08:12:42 corosync [SERV  ] Service engine loaded: openais
checkpoint service B.01.01
Feb 11 08:12:42 corosync [SERV  ] Service engine loaded: corosync extended
virtual synchrony service
Feb 11 08:12:42 corosync [SERV  ] Service engine loaded: corosync
configuration service
Feb 11 08:12:42 corosync [SERV  ] Service engine loaded: corosync cluster
closed process group service v1.01
Feb 11 08:12:42 corosync [SERV  ] Service engine loaded: corosync cluster
config database access v1.01
Feb 11 08:12:42 corosync [SERV  ] Service engine loaded: corosync profile
loading service
Feb 11 08:12:42 corosync [QUORUM] Using quorum provider quorum_cman
Feb 11 08:12:42 corosync [SERV  ] Service engine loaded: corosync cluster
quorum service v0.1
Feb 11 08:12:42 corosync [MAIN  ] Compatibility mode set to whitetank. 
Using V1 and V2 of the synchronization engine.
Feb 11 08:12:42 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Feb 11 08:12:42 corosync [QUORUM] Members[1]: 1
Feb 11 08:12:42 corosync [QUORUM] Members[1]: 1
Feb 11 08:12:42 corosync [CPG   ] chosen downlist: sender r(0)
ip(155.185.135.21) ; members(old:0 left:0)
Feb 11 08:12:42 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Feb 11 08:12:48 corosync [SERV  ] Unloading all Corosync service engines.
Feb 11 08:12:48 corosync [SERV  ] Service engine unloaded: corosync
extended virtual synchrony service
Feb 11 08:12:48 corosync [SERV  ] Service engine unloaded: corosync
configuration service
Feb 11 08:12:48 corosync [SERV  ] Service engine unloaded: corosync
cluster closed process group service v1.01
Feb 11 08:12:48 corosync [SERV  ] Service engine unloaded: corosync
cluster config database access v1.01
Feb 11 08:12:48 corosync [SERV  ] Service engine unloaded: corosync
profile loading service
Feb 11 08:12:48 corosync [SERV  ] Service engine unloaded: openais
checkpoint service B.01.01
Feb 11 08:12:48 corosync [SERV  ] Service engine unloaded: corosync CMAN
membership service 2.90
Feb 11 08:12:48 corosync [SERV  ] Service engine unloaded: corosync
cluster quorum service v0.1
Feb 11 08:12:48 corosync [MAIN  ] Corosync Cluster Engine exiting with
status 0 at main.c:1947.
Feb 11 09:05:04 corosync [MAIN  ] Corosync Cluster Engine ('1.4.1'):
started and ready to provide service.

dlm_controld.log
Feb 03 12:29:55 dlm_controld dlm_controld 3.0.12.1 started
Feb 11 09:05:18 dlm_controld dlm_controld 3.0.12.1 started

fenced.log
Feb 03 13:45:58 fenced node_history_cluster_remove no nodeid 0
Feb 11 09:05:18 fenced fenced 3.0.12.1 started

qdiskd.log
Feb 07 17:41:52 qdiskd Unregistering quorum device.
Feb 11 08:12:46 qdiskd Unable to match label 'mail-qdisk' to any device
Feb 11 09:05:08 qdiskd Quorum Partition: /dev/block/253:0 Label: mail-qdisk

messages.log
Feb 11 08:16:26 eta ntpd[2667]: 0.0.0.0 c618 08 no_sys_peer
Feb 11 08:21:04 eta ricci[4712]: Executing '/usr/bin/virsh nodeinfo'
Feb 11 08:21:05 eta ricci[4714]: Executing
'/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/415378025'
Feb 11 08:21:05 eta ricci[4717]: Executing
'/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1468491627'
Feb 11 08:21:09 eta ricci[4722]: Executing '/usr/bin/virsh nodeinfo'
Feb 11 08:21:09 eta ricci[4724]: Executing
'/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/290731729'
Feb 11 08:21:09 eta ricci[4727]: Executing
'/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1318243426'
Feb 11 08:21:20 eta ricci[4733]: Executing '/usr/bin/virsh nodeinfo'
Feb 11 08:21:20 eta ricci[4735]: Executing
'/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/676971068'
Feb 11 08:21:21 eta ricci[4738]: Executing
'/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1356213597'
Feb 11 08:21:22 eta ricci[4743]: Executing '/usr/bin/virsh nodeinfo'
Feb 11 08:21:22 eta ricci[4745]: Executing
'/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/82953350'
Feb 11 08:21:23 eta ricci[4748]: Executing
'/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1314308023'
Feb 11 08:21:24 eta ricci[4752]: Executing '/usr/bin/virsh nodeinfo'
Feb 11 08:21:24 eta ricci[4754]: Executing
'/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/835292354'
Feb 11 08:21:25 eta ricci[4757]: Executing
'/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1199640582'
Feb 11 08:21:39 eta ricci[4764]: Executing '/usr/bin/virsh nodeinfo'
Feb 11 08:21:39 eta ricci[4766]: Executing
'/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/617299274'
Feb 11 08:21:39 eta ricci[4769]: Executing
'/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/2029685252'
Feb 11 08:21:44 eta ricci[4774]: Executing '/usr/bin/virsh nodeinfo'
Feb 11 08:21:44 eta ricci[4776]: Executing
'/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/593330091'
Feb 11 08:21:44 eta ricci[4779]: Executing
'/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1643272558'
Feb 11 08:21:47 eta ricci[4784]: Executing '/usr/bin/virsh nodeinfo'
Feb 11 08:21:47 eta ricci[4786]: Executing
'/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/352604825'
Feb 11 08:21:47 eta ricci[4789]: Executing
'/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1877167955'
Feb 11 08:23:17 eta clamd[2675]: No stats for Database check - forcing reload
Feb 11 08:23:18 eta clamd[2675]: Reading databases from /var/lib/clamav
Feb 11 08:23:23 eta clamd[2675]: Database correctly reloaded (3107399
signatures)
Feb 11 08:30:19 eta ricci[4924]: Executing '/usr/bin/virsh nodeinfo'
Feb 11 08:30:19 eta ricci[4926]: Executing
'/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1103385483'
Feb 11 08:30:19 eta ricci[4929]: Executing
'/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1664807129'
Feb 11 08:32:05 eta ntpd[2667]: 0.0.0.0 c612 02 freq_set kernel 78.308 PPM
Feb 11 08:32:05 eta ntpd[2667]: 0.0.0.0 c615 05 clock_sync
Feb 11 08:33:24 eta clamd[2675]: SelfCheck: Database status OK.
Feb 11 08:40:38 eta kernel: scsi 3:0:2:0: Direct-Access     DGC      VRAID
           0532 PQ: 0 ANSI: 4
Feb 11 08:40:38 eta kernel: sd 3:0:2:0: Attached scsi generic sg4 type 0
Feb 11 08:40:38 eta kernel: sd 3:0:2:0: [sdc] 4194304 512-byte logical
blocks: (2.14 GB/2.00 GiB)
Feb 11 08:40:38 eta kernel: sd 3:0:2:0: [sdc] Write Protect is off
Feb 11 08:40:38 eta kernel: scsi 3:0:2:1: Direct-Access     DGC      VRAID
           0532 PQ: 0 ANSI: 4
Feb 11 08:40:38 eta kernel: sd 3:0:2:0: [sdc] Write cache: disabled, read
cache: enabled, doesn't support DPO or FUA
Feb 11 08:40:38 eta kernel: sd 3:0:2:1: Attached scsi generic sg5 type 0
Feb 11 08:40:38 eta kernel: sd 3:0:2:1: [sdd] Very big device. Trying to
use READ CAPACITY(16).
Feb 11 08:40:38 eta kernel: sd 3:0:2:1: [sdd] 6442450944 512-byte logical
blocks: (3.29 TB/3.00 TiB)
Feb 11 08:40:38 eta kernel: sd 3:0:2:1: [sdd] Write Protect is off
Feb 11 08:40:38 eta kernel: sdc:
Feb 11 08:40:38 eta kernel: sd 3:0:2:1: [sdd] Write cache: disabled, read
cache: enabled, doesn't support DPO or FUA
Feb 11 08:40:38 eta kernel: unknown partition table
Feb 11 08:40:38 eta kernel: sd 3:0:2:1: [sdd] Very big device. Trying to
use READ CAPACITY(16).
Feb 11 08:40:38 eta kernel: sdd:
Feb 11 08:40:38 eta kernel: sd 3:0:2:0: [sdc] Attached SCSI disk
Feb 11 08:40:38 eta kernel: unknown partition table
Feb 11 08:40:38 eta kernel: sd 3:0:2:1: [sdd] Very big device. Trying to
use READ CAPACITY(16).
Feb 11 08:40:38 eta kernel: sd 3:0:2:1: [sdd] Attached SCSI disk
Feb 11 08:40:38 eta multipathd: sdd: add path (uevent)
Feb 11 08:40:38 eta multipathd: mpathb: event checker started
Feb 11 08:40:38 eta multipathd: sdd [8:48]: path added to devmap mpathb
Feb 11 08:40:38 eta multipathd: sdc: add path (uevent)
Feb 11 08:40:38 eta kernel: device-mapper: multipath round-robin: version
1.0.0 loaded
Feb 11 08:40:38 eta multipathd: mpatha: load table [0 4194304 multipath 1
queue_if_no_path 0 1 1 round-robin 0 1 1 8:32 1]
Feb 11 08:40:38 eta multipathd: mpatha: event checker started
Feb 11 08:40:38 eta multipathd: sdc [8:32]: path added to devmap mpatha
Feb 11 08:40:38 eta kernel: scsi 4:0:1:0: Direct-Access     DGC      VRAID
           0532 PQ: 0 ANSI: 4
Feb 11 08:40:38 eta kernel: sd 4:0:1:0: Attached scsi generic sg6 type 0
Feb 11 08:40:38 eta kernel: sd 4:0:1:0: [sde] 4194304 512-byte logical
blocks: (2.14 GB/2.00 GiB)
Feb 11 08:40:38 eta kernel: sd 4:0:1:0: [sde] Write Protect is off
Feb 11 08:40:38 eta kernel: sd 4:0:1:0: [sde] Write cache: disabled, read
cache: enabled, doesn't support DPO or FUA
Feb 11 08:40:38 eta kernel: scsi 4:0:1:1: Direct-Access     DGC      VRAID
           0532 PQ: 0 ANSI: 4
Feb 11 08:40:38 eta kernel: sd 4:0:1:1: Attached scsi generic sg7 type 0
Feb 11 08:40:38 eta kernel: sd 4:0:1:1: [sdf] Very big device. Trying to
use READ CAPACITY(16).
Feb 11 08:40:38 eta kernel: sd 4:0:1:1: [sdf] 6442450944 512-byte logical
blocks: (3.29 TB/3.00 TiB)
Feb 11 08:40:38 eta kernel: sde:
Feb 11 08:40:38 eta kernel: sd 4:0:1:1: [sdf] Write Protect is off
Feb 11 08:40:38 eta kernel: sd 4:0:1:1: [sdf] Write cache: disabled, read
cache: enabled, doesn't support DPO or FUA
Feb 11 08:40:38 eta kernel: sd 4:0:1:1: [sdf] Very big device. Trying to
use READ CAPACITY(16).
Feb 11 08:40:38 eta kernel: sdf: unknown partition table
Feb 11 08:40:38 eta kernel: unknown partition table
Feb 11 08:40:38 eta kernel: sd 4:0:1:0: [sde] Attached SCSI disk
Feb 11 08:40:38 eta kernel: sd 4:0:1:1: [sdf] Very big device. Trying to
use READ CAPACITY(16).
Feb 11 08:40:38 eta kernel: sd 4:0:1:1: [sdf] Attached SCSI disk
Feb 11 08:40:38 eta multipathd: sdf: add path (uevent)
Feb 11 08:40:38 eta multipathd: mpathb: load table [0 6442450944 multipath
1 queue_if_no_path 0 2 1 round-robin 0 1 1 8:48 1 round-robin 0 1 1 8:80
1]
Feb 11 08:40:38 eta multipathd: sdf [8:80]: path added to devmap mpathb
Feb 11 08:40:38 eta multipathd: sde: add path (uevent)
Feb 11 08:40:38 eta multipathd: mpatha: load table [0 4194304 multipath 1
queue_if_no_path 0 2 1 round-robin 0 1 1 8:32 1 round-robin 0 1 1 8:64 1]
Feb 11 08:40:38 eta multipathd: sde [8:64]: path added to devmap mpatha
Feb 11 08:43:24 eta clamd[2675]: SelfCheck: Database status OK.
Feb 11 08:46:39 eta kernel: scsi 4:0:2:0: Direct-Access     DGC      VRAID
           0532 PQ: 0 ANSI: 4
Feb 11 08:46:39 eta kernel: sd 4:0:2:0: Attached scsi generic sg8 type 0
Feb 11 08:46:39 eta kernel: scsi 4:0:2:1: Direct-Access     DGC      VRAID
           0532 PQ: 0 ANSI: 4
Feb 11 08:46:39 eta kernel: sd 4:0:2:0: [sdg] 4194304 512-byte logical
blocks: (2.14 GB/2.00 GiB)
Feb 11 08:46:39 eta kernel: sd 4:0:2:0: [sdg] Write Protect is off
Feb 11 08:46:39 eta kernel: sd 4:0:2:1: Attached scsi generic sg9 type 0
Feb 11 08:46:39 eta kernel: sd 4:0:2:0: [sdg] Write cache: disabled, read
cache: enabled, doesn't support DPO or FUA
Feb 11 08:46:39 eta kernel: sdg:
Feb 11 08:46:39 eta kernel: sd 4:0:2:1: Warning! Received an indication
that the LUN reached a thin provisioning soft threshold.
Feb 11 08:46:39 eta kernel: sd 4:0:2:1: [sdh] Very big device. Trying to
use READ CAPACITY(16).
Feb 11 08:46:39 eta kernel: sd 4:0:2:1: [sdh] 6442450944 512-byte logical
blocks: (3.29 TB/3.00 TiB)
Feb 11 08:46:39 eta kernel: sd 4:0:2:1: [sdh] Write Protect is off
Feb 11 08:46:39 eta kernel: unknown partition table
Feb 11 08:46:39 eta kernel: sd 4:0:2:1: [sdh] Write cache: disabled, read
cache: enabled, doesn't support DPO or FUA
Feb 11 08:46:39 eta kernel: sd 4:0:2:1: [sdh] Very big device. Trying to
use READ CAPACITY(16).
Feb 11 08:46:39 eta kernel: sdh:
Feb 11 08:46:39 eta kernel: sd 4:0:2:0: [sdg] Attached SCSI disk
Feb 11 08:46:39 eta kernel: unknown partition table
Feb 11 08:46:39 eta kernel: sd 4:0:2:1: [sdh] Very big device. Trying to
use READ CAPACITY(16).
Feb 11 08:46:39 eta kernel: sd 4:0:2:1: [sdh] Attached SCSI disk
Feb 11 08:46:39 eta multipathd: sdg: add path (uevent)
Feb 11 08:46:39 eta multipathd: mpatha: load table [0 4194304 multipath 1
queue_if_no_path 0 3 1 round-robin 0 1 1 8:32 1 round-robin 0 1 1 8:64 1
round-robin 0 1 1 8:96 1]
Feb 11 08:46:39 eta multipathd: sdg [8:96]: path added to devmap mpatha
Feb 11 08:46:39 eta multipathd: sdh: add path (uevent)
Feb 11 08:46:39 eta multipathd: mpathb: load table [0 6442450944 multipath
1 queue_if_no_path 0 3 1 round-robin 0 1 1 8:48 1 round-robin 0 1 1 8:80 1
round-robin 0 1 1 8:112 1]
Feb 11 08:46:39 eta multipathd: sdh [8:112]: path added to devmap mpathb
Feb 11 08:46:40 eta kernel: scsi 3:0:3:0: Direct-Access     DGC      VRAID
           0532 PQ: 0 ANSI: 4
Feb 11 08:46:40 eta kernel: sd 3:0:3:0: Attached scsi generic sg10 type 0
Feb 11 08:46:40 eta kernel: sd 3:0:3:0: [sdi] 4194304 512-byte logical
blocks: (2.14 GB/2.00 GiB)
Feb 11 08:46:40 eta kernel: sd 3:0:3:0: [sdi] Write Protect is off
Feb 11 08:46:40 eta kernel: scsi 3:0:3:1: Direct-Access     DGC      VRAID
           0532 PQ: 0 ANSI: 4
Feb 11 08:46:40 eta kernel: sd 3:0:3:0: [sdi] Write cache: disabled, read
cache: enabled, doesn't support DPO or FUA
Feb 11 08:46:40 eta kernel: sd 3:0:3:1: Attached scsi generic sg11 type 0
Feb 11 08:46:40 eta kernel: sd 3:0:3:1: [sdj] Very big device. Trying to
use READ CAPACITY(16).
Feb 11 08:46:40 eta kernel: sd 3:0:3:1: [sdj] 6442450944 512-byte logical
blocks: (3.29 TB/3.00 TiB)
Feb 11 08:46:40 eta kernel: sd 3:0:3:1: [sdj] Write Protect is off
Feb 11 08:46:40 eta kernel: sd 3:0:3:1: [sdj] Write cache: disabled, read
cache: enabled, doesn't support DPO or FUA
Feb 11 08:46:40 eta kernel: sdi:
Feb 11 08:46:40 eta kernel: sd 3:0:3:1: [sdj] Very big device. Trying to
use READ CAPACITY(16).
Feb 11 08:46:40 eta kernel: sdj: unknown partition table
Feb 11 08:46:40 eta kernel: unknown partition table
Feb 11 08:46:40 eta kernel: sd 3:0:3:0: [sdi] Attached SCSI disk
Feb 11 08:46:40 eta kernel: sd 3:0:3:1: [sdj] Very big device. Trying to
use READ CAPACITY(16).
Feb 11 08:46:40 eta kernel: sd 3:0:3:1: [sdj] Attached SCSI disk
Feb 11 08:46:40 eta multipathd: sdj: add path (uevent)
Feb 11 08:46:40 eta multipathd: mpathb: load table [0 6442450944 multipath
1 queue_if_no_path 0 4 1 round-robin 0 1 1 8:48 1 round-robin 0 1 1 8:80 1
round-robin 0 1 1 8:112 1 round-robin 0 1 1 8:144 1]
Feb 11 08:46:40 eta multipathd: sdj [8:144]: path added to devmap mpathb
Feb 11 08:46:40 eta multipathd: sdi: add path (uevent)
Feb 11 08:46:40 eta multipathd: mpatha: load table [0 4194304 multipath 1
queue_if_no_path 0 4 2 round-robin 0 1 1 8:32 1 round-robin 0 1 1 8:64 1
round-robin 0 1 1 8:96 1 round-robin 0 1 1 8:128 1]
Feb 11 08:46:40 eta multipathd: sdi [8:128]: path added to devmap mpatha
Feb 11 08:46:52 eta multipathd: mpatha: sdc - directio checker reports
path is up
Feb 11 08:46:52 eta multipathd: 8:32: reinstated
Feb 11 08:46:52 eta multipathd: mpatha: remaining active paths: 4
Feb 11 08:46:52 eta multipathd: mpatha: switch to path group #1
Feb 11 08:46:52 eta multipathd: mpatha: switch to path group #1
Feb 11 08:53:24 eta clamd[2675]: SelfCheck: Database status OK.
Feb 11 08:56:59 eta kernel: rport-4:0-3: blocked FC remote port time out:
removing rport
Feb 11 08:57:00 eta kernel: rport-3:0-4: blocked FC remote port time out:
removing rport
Feb 11 09:02:24 eta init: tty (/dev/tty1) main process (3815) killed by
TERM signal
Feb 11 09:02:24 eta init: tty (/dev/tty2) main process (3817) killed by
TERM signal
Feb 11 09:02:24 eta init: tty (/dev/tty3) main process (3819) killed by
TERM signal
Feb 11 09:02:24 eta init: tty (/dev/tty4) main process (3821) killed by
TERM signal
Feb 11 09:02:24 eta init: tty (/dev/tty5) main process (3823) killed by
TERM signal
Feb 11 09:02:24 eta init: tty (/dev/tty6) main process (3825) killed by
TERM signal
Feb 11 09:02:27 eta modclusterd: shutdown succeeded
Feb 11 09:02:27 eta rgmanager[3753]: Disconnecting from CMAN
Feb 11 09:02:27 eta rgmanager[3753]: Exiting
Feb 11 09:02:29 eta ricci: shutdown succeeded
Feb 11 09:02:29 eta oddjobd: oddjobd shutdown succeeded
Feb 11 09:02:31 eta dataeng: dsm_sa_eventmgrd shutdown succeeded
Feb 11 09:02:37 eta dataeng: dsm_sa_datamgrd shutdown succeeded
Feb 11 09:02:38 eta saslauthd[3652]: server_exit     : master exited: 3652
Feb 11 09:02:38 eta abrtd: Got signal 15, exiting
Feb 11 09:02:39 eta clamd[2675]: Pid file removed.
Feb 11 09:02:39 eta clamd[2675]: --- Stopped at Tue Feb 11 09:02:39 2014
Feb 11 09:02:39 eta clamd[2675]: Socket file removed.
Feb 11 09:02:39 eta snmpd[2647]: Received TERM or STOP signal...  shutting
down...
Feb 11 09:02:44 eta acpid: exiting
Feb 11 09:02:44 eta ntpd[2667]: ntpd exiting on signal 15
Feb 11 09:02:44 eta multipathd: mpathb: event checker exit
Feb 11 09:02:44 eta multipathd: dm-0: remove map (uevent)
Feb 11 09:02:44 eta multipathd: dm-0: devmap not registered, can't remove
Feb 11 09:02:44 eta multipathd: dm-0: remove map (uevent)
Feb 11 09:02:44 eta multipathd: dm-0: devmap not registered, can't remove
Feb 11 09:02:44 eta multipathd: mpatha: event checker exit
Feb 11 09:02:44 eta multipathd: dm-1: remove map (uevent)
Feb 11 09:02:44 eta multipathd: dm-1: devmap not registered, can't remove
Feb 11 09:02:44 eta multipathd: dm-1: remove map (uevent)
Feb 11 09:02:44 eta multipathd: dm-1: devmap not registered, can't remove
Feb 11 09:02:44 eta multipathd: --------shut down-------
Feb 11 09:02:44 eta init: Disconnected from system bus
Feb 11 09:02:44 eta rpcbind: rpcbind terminating on signal. Restart with
"rpcbind -w"
Feb 11 09:02:45 eta auditd[1747]: The audit daemon is exiting.
Feb 11 09:02:45 eta kernel: type=1305 audit(1392105765.102:235):
audit_pid=0 old=1747 auid=4294967295 ses=4294967295 res=1
Feb 11 09:02:45 eta kernel: type=1305 audit(1392105765.201:236):
audit_enabled=0 old=1 auid=4294967295 ses=4294967295 res=1
Feb 11 09:02:45 eta nslcd[1862]: caught signal SIGTERM (15), shutting down
Feb 11 09:02:45 eta nslcd[1862]: version 0.7.5 bailing out
Feb 11 09:02:45 eta kernel: Kernel logging (proc) stopped.
Feb 11 09:02:45 eta rsyslogd: [origin software="rsyslogd"
swVersion="5.8.10" x-pid="1875" x-info="http://www.rsyslog.com"] exiting
on signal 15.
Feb 11 09:05:01 eta kernel: imklog 5.8.10, log source = /proc/kmsg started.
Feb 11 09:05:01 eta rsyslogd: [origin software="rsyslogd"
swVersion="5.8.10" x-pid="2083" x-info="http://www.rsyslog.com"] start
Feb 11 09:05:01 eta kernel: Initializing cgroup subsys cpuset
Feb 11 09:05:01 eta kernel: Initializing cgroup subsys cpu
Feb 11 09:05:01 eta kernel: Linux version 2.6.32-431.3.1.el6.x86_64
(mockbuild at c6b10.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat
4.4.7-4) (GCC) ) #1 SMP Fri Jan 3 21:39:27 UTC 2014
Feb 11 09:05:01 eta kernel: Command line: ro
root=UUID=0ee47a33-164b-4fad-9773-770aa314d56b LANG=it_IT.UTF-8 rd_NO_LUKS
 KEYBOARDTYPE=pc KEYTABLE=it rd_NO_MD SYSFONT=latarcyrheb-sun16
crashkernel=auto rd_NO_LVM rd_NO_DM rhgb quiet
Feb 11 09:05:01 eta kernel: KERNEL supported cpus:


> in this case your quorum should be provides 2 votes, in this case if two
> nodes dies and you want to continue with just one node {1(votes per node)
> +
> 2(quorum devices) = 3 votes of 5}, more than half
>
>
> 2014-02-14 18:34 GMT+01:00 Digimer <lists at alteeve.ca>:
>
>> Replies in-line:
>>
>>
>> On 14/02/14 12:07 PM, FABIO FERRARI wrote:
>>
>>> So it's not a normal behavior, I guess.
>>>
>>> Here is my cluster.conf:
>>>
>>> <?xml version="1.0"?>
>>> <cluster config_version="59" name="mail">
>>>          <clusternodes>
>>>                  <clusternode name="eta.mngt.unimo.it" nodeid="1">
>>>                          <fence>
>>>                                  <method name="fence-eta">
>>>                                          <device name="fence-eta"/>
>>>                                  </method>
>>>                          </fence>
>>>                  </clusternode>
>>>                  <clusternode name="beta.mngt.unimo.it" nodeid="2">
>>>                          <fence>
>>>                                  <method name="fence-beta">
>>>                                          <device name="fence-beta"/>
>>>                                  </method>
>>>                          </fence>
>>>                  </clusternode>
>>>                  <clusternode name="guerro.mngt.unimo.it" nodeid="3">
>>>                          <fence>
>>>                                  <method name="fence-guerro">
>>>                                          <device name="fence-guerro"
>>> port="Guerro
>>> " ssl="on" uuid="4213f370-9572-63c7-26e4-22f0f43843aa"/>
>>>                                  </method>
>>>                          </fence>
>>>                  </clusternode>
>>>          </clusternodes>
>>>          <cman expected_votes="5"/>
>>>
>>
>> You generally don't need to set this, the cluster can calculate it.
>>
>>           <quorumd label="mail-qdisk"/>
>>>
>>
>> You don't set any votes, so the default is "1". So with expected votes
>> being 5, that means all three nodes have to be up or two nodes and
>> qdisk.
>>
>>
>>           <rm>
>>>                  <resources>
>>>                          <ip address="155.185.44.61/24"
>>> sleeptime="10"/>
>>>                          <mysql config_file="/etc/my.cnf"
>>> listen_address="155.185.44.61" name="mysql"
>>> shutdown_wait="10" startup_wait="10"/>
>>>                          <script file="/etc/init.d/httpd"
>>> name="httpd"/>
>>>                          <script file="/etc/init.d/postfix"
>>> name="postfix"/>
>>>                          <script file="/etc/init.d/dovecot"
>>> name="dovecot"/>
>>>                          <fs device="/dev/mapper/mailvg-maillv"
>>> force_fsck="1" force_unmount="1" fsid="58161"
>>> fstype="xfs" mountpoint="/cl" name="mailvg-maill
>>> v" options="defaults,noauto" self_fence="1"/>
>>>                          <lvm lv_name="maillv" name="lvm-mailvg-maillv"
>>> self_fence="1" vg_name="mailvg"/>
>>>                  </resources>
>>>                  <failoverdomains>
>>>                          <failoverdomain name="mailfailoverdomain"
>>> nofailback="1" ordered="1" restricted="1">
>>>                                  <failoverdomainnode
>>> name="eta.mngt.unimo.it" priority="1"/>
>>>                                  <failoverdomainnode
>>> name="beta.mngt.unimo.it" priority="2"/>
>>>                                  <failoverdomainnode
>>> name="guerro.mngt.unimo.it" priority="3"/>
>>>                          </failoverdomain>
>>>                  </failoverdomains>
>>>                  <service domain="mailfailoverdomain" max_restarts="3"
>>> name="mailservices" recovery="restart"
>>> restart_expire_time="600">
>>>                          <fs ref="mailvg-maillv">
>>>                                  <ip ref="155.185.44.61/24">
>>>                                          <mysql ref="mysql">
>>>                                                  <script ref="httpd"/>
>>>                                                  <script
>>> ref="postfix"/>
>>>                                                  <script
>>> ref="dovecot"/>
>>>                                          </mysql>
>>>                                  </ip>
>>>                          </fs>
>>>                  </service>
>>>          </rm>
>>>          <fencedevices>
>>>                  <fencedevice agent="fence_ipmilan" auth="password"
>>> ipaddr="155.185.135.105" lanplus="on" login="root"
>>> name="fence-eta" passwd="******" pr
>>> ivlvl="ADMINISTRATOR"/>
>>>                  <fencedevice agent="fence_ipmilan" auth="password"
>>> ipaddr="155.185.135.106" lanplus="on" login="root"
>>> name="fence-beta" passwd="******" p
>>> rivlvl="ADMINISTRATOR"/>
>>>                  <fencedevice agent="fence_vmware_soap"
>>> ipaddr="155.185.0.10" login="etabetaguerro"
>>> name="fence-guerro" passwd="******"/>
>>>          </fencedevices>
>>> </cluster>
>>>
>>> What log file do you need? There are many in /var/log/cluster..
>>>
>>
>> By default, /var/log/messages is the most useful. Checking 'cman_tool
>> status' and 'clustat' are also good.
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.ca/w/
>> What if the cure for cancer is trapped in the mind of a person without
>> access to education?
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From morpheus.ibis at gmail.com  Mon Feb 17 09:42:28 2014
From: morpheus.ibis at gmail.com (Pavel Herrmann)
Date: Mon, 17 Feb 2014 10:42:28 +0100
Subject: [Linux-cluster] Question about cluster behavior
In-Reply-To: <d8f15da35492cb7c8121b31c3e6bc80b.squirrel@webmail2.unimore.it>
References: <74a7f2cfa11c75465a4d9c367eb4459a.squirrel@webmail2.unimore.it>
	<CAE7pJ3BSSSK58MmM2jW5sQ2ygESQBNhcqZrsRDr9bPbtDTVtug@mail.gmail.com>
	<d8f15da35492cb7c8121b31c3e6bc80b.squirrel@webmail2.unimore.it>
Message-ID: <3108259.VueTCQbkyu@bloomfield>

Hello

On Monday 17 of February 2014 10:25:32 FABIO FERRARI wrote:
> Yes, I know that if I have only 1 machine active the cluster won't start
> with this configuration.
> But this is not my problem. My problem is that the cluster was not started
> even with 2 machines and the quorum disk active.
> There isn't any cluster.log in /var/log or in /var/log/cluster. Anyway, as
> you can see, there isn't ANY log line in all cluster log files between
> 8:30 and 9:00 of 11th february. Only /var/log/messages has some logs
> during this period.
> (Don't consider the errors between 8:00 and 8:30 because the storage was
> not active. At about 9:00 we started the third machine so after this hour
> it was all ok.)
>
>...
> 
> qdiskd.log
> Feb 07 17:41:52 qdiskd Unregistering quorum device.
> Feb 11 08:12:46 qdiskd Unable to match label 'mail-qdisk' to any device
> Feb 11 09:05:08 qdiskd Quorum Partition: /dev/block/253:0 Label: mail-qdisk

I am not an expert in this, but it seems you are using LVM volume as your 
qdisk.
I would expect an issue with this, since LVM uses DLM, which in turn depends 
on qdisk

This log seems to indicate that qdisk is not active untill the whole cluster 
is active, which basically defats its purpose

Regards,
Pavel



From emi2fast at gmail.com  Mon Feb 17 11:47:03 2014
From: emi2fast at gmail.com (emmanuel segura)
Date: Mon, 17 Feb 2014 12:47:03 +0100
Subject: [Linux-cluster] Question about cluster behavior
In-Reply-To: <3108259.VueTCQbkyu@bloomfield>
References: <74a7f2cfa11c75465a4d9c367eb4459a.squirrel@webmail2.unimore.it>
	<CAE7pJ3BSSSK58MmM2jW5sQ2ygESQBNhcqZrsRDr9bPbtDTVtug@mail.gmail.com>
	<d8f15da35492cb7c8121b31c3e6bc80b.squirrel@webmail2.unimore.it>
	<3108259.VueTCQbkyu@bloomfield>
Message-ID: <CAE7pJ3BDcqaWqODs0_1Z=x+GV4TLCgkjwDm9DT++K=GKN8X7ag@mail.gmail.com>

mkqdisk -d -L


2014-02-17 10:42 GMT+01:00 Pavel Herrmann <morpheus.ibis at gmail.com>:

> Hello
>
> On Monday 17 of February 2014 10:25:32 FABIO FERRARI wrote:
> > Yes, I know that if I have only 1 machine active the cluster won't start
> > with this configuration.
> > But this is not my problem. My problem is that the cluster was not
> started
> > even with 2 machines and the quorum disk active.
> > There isn't any cluster.log in /var/log or in /var/log/cluster. Anyway,
> as
> > you can see, there isn't ANY log line in all cluster log files between
> > 8:30 and 9:00 of 11th february. Only /var/log/messages has some logs
> > during this period.
> > (Don't consider the errors between 8:00 and 8:30 because the storage was
> > not active. At about 9:00 we started the third machine so after this hour
> > it was all ok.)
> >
> >...
> >
> > qdiskd.log
> > Feb 07 17:41:52 qdiskd Unregistering quorum device.
> > Feb 11 08:12:46 qdiskd Unable to match label 'mail-qdisk' to any device
> > Feb 11 09:05:08 qdiskd Quorum Partition: /dev/block/253:0 Label:
> mail-qdisk
>
> I am not an expert in this, but it seems you are using LVM volume as your
> qdisk.
> I would expect an issue with this, since LVM uses DLM, which in turn
> depends
> on qdisk
>
> This log seems to indicate that qdisk is not active untill the whole
> cluster
> is active, which basically defats its purpose
>
> Regards,
> Pavel
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140217/46290c73/attachment.htm>

From mgrac at redhat.com  Mon Feb 17 15:33:22 2014
From: mgrac at redhat.com (Marek Grac)
Date: Mon, 17 Feb 2014 16:33:22 +0100
Subject: [Linux-cluster] fence-agents-4.0.7 stable release
Message-ID: <53022BC2.1050804@redhat.com>

Welcome to the fence-agents 4.0.6 release.

This release includes quite a lot of bugfixes:
* support for firmware v1.40 for WTI (MPC series)
* nss_wrapper used for fence_ilo was replaced by gnutls-cli

* --delay was not respected in fence_vmware_soap
* fabric fence agents have 'off' as default action
* fence_scsi now generates correct key on corosync clusters
* regression was found in fence_ipmilan where -P and -L were 
interchanged on command line
     (there was no problem with long options and input from STDIN)
* fence_vmware_soap now fails with proper error message when user does 
not have enough privileges


The new source tarball can be downloaded here:

https://fedorahosted.org/releases/f/e/fence-agents/fence-agents-4.0.7.tar.xz 


To report bugs or issues:

https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

    Join us on IRC (irc.freenode.net #linux-cluster) and share your
    experience  with other sysadministrators or power users.

Thanks/congratulations to all people that contributed to achieve this
great milestone.

m,



From Mark.Vallevand at UNISYS.com  Mon Feb 17 21:32:41 2014
From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K)
Date: Mon, 17 Feb 2014 15:32:41 -0600
Subject: [Linux-cluster] Colocation of cloned resource instances.
Message-ID: <99C8B2929B39C24493377AC7A121E21FC5DB950E08@USEA-EXCH8.na.uis.unisys.com>

I have 2 cloned resources.  I want to make sure that instance 0 of each cloned resource are collocated.  (And instance 1, 2, etc.)

I'd like to do something like this:
                crm configure colocation name INFINITY: a_clone:0 b_clone:0
Where a_clone is a clone of resource a, etc:
crm configure clone a_clone a meta clone-max=2
Same for b_clone and b.
A and b are primitives:
                crm configure primitive a ...

Not having much luck.  Advice?
Tried using a_clone:0 and a:0 on the collocation command.
Is this even possible?

Regards.
Mark K Vallevand   Mark.Vallevand at Unisys.com<mailto:Mark.Vallevand at Unisys.com>
May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140217/95d00539/attachment.htm>

From Amitakhya.Phukan at williams.com  Mon Feb 17 21:35:09 2014
From: Amitakhya.Phukan at williams.com (Phukan, Amitakhya)
Date: Mon, 17 Feb 2014 15:35:09 -0600
Subject: [Linux-cluster] Introduction to clusters
Message-ID: <AFB9884C1215D04F9DDC11E35F37C244073478CE3C@WMSTUTXMB01.WILLIAMS.COM>

Hello,

I am new to Linux clusters. Need some help in getting around with it. Can someone point me to any beginner's guide etc.?

Regards,
Amitakhya Phukan

Lead Administrator Williams
Mobile: +91-888-688-6158


Regards,
Amitakhya Phukan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140217/212ec371/attachment.htm>

From Mark.Vallevand at UNISYS.com  Mon Feb 17 21:45:03 2014
From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K)
Date: Mon, 17 Feb 2014 15:45:03 -0600
Subject: [Linux-cluster] Introduction to clusters
In-Reply-To: <AFB9884C1215D04F9DDC11E35F37C244073478CE3C@WMSTUTXMB01.WILLIAMS.COM>
References: <AFB9884C1215D04F9DDC11E35F37C244073478CE3C@WMSTUTXMB01.WILLIAMS.COM>
Message-ID: <99C8B2929B39C24493377AC7A121E21FC5DB950E42@USEA-EXCH8.na.uis.unisys.com>

Try here:  http://clusterlabs.org/

Regards.
Mark K Vallevand   Mark.Vallevand at Unisys.com<mailto:Mark.Vallevand at Unisys.com>
May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.

From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Phukan, Amitakhya
Sent: Monday, February 17, 2014 03:35 PM
To: linux-cluster at redhat.com
Subject: [Linux-cluster] Introduction to clusters

Hello,

I am new to Linux clusters. Need some help in getting around with it. Can someone point me to any beginner's guide etc.?

Regards,
Amitakhya Phukan

Lead Administrator Williams
Mobile: +91-888-688-6158


Regards,
Amitakhya Phukan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140217/904a8ec0/attachment.htm>

From Mark.Vallevand at UNISYS.com  Mon Feb 17 21:51:34 2014
From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K)
Date: Mon, 17 Feb 2014 15:51:34 -0600
Subject: [Linux-cluster] Colocation of cloned resource instances.
In-Reply-To: <99C8B2929B39C24493377AC7A121E21FC5DB950E08@USEA-EXCH8.na.uis.unisys.com>
References: <99C8B2929B39C24493377AC7A121E21FC5DB950E08@USEA-EXCH8.na.uis.unisys.com>
Message-ID: <99C8B2929B39C24493377AC7A121E21FC5DB950E60@USEA-EXCH8.na.uis.unisys.com>

If I do this:
                crm configure colocation name INFINITY: a_clone b_clone
Nothing complains.
But I get errors later.

Regards.
Mark K Vallevand   Mark.Vallevand at Unisys.com<mailto:Mark.Vallevand at Unisys.com>
May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.

From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Vallevand, Mark K
Sent: Monday, February 17, 2014 03:33 PM
To: linux clustering
Subject: [Linux-cluster] Colocation of cloned resource instances.

I have 2 cloned resources.  I want to make sure that instance 0 of each cloned resource are collocated.  (And instance 1, 2, etc.)

I'd like to do something like this:
                crm configure colocation name INFINITY: a_clone:0 b_clone:0
Where a_clone is a clone of resource a, etc:
crm configure clone a_clone a meta clone-max=2
Same for b_clone and b.
A and b are primitives:
                crm configure primitive a ...

Not having much luck.  Advice?
Tried using a_clone:0 and a:0 on the collocation command.
Is this even possible?

Regards.
Mark K Vallevand   Mark.Vallevand at Unisys.com<mailto:Mark.Vallevand at Unisys.com>
May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140217/32e5fed0/attachment.htm>

From andrew at beekhof.net  Mon Feb 17 23:44:56 2014
From: andrew at beekhof.net (Andrew Beekhof)
Date: Tue, 18 Feb 2014 10:44:56 +1100
Subject: [Linux-cluster] Colocation of cloned resource instances.
In-Reply-To: <99C8B2929B39C24493377AC7A121E21FC5DB950E08@USEA-EXCH8.na.uis.unisys.com>
References: <99C8B2929B39C24493377AC7A121E21FC5DB950E08@USEA-EXCH8.na.uis.unisys.com>
Message-ID: <A0ACB278-6FE0-458D-9770-2ACAB0F549AA@beekhof.net>


On 18 Feb 2014, at 8:32 am, Vallevand, Mark K <Mark.Vallevand at UNISYS.com> wrote:

> I have 2 cloned resources.  I want to make sure that instance 0 of each cloned resource are collocated.  (And instance 1, 2, etc.)

Instance numbers are an implementation detail.  You're not supposed to care.

>  
> I?d like to do something like this:
>                 crm configure colocation name INFINITY: a_clone:0 b_clone:0
> Where a_clone is a clone of resource a, etc:
> crm configure clone a_clone a meta clone-max=2
> Same for b_clone and b.
> A and b are primitives:
>                 crm configure primitive a ?
>  
> Not having much luck.  Advice?
> Tried using a_clone:0 and a:0 on the collocation command.
> Is this even possible?
>  
> Regards.
> Mark K Vallevand   Mark.Vallevand at Unisys.com
> May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.
>  
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140218/93e3ee9d/attachment.sig>

From pvmpublic at gmail.com  Tue Feb 18 13:33:44 2014
From: pvmpublic at gmail.com (Pratik Mehta)
Date: Tue, 18 Feb 2014 19:03:44 +0530
Subject: [Linux-cluster] CMAN/DLM without SCTP
Message-ID: <CAH0fbtNiu3BaWdrNpEuKOMNMAquW3B6qgfQ6ZPTGXQujaTXXQw@mail.gmail.com>

Hi,
I am trying to use a cluster with Pacemaker + CMAN on CentOS 6.4. The
application that runs on the cluster includes a userspace SCTP stack.
However CMAN loads dlm which loads the Linux kernel sctp module, which
interferes with the userspace SCTP.

I do not have any GFS/locking requirements on this cluster. I use a 2 node
cluster to failover a bunch of IP addresses.

I tried setting DLM_CONTROLD_OPTS="-r 0" and blacklisting sctp module.
Didn't help since dlm depends on sctp and cman loads dlm.

In /etc/init.d/cman:
errmsg=$( modprobe dlm 2>&1 ) || return 1
errmsg=$( modprobe lock_dlm 2>&1 ) || true

As a hack: I modified the cman service to not load dlm and lock_dlm. The
cluster seems working. However I do not know what additional functionality
was broken when I disabled these. Though a hack: I am eager to know if
someone sees an issue of running a production cluster with this.

Another options is to introduce a compile time flag in the DLM kernel
module to not use SCTP.

Is there any other way?

Thanks in advance
Pratik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140218/4fe97014/attachment.htm>

From ccaulfie at redhat.com  Tue Feb 18 13:48:10 2014
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Tue, 18 Feb 2014 13:48:10 +0000
Subject: [Linux-cluster] CMAN/DLM without SCTP
In-Reply-To: <CAH0fbtNiu3BaWdrNpEuKOMNMAquW3B6qgfQ6ZPTGXQujaTXXQw@mail.gmail.com>
References: <CAH0fbtNiu3BaWdrNpEuKOMNMAquW3B6qgfQ6ZPTGXQujaTXXQw@mail.gmail.com>
Message-ID: <5303649A.50108@redhat.com>

On 18/02/14 13:33, Pratik Mehta wrote:
> Hi,
> I am trying to use a cluster with Pacemaker + CMAN on CentOS 6.4. The
> application that runs on the cluster includes a userspace SCTP stack.
> However CMAN loads dlm which loads the Linux kernel sctp module, which
> interferes with the userspace SCTP.
>
> I do not have any GFS/locking requirements on this cluster. I use a 2
> node cluster to failover a bunch of IP addresses.
>
> I tried setting DLM_CONTROLD_OPTS="-r 0" and blacklisting sctp module.
> Didn't help since dlm depends on sctp and cman loads dlm.
>
> In /etc/init.d/cman:
> errmsg=$( modprobe dlm 2>&1 ) || return 1
> errmsg=$( modprobe lock_dlm 2>&1 ) || true
>
> As a hack: I modified the cman service to not load dlm and lock_dlm. The
> cluster seems working. However I do not know what additional
> functionality was broken when I disabled these. Though a hack: I am
> eager to know if someone sees an issue of running a production cluster
> with this.
>
> Another options is to introduce a compile time flag in the DLM kernel
> module to not use SCTP.
>
> Is there any other way?
>

You should be fine like that. The things that use the DLM are GFS and 
clvmd so if you don't use either of those then not having the DLM 
running won't have any effect. You'd soon see error messages if anything 
you were running depended on it .. I hope!

I suppose we didn't really envisage alternative SCTP stacks when the DLM 
was written and it's never come up as a problem before.

Chrissie



From Mark.Vallevand at UNISYS.com  Tue Feb 18 14:49:53 2014
From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K)
Date: Tue, 18 Feb 2014 08:49:53 -0600
Subject: [Linux-cluster] Colocation of cloned resource instances.
In-Reply-To: <A0ACB278-6FE0-458D-9770-2ACAB0F549AA@beekhof.net>
References: <99C8B2929B39C24493377AC7A121E21FC5DB950E08@USEA-EXCH8.na.uis.unisys.com>
	<A0ACB278-6FE0-458D-9770-2ACAB0F549AA@beekhof.net>
Message-ID: <99C8B2929B39C24493377AC7A121E21FC5DBBA1A6D@USEA-EXCH8.na.uis.unisys.com>

So, if I really really want to do it, can I?
I'm not being snarky.  I'd like to know if it's possible.

You are forcing me to think of a different solution to my cluster implementation.  Maybe that's good.


Regards.
Mark K Vallevand?? Mark.Vallevand at Unisys.com
May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof
Sent: Monday, February 17, 2014 05:45 PM
To: linux clustering
Subject: Re: [Linux-cluster] Colocation of cloned resource instances.


On 18 Feb 2014, at 8:32 am, Vallevand, Mark K <Mark.Vallevand at UNISYS.com> wrote:

> I have 2 cloned resources.  I want to make sure that instance 0 of each cloned resource are collocated.  (And instance 1, 2, etc.)

Instance numbers are an implementation detail.  You're not supposed to care.

>  
> I'd like to do something like this:
>                 crm configure colocation name INFINITY: a_clone:0 b_clone:0
> Where a_clone is a clone of resource a, etc:
> crm configure clone a_clone a meta clone-max=2
> Same for b_clone and b.
> A and b are primitives:
>                 crm configure primitive a .
>  
> Not having much luck.  Advice?
> Tried using a_clone:0 and a:0 on the collocation command.
> Is this even possible?
>  
> Regards.
> Mark K Vallevand   Mark.Vallevand at Unisys.com
> May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.
>  
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From raju.rajsand at gmail.com  Tue Feb 18 17:50:26 2014
From: raju.rajsand at gmail.com (Rajagopal Swaminathan)
Date: Tue, 18 Feb 2014 23:20:26 +0530
Subject: [Linux-cluster] Introduction to clusters
In-Reply-To: <AFB9884C1215D04F9DDC11E35F37C244073478CE3C@WMSTUTXMB01.WILLIAMS.COM>
References: <AFB9884C1215D04F9DDC11E35F37C244073478CE3C@WMSTUTXMB01.WILLIAMS.COM>
Message-ID: <CA+YdgaoyPx2dBOabYS5Jy0gGzo7X3BtKYypaFff+3WX9TzVjqQ@mail.gmail.com>

Greetings,

On Tue, Feb 18, 2014 at 3:05 AM, Phukan, Amitakhya
<Amitakhya.Phukan at williams.com> wrote:
> I am new to Linux clusters. Need some help in getting around with it. Can
> someone point me to any beginner's guide etc.?
>

There are two kinds of clusters:
1. High Performance Clusters (HPC) -- essentially used for number crunching
2. High Availability Clusters

This list's focus is HA Clusters.

You can have a look at these:

https://alteeve.ca/w/Main_Page

especially

https://alteeve.ca/w/AN!Cluster_Tutorial_2

for HA Clusters.

Can you clarify further what kind of cluster you are looking for?


-- 
Regards,

Rajagopal



From Amitakhya.Phukan at williams.com  Tue Feb 18 18:07:05 2014
From: Amitakhya.Phukan at williams.com (Phukan, Amitakhya)
Date: Tue, 18 Feb 2014 12:07:05 -0600
Subject: [Linux-cluster] Introduction to clusters
In-Reply-To: <CA+YdgaoyPx2dBOabYS5Jy0gGzo7X3BtKYypaFff+3WX9TzVjqQ@mail.gmail.com>
References: <AFB9884C1215D04F9DDC11E35F37C244073478CE3C@WMSTUTXMB01.WILLIAMS.COM>
	<CA+YdgaoyPx2dBOabYS5Jy0gGzo7X3BtKYypaFff+3WX9TzVjqQ@mail.gmail.com>
Message-ID: <AFB9884C1215D04F9DDC11E35F37C24407342BF681@WMSTUTXMB01.WILLIAMS.COM>

Hi,

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal Swaminathan
Sent: Tuesday, February 18, 2014 11:50 AM
To: linux clustering
Subject: Re: [Linux-cluster] Introduction to clusters

Greetings,

On Tue, Feb 18, 2014 at 3:05 AM, Phukan, Amitakhya <Amitakhya.Phukan at williams.com> wrote:
> I am new to Linux clusters. Need some help in getting around with it. 
> Can someone point me to any beginner's guide etc.?
>

There are two kinds of clusters:
1. High Performance Clusters (HPC) -- essentially used for number crunching 2. High Availability Clusters

This list's focus is HA Clusters.

You can have a look at these:

https://alteeve.ca/w/Main_Page

especially

https://alteeve.ca/w/AN!Cluster_Tutorial_2

for HA Clusters.

Can you clarify further what kind of cluster you are looking for?


>> I am looking mainly for HA clusters and in particular working with LVM (ext4 formatted) in cluster environment and configuring service groups. The problem arose from trying to configure HP OM server for Linux. The server is named HP-OML.


--
Regards,

Rajagopal

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From andrew at beekhof.net  Tue Feb 18 23:34:51 2014
From: andrew at beekhof.net (Andrew Beekhof)
Date: Wed, 19 Feb 2014 10:34:51 +1100
Subject: [Linux-cluster] Colocation of cloned resource instances.
In-Reply-To: <99C8B2929B39C24493377AC7A121E21FC5DBBA1A6D@USEA-EXCH8.na.uis.unisys.com>
References: <99C8B2929B39C24493377AC7A121E21FC5DB950E08@USEA-EXCH8.na.uis.unisys.com>
	<A0ACB278-6FE0-458D-9770-2ACAB0F549AA@beekhof.net>
	<99C8B2929B39C24493377AC7A121E21FC5DBBA1A6D@USEA-EXCH8.na.uis.unisys.com>
Message-ID: <AD8B765B-EC17-4D95-B9CD-483BA5AC845A@beekhof.net>


On 19 Feb 2014, at 1:49 am, Vallevand, Mark K <Mark.Vallevand at UNISYS.com> wrote:

> So, if I really really want to do it, can I?
> I'm not being snarky.  I'd like to know if it's possible.

Fair enough. You possibly can if you set globally-unique=true for both clones.
But that has other drawbacks.

> 
> You are forcing me to think of a different solution to my cluster implementation.  Maybe that's good.

If you're thinking about giving special meaning to clone numbers, it almost certainly is :)
If you outline the problem you're trying to solve, perhaps someone here will have a suggestion. 

> 
> 
> Regards.
> Mark K Vallevand   Mark.Vallevand at Unisys.com
> May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.
> 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof
> Sent: Monday, February 17, 2014 05:45 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Colocation of cloned resource instances.
> 
> 
> On 18 Feb 2014, at 8:32 am, Vallevand, Mark K <Mark.Vallevand at UNISYS.com> wrote:
> 
>> I have 2 cloned resources.  I want to make sure that instance 0 of each cloned resource are collocated.  (And instance 1, 2, etc.)
> 
> Instance numbers are an implementation detail.  You're not supposed to care.
> 
>> 
>> I'd like to do something like this:
>>                crm configure colocation name INFINITY: a_clone:0 b_clone:0
>> Where a_clone is a clone of resource a, etc:
>> crm configure clone a_clone a meta clone-max=2
>> Same for b_clone and b.
>> A and b are primitives:
>>                crm configure primitive a .
>> 
>> Not having much luck.  Advice?
>> Tried using a_clone:0 and a:0 on the collocation command.
>> Is this even possible?
>> 
>> Regards.
>> Mark K Vallevand   Mark.Vallevand at Unisys.com
>> May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.
>> 
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140219/f2f5c6c4/attachment.sig>

From teigland at redhat.com  Wed Feb 19 18:13:42 2014
From: teigland at redhat.com (David Teigland)
Date: Wed, 19 Feb 2014 12:13:42 -0600
Subject: [Linux-cluster] [Cluster-devel] CMAN/DLM without SCTP
In-Reply-To: <CAH0fbtNiu3BaWdrNpEuKOMNMAquW3B6qgfQ6ZPTGXQujaTXXQw@mail.gmail.com>
References: <CAH0fbtNiu3BaWdrNpEuKOMNMAquW3B6qgfQ6ZPTGXQujaTXXQw@mail.gmail.com>
Message-ID: <20140219181342.GC21855@redhat.com>

On Tue, Feb 18, 2014 at 07:03:44PM +0530, Pratik Mehta wrote:
> Hi,
> I am trying to use a cluster with Pacemaker + CMAN on CentOS 6.4. The
> application that runs on the cluster includes a userspace SCTP stack.
> However CMAN loads dlm which loads the Linux kernel sctp module, which
> interferes with the userspace SCTP.
> 
> I do not have any GFS/locking requirements on this cluster. I use a 2 node
> cluster to failover a bunch of IP addresses.
> 
> I tried setting DLM_CONTROLD_OPTS="-r 0" and blacklisting sctp module.
> Didn't help since dlm depends on sctp and cman loads dlm.
> 
> In /etc/init.d/cman:
> errmsg=$( modprobe dlm 2>&1 ) || return 1
> errmsg=$( modprobe lock_dlm 2>&1 ) || true
> 
> As a hack: I modified the cman service to not load dlm and lock_dlm. The
> cluster seems working. However I do not know what additional functionality
> was broken when I disabled these. Though a hack: I am eager to know if
> someone sees an issue of running a production cluster with this.

That's a fine solution.  You might also be able to use
'service cman start quorum'.  The cman init script could probably
use some sysconfig option to either disable dlm/gfs2/etc or to
tell it to quit after the quorum step.

> Another options is to introduce a compile time flag in the DLM kernel
> module to not use SCTP.

Not long ago it was possible to avoid loading sctp, but people kept adding
sctp symbols and I didn't have to time to try to keep them out.  It would
be nice if that could be corrected again.

Dave



From Mark.Vallevand at UNISYS.com  Wed Feb 19 20:47:14 2014
From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K)
Date: Wed, 19 Feb 2014 14:47:14 -0600
Subject: [Linux-cluster] Colocation of cloned resource instances.
In-Reply-To: <AD8B765B-EC17-4D95-B9CD-483BA5AC845A@beekhof.net>
References: <99C8B2929B39C24493377AC7A121E21FC5DB950E08@USEA-EXCH8.na.uis.unisys.com>
	<A0ACB278-6FE0-458D-9770-2ACAB0F549AA@beekhof.net>
	<99C8B2929B39C24493377AC7A121E21FC5DBBA1A6D@USEA-EXCH8.na.uis.unisys.com>
	<AD8B765B-EC17-4D95-B9CD-483BA5AC845A@beekhof.net>
Message-ID: <99C8B2929B39C24493377AC7A121E21FC5DBC6C8E2@USEA-EXCH8.na.uis.unisys.com>

I have a network routing package that needs to be load-balanced across multiple nodes.  This is our proprietary package and I am free to make changes to it to facilitate load-balancing.  I have written a resource agent for the package.  We are currently testing this in a 2-node master/standby configuration.  But, we need load-balancing for the future.

The package has a RADIUS server part and a network routing part.  It is deployed on nodes that have inward and outward facing NICs.

The RADIUS part accepts authentication requests from clients in the outward NIC.  When a request is granted, an IP address is assigned and returned to the client which uses it for all subsequent messages.  There is a 'route via' setup as well, so the message hops to the system running the package.  The message is processed and routed through to the inward facing NIC.

Now, let's put an IPaddr2 clone load-balancer in place so that the messages from all the clients can be spread across several nodes.  We'll use source address hashing so that one client's messages always go to the same node.

The RADIUS part is passing out client addresses.  How does it know which node will get the messages from that client?  It needs to know because the routing part of the package needs to initialize with all kinds of client-specific information before it can process and route traffic.  And the package is cloned to several nodes.  One of them must be ready to handle the client traffic.

I have thought of lots of ways to make this work.  Every one of them requires you to know something about the instance of the IPaddr2 clone.  One idea is to have instance 0 of the IPaddr2 clone collocate with instance 0 of the package clone.  (and 1, 2, etc.)  Then the package agent can get the IP2addr instance number because it's the same as the package instance number.  The IPaddr2 hash is public and understood.  Now, we can know if client IP will hash to a given node.

I guess I shouldn't say "every one of them".  There other ways to do this.  I was just wondering if I could make instances of two clones collocate.  Set globally-unique=true, you say?   I'll check into that if I need to pursue this idea.

But, looking at the IPaddr2 code, it looks like my package agent can get the iptables CLUSTERID node number, which means I don't need instance collocation.

Long explanation, but you asked.  And, I left out lots of details.

Back to skinning the cat.


Regards.
Mark K Vallevand?? Mark.Vallevand at Unisys.com
May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.


-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof
Sent: Tuesday, February 18, 2014 05:35 PM
To: linux clustering
Subject: Re: [Linux-cluster] Colocation of cloned resource instances.


On 19 Feb 2014, at 1:49 am, Vallevand, Mark K <Mark.Vallevand at UNISYS.com> wrote:

> So, if I really really want to do it, can I?
> I'm not being snarky.  I'd like to know if it's possible.

Fair enough. You possibly can if you set globally-unique=true for both clones.
But that has other drawbacks.

> 
> You are forcing me to think of a different solution to my cluster implementation.  Maybe that's good.

If you're thinking about giving special meaning to clone numbers, it almost certainly is :)
If you outline the problem you're trying to solve, perhaps someone here will have a suggestion. 

> 
> 
> Regards.
> Mark K Vallevand   Mark.Vallevand at Unisys.com
> May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.
> 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof
> Sent: Monday, February 17, 2014 05:45 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Colocation of cloned resource instances.
> 
> 
> On 18 Feb 2014, at 8:32 am, Vallevand, Mark K <Mark.Vallevand at UNISYS.com> wrote:
> 
>> I have 2 cloned resources.  I want to make sure that instance 0 of each cloned resource are collocated.  (And instance 1, 2, etc.)
> 
> Instance numbers are an implementation detail.  You're not supposed to care.
> 
>> 
>> I'd like to do something like this:
>>                crm configure colocation name INFINITY: a_clone:0 b_clone:0
>> Where a_clone is a clone of resource a, etc:
>> crm configure clone a_clone a meta clone-max=2
>> Same for b_clone and b.
>> A and b are primitives:
>>                crm configure primitive a .
>> 
>> Not having much luck.  Advice?
>> Tried using a_clone:0 and a:0 on the collocation command.
>> Is this even possible?
>> 
>> Regards.
>> Mark K Vallevand   Mark.Vallevand at Unisys.com
>> May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.
>> 
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




From Amitakhya.Phukan at williams.com  Wed Feb 19 21:31:37 2014
From: Amitakhya.Phukan at williams.com (Phukan, Amitakhya)
Date: Wed, 19 Feb 2014 15:31:37 -0600
Subject: [Linux-cluster] Introduction to clusters
In-Reply-To: <CA+YdgaoyPx2dBOabYS5Jy0gGzo7X3BtKYypaFff+3WX9TzVjqQ@mail.gmail.com>
References: <AFB9884C1215D04F9DDC11E35F37C244073478CE3C@WMSTUTXMB01.WILLIAMS.COM>,
	<CA+YdgaoyPx2dBOabYS5Jy0gGzo7X3BtKYypaFff+3WX9TzVjqQ@mail.gmail.com>
Message-ID: <AFB9884C1215D04F9DDC11E35F37C24407345F5D9D@WMSTUTXMB01.WILLIAMS.COM>

Hello all,

Apologies for the top posting. I thought in-line reply might not help in this one.

So, this is where I am unable to figure out.

I have a Red Hat cluster that seems to be working fine. It consists of 2 Virtual Machines on VMWare. I have to configure HP OM service in this cluster. Components of this service are basically 3 logical volumes called /dev/shared-vg01/shared-vol1, /dev/shared-vg01/shared-vol2 and /dev/shared-vg01/shared-vol3 that need to be mounted on specific partitions and an IP address associated with this service.

In our setup, I am using VMWare SOAP as the fence device now. A week ago, one of the servers went for a reboot and then the whole cluster had failed. At that time, the fence device that was configured by a previous sys admin was SCSI. I saw that there is no SCSI device attached to the cluster and hence I thought this might be a cause of failover. While starting cman, I notice this error again and again -- "Unfencing self ... [failed]".

Once I made changes to the fencing device, I tested the failover of this cluster with an IP address and Apache service. If node1 goes down, Apache and IP address are brought alive on node2. When node1 comes up, it still stays on the node2, which is as expected. Now when node2 goes down, the service changes over to the node1. So, this seems to be working fine in fail over mode.

What I am unable to figure is this: When I try to add the particular HP OM server in this cluster as a service group, I can see only the IP address failing over, system logs and Luci interface say the service has moved over to the other node; but the shared logical volumes are never mounted. Service is active but logical volumes are never mounted. Where do I debug ?

Where am I failing ? vgs, lvs, pvs all show that shared-vg01 has got the attribute of "cluster". The cluster locking type in /etc/lvm/lvm.conf is set to 3. Still I am never able to mount the file systems. Clvmd is running, cman is running, rgmanager is running and none of them show any issues. I can mount the file systems manually but the cluster never mounts them.

Best,

Amitakhya Phukan

________________________________________
From: linux-cluster-bounces at redhat.com [linux-cluster-bounces at redhat.com] On Behalf Of Rajagopal Swaminathan [raju.rajsand at gmail.com]
Sent: Tuesday, February 18, 2014 11:20 PM
To: linux clustering
Subject: Re: [Linux-cluster] Introduction to clusters

Greetings,

On Tue, Feb 18, 2014 at 3:05 AM, Phukan, Amitakhya
<Amitakhya.Phukan at williams.com> wrote:
> I am new to Linux clusters. Need some help in getting around with it. Can
> someone point me to any beginner's guide etc.?
>

There are two kinds of clusters:
1. High Performance Clusters (HPC) -- essentially used for number crunching
2. High Availability Clusters

This list's focus is HA Clusters.

You can have a look at these:

https://alteeve.ca/w/Main_Page

especially

https://alteeve.ca/w/AN!Cluster_Tutorial_2

for HA Clusters.

Can you clarify further what kind of cluster you are looking for?


--
Regards,

Rajagopal

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster



From andrew at beekhof.net  Wed Feb 19 23:27:01 2014
From: andrew at beekhof.net (Andrew Beekhof)
Date: Thu, 20 Feb 2014 10:27:01 +1100
Subject: [Linux-cluster] Colocation of cloned resource instances.
In-Reply-To: <99C8B2929B39C24493377AC7A121E21FC5DBC6C8E2@USEA-EXCH8.na.uis.unisys.com>
References: <99C8B2929B39C24493377AC7A121E21FC5DB950E08@USEA-EXCH8.na.uis.unisys.com>
	<A0ACB278-6FE0-458D-9770-2ACAB0F549AA@beekhof.net>
	<99C8B2929B39C24493377AC7A121E21FC5DBBA1A6D@USEA-EXCH8.na.uis.unisys.com>
	<AD8B765B-EC17-4D95-B9CD-483BA5AC845A@beekhof.net>
	<99C8B2929B39C24493377AC7A121E21FC5DBC6C8E2@USEA-EXCH8.na.uis.unisys.com>
Message-ID: <4F47E580-716C-4894-B9B5-2A4690FECB31@beekhof.net>


On 20 Feb 2014, at 7:47 am, Vallevand, Mark K <Mark.Vallevand at UNISYS.com> wrote:

> I have a network routing package that needs to be load-balanced across multiple nodes.  This is our proprietary package and I am free to make changes to it to facilitate load-balancing.  I have written a resource agent for the package.  We are currently testing this in a 2-node master/standby configuration.  But, we need load-balancing for the future.
> 
> The package has a RADIUS server part and a network routing part.  It is deployed on nodes that have inward and outward facing NICs.
> 
> The RADIUS part accepts authentication requests from clients in the outward NIC.  When a request is granted, an IP address is assigned and returned to the client which uses it for all subsequent messages.  There is a 'route via' setup as well, so the message hops to the system running the package.  The message is processed and routed through to the inward facing NIC.
> 
> Now, let's put an IPaddr2 clone load-balancer in place so that the messages from all the clients can be spread across several nodes.  We'll use source address hashing so that one client's messages always go to the same node.

Any chance you could document this part once you have it working?
We know its possible but no-one has written up an example of it being used in anger :)

> 
> The RADIUS part is passing out client addresses.  How does it know which node will get the messages from that client?  It needs to know because the routing part of the package needs to initialize with all kinds of client-specific information before it can process and route traffic.  And the package is cloned to several nodes.  One of them must be ready to handle the client traffic.
> 
> I have thought of lots of ways to make this work.  Every one of them requires you to know something about the instance of the IPaddr2 clone.

In the case of an IPaddr2 clone, you need globally-unique=true anyway (as every instance corresponds to a particular bucket and they're not interchangeable).
Is this the case for the RADIUS clone?  It sounds like it... in which case, what you're proposing is less bad than I perhaps thought :-)

I would still recommend not doing this, but there is support for:

	  <optional>
	    <attribute name="rsc-instance"><data type="integer"/></attribute>
	  </optional>
	  <optional>
	    <attribute name="with-rsc-instance"><data type="integer"/></attribute>
	  </optional>

on the colocation constraint when using the experimental pacemaker-1.1 schema (look for validate-with).

>  One idea is to have instance 0 of the IPaddr2 clone collocate with instance 0 of the package clone.  (and 1, 2, etc.)  Then the package agent can get the IP2addr instance number because it's the same as the package instance number.  The IPaddr2 hash is public and understood.  Now, we can know if client IP will hash to a given node.

The problem is the messages arriving on the inward facing NIC, right?
When a request arrives on the outward facing NIC, you can safely process it because will only arrive if IPaddr2 is serving that bucket on that node.

I wonder if there is some way to test the client IP against the currently configured iptables rules... that should tell you if the message should be accepted without needing to know the instance numbers.

> 
> I guess I shouldn't say "every one of them".  There other ways to do this.  I was just wondering if I could make instances of two clones collocate.  Set globally-unique=true, you say?   I'll check into that if I need to pursue this idea.
> 
> But, looking at the IPaddr2 code, it looks like my package agent can get the iptables CLUSTERID node number, which means I don't need instance collocation.
> 
> Long explanation, but you asked.  And, I left out lots of details.
> 
> Back to skinning the cat.
> 
> 
> Regards.
> Mark K Vallevand   Mark.Vallevand at Unisys.com
> May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.
> 
> 
> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof
> Sent: Tuesday, February 18, 2014 05:35 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] Colocation of cloned resource instances.
> 
> 
> On 19 Feb 2014, at 1:49 am, Vallevand, Mark K <Mark.Vallevand at UNISYS.com> wrote:
> 
>> So, if I really really want to do it, can I?
>> I'm not being snarky.  I'd like to know if it's possible.
> 
> Fair enough. You possibly can if you set globally-unique=true for both clones.
> But that has other drawbacks.
> 
>> 
>> You are forcing me to think of a different solution to my cluster implementation.  Maybe that's good.
> 
> If you're thinking about giving special meaning to clone numbers, it almost certainly is :)
> If you outline the problem you're trying to solve, perhaps someone here will have a suggestion. 
> 
>> 
>> 
>> Regards.
>> Mark K Vallevand   Mark.Vallevand at Unisys.com
>> May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.
>> 
>> 
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof
>> Sent: Monday, February 17, 2014 05:45 PM
>> To: linux clustering
>> Subject: Re: [Linux-cluster] Colocation of cloned resource instances.
>> 
>> 
>> On 18 Feb 2014, at 8:32 am, Vallevand, Mark K <Mark.Vallevand at UNISYS.com> wrote:
>> 
>>> I have 2 cloned resources.  I want to make sure that instance 0 of each cloned resource are collocated.  (And instance 1, 2, etc.)
>> 
>> Instance numbers are an implementation detail.  You're not supposed to care.
>> 
>>> 
>>> I'd like to do something like this:
>>>               crm configure colocation name INFINITY: a_clone:0 b_clone:0
>>> Where a_clone is a clone of resource a, etc:
>>> crm configure clone a_clone a meta clone-max=2
>>> Same for b_clone and b.
>>> A and b are primitives:
>>>               crm configure primitive a .
>>> 
>>> Not having much luck.  Advice?
>>> Tried using a_clone:0 and a:0 on the collocation command.
>>> Is this even possible?
>>> 
>>> Regards.
>>> Mark K Vallevand   Mark.Vallevand at Unisys.com
>>> May you live in interesting times, may you come to the attention of important people and may all your wishes come true.
>>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.
>>> 
>>> -- 
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> 
>> 
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140220/32c4b599/attachment.sig>

From michaelux at gmail.com  Thu Feb 20 00:30:52 2014
From: michaelux at gmail.com (Michael Mendoza)
Date: Wed, 19 Feb 2014 19:30:52 -0500
Subject: [Linux-cluster] PowerEdge R610 idrac express fencing
Message-ID: <CAE-wA2=Z2eihpEGWrztRe_RB9Umk-ZBRVgj3OfSgm6tN9MVJbA@mail.gmail.com>

Good afternoon.

We are trying to configure 2 dell R610 with idrac6 EXPRESS in cluster with
redhat 5.10 x64.

For testing we are using the command fence_ipmilan. We can ping idrac on
the remote host.

fence_ipmilan -a X.X.X.X -l usern -p xxxx  -t 200  -o status  -v   <-- works


fence_ipmilan -a X.X.X.X -l usern -p xxxx  -t 200  -o reboot -v

The problem is the server reboot, but while it reboot the idrac6 reboot
too. so the host A after 120 seconds aprox lost connection and get the
follow message.


> Spawning: '/usr/bin/ipmitool -I lan -H 'X.X.X.X' -U 'usern' -P '[set]' -v
> -v -v chassis power on'...
> Spawned: '/usr/bin/ipmitool -I lan -H 'X.X.X.X' -U 'usern' -P '[set]' -v
> -v -v chassis power on' - PID 10104
> Looking for:
> 'Password:', val = 1
> 'Unable to establish LAN', val = 11
> 'IPMI mutex', val = 14
> 'Unsupported cipher suite ID', val = 2048
> 'read_rakp2_message: no support for', val = 2048
> 'Up/On', val = 0
> ExpectToken returned 11
> Reaping pid 10104
> Failed


cman version is CMAN-2.0.115.118.e15_10.3


however I have other host with centos 6.4 and CMAN3.0... and the connection
is not lost. I run the same command, the server reboot as well as idrac,
the ping is back and the ipmi connection is not lost..

 am I doing something wrong? I used the -t and -T option even 300 / 400 and
it doesnt matter, the connection is shut after 120secounds. in centos work
fine. ( I already opened a case with redhat and am waiting answer.)
Thanks

-- 
--Michael Mendoza--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140219/d3ea4eb4/attachment.htm>

From morpheus.ibis at gmail.com  Thu Feb 20 01:05:24 2014
From: morpheus.ibis at gmail.com (Pavel Herrmann)
Date: Thu, 20 Feb 2014 02:05:24 +0100
Subject: [Linux-cluster] PowerEdge R610 idrac express fencing
In-Reply-To: <CAE-wA2=Z2eihpEGWrztRe_RB9Umk-ZBRVgj3OfSgm6tN9MVJbA@mail.gmail.com>
References: <CAE-wA2=Z2eihpEGWrztRe_RB9Umk-ZBRVgj3OfSgm6tN9MVJbA@mail.gmail.com>
Message-ID: <11911458.dfgaUpVezR@bloomfield>

Hello,

On Wednesday 19 of February 2014 19:30:52 Michael Mendoza wrote:
> Good afternoon.
> 
> We are trying to configure 2 dell R610 with idrac6 EXPRESS in cluster with
> redhat 5.10 x64.
> 
> For testing we are using the command fence_ipmilan. We can ping idrac on
> the remote host.
> 
> fence_ipmilan -a X.X.X.X -l usern -p xxxx  -t 200  -o status  -v   <-- works
> 
> 
> fence_ipmilan -a X.X.X.X -l usern -p xxxx  -t 200  -o reboot -v
> 
> The problem is the server reboot, but while it reboot the idrac6 reboot
> too. so the host A after 120 seconds aprox lost connection and get the
> follow message.

This seems wrong, when doing a system reboot through IPMI, your BMC should not 
reboot. does this happen if you use ipmitool or similar tool to reboot the 
machine?


> 
> > Spawning: '/usr/bin/ipmitool -I lan -H 'X.X.X.X' -U 'usern' -P '[set]' -v
> > -v -v chassis power on'...
> > Spawned: '/usr/bin/ipmitool -I lan -H 'X.X.X.X' -U 'usern' -P '[set]' -v
> > -v -v chassis power on' - PID 10104
> > Looking for:
> > 'Password:', val = 1
> > 'Unable to establish LAN', val = 11
> > 'IPMI mutex', val = 14
> > 'Unsupported cipher suite ID', val = 2048
> > 'read_rakp2_message: no support for', val = 2048
> > 'Up/On', val = 0
> > ExpectToken returned 11
> > Reaping pid 10104
> > Failed
> 
> cman version is CMAN-2.0.115.118.e15_10.3
> 
> 
> however I have other host with centos 6.4 and CMAN3.0... and the connection
> is not lost. I run the same command, the server reboot as well as idrac,
> the ping is back and the ipmi connection is not lost..
> 
>  am I doing something wrong? I used the -t and -T option even 300 / 400 and
> it doesnt matter, the connection is shut after 120secounds. in centos work
> fine. ( I already opened a case with redhat and am waiting answer.)
> Thanks



From lists at alteeve.ca  Thu Feb 20 01:24:48 2014
From: lists at alteeve.ca (Digimer)
Date: Wed, 19 Feb 2014 20:24:48 -0500
Subject: [Linux-cluster] PowerEdge R610 idrac express fencing
In-Reply-To: <CAE-wA2=Z2eihpEGWrztRe_RB9Umk-ZBRVgj3OfSgm6tN9MVJbA@mail.gmail.com>
References: <CAE-wA2=Z2eihpEGWrztRe_RB9Umk-ZBRVgj3OfSgm6tN9MVJbA@mail.gmail.com>
Message-ID: <53055960.5080403@alteeve.ca>

On 19/02/14 07:30 PM, Michael Mendoza wrote:
> Good afternoon.
>
> We are trying to configure 2 dell R610 with idrac6 EXPRESS in cluster
> with redhat 5.10 x64.
>
> For testing we are using the command fence_ipmilan. We can ping idrac on
> the remote host.
>
> fence_ipmilan -a X.X.X.X -l usern -p xxxx  -t 200  -o status  -v   <-- works

Over 3 minutes to confirm a fence action is extremely log time!

> fence_ipmilan -a X.X.X.X -l usern -p xxxx  -t 200  -o reboot -v
>
> The problem is the server reboot, but while it reboot the idrac6 reboot
> too. so the host A after 120 seconds aprox lost connection and get the
> follow message.

So you're saying that the IPMI interface, after rebooting the host, 
fails to respond for two full minutes? That strikes me as a reason to 
call Dell and ask for help. That can't be normal.

>     Spawning: '/usr/bin/ipmitool -I lan -H 'X.X.X.X' -U 'usern' -P
>     '[set]' -v -v -v chassis power on'...
>     Spawned: '/usr/bin/ipmitool -I lan -H 'X.X.X.X' -U 'usern' -P
>     '[set]' -v -v -v chassis power on' - PID 10104
>     Looking for:
>     'Password:', val = 1
>     'Unable to establish LAN', val = 11
>     'IPMI mutex', val = 14
>     'Unsupported cipher suite ID', val = 2048
>     'read_rakp2_message: no support for', val = 2048
>     'Up/On', val = 0
>     ExpectToken returned 11
>     Reaping pid 10104
>     Failed
>
>
> cman version is CMAN-2.0.115.118.e15_10.3

This is an old existing cluster, or a new one you're trying to build?

> however I have other host with centos 6.4 and CMAN3.0... and the
> connection is not lost. I run the same command, the server reboot as
> well as idrac, the ping is back and the ipmi connection is not lost..

Are these nodes in the same cluster? cman 2 and 3 only are designed to 
work in maintenance mode for rolling upgrades.

> am I doing something wrong? I used the -t and -T option even 300 / 400
> and it doesnt matter, the connection is shut after 120secounds. in
> centos work fine. ( I already opened a case with redhat and am waiting
> answer.)
> Thanks

It might be that the 120 second upper limit is a bug.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?



From michaelux at gmail.com  Thu Feb 20 01:54:17 2014
From: michaelux at gmail.com (Michael Mendoza)
Date: Wed, 19 Feb 2014 20:54:17 -0500
Subject: [Linux-cluster] PowerEdge R610 idrac express fencing
In-Reply-To: <11911458.dfgaUpVezR@bloomfield>
References: <CAE-wA2=Z2eihpEGWrztRe_RB9Umk-ZBRVgj3OfSgm6tN9MVJbA@mail.gmail.com>
	<11911458.dfgaUpVezR@bloomfield>
Message-ID: <CAE-wA2mVH815AepWB0b-dycLMiERMTZAhWRGWP_FO1jFL4HU_w@mail.gmail.com>

I contacted DELL and was told it was normal because it was idrac6 Express
and it use a shared LOM for OS and idrac.

I've rebooted manually the servers and lost connectivity with idrac for 120
seconds aprox.

The weird thing is fence_ipmilan works from a centos 6.4 server.

We have some others Hp servers with iLo and when I restart the server I
lost connection with ilo while booting


On Wed, Feb 19, 2014 at 8:05 PM, Pavel Herrmann <morpheus.ibis at gmail.com>wrote:

> Hello,
>
> On Wednesday 19 of February 2014 19:30:52 Michael Mendoza wrote:
> > Good afternoon.
> >
> > We are trying to configure 2 dell R610 with idrac6 EXPRESS in cluster
> with
> > redhat 5.10 x64.
> >
> > For testing we are using the command fence_ipmilan. We can ping idrac on
> > the remote host.
> >
> > fence_ipmilan -a X.X.X.X -l usern -p xxxx  -t 200  -o status  -v   <--
> works
> >
> >
> > fence_ipmilan -a X.X.X.X -l usern -p xxxx  -t 200  -o reboot -v
> >
> > The problem is the server reboot, but while it reboot the idrac6 reboot
> > too. so the host A after 120 seconds aprox lost connection and get the
> > follow message.
>
> This seems wrong, when doing a system reboot through IPMI, your BMC should
> not
> reboot. does this happen if you use ipmitool or similar tool to reboot the
> machine?
>
>
> >
> > > Spawning: '/usr/bin/ipmitool -I lan -H 'X.X.X.X' -U 'usern' -P '[set]'
> -v
> > > -v -v chassis power on'...
> > > Spawned: '/usr/bin/ipmitool -I lan -H 'X.X.X.X' -U 'usern' -P '[set]'
> -v
> > > -v -v chassis power on' - PID 10104
> > > Looking for:
> > > 'Password:', val = 1
> > > 'Unable to establish LAN', val = 11
> > > 'IPMI mutex', val = 14
> > > 'Unsupported cipher suite ID', val = 2048
> > > 'read_rakp2_message: no support for', val = 2048
> > > 'Up/On', val = 0
> > > ExpectToken returned 11
> > > Reaping pid 10104
> > > Failed
> >
> > cman version is CMAN-2.0.115.118.e15_10.3
> >
> >
> > however I have other host with centos 6.4 and CMAN3.0... and the
> connection
> > is not lost. I run the same command, the server reboot as well as idrac,
> > the ping is back and the ipmi connection is not lost..
> >
> >  am I doing something wrong? I used the -t and -T option even 300 / 400
> and
> > it doesnt matter, the connection is shut after 120secounds. in centos
> work
> > fine. ( I already opened a case with redhat and am waiting answer.)
> > Thanks
>
>


-- 
--Michael Mendoza--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140219/3c0ea333/attachment.htm>

From morpheus.ibis at gmail.com  Thu Feb 20 09:09:01 2014
From: morpheus.ibis at gmail.com (Pavel Herrmann)
Date: Thu, 20 Feb 2014 10:09:01 +0100
Subject: [Linux-cluster] PowerEdge R610 idrac express fencing
In-Reply-To: <CAE-wA2mVH815AepWB0b-dycLMiERMTZAhWRGWP_FO1jFL4HU_w@mail.gmail.com>
References: <CAE-wA2=Z2eihpEGWrztRe_RB9Umk-ZBRVgj3OfSgm6tN9MVJbA@mail.gmail.com>
	<11911458.dfgaUpVezR@bloomfield>
	<CAE-wA2mVH815AepWB0b-dycLMiERMTZAhWRGWP_FO1jFL4HU_w@mail.gmail.com>
Message-ID: <3941366.E7IKccONJW@bloomfield>

Hello

On Wednesday 19 of February 2014 20:54:17 Michael Mendoza wrote:
> I contacted DELL and was told it was normal because it was idrac6 Express
> and it use a shared LOM for OS and idrac.
> 
> I've rebooted manually the servers and lost connectivity with idrac for 120
> seconds aprox.

This still feels wrong to me.
All my newer dells have an enterprise idrac, but i have one PE2900 without it 
(shared port for OS and BMC), and it does work as on would expect

at this point I would probably recommend to get idrac enterprise cards, they 
seem to be available rather cheap on ebay.

> The weird thing is fence_ipmilan works from a centos 6.4 server.

as pointed out by Digimer, this might be a bug with older fence_ipmilan, 
causing the timeout parameter to be ignored

> We have some others Hp servers with iLo and when I restart the server I
> lost connection with ilo while booting

sorry, I have no experience with HP servers, but as stated above, this is not 
how it is supposed to work

regards
Pavel Herrmann

> 
> On Wed, Feb 19, 2014 at 8:05 PM, Pavel Herrmann 
<morpheus.ibis at gmail.com>wrote:
> > Hello,
> > 
> > On Wednesday 19 of February 2014 19:30:52 Michael Mendoza wrote:
> > > Good afternoon.
> > > 
> > > We are trying to configure 2 dell R610 with idrac6 EXPRESS in cluster
> > 
> > with
> > 
> > > redhat 5.10 x64.
> > > 
> > > For testing we are using the command fence_ipmilan. We can ping idrac on
> > > the remote host.
> > > 
> > > fence_ipmilan -a X.X.X.X -l usern -p xxxx  -t 200  -o status  -v   <--
> > 
> > works
> > 
> > > fence_ipmilan -a X.X.X.X -l usern -p xxxx  -t 200  -o reboot -v
> > > 
> > > The problem is the server reboot, but while it reboot the idrac6 reboot
> > > too. so the host A after 120 seconds aprox lost connection and get the
> > > follow message.
> > 
> > This seems wrong, when doing a system reboot through IPMI, your BMC should
> > not
> > reboot. does this happen if you use ipmitool or similar tool to reboot the
> > machine?
> > 
> > > > Spawning: '/usr/bin/ipmitool -I lan -H 'X.X.X.X' -U 'usern' -P '[set]'
> > 
> > -v
> > 
> > > > -v -v chassis power on'...
> > > > Spawned: '/usr/bin/ipmitool -I lan -H 'X.X.X.X' -U 'usern' -P '[set]'
> > 
> > -v
> > 
> > > > -v -v chassis power on' - PID 10104
> > > > Looking for:
> > > > 'Password:', val = 1
> > > > 'Unable to establish LAN', val = 11
> > > > 'IPMI mutex', val = 14
> > > > 'Unsupported cipher suite ID', val = 2048
> > > > 'read_rakp2_message: no support for', val = 2048
> > > > 'Up/On', val = 0
> > > > ExpectToken returned 11
> > > > Reaping pid 10104
> > > > Failed
> > > 
> > > cman version is CMAN-2.0.115.118.e15_10.3
> > > 
> > > 
> > > however I have other host with centos 6.4 and CMAN3.0... and the
> > 
> > connection
> > 
> > > is not lost. I run the same command, the server reboot as well as idrac,
> > > the ping is back and the ipmi connection is not lost..
> > > 
> > >  am I doing something wrong? I used the -t and -T option even 300 / 400
> > 
> > and
> > 
> > > it doesnt matter, the connection is shut after 120secounds. in centos
> > 
> > work
> > 
> > > fine. ( I already opened a case with redhat and am waiting answer.)
> > > Thanks



From rpeterso at redhat.com  Thu Feb 20 13:22:17 2014
From: rpeterso at redhat.com (Bob Peterson)
Date: Thu, 20 Feb 2014 08:22:17 -0500 (EST)
Subject: [Linux-cluster] Introduction to clusters
In-Reply-To: <AFB9884C1215D04F9DDC11E35F37C24407345F5D9D@WMSTUTXMB01.WILLIAMS.COM>
References: <AFB9884C1215D04F9DDC11E35F37C244073478CE3C@WMSTUTXMB01.WILLIAMS.COM>
	<CA+YdgaoyPx2dBOabYS5Jy0gGzo7X3BtKYypaFff+3WX9TzVjqQ@mail.gmail.com>
	<AFB9884C1215D04F9DDC11E35F37C24407345F5D9D@WMSTUTXMB01.WILLIAMS.COM>
Message-ID: <1227000889.3128905.1392902537637.JavaMail.zimbra@redhat.com>

----- Original Message -----
(snip)
| Where am I failing ? vgs, lvs, pvs all show that shared-vg01 has got the
| attribute of "cluster". The cluster locking type in /etc/lvm/lvm.conf is set
| to 3. Still I am never able to mount the file systems. Clvmd is running,
| cman is running, rgmanager is running and none of them show any issues. I
| can mount the file systems manually but the cluster never mounts them.
| 
| Best,
| 
| Amitakhya Phukan

Hi Amitakhya,

Perhaps you should post a sterilized version of your cluster.conf.
(If you already have, sorry: I just noticed this thread now). If you can
manually mount the file system, I would think that rgmanager should be able
to mount it too, unless there's a syntax error or something.

Also, you could look in /var/log/messages for messages related to the problem.

Regards,

Bob Peterson
Red Hat File Systems



From Amitakhya.Phukan at williams.com  Thu Feb 20 13:50:57 2014
From: Amitakhya.Phukan at williams.com (Phukan, Amitakhya)
Date: Thu, 20 Feb 2014 07:50:57 -0600
Subject: [Linux-cluster] Introduction to clusters
In-Reply-To: <1227000889.3128905.1392902537637.JavaMail.zimbra@redhat.com>
References: <AFB9884C1215D04F9DDC11E35F37C244073478CE3C@WMSTUTXMB01.WILLIAMS.COM>
	<CA+YdgaoyPx2dBOabYS5Jy0gGzo7X3BtKYypaFff+3WX9TzVjqQ@mail.gmail.com>
	<AFB9884C1215D04F9DDC11E35F37C24407345F5D9D@WMSTUTXMB01.WILLIAMS.COM>
	<1227000889.3128905.1392902537637.JavaMail.zimbra@redhat.com>
Message-ID: <AFB9884C1215D04F9DDC11E35F37C24407349B1CF8@WMSTUTXMB01.WILLIAMS.COM>

Hi Bob,

The error was a very elementary one it seems. I need to configure the mount points separately. Just exposes my experience with Clustering.

> (snipped)

> Hi Amitakhya,
> 
> Perhaps you should post a sterilized version of your cluster.conf.
> (If you already have, sorry: I just noticed this thread now). If you can
> manually mount the file system, I would think that rgmanager should be able
> to mount it too, unless there's a syntax error or something.
> 
> Also, you could look in /var/log/messages for messages related to the
> problem.
> 
> Regards,
> 
> Bob Peterson
> Red Hat File Systems

Here is the relevant portion of my cluster.conf

<resources>
                        <lvm name="shared-vol03" vg_name="shared-vg03"/>
                        <lvm name="shared-vol02" vg_name="shared-vg02"/>
                        <lvm name="shared-vol01" vg_name="shared-vg01"/>
                        <ip address="10.26.201.56" monitor_link="1"/>
                </resources>
                <service autostart="0" domain="ov-server-failover" name="ov-server" recovery="relocate">
                        <lvm ref="shared-vol03"/>
                        <lvm ref="shared-vol02"/>
                        <lvm ref="shared-vol01"/>
                        <ip ref="10.26.201.56"/>
                </service>



From lists at alteeve.ca  Thu Feb 20 21:07:08 2014
From: lists at alteeve.ca (Digimer)
Date: Thu, 20 Feb 2014 16:07:08 -0500
Subject: [Linux-cluster] Creating clustered LVM snapshots,
	locking and exclusivity
Message-ID: <53066E7C.5020204@alteeve.ca>

Hi all,

   I want to get clustered LV snapshotting working. I was under the 
impression it was simply a matter of disabling the LV on the other node 
(2-node cluster here). However, this fails because of locking issues.

   I can change the peer node's LV to 'inactive' with (confirmed with 
lvscan):

[root at an-c05n01 ~]# lvchange -aln /dev/an-c05n02_vg0/vm01-rhel2_0
[root at an-c05n01 ~]# lvscan

   inactive          '/dev/an-c05n02_vg0/vm01-rhel2_0' [50.00 GiB] inherit

   But I still can't create snapshot on the other node running the VM:

[root at an-c05n02 ~]# lvcreate -L 25GiB --snapshot -n 
vm01-rhel2_0_snapshot /dev/an-c05n02_vg0/vm01-rhel2_0
   vm01-rhel2_0 must be active exclusively to create snapshot

So I try to set it exclusive:

[root at an-c05n02 ~]# lvchange -aey /dev/an-c05n02_vg0/vm01-rhel2_0
   Error locking on node an-c05n02.alteeve.ca: Device or resource busy

If I stop the VM running on the LV, then I can set the exclusive lock, 
boot the VM and later create the snapshot fine:

[root at an-c05n02 ~]# lvcreate -L 25GiB --snapshot -n 
vm01-rhel2_0_snapshot /dev/an-c05n02_vg0/vm01-rhel2_0
   Logical volume "vm01-rhel2_0_snapshot" created

But then later, I can't remove the exclusive value, so I can't re-active 
the LV after deleting the snapshot. I have to shut the VM down again in 
order to remove the exclusive flag.

I'm assuming it's possible to snapshot clustered LVs while they're in 
use, without stopping what is using them twice... Can someone help 
clarify what the magical incantation is?

Thanks!

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?



From bubble at hoster-ok.com  Fri Feb 21 10:25:58 2014
From: bubble at hoster-ok.com (Vladislav Bogdanov)
Date: Fri, 21 Feb 2014 13:25:58 +0300
Subject: [Linux-cluster] Creating clustered LVM snapshots,
 locking and exclusivity
In-Reply-To: <53066E7C.5020204@alteeve.ca>
References: <53066E7C.5020204@alteeve.ca>
Message-ID: <530729B6.6020109@hoster-ok.com>

Hi,

21.02.2014 00:07, Digimer wrote:
> Hi all,
> 
>   I want to get clustered LV snapshotting working. I was under the
> impression it was simply a matter of disabling the LV on the other node
> (2-node cluster here). However, this fails because of locking issues.
> 
>   I can change the peer node's LV to 'inactive' with (confirmed with
> lvscan):
> 
> [root at an-c05n01 ~]# lvchange -aln /dev/an-c05n02_vg0/vm01-rhel2_0
> [root at an-c05n01 ~]# lvscan
> 
>   inactive          '/dev/an-c05n02_vg0/vm01-rhel2_0' [50.00 GiB] inherit
> 
>   But I still can't create snapshot on the other node running the VM:
> 
> [root at an-c05n02 ~]# lvcreate -L 25GiB --snapshot -n
> vm01-rhel2_0_snapshot /dev/an-c05n02_vg0/vm01-rhel2_0
>   vm01-rhel2_0 must be active exclusively to create snapshot
> 
> So I try to set it exclusive:
> 
> [root at an-c05n02 ~]# lvchange -aey /dev/an-c05n02_vg0/vm01-rhel2_0
>   Error locking on node an-c05n02.alteeve.ca: Device or resource busy
> 
> If I stop the VM running on the LV, then I can set the exclusive lock,
> boot the VM and later create the snapshot fine:
> 
> [root at an-c05n02 ~]# lvcreate -L 25GiB --snapshot -n
> vm01-rhel2_0_snapshot /dev/an-c05n02_vg0/vm01-rhel2_0
>   Logical volume "vm01-rhel2_0_snapshot" created
> 
> But then later, I can't remove the exclusive value, so I can't re-active
> the LV after deleting the snapshot. I have to shut the VM down again in
> order to remove the exclusive flag.
> 
> I'm assuming it's possible to snapshot clustered LVs while they're in
> use, without stopping what is using them twice... Can someone help
> clarify what the magical incantation is?

First patch from this series should
https://www.redhat.com/archives/linux-lvm/2013-March/msg00050.html

I support that for myself and have that reworked and ported to .102, so
please contact me if you need newer patches (iirc they have some fixes
as well).

As you see, I tried to push that into upstream LVM, but I'm strongly
disagree with some points I got in replies from core LVM devs, so I wash
my hands.

Best,
Vladislav



From bjoern.teipel at internetbrands.com  Sat Feb 22 05:21:28 2014
From: bjoern.teipel at internetbrands.com (Bjoern Teipel)
Date: Fri, 21 Feb 2014 21:21:28 -0800
Subject: [Linux-cluster] CLVM & CMAN live adding nodes
Message-ID: <CAE6679n9unE7-tuLUWPFOhoHCFESf6qJwyL7ND8hrxOenC9QeQ@mail.gmail.com>

Hi all,

who's using CLVM with CMAN in a cluster with more than 2 nodes in
production ?
Did you guys got it to manage to live add a new node to the cluster while
everything is running ?
I'm only able to add nodes while the cluster stack is shutdown.
That's certainly not a good idea when you have to run CLVM on hypervisors
and you need to shutdown all VMs to add a new box.
Would be also good if you paste some of your configs using IPMI fencing

Thanks in advance,
Bjoern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140221/0a5865de/attachment.htm>

From emi2fast at gmail.com  Sat Feb 22 09:33:42 2014
From: emi2fast at gmail.com (emmanuel segura)
Date: Sat, 22 Feb 2014 10:33:42 +0100
Subject: [Linux-cluster] CLVM & CMAN live adding nodes
In-Reply-To: <CAE6679n9unE7-tuLUWPFOhoHCFESf6qJwyL7ND8hrxOenC9QeQ@mail.gmail.com>
References: <CAE6679n9unE7-tuLUWPFOhoHCFESf6qJwyL7ND8hrxOenC9QeQ@mail.gmail.com>
Message-ID: <CAE7pJ3CMZ9V2vn0c9eqvu1dhQdYkn6mqkA4a2o-CntFNtRZzsg@mail.gmail.com>

I know if you need to modify anything outside <rm>... </rm>{used by
rgmanager} tag in the cluster.conf file, you need to restart the whole
cluster stack, with cman+rgmanager i have never seen how to add a node and
remove a node from cluster without restart cman.




2014-02-22 6:21 GMT+01:00 Bjoern Teipel <bjoern.teipel at internetbrands.com>:

> Hi all,
>
> who's using CLVM with CMAN in a cluster with more than 2 nodes in
> production ?
> Did you guys got it to manage to live add a new node to the cluster while
> everything is running ?
> I'm only able to add nodes while the cluster stack is shutdown.
> That's certainly not a good idea when you have to run CLVM on hypervisors
> and you need to shutdown all VMs to add a new box.
> Would be also good if you paste some of your configs using IPMI fencing
>
> Thanks in advance,
> Bjoern
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140222/d925d354/attachment.htm>

From lists at alteeve.ca  Sat Feb 22 13:51:27 2014
From: lists at alteeve.ca (Digimer)
Date: Sat, 22 Feb 2014 08:51:27 -0500
Subject: [Linux-cluster] CLVM & CMAN live adding nodes
In-Reply-To: <CAE7pJ3CMZ9V2vn0c9eqvu1dhQdYkn6mqkA4a2o-CntFNtRZzsg@mail.gmail.com>
References: <CAE6679n9unE7-tuLUWPFOhoHCFESf6qJwyL7ND8hrxOenC9QeQ@mail.gmail.com>
	<CAE7pJ3CMZ9V2vn0c9eqvu1dhQdYkn6mqkA4a2o-CntFNtRZzsg@mail.gmail.com>
Message-ID: <5308AB5F.8040306@alteeve.ca>

This is not true. I change things outside the <rm> tags often without 
restarting the cluster. It would be a significant flaw if that were the 
case.

On 22/02/14 04:33 AM, emmanuel segura wrote:
> I know if you need to modify anything outside <rm>... </rm>{used by
> rgmanager} tag in the cluster.conf file, you need to restart the whole
> cluster stack, with cman+rgmanager i have never seen how to add a node
> and remove a node from cluster without restart cman.
>
>
>
>
> 2014-02-22 6:21 GMT+01:00 Bjoern Teipel
> <bjoern.teipel at internetbrands.com
> <mailto:bjoern.teipel at internetbrands.com>>:
>
>     Hi all,
>
>     who's using CLVM with CMAN in a cluster with more than 2 nodes in
>     production ?
>     Did you guys got it to manage to live add a new node to the cluster
>     while everything is running ?
>     I'm only able to add nodes while the cluster stack is shutdown.
>     That's certainly not a good idea when you have to run CLVM on
>     hypervisors and you need to shutdown all VMs to add a new box.
>     Would be also good if you paste some of your configs using IPMI fencing
>
>     Thanks in advance,
>     Bjoern
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
>


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?



From fdinitto at redhat.com  Sat Feb 22 13:53:30 2014
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Sat, 22 Feb 2014 14:53:30 +0100
Subject: [Linux-cluster] CLVM & CMAN live adding nodes
In-Reply-To: <CAE6679n9unE7-tuLUWPFOhoHCFESf6qJwyL7ND8hrxOenC9QeQ@mail.gmail.com>
References: <CAE6679n9unE7-tuLUWPFOhoHCFESf6qJwyL7ND8hrxOenC9QeQ@mail.gmail.com>
Message-ID: <5308ABDA.9000700@redhat.com>

On 02/22/2014 06:21 AM, Bjoern Teipel wrote:
> Hi all,
> 
> who's using CLVM with CMAN in a cluster with more than 2 nodes in
> production ?

Yeps.

> Did you guys got it to manage to live add a new node to the cluster
> while everything is running ?

Yeps :)

> I'm only able to add nodes while the cluster stack is shutdown.
> That's certainly not a good idea when you have to run CLVM on
> hypervisors and you need to shutdown all VMs to add a new box.
> Would be also good if you paste some of your configs using IPMI fencing 

It would be very useful if you could at least provide what version of
CMAN / rgmanager you are running and what process did you follow to add
a node. It might very well be a bug but it's impossible to say if you
don't provide more info.

There are some known documented limitations in adding/removing nodes tho.

2 nodes -> 3 or more needs a full restart
3 or more -> 2 nodes needs a full restart.

Fabio



From fdinitto at redhat.com  Sat Feb 22 13:54:53 2014
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Sat, 22 Feb 2014 14:54:53 +0100
Subject: [Linux-cluster] CLVM & CMAN live adding nodes
In-Reply-To: <CAE7pJ3CMZ9V2vn0c9eqvu1dhQdYkn6mqkA4a2o-CntFNtRZzsg@mail.gmail.com>
References: <CAE6679n9unE7-tuLUWPFOhoHCFESf6qJwyL7ND8hrxOenC9QeQ@mail.gmail.com>
	<CAE7pJ3CMZ9V2vn0c9eqvu1dhQdYkn6mqkA4a2o-CntFNtRZzsg@mail.gmail.com>
Message-ID: <5308AC2D.3010005@redhat.com>

On 02/22/2014 10:33 AM, emmanuel segura wrote:
> I know if you need to modify anything outside <rm>... </rm>{used by
> rgmanager} tag in the cluster.conf file, you need to restart the whole
> cluster stack, with cman+rgmanager i have never seen how to add a node
> and remove a node from cluster without restart cman.

It depends on the version. RHEL5 that's correct, RHEL6 it works also for
outside of <rm> but there are some limitations as some parameters just
can't be changed runtime.

Fabio

> 
> 
> 
> 
> 2014-02-22 6:21 GMT+01:00 Bjoern Teipel
> <bjoern.teipel at internetbrands.com
> <mailto:bjoern.teipel at internetbrands.com>>:
> 
>     Hi all,
> 
>     who's using CLVM with CMAN in a cluster with more than 2 nodes in
>     production ?
>     Did you guys got it to manage to live add a new node to the cluster
>     while everything is running ?
>     I'm only able to add nodes while the cluster stack is shutdown.
>     That's certainly not a good idea when you have to run CLVM on
>     hypervisors and you need to shutdown all VMs to add a new box.
>     Would be also good if you paste some of your configs using IPMI fencing 
> 
>     Thanks in advance,
>     Bjoern
> 
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> 
> -- 
> esta es mi vida e me la vivo hasta que dios quiera
> 
> 



From bjoern.teipel at internetbrands.com  Sat Feb 22 19:05:42 2014
From: bjoern.teipel at internetbrands.com (Bjoern Teipel)
Date: Sat, 22 Feb 2014 11:05:42 -0800
Subject: [Linux-cluster] CLVM & CMAN live adding nodes
In-Reply-To: <5308AC2D.3010005@redhat.com>
References: <CAE6679n9unE7-tuLUWPFOhoHCFESf6qJwyL7ND8hrxOenC9QeQ@mail.gmail.com>
	<CAE7pJ3CMZ9V2vn0c9eqvu1dhQdYkn6mqkA4a2o-CntFNtRZzsg@mail.gmail.com>
	<5308AC2D.3010005@redhat.com>
Message-ID: <CAE6679nnMPOfZgMVE5N27X2Uj0Aqwe2qPQYYJZUa5ehHWPdiZA@mail.gmail.com>

Thanks Fabio for replying may request.

I'm using stock CentOS 6.4 versions and no rm, just clvmd and dlm.

Name        : cman                         Relocations: (not relocatable)
Version     : 3.0.12.1                          Vendor: CentOS
Release     : 49.el6_4.2                    Build Date: Tue 03 Sep 2013
02:18:10 AM PDT

Name        : lvm2-cluster                 Relocations: (not relocatable)
Version     : 2.02.98                           Vendor: CentOS
Release     : 9.el6_4.3                     Build Date: Tue 05 Nov 2013
07:36:18 AM PST

Name        : corosync                     Relocations: (not relocatable)
Version     : 1.4.1                             Vendor: CentOS
Release     : 15.el6_4.1                    Build Date: Tue 14 May 2013
02:09:27 PM PDT


My question is based off this problem I have till January:


When ever I add a new node (I put into the cluster.conf and reloaded with
cman_tool version -r -S)  I end up with situations like the new node wants
to gain the quorum and starts to fence the existing pool master and appears
to generate some sort of split cluster. Does it work at all, corosync and
dlm do not know about the recently added node ?

New Node
==========

Node  Sts   Inc   Joined               Name
   1   X      0                        hv-1
   2   X      0                        hv-2
   3   X      0                        hv-3
   4   X      0                        hv-4
   5   X      0                        hv-5
   6   M     80   2014-01-07 21:37:42  hv-6<--- host added


Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] The network interface
[10.14.18.77] is now up.
Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Using quorum provider
quorum_cman
Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
corosync cluster quorum service v0.1
Jan  7 21:37:42 hv-1 corosync[12564]:   [CMAN  ] CMAN 3.0.12.1 (built Sep
 3 2013 09:17:34) started
Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
corosync CMAN membership service 2.90
Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
openais checkpoint service B.01.01
Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
corosync extended virtual synchrony service
Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
corosync configuration service
Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
corosync cluster closed process group service v1.01
Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
corosync cluster config database access v1.01
Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
corosync profile loading service
Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Using quorum provider
quorum_cman
Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
corosync cluster quorum service v0.1
Jan  7 21:37:42 hv-1 corosync[12564]:   [MAIN  ] Compatibility mode set to
whitetank.  Using V1 and V2 of the synchronization engine.
Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
{10.14.18.65}
Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
{10.14.18.67}
Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
{10.14.18.68}
Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
{10.14.18.70}
Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
{10.14.18.66}
Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
{10.14.18.77}
Jan  7 21:37:42 hv-1  corosync[12564]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Jan  7 21:37:42 hv-1 corosync[12564]:   [CMAN  ] quorum regained, resuming
activity
Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] This node is within the
primary component and will provide service.
Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Members[1]: 6
Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Members[1]: 6
Jan  7 21:37:42 hv-1 corosync[12564]:   [CPG   ] chosen downlist: sender
r(0) ip(10.14.18.77) ; members(old:0 left:0)
Jan  7 21:37:42 hv-1 corosync[12564]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Jan  7 21:37:46 hv-1 fenced[12620]: fenced 3.0.12.1 started
Jan  7 21:37:46 hv-1 dlm_controld[12643]: dlm_controld 3.0.12.1 started
Jan  7 21:37:47 hv-1 gfs_controld[12695]: gfs_controld 3.0.12.1 started
Jan  7 21:37:54 hv-1 fenced[12620]: fencing node hv-b1clcy1

sudo -i corosync-objctl  |grep member

totem.interface.member.memberaddr=hv-1
totem.interface.member.memberaddr=hv-2
totem.interface.member.memberaddr=hv-3
totem.interface.member.memberaddr=hv-4
totem.interface.member.memberaddr=hv-5
totem.interface.member.memberaddr=hv-6
runtime.totem.pg.mrp.srp.members.6.ip=r(0) ip(10.14.18.77)
runtime.totem.pg.mrp.srp.members.6.join_count=1
runtime.totem.pg.mrp.srp.members.6.status=joined


Existing Node
=============

member 6 has not been added to the quorum list :

Jan  7 21:36:28 hv-1 corosync[7769]:   [QUORUM] Members[4]: 1 2 3 5
Jan  7 21:37:54 hv-1 corosync[7769]:   [TOTEM ] A processor joined or left
the membership and a new membership was formed.
Jan  7 21:37:54 hv-1 corosync[7769]:   [CPG   ] chosen downlist: sender
r(0) ip(10.14.18.65) ; members(old:4 left:0)


Node  Sts   Inc   Joined               Name
   1   M   4468   2013-12-10 14:33:27  hv-1
   2   M   4468   2013-12-10 14:33:27  hv-2
   3   M   5036   2014-01-07 17:51:26  hv-3
   4   X   4468                        hv-4(dead at the moment)
   5   M   4468   2013-12-10 14:33:27  hv-5
   6   X      0                        hv-6<--- added


Jan  7 21:36:28 hv-1 corosync[7769]:   [QUORUM] Members[4]: 1 2 3 5
Jan  7 21:37:54 hv-1 corosync[7769]:   [TOTEM ] A processor joined or left
the membership and a new membership was formed.
Jan  7 21:37:54 hv-1 corosync[7769]:   [CPG   ] chosen downlist: sender
r(0) ip(10.14.18.65) ; members(old:4 left:0)
Jan  7 21:37:54 hv-1 corosync[7769]:   [MAIN  ] Completed service
synchronization, ready to provide service.


totem.interface.member.memberaddr=hv-1
totem.interface.member.memberaddr=hv-2
totem.interface.member.memberaddr=hv-3
totem.interface.member.memberaddr=hv-4
totem.interface.member.memberaddr=hv-5.
runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(10.14.18.65)
runtime.totem.pg.mrp.srp.members.1.join_count=1
runtime.totem.pg.mrp.srp.members.1.status=joined
runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(10.14.18.66)
runtime.totem.pg.mrp.srp.members.2.join_count=1
runtime.totem.pg.mrp.srp.members.2.status=joined
runtime.totem.pg.mrp.srp.members.4.ip=r(0) ip(10.14.18.68)
runtime.totem.pg.mrp.srp.members.4.join_count=1
runtime.totem.pg.mrp.srp.members.4.status=left
runtime.totem.pg.mrp.srp.members.5.ip=r(0) ip(10.14.18.70)
runtime.totem.pg.mrp.srp.members.5.join_count=1
runtime.totem.pg.mrp.srp.members.5.status=joined
runtime.totem.pg.mrp.srp.members.3.ip=r(0) ip(10.14.18.67)
runtime.totem.pg.mrp.srp.members.3.join_count=3
runtime.totem.pg.mrp.srp.members.3.status=joined


cluster.conf:

<?xml version="1.0"?>
<cluster config_version="32" name="hv-1618-110-1">
  <fence_daemon clean_start="0"/>
  <cman transport="udpu" expected_votes="1"/>
  <logging debug="off"/>
  <clusternodes>
    <clusternode name="hv-1" votes="1" nodeid="1"><fence><method
name="single"><device name="human"/></method></fence></clusternode>
    <clusternode name="hv-2" votes="1" nodeid="3"><fence><method
name="single"><device name="human"/></method></fence></clusternode>
    <clusternode name="hv-3" votes="1" nodeid="4"><fence><method
name="single"><device name="human"/></method></fence></clusternode>
    <clusternode name="hv-4" votes="1" nodeid="5"><fence><method
name="single"><device name="human"/></method></fence></clusternode>
    <clusternode name="hv-5" votes="1" nodeid="2"><fence><method
name="single"><device name="human"/></method></fence></clusternode>
    <clusternode name="hv-6" votes="1" nodeid="6"><fence><method
name="single"><device name="human"/></method></fence></clusternode>
  </clusternodes>
  <fencedevices>
  <fencedevice name="human" agent="manual"/></fencedevices>
  <rm/>
</cluster>

(manual fencing just for testing)


corosync.conf:

compatibility: whitetank
totem {
  version: 2
  secauth: off
  threads: 0
  # fail_recv_const: 5000
  interface {
    ringnumber: 0
    bindnetaddr: 10.14.18.0
    mcastaddr: 239.0.0.4
    mcastport: 5405
  }
}
logging {
  fileline: off
  to_stderr: no
  to_logfile: yes
  to_syslog: yes
  # the pathname of the log file
  logfile: /var/log/cluster/corosync.log
  debug: off
  timestamp: on
  logger_subsys {
    subsys: AMF
    debug: off
  }
}

amf {
  mode: disabled
}



On Sat, Feb 22, 2014 at 5:54 AM, Fabio M. Di Nitto <fdinitto at redhat.com>wrote:

> On 02/22/2014 10:33 AM, emmanuel segura wrote:
> > I know if you need to modify anything outside <rm>... </rm>{used by
> > rgmanager} tag in the cluster.conf file, you need to restart the whole
> > cluster stack, with cman+rgmanager i have never seen how to add a node
> > and remove a node from cluster without restart cman.
>
> It depends on the version. RHEL5 that's correct, RHEL6 it works also for
> outside of <rm> but there are some limitations as some parameters just
> can't be changed runtime.
>
> Fabio
>
> >
> >
> >
> >
> > 2014-02-22 6:21 GMT+01:00 Bjoern Teipel
> > <bjoern.teipel at internetbrands.com
> > <mailto:bjoern.teipel at internetbrands.com>>:
> >
> >     Hi all,
> >
> >     who's using CLVM with CMAN in a cluster with more than 2 nodes in
> >     production ?
> >     Did you guys got it to manage to live add a new node to the cluster
> >     while everything is running ?
> >     I'm only able to add nodes while the cluster stack is shutdown.
> >     That's certainly not a good idea when you have to run CLVM on
> >     hypervisors and you need to shutdown all VMs to add a new box.
> >     Would be also good if you paste some of your configs using IPMI
> fencing
> >
> >     Thanks in advance,
> >     Bjoern
> >
> >     --
> >     Linux-cluster mailing list
> >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >     https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> >
> >
> > --
> > esta es mi vida e me la vivo hasta que dios quiera
> >
> >
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140222/ee1e624d/attachment.htm>

From fdinitto at redhat.com  Sun Feb 23 06:16:20 2014
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Sun, 23 Feb 2014 07:16:20 +0100
Subject: [Linux-cluster] CLVM & CMAN live adding nodes
In-Reply-To: <CAE6679nnMPOfZgMVE5N27X2Uj0Aqwe2qPQYYJZUa5ehHWPdiZA@mail.gmail.com>
References: <CAE6679n9unE7-tuLUWPFOhoHCFESf6qJwyL7ND8hrxOenC9QeQ@mail.gmail.com>	<CAE7pJ3CMZ9V2vn0c9eqvu1dhQdYkn6mqkA4a2o-CntFNtRZzsg@mail.gmail.com>	<5308AC2D.3010005@redhat.com>
	<CAE6679nnMPOfZgMVE5N27X2Uj0Aqwe2qPQYYJZUa5ehHWPdiZA@mail.gmail.com>
Message-ID: <53099234.2080704@redhat.com>

On 02/22/2014 08:05 PM, Bjoern Teipel wrote:
> Thanks Fabio for replying may request.
> 
> I'm using stock CentOS 6.4 versions and no rm, just clvmd and dlm.
> 
> Name        : cman                         Relocations: (not relocatable)
> Version     : 3.0.12.1                          Vendor: CentOS
> Release     : 49.el6_4.2                    Build Date: Tue 03 Sep 2013
> 02:18:10 AM PDT
> 
> Name        : lvm2-cluster                 Relocations: (not relocatable)
> Version     : 2.02.98                           Vendor: CentOS
> Release     : 9.el6_4.3                     Build Date: Tue 05 Nov 2013
> 07:36:18 AM PST
> 
> Name        : corosync                     Relocations: (not relocatable)
> Version     : 1.4.1                             Vendor: CentOS
> Release     : 15.el6_4.1                    Build Date: Tue 14 May 2013
> 02:09:27 PM PDT
> 
> 
> My question is based off this problem I have till January:
> 
> 
> When ever I add a new node (I put into the cluster.conf and reloaded
> with cman_tool version -r -S)  I end up with situations like the new
> node wants to gain the quorum and starts to fence the existing pool
> master and appears to generate some sort of split cluster. Does it work
> at all, corosync and dlm do not know about the recently added node ?

I can see you are using UDPU and that could be the culprit. Can you drop
UDPU and work with multicast?

Jan/Chrissie: do you remember if we support adding nodes at runtime with
UDPU?

The standalone node should not have quorum at all and should not be able
to fence anybody to start with.

> 
> New Node 
> ==========
> 
> Node  Sts   Inc   Joined               Name
>    1   X      0                        hv-1
>    2   X      0                        hv-2
>    3   X      0                        hv-3
>    4   X      0                        hv-4
>    5   X      0                        hv-5
>    6   M     80   2014-01-07 21:37:42  hv-6<--- host added
> 
> 
> Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] The network interface
> [10.14.18.77] is now up.
> Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Using quorum provider
> quorum_cman
> Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> corosync cluster quorum service v0.1
> Jan  7 21:37:42 hv-1 corosync[12564]:   [CMAN  ] CMAN 3.0.12.1 (built
> Sep  3 2013 09:17:34) started
> Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> corosync CMAN membership service 2.90
> Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> openais checkpoint service B.01.01
> Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> corosync extended virtual synchrony service
> Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> corosync configuration service
> Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> corosync cluster closed process group service v1.01
> Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> corosync cluster config database access v1.01
> Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> corosync profile loading service
> Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Using quorum provider
> quorum_cman
> Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> corosync cluster quorum service v0.1
> Jan  7 21:37:42 hv-1 corosync[12564]:   [MAIN  ] Compatibility mode set
> to whitetank.  Using V1 and V2 of the synchronization engine.
> Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
> {10.14.18.65}
> Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
> {10.14.18.67}
> Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
> {10.14.18.68}
> Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
> {10.14.18.70}
> Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
> {10.14.18.66}
> Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
> {10.14.18.77}
> Jan  7 21:37:42 hv-1  corosync[12564]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Jan  7 21:37:42 hv-1 corosync[12564]:   [CMAN  ] quorum regained,
> resuming activity
> Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] This node is within the
> primary component and will provide service.
> Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Members[1]: 6
> Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Members[1]: 6
> Jan  7 21:37:42 hv-1 corosync[12564]:   [CPG   ] chosen downlist: sender
> r(0) ip(10.14.18.77) ; members(old:0 left:0)
> Jan  7 21:37:42 hv-1 corosync[12564]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jan  7 21:37:46 hv-1 fenced[12620]: fenced 3.0.12.1 started
> Jan  7 21:37:46 hv-1 dlm_controld[12643]: dlm_controld 3.0.12.1 started
> Jan  7 21:37:47 hv-1 gfs_controld[12695]: gfs_controld 3.0.12.1 started
> Jan  7 21:37:54 hv-1 fenced[12620]: fencing node hv-b1clcy1
> 
> sudo -i corosync-objctl  |grep member
> 
> totem.interface.member.memberaddr=hv-1
> totem.interface.member.memberaddr=hv-2
> totem.interface.member.memberaddr=hv-3
> totem.interface.member.memberaddr=hv-4
> totem.interface.member.memberaddr=hv-5
> totem.interface.member.memberaddr=hv-6
> runtime.totem.pg.mrp.srp.members.6.ip=r(0) ip(10.14.18.77)
> runtime.totem.pg.mrp.srp.members.6.join_count=1
> runtime.totem.pg.mrp.srp.members.6.status=joined
> 
> 
> Existing Node 
> =============
> 
> member 6 has not been added to the quorum list :
> 
> Jan  7 21:36:28 hv-1 corosync[7769]:   [QUORUM] Members[4]: 1 2 3 5
> Jan  7 21:37:54 hv-1 corosync[7769]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Jan  7 21:37:54 hv-1 corosync[7769]:   [CPG   ] chosen downlist: sender
> r(0) ip(10.14.18.65) ; members(old:4 left:0)
> 
> 
> Node  Sts   Inc   Joined               Name
>    1   M   4468   2013-12-10 14:33:27  hv-1
>    2   M   4468   2013-12-10 14:33:27  hv-2
>    3   M   5036   2014-01-07 17:51:26  hv-3
>    4   X   4468                        hv-4(dead at the moment)
>    5   M   4468   2013-12-10 14:33:27  hv-5
>    6   X      0                        hv-6<--- added
> 
> 
> Jan  7 21:36:28 hv-1 corosync[7769]:   [QUORUM] Members[4]: 1 2 3 5
> Jan  7 21:37:54 hv-1 corosync[7769]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Jan  7 21:37:54 hv-1 corosync[7769]:   [CPG   ] chosen downlist: sender
> r(0) ip(10.14.18.65) ; members(old:4 left:0)
> Jan  7 21:37:54 hv-1 corosync[7769]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> 
> 
> totem.interface.member.memberaddr=hv-1
> totem.interface.member.memberaddr=hv-2
> totem.interface.member.memberaddr=hv-3
> totem.interface.member.memberaddr=hv-4
> totem.interface.member.memberaddr=hv-5.
> runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(10.14.18.65)
> runtime.totem.pg.mrp.srp.members.1.join_count=1
> runtime.totem.pg.mrp.srp.members.1.status=joined
> runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(10.14.18.66)
> runtime.totem.pg.mrp.srp.members.2.join_count=1
> runtime.totem.pg.mrp.srp.members.2.status=joined
> runtime.totem.pg.mrp.srp.members.4.ip=r(0) ip(10.14.18.68)
> runtime.totem.pg.mrp.srp.members.4.join_count=1
> runtime.totem.pg.mrp.srp.members.4.status=left
> runtime.totem.pg.mrp.srp.members.5.ip=r(0) ip(10.14.18.70)
> runtime.totem.pg.mrp.srp.members.5.join_count=1
> runtime.totem.pg.mrp.srp.members.5.status=joined
> runtime.totem.pg.mrp.srp.members.3.ip=r(0) ip(10.14.18.67)
> runtime.totem.pg.mrp.srp.members.3.join_count=3
> runtime.totem.pg.mrp.srp.members.3.status=joined
> 
> 
> cluster.conf:
> 
> <?xml version="1.0"?>
> <cluster config_version="32" name="hv-1618-110-1">
>   <fence_daemon clean_start="0"/>
>   <cman transport="udpu" expected_votes="1"/>
>   <logging debug="off"/>
>   <clusternodes>
>     <clusternode name="hv-1" votes="1" nodeid="1"><fence><method
> name="single"><device name="human"/></method></fence></clusternode>
>     <clusternode name="hv-2" votes="1" nodeid="3"><fence><method
> name="single"><device name="human"/></method></fence></clusternode>
>     <clusternode name="hv-3" votes="1" nodeid="4"><fence><method
> name="single"><device name="human"/></method></fence></clusternode>
>     <clusternode name="hv-4" votes="1" nodeid="5"><fence><method
> name="single"><device name="human"/></method></fence></clusternode>
>     <clusternode name="hv-5" votes="1" nodeid="2"><fence><method
> name="single"><device name="human"/></method></fence></clusternode>
>     <clusternode name="hv-6" votes="1" nodeid="6"><fence><method
> name="single"><device name="human"/></method></fence></clusternode>
>   </clusternodes>
>   <fencedevices>
>   <fencedevice name="human" agent="manual"/></fencedevices>
>   <rm/>
> </cluster>
> 
> (manual fencing just for testing)
> 
> 
> corosync.conf:
> 
> compatibility: whitetank
> totem {
>   version: 2
>   secauth: off
>   threads: 0
>   # fail_recv_const: 5000
>   interface {
>     ringnumber: 0
>     bindnetaddr: 10.14.18.0
>     mcastaddr: 239.0.0.4
>     mcastport: 5405
>   }
> }
> logging {
>   fileline: off
>   to_stderr: no
>   to_logfile: yes
>   to_syslog: yes
>   # the pathname of the log file
>   logfile: /var/log/cluster/corosync.log
>   debug: off
>   timestamp: on
>   logger_subsys {
>     subsys: AMF
>     debug: off
>   }
> }
> 
> amf {
>   mode: disabled
> }
> 

when using cman, corosync.conf is not used/read.

Fabio

> 
> 
> On Sat, Feb 22, 2014 at 5:54 AM, Fabio M. Di Nitto <fdinitto at redhat.com
> <mailto:fdinitto at redhat.com>> wrote:
> 
>     On 02/22/2014 10:33 AM, emmanuel segura wrote:
>     > I know if you need to modify anything outside <rm>... </rm>{used by
>     > rgmanager} tag in the cluster.conf file, you need to restart the whole
>     > cluster stack, with cman+rgmanager i have never seen how to add a node
>     > and remove a node from cluster without restart cman.
> 
>     It depends on the version. RHEL5 that's correct, RHEL6 it works also for
>     outside of <rm> but there are some limitations as some parameters just
>     can't be changed runtime.
> 
>     Fabio
> 
>     >
>     >
>     >
>     >
>     > 2014-02-22 6:21 GMT+01:00 Bjoern Teipel
>     > <bjoern.teipel at internetbrands.com
>     <mailto:bjoern.teipel at internetbrands.com>
>     > <mailto:bjoern.teipel at internetbrands.com
>     <mailto:bjoern.teipel at internetbrands.com>>>:
>     >
>     >     Hi all,
>     >
>     >     who's using CLVM with CMAN in a cluster with more than 2 nodes in
>     >     production ?
>     >     Did you guys got it to manage to live add a new node to the
>     cluster
>     >     while everything is running ?
>     >     I'm only able to add nodes while the cluster stack is shutdown.
>     >     That's certainly not a good idea when you have to run CLVM on
>     >     hypervisors and you need to shutdown all VMs to add a new box.
>     >     Would be also good if you paste some of your configs using
>     IPMI fencing
>     >
>     >     Thanks in advance,
>     >     Bjoern
>     >
>     >     --
>     >     Linux-cluster mailing list
>     >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     <mailto:Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>>
>     >     https://www.redhat.com/mailman/listinfo/linux-cluster
>     >
>     >
>     >
>     >
>     > --
>     > esta es mi vida e me la vivo hasta que dios quiera
>     >
>     >
> 
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> 
> 



From andrew at beekhof.net  Mon Feb 24 01:57:22 2014
From: andrew at beekhof.net (Andrew Beekhof)
Date: Mon, 24 Feb 2014 12:57:22 +1100
Subject: [Linux-cluster] [Linux-HA] Problems with fence_apc agent and
	accessing APC AP8965
In-Reply-To: <alpine.LRH.2.03.1402211253001.4626@tf6102xuryqne.ccf.rbfqvf.anfn.tbi>
References: <alpine.LRH.2.03.1402211253001.4626@tf6102xuryqne.ccf.rbfqvf.anfn.tbi>
Message-ID: <9C2DC656-CBBC-4F61-B102-51107C61ADB2@beekhof.net>

Forwarding to linux-cluster which has more people knowledgeable on this set of fencing agents.

On 22 Feb 2014, at 12:38 am, Tony Stocker <Tony.Stocker at nasa.gov> wrote:

> 
> All,
> 
> I have a bigger issue regarding dual power supplies and fence_apc that I'm going to eventually need to resolve.  But at this point I'm simply having basic issues getting the fence_apc agent to be able to access the devices in general, to wit:
> 
> # fence_apc --ssh --ip=hac-pdu1 --plug=1 --username=blah --password=blah --verbose --action=status
> Unable to connect/login to fencing device
> 
> 
> However I can manually SSH into the device just fine:
> 
> # ssh  blah at hac-pdu1
> blah at hac-pdu1's password:
> 
> 
> American Power Conversion               Network Management Card AOS v5.1.9
> (c) Copyright 2010 All Rights Reserved  RPDU 2g v5.1.6
> -------------------------------------------------------------------------------
> Name      : hac-pdu1                                  Date : 02/21/2014
> Contact   : systems at mail.myserver123.com              Time : 13:12:02
> Location  : C101, HAC Rack 1                          User : Administrator
> Up Time   : 223 Days 17 Hours 0 Minutes               Stat : P+ N4+ N6+ A+
> 
> 
> Type ? for command listing
> Use tcpip command for IP address(-i), subnet(-s), and gateway(-g)
> 
> apc>
> 
> 
> So perhaps the place to start first is simply getting the fence_apc agent (provided by CentOS/RHEL package fence-agents-3.1.5-35.el6_5.3.x86_64) to actually be able to work correctly.  Once that's done, I'll still need help on the dual power supply issue.
> 
> I'm not seeing any attempts to login in the APC's logs file, though I do see connections when I manually login, e.g.:
> 02/21/2014	13:13:11	System: Console user 'apc' logged out from 192.168.1.216.
> 02/21/2014	13:12:40	System: Console user 'apc' logged in from 192.168.1.216.
> 
> A manual 'telnet [name] 22' command also works fine from the command line:
> # telnet hac-pdu1 22
> Trying 192.168.1.222...
> Connected to hac-1-pdu1 (192.168.1.222).
> Escape character is '^]'.
> SSH-2.0-cryptlib
> 
> 
> However fence_apc_snmp **does** seem to work:
> 
> # fence_apc_snmp --snmp-version=1 --community=public --ip=hac-pdu1 --plug=1 --username=blah --password=blah --verbose --action=status
> /usr/bin/snmpwalk -m '' -Oeqn  -v '1' -c 'public' 'hac-pdu1:161' '.1.3.6.1.2.1.1.2.0'
> No log handling enabled - turning on stderr logging
> Created directory: /var/lib/net-snmp/mib_indexes
> .1.3.6.1.2.1.1.2.0 .1.3.6.1.4.1.318.1.3.4.6
> 
> Trying APC Master Switch (fallback)
> /usr/bin/snmpget -m '' -Oeqn  -v '1' -c 'public' 'hac-pdu1:161' '.1.3.6.1.4.1.318.1.1.4.4.2.1.3.1'
> .1.3.6.1.4.1.318.1.1.4.4.2.1.3.1 1
> 
> Status: ON
> 
> 
> Does anyone have any ideas as to why fence_apc is not working?
> 
> 
> Thanks!
> Tony
> 
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140224/f0181234/attachment.sig>

From lists at alteeve.ca  Mon Feb 24 02:09:08 2014
From: lists at alteeve.ca (Digimer)
Date: Sun, 23 Feb 2014 21:09:08 -0500
Subject: [Linux-cluster] [Linux-HA] Problems with fence_apc agent and
 accessing APC AP8965
In-Reply-To: <9C2DC656-CBBC-4F61-B102-51107C61ADB2@beekhof.net>
References: <alpine.LRH.2.03.1402211253001.4626@tf6102xuryqne.ccf.rbfqvf.anfn.tbi>
	<9C2DC656-CBBC-4F61-B102-51107C61ADB2@beekhof.net>
Message-ID: <530AA9C4.2020400@alteeve.ca>

Does 'fence_apc_snmp -a hac-pdu1 -n 1 -o status' work? What about 
'fence_apc -a hac-pdu1 -l <user> -p <passwd> -n 1 -o status'?

digimer

On 23/02/14 08:57 PM, Andrew Beekhof wrote:
> Forwarding to linux-cluster which has more people knowledgeable on this set of fencing agents.
>
> On 22 Feb 2014, at 12:38 am, Tony Stocker <Tony.Stocker at nasa.gov> wrote:
>
>>
>> All,
>>
>> I have a bigger issue regarding dual power supplies and fence_apc that I'm going to eventually need to resolve.  But at this point I'm simply having basic issues getting the fence_apc agent to be able to access the devices in general, to wit:
>>
>> # fence_apc --ssh --ip=hac-pdu1 --plug=1 --username=blah --password=blah --verbose --action=status
>> Unable to connect/login to fencing device
>>
>>
>> However I can manually SSH into the device just fine:
>>
>> # ssh  blah at hac-pdu1
>> blah at hac-pdu1's password:
>>
>>
>> American Power Conversion               Network Management Card AOS v5.1.9
>> (c) Copyright 2010 All Rights Reserved  RPDU 2g v5.1.6
>> -------------------------------------------------------------------------------
>> Name      : hac-pdu1                                  Date : 02/21/2014
>> Contact   : systems at mail.myserver123.com              Time : 13:12:02
>> Location  : C101, HAC Rack 1                          User : Administrator
>> Up Time   : 223 Days 17 Hours 0 Minutes               Stat : P+ N4+ N6+ A+
>>
>>
>> Type ? for command listing
>> Use tcpip command for IP address(-i), subnet(-s), and gateway(-g)
>>
>> apc>
>>
>>
>> So perhaps the place to start first is simply getting the fence_apc agent (provided by CentOS/RHEL package fence-agents-3.1.5-35.el6_5.3.x86_64) to actually be able to work correctly.  Once that's done, I'll still need help on the dual power supply issue.
>>
>> I'm not seeing any attempts to login in the APC's logs file, though I do see connections when I manually login, e.g.:
>> 02/21/2014	13:13:11	System: Console user 'apc' logged out from 192.168.1.216.
>> 02/21/2014	13:12:40	System: Console user 'apc' logged in from 192.168.1.216.
>>
>> A manual 'telnet [name] 22' command also works fine from the command line:
>> # telnet hac-pdu1 22
>> Trying 192.168.1.222...
>> Connected to hac-1-pdu1 (192.168.1.222).
>> Escape character is '^]'.
>> SSH-2.0-cryptlib
>>
>>
>> However fence_apc_snmp **does** seem to work:
>>
>> # fence_apc_snmp --snmp-version=1 --community=public --ip=hac-pdu1 --plug=1 --username=blah --password=blah --verbose --action=status
>> /usr/bin/snmpwalk -m '' -Oeqn  -v '1' -c 'public' 'hac-pdu1:161' '.1.3.6.1.2.1.1.2.0'
>> No log handling enabled - turning on stderr logging
>> Created directory: /var/lib/net-snmp/mib_indexes
>> .1.3.6.1.2.1.1.2.0 .1.3.6.1.4.1.318.1.3.4.6
>>
>> Trying APC Master Switch (fallback)
>> /usr/bin/snmpget -m '' -Oeqn  -v '1' -c 'public' 'hac-pdu1:161' '.1.3.6.1.4.1.318.1.1.4.4.2.1.3.1'
>> .1.3.6.1.4.1.318.1.1.4.4.2.1.3.1 1
>>
>> Status: ON
>>
>>
>> Does anyone have any ideas as to why fence_apc is not working?
>>
>>
>> Thanks!
>> Tony
>>
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>>
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA at lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>
>
>


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?



From bjoern.teipel at internetbrands.com  Mon Feb 24 08:39:51 2014
From: bjoern.teipel at internetbrands.com (Bjoern Teipel)
Date: Mon, 24 Feb 2014 00:39:51 -0800
Subject: [Linux-cluster] CLVM & CMAN live adding nodes
In-Reply-To: <53099234.2080704@redhat.com>
References: <CAE6679n9unE7-tuLUWPFOhoHCFESf6qJwyL7ND8hrxOenC9QeQ@mail.gmail.com>
	<CAE7pJ3CMZ9V2vn0c9eqvu1dhQdYkn6mqkA4a2o-CntFNtRZzsg@mail.gmail.com>
	<5308AC2D.3010005@redhat.com>
	<CAE6679nnMPOfZgMVE5N27X2Uj0Aqwe2qPQYYJZUa5ehHWPdiZA@mail.gmail.com>
	<53099234.2080704@redhat.com>
Message-ID: <CAE6679mNE60GQKtzBTR+vfYHuM2u2hALg2ANKd1jOM3_R5XGnw@mail.gmail.com>

Hi Fabio,

removing UDPU does not change the behavior, the new node still doesn't join
the cluster and still wants to fence node 01
It still feels like a split brain or so.
How do you join a new node, using the /etc/init.d/cman start or using
 cman_tool / dlm_tool  join ?

Bjoern


On Sat, Feb 22, 2014 at 10:16 PM, Fabio M. Di Nitto <fdinitto at redhat.com>wrote:

> On 02/22/2014 08:05 PM, Bjoern Teipel wrote:
> > Thanks Fabio for replying may request.
> >
> > I'm using stock CentOS 6.4 versions and no rm, just clvmd and dlm.
> >
> > Name        : cman                         Relocations: (not relocatable)
> > Version     : 3.0.12.1                          Vendor: CentOS
> > Release     : 49.el6_4.2                    Build Date: Tue 03 Sep 2013
> > 02:18:10 AM PDT
> >
> > Name        : lvm2-cluster                 Relocations: (not relocatable)
> > Version     : 2.02.98                           Vendor: CentOS
> > Release     : 9.el6_4.3                     Build Date: Tue 05 Nov 2013
> > 07:36:18 AM PST
> >
> > Name        : corosync                     Relocations: (not relocatable)
> > Version     : 1.4.1                             Vendor: CentOS
> > Release     : 15.el6_4.1                    Build Date: Tue 14 May 2013
> > 02:09:27 PM PDT
> >
> >
> > My question is based off this problem I have till January:
> >
> >
> > When ever I add a new node (I put into the cluster.conf and reloaded
> > with cman_tool version -r -S)  I end up with situations like the new
> > node wants to gain the quorum and starts to fence the existing pool
> > master and appears to generate some sort of split cluster. Does it work
> > at all, corosync and dlm do not know about the recently added node ?
>
> I can see you are using UDPU and that could be the culprit. Can you drop
> UDPU and work with multicast?
>
> Jan/Chrissie: do you remember if we support adding nodes at runtime with
> UDPU?
>
> The standalone node should not have quorum at all and should not be able
> to fence anybody to start with.
>
> >
> > New Node
> > ==========
> >
> > Node  Sts   Inc   Joined               Name
> >    1   X      0                        hv-1
> >    2   X      0                        hv-2
> >    3   X      0                        hv-3
> >    4   X      0                        hv-4
> >    5   X      0                        hv-5
> >    6   M     80   2014-01-07 21:37:42  hv-6<--- host added
> >
> >
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] The network interface
> > [10.14.18.77] is now up.
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Using quorum provider
> > quorum_cman
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> > corosync cluster quorum service v0.1
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [CMAN  ] CMAN 3.0.12.1 (built
> > Sep  3 2013 09:17:34) started
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> > corosync CMAN membership service 2.90
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> > openais checkpoint service B.01.01
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> > corosync extended virtual synchrony service
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> > corosync configuration service
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> > corosync cluster closed process group service v1.01
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> > corosync cluster config database access v1.01
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> > corosync profile loading service
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Using quorum provider
> > quorum_cman
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine loaded:
> > corosync cluster quorum service v0.1
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [MAIN  ] Compatibility mode set
> > to whitetank.  Using V1 and V2 of the synchronization engine.
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
> > {10.14.18.65}
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
> > {10.14.18.67}
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
> > {10.14.18.68}
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
> > {10.14.18.70}
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
> > {10.14.18.66}
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU member
> > {10.14.18.77}
> > Jan  7 21:37:42 hv-1  corosync[12564]:   [TOTEM ] A processor joined or
> > left the membership and a new membership was formed.
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [CMAN  ] quorum regained,
> > resuming activity
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] This node is within the
> > primary component and will provide service.
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Members[1]: 6
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Members[1]: 6
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [CPG   ] chosen downlist: sender
> > r(0) ip(10.14.18.77) ; members(old:0 left:0)
> > Jan  7 21:37:42 hv-1 corosync[12564]:   [MAIN  ] Completed service
> > synchronization, ready to provide service.
> > Jan  7 21:37:46 hv-1 fenced[12620]: fenced 3.0.12.1 started
> > Jan  7 21:37:46 hv-1 dlm_controld[12643]: dlm_controld 3.0.12.1 started
> > Jan  7 21:37:47 hv-1 gfs_controld[12695]: gfs_controld 3.0.12.1 started
> > Jan  7 21:37:54 hv-1 fenced[12620]: fencing node hv-b1clcy1
> >
> > sudo -i corosync-objctl  |grep member
> >
> > totem.interface.member.memberaddr=hv-1
> > totem.interface.member.memberaddr=hv-2
> > totem.interface.member.memberaddr=hv-3
> > totem.interface.member.memberaddr=hv-4
> > totem.interface.member.memberaddr=hv-5
> > totem.interface.member.memberaddr=hv-6
> > runtime.totem.pg.mrp.srp.members.6.ip=r(0) ip(10.14.18.77)
> > runtime.totem.pg.mrp.srp.members.6.join_count=1
> > runtime.totem.pg.mrp.srp.members.6.status=joined
> >
> >
> > Existing Node
> > =============
> >
> > member 6 has not been added to the quorum list :
> >
> > Jan  7 21:36:28 hv-1 corosync[7769]:   [QUORUM] Members[4]: 1 2 3 5
> > Jan  7 21:37:54 hv-1 corosync[7769]:   [TOTEM ] A processor joined or
> > left the membership and a new membership was formed.
> > Jan  7 21:37:54 hv-1 corosync[7769]:   [CPG   ] chosen downlist: sender
> > r(0) ip(10.14.18.65) ; members(old:4 left:0)
> >
> >
> > Node  Sts   Inc   Joined               Name
> >    1   M   4468   2013-12-10 14:33:27  hv-1
> >    2   M   4468   2013-12-10 14:33:27  hv-2
> >    3   M   5036   2014-01-07 17:51:26  hv-3
> >    4   X   4468                        hv-4(dead at the moment)
> >    5   M   4468   2013-12-10 14:33:27  hv-5
> >    6   X      0                        hv-6<--- added
> >
> >
> > Jan  7 21:36:28 hv-1 corosync[7769]:   [QUORUM] Members[4]: 1 2 3 5
> > Jan  7 21:37:54 hv-1 corosync[7769]:   [TOTEM ] A processor joined or
> > left the membership and a new membership was formed.
> > Jan  7 21:37:54 hv-1 corosync[7769]:   [CPG   ] chosen downlist: sender
> > r(0) ip(10.14.18.65) ; members(old:4 left:0)
> > Jan  7 21:37:54 hv-1 corosync[7769]:   [MAIN  ] Completed service
> > synchronization, ready to provide service.
> >
> >
> > totem.interface.member.memberaddr=hv-1
> > totem.interface.member.memberaddr=hv-2
> > totem.interface.member.memberaddr=hv-3
> > totem.interface.member.memberaddr=hv-4
> > totem.interface.member.memberaddr=hv-5.
> > runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(10.14.18.65)
> > runtime.totem.pg.mrp.srp.members.1.join_count=1
> > runtime.totem.pg.mrp.srp.members.1.status=joined
> > runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(10.14.18.66)
> > runtime.totem.pg.mrp.srp.members.2.join_count=1
> > runtime.totem.pg.mrp.srp.members.2.status=joined
> > runtime.totem.pg.mrp.srp.members.4.ip=r(0) ip(10.14.18.68)
> > runtime.totem.pg.mrp.srp.members.4.join_count=1
> > runtime.totem.pg.mrp.srp.members.4.status=left
> > runtime.totem.pg.mrp.srp.members.5.ip=r(0) ip(10.14.18.70)
> > runtime.totem.pg.mrp.srp.members.5.join_count=1
> > runtime.totem.pg.mrp.srp.members.5.status=joined
> > runtime.totem.pg.mrp.srp.members.3.ip=r(0) ip(10.14.18.67)
> > runtime.totem.pg.mrp.srp.members.3.join_count=3
> > runtime.totem.pg.mrp.srp.members.3.status=joined
> >
> >
> > cluster.conf:
> >
> > <?xml version="1.0"?>
> > <cluster config_version="32" name="hv-1618-110-1">
> >   <fence_daemon clean_start="0"/>
> >   <cman transport="udpu" expected_votes="1"/>
> >   <logging debug="off"/>
> >   <clusternodes>
> >     <clusternode name="hv-1" votes="1" nodeid="1"><fence><method
> > name="single"><device name="human"/></method></fence></clusternode>
> >     <clusternode name="hv-2" votes="1" nodeid="3"><fence><method
> > name="single"><device name="human"/></method></fence></clusternode>
> >     <clusternode name="hv-3" votes="1" nodeid="4"><fence><method
> > name="single"><device name="human"/></method></fence></clusternode>
> >     <clusternode name="hv-4" votes="1" nodeid="5"><fence><method
> > name="single"><device name="human"/></method></fence></clusternode>
> >     <clusternode name="hv-5" votes="1" nodeid="2"><fence><method
> > name="single"><device name="human"/></method></fence></clusternode>
> >     <clusternode name="hv-6" votes="1" nodeid="6"><fence><method
> > name="single"><device name="human"/></method></fence></clusternode>
> >   </clusternodes>
> >   <fencedevices>
> >   <fencedevice name="human" agent="manual"/></fencedevices>
> >   <rm/>
> > </cluster>
> >
> > (manual fencing just for testing)
> >
> >
> > corosync.conf:
> >
> > compatibility: whitetank
> > totem {
> >   version: 2
> >   secauth: off
> >   threads: 0
> >   # fail_recv_const: 5000
> >   interface {
> >     ringnumber: 0
> >     bindnetaddr: 10.14.18.0
> >     mcastaddr: 239.0.0.4
> >     mcastport: 5405
> >   }
> > }
> > logging {
> >   fileline: off
> >   to_stderr: no
> >   to_logfile: yes
> >   to_syslog: yes
> >   # the pathname of the log file
> >   logfile: /var/log/cluster/corosync.log
> >   debug: off
> >   timestamp: on
> >   logger_subsys {
> >     subsys: AMF
> >     debug: off
> >   }
> > }
> >
> > amf {
> >   mode: disabled
> > }
> >
>
> when using cman, corosync.conf is not used/read.
>
> Fabio
>
> >
> >
> > On Sat, Feb 22, 2014 at 5:54 AM, Fabio M. Di Nitto <fdinitto at redhat.com
> > <mailto:fdinitto at redhat.com>> wrote:
> >
> >     On 02/22/2014 10:33 AM, emmanuel segura wrote:
> >     > I know if you need to modify anything outside <rm>... </rm>{used by
> >     > rgmanager} tag in the cluster.conf file, you need to restart the
> whole
> >     > cluster stack, with cman+rgmanager i have never seen how to add a
> node
> >     > and remove a node from cluster without restart cman.
> >
> >     It depends on the version. RHEL5 that's correct, RHEL6 it works also
> for
> >     outside of <rm> but there are some limitations as some parameters
> just
> >     can't be changed runtime.
> >
> >     Fabio
> >
> >     >
> >     >
> >     >
> >     >
> >     > 2014-02-22 6:21 GMT+01:00 Bjoern Teipel
> >     > <bjoern.teipel at internetbrands.com
> >     <mailto:bjoern.teipel at internetbrands.com>
> >     > <mailto:bjoern.teipel at internetbrands.com
> >     <mailto:bjoern.teipel at internetbrands.com>>>:
> >     >
> >     >     Hi all,
> >     >
> >     >     who's using CLVM with CMAN in a cluster with more than 2 nodes
> in
> >     >     production ?
> >     >     Did you guys got it to manage to live add a new node to the
> >     cluster
> >     >     while everything is running ?
> >     >     I'm only able to add nodes while the cluster stack is shutdown.
> >     >     That's certainly not a good idea when you have to run CLVM on
> >     >     hypervisors and you need to shutdown all VMs to add a new box.
> >     >     Would be also good if you paste some of your configs using
> >     IPMI fencing
> >     >
> >     >     Thanks in advance,
> >     >     Bjoern
> >     >
> >     >     --
> >     >     Linux-cluster mailing list
> >     >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >     <mailto:Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>>
> >     >     https://www.redhat.com/mailman/listinfo/linux-cluster
> >     >
> >     >
> >     >
> >     >
> >     > --
> >     > esta es mi vida e me la vivo hasta que dios quiera
> >     >
> >     >
> >
> >     --
> >     Linux-cluster mailing list
> >     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
> >     https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> >
> >
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140224/1ca9580a/attachment.htm>

From ccaulfie at redhat.com  Mon Feb 24 10:25:55 2014
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Mon, 24 Feb 2014 10:25:55 +0000
Subject: [Linux-cluster] CLVM & CMAN live adding nodes
In-Reply-To: <CAE6679mNE60GQKtzBTR+vfYHuM2u2hALg2ANKd1jOM3_R5XGnw@mail.gmail.com>
References: <CAE6679n9unE7-tuLUWPFOhoHCFESf6qJwyL7ND8hrxOenC9QeQ@mail.gmail.com>	<CAE7pJ3CMZ9V2vn0c9eqvu1dhQdYkn6mqkA4a2o-CntFNtRZzsg@mail.gmail.com>	<5308AC2D.3010005@redhat.com>	<CAE6679nnMPOfZgMVE5N27X2Uj0Aqwe2qPQYYJZUa5ehHWPdiZA@mail.gmail.com>	<53099234.2080704@redhat.com>
	<CAE6679mNE60GQKtzBTR+vfYHuM2u2hALg2ANKd1jOM3_R5XGnw@mail.gmail.com>
Message-ID: <530B1E33.2000909@redhat.com>

On 24/02/14 08:39, Bjoern Teipel wrote:
> Hi Fabio,
>
> removing UDPU does not change the behavior, the new node still doesn't
> join the cluster and still wants to fence node 01
> It still feels like a split brain or so.
> How do you join a new node, using the /etc/init.d/cman start or using
>   cman_tool / dlm_tool  join ?
>
> Bjoern
>
>
> On Sat, Feb 22, 2014 at 10:16 PM, Fabio M. Di Nitto <fdinitto at redhat.com
> <mailto:fdinitto at redhat.com>> wrote:
>
>     On 02/22/2014 08:05 PM, Bjoern Teipel wrote:
>      > Thanks Fabio for replying may request.
>      >
>      > I'm using stock CentOS 6.4 versions and no rm, just clvmd and dlm.
>      >
>      > Name        : cman                         Relocations: (not
>     relocatable)
>      > Version     : 3.0.12.1                          Vendor: CentOS
>      > Release     : 49.el6_4.2                    Build Date: Tue 03
>     Sep 2013
>      > 02:18:10 AM PDT
>      >
>      > Name        : lvm2-cluster                 Relocations: (not
>     relocatable)
>      > Version     : 2.02.98                           Vendor: CentOS
>      > Release     : 9.el6_4.3                     Build Date: Tue 05
>     Nov 2013
>      > 07:36:18 AM PST
>      >
>      > Name        : corosync                     Relocations: (not
>     relocatable)
>      > Version     : 1.4.1                             Vendor: CentOS
>      > Release     : 15.el6_4.1                    Build Date: Tue 14
>     May 2013
>      > 02:09:27 PM PDT
>      >
>      >
>      > My question is based off this problem I have till January:
>      >
>      >
>      > When ever I add a new node (I put into the cluster.conf and reloaded
>      > with cman_tool version -r -S)  I end up with situations like the new
>      > node wants to gain the quorum and starts to fence the existing pool
>      > master and appears to generate some sort of split cluster. Does
>     it work
>      > at all, corosync and dlm do not know about the recently added node ?
>
>     I can see you are using UDPU and that could be the culprit. Can you drop
>     UDPU and work with multicast?
>
>     Jan/Chrissie: do you remember if we support adding nodes at runtime with
>     UDPU?
>
>     The standalone node should not have quorum at all and should not be able
>     to fence anybody to start with.
>
>      >
>      > New Node
>      > ==========
>      >
>      > Node  Sts   Inc   Joined               Name
>      >    1   X      0                        hv-1
>      >    2   X      0                        hv-2
>      >    3   X      0                        hv-3
>      >    4   X      0                        hv-4
>      >    5   X      0                        hv-5
>      >    6   M     80   2014-01-07 21:37:42  hv-6<--- host added
>      >
>      >
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] The network
>     interface
>      > [10.14.18.77] is now up.
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Using quorum
>     provider
>      > quorum_cman
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>     loaded:
>      > corosync cluster quorum service v0.1
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [CMAN  ] CMAN 3.0.12.1 (built
>      > Sep  3 2013 09:17:34) started
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>     loaded:
>      > corosync CMAN membership service 2.90
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>     loaded:
>      > openais checkpoint service B.01.01
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>     loaded:
>      > corosync extended virtual synchrony service
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>     loaded:
>      > corosync configuration service
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>     loaded:
>      > corosync cluster closed process group service v1.01
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>     loaded:
>      > corosync cluster config database access v1.01
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>     loaded:
>      > corosync profile loading service
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Using quorum
>     provider
>      > quorum_cman
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>     loaded:
>      > corosync cluster quorum service v0.1
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [MAIN  ] Compatibility
>     mode set
>      > to whitetank.  Using V1 and V2 of the synchronization engine.
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU
>     member
>      > {10.14.18.65}
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU
>     member
>      > {10.14.18.67}
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU
>     member
>      > {10.14.18.68}
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU
>     member
>      > {10.14.18.70}
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU
>     member
>      > {10.14.18.66}
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU
>     member
>      > {10.14.18.77}
>      > Jan  7 21:37:42 hv-1  corosync[12564]:   [TOTEM ] A processor
>     joined or
>      > left the membership and a new membership was formed.
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [CMAN  ] quorum regained,
>      > resuming activity
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] This node is
>     within the
>      > primary component and will provide service.
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Members[1]: 6
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Members[1]: 6
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [CPG   ] chosen downlist:
>     sender
>      > r(0) ip(10.14.18.77) ; members(old:0 left:0)
>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [MAIN  ] Completed service
>      > synchronization, ready to provide service.
>      > Jan  7 21:37:46 hv-1 fenced[12620]: fenced 3.0.12.1 started
>      > Jan  7 21:37:46 hv-1 dlm_controld[12643]: dlm_controld 3.0.12.1
>     started
>      > Jan  7 21:37:47 hv-1 gfs_controld[12695]: gfs_controld 3.0.12.1
>     started
>      > Jan  7 21:37:54 hv-1 fenced[12620]: fencing node hv-b1clcy1
>      >
>      > sudo -i corosync-objctl  |grep member
>      >
>      > totem.interface.member.memberaddr=hv-1
>      > totem.interface.member.memberaddr=hv-2
>      > totem.interface.member.memberaddr=hv-3
>      > totem.interface.member.memberaddr=hv-4
>      > totem.interface.member.memberaddr=hv-5
>      > totem.interface.member.memberaddr=hv-6
>      > runtime.totem.pg.mrp.srp.members.6.ip=r(0) ip(10.14.18.77)
>      > runtime.totem.pg.mrp.srp.members.6.join_count=1
>      > runtime.totem.pg.mrp.srp.members.6.status=joined
>      >
>      >
>      > Existing Node
>      > =============
>      >
>      > member 6 has not been added to the quorum list :
>      >
>      > Jan  7 21:36:28 hv-1 corosync[7769]:   [QUORUM] Members[4]: 1 2 3 5
>      > Jan  7 21:37:54 hv-1 corosync[7769]:   [TOTEM ] A processor joined or
>      > left the membership and a new membership was formed.
>      > Jan  7 21:37:54 hv-1 corosync[7769]:   [CPG   ] chosen downlist:
>     sender
>      > r(0) ip(10.14.18.65) ; members(old:4 left:0)
>      >
>      >
>      > Node  Sts   Inc   Joined               Name
>      >    1   M   4468   2013-12-10 14:33:27  hv-1
>      >    2   M   4468   2013-12-10 14:33:27  hv-2
>      >    3   M   5036   2014-01-07 17:51:26  hv-3
>      >    4   X   4468                        hv-4(dead at the moment)
>      >    5   M   4468   2013-12-10 14:33:27  hv-5
>      >    6   X      0                        hv-6<--- added
>      >
>      >
>      > Jan  7 21:36:28 hv-1 corosync[7769]:   [QUORUM] Members[4]: 1 2 3 5
>      > Jan  7 21:37:54 hv-1 corosync[7769]:   [TOTEM ] A processor joined or
>      > left the membership and a new membership was formed.
>      > Jan  7 21:37:54 hv-1 corosync[7769]:   [CPG   ] chosen downlist:
>     sender
>      > r(0) ip(10.14.18.65) ; members(old:4 left:0)
>      > Jan  7 21:37:54 hv-1 corosync[7769]:   [MAIN  ] Completed service
>      > synchronization, ready to provide service.
>      >
>      >
>      > totem.interface.member.memberaddr=hv-1
>      > totem.interface.member.memberaddr=hv-2
>      > totem.interface.member.memberaddr=hv-3
>      > totem.interface.member.memberaddr=hv-4
>      > totem.interface.member.memberaddr=hv-5.
>      > runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(10.14.18.65)
>      > runtime.totem.pg.mrp.srp.members.1.join_count=1
>      > runtime.totem.pg.mrp.srp.members.1.status=joined
>      > runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(10.14.18.66)
>      > runtime.totem.pg.mrp.srp.members.2.join_count=1
>      > runtime.totem.pg.mrp.srp.members.2.status=joined
>      > runtime.totem.pg.mrp.srp.members.4.ip=r(0) ip(10.14.18.68)
>      > runtime.totem.pg.mrp.srp.members.4.join_count=1
>      > runtime.totem.pg.mrp.srp.members.4.status=left
>      > runtime.totem.pg.mrp.srp.members.5.ip=r(0) ip(10.14.18.70)
>      > runtime.totem.pg.mrp.srp.members.5.join_count=1
>      > runtime.totem.pg.mrp.srp.members.5.status=joined
>      > runtime.totem.pg.mrp.srp.members.3.ip=r(0) ip(10.14.18.67)
>      > runtime.totem.pg.mrp.srp.members.3.join_count=3
>      > runtime.totem.pg.mrp.srp.members.3.status=joined
>      >
>      >
>      > cluster.conf:
>      >
>      > <?xml version="1.0"?>
>      > <cluster config_version="32" name="hv-1618-110-1">
>      >   <fence_daemon clean_start="0"/>
>      >   <cman transport="udpu" expected_votes="1"/>


Setting expected_votes to 1 in a six node cluster is a serious 
configuration error and needs to be changed. That is what is causing the 
new node to fence the rest of the cluster.

Check that all of the nodes have the same cluster.conf file, any 
difference between that on the exiting nodes and the new one will 
prevent the new node from joining too.

Chrissie




From mgrac at redhat.com  Mon Feb 24 11:55:40 2014
From: mgrac at redhat.com (Marek Grac)
Date: Mon, 24 Feb 2014 12:55:40 +0100
Subject: [Linux-cluster] [Linux-HA] Problems with fence_apc agent and
 accessing APC AP8965
In-Reply-To: <9C2DC656-CBBC-4F61-B102-51107C61ADB2@beekhof.net>
References: <alpine.LRH.2.03.1402211253001.4626@tf6102xuryqne.ccf.rbfqvf.anfn.tbi>
	<9C2DC656-CBBC-4F61-B102-51107C61ADB2@beekhof.net>
Message-ID: <530B333C.2040304@redhat.com>

Hi,

This is a problem of firmware support, APC have completely changed 
text-based interface (and it is better than before). This should be 
solved in release fence-agents-3.1.5-35.el6 (RHEL 6.5). If that (or 
upstream 4.x version) is not working please create a bugzilla report.

m,

On 02/24/2014 02:57 AM, Andrew Beekhof wrote:
> Forwarding to linux-cluster which has more people knowledgeable on this set of fencing agents.
>
> On 22 Feb 2014, at 12:38 am, Tony Stocker <Tony.Stocker at nasa.gov> wrote:
>
>> All,
>>
>> I have a bigger issue regarding dual power supplies and fence_apc that I'm going to eventually need to resolve.  But at this point I'm simply having basic issues getting the fence_apc agent to be able to access the devices in general, to wit:
>>
>> # fence_apc --ssh --ip=hac-pdu1 --plug=1 --username=blah --password=blah --verbose --action=status
>> Unable to connect/login to fencing device
>>
>>
>> However I can manually SSH into the device just fine:
>>
>> # ssh  blah at hac-pdu1
>> blah at hac-pdu1's password:
>>
>>
>> American Power Conversion               Network Management Card AOS v5.1.9
>> (c) Copyright 2010 All Rights Reserved  RPDU 2g v5.1.6
>> -------------------------------------------------------------------------------
>> Name      : hac-pdu1                                  Date : 02/21/2014
>> Contact   : systems at mail.myserver123.com              Time : 13:12:02
>> Location  : C101, HAC Rack 1                          User : Administrator
>> Up Time   : 223 Days 17 Hours 0 Minutes               Stat : P+ N4+ N6+ A+
>>
>>
>> Type ? for command listing
>> Use tcpip command for IP address(-i), subnet(-s), and gateway(-g)
>>
>> apc>
>>
>>
>> So perhaps the place to start first is simply getting the fence_apc agent (provided by CentOS/RHEL package fence-agents-3.1.5-35.el6_5.3.x86_64) to actually be able to work correctly.  Once that's done, I'll still need help on the dual power supply issue.
>>
>> I'm not seeing any attempts to login in the APC's logs file, though I do see connections when I manually login, e.g.:
>> 02/21/2014	13:13:11	System: Console user 'apc' logged out from 192.168.1.216.
>> 02/21/2014	13:12:40	System: Console user 'apc' logged in from 192.168.1.216.
>>
>> A manual 'telnet [name] 22' command also works fine from the command line:
>> # telnet hac-pdu1 22
>> Trying 192.168.1.222...
>> Connected to hac-1-pdu1 (192.168.1.222).
>> Escape character is '^]'.
>> SSH-2.0-cryptlib
>>
>>
>> However fence_apc_snmp **does** seem to work:
>>
>> # fence_apc_snmp --snmp-version=1 --community=public --ip=hac-pdu1 --plug=1 --username=blah --password=blah --verbose --action=status
>> /usr/bin/snmpwalk -m '' -Oeqn  -v '1' -c 'public' 'hac-pdu1:161' '.1.3.6.1.2.1.1.2.0'
>> No log handling enabled - turning on stderr logging
>> Created directory: /var/lib/net-snmp/mib_indexes
>> .1.3.6.1.2.1.1.2.0 .1.3.6.1.4.1.318.1.3.4.6
>>
>> Trying APC Master Switch (fallback)
>> /usr/bin/snmpget -m '' -Oeqn  -v '1' -c 'public' 'hac-pdu1:161' '.1.3.6.1.4.1.318.1.1.4.4.2.1.3.1'
>> .1.3.6.1.4.1.318.1.1.4.4.2.1.3.1 1
>>
>> Status: ON
>>
>>
>> Does anyone have any ideas as to why fence_apc is not working?
>>
>>
>> Thanks!
>> Tony
>>
>> -- 
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>>
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA at lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140224/e773fafc/attachment.htm>

From bjoern.teipel at internetbrands.com  Mon Feb 24 19:45:59 2014
From: bjoern.teipel at internetbrands.com (Bjoern Teipel)
Date: Mon, 24 Feb 2014 11:45:59 -0800
Subject: [Linux-cluster] CLVM & CMAN live adding nodes
In-Reply-To: <530B1E33.2000909@redhat.com>
References: <CAE6679n9unE7-tuLUWPFOhoHCFESf6qJwyL7ND8hrxOenC9QeQ@mail.gmail.com>
	<CAE7pJ3CMZ9V2vn0c9eqvu1dhQdYkn6mqkA4a2o-CntFNtRZzsg@mail.gmail.com>
	<5308AC2D.3010005@redhat.com>
	<CAE6679nnMPOfZgMVE5N27X2Uj0Aqwe2qPQYYJZUa5ehHWPdiZA@mail.gmail.com>
	<53099234.2080704@redhat.com>
	<CAE6679mNE60GQKtzBTR+vfYHuM2u2hALg2ANKd1jOM3_R5XGnw@mail.gmail.com>
	<530B1E33.2000909@redhat.com>
Message-ID: <CAE6679n0uaVqabZAmSPCDKz9gy_osQpHRsEyTjVaxexK_mZiLw@mail.gmail.com>

Thanks Chrissie,

that was an old artifact from testing with two nodes.
I set the expected votes now to 4 (3 existing nodes in the cluster and one
new) but I still have the same issue.
It seems like the new node can't gain quorum over corosync, I see multicast
packets flowing over the wire but quorum membership seems to be static:

Feb 24 11:29:09 corosync [QUORUM] Members[3]: 1 2 3


Version: 6.2.0

Config Version: 4

Cluster Name: hv-1618-106-1

Cluster Id: 11612

Cluster Member: Yes

Cluster Generation: 244

Membership state: Cluster-Member

Nodes: 3

Expected votes: 4

Total votes: 3

Node votes: 1

Quorum: 3

Active subsystems: 8

Flags:

Ports Bound: 0 11

Node name: node01

Node ID: 1

Multicast addresses: 239.192.45.137

Node addresses: 10.14.10.6

On Node04:

Starting cluster:

   Checking if cluster has been disabled at boot...        [  OK  ]

   Checking Network Manager...                             [  OK  ]

   Global setup...                                         [  OK  ]

   Loading kernel modules...                               [  OK  ]

   Mounting configfs...                                    [  OK  ]

   Starting cman...                                        [  OK  ]

   Waiting for quorum... Timed-out waiting for cluster

                                                           [FAILED]

Stopping cluster:

   Leaving fence domain...                                 [  OK  ]

   Stopping gfs_controld...                                [  OK  ]

   Stopping dlm_controld...                                [  OK  ]

   Stopping fenced...                                      [  OK  ]

   Stopping cman...                                        [  OK  ]

   Waiting for corosync to shutdown:                       [  OK  ]

   Unloading kernel modules...                             [  OK  ]

   Unmounting configfs...                                  [  OK  ]


Node status:
Node  Sts   Inc   Joined               Name

   1   M    236   2014-02-24 00:22:32  node01

   2   M    240   2014-02-24 00:22:34  node02

   3   M    244   2014-02-24 00:22:38  node03

   4   X      0                        node04


On Mon, Feb 24, 2014 at 2:25 AM, Christine Caulfield <ccaulfie at redhat.com>wrote:

> On 24/02/14 08:39, Bjoern Teipel wrote:
>
>> Hi Fabio,
>>
>> removing UDPU does not change the behavior, the new node still doesn't
>> join the cluster and still wants to fence node 01
>> It still feels like a split brain or so.
>> How do you join a new node, using the /etc/init.d/cman start or using
>>   cman_tool / dlm_tool  join ?
>>
>> Bjoern
>>
>>
>> On Sat, Feb 22, 2014 at 10:16 PM, Fabio M. Di Nitto <fdinitto at redhat.com
>> <mailto:fdinitto at redhat.com>> wrote:
>>
>>     On 02/22/2014 08:05 PM, Bjoern Teipel wrote:
>>      > Thanks Fabio for replying may request.
>>      >
>>      > I'm using stock CentOS 6.4 versions and no rm, just clvmd and dlm.
>>      >
>>      > Name        : cman                         Relocations: (not
>>     relocatable)
>>      > Version     : 3.0.12.1                          Vendor: CentOS
>>      > Release     : 49.el6_4.2                    Build Date: Tue 03
>>     Sep 2013
>>      > 02:18:10 AM PDT
>>      >
>>      > Name        : lvm2-cluster                 Relocations: (not
>>     relocatable)
>>      > Version     : 2.02.98                           Vendor: CentOS
>>      > Release     : 9.el6_4.3                     Build Date: Tue 05
>>     Nov 2013
>>      > 07:36:18 AM PST
>>      >
>>      > Name        : corosync                     Relocations: (not
>>     relocatable)
>>      > Version     : 1.4.1                             Vendor: CentOS
>>      > Release     : 15.el6_4.1                    Build Date: Tue 14
>>     May 2013
>>      > 02:09:27 PM PDT
>>      >
>>      >
>>      > My question is based off this problem I have till January:
>>      >
>>      >
>>      > When ever I add a new node (I put into the cluster.conf and
>> reloaded
>>      > with cman_tool version -r -S)  I end up with situations like the
>> new
>>      > node wants to gain the quorum and starts to fence the existing pool
>>      > master and appears to generate some sort of split cluster. Does
>>     it work
>>      > at all, corosync and dlm do not know about the recently added node
>> ?
>>
>>     I can see you are using UDPU and that could be the culprit. Can you
>> drop
>>     UDPU and work with multicast?
>>
>>     Jan/Chrissie: do you remember if we support adding nodes at runtime
>> with
>>     UDPU?
>>
>>     The standalone node should not have quorum at all and should not be
>> able
>>     to fence anybody to start with.
>>
>>      >
>>      > New Node
>>      > ==========
>>      >
>>      > Node  Sts   Inc   Joined               Name
>>      >    1   X      0                        hv-1
>>      >    2   X      0                        hv-2
>>      >    3   X      0                        hv-3
>>      >    4   X      0                        hv-4
>>      >    5   X      0                        hv-5
>>      >    6   M     80   2014-01-07 21:37:42  hv-6<--- host added
>>      >
>>      >
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] The network
>>     interface
>>      > [10.14.18.77] is now up.
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Using quorum
>>     provider
>>      > quorum_cman
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>>     loaded:
>>      > corosync cluster quorum service v0.1
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [CMAN  ] CMAN 3.0.12.1
>> (built
>>      > Sep  3 2013 09:17:34) started
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>>     loaded:
>>      > corosync CMAN membership service 2.90
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>>     loaded:
>>      > openais checkpoint service B.01.01
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>>     loaded:
>>      > corosync extended virtual synchrony service
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>>     loaded:
>>      > corosync configuration service
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>>     loaded:
>>      > corosync cluster closed process group service v1.01
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>>     loaded:
>>      > corosync cluster config database access v1.01
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>>     loaded:
>>      > corosync profile loading service
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Using quorum
>>     provider
>>      > quorum_cman
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service engine
>>     loaded:
>>      > corosync cluster quorum service v0.1
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [MAIN  ] Compatibility
>>     mode set
>>      > to whitetank.  Using V1 and V2 of the synchronization engine.
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU
>>     member
>>      > {10.14.18.65}
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU
>>     member
>>      > {10.14.18.67}
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU
>>     member
>>      > {10.14.18.68}
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU
>>     member
>>      > {10.14.18.70}
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU
>>     member
>>      > {10.14.18.66}
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding new UDPU
>>     member
>>      > {10.14.18.77}
>>      > Jan  7 21:37:42 hv-1  corosync[12564]:   [TOTEM ] A processor
>>     joined or
>>      > left the membership and a new membership was formed.
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [CMAN  ] quorum regained,
>>      > resuming activity
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] This node is
>>     within the
>>      > primary component and will provide service.
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Members[1]: 6
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Members[1]: 6
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [CPG   ] chosen downlist:
>>     sender
>>      > r(0) ip(10.14.18.77) ; members(old:0 left:0)
>>      > Jan  7 21:37:42 hv-1 corosync[12564]:   [MAIN  ] Completed service
>>      > synchronization, ready to provide service.
>>      > Jan  7 21:37:46 hv-1 fenced[12620]: fenced 3.0.12.1 started
>>      > Jan  7 21:37:46 hv-1 dlm_controld[12643]: dlm_controld 3.0.12.1
>>     started
>>      > Jan  7 21:37:47 hv-1 gfs_controld[12695]: gfs_controld 3.0.12.1
>>     started
>>      > Jan  7 21:37:54 hv-1 fenced[12620]: fencing node hv-b1clcy1
>>      >
>>      > sudo -i corosync-objctl  |grep member
>>      >
>>      > totem.interface.member.memberaddr=hv-1
>>      > totem.interface.member.memberaddr=hv-2
>>      > totem.interface.member.memberaddr=hv-3
>>      > totem.interface.member.memberaddr=hv-4
>>      > totem.interface.member.memberaddr=hv-5
>>      > totem.interface.member.memberaddr=hv-6
>>      > runtime.totem.pg.mrp.srp.members.6.ip=r(0) ip(10.14.18.77)
>>      > runtime.totem.pg.mrp.srp.members.6.join_count=1
>>      > runtime.totem.pg.mrp.srp.members.6.status=joined
>>      >
>>      >
>>      > Existing Node
>>      > =============
>>      >
>>      > member 6 has not been added to the quorum list :
>>      >
>>      > Jan  7 21:36:28 hv-1 corosync[7769]:   [QUORUM] Members[4]: 1 2 3 5
>>      > Jan  7 21:37:54 hv-1 corosync[7769]:   [TOTEM ] A processor joined
>> or
>>      > left the membership and a new membership was formed.
>>      > Jan  7 21:37:54 hv-1 corosync[7769]:   [CPG   ] chosen downlist:
>>     sender
>>      > r(0) ip(10.14.18.65) ; members(old:4 left:0)
>>      >
>>      >
>>      > Node  Sts   Inc   Joined               Name
>>      >    1   M   4468   2013-12-10 14:33:27  hv-1
>>      >    2   M   4468   2013-12-10 14:33:27  hv-2
>>      >    3   M   5036   2014-01-07 17:51:26  hv-3
>>      >    4   X   4468                        hv-4(dead at the moment)
>>      >    5   M   4468   2013-12-10 14:33:27  hv-5
>>      >    6   X      0                        hv-6<--- added
>>      >
>>      >
>>      > Jan  7 21:36:28 hv-1 corosync[7769]:   [QUORUM] Members[4]: 1 2 3 5
>>      > Jan  7 21:37:54 hv-1 corosync[7769]:   [TOTEM ] A processor joined
>> or
>>      > left the membership and a new membership was formed.
>>      > Jan  7 21:37:54 hv-1 corosync[7769]:   [CPG   ] chosen downlist:
>>     sender
>>      > r(0) ip(10.14.18.65) ; members(old:4 left:0)
>>      > Jan  7 21:37:54 hv-1 corosync[7769]:   [MAIN  ] Completed service
>>      > synchronization, ready to provide service.
>>      >
>>      >
>>      > totem.interface.member.memberaddr=hv-1
>>      > totem.interface.member.memberaddr=hv-2
>>      > totem.interface.member.memberaddr=hv-3
>>      > totem.interface.member.memberaddr=hv-4
>>      > totem.interface.member.memberaddr=hv-5.
>>      > runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(10.14.18.65)
>>      > runtime.totem.pg.mrp.srp.members.1.join_count=1
>>      > runtime.totem.pg.mrp.srp.members.1.status=joined
>>      > runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(10.14.18.66)
>>      > runtime.totem.pg.mrp.srp.members.2.join_count=1
>>      > runtime.totem.pg.mrp.srp.members.2.status=joined
>>      > runtime.totem.pg.mrp.srp.members.4.ip=r(0) ip(10.14.18.68)
>>      > runtime.totem.pg.mrp.srp.members.4.join_count=1
>>      > runtime.totem.pg.mrp.srp.members.4.status=left
>>      > runtime.totem.pg.mrp.srp.members.5.ip=r(0) ip(10.14.18.70)
>>      > runtime.totem.pg.mrp.srp.members.5.join_count=1
>>      > runtime.totem.pg.mrp.srp.members.5.status=joined
>>      > runtime.totem.pg.mrp.srp.members.3.ip=r(0) ip(10.14.18.67)
>>      > runtime.totem.pg.mrp.srp.members.3.join_count=3
>>      > runtime.totem.pg.mrp.srp.members.3.status=joined
>>      >
>>      >
>>      > cluster.conf:
>>      >
>>      > <?xml version="1.0"?>
>>      > <cluster config_version="32" name="hv-1618-110-1">
>>      >   <fence_daemon clean_start="0"/>
>>      >   <cman transport="udpu" expected_votes="1"/>
>>
>
>
> Setting expected_votes to 1 in a six node cluster is a serious
> configuration error and needs to be changed. That is what is causing the
> new node to fence the rest of the cluster.
>
> Check that all of the nodes have the same cluster.conf file, any
> difference between that on the exiting nodes and the new one will prevent
> the new node from joining too.
>
> Chrissie
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140224/989178b6/attachment.htm>

From mgrac at redhat.com  Tue Feb 25 08:53:05 2014
From: mgrac at redhat.com (Marek Grac)
Date: Tue, 25 Feb 2014 09:53:05 +0100
Subject: [Linux-cluster] [Linux-HA] Problems with fence_apc agent and
 accessing APC AP8965
In-Reply-To: <alpine.LRH.2.03.1402241416440.4626@tf6102xuryqne.ccf.rbfqvf.anfn.tbi>
References: <alpine.LRH.2.03.1402211253001.4626@tf6102xuryqne.ccf.rbfqvf.anfn.tbi>
	<9C2DC656-CBBC-4F61-B102-51107C61ADB2@beekhof.net>
	<530B333C.2040304@redhat.com>
	<alpine.LRH.2.03.1402241416440.4626@tf6102xuryqne.ccf.rbfqvf.anfn.tbi>
Message-ID: <530C59F1.2000107@redhat.com>

On 02/24/2014 03:17 PM, Tony Stocker wrote:
> Marc,
>
> As mentioned in the output of my post, that **is** the version of 
> fence agents that I am using, i.e.:

>
>     # rpm -qa | grep fence
>     fence-agents-3.1.5-35.el6_5.3.x86_64
>
Only other error which came to my mind is 
https://git.fedorahosted.org/cgit/fence-agents.git/commit/?id=cfd1f7490d7958ee25ad5937c9f2072adaa9862e
but as it asks for password this should be the case.

The current status is that fence agent is not issuing any command, there 
are just few possibilities:
* invalid username/password - clearly not the case
* fence agent wait for incorrect command prompt - we are waiting for 
invalid prompt
* failure of some ssh negotiation - very likely
     fence_apc by default use '-1 -c blowfish' because the ssh2 support 
is broken on older devices
     it is possible that only sshv2+ is supported now

     you can try to change this value in fence_apc directly line is:
         options["ssh_options"] = "-1 -c blowfish"

     in upstream version this can be set from command line

> Where, or with whom, should I create the Bugzilla report?

bugzilla.redhat.com - package fence-agents

please attach an output of what it does when you try to get 
status/reboot machine manually via ssh.




> Tony
>
> On Mon, 24 Feb 2014, Marek Grac wrote:
>
>> Hi,
>>
>> This is a problem of firmware support, APC have completely changed 
>> text-based interface (and it is better than
>> before). This should be solved in release fence-agents-3.1.5-35.el6 
>> (RHEL 6.5). If that (or upstream 4.x version) is
>> not working please create a bugzilla report.
>>
>> m,
>>
>> On 02/24/2014 02:57 AM, Andrew Beekhof wrote:
>>
>> Forwarding to linux-cluster which has more people knowledgeable on 
>> this set of fencing agents.
>>
>> On 22 Feb 2014, at 12:38 am, Tony Stocker <Tony.Stocker at nasa.gov> wrote:
>>
>> All,
>>
>> I have a bigger issue regarding dual power supplies and fence_apc 
>> that I'm going to eventually need to resolve.  But
>>  at this point I'm simply having basic issues getting the fence_apc 
>> agent to be able to access the devices in genera
>> l, to wit:
>>
>> # fence_apc --ssh --ip=hac-pdu1 --plug=1 --username=blah 
>> --password=blah --verbose --action=status
>> Unable to connect/login to fencing device
>>
>>
>> However I can manually SSH into the device just fine:
>>
>> # ssh  blah at hac-pdu1
>> blah at hac-pdu1's password:
>>
>>
>> American Power Conversion               Network Management Card AOS 
>> v5.1.9
>> (c) Copyright 2010 All Rights Reserved  RPDU 2g v5.1.6
>> ------------------------------------------------------------------------------- 
>>
>> Name      : hac-pdu1                                  Date : 02/21/2014
>> Contact   : systems at mail.myserver123.com              Time : 13:12:02
>> Location  : C101, HAC Rack 1                          User : 
>> Administrator
>> Up Time   : 223 Days 17 Hours 0 Minutes               Stat : P+ N4+ 
>> N6+ A+
>>
>>
>> Type ? for command listing
>> Use tcpip command for IP address(-i), subnet(-s), and gateway(-g)
>>
>> apc>
>>
>>
>> So perhaps the place to start first is simply getting the fence_apc 
>> agent (provided by CentOS/RHEL package fence-age
>> nts-3.1.5-35.el6_5.3.x86_64) to actually be able to work correctly.  
>> Once that's done, I'll still need help on the d
>> ual power supply issue.
>>
>> I'm not seeing any attempts to login in the APC's logs file, though I 
>> do see connections when I manually login, e.g.
>> :
>> 02/21/2014    13:13:11    System: Console user 'apc' logged out from 
>> 192.168.1.216.
>> 02/21/2014    13:12:40    System: Console user 'apc' logged in from 
>> 192.168.1.216.
>>
>> A manual 'telnet [name] 22' command also works fine from the command 
>> line:
>> # telnet hac-pdu1 22
>> Trying 192.168.1.222...
>> Connected to hac-1-pdu1 (192.168.1.222).
>> Escape character is '^]'.
>> SSH-2.0-cryptlib
>>
>>
>> However fence_apc_snmp **does** seem to work:
>>
>> # fence_apc_snmp --snmp-version=1 --community=public --ip=hac-pdu1 
>> --plug=1 --username=blah --password=blah --verbos
>> e --action=status
>> /usr/bin/snmpwalk -m '' -Oeqn  -v '1' -c 'public' 'hac-pdu1:161' 
>> '.1.3.6.1.2.1.1.2.0'
>> No log handling enabled - turning on stderr logging
>> Created directory: /var/lib/net-snmp/mib_indexes
>> .1.3.6.1.2.1.1.2.0 .1.3.6.1.4.1.318.1.3.4.6
>>
>> Trying APC Master Switch (fallback)
>> /usr/bin/snmpget -m '' -Oeqn  -v '1' -c 'public' 'hac-pdu1:161' 
>> '.1.3.6.1.4.1.318.1.1.4.4.2.1.3.1'
>> .1.3.6.1.4.1.318.1.1.4.4.2.1.3.1 1
>>
>> Status: ON
>>
>>
>> Does anyone have any ideas as to why fence_apc is not working?
>>
>>
>> Thanks!
>> Tony
>>
>>
>



From ccaulfie at redhat.com  Tue Feb 25 10:47:38 2014
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Tue, 25 Feb 2014 10:47:38 +0000
Subject: [Linux-cluster] CLVM & CMAN live adding nodes
In-Reply-To: <CAE6679n0uaVqabZAmSPCDKz9gy_osQpHRsEyTjVaxexK_mZiLw@mail.gmail.com>
References: <CAE6679n9unE7-tuLUWPFOhoHCFESf6qJwyL7ND8hrxOenC9QeQ@mail.gmail.com>	<CAE7pJ3CMZ9V2vn0c9eqvu1dhQdYkn6mqkA4a2o-CntFNtRZzsg@mail.gmail.com>	<5308AC2D.3010005@redhat.com>	<CAE6679nnMPOfZgMVE5N27X2Uj0Aqwe2qPQYYJZUa5ehHWPdiZA@mail.gmail.com>	<53099234.2080704@redhat.com>	<CAE6679mNE60GQKtzBTR+vfYHuM2u2hALg2ANKd1jOM3_R5XGnw@mail.gmail.com>	<530B1E33.2000909@redhat.com>
	<CAE6679n0uaVqabZAmSPCDKz9gy_osQpHRsEyTjVaxexK_mZiLw@mail.gmail.com>
Message-ID: <530C74CA.1030901@redhat.com>

I still think you should check the multicast setup or maybe use UDPU or 
broadcast, if only to eliminate the possibility. I've seen this sort of 
thing happen when snooping is switched off for example. Multicast 
packets do 'flow', but the switch doesn't allow new nodes to join the 
multicast group - or at least not quickly enough for the cluster 
protocol. It's a classic symptom - join 3 nodes at the same time, then 
add another later and it can't. I can easily reproduce it in fact.

It might not be that, but it's best to check, because it's the most 
common cause of this symptom IME.

The (highly) recommended thing to do with expected_votes is to leave it 
out of cluster.conf altogether and let cman calculate it. That avoids 
any nasty accidents like the last one ;-)

Chrissie

On 24/02/14 19:45, Bjoern Teipel wrote:
> Thanks Chrissie,
>
> that was an old artifact from testing with two nodes.
> I set the expected votes now to 4 (3 existing nodes in the cluster and
> one new) but I still have the same issue.
> It seems like the new node can't gain quorum over corosync, I see
> multicast packets flowing over the wire but quorum membership seems to
> be static:
>
> Feb 24 11:29:09 corosync [QUORUM] Members[3]: 1 2 3
>
>
> Version: 6.2.0
>
> Config Version: 4
>
> Cluster Name: hv-1618-106-1
>
> Cluster Id: 11612
>
> Cluster Member: Yes
>
> Cluster Generation: 244
>
> Membership state: Cluster-Member
>
> Nodes: 3
>
> Expected votes: 4
>
> Total votes: 3
>
> Node votes: 1
>
> Quorum: 3
>
> Active subsystems: 8
>
> Flags:
>
> Ports Bound: 0 11
>
> Node name: node01
>
> Node ID: 1
>
> Multicast addresses: 239.192.45.137
>
> Node addresses: 10.14.10.6
>
> On Node04:
>
> Starting cluster:
>
>     Checking if cluster has been disabled at boot...        [  OK  ]
>
>     Checking Network Manager...                             [  OK  ]
>
>     Global setup...                                         [  OK  ]
>
>     Loading kernel modules...                               [  OK  ]
>
>     Mounting configfs...                                    [  OK  ]
>
>     Starting cman...                                        [  OK  ]
>
>     Waiting for quorum... Timed-out waiting for cluster
>
>                                                             [FAILED]
>
> Stopping cluster:
>
>     Leaving fence domain...                                 [  OK  ]
>
>     Stopping gfs_controld...                                [  OK  ]
>
>     Stopping dlm_controld...                                [  OK  ]
>
>     Stopping fenced...                                      [  OK  ]
>
>     Stopping cman...                                        [  OK  ]
>
>     Waiting for corosync to shutdown:                       [  OK  ]
>
>     Unloading kernel modules...                             [  OK  ]
>
>     Unmounting configfs...                                  [  OK  ]
>
>
> Node status:
> Node  Sts   Inc   Joined               Name
>
>     1   M    236   2014-02-24 00:22:32  node01
>
>     2   M    240   2014-02-24 00:22:34  node02
>
>     3   M    244   2014-02-24 00:22:38  node03
>
>     4   X      0                        node04
>
>
> On Mon, Feb 24, 2014 at 2:25 AM, Christine Caulfield
> <ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>> wrote:
>
>     On 24/02/14 08:39, Bjoern Teipel wrote:
>
>         Hi Fabio,
>
>         removing UDPU does not change the behavior, the new node still
>         doesn't
>         join the cluster and still wants to fence node 01
>         It still feels like a split brain or so.
>         How do you join a new node, using the /etc/init.d/cman start or
>         using
>            cman_tool / dlm_tool  join ?
>
>         Bjoern
>
>
>         On Sat, Feb 22, 2014 at 10:16 PM, Fabio M. Di Nitto
>         <fdinitto at redhat.com <mailto:fdinitto at redhat.com>
>         <mailto:fdinitto at redhat.com <mailto:fdinitto at redhat.com>>> wrote:
>
>              On 02/22/2014 08:05 PM, Bjoern Teipel wrote:
>               > Thanks Fabio for replying may request.
>               >
>               > I'm using stock CentOS 6.4 versions and no rm, just
>         clvmd and dlm.
>               >
>               > Name        : cman                         Relocations: (not
>              relocatable)
>               > Version     : 3.0.12.1                          Vendor:
>         CentOS
>               > Release     : 49.el6_4.2                    Build Date:
>         Tue 03
>              Sep 2013
>               > 02:18:10 AM PDT
>               >
>               > Name        : lvm2-cluster                 Relocations: (not
>              relocatable)
>               > Version     : 2.02.98                           Vendor:
>         CentOS
>               > Release     : 9.el6_4.3                     Build Date:
>         Tue 05
>              Nov 2013
>               > 07:36:18 AM PST
>               >
>               > Name        : corosync                     Relocations: (not
>              relocatable)
>               > Version     : 1.4.1                             Vendor:
>         CentOS
>               > Release     : 15.el6_4.1                    Build Date:
>         Tue 14
>              May 2013
>               > 02:09:27 PM PDT
>               >
>               >
>               > My question is based off this problem I have till January:
>               >
>               >
>               > When ever I add a new node (I put into the cluster.conf
>         and reloaded
>               > with cman_tool version -r -S)  I end up with situations
>         like the new
>               > node wants to gain the quorum and starts to fence the
>         existing pool
>               > master and appears to generate some sort of split
>         cluster. Does
>              it work
>               > at all, corosync and dlm do not know about the recently
>         added node ?
>
>              I can see you are using UDPU and that could be the culprit.
>         Can you drop
>              UDPU and work with multicast?
>
>              Jan/Chrissie: do you remember if we support adding nodes at
>         runtime with
>              UDPU?
>
>              The standalone node should not have quorum at all and
>         should not be able
>              to fence anybody to start with.
>
>               >
>               > New Node
>               > ==========
>               >
>               > Node  Sts   Inc   Joined               Name
>               >    1   X      0                        hv-1
>               >    2   X      0                        hv-2
>               >    3   X      0                        hv-3
>               >    4   X      0                        hv-4
>               >    5   X      0                        hv-5
>               >    6   M     80   2014-01-07 21:37:42  hv-6<--- host added
>               >
>               >
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] The network
>              interface
>               > [10.14.18.77] is now up.
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Using
>         quorum
>              provider
>               > quorum_cman
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service
>         engine
>              loaded:
>               > corosync cluster quorum service v0.1
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [CMAN  ] CMAN
>         3.0.12.1 (built
>               > Sep  3 2013 09:17:34) started
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service
>         engine
>              loaded:
>               > corosync CMAN membership service 2.90
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service
>         engine
>              loaded:
>               > openais checkpoint service B.01.01
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service
>         engine
>              loaded:
>               > corosync extended virtual synchrony service
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service
>         engine
>              loaded:
>               > corosync configuration service
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service
>         engine
>              loaded:
>               > corosync cluster closed process group service v1.01
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service
>         engine
>              loaded:
>               > corosync cluster config database access v1.01
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service
>         engine
>              loaded:
>               > corosync profile loading service
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] Using
>         quorum
>              provider
>               > quorum_cman
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [SERV  ] Service
>         engine
>              loaded:
>               > corosync cluster quorum service v0.1
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [MAIN  ]
>         Compatibility
>              mode set
>               > to whitetank.  Using V1 and V2 of the synchronization
>         engine.
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding
>         new UDPU
>              member
>               > {10.14.18.65}
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding
>         new UDPU
>              member
>               > {10.14.18.67}
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding
>         new UDPU
>              member
>               > {10.14.18.68}
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding
>         new UDPU
>              member
>               > {10.14.18.70}
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding
>         new UDPU
>              member
>               > {10.14.18.66}
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [TOTEM ] adding
>         new UDPU
>              member
>               > {10.14.18.77}
>               > Jan  7 21:37:42 hv-1  corosync[12564]:   [TOTEM ] A
>         processor
>              joined or
>               > left the membership and a new membership was formed.
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [CMAN  ] quorum
>         regained,
>               > resuming activity
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM] This
>         node is
>              within the
>               > primary component and will provide service.
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM]
>         Members[1]: 6
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [QUORUM]
>         Members[1]: 6
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [CPG   ] chosen
>         downlist:
>              sender
>               > r(0) ip(10.14.18.77) ; members(old:0 left:0)
>               > Jan  7 21:37:42 hv-1 corosync[12564]:   [MAIN  ]
>         Completed service
>               > synchronization, ready to provide service.
>               > Jan  7 21:37:46 hv-1 fenced[12620]: fenced 3.0.12.1 started
>               > Jan  7 21:37:46 hv-1 dlm_controld[12643]: dlm_controld
>         3.0.12.1
>              started
>               > Jan  7 21:37:47 hv-1 gfs_controld[12695]: gfs_controld
>         3.0.12.1
>              started
>               > Jan  7 21:37:54 hv-1 fenced[12620]: fencing node hv-b1clcy1
>               >
>               > sudo -i corosync-objctl  |grep member
>               >
>               > totem.interface.member.__memberaddr=hv-1
>               > totem.interface.member.__memberaddr=hv-2
>               > totem.interface.member.__memberaddr=hv-3
>               > totem.interface.member.__memberaddr=hv-4
>               > totem.interface.member.__memberaddr=hv-5
>               > totem.interface.member.__memberaddr=hv-6
>               > runtime.totem.pg.mrp.srp.__members.6.ip=r(0) ip(10.14.18.77)
>               > runtime.totem.pg.mrp.srp.__members.6.join_count=1
>               > runtime.totem.pg.mrp.srp.__members.6.status=joined
>               >
>               >
>               > Existing Node
>               > =============
>               >
>               > member 6 has not been added to the quorum list :
>               >
>               > Jan  7 21:36:28 hv-1 corosync[7769]:   [QUORUM]
>         Members[4]: 1 2 3 5
>               > Jan  7 21:37:54 hv-1 corosync[7769]:   [TOTEM ] A
>         processor joined or
>               > left the membership and a new membership was formed.
>               > Jan  7 21:37:54 hv-1 corosync[7769]:   [CPG   ] chosen
>         downlist:
>              sender
>               > r(0) ip(10.14.18.65) ; members(old:4 left:0)
>               >
>               >
>               > Node  Sts   Inc   Joined               Name
>               >    1   M   4468   2013-12-10 14:33:27  hv-1
>               >    2   M   4468   2013-12-10 14:33:27  hv-2
>               >    3   M   5036   2014-01-07 17:51:26  hv-3
>               >    4   X   4468                        hv-4(dead at the
>         moment)
>               >    5   M   4468   2013-12-10 14:33:27  hv-5
>               >    6   X      0                        hv-6<--- added
>               >
>               >
>               > Jan  7 21:36:28 hv-1 corosync[7769]:   [QUORUM]
>         Members[4]: 1 2 3 5
>               > Jan  7 21:37:54 hv-1 corosync[7769]:   [TOTEM ] A
>         processor joined or
>               > left the membership and a new membership was formed.
>               > Jan  7 21:37:54 hv-1 corosync[7769]:   [CPG   ] chosen
>         downlist:
>              sender
>               > r(0) ip(10.14.18.65) ; members(old:4 left:0)
>               > Jan  7 21:37:54 hv-1 corosync[7769]:   [MAIN  ]
>         Completed service
>               > synchronization, ready to provide service.
>               >
>               >
>               > totem.interface.member.__memberaddr=hv-1
>               > totem.interface.member.__memberaddr=hv-2
>               > totem.interface.member.__memberaddr=hv-3
>               > totem.interface.member.__memberaddr=hv-4
>               > totem.interface.member.__memberaddr=hv-5.
>               > runtime.totem.pg.mrp.srp.__members.1.ip=r(0) ip(10.14.18.65)
>               > runtime.totem.pg.mrp.srp.__members.1.join_count=1
>               > runtime.totem.pg.mrp.srp.__members.1.status=joined
>               > runtime.totem.pg.mrp.srp.__members.2.ip=r(0) ip(10.14.18.66)
>               > runtime.totem.pg.mrp.srp.__members.2.join_count=1
>               > runtime.totem.pg.mrp.srp.__members.2.status=joined
>               > runtime.totem.pg.mrp.srp.__members.4.ip=r(0) ip(10.14.18.68)
>               > runtime.totem.pg.mrp.srp.__members.4.join_count=1
>               > runtime.totem.pg.mrp.srp.__members.4.status=left
>               > runtime.totem.pg.mrp.srp.__members.5.ip=r(0) ip(10.14.18.70)
>               > runtime.totem.pg.mrp.srp.__members.5.join_count=1
>               > runtime.totem.pg.mrp.srp.__members.5.status=joined
>               > runtime.totem.pg.mrp.srp.__members.3.ip=r(0) ip(10.14.18.67)
>               > runtime.totem.pg.mrp.srp.__members.3.join_count=3
>               > runtime.totem.pg.mrp.srp.__members.3.status=joined
>               >
>               >
>               > cluster.conf:
>               >
>               > <?xml version="1.0"?>
>               > <cluster config_version="32" name="hv-1618-110-1">
>               >   <fence_daemon clean_start="0"/>
>               >   <cman transport="udpu" expected_votes="1"/>
>
>
>
>     Setting expected_votes to 1 in a six node cluster is a serious
>     configuration error and needs to be changed. That is what is causing
>     the new node to fence the rest of the cluster.
>
>     Check that all of the nodes have the same cluster.conf file, any
>     difference between that on the exiting nodes and the new one will
>     prevent the new node from joining too.
>
>     Chrissie
>
>
>



From mgrac at redhat.com  Tue Feb 25 13:53:28 2014
From: mgrac at redhat.com (Marek Grac)
Date: Tue, 25 Feb 2014 14:53:28 +0100
Subject: [Linux-cluster] [Linux-HA] Problems with fence_apc agent and
 accessing APC AP8965
In-Reply-To: <alpine.LRH.2.03.1402251149580.4626@tf6102xuryqne.ccf.rbfqvf.anfn.tbi>
References: <alpine.LRH.2.03.1402211253001.4626@tf6102xuryqne.ccf.rbfqvf.anfn.tbi>
	<9C2DC656-CBBC-4F61-B102-51107C61ADB2@beekhof.net>
	<530B333C.2040304@redhat.com>
	<alpine.LRH.2.03.1402241416440.4626@tf6102xuryqne.ccf.rbfqvf.anfn.tbi>
	<530C59F1.2000107@redhat.com>
	<alpine.LRH.2.03.1402251149580.4626@tf6102xuryqne.ccf.rbfqvf.anfn.tbi>
Message-ID: <530CA058.6010702@redhat.com>

Hi,

On 02/25/2014 01:11 PM, Tony Stocker wrote:
> On Tue, 25 Feb 2014, Marek Grac wrote:
>
>>    it is possible that only sshv2+ is supported now
>>
>>    you can try to change this value in fence_apc directly line is:
>>        options["ssh_options"] = "-1 -c blowfish"
>
> This appears to be the culprit with this issue.  I changed that line to:
>
>     options["ssh_options"] = "-2"
>
> and was successful in getting a status check:
> ...
> I have submitted this information via RedHat's Bugzilla, bug id #1069618.
>
> I greatly appreciate the help.  Now I can send out another email about 
> my main issue, which is figuring out how to get Pacemaker to perform 
> STONITH on a system with power supplies plugged into two different APC 
> AP8965 strips successfully.  Since the fence_apc assumes a single 
> connection, there appears to be no way to use the fence agent to 
> achieve this, I need to figure out how people are doing it.
>
The fence agents are just a part of ecosystem, take a look at

http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_example_use_of_advanced_fencing_topologies_dual_layer_and_dual_devices.html

m,



From pvmpublic at gmail.com  Wed Feb 26 10:04:35 2014
From: pvmpublic at gmail.com (Pratik Mehta)
Date: Wed, 26 Feb 2014 15:34:35 +0530
Subject: [Linux-cluster] CMAN/DLM without SCTP
In-Reply-To: <5303649A.50108@redhat.com>
References: <CAH0fbtNiu3BaWdrNpEuKOMNMAquW3B6qgfQ6ZPTGXQujaTXXQw@mail.gmail.com>
	<5303649A.50108@redhat.com>
Message-ID: <CAH0fbtPVdcmosD0jjyrNt41pbT-dBtmY50xueGb89hQACi7u=Q@mail.gmail.com>

On Tue, Feb 18, 2014 at 7:18 PM, Christine Caulfield <ccaulfie at redhat.com>
 wrote:

>
> I suppose we didn't really envisage alternative SCTP stacks when the DLM
> was written and it's never come up as a problem before.
>
>
Thanks for the prompt response. I am surprised. SCTP is used as the
underlying transport for a bunch of interfaces/nodes in telecom networks
and I believe userspace sctp stacks are common there. Many of these
products run on RHEL. So I should not be the first one to hit this. Strange.

Thanks again
Pratik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140226/80214f55/attachment.htm>

From pvmpublic at gmail.com  Wed Feb 26 11:22:14 2014
From: pvmpublic at gmail.com (Pratik Mehta)
Date: Wed, 26 Feb 2014 16:52:14 +0530
Subject: [Linux-cluster] [Cluster-devel] CMAN/DLM without SCTP
In-Reply-To: <20140219181342.GC21855@redhat.com>
References: <CAH0fbtNiu3BaWdrNpEuKOMNMAquW3B6qgfQ6ZPTGXQujaTXXQw@mail.gmail.com>
	<20140219181342.GC21855@redhat.com>
Message-ID: <CAH0fbtNp7dj1snANfLEwGJDg7adCY0QNmraXETkRfk=2a_YC5Q@mail.gmail.com>

On Wed, Feb 19, 2014 at 11:43 PM, David Teigland <teigland at redhat.com>
wrote:
>
>
> That's a fine solution.  You might also be able to use
> 'service cman start quorum'.

Apart from DLM, wouldn't this prevent fenced from starting? Trying this
caused cman status to be non-zero (fenced is stopped), causing pacemaker
start to force a "cman start".

>
> The cman init script could probably
> use some sysconfig option to either disable dlm/gfs2/etc or to
> tell it to quit after the quorum step.

I can help do this. Is there an equivalent example I can look at? Would you
envision this to be exported as a exclude-functionality type of config or
more as a breakpoint=foo (linear/procedural exclusion)?

>
>
> Not long ago it was possible to avoid loading sctp, but people kept adding
> sctp symbols and I didn't have to time to try to keep them out.  It would
> be nice if that could be corrected again.
>

Any pointers on which release/timeline can I go look back at source from? I
can take a shot at replicating that.

Thanks again
Pratik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140226/cf25db55/attachment.htm>

From teigland at redhat.com  Wed Feb 26 16:10:45 2014
From: teigland at redhat.com (David Teigland)
Date: Wed, 26 Feb 2014 10:10:45 -0600
Subject: [Linux-cluster] [Cluster-devel] CMAN/DLM without SCTP
In-Reply-To: <CAH0fbtNp7dj1snANfLEwGJDg7adCY0QNmraXETkRfk=2a_YC5Q@mail.gmail.com>
References: <CAH0fbtNiu3BaWdrNpEuKOMNMAquW3B6qgfQ6ZPTGXQujaTXXQw@mail.gmail.com>
	<20140219181342.GC21855@redhat.com>
	<CAH0fbtNp7dj1snANfLEwGJDg7adCY0QNmraXETkRfk=2a_YC5Q@mail.gmail.com>
Message-ID: <20140226161045.GA10848@redhat.com>

On Wed, Feb 26, 2014 at 04:52:14PM +0530, Pratik Mehta wrote:
> On Wed, Feb 19, 2014 at 11:43 PM, David Teigland <teigland at redhat.com>
> wrote:
> >
> >
> > That's a fine solution.  You might also be able to use
> > 'service cman start quorum'.
> 
> Apart from DLM, wouldn't this prevent fenced from starting? Trying this
> caused cman status to be non-zero (fenced is stopped), causing pacemaker
> start to force a "cman start".

If you don't need dlm or gfs, then you shouldn't need anything after the
quorum breakpoint.

It may be possible to insert your own step after the cman init script that
stops or removes daemons and/or modules you don't need.

> > The cman init script could probably
> > use some sysconfig option to either disable dlm/gfs2/etc or to
> > tell it to quit after the quorum step.
> 
> I can help do this. Is there an equivalent example I can look at? Would you
> envision this to be exported as a exclude-functionality type of config or
> more as a breakpoint=foo (linear/procedural exclusion)?

I'd probably use a config option for each start step.  I'm not entirely
sure if it would be taken into the cluster source tree at this point.

> > Not long ago it was possible to avoid loading sctp, but people kept adding
> > sctp symbols and I didn't have to time to try to keep them out.  It would
> > be nice if that could be corrected again.
> >
> 
> Any pointers on which release/timeline can I go look back at source from? I
> can take a shot at replicating that.

Hm, if you're using RHEL6 it would require working on the RHEL6 kernel,
which is going to be different from upstream in the area of SCTP.  And
again, it's unlikely such a patch would be taken for RHEL6 at this point.

Dave



From jeff.stoner at dimensiondata.com  Thu Feb 27 17:04:23 2014
From: jeff.stoner at dimensiondata.com (Jeff Stoner)
Date: Thu, 27 Feb 2014 12:04:23 -0500
Subject: [Linux-cluster] RH Summit '14
Message-ID: <CADYQ591QvSaCzAyc-OSrTzHGd-QC_qk57z-+7hczbR8PZkHOjg@mail.gmail.com>

Anyone else going to Red Hat Summit this year? Wanna meetup for a
beer/coffee/tea/soda?

-- 
*Jeff Stoner *
*Cloud Evangelist*
Dimension Data CBU
Tel +1-703-723-5620
Mobile +1-703-475-7720
jeff.stoner at dimensiondata.com
Twitter <http://twitter.com/didatacloud>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140227/1dc96744/attachment.htm>