[Linux-cluster] Two nodes DRBD - Fail-Over Actif/Passif Cluster.

Wed Feb 16 05:55:35 UTC 2011

>>> below the cluster.conf file ...
>>>
>>>
>>> <?xml version="1.0"?>
>>> <cluster name="cluster" config_version="6">
>>>    <!-- post_join_delay: number of seconds the daemon will wait before
>>>                          fencing any victims after a node joins the domain
>>>         post_fail_delay: number of seconds the daemon will wait before
>>>                        fencing any victims after a domain member fails
>>>         clean_start    : prevent any startup fencing the daemon might do.
>>>                        It indicates that the daemon should assume all nodes
>>>                        are in a clean state to start. -->
>>>    <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
>>>    <clusternodes>
>>>      <clusternode name="reporter1.lab.intranet" votes="1" nodeid="1">
>>>        <fence>
>>>          <!-- Handle fencing manually -->
>>>          <method name="human">
>>>            <device name="human" nodename="reporter1.lab.intranet"/>
>>>          </method>
>>>        </fence>
>>>      </clusternode>
>>>      <clusternode name="reporter2.lab.intranet" votes="1" nodeid="2">
>>>        <fence>
>>>          <!-- Handle fencing manually -->
>>>          <method name="human">
>>>            <device name="human" nodename="reporter2.lab.intranet"/>
>>>          </method>
>>>        </fence>
>>>      </clusternode>
>>>    </clusternodes>
>>>    <!-- cman two nodes specification -->
>>>    <cman expected_votes="1" two_node="1"/>
>>>    <fencedevices>
>>>      <!-- Define manual fencing -->
>>>      <fencedevice name="human" agent="fence_manual"/>
>>>    </fencedevices>
>>>    <rm>
>>>       <failoverdomains>
>>>          <failoverdomain name="example_pri" nofailback="0" ordered="1" restricted="0">
>>>             <failoverdomainnode name="reporter1.lab.intranet" priority="1"/>
>>>             <failoverdomainnode name="reporter2.lab.intranet" priority="2"/>
>>>          </failoverdomain>
>>>       </failoverdomains>
>>>       <resources>
>>>             <ip address="10.30.30.92" monitor_link="on" sleeptime="10"/>
>>>             <apache config_file="conf/httpd.conf" name="example_server" server_root="/etc/httpd" shutdown_wait="0"/>
>>>        </resources>
>>>        <service autostart="1" domain="example_pri" exclusive="0" name="example_apache" recovery="relocate">
>>>                  <ip ref="10.30.30.92"/>
>>>                  <apache ref="example_server"/>
>>>        </service>
>>>    </rm>
>>> </cluster>
>>>
>>> and this is the result I get on both servers ...
>>>
>>> [root at reporter1 ~]# clustat
>>> Cluster Status for cluster @ Mon Feb 14 22:22:53 2011
>>> Member Status: Quorate
>>>
>>>   Member Name                                      ID   Status
>>>   ------ ----                                      ---- ------
>>>   reporter1.lab.intranet                               1 Online, Local, rgmanager
>>>   reporter2.lab.intranet                               2 Online, rgmanager
>>>
>>>   Service Name                            Owner (Last)                            State
>>>   ------- ----                            ----- ------                            -----
>>>   service:example_apache                  (none)                                  stopped
>>>
>>> as you can see, everything is stopped or in other words nothing runs .. so my question are :
>
>Having a read through /var/log/messages for possible causes would be a
>good start.
>

this is what I see in the /var/log/messages file ...

Feb 16 07:36:54 reporter1 corosync[1250]:   [MAIN  ] Corosync Cluster Engine ('1.2.3'): started and ready to provide service.
Feb 16 07:36:54 reporter1 corosync[1250]:   [MAIN  ] Corosync built-in features: nss rdma
Feb 16 07:36:54 reporter1 corosync[1250]:   [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
Feb 16 07:36:54 reporter1 corosync[1250]:   [MAIN  ] Successfully parsed cman config
Feb 16 07:36:54 reporter1 corosync[1250]:   [TOTEM ] Initializing transport (UDP/IP).
Feb 16 07:36:54 reporter1 corosync[1250]:   [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Feb 16 07:36:55 reporter1 corosync[1250]:   [TOTEM ] The network interface [10.30.30.90] is now up.
Feb 16 07:36:55 reporter1 corosync[1250]:   [QUORUM] Using quorum provider quorum_cman
Feb 16 07:36:55 reporter1 corosync[1250]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Feb 16 07:36:55 reporter1 corosync[1250]:   [CMAN  ] CMAN 3.0.12 (built Aug 17 2010 14:08:49) started
Feb 16 07:36:55 reporter1 corosync[1250]:   [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
Feb 16 07:36:55 reporter1 corosync[1250]:   [SERV  ] Service engine loaded: openais checkpoint service B.01.01
Feb 16 07:36:55 reporter1 corosync[1250]:   [SERV  ] Service engine loaded: corosync extended virtual synchrony service
Feb 16 07:36:55 reporter1 corosync[1250]:   [SERV  ] Service engine loaded: corosync configuration service
Feb 16 07:36:55 reporter1 corosync[1250]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01
Feb 16 07:36:55 reporter1 corosync[1250]:   [SERV  ] Service engine loaded: corosync cluster config database access v1.01
Feb 16 07:36:55 reporter1 corosync[1250]:   [SERV  ] Service engine loaded: corosync profile loading service
Feb 16 07:36:55 reporter1 corosync[1250]:   [QUORUM] Using quorum provider quorum_cman
Feb 16 07:36:55 reporter1 corosync[1250]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Feb 16 07:36:55 reporter1 corosync[1250]:   [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
Feb 16 07:36:55 reporter1 corosync[1250]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 16 07:36:55 reporter1 corosync[1250]:   [CMAN  ] quorum regained, resuming activity
Feb 16 07:36:55 reporter1 corosync[1250]:   [QUORUM] This node is within the primary component and will provide service.
Feb 16 07:36:55 reporter1 corosync[1250]:   [QUORUM] Members[1]: 1
Feb 16 07:36:55 reporter1 corosync[1250]:   [QUORUM] Members[1]: 1
Feb 16 07:36:55 reporter1 corosync[1250]:   [CPG   ] downlist received left_list: 0
Feb 16 07:36:55 reporter1 corosync[1250]:   [CPG   ] chosen downlist from node r(0) ip(10.30.30.90)
Feb 16 07:36:55 reporter1 corosync[1250]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 16 07:36:56 reporter1 fenced[1302]: fenced 3.0.12 started
Feb 16 07:36:57 reporter1 dlm_controld[1319]: dlm_controld 3.0.12 started
Feb 16 07:36:57 reporter1 gfs_controld[1374]: gfs_controld 3.0.12 started
Feb 16 07:37:03 reporter1 corosync[1250]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 16 07:37:03 reporter1 corosync[1250]:   [QUORUM] Members[2]: 1 2
Feb 16 07:37:03 reporter1 corosync[1250]:   [QUORUM] Members[2]: 1 2
Feb 16 07:37:03 reporter1 corosync[1250]:   [CPG   ] downlist received left_list: 0
Feb 16 07:37:03 reporter1 corosync[1250]:   [CPG   ] downlist received left_list: 0
Feb 16 07:37:03 reporter1 corosync[1250]:   [CPG   ] chosen downlist from node r(0) ip(10.30.30.90)

>>> do I have to configure manually load balanced ip 10.30.30.92 as an alias ip on both sides or is it done automatically by redhat cluster ?
>
>RHCS will automatically assign the IP to an interface that is on the
>same subnet. You most definitely shouldn't create the IP manually on any
>of the nodes.
>
>>> I just made a simple try with apache but I do not find anywhere reference to the start/stop script for apache in the examples, is that normal ??
>>> do you have some best practice regarding this picture ??
>
>I'm not familiar with the <apache> tag in cluster.conf, I usually
>configure most things as init script resources.
>
>Gordan
-----------------------------------------------------------------
ATTENTION:
This e-mail is intended for the exclusive use of the
recipient(s). This e-mail and its attachments, if any, contain
confidential information and/or information protected by
intellectual property rights or other rights. This e-mail does
not constitute any commitment for ING Belgium except when
expressly otherwise agreed in a written agreement between the
intended recipient and ING Belgium.

If you receive this message by mistake, please, notify the sender
with the "reply" option and delete immediately this e-mail from
your system, and destroy all copies of it. You may not, directly
or indirectly, use this e-mail or any part of it if you are not
the intended recipient.

Messages and attachments are scanned for all viruses known. If
this message contains password-protected attachments, the files
have NOT been scanned for viruses by the ING mail domain. Always
scan attachments before opening them.
-----------------------------------------------------------------
ING Belgium SA/NV - Bank/Lender - Avenue Marnix 24, B-1000
Brussels, Belgium - Brussels RPM/RPR - VAT BE 0403.200.393 -
BIC (SWIFT) : BBRUBEBB - Account: 310-9156027-89 (IBAN BE45 3109
1560 2789).
An insurance broker, registered with the Banking, Finance and
Insurance Commission under the code number 12381A.

ING Belgique SA - Banque/Preteur, Avenue Marnix 24, B-1000
Bruxelles - RPM Bruxelles - TVA BE 0403 200 393 - BIC (SWIFT) :
BBRUBEBB - Compte: 310-9156027-89 (IBAN: BE45 3109 1560 2789).
Courtier d'assurances inscrit a la CBFA sous le numero 12381A.

ING Belgie NV - Bank/Kredietgever - Marnixlaan 24, B-1000 Brussel
- RPR Brussel - BTW BE 0403.200.393 - BIC (SWIFT) : BBRUBEBB -
Rekening: 310-9156027-89 (IBAN: BE45 3109 1560 2789).
Verzekeringsmakelaar ingeschreven bij de CBFA onder het nr.
12381A.
-----------------------------------------------------------------