Hi Martijn, I want to know how to setup a 3-node cluster environment can you suggested me some good book or could you provide some of the technical articles which i can follow and try it on to my test server. <div class="gmail_quote">On Tue, Aug 17, 2010 at 12:03 AM, Martijn <<a href="mailto:martijn@tenheuvel.net">martijn@tenheuvel.net</a>> wrote: <blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> -origineel bericht- Onderwerp: [Linux-cluster] what happens during orderly shutdown ? Van: "Martin Waite" <<a href="mailto:Martin.Waite@datacash.com">Martin.Waite@datacash.com</a>> Datum: 16-08-2010 19:05 Hi, We have hit a snag during testing of a 3-node cluster, where only nodes 1 and 2 are allowed to run the service. If we shutdown the node running the service, the service first starts on the surviving node, but is then stopped. This is not an activity we would normally expect to do, but some of the environment monitoring in our servers can trigger this situation - temperature getting too high etc. What I think happens is the following: 1. host 1: shutdown calls on rgmanager to stop: this takes several seconds, stopping the service. 2. host 2: rgmanager starts to relocate the service to the surviving node 3. host 1: shutdown calls on cman to stop 4. host 2: while relocating the service, cman enters gather state because host 1 has dropped out. The service is still starting and triggers "Failed changing service status" in log below. 5. host 1: halts 6. host 2: the service completes start up. The cluster reconfiguration completes. 7. host 2: the service is stopped. Is this a bug in the cluster system, or just an unfortunate sequence of events that the cluster cannot cope with, i.e. node failure while a service is relocating ? regards, Martin Here is the log from node 2: Aug 16 16:58:40 gmhcluster2 clurgmgrd[4132]: <notice> Member 1 shutting down Aug 16 16:58:40 gmhcluster2 openais[4084]: [TOTEM] Retransmit List: 86 Aug 16 16:58:46 gmhcluster2 clurgmgrd[4132]: <notice> Starting stopped service service:MASTERVIP Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] The token was lost in the OPERATIONAL state. Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes). Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] entering GATHER state from 2. Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <err> #47: Failed changing service status Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <warning> #71: Relocating failed service service:MASTERVIP Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <warning> #70: Failed to relocate service:MASTERVIP; restarting locally Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <notice> Starting stopped service service:MASTERVIP Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering GATHER state from 0. Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Creating commit token because I am the rep. Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Saving state aru 90 high seq received 90 Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Storing new sequence id for ring 1e4 Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering COMMIT state. Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering RECOVERY state. Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] position [0] member <a href="http://172.30.31.4" target="_blank">172.30.31.4</a>: Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] previous ring seq 480 rep 172.30.31.3 Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] aru 90 high delivered 90 received flag 1 Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] position [1] member <a href="http://172.30.31.5" target="_blank">172.30.31.5</a>: Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] previous ring seq 480 rep 172.30.31.3 Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] aru 90 high delivered 90 received flag 1 Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Did not need to originate any messages in recovery. Aug 16 16:59:15 gmhcluster2 kernel: dlm: closing connection to node 1 Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Sending initial ORF token Aug 16 16:59:15 gmhcluster2 clurgmgrd: [4132]: <info> Executing /usr/sbin/dc-cluster-resourceagent start Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM ] CLM CONFIGURATION CHANGE Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM ] New Configuration: Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM ] r(0) ip(172.30.31.4) Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM ] r(0) ip(172.30.31.5) Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM ] Members Left: Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM ] r(0) ip(172.30.31.3) Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM ] Members Joined: Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM ] CLM CONFIGURATION CHANGE Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM ] New Configuration: Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM ] r(0) ip(172.30.31.4) Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492] INFO - ACTIVESITE in state started Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM ] r(0) ip(172.30.31.5) Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492] INFO - ACTIVESITE service is running - ACTIVESITE in state started Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM ] Members Left: Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492] INFO - Service MASTERVIP: Performing precautionary updates to databases before setting up VIP Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM ] Members Joined: Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492] INFO - Service MASTERVIP: running precautionary update [touch /tmp/dc-set-dr-counters] Aug 16 16:59:15 gmhcluster2 openais[4084]: [SYNC ] This node is within the primary component and will provide service. Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492] INFO - Service MASTERVIP: running precautionary update [touch /tmp/dc-set-dr-auto-increments] Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering OPERATIONAL state. Aug 16 16:59:16 gmhcluster2 openais[4084]: [CLM ] got nodejoin message 172.30.31.4 Aug 16 16:59:16 gmhcluster2 openais[4084]: [CLM ] got nodejoin message 172.30.31.5 Aug 16 16:59:16 gmhcluster2 openais[4084]: [CPG ] got joinlist message from node 3 Aug 16 16:59:16 gmhcluster2 clurgmgrd: [4132]: <info> Adding IPv4 address <a href="http://172.30.30.93/24" target="_blank">172.30.30.93/24</a> to bond0 Aug 16 16:59:16 gmhcluster2 openais[4084]: [CPG ] got joinlist message from node 2 Aug 16 16:59:17 gmhcluster2 clurgmgrd[4132]: <notice> Service service:MASTERVIP started Aug 16 16:59:17 gmhcluster2 clurgmgrd[4132]: <notice> Stopping service service:MASTERVIP Aug 16 16:59:17 gmhcluster2 clurgmgrd: [4132]: <info> Removing IPv4 address <a href="http://172.30.30.93/24" target="_blank">172.30.30.93/24</a> from bond0 Aug 16 16:59:27 gmhcluster2 clurgmgrd: [4132]: <info> Executing /usr/sbin/dc-cluster-resourceagent stop Aug 16 16:59:27 gmhcluster2 clurgmgrd[4132]: <notice> Service service:MASTERVIP is stopped -- Linux-cluster mailing list <a href="mailto:Linux-cluster@redhat.com">Linux-cluster@redhat.com</a> <a href="https://www.redhat.com/mailman/listinfo/linux-cluster" target="_blank">https://www.redhat.com/mailman/listinfo/linux-cluster</a> -- Linux-cluster mailing list <a href="mailto:Linux-cluster@redhat.com">Linux-cluster@redhat.com</a> <a href="https://www.redhat.com/mailman/listinfo/linux-cluster" target="_blank">https://www.redhat.com/mailman/listinfo/linux-cluster</a> </blockquote></div>