[Linux-cluster] what happens during orderly shutdown ?

Mon Aug 16 16:53:40 UTC 2010

Hi,

We have hit a snag during testing of a 3-node cluster, where only nodes
1 and 2 are allowed to run the service.

If we shutdown the node running the service, the service first starts on
the surviving node, but is then stopped.  This is not an activity we
would normally expect to do, but some of the environment monitoring in
our servers can trigger this situation - temperature getting too high
etc.

What I think happens is the following:

1.    host 1: shutdown calls on rgmanager to stop:  this takes several
seconds, stopping the service.

2.    host 2: rgmanager starts to relocate the service to the surviving
node

3.    host 1: shutdown calls on cman to stop

4.    host 2: while relocating the service, cman enters gather state
because host 1 has dropped out.  The service is still starting and
triggers "Failed changing service status" in log below.

5.    host 1: halts

6.    host 2: the service completes start up.  The cluster
reconfiguration completes.  

7.    host 2: the service is stopped.

Is this a bug in the cluster system, or just an unfortunate sequence of
events that the cluster cannot cope with, i.e. node failure while a
service is relocating ?

regards,

Martin

Here is the log from node 2:

Aug 16 16:58:40 gmhcluster2 clurgmgrd[4132]: <notice> Member 1 shutting
down

Aug 16 16:58:40 gmhcluster2 openais[4084]: [TOTEM] Retransmit List: 86

Aug 16 16:58:46 gmhcluster2 clurgmgrd[4132]: <notice> Starting stopped
service service:MASTERVIP

Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] The token was lost in
the OPERATIONAL state.

Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] Receive multicast
socket recv buffer size (320000 bytes).

Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] Transmit multicast
socket send buffer size (262142 bytes).

Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] entering GATHER state
from 2.

Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <err> #47: Failed changing
service status

Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <warning> #71: Relocating
failed service service:MASTERVIP

Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <warning> #70: Failed to
relocate service:MASTERVIP; restarting locally

Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <notice> Starting stopped
service service:MASTERVIP

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering GATHER state
from 0.

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Creating commit token
because I am the rep.

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Saving state aru 90
high seq received 90

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Storing new sequence
id for ring 1e4

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering COMMIT
state.

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering RECOVERY
state.

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] position [0] member
172.30.31.4:

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] previous ring seq 480
rep 172.30.31.3

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] aru 90 high delivered
90 received flag 1

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] position [1] member
172.30.31.5:

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] previous ring seq 480
rep 172.30.31.3

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] aru 90 high delivered
90 received flag 1

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Did not need to
originate any messages in recovery.

Aug 16 16:59:15 gmhcluster2 kernel: dlm: closing connection to node 1

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Sending initial ORF
token

Aug 16 16:59:15 gmhcluster2 clurgmgrd: [4132]: <info> Executing
/usr/sbin/dc-cluster-resourceagent start

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] CLM CONFIGURATION
CHANGE

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] New Configuration:

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ]      r(0)
ip(172.30.31.4)

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ]      r(0)
ip(172.30.31.5)

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] Members Left:

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ]      r(0)
ip(172.30.31.3)

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] Members Joined:

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] CLM CONFIGURATION
CHANGE

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] New Configuration:

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ]      r(0)
ip(172.30.31.4)

Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492]
INFO - ACTIVESITE in state started

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ]      r(0)
ip(172.30.31.5)

Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492]
INFO - ACTIVESITE service is running - ACTIVESITE in state started

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] Members Left:

Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492]
INFO - Service MASTERVIP: Performing precautionary updates to databases
before setting up VIP

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] Members Joined:

Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492]
INFO - Service MASTERVIP: running precautionary update [touch
/tmp/dc-set-dr-counters] 

Aug 16 16:59:15 gmhcluster2 openais[4084]: [SYNC ] This node is within
the primary component and will provide service.

Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492]
INFO - Service MASTERVIP: running precautionary update [touch
/tmp/dc-set-dr-auto-increments]

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering OPERATIONAL
state.

Aug 16 16:59:16 gmhcluster2 openais[4084]: [CLM  ] got nodejoin message
172.30.31.4

Aug 16 16:59:16 gmhcluster2 openais[4084]: [CLM  ] got nodejoin message
172.30.31.5

Aug 16 16:59:16 gmhcluster2 openais[4084]: [CPG  ] got joinlist message
from node 3

Aug 16 16:59:16 gmhcluster2 clurgmgrd: [4132]: <info> Adding IPv4
address 172.30.30.93/24 to bond0

Aug 16 16:59:16 gmhcluster2 openais[4084]: [CPG  ] got joinlist message
from node 2

Aug 16 16:59:17 gmhcluster2 clurgmgrd[4132]: <notice> Service
service:MASTERVIP started

Aug 16 16:59:17 gmhcluster2 clurgmgrd[4132]: <notice> Stopping service
service:MASTERVIP

Aug 16 16:59:17 gmhcluster2 clurgmgrd: [4132]: <info> Removing IPv4
address 172.30.30.93/24 from bond0

Aug 16 16:59:27 gmhcluster2 clurgmgrd: [4132]: <info> Executing
/usr/sbin/dc-cluster-resourceagent stop

Aug 16 16:59:27 gmhcluster2 clurgmgrd[4132]: <notice> Service
service:MASTERVIP is stopped

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100816/f1543f3f/attachment.htm>