[Linux-cluster] fencing loop in a 2-node partitioned cluster

Tue Feb 24 13:50:08 UTC 2009

And these are the logs I see on the wo nodes:
the first node:
Feb 23 16:26:38 oracs1 openais[6020]: [TOTEM] The token was lost in
the OPERATIONAL state.
Feb 23 16:26:38 oracs1 openais[6020]: [TOTEM] Receive multicast socket
recv buffer size (288000 bytes).
Feb 23 16:26:38 oracs1 openais[6020]: [TOTEM] Transmit multicast
socket send buffer size (288000 bytes).
Feb 23 16:26:38 oracs1 openais[6020]: [TOTEM] entering GATHER state from 2.
Feb 23 16:26:43 oracs1 openais[6020]: [TOTEM] entering GATHER state from 0.
Feb 23 16:26:43 oracs1 openais[6020]: [TOTEM] Creating commit token
because I am the rep.
Feb 23 16:26:43 oracs1 openais[6020]: [TOTEM] Saving state aru 36 high
seq received 36
Feb 23 16:26:43 oracs1 openais[6020]: [TOTEM] Storing new sequence id
for ring 4d944
Feb 23 16:26:43 oracs1 openais[6020]: [TOTEM] entering COMMIT state.
Feb 23 16:26:43 oracs1 openais[6020]: [TOTEM] entering RECOVERY state.
Feb 23 16:26:43 oracs1 openais[6020]: [TOTEM] position [0] member 192.168.16.1:
Feb 23 16:26:43 oracs1 openais[6020]: [TOTEM] previous ring seq 317760
rep 192.168.16.1
Feb 23 16:26:43 oracs1 openais[6020]: [TOTEM] aru 36 high delivered 36
received flag 1
Feb 23 16:26:43 oracs1 openais[6020]: [TOTEM] Did not need to
originate any messages in recovery.
Feb 23 16:26:43 oracs1 openais[6020]: [TOTEM] Sending initial ORF token
Feb 23 16:26:43 oracs1 openais[6020]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 23 16:26:43 oracs1 openais[6020]: [CLM  ] New Configuration:
Feb 23 16:26:43 oracs1 kernel: dlm: closing connection to node 2
Feb 23 16:26:43 oracs1 fenced[6078]: node02 not a cluster member after
0 sec post_fail_delay
Feb 23 16:26:43 oracs1 openais[6020]: [CLM  ]   r(0) ip(192.168.16.1)
Feb 23 16:26:43 oracs1 clurgmgrd[6868]: <info> State change: node02 DOWN
Feb 23 16:26:43 oracs1 fenced[6078]: fencing node "node02"
Feb 23 16:26:43 oracs1 openais[6020]: [CLM  ] Members Left:
Feb 23 16:26:43 oracs1 openais[6020]: [CLM  ]   r(0) ip(192.168.16.8)
Feb 23 16:26:43 oracs1 openais[6020]: [CLM  ] Members Joined:
Feb 23 16:26:43 oracs1 openais[6020]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 23 16:26:43 oracs1 openais[6020]: [CLM  ] New Configuration:
Feb 23 16:26:43 oracs1 openais[6020]: [CLM  ]   r(0) ip(192.168.16.1)
Feb 23 16:26:43 oracs1 openais[6020]: [CLM  ] Members Left:
Feb 23 16:26:43 oracs1 openais[6020]: [CLM  ] Members Joined:
Feb 23 16:26:43 oracs1 openais[6020]: [SYNC ] This node is within the
primary component and will provide service.
Feb 23 16:26:43 oracs1 openais[6020]: [TOTEM] entering OPERATIONAL state.
Feb 23 16:26:43 oracs1 openais[6020]: [CLM  ] got nodejoin message 192.168.16.1
Feb 23 16:26:43 oracs1 openais[6020]: [CPG  ] got joinlist message from node 1
Feb 23 16:26:48 oracs1 clurgmgrd[6868]: <info> Waiting for node #2 to be fenced

The other node:
Feb 23 16:26:38 oracs2 openais[6027]: [TOTEM] The token was lost in
the OPERATIONAL state.
Feb 23 16:26:38 oracs2 openais[6027]: [TOTEM] Receive multicast socket
recv buffer size (288000 bytes).
Feb 23 16:26:38 oracs2 openais[6027]: [TOTEM] Transmit multicast
socket send buffer size (288000 bytes).
Feb 23 16:26:38 oracs2 openais[6027]: [TOTEM] entering GATHER state from 2.
Feb 23 16:26:43 oracs2 openais[6027]: [TOTEM] entering GATHER state from 0.
Feb 23 16:26:43 oracs2 openais[6027]: [TOTEM] Creating commit token
because I am the rep.
Feb 23 16:26:43 oracs2 openais[6027]: [TOTEM] Saving state aru 36 high
seq received 36
Feb 23 16:26:43 oracs2 openais[6027]: [TOTEM] Storing new sequence id
for ring 4d944
Feb 23 16:26:43 oracs2 openais[6027]: [TOTEM] entering COMMIT state.
Feb 23 16:26:43 oracs2 openais[6027]: [TOTEM] entering RECOVERY state.
Feb 23 16:26:43 oracs2 openais[6027]: [TOTEM] position [0] member 192.168.16.8:
Feb 23 16:26:43 oracs2 openais[6027]: [TOTEM] previous ring seq 317760
rep 192.168.16.1
Feb 23 16:26:43 oracs2 openais[6027]: [TOTEM] aru 36 high delivered 36
received flag 1
Feb 23 16:26:43 oracs2 openais[6027]: [TOTEM] Did not need to
originate any messages in recovery.
Feb 23 16:26:43 oracs2 openais[6027]: [TOTEM] Sending initial ORF token
Feb 23 16:26:43 oracs2 openais[6027]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 23 16:26:43 oracs2 openais[6027]: [CLM  ] New Configuration:
Feb 23 16:26:43 oracs2 kernel: dlm: closing connection to node 1
Feb 23 16:26:43 oracs2 fenced[6085]: node01 not a cluster member after
0 sec post_fail_delay
Feb 23 16:26:43 oracs2 openais[6027]: [CLM  ]   r(0) ip(192.168.16.8)
Feb 23 16:26:43 oracs2 clurgmgrd[6880]: <info> State change: node01 DOWN
Feb 23 16:26:43 oracs2 fenced[6085]: fencing node "node01"
Feb 23 16:26:43 oracs2 openais[6027]: [CLM  ] Members Left:
Feb 23 16:26:43 oracs2 openais[6027]: [CLM  ]   r(0) ip(192.168.16.1)
Feb 23 16:26:43 oracs2 openais[6027]: [CLM  ] Members Joined:
Feb 23 16:26:43 oracs2 openais[6027]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 23 16:26:43 oracs2 openais[6027]: [CLM  ] New Configuration:
Feb 23 16:26:43 oracs2 openais[6027]: [CLM  ]   r(0) ip(192.168.16.8)
Feb 23 16:26:43 oracs2 openais[6027]: [CLM  ] Members Left:
Feb 23 16:26:43 oracs2 openais[6027]: [CLM  ] Members Joined:
Feb 23 16:26:43 oracs2 openais[6027]: [SYNC ] This node is within the
primary component and will provide service.
Feb 23 16:26:44 oracs2 openais[6027]: [TOTEM] entering OPERATIONAL state.
Feb 23 16:26:44 oracs2 openais[6027]: [CLM  ] got nodejoin message 192.168.16.8
Feb 23 16:26:44 oracs2 openais[6027]: [CPG  ] got joinlist message from node 2
Feb 23 16:26:48 oracs2 clurgmgrd[6880]: <info> Waiting for node #1 to be fenced

The result is that both nodes went in power off state and I manually
needed to power on them.

As I also have debug enabled for qdisk, these are the logs for it:
node01:
Feb 23 09:39:42 oracs1 qdiskd[6062]: <debug> Heuristic: 'ping -c1 -w1
10.4.5.250' score=1 interval=2 tko=3
Feb 23 09:39:42 oracs1 qdiskd[6062]: <debug> 1 heuristics loaded
Feb 23 09:39:42 oracs1 qdiskd[6062]: <debug> Quorum Daemon: 1
heuristics, 3 interval, 5 tko, 1 votes
Feb 23 09:39:42 oracs1 qdiskd[6062]: <debug> Run Flags: 00000031
Feb 23 09:39:43 oracs1 qdiskd[6062]: <info> Quorum Partition:
/dev/dm-5 Label: acsquorum
Feb 23 09:39:43 oracs1 qdiskd[6063]: <info> Quorum Daemon Initializing
Feb 23 09:39:43 oracs1 qdiskd[6063]: <debug> I/O Size: 512  Page Size: 4096
Feb 23 09:39:44 oracs1 qdiskd[6063]: <info> Heuristic: 'ping -c1 -w1
10.4.5.250' UP
Feb 23 09:39:58 oracs1 qdiskd[6063]: <info> Initial score 1/1
Feb 23 09:39:58 oracs1 qdiskd[6063]: <info> Initialization complete
Feb 23 09:39:58 oracs1 qdiskd[6063]: <notice> Score sufficient for
master operation (1/1; required=1); upgrading
Feb 23 09:40:04 oracs1 qdiskd[6063]: <debug> Making bid for master
Feb 23 09:40:10 oracs1 qdiskd[6063]: <info> Assuming master role
Feb 23 09:42:52 oracs1 qdiskd[6063]: <debug> Node 2 is UP

node02:
Feb 23 16:12:34 oracs2 qdiskd[6069]: <debug> Heuristic: 'ping -c1 -w1
10.4.5.250' score=1 interval=2 tko=3
Feb 23 16:12:34 oracs2 qdiskd[6069]: <debug> 1 heuristics loaded
Feb 23 16:12:34 oracs2 qdiskd[6069]: <debug> Quorum Daemon: 1
heuristics, 3 interval, 5 tko, 1 votes
Feb 23 16:12:34 oracs2 qdiskd[6069]: <debug> Run Flags: 00000031
Feb 23 16:12:34 oracs2 qdiskd[6069]: <info> Quorum Partition:
/dev/dm-10 Label: acsquorum
Feb 23 16:12:34 oracs2 qdiskd[6070]: <info> Quorum Daemon Initializing
Feb 23 16:12:34 oracs2 qdiskd[6070]: <debug> I/O Size: 512  Page Size: 4096
Feb 23 16:12:36 oracs2 qdiskd[6070]: <info> Heuristic: 'ping -c1 -w1
10.4.5.250' UP
Feb 23 16:12:41 oracs2 qdiskd[6070]: <debug> Node 1 is UP
Feb 23 16:12:44 oracs2 qdiskd[6070]: <info> Node 1 is the master
Feb 23 16:12:50 oracs2 qdiskd[6070]: <info> Initial score 1/1
Feb 23 16:12:50 oracs2 qdiskd[6070]: <info> Initialization complete
Feb 23 16:12:50 oracs2 qdiskd[6070]: <notice> Score sufficient for
master operation (1/1; required=1); upgrading

Gianluca