[Linux-cluster] Node is randomly fenced

Fabio M. Di Nitto fdinitto at redhat.com
Fri Jun 13 04:02:34 UTC 2014


On 06/12/2014 09:06 PM, Digimer wrote:
> Hrm, I'm not really sure that I am able to interpret this without making
> guesses. I'm cc'ing one of the devs (who I hope will poke the right
> person if he's not able to help at the moment). Lets see what he has to
> say.
> 
> I am curious now, too. :)

Chrissie/Honza: can you please take a look at this thread and see if
there is a latent bug?

I find it odd that the Process pause detected is kicking in so many
times without a fencing action.

Fabio

> 
> On 12/06/14 03:02 PM, Schaefer, Micah wrote:
>> Node4 was fenced again, I was able to get some debug logs (below), a new
>> message :
>>
>> "Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the OPERATIONAL
>> state.“
>>
>>
>> Rest of corosync logs
>>
>> http://pastebin.com/iYFbkbhb
>>
>>
>> Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state.
>> Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the
>> membership and a new membership was formed.
>> Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>> flushing membership messages.
>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33294 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33363 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33432 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33494 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33495 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33564 ms,
>> flushing membership messages.
>> Jun 12 14:44:50 corosync [TOTEM ] got commit token
>> Jun 12 14:44:50 corosync [TOTEM ] Saving state aru 86 high seq
>> received 86
>> Jun 12 14:44:50 corosync [TOTEM ] Storing new sequence id for ring 6324
>> Jun 12 14:44:50 corosync [TOTEM ] entering COMMIT state.
>> Jun 12 14:44:50 corosync [TOTEM ] got commit token
>> Jun 12 14:44:50 corosync [TOTEM ] entering RECOVERY state.
>> Jun 12 14:44:50 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
>> Jun 12 14:44:50 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
>> Jun 12 14:44:50 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
>> Jun 12 14:44:50 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
>> Jun 12 14:44:50 corosync [TOTEM ] position [0] member 10.70.100.101:
>> Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep
>> 10.70.100.101
>> Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:50 corosync [TOTEM ] position [1] member 10.70.100.102:
>> Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep
>> 10.70.100.101
>> Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:50 corosync [TOTEM ] position [2] member 10.70.100.103:
>> Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep
>> 10.70.100.101
>> Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:50 corosync [TOTEM ] position [3] member 10.70.100.104:
>> Jun 12 14:44:50 corosync [TOTEM ] previous ring seq 25376 rep
>> 10.70.100.101
>> Jun 12 14:44:50 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:50 corosync [TOTEM ] Did not need to originate any messages
>> in recovery.
>> Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 0, aru ffffffff
>> Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 1, aru 0
>> Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 2, aru 0
>> Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:50 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 3, aru 0
>> Jun 12 14:44:50 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:50 corosync [TOTEM ] retrans flag count 4 token aru 0
>> install
>> seq 0 aru 0 0
>> Jun 12 14:44:50 corosync [TOTEM ] Resetting old ring state
>> Jun 12 14:44:50 corosync [TOTEM ] recovery to regular 1-0
>> Jun 12 14:44:50 corosync [TOTEM ] waiting_trans_ack changed to 1
>> Jun 12 14:44:50 corosync [TOTEM ] entering OPERATIONAL state.
>> Jun 12 14:44:50 corosync [TOTEM ] A processor joined or left the
>> membership and a new membership was formed.
>> Jun 12 14:44:50 corosync [TOTEM ] waiting_trans_ack changed to 0
>> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms,
>> flushing membership messages.
>> Jun 12 14:44:51 corosync [TOTEM ] entering GATHER state from 12.
>> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms,
>> flushing membership messages.
>> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms,
>> flushing membership messages.
>> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34338 ms,
>> flushing membership messages.
>> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms,
>> flushing membership messages.
>> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms,
>> flushing membership messages.
>> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms,
>> flushing membership messages.
>> Jun 12 14:44:51 corosync [TOTEM ] Process pause detected for 34407 ms,
>> flushing membership messages.
>> Jun 12 14:44:51 corosync [TOTEM ] got commit token
>> Jun 12 14:44:51 corosync [TOTEM ] Saving state aru 86 high seq
>> received 86
>> Jun 12 14:44:51 corosync [TOTEM ] Storing new sequence id for ring 6328
>> Jun 12 14:44:51 corosync [TOTEM ] entering COMMIT state.
>> Jun 12 14:44:51 corosync [TOTEM ] got commit token
>> Jun 12 14:44:51 corosync [TOTEM ] entering RECOVERY state.
>> Jun 12 14:44:51 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
>> Jun 12 14:44:51 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
>> Jun 12 14:44:51 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
>> Jun 12 14:44:51 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
>> Jun 12 14:44:51 corosync [TOTEM ] position [0] member 10.70.100.101:
>> Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep
>> 10.70.100.101
>> Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:51 corosync [TOTEM ] position [1] member 10.70.100.102:
>> Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep
>> 10.70.100.101
>> Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:51 corosync [TOTEM ] position [2] member 10.70.100.103:
>> Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep
>> 10.70.100.101
>> Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:51 corosync [TOTEM ] position [3] member 10.70.100.104:
>> Jun 12 14:44:51 corosync [TOTEM ] previous ring seq 25380 rep
>> 10.70.100.101
>> Jun 12 14:44:51 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:51 corosync [TOTEM ] Did not need to originate any messages
>> in recovery.
>> Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 0, aru ffffffff
>> Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 1, aru 0
>> Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 2, aru 0
>> Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:51 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 3, aru 0
>> Jun 12 14:44:51 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:51 corosync [TOTEM ] retrans flag count 4 token aru 0
>> install
>> seq 0 aru 0 0
>> Jun 12 14:44:51 corosync [TOTEM ] Resetting old ring state
>> Jun 12 14:44:51 corosync [TOTEM ] recovery to regular 1-0
>> Jun 12 14:44:51 corosync [TOTEM ] waiting_trans_ack changed to 1
>> Jun 12 14:44:51 corosync [TOTEM ] entering OPERATIONAL state.
>> Jun 12 14:44:51 corosync [TOTEM ] A processor joined or left the
>> membership and a new membership was formed.
>> Jun 12 14:44:51 corosync [TOTEM ] waiting_trans_ack changed to 0
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35177 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] entering GATHER state from 12.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35177 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35246 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35246 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35316 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35316 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35385 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35454 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] Process pause detected for 35455 ms,
>> flushing membership messages.
>> Jun 12 14:44:52 corosync [TOTEM ] got commit token
>> Jun 12 14:44:52 corosync [TOTEM ] Saving state aru 86 high seq
>> received 86
>> Jun 12 14:44:52 corosync [TOTEM ] Storing new sequence id for ring 632c
>> Jun 12 14:44:52 corosync [TOTEM ] entering COMMIT state.
>> Jun 12 14:44:52 corosync [TOTEM ] got commit token
>> Jun 12 14:44:52 corosync [TOTEM ] entering RECOVERY state.
>> Jun 12 14:44:52 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
>> Jun 12 14:44:52 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
>> Jun 12 14:44:52 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
>> Jun 12 14:44:52 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
>> Jun 12 14:44:52 corosync [TOTEM ] position [0] member 10.70.100.101:
>> Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep
>> 10.70.100.101
>> Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:52 corosync [TOTEM ] position [1] member 10.70.100.102:
>> Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep
>> 10.70.100.101
>> Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:52 corosync [TOTEM ] position [2] member 10.70.100.103:
>> Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep
>> 10.70.100.101
>> Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:52 corosync [TOTEM ] position [3] member 10.70.100.104:
>> Jun 12 14:44:52 corosync [TOTEM ] previous ring seq 25384 rep
>> 10.70.100.101
>> Jun 12 14:44:52 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:52 corosync [TOTEM ] Did not need to originate any messages
>> in recovery.
>> Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 0, aru ffffffff
>> Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 1, aru 0
>> Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 2, aru 0
>> Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:52 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 3, aru 0
>> Jun 12 14:44:52 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:52 corosync [TOTEM ] retrans flag count 4 token aru 0
>> install
>> seq 0 aru 0 0
>> Jun 12 14:44:52 corosync [TOTEM ] Resetting old ring state
>> Jun 12 14:44:52 corosync [TOTEM ] recovery to regular 1-0
>> Jun 12 14:44:52 corosync [TOTEM ] waiting_trans_ack changed to 1
>> Jun 12 14:44:52 corosync [TOTEM ] entering OPERATIONAL state.
>> Jun 12 14:44:52 corosync [TOTEM ] A processor joined or left the
>> membership and a new membership was formed.
>> Jun 12 14:44:52 corosync [TOTEM ] waiting_trans_ack changed to 0
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36223 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] entering GATHER state from 12.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36224 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36293 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36362 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36431 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36431 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36432 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36432 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] Process pause detected for 36501 ms,
>> flushing membership messages.
>> Jun 12 14:44:53 corosync [TOTEM ] got commit token
>> Jun 12 14:44:53 corosync [TOTEM ] Saving state aru 86 high seq
>> received 86
>> Jun 12 14:44:53 corosync [TOTEM ] Storing new sequence id for ring 6330
>> Jun 12 14:44:53 corosync [TOTEM ] entering COMMIT state.
>> Jun 12 14:44:53 corosync [TOTEM ] got commit token
>> Jun 12 14:44:53 corosync [TOTEM ] entering RECOVERY state.
>> Jun 12 14:44:53 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
>> Jun 12 14:44:53 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
>> Jun 12 14:44:53 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
>> Jun 12 14:44:53 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
>> Jun 12 14:44:53 corosync [TOTEM ] position [0] member 10.70.100.101:
>> Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep
>> 10.70.100.101
>> Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:53 corosync [TOTEM ] position [1] member 10.70.100.102:
>> Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep
>> 10.70.100.101
>> Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:53 corosync [TOTEM ] position [2] member 10.70.100.103:
>> Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep
>> 10.70.100.101
>> Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:53 corosync [TOTEM ] position [3] member 10.70.100.104:
>> Jun 12 14:44:53 corosync [TOTEM ] previous ring seq 25388 rep
>> 10.70.100.101
>> Jun 12 14:44:53 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:53 corosync [TOTEM ] Did not need to originate any messages
>> in recovery.
>> Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 0, aru ffffffff
>> Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 1, aru 0
>> Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 2, aru 0
>> Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:53 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 3, aru 0
>> Jun 12 14:44:53 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:53 corosync [TOTEM ] retrans flag count 4 token aru 0
>> install
>> seq 0 aru 0 0
>> Jun 12 14:44:53 corosync [TOTEM ] Resetting old ring state
>> Jun 12 14:44:53 corosync [TOTEM ] recovery to regular 1-0
>> Jun 12 14:44:53 corosync [TOTEM ] waiting_trans_ack changed to 1
>> Jun 12 14:44:53 corosync [TOTEM ] entering OPERATIONAL state.
>> Jun 12 14:44:53 corosync [TOTEM ] A processor joined or left the
>> membership and a new membership was formed.
>> Jun 12 14:44:53 corosync [TOTEM ] waiting_trans_ack changed to 0
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37267 ms,
>> flushing membership messages.
>> Jun 12 14:44:54 corosync [TOTEM ] entering GATHER state from 12.
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37267 ms,
>> flushing membership messages.
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37268 ms,
>> flushing membership messages.
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37268 ms,
>> flushing membership messages.
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37337 ms,
>> flushing membership messages.
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 37337 ms,
>> flushing membership messages.
>> Jun 12 14:44:54 corosync [TOTEM ] got commit token
>> Jun 12 14:44:54 corosync [TOTEM ] Saving state aru 86 high seq
>> received 86
>> Jun 12 14:44:54 corosync [TOTEM ] Storing new sequence id for ring 6334
>> Jun 12 14:44:54 corosync [TOTEM ] entering COMMIT state.
>> Jun 12 14:44:54 corosync [TOTEM ] got commit token
>> Jun 12 14:44:54 corosync [TOTEM ] entering RECOVERY state.
>> Jun 12 14:44:54 corosync [TOTEM ] TRANS [0] member 10.70.100.101:
>> Jun 12 14:44:54 corosync [TOTEM ] TRANS [1] member 10.70.100.102:
>> Jun 12 14:44:54 corosync [TOTEM ] TRANS [2] member 10.70.100.103:
>> Jun 12 14:44:54 corosync [TOTEM ] TRANS [3] member 10.70.100.104:
>> Jun 12 14:44:54 corosync [TOTEM ] position [0] member 10.70.100.101:
>> Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep
>> 10.70.100.101
>> Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:54 corosync [TOTEM ] position [1] member 10.70.100.102:
>> Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep
>> 10.70.100.101
>> Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:54 corosync [TOTEM ] position [2] member 10.70.100.103:
>> Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep
>> 10.70.100.101
>> Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:54 corosync [TOTEM ] position [3] member 10.70.100.104:
>> Jun 12 14:44:54 corosync [TOTEM ] previous ring seq 25392 rep
>> 10.70.100.101
>> Jun 12 14:44:54 corosync [TOTEM ] aru 86 high delivered 86 received
>> flag 1
>> Jun 12 14:44:54 corosync [TOTEM ] Did not need to originate any messages
>> in recovery.
>> Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 0, aru ffffffff
>> Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 1, aru 0
>> Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 2, aru 0
>> Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:54 corosync [TOTEM ] token retrans flag is 0 my set retrans
>> flag0 retrans queue empty 1 count 3, aru 0
>> Jun 12 14:44:54 corosync [TOTEM ] install seq 0 aru 0 high seq received 0
>> Jun 12 14:44:54 corosync [TOTEM ] retrans flag count 4 token aru 0
>> install
>> seq 0 aru 0 0
>> Jun 12 14:44:54 corosync [TOTEM ] Resetting old ring state
>> Jun 12 14:44:54 corosync [TOTEM ] recovery to regular 1-0
>> Jun 12 14:44:54 corosync [TOTEM ] waiting_trans_ack changed to 1
>> Jun 12 14:44:54 corosync [TOTEM ] entering OPERATIONAL state.
>> Jun 12 14:44:54 corosync [TOTEM ] A processor joined or left the
>> membership and a new membership was formed.
>> Jun 12 14:44:54 corosync [TOTEM ] waiting_trans_ack changed to 0
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms,
>> flushing membership messages.
>> Jun 12 14:44:54 corosync [TOTEM ] entering GATHER state from 12.
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms,
>> flushing membership messages.
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38108 ms,
>> flushing membership messages.
>> Jun 12 14:44:54 corosync [TOTEM ] Process pause detected for 38109 ms,
>> flushing membership messages.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 6/12/14, 1:55 PM, "Schaefer, Micah" <Micah.Schaefer at jhuapl.edu> wrote:
>>
>>> I just found that the clock on node1 was off by about a minute and a
>>> half
>>> compared to the rest of the nodes.
>>>
>>> I am running ntp, so not sure why the time wasn’t synced up. Wonder if
>>> node1 being behind, would think it was not receiving updates from the
>>> other nodes?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 6/12/14, 1:29 PM, "Digimer" <lists at alteeve.ca> wrote:
>>>
>>>> Even if the token changes stop the immediate fencing, don't leave it
>>>> please. There is something fundamentally wrong that you need to
>>>> identify/fix.
>>>>
>>>> Keep us posted!
>>>>
>>>> On 12/06/14 01:24 PM, Schaefer, Micah wrote:
>>>>> The servers do not run any tasks other than the tasks in the cluster
>>>>> service group.
>>>>>
>>>>> Nodes 3 and 4 are physical servers with a lot of horsepower and
>>>>> nodes 1
>>>>> and 2 are virtual machines with much less resources available.
>>>>>
>>>>> I adjusted the token settings and will watch for any change.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 6/12/14, 1:08 PM, "Digimer" <lists at alteeve.ca> wrote:
>>>>>
>>>>>> On 12/06/14 12:48 PM, Schaefer, Micah wrote:
>>>>>>> As far as the switch goes, both are Cisco Catalyst 6509-E, no
>>>>>>> spanning
>>>>>>> tree changes are happening and all the ports have port-fast enabled
>>>>>>> for
>>>>>>> these servers. My switch logging level is very high and I have no
>>>>>>> messages
>>>>>>> in relation to the time frames or ports.
>>>>>>>
>>>>>>> TOTEM reports that ³A processor joined or left the membershipŠ², but
>>>>>>> that
>>>>>>> isn¹t enough detail.
>>>>>>>
>>>>>>> Also note that I did not have these issues until adding new servers:
>>>>>>> node3
>>>>>>> and node4 to the cluster. Node1 and node2 do not fence each other
>>>>>>> (unless
>>>>>>> a real issue is there), and they are on different switches.
>>>>>>
>>>>>> Then I can't imagine it being network anymore. Seeing as both node 3
>>>>>> and
>>>>>> 4 get fenced, it's likely not hardware either. Are the workloads on 3
>>>>>> and 4 much higher (or are the computers much slower) than 1 and 2?
>>>>>> I'm
>>>>>> wondering if the nodes are simply not keeping up with corosync
>>>>>> traffic.
>>>>>> You might try adjusting the corosync token timeout and retransmit
>>>>>> counts
>>>>>> to see if that reduces the node loses.
>>>>>>
>>>>>> -- 
>>>>>> Digimer
>>>>>> Papers and Projects: https://alteeve.ca/w/
>>>>>> What if the cure for cancer is trapped in the mind of a person
>>>>>> without
>>>>>> access to education?
>>>>>>
>>>>>> -- 
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>>
>>>>
>>>>
>>>> -- 
>>>> Digimer
>>>> Papers and Projects: https://alteeve.ca/w/
>>>> What if the cure for cancer is trapped in the mind of a person without
>>>> access to education?
>>>>
>>>> -- 
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>> -- 
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
> 
> 




More information about the Linux-cluster mailing list