[Linux-cluster] Two-node cluster: Node attempts stateful merge after clean reboot

Thu Sep 12 17:25:29 UTC 2013

On 12/09/13 02:57, Pascal Ehlert wrote:
>
> On 11/09/13 7:31 PM, Digimer wrote:
>> That log message does show the node joining. Can you reliably
>> reproduce this? If so, can you please 'tail -f -n 0 /var/log/messages'
>> on both nodes, break the cluster and wait for the node to restart,
>> 'tail' the rebooted node's /var/log/messages, wait the six minutes and
>> then, after the second fence occurs, post both node's logs?
>>
> I was indeed able to reliably reproduce this and that's where my
> confusion came from. I don't understand why the node seems to be joining
> (and leaving immediately afterwards as per the log), all within the
> 360secs post join fence delay and still gets fenced.
>
> As this is a semi-production system (we had to move quickly), I went
> with a qdisk based approach now, using a small iscsi disk from a remote
> site. This works very well and reliable as far as I can tell from the
> testing that I have done so far. I would still be interested to hear why
> the initial approach failed.
>
> How would have manually starting the cluster services a difference
> anyway? Does that mean that one should join the cluster and fence domain
> first to ensure a stateless join and only then start rgmanager? Isn't
> that something that could be achieved with some delays in the startup
> scripts as well?
>
> Either way, thank you all for helping out this quick!

I honestly don't know why it wound join -> fence; That's most likely a 
network issue but I couldn't guess any more than that. Regardless, you 
have an issue as this behaviour is certainly not normal. You may have 
masked it with qdisk, but please don't leave things as they are. This is 
worthy of further investigation.

In this case, manually starting the cluster would probably not change 
anything. It would, however, allow you to more easily debug because you 
could get the logs tail'ing before attempting to start the cluster. 
We'll really need to see the logs in order to go much further.

If you can schedule a maintenance window, please reproduce this and post 
the logs here. I am very curious as to what might be going on. In the 
meantime, run 'cman_tool status', record the multicast address and make 
sure that group is persistent in your switches.

There is a small chance that one of the services under rgmanager's 
control that is causing an interruption. Again; guessing.

digimer

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?