[Linux-cluster] Fencing problem w/ 2-node VM when a VM host dies
Digimer
lists at alteeve.ca
Fri Dec 4 19:00:04 UTC 2015
On 04/12/15 01:52 PM, Kelvin Edmison wrote:
>
>
> On 12/04/2015 12:49 PM, Digimer wrote:
>> On 04/12/15 09:14 AM, Kelvin Edmison wrote:
>>>
>>> On 12/03/2015 09:31 PM, Digimer wrote:
>>>> On 03/12/15 08:39 PM, Kelvin Edmison wrote:
>>>>> On 12/03/2015 06:14 PM, Digimer wrote:
>>>>>> On 03/12/15 02:19 PM, Kelvin Edmison wrote:
>>>>>>> I am hoping that someone can help me understand the problems I'm
>>>>>>> having
>>>>>>> with linux clustering for VMs.
>>>>>>>
>>>>>>> I am clustering 2 VMs on two separate VM hosts, trying to ensure
>>>>>>> that a
>>>>>>> service is always available. The hosts and guests are both RHEL
>>>>>>> 6.7.
>>>>>>> The goal is to have only one of the two VMs running at a time.
>>>>>>>
>>>>>>> The configuration works when we test/simulate VM deaths and
>>>>>>> graceful VM
>>>>>>> host shutdowns, and administrative switchovers (i.e. clusvcadm -r ).
>>>>>>>
>>>>>>> However, when we simulate the sudden isolation of host A (e.g.
>>>>>>> ifdown
>>>>>>> eth0), two things happen
>>>>>>> 1) the VM on host B does not start, and repeated fence_xvm errors
>>>>>>> appear
>>>>>>> in the logs on host B
>>>>>>> 2) when the 'failed' node is returned to service, the cman
>>>>>>> service on
>>>>>>> host B dies.
>>>>>> If the node's host is dead, then there is no way for the survivor to
>>>>>> determine the state of the lost VM node. The cluster is not
>>>>>> allowed to
>>>>>> take "no answer" as confirmation of fence success.
>>>>>>
>>>>>> If your hosts have IPMI, then you could add fence_ipmilan as a backup
>>>>>> method where, if fence_xvm fails, it moves on and reboots the host
>>>>>> itself.
>>>>> Thank you for the suggestion. The hosts do have ipmi. I'll
>>>>> explore it
>>>>> but I'm a little concerned about what it means for the other
>>>>> non-clustered VM workloads that exist on these two servers.
>>>>>
>>>>> Do you have any thoughts as to why host B's cman process is dying when
>>>>> 'host A' returns?
>>>>>
>>>>> Thanks,
>>>>> Kelvin
>>>> It's not dieing, it's blocking. When a node is lost, dlm blocks until
>>>> fenced tells it that the fence was successful. If fenced can't contact
>>>> the lost node's fence method(s), then it doesn't succeed and dlm stays
>>>> blocked. To anything that uses DLM, like rgmanager, it appears like the
>>>> host is hung but it is by design. The logic is that, as bad as it is to
>>>> hang, it's better than risking a split-brain.
>>> when I said the cman service is dying, I should have further qualified
>>> it. I mean that the corosync process is no longer running (ps -ef | grep
>>> corosync does not show it) and after recovering the failed host A,
>>> manual intervention (service cman start) was required on host B to
>>> recover full cluster services.
>>>
>>> [root at host2 ~]# for SERVICE in ricci fence_virtd cman rgmanager; do
>>> printf "%-12s " $SERVICE; service $SERVICE status; done
>>> ricci ricci (pid 5469) is running...
>>> fence_virtd fence_virtd (pid 4862) is running...
>>> cman Found stale pid file
>>> rgmanager rgmanager (pid 5366) is running...
>>>
>>>
>>> Thanks,
>>> Kelvin
>> Oh now that is interesting...
>>
>> You'll want input from Fabio, Chrissie or one of the other core devs, I
>> suspect.
>>
>> If this is RHEL proper, can you open a rhbz ticket? If it's CentOS, and
>> if you can reproduce it reliably, can you create a new thread with the
>> reproducer?
> It's RHEL proper in both host and guest, and we can reproduce it reliably.
Excellent!
Please reply here with the rhbz#. I'm keen to see what comes of it.
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
More information about the Linux-cluster
mailing list