[Linux-cluster] Adding a stop timeout to a VM service using 'ccs'

Thu Mar 20 02:35:55 UTC 2014

On 19/03/14 10:12 PM, Pavel Herrmann wrote:
> Hi
>
> On Wednesday 19 of March 2014 21:26:56 Digimer wrote:
>> On 19/03/14 07:45 PM, Digimer wrote:
>>> On 19/03/14 06:31 PM, Chris Feist wrote:
>>>> On 03/18/2014 08:27 PM, Digimer wrote:
>>>>> Hi all,
>>>>>
>>>>>     I would like to tell rgmanager to give more time for VMs to stop. I
>>>>>
>>>>> want this:
>>>>>
>>>>> <vm name="vm01-win2008" domain="primary_n01" autostart="0"
>>>>> path="/shared/definitions/" exclusive="0" recovery="restart"
>>>>> max_restarts="2"
>>>>> restart_expire_time="600">
>>>>>
>>>>>     <action name="stop" timeout="10m" />
>>>>>
>>>>> </vm>
>>>>>
>>>>> I already use ccs to create the entry:
>>>>>
>>>>> <vm name="vm01-win2008" domain="primary_n01" autostart="0"
>>>>> path="/shared/definitions/" exclusive="0" recovery="restart"
>>>>> max_restarts="2"
>>>>> restart_expire_time="600"/>
>>>>>
>>>>> via:
>>>>>
>>>>> ccs -h localhost --activate --sync --password "secret" \
>>>>>
>>>>>    --addvm vm01-win2008 \
>>>>>    --domain="primary_n01" \
>>>>>    path="/shared/definitions/" \
>>>>>    autostart="0" \
>>>>>    exclusive="0" \
>>>>>    recovery="restart" \
>>>>>    max_restarts="2" \
>>>>>    restart_expire_time="600"
>>>>>
>>>>> I'm hoping it's a simple additional switch. :)
>>>>
>>>> Unfortunately currently ccs doesn't support setting resource actions.
>>>> However it's my understanding that rgmanager doesn't check timeouts
>>>> unless __enforce_timeouts is set to "1".  So you shouldn't be seeing a
>>>> vm resource go to failed if it takes a long time to stop.  Are you
>>>> trying to make the vm resource fail if it takes longer than 10 minutes
>>>> to stop?
>>>
>>> I was afraid you were going to say that. :(
>>>
>>> The problem is that after calling 'disable' against the VM service,
>>> rgmanager waits two minutes. If the service isn't closed in that time,
>>> the server is forced off (at least, this was the behaviour when I last
>>> tested this).
>>>
>>> The concern is that, by default, windows installs queue updates to
>>> install when the system shuts down. During this time, windows makes it
>>> very clear that you should not power off the system during the updates.
>>> So if this timer is hit, and the VM is forced off, the guest OS can be
>>> damaged.
>>>
>>> Of course, we can debate the (lack of) wisdom of this behaviour, and I
>>> already document this concern (and even warn people to check for updates
>>> before stopping the server), it's not sufficient. If a user doesn't read
>>> the warning, or simply forgets to check, the consequences can be
>>> non-trivial.
>>>
>>> If ccs can't be made to add this attribute, and if the behaviour
>>> persists (I will test shortly after sending this reply), then I will
>>> have to edit the cluster.conf directly, something I am loath to do if at
>>> all avoidable.
>>>
>>> Cheers
>>
>> Confirmed;
>>
>> I called disable on a VM with gnome running, so that I could abort the
>> VM's shut down.
>>
>> an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date
>> Wed Mar 19 21:06:29 EDT 2014
>> Local machine disabling vm:vm01-rhel6...Success
>> Wed Mar 19 21:08:36 EDT 2014
>>
>> 2 minutes and 7 seconds, then rgmanager forced-off the VM. Had this been
>> a windows guest in the middle of installing updates, it would be highly
>> likely to be screwed now.
>
> Is this really the best way to handle such an event?
>
>  From what I remember, Windows can (or could, I don't have any 'modern' windows
> laying around) be told to shutdown without updating. maybe a wiser approach
> would be to make the stop event (which I believe is delivered to the guest as
> pressing the ACPI power button) trigger a shutdown without updates.
>
> keep in mind that doing system updates on timer is dangerous, irrelevant of
> the actual time
>
> regards
> Pavel Herrmann

This assumes that we can modify how windows behaves. Unless there is a 
magic ACPI event that windows will reliably interpret as "power off 
without updating", we can't rely on this.

We have clients (and I am sure we aren't the only ones) who install 
their own OSes without any input from us. As mentioned earlier, we do 
document the risks, but that's not good enough. We can't force users to 
read.

So we have a choice; Take mitigating steps or let the user shoot 
themselves in the foot "because they should have known better". As 
personally satisfying as option #2 might seem, option #1 is the more 
professional approach, I would _strongly_ argue.

digimer

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?