"default" watchdog device - ?

Tue Apr 5 16:27:46 UTC 2022

On 29/03/2022 20:25, Nir Soffer wrote:
> On Wed, Mar 16, 2022 at 1:55 PM lejeczek <peljasz at yahoo.co.uk> wrote:
>>
>>
>> On 15/03/2022 11:21, Daniel P. Berrangé wrote:
>>> On Tue, Mar 15, 2022 at 10:39:50AM +0000, lejeczek wrote:
>>>> Hi guys.
>>>>
>>>> Without explicitly, manually using watchdog device for a VM, the VM (centOS
>>>> 8 Stream 4.18.0-365.el8.x86_64) shows '/dev/watchdog' exists.
>>>> To double check - 'dumpxml' does not show any such device - what kind of a
>>>> 'watchdog' that is?
>>> The kernel can always provide a pure software watchdog IIRC. It can be
>>> useful if a userspace app wants a watchdog. The limitation is that it
>>> relies on the kernel remaining functional, as there's no hardware
>>> backing it up.
>>>
>>> Regards,
>>> Daniel
>> On a related note - with 'i6300esb' watchdog which I tested
>> and I believe is working.
>> I get often in my VMs from 'dmesg':
>> ...
>> watchdog: BUG: soft lockup - CPU#0 stuck for xxxs! [swapper/0:0]
>> rcu: INFO: rcu_sched self-detected stall on CPU
>> ...
>> This above is from Ubuntu and CentOS alike and when this
>> happens, console via VNC responds to until first 'enter'
>> then is non-resposive.
>> This happens after VM(s) was migrated between hosts, but
>> anyway..
>> I do not see what I expected from 'watchdog' - there is no
>> action whatsoever, which should be 'reset'. VM remains in
>> such 'frozen' state forever.
>>
>> any & all shared thoughts much appreciated.
>> L.
> You need to run some userspace tool that will open the watchdog
> device, and pet it periodically, telling the kernel that userspace is alive.
>
> If this tool will stop petting the watchdog, maybe because of a soft lockup
> or other trouble, the watchdog device will reset the VM.
>
> watchdog(8) may be the tool you need.
>
> See also
> https://www.kernel.org/doc/Documentation/watchdog/watchdog-api.rst
>
> Nir
>
I do not think that 'i6300esb' watchog works under those 
soft-lockups, whether it's qemu or OS end I cannot say.
With:
     <watchdog model='i6300esb' action='reset'/>
in dom xml OS sees:
-> $ llr /dev/watchdog*
crw-------. 1 root root  10, 130 Apr  5 16:59 /dev/watchdog
crw-------. 1 root root 248,   0 Apr  5 16:59 /dev/watchdog0
crw-------. 1 root root 248,   1 Apr  5 16:59 /dev/watchdog1
and
-> $ wdctl
Device:        /dev/watchdog
Identity:      i6300ESB timer [version 0]
Timeout:       30 seconds
Pre-timeout:    0 seconds
FLAG           DESCRIPTION               STATUS BOOT-STATUS
KEEPALIVEPING  Keep alive ping reply          1           0
MAGICCLOSE     Supports magic close char      0           0
SETTIMEOUT     Set timeout (in seconds)       0           0

If it worked, the HW watchdog, then 'i6300esb' should reset 
the VM if nothing is pinging the watchdog - I read that it's 
possible to exit 'software' watchdog and not to cause HW 
watchdog take action. I do not know it that's happening here 
when I just 'systemclt stop watchdog'
In '/etc/watchdog.conf' I do not point to any specific 
device, which I believe makes watchdogd do its things.
Simple test:
-> $ cat >> /dev/watchdog
& 'Enter' press twice
does invoke 'reset' action and I was to believe 'wdctl' that 
is HW watchdog working. But!...
The main issue I have are those "soft lockups" where VM's OS 
becomes frozen, but nothing from the watchdog, no action - 
though, as VM is in such frozen state host shows high CPU 
for the VM.

I do not anything fancy so I really wonder if what I see is 
that rare.
Soft-lockup occur I think usually, cannot say that uniquely 
though, during or after VM live-migration.

thanks, L.