[PATCH V4 0/5] Introduce Advanced Watch Dog module

Zhang, Chen chen.zhang at intel.com
Mon Mar 9 09:32:11 UTC 2020


On 3/4/2020 9:37 PM, Paolo Bonzini wrote:
> On 04/03/20 09:06, Zhang, Chen wrote:
>>> Hi Eric and Paolo, Can you give some comments about this series?
>>>
>>>
>> No news for a while...
>> We already have some users(Cloud Service Provider) try to use is module in their product.
>> But they also need to follow the Qemu upstream code.
> My main comment about this series is that it's not clear why it is
> needed and how to use it.  The documentation includes a demo, but no
> description of what is an awd_node, a notification_node and an
> opt_script.  I can more or less understand the notification_node and
> opt_script role from the documentation, but not entirely because, for
> example, the two-host demo has hardcoded IP addresses without saying
> which host is which IP address.

Hi Paolo,

Sorry for slow reply and thank you for your comments.

Let me summarize your main opinions and methods:

1. Why AWD is needed.

Advanced Watch Dog is an universal monitoring module on VMM side, it can 
be used to detect network down(VMM to guest, VMM to VMM, VMM to another 
remote server) and do previously set operation. Current AWD patch just 
accept any input as the signal to refresh the watchdog timer, and we can 
also make a certain interactive protocol here. For the outputs, user can 
pre-write some command or some messages in the AWD opt-script. We 
noticed that there is no way for VMM communicate directly, maybe some 
people think we don't need such things(up layer software like openstack 
can handle it). so we engaged with real customer found that they need a 
lightweight and efficient mechanism to solve some practical problems,

For example Edge Computing cases(they think high level software is too 
heavy to use in Edge or it is hard to manage and combine with VM instance).
It make user have basic VM/Host network monitoring tools and basic false 
tolerance and recovery solution.

For COLO FT/HA solution, we already have some CSPs try to use AWD with COLO.

2. Documentation issues, include how to use it.

I will address all your comments and complete details about documentation.

3. Communication protocol issue.

Current AWD without any protocol, any data it gets will be considered a 
heartbeat signal.

I think use QMP format is good for me.

4. Implementation issue.

The AWD script as an optional feature is OK for me.

And report the triggering of the watchdog via QMP events is enough for 
current usage.

But it looks have limitation to notify outside Qemu. I don't know which 
is better choice.

If the QMP events solution is better, I will fix it in next version.


I don't know if I understand your means correctly.

Please give me more guidance on this series.  :-)

Thanks

Zhang Chen


>
> The documentation does not describe the protocol, which is absolutely
> necessary, and does not describe _why_ the protocol was designed like
> that.  Without such documentation it's not clear if, for example, the
> watchdog protocol could be implemented as QMP commands (e.g.
> start-watchdog, stop-watchdog, notify-watchdog).  Another possibility
> could be to use the systemd watchdog protocol, which consists of
> essentially three commands (WATCHDOG=1, WATCHDOG=trigger,
> WATCHDOG_USEC=...) which are transmitted as datagrams.  Documentation is
> important for reviewers to judge the merits of the protocol without (or
> before) diving into the code.
>
> In the demo, the opt_script mechanism is currently using the "human"
> monitor as opposed to QMP.  The human monitor interface is not stable
> and not meant for consumption by management interface.  It is not clear
> if this is just a sample usage, and in practice the notification_node
> would be outside of QEMU, or not.  In general I would prefer to have the
> script as an optional feature, and report the triggering of the watchdog
> via QMP events.
>
> Paolo
>





More information about the libvir-list mailing list