frequent network collapse possibly due to bridging

Mon Jan 24 22:30:16 UTC 2022

On 1/24/22 4:35 AM, Martin Kletzander wrote:
> On Fri, Jan 21, 2022 at 08:42:58AM -0600, Hakan E. Duran wrote:
>> Hi,
>>
>> I would like some help to troubleshoot the problem I have been having
>> lately with my VM host, which contains 5 VMs, one of which is for
>> pi-hole, unbound services. It has been a relatively common occurrence in
>> the last few weeks for me to find that the host machine has lost its
>> network when I get back home from work. Restoring the VM/VMs do not fix
>> the problem, the host needs to be restarted for a fix, otherwise there
>> is both loss of name resolution, as well as an internet connection; I
>> cannot ping even IPs such as 8.8.8.8. Since I use the pi-hole VM as 
>> the DNS
>> server for my LAN, this means that my whole LAN gets disconnected from
>> internet, until the host machine is rebooted. The host machine has a
>> little complicated network setup: the two gigabit connections are bonded
>> and bridged to the VMs; however this set up has been serving me so well
>> for several years now. The problem, on the other hand, appeared a few
>> weeks ago. This doesn't happen every day but often enough to be annoying
>> and disruptive for my family.
>>
> 
> Always good to check what has changed those weeks ago, but I understand
> it is difficult to find out what you were updating and where.
> 
>> My question is, how can I troubleshoot this problem and figure out
>> whether it is truly due to network bridging somehow collapsing or not? I
>> tried to find some log files but all I could find were the
>> /var/log/libvirt/qemu/$VM files, and the particular log file for the 
>> pi-hole
>> VM reported the following lines; however, I am not sure if they are
>> associated with a real crash or just due to shutting down and restarting
>> the host (please excuse the word-wrapping):
>>
>> char device redirected to /dev/pts/2 (label charserial0)
>> qxl_send_events: spice-server bug: guest stopped, ignoring
>> 2022-01-20T23:41:17.012445Z qemu-system-x86_64: terminating on signal 
>> 15 from pid 1 (/sbin/init)
> 
> Probably restarting the host as it got SIGTERM'd by init.  Maybe it was
> restarted in a bad time and there is some inconsistency on the disk?
> Using something like libvirt-guests which can manage your machines when
> rebooting would be a good idea.
> 
>> 2022-01-20 23:41:17.716+0000: shutting down, reason=crashed
>> 2022-01-20 23:42:46.059+0000: starting up libvirt version: 7.10.0, qemu
>> version: 6.2.0, kernel: 5.10.89-1-MANJARO, hostname: -redacted-
>>
>> Please excuse my ignorance but is there a way to restart the
>> networking without rebooting the host machine? This will not solve my
> 
> You can do:
> 
> virsh net-destroy <network_name>
> virsh net-start <network_name>
> 
> but depending on what the network looks like, how it is set up etc. you
> might need to restart some of the VMs or manually plug them in.

The connection between any guest tap device and a host bridge device 
will be broken by virsh net-destroy, and not restored by virsh net-start 
(because the network driver has no good way of notifying the QEMU driver 
that it has restarted a network). This is something that's been on my 
"list of annoying things I should fix some day" for a very long time, 
but I've never been motivated enough to figure out a clean solution.

In the meantime, if you destroy/start a network, you can get all the 
guest tap devices reconnected by restarting libvirtd:

    systemctl restart libvirtd.service

or if you're using split daemons:

    systemctl restart virtqemud.service

One of the things the QEMU driver does when it's initializing is to 
check where each guest tap device *should* be connected, compare that to 
where it *is* connected, and if those don't match then fix it.