Ways to deal with broken machine types

Tue Mar 23 16:54:47 UTC 2021

On Tue, 23 Mar 2021 16:04:11 +0100
Thomas Lamprecht <t.lamprecht at proxmox.com> wrote:

> On 23.03.21 15:55, Vitaly Cheptsov wrote:
> >> 23 марта 2021 г., в 17:48, Michael S. Tsirkin <mst at redhat.com> написал(а):
> >>
> >> The issue is with people who installed a VM using 5.1 qemu,
> >> migrated to 5.2, booted there and set a config on a device
> >> e.g. IP on a NIC.
> >> They now have a 5.1 machine type but changing uid back
> >> like we do will break these VMs.
> >>
> >> Unlikley to be common but let's at least create a way for these people
> >> to used these VMs.
> >>  
> > They can simply set the 5.2 VM version in such a case. I do not want to   
> let this legacy hack to be enabled in any modern QEMU VM version, as it violates ACPI specification and makes the life more difficult for various other software like bootloaders and operating systems.
> 
> Yeah here I agree with Vitaly, if they already used 5.2 and made some configurations
> for those "new" devices they can just keep using 5.2?
> 
> If some of the devices got configured on 5.1 and some on 5.2 there's nothing we can
> do anyway, from a QEMU POV - there the user always need to choose one machine version
> and fix up the device configured while on the other machine.

According to testing it appears that issue affects virtio drivers so it may lead to
failure to boot guest (and there was at least 1 report about virtio-scsi being affected).

Let me hijack this thread for beyond this case scope.

I agree that for this particular bug we've done all we could, but
there is broader issue to discuss here.

We have machine versions to deal with hw compatibility issues and that covers most of the cases,
but occasionally we notice problem well after release(s),
so users may be stuck with broken VM and need to manually fix configuration (and/or VM).
Figuring out what's wrong and how to fix it is far from trivial. So lets discuss if we
can help to ease this pain, yes it will be late for first victims but it's still
better than never.

I'll try to sum up idea Michael suggested (here comes my unorganized brain-dump),

1. We can keep in VM's config QEMU version it was created on
   and as minimum warn user with a pointer to known issues if version in
   config mismatches version of actually used QEMU, with a knob to silence
   it for particular mismatch.

When an issue becomes know and resolved we know for sure how and what
changed and embed instructions on what options to use for fixing up VM's
config to preserve old HW config depending on QEMU version VM was installed on.

some more ideas:
   2. let mgmt layer to keep fixup list and apply them to config if available
       (user would need to upgrade mgmt or update fixup list somehow)
   3. let mgmt layer to pass VM's QEMU version to currently used QEMU, so
      that QEMU could maintain and apply fixups based on QEMU version + machine type.
      The user will have to upgrade to newer QEMU to get/use new fixups.

In my opinion both would lead to explosion of 'possibly needed' properties for each
change we introduce in hw/firmware(read ACPI) and very possibly a lot of conditional
branches in QEMU code. And I'm afraid it will become hard to maintain QEMU =>
more bugs in future.
Also it will lead to explosion of test matrix for downstreams who care about testing.

If we proactively gate changes on properties, we can just update fixup lists in mgmt,
without need to update QEMU (aka Insite rules) at a cost of complexity on QMEU side.

Alternatively we can be conservative in spawning new properties, that means creating
them only when issue is fixed and require users to update QEMU, so that fixups could
be applied to VM.

Feel free to shoot the messenger down or suggest ways how we can deal with the problem.