[vfio-users] GPU driver crashes when running a second VM if either VM has a virtual disk stored on physical media other than the root disk. Tested on three X58 chipset MBs

Brian Yglesias brian at atlanticdigitalsolutions.com
Mon Nov 20 09:58:49 UTC 2017


Zir,

Thanks for the response.  I really thought that would work, as the problem does seem to follow the chipset but alas no change.  I added intremap=off per your suggestion and found the nox2apic, which I presume will enable xAPIC instead.

I searched for information about setting a bit to do the same in the the DMAR table, but I wasn't able to find anything about that.  I'm not sure if nox2apic achieves that.




root at ads-120elmst-proxmox-1:~# dmesg |grep -i apic
[    0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.10.17-3-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet vfio_iommu_type1.allow_unsafe_interrupts=1 intel_iommu=on intremap=off nox2apic
[    0.000000] ACPI: APIC 0x000000009F780390 0000D8 (v01 051111 APIC2126 20110511 MSFT 00000097)
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] IOAPIC[0]: apic_id 6, version 32, address 0xfec00000, GSI 0-23
[    0.000000] IOAPIC[1]: apic_id 7, version 32, address 0xfec8a000, GSI 24-47
[    0.000000] Kernel command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.10.17-3-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet vfio_iommu_type1.allow_unsafe_interrupts=1 intel_iommu=on intremap=off nox2apic
[    0.090605] Switched APIC routing to physical flat.
[    0.091091] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.415386] ACPI: Using IOAPIC for interrupt routing
[    0.946833] intel_idle: lapic_timer_reliable_states 0xffffffff



Here's what the system log looks like when running both VMs (it seems to be a bit more resilient with the 2017 OS updates, but still crashes sooner than later):

Nov 19 22:12:47 ads-120elmst-proxmox-1 kernel: [ 2322.345972] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 3.803 msecs
Nov 19 22:31:32 ads-120elmst-proxmox-1 kernel: [ 3447.384827] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 3.812 msecs
Nov 19 22:34:25 ads-120elmst-proxmox-1 kernel: [ 3620.283114] kvm_get_msr_common: 3 callbacks suppressed
Nov 19 22:34:57 ads-120elmst-proxmox-1 kernel: [ 3652.379106] kvm_get_msr_common: 134 callbacks suppressed
Nov 19 22:39:23 ads-120elmst-proxmox-1 kernel: [ 3918.236976] kvm_get_msr_common: 134 callbacks suppressed
Nov 19 22:52:32 ads-120elmst-proxmox-1 kernel: [ 4707.406853] usb 9-1.2: reset full-speed USB device number 5 using xhci_hcd
Nov 19 22:52:33 ads-120elmst-proxmox-1 kernel: [ 4708.055537] usb 9-1.4: reset full-speed USB device number 7 using xhci_hcd





Nov 19 22:53:04 ads-120elmst-proxmox-1 kernel: [ 4739.445246] vmbr0: port 3(tap110i0) entered disabled state
Nov 19 22:53:04 ads-120elmst-proxmox-1 kernel: [ 4739.554581] input: Logitech G510s Gaming Keyboard as /devices/pci0000:00/0000:00:02.0/0000:02:00.0/usb9/9-1/9-1.4/9-1.4:1.0/0003:046D:C22D.000F/input/input20
Nov 19 22:53:04 ads-120elmst-proxmox-1 kernel: [ 4739.612961] hid-generic 0003:046D:C22D.000F: input,hidraw0: USB HID v1.11 Keyboard [Logitech G510s Gaming Keyboard] on usb-0000:02:00.0-1.4/input0
Nov 19 22:53:04 ads-120elmst-proxmox-1 kernel: [ 4739.648397] input: Logitech G510s Gaming Keyboard as /devices/pci0000:00/0000:00:02.0/0000:02:00.0/usb9/9-1/9-1.4/9-1.4:1.1/0003:046D:C22D.0010/input/input21
Nov 19 22:53:04 ads-120elmst-proxmox-1 kernel: [ 4739.705008] hid-generic 0003:046D:C22D.0010: input,hiddev0,hidraw1: USB HID v1.11 Device [Logitech G510s Gaming Keyboard] on usb-0000:02:00.0-1.4/input1
Nov 19 22:53:04 ads-120elmst-proxmox-1 kernel: [ 4739.708743] input: Logitech USB Receiver as /devices/pci0000:00/0000:00:02.0/0000:02:00.0/usb9/9-1/9-1.2/9-1.2:1.0/0003:046D:C531.0011/input/input22
Nov 19 22:53:04 ads-120elmst-proxmox-1 kernel: [ 4739.709067] hid-generic 0003:046D:C531.0011: input,hidraw2: USB HID v1.11 Mouse [Logitech USB Receiver] on usb-0000:02:00.0-1.2/input0
Nov 19 22:53:04 ads-120elmst-proxmox-1 kernel: [ 4739.722617] input: Logitech USB Receiver as /devices/pci0000:00/0000:00:02.0/0000:02:00.0/usb9/9-1/9-1.2/9-1.2:1.1/0003:046D:C531.0012/input/input23
Nov 19 22:53:04 ads-120elmst-proxmox-1 kernel: [ 4739.781100] hid-generic 0003:046D:C531.0012: input,hiddev0,hidraw3: USB HID v1.11 Keyboard [Logitech USB Receiver] on usb-0000:02:00.0-1.2/input1
Nov 19 22:53:05 ads-120elmst-proxmox-1 kernel: [ 4739.816936] hid-generic 0003:2101:8501.0013: hiddev0,hidraw4: USB HID v1.11 Device [Action Star USB HID] on usb-0000:02:00.0-1.1/input0
Nov 19 22:53:06 ads-120elmst-proxmox-1 kernel: [ 4741.055914] usb 9-2.2: reset low-speed USB device number 8 using xhci_hcd
noNov 19 22:53:38 ads-120elmst-proxmox-1 kernel: [ 4773.399598] vmbr0: port 2(tap111i0) entered disabled state
Nov 19 22:53:38 ads-120elmst-proxmox-1 kernel: [ 4773.464740] input: Logitech USB Optical Mouse as /devices/pci0000:00/0000:00:02.0/0000:02:00.0/usb9/9-2/9-2.2/9-2.2:1.0/0003:046D:C077.0014/input/input24
Nov 19 22:53:38 ads-120elmst-proxmox-1 kernel: [ 4773.465021] hid-generic 0003:046D:C077.0014: input,hidraw5: USB HID v1.11 Mouse [Logitech USB Optical Mouse] on usb-0000:02:00.0-2.2/input0
Nov 19 22:53:38 ads-120elmst-proxmox-1 kernel: [ 4773.482748] hid-generic 0003:2101:8501.0015: hiddev0,hidraw6: USB HID v1.11 Device [Action Star USB HID] on usb-0000:02:00.0-2.1/input0



Nov 19 22:54:32 ads-120elmst-proxmox-1 kernel: [ 4827.661898] device tap110i0 entered promiscuous mode
Nov 19 22:54:32 ads-120elmst-proxmox-1 kernel: [ 4827.668725] vmbr0: port 2(tap110i0) entered blocking state
Nov 19 22:54:32 ads-120elmst-proxmox-1 kernel: [ 4827.668726] vmbr0: port 2(tap110i0) entered disabled state
Nov 19 22:54:32 ads-120elmst-proxmox-1 kernel: [ 4827.668800] vmbr0: port 2(tap110i0) entered blocking state
Nov 19 22:54:32 ads-120elmst-proxmox-1 kernel: [ 4827.668801] vmbr0: port 2(tap110i0) entered forwarding state
Nov 19 22:54:33 ads-120elmst-proxmox-1 kernel: [ 4827.964488] vmbr0: port 2(tap110i0) entered disabled state
Nov 19 22:54:38 ads-120elmst-proxmox-1 kernel: [ 4833.565853] usb 9-2.2: USB disconnect, device number 8
Nov 19 22:54:40 ads-120elmst-proxmox-1 kernel: [ 4835.038318] usb 9-2.2: new low-speed USB device number 9 using xhci_hcd
Nov 19 22:54:40 ads-120elmst-proxmox-1 kernel: [ 4835.148988] usb 9-2.2: New USB device found, idVendor=046d, idProduct=c077
Nov 19 22:54:40 ads-120elmst-proxmox-1 kernel: [ 4835.148991] usb 9-2.2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
Nov 19 22:54:40 ads-120elmst-proxmox-1 kernel: [ 4835.148993] usb 9-2.2: Product: USB Optical Mouse
Nov 19 22:54:40 ads-120elmst-proxmox-1 kernel: [ 4835.148994] usb 9-2.2: Manufacturer: Logitech
Nov 19 22:54:40 ads-120elmst-proxmox-1 kernel: [ 4835.153141] input: Logitech USB Optical Mouse as /devices/pci0000:00/0000:00:02.0/0000:02:00.0/usb9/9-2/9-2.2/9-2.2:1.0/0003:046D:C077.0016/input/input25
Nov 19 22:54:40 ads-120elmst-proxmox-1 kernel: [ 4835.153521] hid-generic 0003:046D:C077.0016: input,hidraw5: USB HID v1.11 Mouse [Logitech USB Optical Mouse] on usb-0000:02:00.0-2.2/input0
Nov 19 22:55:42 ads-120elmst-proxmox-1 kernel: [ 4897.081151] usb 9-2.2: USB disconnect, device number 9
Nov 19 22:55:43 ads-120elmst-proxmox-1 kernel: [ 4898.582198] usb 9-2.2: new low-speed USB device number 10 using xhci_hcd
Nov 19 22:55:43 ads-120elmst-proxmox-1 kernel: [ 4898.694029] usb 9-2.2: New USB device found, idVendor=046d, idProduct=c077
Nov 19 22:55:43 ads-120elmst-proxmox-1 kernel: [ 4898.694031] usb 9-2.2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
Nov 19 22:55:43 ads-120elmst-proxmox-1 kernel: [ 4898.694033] usb 9-2.2: Product: USB Optical Mouse
Nov 19 22:55:43 ads-120elmst-proxmox-1 kernel: [ 4898.694035] usb 9-2.2: Manufacturer: Logitech
Nov 19 22:55:43 ads-120elmst-proxmox-1 kernel: [ 4898.701760] input: Logitech USB Optical Mouse as /devices/pci0000:00/0000:00:02.0/0000:02:00.0/usb9/9-2/9-2.2/9-2.2:1.0/0003:046D:C077.0017/input/input26
Nov 19 22:55:43 ads-120elmst-proxmox-1 kernel: [ 4898.702086] hid-generic 0003:046D:C077.0017: input,hidraw5: USB HID v1.11 Mouse [Logitech USB Optical Mouse] on usb-0000:02:00.0-2.2/input0
Nov 19 22:55:48 ads-120elmst-proxmox-1 kernel: [ 4903.677507] device tap110i0 entered promiscuous mode
Nov 19 22:55:48 ads-120elmst-proxmox-1 kernel: [ 4903.688302] vmbr0: port 2(tap110i0) entered blocking state
Nov 19 22:55:48 ads-120elmst-proxmox-1 kernel: [ 4903.688304] vmbr0: port 2(tap110i0) entered disabled state
Nov 19 22:55:48 ads-120elmst-proxmox-1 kernel: [ 4903.688387] vmbr0: port 2(tap110i0) entered blocking state
Nov 19 22:55:48 ads-120elmst-proxmox-1 kernel: [ 4903.688388] vmbr0: port 2(tap110i0) entered forwarding state
Nov 19 22:55:50 ads-120elmst-proxmox-1 kernel: [ 4904.869968] vfio_ecap_init: 0000:04:00.0 hiding ecap 0x19 at 0x900
Nov 19 22:55:52 ads-120elmst-proxmox-1 kernel: [ 4906.874808] device tap111i0 entered promiscuous mode
Nov 19 22:55:52 ads-120elmst-proxmox-1 kernel: [ 4906.880827] vmbr0: port 3(tap111i0) entered blocking state
Nov 19 22:55:52 ads-120elmst-proxmox-1 kernel: [ 4906.880829] vmbr0: port 3(tap111i0) entered disabled state
Nov 19 22:55:52 ads-120elmst-proxmox-1 kernel: [ 4906.880895] vmbr0: port 3(tap111i0) entered blocking state
Nov 19 22:55:52 ads-120elmst-proxmox-1 kernel: [ 4906.880896] vmbr0: port 3(tap111i0) entered forwarding state
Nov 19 22:55:53 ads-120elmst-proxmox-1 kernel: [ 4908.513876] usb 9-1.1: reset high-speed USB device number 4 using xhci_hcd
Nov 19 22:55:54 ads-120elmst-proxmox-1 kernel: [ 4908.938243] usb 9-1.2: reset full-speed USB device number 5 using xhci_hcd
Nov 19 22:55:54 ads-120elmst-proxmox-1 kernel: [ 4909.297968] usb 9-1.4: reset full-speed USB device number 7 using xhci_hcd
Nov 19 22:55:55 ads-120elmst-proxmox-1 kernel: [ 4909.781433] usb 9-1.2: reset full-speed USB device number 5 using xhci_hcd
Nov 19 22:55:55 ads-120elmst-proxmox-1 kernel: [ 4910.176352] usb 9-1.4: reset full-speed USB device number 7 using xhci_hcd
Nov 19 22:56:06 ads-120elmst-proxmox-1 kernel: [ 4921.624815] vfio_ecap_init: 0000:05:00.0 hiding ecap 0x1e at 0x258
Nov 19 22:56:06 ads-120elmst-proxmox-1 kernel: [ 4921.624824] vfio_ecap_init: 0000:05:00.0 hiding ecap 0x19 at 0x900
Nov 19 22:56:09 ads-120elmst-proxmox-1 kernel: [ 4924.086319] usb 9-1.4: reset full-speed USB device number 7 using xhci_hcd
Nov 19 22:56:09 ads-120elmst-proxmox-1 kernel: [ 4924.282305] usb 9-1.2: reset full-speed USB device number 5 using xhci_hcd
Nov 19 22:56:09 ads-120elmst-proxmox-1 kernel: [ 4924.480909] usb 9-1.1: reset high-speed USB device number 4 using xhci_hcd
Nov 19 22:56:09 ads-120elmst-proxmox-1 kernel: [ 4924.666275] usb 9-1.2: reset full-speed USB device number 5 using xhci_hcd
Nov 19 22:56:10 ads-120elmst-proxmox-1 kernel: [ 4924.855264] usb 9-1.4: reset full-speed USB device number 7 using xhci_hcd
Nov 19 22:56:10 ads-120elmst-proxmox-1 kernel: [ 4925.076862] usb 9-1.1: reset high-speed USB device number 4 using xhci_hcd
Nov 19 22:56:10 ads-120elmst-proxmox-1 kernel: [ 4925.262218] usb 9-1.2: reset full-speed USB device number 5 using xhci_hcd



There are more USB resets than I remember last time I tried this.

Thanks again for the response, and if you have any other thoughts on the matter I'd love to hear them.

-Brian
  

----- Original Message -----
From: "vfio-users-request" <vfio-users-request at redhat.com>
To: "vfio-users" <vfio-users at redhat.com>
Sent: Sunday, November 19, 2017 12:00:11 PM
Subject: vfio-users Digest, Vol 28, Issue 16

Message: 4
Date: Sun, 19 Nov 2017 08:53:32 +0000
From: Zir Blazer <zir_blazer at hotmail.com>
To: "vfio-users at redhat.com" <vfio-users at redhat.com>
Subject: Re: [vfio-users] GPU driver crashes when running a second VM
        if either VM has a virtual disk stored on physical media other than
        the root disk. Tested on three X58 chipset MBs
Message-ID:
        <CY4PR15MB146404EDEBB7CAA99B74972EF32D0 at CY4PR15MB1464.namprd15.prod.outlook.com>
        
Content-Type: text/plain; charset="iso-8859-1"

The Nehalem era X58, 5500 and 5520 Chipsets had a notoriously broken Interrupt Remapping implementation:
https://support.citrix.com/article/CTX136517 (Love that symptoms list)
https://www.netiq.com/support/kb/doc.php?id=7014344
https://serverfault.com/questions/745593/does-disabling-vt-d-and-interrupt-remapping-break-msi-x



Interrupt Remapping was directly related to some x2APIC, MSI-X (Not sure if MSI) and IOMMU features which obviously on X58 platforms don't work as intended. You can try to force disabling them with Kernel Parameters (intremap=off and something else to force old xAPIC) and see if it improves. Google around also the X2APIC Disable Bit in the ACPI DMAR Table (I recall that I wrote something related to it). VFIO had also an allow_unsafe_interrupts=1 option that was also related to Nehalem broken Chipsets.
Basically, you have trying to use early era Hardware that was quite buggy, so is expected that Passthrough would be problematic. Have fun!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.redhat.com/archives/vfio-users/attachments/20171119/c5f4ae66/attachment.html>




More information about the vfio-users mailing list