[vfio-users] VFIO and random host crashes

Colin Godsey crgodsey at gmail.com
Sat May 21 15:55:26 UTC 2016


Hmm, I’m still on 4.4 but I did notice I didn’t have THP disabled- going to
try that.

(also just for a small note- I’ve tried various combinations of edge/msi
IRQs for the devices, no difference really)

I replaced the PSU on my system yesterday, and it improved the stability a
bunch! But not totally. I got about 4 hours of burin-in followed by 4 hours
of gaming- still crashed at the end. I didn’t get any artifacts at all, and
I swear my GPUs were running a bit faster (I finally saw my cards hit the
‘pwr’ limit, instead of just ‘voltage rel’). But, it still crashed in the
end. Same hang as before- 0 logs, system wont restart, etc.

One of these cards is a horrible 3-yr old monster that by-far produces the
most heat in the system than anything else. Probably more than the other
card, PSU and CPU combined. When this card was running solo, I did notice
what I think were device ‘resets’. Screen would blank out and come back.

I’m wondering if maybe these hard device-side resets could cause this- GPU
resets can be amplified by heat, usage, age/faults etc.

I remember really weird behavior trying to detach/attach devices that were
already bound with VFIO- perhaps theres something similar here when the
device disconnects itself, possibly bricking the DMAR or interrupt
remapping.

I remember reading that the reset switch functions over some kind of ACPI
based interrupt- bricking the interrupt handler could result in the system
being unable to log or respond to… anything. Does anybody know how linux
might respond to a missing EOI or something like that with VFIO? I
understand VFIO is supposed to give us great device emulation, but with
interrupt remapping… an interrupt is an interrupt. I’m assuming any
critical failure in IRQ handling for VFIO devices would be just as bad as a
host-level device.

Has any one ever like…. yanked a passthrough’d PCI card out while a guest
was running? Might try that today… just dont want to damage anything
further.

On Thu, May 19, 2016 at 10:00 AM Alex Williamson <
alex.l.williamson at gmail.com> wrote:

> I'm not convinced the ones I've seen are power related, my system has been
> running videos in a loop for days, completed a few folding @home jobs, done
> some compiling, and I even added another (idle, low power) video card since
> the last hang.  Storage doesn't jive with my observations either.  I'm
> still leaning towards some isolcpus/nohz_full interaction, but I haven't
> yet started to add those options back.  If you're running a v4.5 kernel,
> please be sure to rule out a transparent hugepage issue with
> transparent_hugepage=never on the kernel command line or run the latest
> stable v4.5.5 release (now fixed).
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/vfio-users/attachments/20160521/afc99a19/attachment.htm>


More information about the vfio-users mailing list