[vfio-users] VFIO and random host crashes

Colin Godsey crgodsey at gmail.com
Mon May 23 14:38:37 UTC 2016


I’m now 98% certain I was seeing the ‘skylake freeze’, or one of the other
random bugs in the skylake microcode. Updated to the newest BIOS and
everything is rock solid (my old one was only 5 months old).

There were two big issues regarding c-states and ‘complex workloads’ that
would result in a CPU fault. I’ve now also learned that this type of hard
freeze (where the reset button doesnt work and 0 system logs) is almost
always some kind of CPU fault. The chip basically stops doing anything
(including handling interrupts) until power cycled.

On Sat, May 21, 2016 at 9:55 AM Colin Godsey <crgodsey at gmail.com> wrote:

> Hmm, I’m still on 4.4 but I did notice I didn’t have THP disabled- going
> to try that.
>
> (also just for a small note- I’ve tried various combinations of edge/msi
> IRQs for the devices, no difference really)
>
> I replaced the PSU on my system yesterday, and it improved the stability a
> bunch! But not totally. I got about 4 hours of burin-in followed by 4 hours
> of gaming- still crashed at the end. I didn’t get any artifacts at all, and
> I swear my GPUs were running a bit faster (I finally saw my cards hit the
> ‘pwr’ limit, instead of just ‘voltage rel’). But, it still crashed in the
> end. Same hang as before- 0 logs, system wont restart, etc.
>
> One of these cards is a horrible 3-yr old monster that by-far produces the
> most heat in the system than anything else. Probably more than the other
> card, PSU and CPU combined. When this card was running solo, I did notice
> what I think were device ‘resets’. Screen would blank out and come back.
>
> I’m wondering if maybe these hard device-side resets could cause this- GPU
> resets can be amplified by heat, usage, age/faults etc.
>
> I remember really weird behavior trying to detach/attach devices that were
> already bound with VFIO- perhaps theres something similar here when the
> device disconnects itself, possibly bricking the DMAR or interrupt
> remapping.
>
> I remember reading that the reset switch functions over some kind of ACPI
> based interrupt- bricking the interrupt handler could result in the system
> being unable to log or respond to… anything. Does anybody know how linux
> might respond to a missing EOI or something like that with VFIO? I
> understand VFIO is supposed to give us great device emulation, but with
> interrupt remapping… an interrupt is an interrupt. I’m assuming any
> critical failure in IRQ handling for VFIO devices would be just as bad as a
> host-level device.
>
> Has any one ever like…. yanked a passthrough’d PCI card out while a guest
> was running? Might try that today… just dont want to damage anything
> further.
>
> On Thu, May 19, 2016 at 10:00 AM Alex Williamson <
> alex.l.williamson at gmail.com> wrote:
>
>> I'm not convinced the ones I've seen are power related, my system has
>> been running videos in a loop for days, completed a few folding @home jobs,
>> done some compiling, and I even added another (idle, low power) video card
>> since the last hang.  Storage doesn't jive with my observations either.
>> I'm still leaning towards some isolcpus/nohz_full interaction, but I
>> haven't yet started to add those options back.  If you're running a v4.5
>> kernel, please be sure to rule out a transparent hugepage issue with
>> transparent_hugepage=never on the kernel command line or run the latest
>> stable v4.5.5 release (now fixed).
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/vfio-users/attachments/20160523/c68dc152/attachment.htm>


More information about the vfio-users mailing list