[vfio-users] VFIO and random host crashes
Zycorax Tokoroa
zycorax at phoxden.xyz
Thu May 19 08:37:03 UTC 2016
> I’ve been running a dual gaming VM rig (2x dedicated GPU) for a little
> bit now, and everything works perfectly except when both VMs are under
> load, after an hour or so I get a hard crash and/or reboot. It will
> either reboot itself, or will hang so bad the physical ‘reset’ button on
> the box doesnt work.
>
> There is 0 evidence in the linux logs about the crash, I literally just
> see one of a few standard cron jobs as the syslog, then the next line is
> the kernel boot/start-up. Only real evidence I get is that- rarely I can
> hear windows crash first. Or windows will crash and Ill get maybe
> another second or 2 of ’top’ before the whole system goes down. I find
> it extremely odd that there’s some sort of (albeit fast) degradation,
> but absolutely nothing interesting in the logs.
This seems very similar to the issues I have. The system freezes with no
log available for this specific problem, and there's hardly any sign
anticipating it. Any device, including wifi, USB, SATA disks, etc is as
powered off / severed from the machine. My only solution is to use the
reset button - which causes a rather long reboot.
> So, I’m pretty sure it’s something hardware related- either PSU or my
> mobo is crap and is underpowered somewhere. During load, there are about
> 5 drives, 2 GTX GPUs, and GBe (~200mbps) all under constant load, so it
> seems likely it could be something chipset related.
The setup is similar, with 6 drives (though I mainly use two for the VM,
a rotational HD and a SSD for caching. The bcache device is passed as a
disk to the VM) and two GTX GPUs. I'm sure it's not a power issue as
removing one from the motherboard and thus not using passtrough for
graphics still gets me the issue. The load that causes me to crash is
usually heavy I/O on the VM disk
> *So my question is really: is there ANY kind of kernel/vfio software
> level issue that could cause this crash? Or does this just sound like
> hardware?* I’ve tried several different power configurations at this
> point, I just want to be as sure as possible it’s hardware before i
> start replacing more things =\
I'm starting to think it's motherboard related, as it doesn't make sense
that only few people get issues. Perhaps correlating onboard components
could pin it down to something more specific
> This is an up to date Ubuntu Xenial, not really running anything
> special. I’ve gotten away with running my VMs almost as pure as
> possible, no funny workarounds or anything. OVMF, Windows 10, hyper-v
> flags. Skylake i7 @ z170M.
Both Xenial and Wily have this issue for me. Using a X99-Deluxe from
ASUS with a i7-5930K. No kernel patching, default libvirt and qemu-kvm
packets, default setup.
FWIW I used to isolate 6 of the 12 logical cores of my processor and to
pin the vCPUs to them. I haven't seen the host choking, not tuning just
gets a slightly worse performance on the VM
Zycorax Tokoroa
More information about the vfio-users
mailing list